Regression analysis: likelihood, error and entropy

In a regression with independent and identically distributed normal residuals, the log-likelihood function yields an empirical form of the L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{L}^2$$\end{document}-norm, whereas the normal distribution can be obtained as a solution of differential entropy maximization subject to a constraint on the L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{L}^2$$\end{document}-norm of a random variable. The L1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{L}^1$$\end{document}-norm and the double exponential (Laplace) distribution are related in a similar way. These are examples of an “inter-regenerative” relationship. In fact, L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{L}^2$$\end{document}-norm and L1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{L}^1$$\end{document}-norm are just particular cases of general error measures introduced by Rockafellar et al. (Finance Stoch 10(1):51–74, 2006) on a space of random variables. General error measures are not necessarily symmetric with respect to ups and downs of a random variable, which is a desired property in finance applications where gains and losses should be treated differently. This work identifies a set of all error measures, denoted by E\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathscr {E}$$\end{document}, and a set of all probability density functions (PDFs) that form “inter-regenerative” relationships (through log-likelihood and entropy maximization). It also shows that M-estimators, which arise in robust regression but, in general, are not error measures, form “inter-regenerative” relationships with all PDFs. In fact, the set of M-estimators, which are error measures, coincides with E\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathscr {E}$$\end{document}. On the other hand, M-estimators are a particular case of L-estimators that also arise in robust regression. A set of L-estimators which are error measures is identified—it contains E\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathscr {E}$$\end{document} and the so-called trimmed Lp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{L}^p$$\end{document}-norms.


Introduction
There are at least two approaches to regression analysis: likelihood maximization and error minimization of regression residuals. The first assumes a certain class of probability distributions for the regression residuals and is traditionally used in statistics, whereas the second ponders over a suitable choice of an error measure for the regression residuals and is a customary tool in engineering and risk analysis [39]. Both methods were introduced 1 by Gauss in 1809 [10], who observed that if regression residuals were assumed to be independent and identically distributed (i.i.d.) normal random variables (r.v.'s), then maximization of the log-likelihood function of the regression residuals could be reduced to the least squares problem or, equivalently, to minimization of the L 2 -norm of the regression error. In fact, the normal distribution was introduced in [10] as the only distribution with such a property. This made the least squares (LS) method as well as the assumption of normally distributed residuals a cornerstone of regression analysis for the past two centuries. (In fact, the LS regression is quite sensitive to outliers-a single outlier may have a drastic impact on regression coefficients [40, pp. 3-5], and there is extensive evidence against the assumption on normality of noise in real data [5].) The information theory [21] highlighted another relationship between the L 2 -norm and the normal distribution: a normal distribution is a solution of differential entropy maximization [43] with a constraint on the L 2 -norm of an r.v. Thus, the log-likelihood function of the normal distribution yields an empirical form of the L 2 -norm, whereas the normal distribution can be "recovered" from the maximum entropy principle with a constraint on the L 2 -norm: We call (1) an "inter-regenerative" relationship.
In fact, the L 2 -norm and normal distribution are not the only pair with this remarkable relationship. In 1887, Edgeworth [7] argued 2 that LS regression coefficients are so sensitive to outliers because the residuals are squared, so instead, he suggested to minimize the sum of absolute values of the residuals-the method now known as L 1 -regression. (Although coefficients in the L 1 -regression are not "immune" to outliers, see e.g. [40, pp. 10-11], the impact of a single outlier in the response variable is not as severe as in the LS regression.) Laplace [26] observed that L 1 -regression is equivalent to likelihood maximization with the double exponential (Laplace) distribution. It turns out that this distribution maximizes the differential entropy subject to a constraint on the L 1 -norm [29]. Thus, the L 1 -norm and the Laplace distribution is yet another example of (1).
In 1964, Huber [19] proposed to minimize i ρ(z i ) with respect to regression coefficients, where ρ is a non-constant function and z i are regression residuals. The cases ρ(t) = t 2 and ρ(t) = |t| correspond to the LS regression and to the L 1regression, respectively, while the case ρ(t) = at 2 , t 0, ρ(t) = bt 2 , t 0, with a > 0, b > 0, a = b, leads to the asymmetric least square (ALS) regression, which is also known as the expectile regression [8,16,44]. For arbitrary ρ, i ρ(z i ) is known as an M-estimator [19]. Further, Huber [20] suggested to sum up ρ(z i ) with weights corresponding to the order statistic of z i , for example, the smallest and the largest residuals could be assigned different weights. Such weighted sums are known as L-estimators [20] that generalize M-estimators and that include quantile regression [23] and least median of squares (LMS) regression [42] or least trimmed squares (LTS) regression as particular cases. LMS regression coefficients remain unchanged even if half of all data are outliers. M-estimators and L-estimators remain an active research area to this day, see e.g. [1,18,27,30,32].
The use of M-estimators and L-estimators, as well as other robust estimators, may, however, lead to non-convex optimization for regression coefficients-this is a considerable disadvantage, particularly for large-scale high-dimensional problems. Bernholt [3] suggested an algorithm which computes LMS estimator for n data points of dimension d in time proportional to n d . Mount et al. [33] offered an O(n d+1 ) algorithm for computing an LTS estimator and showed that the existence of any algorithm which (exactly and deterministically) computes it in time O(n k ) for any k < d would contradict the well-known "hardness of affine degeneracy" conjecture. In reallife applications, particularly with large data sets, LTS regression coefficients can be found by the fast-LTS heuristic [41], but in this case, they are not guaranteed to be optimal.
Rockafellar et al. [39] took the second approach to regression analysis. They introduced general measures of error as nonnegative positively homogeneous convex functionals on a space of r.v.'s, which include the L 1 -norm and the L 2 -norm, but are not necessarily symmetric with respect to the ups and downs of r.v.'s, and then proposed to minimize a general error measure of regression residuals. For a linear regression, this approach yields convex optimization programs for regression coefficients. Zabarankin and Uryasev [45,Proposition 5.1] showed that entropy maximization subject to a constraint on a general error measure E is equivalent to entropy maximization subject to two constraints: on the deviation measure projected from E and on the statistic associated with E. 3 The theory of general error measures opens up the possibility for identifying other pairs of error measures and probability dis-  (1). Also, connection between error measures [39] and M-estimators [19] and L-estimators [20] is believed to be an open issue. This work shows that all possible pairs of error measures and probability density functions (PDFs) that are related by (1) are determined by respectively, where X is an r.v., · p is the L p -norm, and (·) a,b is a function defined by For example, for a = b = 1, (2a) simplifies to the L p -norm X p , whereas for p = 1, a = 1 and b = 1/α − 1 with α ∈ (0, 1), it is the asymmetric mean absolute error, also known as the Koenker-Bassett error measure used in the quantile regression [23]. The sets of all error measures defined by (2a) and of all PDFs given by (2b) is denoted by E and P, respectively. If E is replaced by M-estimators, which, in general, are not error measures in the sense of Rockafellar et al. [39] (positively homogeneous convex nonnegative functionals), then (1) is extended from P to all PDFs, and the set of all M-estimators that are error measures coincides with E , see  Let = ( , M, P) be a probability space, with , M, and P being a set of elementary events, a σ -algebra over , and a probability measure on ( , M), respectively. A random variable (r.v.) is any measurable function from to R, and L r ( ) = L r ( , M, P), r ∈ [1, ∞], is a linear space of r.v.'s with norms X r = (E[|X | r ]) 1/r , r < ∞, and X ∞ = ess sup |X |. For an r.v. X, F X (x) = Pr[X x] and q X (s) = inf{x|F X (x) > s} are its cumulative distribution function (CDF) and quantile function, respectively. An

Log-likelihood function, error measures, and M-estimators
Suppose variables x ∈ R m (regressor) and y ∈ R (regressant) are related by where φ is a given function, β ∈ R l is an unknown deterministic parameter, and z is a regression error/residual. The regression problem is to find β based on given data Optimal β is then found by maximizing (5), or equivalently, the logarithm of (5) (log-likelihood function): where the multiplier 1/n is introduced for convenience. On the other hand, the objective function in (6) can be considered as the sample analogue of the expected log-likelihood E[ln f (Z (β))], which is negative cross entropy, and likelihood maximization (6) takes the form In this case, the functional E(Z (β)) = −E[ln f (Z (β))] plays the role of a measure for the random error Z (β), and problem (7) can be recast with an arbitrary error measure E: which is essentially the approach to regression taken in engineering: find the best fit for the r.v. Y in terms of the explanatory random vector X = (X 1 , . . . , X m ).
In general, an error measure is a functional E : L r ( ) → [0, ∞] satisfying the following axioms [39]: Loosely speaking, E(X ) is a nonnegative positively homogeneous convex functional, which generalizes the notion of norm, but in contrast to a norm is not necessarily symmetric, i.e., in general, A broad class of error measures is given by (2a). Comparison of (7) and (8) with (2a) yields (2b)-log-likelihood maximization (6) with (2b) is equivalent to error minimization (8) with (2a).

Example 1 (LS regression) The least squares (LS) regression
is equivalent to likelihood maximization with a normally distributed regression error.
Example 2 (Quantile regression) The quantile regression [23] is equivalent to likelihood maximization with the regression error having the PDF In LS regression (9), a single outlier can substantially alter regression coefficients. Several alternatives have been suggested with better robustness properties. For example, Huber [19] proposed the coefficient vector β in (4) to minimize for some non-constant function ρ : R → R + , where the objective function in (10) is called M-estimator. The case ρ(t) = t 2 corresponds to the ordinary least square error. Problem (10) is equivalent to (8) with where Z is an r.v. such that P[Z = z i ] = 1/n, i = 1, . . . , n, and h : R + → R + is an arbitrary strictly increasing function. For example, with (11) simplifies to (2a). However, in general, the functional (11) is not an error measure. (11), and let ρ : (11) is a regular measure of error [37], i.e., satisfies axioms E1, E3, E4, and In general, regular measures of error may not satisfy E2. For example, the asymmetric exponential error E(Z ) = E[e Z − Z − 1] satisfies E1 and E3-E5 but not E2, see [37,Example 8].
The following proposition shows that the set of all M-estimators (11), which are error measures, are, in fact, the error measures defined by (2a), i.e., the set E .
Proof See Appendix A.

Entropy Maximization
Let C 1 ( ) ⊂ L 1 ( ) be the set of all r.v.'s, which have finite mean and a PDF, and let X ⊂ C 1 ( ). Maximization of the differential entropy 4 can be formulated in a general form: can be a solution to (13) for some convex closed (in L 1 ( )) law-invariant set X if and only if Z * has a log-concave PDF.
Proof See Appendix A.
Problem (5.4.5) in [45] suggests that maximization of the differential entropy with a constraint on an error measure E : can "restore" the PDF of the regression residual. Indeed, if an r.v. Z admits a continuous PDF f (t) : R → R + , then problem (14) with error measure (2a) takes the form and Boltzmann's theorem [6, Theorem 11.1.1] yields (2b) with constants C > 0 and λ > 0 to be found from the constraints in (15)-the exact form of f is given by [45, (5.4.8)] where [·] is the gamma function. When a = b = 1, error measure (2a) takes the form E(Z ) = Z p and PDF (16) simplifies to [45, (5.4.9)] see Figure 5.2 in [45] for the graph of this PDF for various p. Thus, given PDF (2b), error measure (2a) follows from the log-likelihood function, and given error measure (2a), PDF (2b) follows from entropy maximization, i.e. (2a) and (2b) form "inter-regenerative" relationship (1). This raises the following questions: (i) Entropy-error relationship: For which PDF f does there exist an error measure E such that f is a maximizer in (14) Proof See Appendix A.
Proposition 4 implies that P and E are, in fact, the only sets of PDFs and error measures, respectively, for which the two regression approaches yield the same solution and which are related by (1).
Example 4 (Trimmed L 1 -norm) The trimmed L 1 -norm (also known as CVaR norm [31]) is the average of the right (1 − α)-tail of |Z |: where q |Z | (s) is the s-quantile of |Z |. It was recently used in regression analysis [39].
Since (17) is not in the form of (2a), Proposition 4 implies that there is no PDF, for which expected log-likelihood maximization is equivalent to (8) with (17).
where w j = P[L = j], j = 1, . . . , m, which implies that m j=1 w j = 1. Let w j > 0, j = 1, . . . , m, i.e., each source of error has a non-zero probability. Parameters w = (w 1 , . . . , w m ) ∈ R m , μ = (μ 1 , . . . , μ m ) ∈ R m and σ = (σ 1 , . . . , σ m ) ∈ R m in (18), and β ∈ R l in (4) can be found from likelihood maximization through the expected maximization (EM) algorithm [2]. Alternatively, we can minimize some error measure E of the residuals. However, since f in (18) is not in the form of (2b), Proposition 4 implies that there is no error measure for which these two approaches are equivalent. Also, since log f is not a concave function, Proposition 3 implies that there is no error measure for which f given by (18) is a maximizer in (14).

Next proposition extends (1) for M-estimators (11) in place of error measures.
Proposition 5 Let f * be an arbitrary PDF. Then (6) with the PDF f * yields the same solution as (8) with E * in the form of (11) and ρ * (t) = − ln f * (t). Moreover, f * can be "restored" from maximization of the differential entropy S(Z ) subject to the constraint E * (Z ) = c for some constant c ∈ R: Proof See Appendix A.
Other versions of robust regression, which are similar to (20), first apply a function ρ to residuals and then rank them. They correspond to (8) with A simple example is least median of squares regression where median med(X ) of an r.v. X is a real number x such that Pr[X < x] 1/2 and Pr[X > x] 1/2. Coefficients in this regression do not change even if half of the data are outliers, but this regression is much less efficient than (9): more data are required to achieve the same accuracy [42]. Least trimmed squares (LTS) regression has the same robustness level but is more efficient [42,Section 4]. It corresponds to (8) with for some α ∈ (0, 1). The functionals (22) and (24) where w(α) is either a Dirac delta function at 1 or a non-negative non-decreasing function on [0, 1) such that 0 < A different hybrid of trimmed L 1 -norm (17) and L p -norm is given by In fact, E p,α (Z ) = HMCR p,α (|Z |), where HMCR p,α is a higher moment coherent risk measure [24]. For p ∈ (1, ∞) and α ∈ (0, 1), E p,α (Z ) is not of the form of (25). Hence, by Proposition 6, it does not belong to family (23) of error measures related to L-estimators.

Entropy maximization with a constraint on error measure (25)
Example 4 shows that there is no PDF of error residuals such that log-likelihood maximization corresponds to minimization of the trimmed L 1 -norm, i.e., the "upper arrow" in (1) does not hold. However, the "lower arrow" still works-differential entropy maximization subject to a constraint on a general error measure was addressed in [45, problem (5.5.4)].
Error measure (25) to power p can be rewritten as The conditional value-at-risk (CVaR) minimization formula [36] (see also (1.4.4) in [45]) yields Then and entropy maximization problem (14) becomes where By Boltzmann's theorem [6, Theorem 11.1.1], the solution of (26) is given by where c and λ are positive constants to be found from the constraints in (26). With , we obtain where c 2 , λ, and ζ(α) are found from the constraints 1 0 f (t)dt = 1 and E(Z ) = 1 and the equation Example 8 (Entropy maximization with trimmed L 1 -norm) Entropy maximization problem (14) with error measure (17) simplifies to As in (27), optimal f (t) has the form where constants C, λ, and ζ are found from the constraints 1 0 f (t)dt = 1, E(Z ) = 1, and ζ = q |Z | (α), so that

Appendix A.2: Proof of Proposition 2
Proposition 4.7 (b) in [11] implies that if Z * ∈ C 1 ( ) has a log-concave PDF, then it is a solution to for μ = E[Z * ] and some law-invariant the deviation measure 5 D. Hence Z * is a solution to with Conversely, let Z * ∈ C 1 ( ) be a solution to (13) for some convex closed lawinvariant set X . Then it is a solution to (33) for the deviation measure where , see [14]. Indeed, if an r.v. Z satisfies the constraints in (33) with D given by (34), then E[Z ] = μ = E[Z * ], and CVaR α (Z ) CVaR α (Z * ) for all α ∈ [0, 1], so that Z dominates Z * with respect to concave ordering, see Proposition 1 in [14]. Since Z * has a PDF, the underlying probability space is, by definition, atomless, and part "(a) to (d)" of Corollary 2.61 in [9] along with Lemma 4.2 in [22] implies that Z ∈ X . Since Z * ∈ C 1 ( ) is a solution to (13), this yields S(Z * ) S(Z ), and consequently, Z * is a solution to (33). Thus, Z * has a log-concave PDF by Proposition 4.11 in [11].

Appendix A.3: Proof of Proposition 3
If Z * ∈ C 1 ( ) has a log-concave PDF, then it is a solution to (33) for some lawinvariant deviation measure D. On the other hand, Proposition 5.1 in [45] shows that problem (33) is equivalent to (14) with an error measure E such that D(Z ) = inf C∈R E(Z − C), i.e., D is the deviation measure projected from E. In general, for a given deviation measure D, such an error measure is non-unique and can be determined by which is called inverse projection of D, see [39]. Thus, Z * is a solution to (14) with (35). Conversely, let Z * ∈ C 1 ( ) be a solution to (14) for some law-invariant error measure E. Then positive homogeneity of E and relation S(k Z) = S(Z ) + ln k, k > 0, imply that Z * is also a solution to Since {Z | E(Z ) 1} is a convex closed law-invariant set, Z * has a log-concave PDF by Proposition 2.