Sparse alternatives to ridge regression: a random effects approach

In a calibration of near-infrared (NIR) instrument, we regress some chemical compositions of interest as a function of their NIR spectra. In this process, we have two immediate challenges: first, the number of variables exceeds the number of observations and, second, the multicollinearity between variables are extremely high. To deal with the challenges, prediction models that produce sparse solutions have recently been proposed. The term ‘sparse’ means that some model parameters are zero estimated and the other parameters are estimated naturally away from zero. In effect, a variable selection is embedded in the model to potentially achieve a better prediction. Many studies have investigated sparse solutions for latent variable models, such as partial least squares and principal component regression, and for direct regression models such as ridge regression (RR). However, in the latter, it mainly involves an L1 norm penalty to the objective function such as lasso regression. In this study, we investigate new sparse alternative models for RR within a random effects model framework, where we consider Cauchy and mixture-of-normals distributions on the random effects. The results indicate that the mixture-of-normals model produces a sparse solution with good prediction and better interpretation. We illustrate the methods using NIR spectra datasets from milk and corn specimens.


Introduction
In a calibration of near-infrared (NIR) instrument, we model some chemical compositions of interest from a set of observations as a function of their NIR spectra, which are measured on several hundreds or thousands wavelengths (variables). In the calibration, we deal with the situation where the number of variables exceeds the number of observations and the variables exhibit extremely high multi-collinearity. With these characteristics of the data, we have two inherent problems in building a calibration model. Firstly, we have a problem in estimating model parameters as standard ordinary least squares (OLS) regression will fail in this situation, and secondly, some variables may not contribute towards better prediction.
To deal with the first problem, many calibration models have been proposed and shown to work well. These models can generally be categorised into two main groups. The first one is by regressing the response on linear combinations of the original variables (called latent variables or components) and the second one is directly regressing the response on the (original) variables. The first category includes partial least squares (PLS) regression and principal component regression (PCR) and the second category includes ridge regression (RR) and lasso regression. To deal with the second problem, wavelength selection or variable selection has been advocated to improve prediction (e.g. [25]). Many strategies have been proposed, including genetic algorithm (e.g. [2,20]), forward selection (e.g. [9]), self-organising map [5], and boundary subset selection [23]. See, for example, Ref. [1] for some strategies of variable selection.
Dealing with both problems separately can be sub-optimal in building a prediction model, as some selection methods are not optimal to be used with a particular calibration model. For example, forward selection method may result in a configuration of variables that does not produce better prediction than 'best subset' selection method in RR. To overcome this, recent studies suggested sparse alternatives to the calibration models [3,7,15,17]. The term 'sparse' refers to the model parameters, where some of them are zero estimated (or very close to zero), and the other parameters are estimated naturally to be away from zero. Effectively, the contribution of some variables are cancelled and the prediction model relies on the other variables that have non-zero parameter estimates. This is done in a single framework within the calibration model.
Many studies recently have investigated some sparse alternatives for PLS and PCR, mostly involving L 1 -norm penalisation or lasso [26]. For example, Chun and Keles [4] developed a sparse PLS method by imposing sparsity in the dimension reduction step. Fu et al. [8] proposed an elastic net, a combination of ridge-type and lasso-type penalties, that imposes sparse solution within PLS. Lee et al. [16] proposed a shrinkage of singular values for principal component analysis to produce sparse loadings. Lee et al. [17] modified the NIPALS algorithm in PLS, by incorporating lasso method to produce sparse loadings.
The above studies have focused on sparse solutions for latent-variable models. For direct regression models such as ridge regression, the main sparse alternative has been the lasso regression [26] or similar models with L 1 -norm penalty to the model objective function. Although they produce a sparse solution, the models share some concerns including interpretation of the model estimates in extremely multicollinear data.
In this paper, we propose some sparse alternatives to the ridge regression within a random effects model framework by demonstrating that a sparse solution can be achieved by modifying the distributional assumption of the random effects. The first one is a linear model where the random effects are assumed to follow a Cauchy distribution. The second one is a linear model where we assume that the random effects follow a mixture distribution of three normals. The three normal distributions correspond to those random effects that have negative, zero, and positive estimates to create a sparse solution. We compare this model to some existing models within random effects model framework such as ridge regression and lasso regression. The main results indicate that the Cauchy random effects model does not necessarily produce a sparse solution. The results also indicate that the mixture-of-normals random effects model will produce a sparse solution with interesting interpretation and a good prediction. We illustrate the application of our proposed models using NIR calibration from milk and corn datasets.

Methods
In this section, we discuss the different random effects models and how their distributional assumptions are modified to produce a sparse solution. In Section 2.1, we outline a general setting of random effects models. In Section 2.2 we describe the RR model and some current models that produce a sparse solution. We present the Cauchy and mixture-of-normals random effects models in Sections 2.3 and 2.4, respectively.

General setting
Let X be a matrix of NIR spectra of p variables from n observations. So the matrix X is of size n × p with n p. We assume that each variable has been centred to have zero mean. Let y be an n-vector of response variable, and let β be a p-vector of regression model parameters. We also assume that the vector y is centred to have zero mean. We model the response y as where ε ≡ {ε k }, k = 1, . . . , n, is an n-vector of the error term. We assume ε to follow a normal distribution with mean zero and variance covariance matrix = σ 2 I n , where I n is the identity matrix of size n. Our main interest is to estimate the regression parameters β.
It is well known that the solution of OLSβ LS = (X T X ) −1 X T y is not applicable in our case because (X T X ) is not invertible in the case of n p. Even if (X T X ) is invertible in the case of n > p, the estimates of β LS are highly unstable because of the extreme multi-collinearity between the columns of X . To deal with this problem, it is natural to make a distributional assumption on β so that we formulate the estimation problem in the context of random effects model. For example, if we assume β to follow a normal distribution N(0, σ 2 β ), we have a RR model (Section 2.2.1). It is of our interest in this paper to modify this assumption on the random effects to obtain a sparse solution for β, which will be described in Sections 2.2.2 and 2.4.
Let p(β) denote the distribution of β. In the context of random effects model, the estimation of β is based on an extended likelihood [22] or h-likelihood [18]. Let θ denotes the variance(s) and other parameters. The log-likelihood of the parameters can then be written as log L(β, θ) = log p(y | β) + log p(β). ( The estimates of β can be obtained by maximising log L(β, θ) in Equation (2). Regardless of the distribution of the random effects β, the conditional distribution of y given β is a normal distribution with mean E(y | β) = X β and variance , so that the first term in the log-likelihood is given by In our formulation, the RR can be viewed as a random effects model where the random effects β are assumed to follow a normal distribution (see the next section). To obtain sparse alternatives for RR, it is natural to modify the distributional assumption of the random effects and consider other distributions that have fatter tails than the normal distribution. Before we discuss about the modification, we briefly review the RR and other current models that produce sparse solutions from the random effects model framework in the next section. The proposed new sparse alternatives to the ridge regression will be presented in Sections 2.3 and 2.4.
It is interesting to note that Equation (2) can have a Bayesian interpretation. From a Bayesian perspective, p(β), p(y | β), and L(β, θ) can be considered as prior, likelihood, and posterior, respectively. We will come across this form again in the next several sections, in particular Equations (4), (7), (8), and (12).

Normal random effects (RR)
To view the RR from within the random effects model framework, we assume that the regression parameter β follow a normal distribution with zero mean and variance covariance matrix . We also assume that β is independent of ε. Combining this with the conditional log-likelihood (3), the log-likelihood of the parameters can be written as where θ ≡ { , D}.
At fixed values of θ, we take the first derivative of the log-likelihood log L with respect to β Setting this to zero and solving it for β, we obtain the estimates of β aŝ where the subscript N inβ N indicates that the estimates are under the normal distribution assumption on β. By considering that ≡ σ 2 I n and D ≡ σ 2 β I p , the estimates for β N can be rewritten intoβ where λ ≡ σ 2 /σ 2 β . This is the RR estimates as first proposed by Hoerl and Kennard [11]. The estimation of λ can be done via cross-validation or through minimising Akaike's information criterion (AIC).
One thing to note is that the RR estimates in Equation (6) are not sparse, i.e. there is no clear separation between zero and non-zero estimates. To put forward sparse alternatives for the RR, we assume β to follow some distributions which put a density mass around zero and have fatter tails (for non-zero estimates). A natural choice would be to consider a Laplace distribution, which is known to have fatter tails than the normal distribution ( Figure 1). The result of this assumption would be the well-known Lasso regression as described in the next section.

Laplace random effects
A distribution that we commonly assume on the random effects β to produce a sparse solution is the Laplace distribution ( Figure 1, right panel, dotted line). Specifically, we assume β to follow a Laplace distribution with location 0 and scale σ v , or where | · | denotes the modulus or absolute value. Combining this with the conditional loglikelihood (3), the log-likelihood of the parameters can be written as There is one thing to note from the estimation of β from the log-likelihood in Equation (7). The first derivative of the last term in the log-likelihood, log p(β), does not contain β. In fact, the first derivative is −1/σ v for positive β and 1/σ v for negative β. There are some alternative ways to estimate β in this case (e.g. [28]). In this study, we resort to the algorithm discussed in the original paper of Tibshirani [26] and Hastie et al. [10] which is implemented in the R package lars. The package also implements the estimation of σ v via cross-validation.

Cauchy random effects
The first alternative model that we consider is a random effects model where the random effects are assumed to follow a Cauchy distribution. This distribution is illustrated in Figure 1 (left), in comparison to the normal distribution. The Cauchy distribution has fatter tails than the normal distribution, which makes it an intuitive first alternative towards obtaining a sparse solution.
Specifically, we consider the random effects β to follow a Cauchy distribution with location 0 and a scale σ s , or where β is assumed to be independent to ε. Combining p(β) with the conditional distribution of y in Equation (3), the log-likelihood of the parameters is given by (8), an additional step is needed to approximate the log-likelihood by a quadratic form [22, pp. 464-466]. However, we can derive a more stable estimation as follows. First, note that the log likelihood of the random effects β i is given by The first derivative of the log-likelihood is given by where Secondly, we combine the expression for (β) in Equation (9) with the derivation of loglikelihood log L in Equation (8) at fixed θ to obtain At fixed θ, we set this equation to zero and solve it for β to obtain the estimates of β aŝ where the subscript C inβ C denotes the Cauchy distribution assumption on the random effects β. With a starting value β 0 C , the estimation of β C is done alternately at fixed value of θ by first computing D −1 s and then calculatingβ C as in Equation (10). The estimation of error variance σ 2 is done robustly byσ and the estimation of σ 2 s is done through cross-validation.

Mixture-of-normals random effects
Previous approaches to obtain a sparse solution relied on the assumption that the model parameters follow a distribution that put some mass at zero and has fatter tails (for example, Laplace distribution). With this assumption, some parameters receive heavy shrinkage towards zero in the estimation, while allowing the other parameters to be estimated naturally away from zero. In the RR, assuming a (single) normal distribution with zero mean produces random effects estimates that are shrunk towards the zero mean. To obtain a sparse solution, we need to support some parameters to be naturally estimated away from zero by constructing a random effects distribution that contains non-zero means.
Hence, a natural immediate extension to obtain a sparse solution is to assume that the parameters β follow a mixture of three normal distributions. The first and third components of the mixture are assumed to have negative and positive means, respectively, and the second component to have zero mean. The second component with zero mean is the one that will put shrinkage on some estimates towards zero and give a sparse solution.
Specifically, we assume that the parameters β ≡ {β i } follow a mixture distribution where g = 3 is the number of components in the mixture distribution, π j is the mixing proportion with 0 ≤ π j ≤ 1 for all j and j π j = 1, μ j is the jth mean of normal component, constrained to have μ 1 < 0, μ 2 = 0, μ 3 > 0, and σ 2 βj is the jth variance of normal component in the mixture distribution.
To estimate the parameters, we first rewrite the model (1) as for k = 1, . . . , n. Defining θ ≡ {μ j , σ 2 βj , π j , = σ 2 I n , j = 1, 2, 3}, and combining the expressions in Equations (3) and (11), the contribution of observation k to the extended likelihood of β in Equation (2) can be written as where p j (·) is a normal density function with mean μ j and variance σ 2 βj . At fixed θ , taking the derivative of log L k (β, θ) with respect to β gives us We define and P j as a diagonal matrix of size p × p with diagonal entries p ij for i = 1, . . . , p. The quantity p ij is the probability of β i to belong to mixture component j.
Expression (13) can be written as Combining for k = 1, . . . , n, we arrive at with = σ 2 I n and D j = σ 2 βj I p . Equating the above expression to zero yields the estimatê We can infer that, as a result of assuming a mixture distribution on the distribution of random effects β, the expression forβ M in Equation (17) contains an additional term that would shift a group of parameter estimates to their corresponding μ j , depending on their probability to belong to the mixture component j. For example, if β i has a higher probability to belong to the group j = 3 with a positive mean, the estimate will be shrunk towards μ 3 , and not zero. As a result, we have a non-linear shrinkage. The expression forβ M in Equation (17) reduces to the ridge regression estimateβ N in Equation (5), if μ j = 0 and σ 2 βj = σ 2 β for all j = 1, 2, 3.
The estimation of θ is done jointly with the estimation of β M in an iterative manner. We use non-standard expectation-maximisation algorithm, where the estimation of random effects is done in the maximisation step instead of the expectation step. The advantage is that the quantities inβ M and θ ≡ {μ j , σ 2 βj , π j , = σ 2 I n , j = 1, 2, 3}, are in closed form expression (tractable). To obtain a sparse solution, σ 2 β2 is constrained at a small value, typically in the order of 10 −4 -10 −5 of the estimates of σ 2 β1 or σ 2 β3 , and we constrain σ 2 β1 = σ 2 β3 . Referring to Figure 1 (right), this means that the middle component would be a 'spike'. The variances associated with non-zero means (σ 2 β1 and σ 2 β3 ) are estimated via cross-validation.

Datasets
In our study, we illustrate the above methods using two datasets. The first one is a dataset from a calibration of NIR instruments with spectra from milk batches as previously described in [9,14]. Seventy milk batches were drawn from a production line in milk processing factory, and their fat concentrations were measured. On each milk batch, NIR spectrum was scanned between 829 and 1145 nm with wavelength resolution of 2.45 nm, creating a spectra matrix of 70 rows (observations) and 130 variables (wavelengths). Figure 2 (left) shows the spectra of the 70 milk batches after being centred on their respective variable means. The second dataset is from corn specimen described in Eigenvector.com website [13]. The dataset consists of NIR spectra from 80 corn specimens measured on two spectrometers (mp5 and mp6). The wavelength range is 1100-2498 nm at 2 nm intervals (creating 700 variables). The moisture, oil, protein and starch values for each of the corn specimen were measured. So,  we have four calibration models for each spectrometer, with a total eight calibration models in this dataset. Figure 2 top-right and bottom panels show the spectra of the 80 corn specimens that have been centred to have zero column means. The results of analysis on the corn dataset are presented in the supplementary material.

Cross-validation
In the cross-validation, we split n observations into two subsets. The first one is a training set with n t observations, in which we estimate the regression parameters. The second one is a validation set with n v observations, where we estimate the predicted fat concentration using the regression parameters obtained in the training set. In each regression method, we calculate the root mean square errors of prediction (RMSEP) where γ represents a relevant tuning parameter, the summation is done over the n v observations in the validation set, andŷ k is the predicted fat concentration using the parameter estimates from the estimation set. The use of cross-validation to estimate a tuning parameter is appropriate in our NIR calibration problem. It can be shown that an estimate of tuning parameter based on the cross validation also maximises its profile likelihood. In a general case where the number of variables increases at a higher rate than the number of samples, careful consideration is needed to the choice of method to estimate tuning parameter (see, for example, [6,27]).

Results
In this section, we present the results of our analysis on the milk dataset; the results on the corn dataset are presented in the supplementary material, except the summary of prediction error which are presented in Section 4.4.  Figure 3 shows the results of cross-validation and the estimates of normal random effects. The figure indicates that the optimal λ is estimated at exp{−9}. The estimatesβ N (from the training set) in Figure 3 suggest that we cannot obtain a sparse solution; the estimates are all non-zero. From variable selection point of view, we do not have a simple interpretation of the estimates as all of the wavelengths are taken into account in the prediction. The estimatesβ N follow approximately a normal distribution (not shown), reflecting our assumption on β N . Figure 4 presents the results of cross validation and random effects estimates under the Cauchy and Laplace random effects models. For the Cauchy random effects model, the estimate of σ 2 s is exp{4}, which minimises the validation residual sum of squares. The figure shows some parameters that are estimated away from zero while the others are estimated near zero. When we 'zoom in' to the estimatesβ C around zero, these estimates are not zero exactly, but only distributed tightly around zero (see Figure 1 in the supplementary material, top-left panel). This indicates that the Cauchy distribution with fatter tails allows some estimates to be away from zero, but not enough to shrink some estimates to zero. The behaviour of the estimatesβ C around zero is similar to that of the normal random effectsβ N .

Cauchy and Laplace random effects models
For the lasso regression, Figure 4 indicates that the estimate of λ L ≡ 1/σ v is exp{−5.079}. The algorithm to estimate the random effects was implemented in 'steps', hence the value of λ L evaluated cannot be made equally spaced. Looking at the bottom-right figure on the estimateŝ β L , we have a sparse solution where many random effects are shrunk to zero, and the others are non-zero estimated. Unlike the characteristics of the estimatesβ C , some of the estimates under the lasso regressionβ L are zero estimated. In practice, the estimates are very small (in the order of 10 −7 -10 −5 ) and reported as zero.

Mixture-of-normals random effects model
For the mixture-of-normals random effects model, the estimates of σ 2 βj for j = 1, 3, are presented in Figure 5. Based on the cross-validation, the variances σ 2 β1 = σ 2 β3 are estimated at exp{5}. The estimatesβ M are presented in the right panel of the figure and exhibit a sparse solution, where some estimates are shrunk to zero.
Compared to the lasso regression solution, the solution under mixture-of-normals random effects has less number of zero estimates. The estimates of mixture proportion π j 's are 0.471, 0.113, and 0.416 for j = 1, 2, 3. So, the number of zero estimates are approximately 11% out of 130 wavelengths. The estimated means of the mixture components μ j 's are −25.4 and 28.7 for j = 1, 3 (the middle component is constrained to have zero mean). This is reflected in the estimatesβ M where they are separated into three groups as displayed in Figure 5.
From Figure 5, we have an interesting note. There are six regions in the wavelength that have 'blocks' of non-zero random effects estimates, three in each sign. It has been suggested in the context of PLS regression that selecting regions of wavelengths, rather than selecting individual wavelengths, can potentially give better prediction (e.g. [19,21,24]) or give better stability of the estimates [12]. These studies support the notion that within the NIR regions of wavelength, there are some intervals where the main information lie.
Our results based on the mixture-of-normals random effects model in the milk data seem to support this idea. Although we do not model explicitly the spatial information between wavelengths, the existing correlation structure between wavelengths and the mixture model assumption on the random effects have resulted in accentuating the regions of wavelengths that positively or negatively associated with the fat concentration. This result is not specific only to this dataset. Similar interpretation can be inferred in different calibration models in the corn dataset (presented in the supplementary material).

Prediction
To check whether the models are reasonable to be used in real application, we compare the predicted fat concentration in the validation set to their observed value under different models as presented in Figure 6 for the milk dataset. The predicted fat concentrations are based on the  The description for elastic-net and adaptive lasso models are presented in the supplementary material.
parameter estimates obtained in the training set. In general, all of the models that we consider in this study suggest a good prediction. The figure indicates that the Laplace random effects model exhibits the lowest RMSEP, while the RR exhibits the highest RMSEP in the milk dataset. The different RMSEP's for the different calibration models in the corn and milk datasets are presented in Table 1. Overall, the different calibration models indicate a good prediction error. It is not categorically conclusive that one model is always superior to the other models. Some reduction of prediction error that we can observe from those datasets are marginal. The mixtureof-normals random effects model tend to have lower prediction error in majority of calibration models.

Discussion
In this study, we have investigated some sparse alternatives for RR in NIR calibration problem from random effects approach. This approach gives us the flexibility in modifying the distributional assumption of the parameters to obtain a sparse solution. We concentrate on the distributions that put more density mass around zero, but still allow some random effects to be estimated away from zero. In this paper, we investigate the Cauchy, and mixture-of-normals distributions for the random effects as alternatives to the normal distribution, and briefly review lasso regression as a well-known alternative.
In a strict sense, assuming Cauchy distribution for the random effects does not produce a sparse solution, as none of the variables is shrunk to zero. However, apart from the Laplace distribution, the Cauchy distribution is another immediate natural alternative to the normal distribution. It has fatter tails that we expect to allow some random effects to be estimated away from zero. Although some parameters are estimated away from zero, the model does not generally improve the prediction compared to the RR.
We found that the mixture-of-normals random effects model is a better alternative to the RR compared to the Cauchy and Laplace random effects models. In the mixture-of-normals random effects model, the estimates form a sparse solution and, with extremely high multicollinearity presents, the estimates have a more intuitive interpretation compared to those of the Cauchy or Laplace random effects model. Our results suggest that the model shows a good prediction in a cross validation.

Conclusion
The challenges in calibration of NIR instruments have recently been dealt with sparse regression models. We investigated a novel sparse alternative to RR within a random effects model framework where we assume the random effects to follow a mixture of three normal distributions. This assumption has resulted in a sparse solution where some estimates are zero estimated. Compared to the RR or lasso regression, the mixture-of-normals random effects model has a more intuitive interpretation with relatively better prediction. 1