A comparison of some new and old robust ridge regression estimators

Abstract Ridge regression is used to circumvent the problem of multicollinearity among predictors and many estimators for ridge parameter are available in the literature. However, if the level of collinearity among predictors is high, the existing estimators also have high mean square errors (MSE). In this paper, we consider some existing and propose new estimators for the estimation of ridge parameter k. Extensive Monte Carlo simulations as well as a real-life example are used to evaluate the performance of proposed estimators based on the MSE criterion. The results show the superiority of our proposed estimators compared to the existing estimators.


Introduction
Multiple regression models are extensively used as they play an important role in Business, economics, environmental, finance and social sciences. In classical linear regression model, it is assumed that there is no dependency among the predictor variables, which might not be true in practice. Thus, a violation of this assumption leads to the problem of multicollinearity. In the presence of multicollinearity, it is difficult to estimate the unrivaled effects of a particular predictor. Moreover, the regression coefficients have massive sampling variance along with a wrong signs, which affect both the inference and the prediction. Thus, multicollinearity is a problem of great concern in the linear regression analysis. In the literature, there are various methods, for example, Partial Least Square (PLS), Principal Components Regression (PCR), Ridge Regression (RR), etc., to solve this problem. Ridge Regression is one of the most popular methods to cope this problem. To define ridge regression, we consider the following traditional multiple linear regression model: where Y is n Â 1 vector of observations; X is an n Â p full Rank matrix known as the design matrix, b is p Â 1 vector of unknown regression coefficients, u is a n Â 1 vector of stochastic variables with multivariate normal distribution having mean vector 0 and variance-covariance matrix d 2 I n , where I n is an identity matrix of order n. The variance-covariance matrix can be written as d 2 I n ¼ Cov u 1 u 2 : : : u n 2 6 6 6 6 6 6 4 3 7 7 7 7 7 7 5 ¼ d 2 0::: 0 0 d 2 ::: 0 : ::::: : : ::::: : : ::::: : 0 0:::: d 2 2 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 5 I n Under the assumption of the classical linear regression model, the Ordinary Least Square (OLS) estimator has the minimum variance. The OLS estimator of b is given as: The OLS estimator highly depends on the characteristic of the matrix C¼X 0 X and if the C matrix is ill conditioned or det(X 0 X) % 0 then multicollinearity exists. In such cases, the OLS becomes unstable and have a large mean square error (MSE). Therefore, alternative to OLS, different methods have been proposed and Ridge regression Method (Hoerl and Kennard 1970) is one of them. To resolve the problem of multicollinearity, Hoerl and Kennard (1970) where the constant k > 0 is the biased Ridge parameter. For k ¼ 0, one gets the OLS estimators. The ridge estimator is biased, however, it yields a small mean square error (MSE) as compared to OLS (Hoerl and Kennard 1970). The MSEðbðkÞÞ depends on the unknown parameters k, b and r 2 . In practice, the k is estimated from the data and different researchers have proposed different estimators for k, e.g., Gunst and Mason (1977), Dempster et al. (1977), Golub et al. (1979), Hoerl and Kennard (1970), Hocking et al. (1976), Hemmerle and Brantle (1978), Gibbons (1981), Wichern and Churchill (1978), McDonald and Galarneau (1975), Lawless and Wang (1976), Singh and Tracy (1999), Kibria (2003), Khalaf et al. (2013), Firinguett et al. (2017), among others. The aim of this article is to introduce some new estimators for k and compare them with some existing estimators. Further, the problem of multicollinearity in relation to error variance has not been investigated previously and the current study is intended to explore it. This article unfolds as follows. Section 2 reviews some existing estimators and proposes new estimators for the estimation of the ridge parameter. Design to conduct Monte Carlo simulations to assess the efficiency of different estimators is discussed in the same Sec. 2. Sec. 3 deals with the simulation results. In Sec. 4, we analyze a real data set while some concluding remarks are given in Sec. 5.

Some old and new estimators
This section discuss some existing and further introduce new estimators for the generalized ridge regression model. Before proceeding further, suppose there exist an orthogonal matrix D such that k5D 0 CD where k ¼ diagðk 1 ; k 2 ; ::::::::::k p Þcontain the eigenvalues of the matrix C. Then, the modified form of model (1) is where X Ã 5XD and a5D 0 b: Consequently, the generalized ridge regression estimator can be written as:â where k ¼ diagðk 1 ; k 2 ; :::; k p Þ; k i >0 andâ5k À1 X Ã 0 Y is the OLS estimates of a: Hoerl and Kennard (1970) suggested that the value of k should minimize the MSEðâðkÞÞ; where In Equation (6), the first term shows the variance and the second is the bias, and where d 2 represent the error variance of Model (1), a i is the i th element of a: In practice, d 2 and a are seldom known, therefore unbiased estimator derived by Hoerl and Kennard (1970) proposed the following unbiased estimator.
is the squared residual mean square estimate, l 2 i is the squared errors andâ i is the i-th element ofâ: Soon after the above proposal, Hoerl and Kennard (1970) suggested another estimator of the ridge parameter of the form whereâ 2 max is the maximum element of a: In a more recent work, Kibria (2003) proposed a different estimator of k by taking the geometric mean ofâ 2 i that is Furthermore, Kibria (2003) also suggested another modified estimator by taking the median of m i given by Khalaf et al. (2013) suggested the following estimator of k: where t max is the maximum eigenvalue of the matrix C.
Later, Khalaf et al. (2013) suggested the following estimator of k: and t i is the value of the ith column of matrix C. Now, we propose some new modified estimators on the basis of 5 th percentile (P 05 ), first quartile (Q 1 ), second quartile (Q 2 ), third quartile (Q 3 ) and 90 th percentile (P 90 ) of the quantity W i ¼ t max jâ i j : The reason of considering percentiles or quartiles is the robustness against the outliers. In particular, P1: The first proposal to estimate k is: where H i ¼ t max jâ i j and p 05 denotes the 5 th percentile P2: The second proposed estimator SHM 2 and defined as: where Q 1 denotes the first quartile.
P3: To estimate the k, the next proposal is SHM 3 defined as: where Q 2 denotes the second quartile.
P4: The fourth proposal is where Q 3 denotes the third quartile.
P5: Next, we propose SHM 5 , which is defined as: where p 90 denotes the 90th percentile.
P6: To estimate k, our last proposal is SHM 6 which is defined as: where p 90 denotes 90th percentile 2.1. Performance assessment measure To evaluate the performance of OLS and ridge estimators, we compute the MSE using the following equation: whereb is the estimator of b obtained from OLS or ridge estimators and N denotes the number of replicates used in the Monte Carlo Simulation. Considering the MSE as a performance assessment criterion, we compare the performance of HK, KGM, KMED, KMS, KSM, SHM 1 , SHM 2 , SHM 3 , SHM 4 , SHM 5 and SHM 6 estimators. In particular, we consider the following settings to assess the performance of the above mentioned estimators.
For the numerical study, we generated the data using the following equation where Z ij are independent standard normal distribution. We consider the following three different sets of correlation q ¼ (0.90, 0.95, 0.99). The observation of the dependent variable is then determined by þ ::::::::::::: þ b p X ip þ u i ; i ¼ 1; 2; :::::n where l i are independent and identically distributed standard normal random variable, n is the number of observations and b 0 is taken to be zero without loss of generality. The values of regression coefficients b were chosen in such a way that sum of squared bequal to 1, that is, (Hoerl et al. 1975;Kibria. 2003;Muniz and Kibria 2009;Khalaf et al. 2013).
We consider moderate and high degree of correlation, i.e., q ¼ 0:90; 0:95; 0:99 whereas to assess the effect of sample size, n ¼ 25, and 50 are considered. Moreover, we consider p ¼ 4, 8 and 16 with following variances d 2 ¼ 1, 4, 5, 10 and 16, of the error terms. For the fixed values of n; p; q and d 2 ; values of regression coefficient were obtained so that b 0 b ¼ 1; that is, b was chosen as the eigenvector corresponding to the largest eigen value. Furthermore, the error term u i are generated as: 14. u i follow the normal (0, 5 þ i) 15. u i follow the normal (0, 10 þ i) 16. u i follow the normal (0, 16 þ i) Then, we repeated the whole procedure 2000 times and the estimated MSE for all the estimators as listed in the next section.

Result and discussion
The estimated MSEs have been listed in Tables 1-15 and additional tables are given in Supplementary Text (Tables S1-S12). A general remark about the MSE is that the different factors, like, the variance of the error terms, sample size, correlation degree, etc., affect the simulation design. More specifically, we observed that a high degree of correlation lead to a high MSE for OLS. However, the ridge regression estimators have smaller MSE than the OLS. Thus, the ridge regression estimators perform better than the OLS. The ranking of estimators on the basis of MSE has also been done while the sum of ranks against different sample sizes and situations is also given in the last row of each table. The ranks are simply computed by assigning the lowest rank to the lowest MSE and the highest rank to the largest MSE, i.e., by arranging the MSEs in ascending order. For example, the first entry in Table 1, the MSE value is 1.19 and ranked twelveth among the considered estimators, whereas the overall rank of OLS is also the same. It is also observed that the proposed estimators have smaller MSEs for a small variance of error terms and sample sizes. This observation is against the generally believed fact that the variance should decrease by increasing a sample size. However, we attribute this to the fact that our proposed estimators utilize the quartiles and percentiles. There are fewer chances of extreme observations if variance of the error term is small. Thus, in such cases the proposed estimators do not follow the generally believed pattern but for a large value they do. Further, the MSEs of the proposed estimators are not as high as observed the case of old estimators. However, there is no objection on the superiority of the proposed estimators as compared to the OLS and other considered ridge regression estimators. For the existing estimators, more specific discussion is        Table 2, if the number of predictors is eight, KMS is the best choice for small sample size and KSM is the best option for moderate sample size. In Table 3, assuming the numbers of predictors equal to sixteen, it is observed that the KSM is the best option when the sample size is 25, and KMS performed best for sample size 50. In Table 4, when the degree of correlation is equal to 0. 95 with small variance of the errors and number of predictors equal to four, the KMS is the best option for size 25 and KSM for size 50. However, in Tables 5 and 6, the KMS and SHM1 outperformed the rest. Among the proposed estimators, if the variance of the error is moderate, SHM1 is the best while SHM6 for large variance. In Tables 7-9, assuming the correlation 0.99, the SHM1 is the best for small sample size while KMS for moderate sample size. In all the assumed cases of predictors, as the variance of the error increased the SHM1, SHM5 and SHM6 are better than the existing estimators.
To assess the effect of distribution on the error terms, we generated them from student-t distribution with small and large degrees of freedom. It is observed that for sample size 25, the performance of KGM is better than the rest. Similarly, for sample size     50, the SHM 5 is noticed better than the other estimators as listed in Table 10. However, it is also noticed that for moderate to large degrees of freedom, the KMS, KSM, SHM 1 outperformed the other ridge and OLS estimators. In Tables 11-15, for q ¼ 0.90, 0.95 and df ¼ 1, it is observed that the KGM, and SHM 5 are much better than the rest estimators. However, for q ¼ 0.99 and df ¼ 1, the KGM, KMED, SHM 5 are the best. For q ¼ 0.90, and df ¼ 3, 4, 5, 10 and 16, the KMS, KSM, and SHM 1 are observed superior than the rest. Similarly, for q ¼ 0.95, df ¼ 3, 4, 5, and n ¼ 25, the KMS outperformed. However, for n ¼ 50, the SHM 1, is observed the best while for q¼ 0.95, p ¼ 4, and df ¼ 10, 16; the KMS outperformed the rest estimators. For p ¼ 8 and 16, the KMS and the SHM 1 outperformed the old ridge estimators and OLS estimator. In Tables S1-S3, for q ¼ 0.99, and df ¼ 1, the performance of KGM, KMED, SHM 5, is observed better than the rest estimators. It is also observed that for a high degree of correlation and moderate value of degree of freedom, SHM 1 , outperformed the rest estimators. Similarly, for large degrees of freedom, that is df ¼ 10 or df ¼ 16 with p ¼ 4 or p ¼ 8 while n ¼ 25, the SHM 1, outperformed the existing ridge and OLS estimators. Similarly, for n ¼ 50, the KMS is observed the best. However, for a high degree of correlation and p ¼ 8 or 16, for a given n, the SHM 1 outperformed the other ridge type and OLS estimators. Across different variances, i.e., d 2 ¼ 1, 4, 5, 10, and 16, it is noticed that for fixed q and given n, and p, the performance of SHM 5 is better than the rest. For q ¼ 0.90, and considering the sample size 25 or 50, and the number of predictors equal to 4, for different values of the variance of the error terms, the SHM 5 performs better as compared to the other ridge type estimators and OLS estimator (Table S4). However, in Table S5, it is noticed that for a small variance of the error terms, the number of predictors equal to eight and the sample size 25, the SHM 5 has the smallest MSE than the rest. Similarly, for the sample size 50, the SHM 3 outperformed the rest estimators. It is also observed that for moderate values of the error variance with n ¼ 25, the SHM 5 has the smallest MSE. Similarly, for n ¼ 50, the SHM 4 performance is the best. However, for high values of error variance, the SHM 5 outperformed the rest estimators. In Table S6, for q ¼ 0.90 and p ¼ 16, the performance of SHM 3, SHM 4, and SHM 5 is observed better than the other estimators used in this study. In particular, for q ¼ 0.95, and p ¼ 4 or 8, the SHM 5 outperformed the rest estimators (Tables S7-S9). Further, for p ¼ 16, we noticed that the SHM 4, and SHM 5 are much better than the other ridge estimators and the OLS estimator on the basis of MSE criterion. From Tables S10-S12, it is noticed that, for q ¼ 0.90, the SHM5 outperformed the rest estimators. Thus, in the presence of high correlation among the predictor variable, we recommend the practitioners use our new suggested estimators.
Next, to assess the effect of outliers on the proposed estimators, we conducted a simulation study by introducing outliers in the data, i.e., generating 5% data from N(4,1), and results are tabulated in the Supplementary file. From the results reported in Tables S13-S18, it is noticed that the proposed estimators outperform the other estimators when either p or q is large.

An example
In the previous section, we have conducted a simulation study to compare the performance of the proposed estimators. However, simulations are usually conducted assuming under ideal conditions. Contrary to the previous section, this section consider a real data example taken from Chatterjee and Hadi (2012) to compare the performance of our proposed estimators in practical situations. The data set basically about the consumption of gasoline of cars, which was collected for 30 different models of the cars. For this data, we considered the following model where Y represents the Gasoline consumptions, X 1 shows miles/gallon, X 2 is displacement (cube inches), X 3 represents Torque (feet/pound) and X 4 represents number of transmission speed. For more detail about the data, see Chatterjee and Hadi (2012). The correlation matrix of variables is shown in Table 16. It is observed that correlation among predictor variables ranges from moderate to high, indicating the existence of the problem of the multicollinearity in the data. To further confirm it, the condition number (j) is also computed, which is the condition index with the largest value; it equals the square root of the largest eigenvalue (k max ) divided by the smallest eigenvalue (k min ), i.e., j ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k max =k min p : The eigenvalues, condition indices and condition number will all equal to one if there is no collinearity and as collinearity increases, eigenvalues will be both greater and smaller than 1 (eigenvalues close to zero indicate a multicollinearity problem), and the condition indices and the condition number will increase. An informal rule of thumb as reported in the literature is that if the condition number is 15, multicollinearity is a concern; if it is greater than 30 multicollinearity is a very serious concern. For the data considered in this study, the condition number is 116.715, that indicates very serious multicollinearity. It is worth mentioning that eigenvalues are (5.7497, 0.2346, 0.0099, 0.0046, 0.0007, 0.0004). The MSE of the suggested estimator are computed by the following formula where k   The estimated MSE of estimators along with the coefficients of regression are given in Table 17.
From the table, one can easily observe that the ridge estimators, existing as well as the proposed, have the smallest MSE as compared to the OLS estimator. It is also observed that the proposed estimators outperform in the most conditions, especially when the level of collinearity is high. Further, one can observed clearly that SHM 6 performs as the best in terms of MSE.

Conclusion
In this article we proposed and compared some new and old estimators for ridge parameter. By Monte Carlo simulations, we evaluated the performance of ridge estimators in terms of mean square errors. We assessed the performance of ridge estimators as a function of the following parameters.
Performance as a function of q: We selected the values of n, p and d 2 ; and evaluated the performance in term of degree of correlation. From the simulation study, it is observed that for a small to moderate degree of correlation, the performance of KSM, KMS and SHM 1 is better than other estimators. However, for a high degree of correlation, the performance of SHM 1 and SHM 6 is observed better than the other estimators.
Performance as a function of d 2 : The performance of ridge estimators has also been evaluated in terms of d 2 : From the simulation study, it is observed that for a small value of d 2 ; KSM and KMS outperform. However, for moderate values of d 2 ; the performance of SHM 1 is observed reasonably better than the rest. For high values of d 2 ; SHM 1 , SHM 2 , SHM 4 , SHM 5 and SHM 6 outperform the other ridge estimators.
Performance as a function of n and p: In general, there is an inverse relation among the sample size and mean square error, that is, the MSE decreases as the sample size increases. From our simulation study, it is noticed that for a small sample size and a moderate degree of correlation, KSM, KMS and SHM 1 outperformed the other estimators. Moreover, it is also noticed that for a given number of predictors, high degree of correlation and the value of d 2 > 1, the SHM 1 and SHM 6 are better than the rest estimators.
Performance as a function of k: For the given values of n and p, it is observed that the value of ridge parameter k decreased. But most of the times, for the given value of n, p, q; the best estimator is the estimator that has a large value of k. One can also observe that as q and d 2 increased, our proposed estimators outperformed the rest.
In the simulation study, our suggested estimators have shown optimal performance in many evaluated situations, especially when the degree of correlation and error parameter are high. Based on the results, we suggest the practitioners to use the proposed estimators. Additional tables in support of the conclusions are provided in the Supplementary Text. In future, the work may be extended to identify the density and distribution function of different shrinkage parameters. Although we evaluated the performance of the proposed estimators in the presence of multicollinearity and outliers problem simultaneously, a more comprehensive study is required in this direction.