Comparative study of L 1 regularized logistic regression methods for variable selection

L 1 regularized logistic regression consists an important tool in data science and is dedicated to solve sparse generalized linear problems. The L 1 regularization is widely used in variable selection and estimation in generalized linear model analysis. This approach is intended to select the statistically important predictors. In this paper we compare the performance of some existing L 1 regularized logistic regression methods. The goal of our simulation study is directed toward the variable selection performance of regularized logistic regression in high dimensions. We consider three varying n (number of observations), p (number of predictors) settings and we support this comparison analysis by conducting various simulated experiments taking into consideration the correlation structure of the design matrix.


Introduction
Logistic regression analysis is a popular statistical method to model binary classification problems. Initially, logistic regression was mostly implemented in biostatistics (Hoffman 2015), however its application has spread to other areas, such as financial forecasting (Han, Ma and Yu 2008;Zaidi and Ofori-Abebrese 2016), image processing (Vanlalhruaia, Singh and Singh 2017;Wang and Sun 2012), life sciences (Ekstrøm and Sørensen 2015), medicine (Matthews and Farewell 2015), genetics (Zhang et al. 2018), social sciences (Menard 2001), economics (Cramer and Ridder 1988), business (Burinskiene and Vitalija 2007) to mention but a few.
In the majority of applications, it is often assumed that the number of observations denoted by n of a dataset is much larger than the number of predictor variables denoted by p, i.e., n > p, or there are cases where n ¼ p. However, there are scenarios where the number of predictors is larger than the number of observations (n < p), such as in the scientific field of genomics (Hauskrecht et al. 2007). It is also well known that according to the sparsity of effects principle, only a small number of predictors are truly significant. Therefore, in variable selection it is important to identify the important variables in such high dimensional data. This problem of fitting an appropriate model in high dimensional framework has been under investigation in recent years. Some techniques that have been developed belong to the family of penalized logistic regression models (Hastie, Tibshirani and Friedman 2003).
In this paper, we compare the performance of some existing L 1 regularized logistic regression methods. A regularization parameter, k > 0, balances the influence of a fidelity term, which measures how well the data is approximated, and of a L 1 regularization term, which avoids over-fitting. Iteratively reweighted least squares (IRLS) method is one of the most popular methods to iteratively solve the L 1 regularized logistic regression. Instead of solving the L 1 regularized convex optimization problem, IRLS solve, at each iteration, a weighted least squares problem using Newton's method. Note that this approach has been previously used for the regularized logistic regression model for the first time in Lee et al. (2006). Glmnet method, introduced by Friedman, Hastie and Tibshirani (2010) is a more recent approach that has been also developed for the L 1 penalization of the maximum likelihood estimation. Compared to the IRLS approach, Glmnet is known to be more faster and more efficient for sparse reconstruction. The goal of our simulation comparison study is directed toward the variable selection and also the estimation accuracy of regularized logistic regression in high dimensional problems. We are interested in the performance evaluation of these methods in logistic regression models. We consider three varying n, p settings, as those mentioned above, and we support this comparison analysis by conducting various simulated experiments taking into consideration the correlation structure of the dataset.
It is the purpose of the present paper to compare the quality of the maximum likelihood estimation when solving the L 1 regularized logistic regression problem by IRLS and Glmnet.
The rest of the paper is organized as follows. In Sec. 2, the logistic model is briefly described. Section 3 summarizes the considered methods. In Sec. 4, a performance comparison between the methods using simulations is conducted via simulated datasets with various correlation structures. Section 5 gives some concluding remarks.

Logistic regression model
Logistic regression is a widely used statistical analysis (McCullagh and Nelder 1989;Myers, Montgomery and Vining 2002) when the response variable Y is binary, thus it takes one of only two possible values corresponding to success or failure. Let Y 2 0, 1 f g, where Y ¼ 1 is considered as success and Y ¼ 0 as failure. The logit function of the logistic regression model is formulated as follows: where b 0 and b are the regression parameters of interest and p(x) must be between 0 and 1. The exponentiation of Equation (1) has as a result the following logistic regression function: Therefore, the target is to model the probability of success given x, i.e., and By performing maximum likelihood estimation, in order to compute the estimates, we obtain the following estimated logistic regression model: Let Y 2 0, 1 f g binary response variable and we consider n observation units in the mode X ¼ x 1 , x 2 , ::, x n ½ , where x i 2 R p : The logistic regression model described in Sec. 3 is fitted by regularized maximum likelihood. Let Þdenote the probability for the i th unit.

IRLS
We consider the following L 1 regularized logistic regression problem for the maximum a posteriori (MAP) estimate of b where n is the number of the sample observations, k > 0 is the regularization parameter, as the value of k is decreased, more and more coefficients are set to zero, and kbk 1 reconstructs a sparse approximation of b, which encourages sparsity in the estimated coefficient vector. The model (6) is very difficult to solve due to the non-differentiability and non-linearity of its terms. Nonlinear optimization techniques are therefore needed to solve (6). A huge effort has been dedicated to develop suitable and efficient algorithms that can deal with such nonlinear problems; (see, e.g., Lee et al. 2006;Roth 2004). Iteratively reweighted least squares (IRLS) method (Green 1984) is known to be the most popular one, due to its simplicity and convergence guarantees. The method guarantees the convergence to the global optimum no matter where we start. In what follows, we outline the main ideas of this iterative approach to solve the problem (6). Details of this approach are described in Lee et al. (2006). It is an iterative algorithm; it starts with an initial guess at the parameter vector, and at each iteration, the method solves a weighted least squares problem to find a new parameter vector. It first uses Newton's method to find a step direction by approximating the objective corresponding to the unregularized logistic regression min b P n i¼1 À log p i using Taylor expansion of the second order. This approximation gives rise to a new L 1 regularized logistic regression problem, where this time, the unregularized term is quadratic. To be more specific, let b k ð Þ be the current solution in the k th iteration. The step direction denoted by d k ð Þ is obtained by solving the following L 1 regularized problem where, for all i ¼ 1, 2, :::, p: Then, Newton's method computes the next iterate by a line search over the step size parameter t. Note that resulting least squares problem (7) is referred to as LASSO problem, which is an L 1 penalized linear regression. Several methods have been proposed in the literature to solve this kind of problems; (see, e.g., Efron et al. 2004;Tibshirani 1996). Another interesting simulation study was also conducted using logistic regression with forward stepwise using LASSO variable selection methods; (see, e.g., Grogan and Elashoff 2017). In this paper, we use the MATLAB lasso function using cross-validation to choose the regularization parameter k.

Glmnet
The Glmnet solves the following problem where the penalized log-likelihood is maximized, i.e.,: where P a b ð Þ ¼ 1 À a ð Þ 1 2 jjbjj 2 l 2 þ ajjbjj l 1 is the elastic-net penalty (for more information we refer to Zou and Hastie (2005)). Here, k:k l 1 and k:k l 2 stand for the l 1 -norm and l 2 - then the log likelihood of Equation (11) is written as the following concave function of the parameters: Then, the so called partial Newton algorithm can be used for maximizing the (unpenalized) Equation (12) by making a partial quadratic approximation to the log-likelihood. Therefore, ifb 0 andb are the current estimates of the b 0 and b parameters, respectively, the formulation of the quadratic approximation to the log-likelihood, is given by the following equation: is a working response, w i ¼p i 1 Àp i À Á are some weights, p i is estimated at the current parameters, and C is a constant. The Newton update is given by the minimization of l Q quadratic approximation. Therefore, for each value of k, we create an outer loop, where we compute the l Q given in Equation (13) about the parametersb 0 andb: Then, we apply the coordinate descent to solve the penalized weighted least-squares problem In this paper, we use the R statistical package Glmnet (Friedman, Hastie and Tibshirani 2010) for our simulation experiments. The regularization parameter k is chosen by cross-validation.

Simulation study
The purpose of our simulation study is to compare the performance of some existing L 1 regularized logistic regression methods, namely the IRLS and the Glmnet, that are briefly described in the previous section. This simulation study is directed toward investigating the efficiency of these regularized logistic regression methods in problems with high dimensions. In addition, we consider three different settings of the number of observations (n) and the number of predictors (p).
In this section, we first describe the performance measures, the general design and the structure of our simulation study. We then present and review the results, in terms of variable selection and also prediction accuracy, that were obtained from running the simulation study under three varying n, p settings investigated.

Description of the performance measures
The purpose of this simulation is to examine the performance of the IRLS and the Glmnet methods, in terms of variable selection and prediction accuracy. Their performance in respect of variable selection, is evaluated by the measures: True Positive Rate (TPR) and False Positive Rate (FPR). TPR, defined as the proportion of correctly selected important predictors among the true important predictors, is given by the equation: where TP is the correctly identified predictors and P is the number of real important predictors. FPR, defined as the proportion of falsely selected important predictors among the true unimportant predictors, is given by the equation: where FP is the incorrectly identified predictors and N is the number of real unimportant predictors. Both TPR and FPR measures have values ranging from 0 to 1. A model that has both TPR closer to 1 and FPR closer to 0 is the desired one.
The performance of the methods in comparison, regarding prediction accuracy, is evaluated by the sum of absolute difference between the estimated coefficients and true coefficients, which is called bias (BIAS) for simplicity, and the division of the BIAS by the number of important predictors in the true model, that is called adjusted bias (BIAS Adj ). A model with smaller BIAS and BIAS Adj values, estimates the coefficients more accurately.
Moreover, another performance measure of prediction accuracy is the Balanced Classification Rate (BCR). BCR is the proportion of true results among the total number of cases examined, which is given by the equation: where TN is the number of correctly identified unimportant predictors among the N real unimportant predictors. A model with, BCR closer to 1 is the desired one.

Simulation design
To compare the performance of IRLS and Glmnet regularized logistic regression methods, we generated the data according to the three following cases, regarding the number of the sample observations (n) and the number of predictors (p): Particularly, we generated Gaussian data with n sample observations and p predictors: where, x 1 , :::, x p are jointly Gaussian, marginally distributed N(0, 1) random variables, and with population correlation q ¼ cor x i , x j ð Þ ¼ 0.0, 0.1, 0.2, 0.5, 0.7, 0.9, 0.95 and 0.99, if i 6 ¼ j: 2. x 1 , :::, x p are jointly Gaussian, marginally distributed N(0, 1), and with correlation q ¼ cor 2 for all i 6 2 1, 2 f g and q ¼ cor x i , x j ð Þ ¼ 1 2 if i and j are distinct elements of 1, 2, :::, p f g n 1, 2 f g (denoted by "Composite structure or CC" hereafter). Table 1 presents the three n, p settings along with the correlation structures that were taken into account for each of the three different cases presented above.
The true model coefficients that were taken into consideration in the simulation experiments, are presented in the Table 2, where q is the number of important predictors. Nowadays, there are occasions in practice, in which the datasets have several hundreds of predictors, e.g., genomics, and the experimenters are keen about identifying a very small set of the truly important ones to be included in the final model. This is because sparser models are more desirable, and easier in interpretation. For that reason, we decided to evaluate the performance of IRLS and Glmnet methods on high dimensional design matrices with different correlation structures, as those presented in Table  1 and set q equal to 4, 6 and 8. Note that, the number of real unimportant predictors is pq and the corresponding coefficient values were set equal to 0. It should be noted that, in our simulation design, P of Equation (15) is equal to the number q of the important predictors in the model, and N of Equation (16) is equal to pq.
The following simulation procedure explains how we computed the performance measures: BIAS, BIAS Adj , TPR, FPR and BCR with the coefficients estimates of the selected model for the IRLS and Glmnet methods.
1. For each of the Cases I-III, simulate a dataset consisting n observations and p predictors, based on the n, p settings and correlation structures of Table 1. All predictors are normally distributed with mean 0 and standard deviation 1. In addition, consider the model coefficients b 1 and b 2 of Table 2 with q important predictors among p predictors of the design matrix. We will illustrate the behavior of the L 1 regularization and Glmnet approaches presented in this paper for solving the logistic regression model. The experiments using L 1 regularization approach were performed with Matlab R2018a. Furthermore, under each scenario, we ran 200 simulations in parallel on a Linux cluster at the University of Littoral Opal Coast that has 32 cores and 128 GB shared memory. Glmnet simulations were conducted using Glmnet (Friedman, Hastie and Tibshirani 2010) package in R statistical software.  In the tables provided in supplementary material, we report the performance measures (BIAS, BIAS Adj , TPR, FPR and BCR) and the elapsed time of 200 repetitions, of IRLS and Glmnet simulation schemes for Cases I-III under varying n, p settings, and q correlation structures.

Simulation results
Through our simulation study, we assessed the performance of two L 1 regularized logistic regression methods, namely IRLS and Glmnet, by investigating various evaluation measures (BIAS, BIAS Adj , TPR, FPR and BCR). We also examined the elapsed time ("Time" in minutes) for 200 simulation repetitions, taken for each of the two methods under three varying n, p settings and q correlation structures of Cases I-III (see Table 1). In this section, we provide a brief discussion of the results obtained from the simulation experiments for each one of the Cases I to III.

Case I: n > p
From Tables 3-13 in the supplementary material we observe similar performance of the evaluation measures for all the numbers of important predictors q ¼ 4, 6 and 8. Therefore, we are going to present the results, where q is set equal to 4 for both model coefficients b 1 and b 2 .
In Figures 1 and 2, we present the performance of the IRLS (blue columns) and the Glmnet (red columns) methods considering the mean of the performance measures along with the elapsed time of 200 simulations under CASE I. In these figures, the Design "1" represents a matrix with n ¼ 1000 and p ¼ 100, "2" represents a matrix with n ¼ 5000 and p ¼ 100, "3" represents a matrix with n ¼ 5000 and p ¼ 400 and "4" represents a matrix with n ¼ 10000 and p ¼ 400 with population correlation q equal to 0.0, 0.1, 0.2, 0.5, 0.7, 0.9, 0.95, 0.99 and "CC" (see Case I of Table 1).
We observe that Glmnet gives smaller values for both BIAS and BIAS Adj for the majority of the scenarios of Case I, that were taken into consideration. Both IRLS and Glmnet methods produce TPR close to 1 for low to medium correlation structures. Furthermore, for high correlations, i.e. q > 0:9, IRLS method seems to have higher TPR values, whereas for composite correlation structures (q ¼ "CC") the Glmnet is better than IRLS. Considering the FPR measure, Glmnet is closer to 0 in most of the evaluated scenarios, except from some high correlated designs that produces values between 0.05 -0.10. As for IRLS method, FPR measure ranges from 0.05 to 0.45, which is quite high. Concerning BCR, Glmnet outperforms IRLS in most of the simulation results. However, in the case, where b 1 is used and q ¼ 0:99, the IRLS is better, while in the case of b 2 and q ¼ 0:99, Glmnet seems to have slightly better performance. Moreover, as the correlation q increases both methods fail to achieve 100% of BCR. Finally, IRLS simulation scheme seems to be faster, in most of the correlation q values, when the design matrix that is used, has size (n Â p) 1000Â 100 and 5000 Â 100, whereas Glmnet scheme is faster for the remaining examined designs.

Case II: n ¼ p
From Tables 14-24 in the supplementary material we observe similar performance of the measures for all the numbers of important predictors q, in consideration. Therefore, we are going to present the results, where q is equal to 6 for the given coefficients b 1 and b 2 .
In Figures 3 and 4, we present the performance of the IRLS (blue columns) and the Glmnet (red columns) methods considering the mean of the performance measures and the elapsed time of 200 simulations under CASE II. In these figures the Design "1" represents a matrix with n ¼ 100 and p ¼ 100, "2" represents a matrix with n ¼ 400 and p ¼ 400 and "3" represents a matrix with n ¼ 1000 and p ¼ 1000 with population correlation q equal to 0.0, 0.1, 0.2, 0.5 and "CC" (see Case II of Table 1). The cases, where there are not any blue columns of IRLS in these figures and there are empty cells in the Tables 14-24 in the supplementary material, correspond to the scenarios that the fitted models did not converge using IRLS method. Unfortunately, it was not observed a certain pattern of the non convergence of the IRLS in Case II, which would help us form a better view of the situation. For example, IRLS could not converge when: (i) b 1 is used, along with the designs with q ¼ "CC" and (ii) b 2 is used, along with the designs with q ¼ 0.5.
We observe that Glmnet produces smaller BIAS and BIAS Adj values for all the examined designs and correlation structures. Moreover, IRLS has TPR values closer to 1 for the majority of the examined scenarios, whereas Glmnet produces lower TPR values  ) and Glmnet (red columns) methods for CASE I (n > p) with given model b 1 of the coefficients and the number of important predictors q ¼ 4. The Design "1" represents a matrix with n ¼ 1000 and p ¼ 100, "2" represents a matrix with n ¼ 5000 and p ¼ 100, "3" represents a matrix with n ¼ 5000 and p ¼ 400 and "4" represents a matrix with n ¼ 10000 and p ¼ 400, with population correlation structure q equal to 0.0, 0.1, 0.2, 0.5, 0.7, 0.9, 0.95, 0.99 and "CC" (Composite Correlation).
when a 100 Â 100 design is used for all the considered correlations of Case II. Furthermore, TPR decreases dramatically for both methods, and especially for Glmnet, when composite correlation structured (q ¼ "CC") designs are used. As for the FPR measure, the Glmnet has better performance than the IRLS method. It should be noted that, the FPR of the IRLS method is near 0.70 which is a quite huge value. Regarding BCR, Glmnet is quite efficient with values ranging from 0.80 to 1, while IRLS produces BCR values between 0.45 and 0.70. Finally, it is not clear which simulation scheme is faster, due to the non convergence of IRLS. When b 1 and q ¼ 0:0 and 0.5 correlated designs are used, the IRLS simulation scheme is faster, whereas, when q ¼ 0:1 and 0.2 correlated designs are taken into consideration, the Glmnet scheme is faster. In case of b 2 , the IRLS simulation scheme seems to be faster for low correlated structures (q ¼ 0:0, 0.1 and 0.2), and as the size of the design matrix increases.  and Glmnet (red columns) methods for CASE I (n > p) with given model b 2 of the coefficients and the number of important predictors q ¼ 4. The Design "1" represents a matrix with n ¼ 1000 and p ¼ 100, "2" presents a matrix with n ¼ 5000 and p ¼ 100, "3" represents a matrix with n ¼ 5000 and p ¼ 400 and "4" represents a matrix with n ¼ 10000 and p ¼ 400, with population correlation structure q equal to 0.0, 0.1, 0.2, 0.5, 0.7, 0.9, 0.95, 0.99 and "CC" (Composite Correlation).
In Figures 5 and 6, we present the performance of the IRLS (blue columns) and Glmnet (red columns) methods considering the mean of the performance measures and the elapsed time of 200 simulations under CASE III. In these figures, the Design "1" represents a matrix with n ¼ 400 and p ¼ 5000 and "2" represents a matrix with n ¼ 400 and p ¼ 10000 with population correlation q equal to 0.0, 0.1, 0.2, 0.5 and 0.7 (see Case III of Table 1). The cases, where there are not any blue columns of IRLS in Figures 5  and 6 and there are empty cells in the Tables 25-35 in the supplementary material, correspond to the scenarios that the fitted models did not converge using IRLS method. Unfortunately, it was not observed a certain pattern of the non convergence of the IRLS in Case III, which would point some better view of the circumstances. For example, IRLS could not converge when: i) b 1 is used, along with q ¼ 0.1, 0.7 correlated structured designs and ii) b 2 is used, along with q ¼ 0.0 correlated design. and Glmnet (red columns) methods for CASE II (n ¼ p) with given model b 1 of the coefficients and the number of important predictors q ¼ 6. The Design "1" represents a matrix with n ¼ 100 and p ¼ 100, "2" represents a matrix with n ¼ 400 and p ¼ 400 and "3" represents a matrix with n ¼ 1000 and p ¼ 1000 with population correlation structure q equal to 0.0, 0.1, 0.2, 0.5 and "CC" (Composite Correlation). (There are no blue columns, where IRLS fails to converge. See Tables 14-24 in the supplementary material and section 4.3.2 for more details.).
We observe that Glmnet produces smaller BIAS and BIAS Adj values for all the examined scenarios compared with IRLS. Moreover, Glmnet has TPR % 1 when b 1 is used and almost equal to 0.80 when b 2 and low correlated (q ¼ 0:0, 0:1 and 0.2) structure designs are used. As correlation q increases, the TPR of the Glmnet decreases dramatically, either b 1 or b 2 is used. From IRLS simulation results, we notice that when low correlated structure designs and the model coefficient b 1 are examined, there is a similar performance with Glmnet, whereas it has better performance, as q increases. Similarly, and Glmnet (red columns) methods for CASE II (n ¼ p) with given model b 2 of the coefficients and the number of important predictors q ¼ 6. The Design "1" represents a matrix with n ¼ 100 and p ¼ 100, "2" represents a matrix with n ¼ 400 and p ¼ 400 and "3" represents a matrix with n ¼ 1000 and p ¼ 1000 with population correlation structure q equal to 0.0, 0.1, 0.2, 0.5 and "CC" (Composite Correlation). (There are no blue columns, where IRLS fails to converge. See Tables 14-24 in the supplementary material and section 4.3.2 for more details.).
when the coefficient b 2 is used, as q increases the TPR of the IRLS is better. Concerning FPR measure, the Glmnet is superior to the IRLS. It should be noted that, the FPR values of the IRLS method range from 0.4 to 0.7, which are high enough. Furthermore, the Glmnet simulation scheme has better BCR performance for low correlated (q ¼ 0:0, 0:1 and 0.2) designs. However, we notice that, according to the results of the tables in the supplementary material, the BCR of the Glmnet deteriorates when highly correlated designs are considered. As correlation q increases, the IRLS seems to produce better BCR values. Finally, from a general point of view of Case III, Glmnet simulation scheme is faster. However, due to the non convergence of IRLS for many examined scenarios of this case, we cannot conclude with certainty which scheme is faster.

Concluding remarks
In this paper, we investigated the performance of some existing L 1 regularized logistic regression methods, i.e., IRLS and Glmnet, in a high-dimensional setting. Particularly, Figure 6. Simulation comparisons of (a) Bias, (b) Adjusted Bias, (c) TPR, (d) FPR, (e) BCR and (f) Elapsed Time (in minutes) of IRLS (blue columns) and Glmnet (red columns) for CASE III (n < p) with given model b 2 of the coefficients and the number of important predictors q ¼ 8. The Design "1" represents a matrix with n ¼ 400 and p ¼ 5000 and "2" represents a matrix with n ¼ 400 and p ¼ 10000 with population correlation q structure equal to 0.0, 0.1, 0.2, 0.5 and 0.7. (There are no blue columns, where IRLS fails to converge. See Tables 25-35 in the supplementary material and section 4.3.3 for more details.).
we considered three varying n (number of observations), p (number of predictors) settings, i.e., n > p (Case I), n ¼ p (Case II) and n < p (Case III), and we supported this comparison analysis by conducting various simulated experiments. Through our simulation study, we found some very interesting remarks in terms of variable selection and also prediction accuracy. Concerning variable selection both methods achieve to correctly identify the true model for small to medium correlated designs structures, whereas for high correlated scenarios IRLS method seems to perform better in this matter. Moreover, regarding the selection of the falsely selected models, Glmnet method outperforms IRLS for all the examined scenarios. As for prediction accuracy, Glmnet is better in estimating the model coefficients, and achieves higher classification rates for the majority of the examined scenarios. However, for highly correlated structures of Cases II and III, the IRLS method has higher balanced classification rate. Unfortunately, there were scenarios in Cases II and III, where the IRLS method could not converge. This issue is a huge limitation of the method, that could be further investigated. To overcome the convergence and stability issues in the IRLS algorithm, the reader can be directed to, for example, Durif et al. (2018). As noted above, this study is focused in the comparison of two well known L 1 regularized logistic regression methods or high dimensional problems. For future work, we are interested in expanding this comparison with other types of regularized logistic regression methods that enforce more sparsity.