Variable Selection Via Subtle Uprooting

This article proposes a variable selection method termed “subtle uprooting” for linear regression. In this proposal, variable selection is formulated into a single optimization problem by approximating cardinality involved in the information criterion with a smooth function. A technical maneuver is then employed to enforce sparsity of parameter estimates while maintaining smoothness of the objective function. To solve the resulting smooth nonconvex optimization problem, a modified Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm with established global and super-linear convergence is adopted. Both simulated experiments and an empirical example are provided for assessment and illustration. Supplementary materials for this article are available online.


INTRODUCTION
With training data {(y i , x i ) : i = 1, . . . , n} that consist of n iid copies of a continuous response Y and p predictors X 1 , . . . , X p , we consider the variable selection issue with the Gaussian linear regression model where ε i iid ∼ N (0, σ 2 ) for i = 1, . . . , n. WLOG, we assume y i 's are centered and x ij 's are standardized so that the intercept term has been omitted from (1). The true β is often sparse in the sense that some of its components are zeros.
One standard method is the best subset selection. In this approach, an information criterion such as Akaike information criterion (AIC; Akaike 1974) or Bayesian information criterion (BIC; Schwarz 1978) is employed to compare and select models. The approach can be formulated as min β,σ 2 −2 · L(β, σ 2 ) + λ 0 · β 0 , where L(β, σ 2 ) denotes the log-likelihood function of the model; the penalty parameter λ 0 is fixed as 2 in AIC or log(n) in BIC; and β 0 = card(β) = the cardinality or the number of nonzero components in β. Plugging the estimate of σ 2 ,σ 2 = y − Xβ 2 /n, L(β, σ 2 ) can be simply denoted as L(β). Thus, (2) becomes min β ∈ B n log y − Xβ 2 +λ 0 · β 0 , up to some constant, where B denotes the parameter set. Owing to the discrete and nonconvex nature of cardinality, optimization of (3), known as an NP-hard problem (Natarajan 1995), is proceeded in two steps: first maximize the log-likelihood for every model choice with known sparsity in β and then compare the resultant information criterion −2 · L(β) + λ 0 · β 0 across all model choices, whereβ denotes the maximum likelihood estimate (MLE) of β and L(β) is the maximized log-likelihood. Although faster algorithms for optimizing (3) are available by resorting to combinatorial optimization techniques such as the branch-and-bound method (Furnival and Wilson 1974), this best subset selection method is not feasible for even moderately large p. Both ridge regression (Hoerl and Kennard 1970) and least absolute shrinkage and selection operator (LASSO; Tibshirani 1996) can be viewed as convex relaxation of the optimization problem presented in (3). They specifically solve where β r = p j =1 |β j | r for some r > 0. One key difference from (3) is that the penalty parameter λ ≥ 0 is no longer fixed. With tunable λ, (4) is equivalent to a penalized leastsquare problem min β 1 2 y − Xβ 2 +λ · β r .
To encourage convexity, typically r ≥ 1 is used. For example, r = 1 in LASSO or 1 regularization and r = 2 in ridge regression. While the ridge estimator is explicitly available, LASSO enjoys an attractive property that it simultaneously enforces sparsity. Since the proposal of LASSO, a vast literature has been devoted to the 1 regularization and numerous variants have been developed for enhancement and expansion. Efficient algorithms, that is, the homotopy (Osborne, Presnell, and Turlach 2000) or Least Angle Regression (LARS) algorithm (Efron et al. 2004), and coordinate descent (Fu 1998;Friedman et al. 2007;Breheny and Huang 2011), are available and its statistical properties have been explored in depth. In particular, 1 regularization may not provide satisfactory variable selection especially when predictors are highly correlated. Nonconvex penalties such as smoothly clipped absolute deviation (SCAD) (Fan and Li 2001) and minimax concave penalty (MCP) (Zhang 2010) are proposed for improvement. An up-to-date literature survey, which is not attempted here, can be found in Zhang (2010) and Mazumder, Friedman, and Hastie (2011) and references therein. However, regularization incurs a lost track of λ and its choice has to be tuned. As a result, the common practice of regularization involves two steps as well: first compute the whole regularization path, that is, solutions of β for every λ ≥ 0; and then select the best λ according to a model selection criterion, often aided with cross-validation. While tuning has been a convention in regularization methods, it is worth noting that the model selection criteria themselves have a penalized function form similar to regularization. It becomes apparent that there have been duplicated efforts in the current practice of regularization. Moreover, it is noteworthy that the regularization path is merely a one-dimensional curve in the p-dimensional parameter space B. Thus, these regularization methods essentially seek to optimize AIC or BIC over the reduced search space.
In this article, we reformulate variable selection into one single nonlinear programming problem by making a closer connection between regularization and model selection criteria.
The key idea is to approximate cardinality β 0 with a smooth function, which results in a smoothed version of the information criterion for optimization. This conceptually simplifies the formulation of the variable selection problem and greatly reduces the computational burden. The proposed method can be viewed a natural extension of the classical maximum likelihood principle. It brings model selection and parameter estimation under the common framework of optimization and accomplishes both in one step. One immediate advantage of this approach is that it avoids the two-step approaches involved in either best subset selection or regularization. In addition, we devise a method, termed "uprooting," to enforce sparsity in parameter estimates while maintaining smoothness of the objective function. The remainder of the article is organized as follows. In Section 2, the subtle uprooting method is introduced in details. Section 3 addresses the algorithmic issues. Section 4 presents simulation results and an illustrating example with real data. Section 5 ends the article with a brief discussion.

SUBTLE UPROOTING
We wish to borrow strength from both regularization and information criteria. This entails an extended wish list for the desired approach. From the perspective of 1 regularization, a desirable penalty function is expected to help achieve three key goals as spelled out by Fan and Li (2001): unbiasedness in estimating nonzero parameters, sparsity in terms of enforcing zero estimates, and continuity in terms of the model spectrum under consideration. At the same time, for computational concerns, we hope to have λ 0 fixed by using known choices supplied by information criteria. In addition, we wish the penalty function to possess certain level of smoothness. This would allow us to capitalize further on established results from both optimization theory and statistical estimation theory.
To tackle these issues, a method termed "subtle uprooting" is proposed. In its final form, subtle uprooting simply solves where w(β j ) = tanh a · β 2 j and β j = β j · w(β j ), for a > 0 and j = 1, . . . , p. In matrix form, (6) can be rewritten as min β n log y − XWβ 2 +λ 0 · tr(W), where W = diag(w j ) with w j = w(β j ). Again, λ 0 is fixed according to an information criterion. We recommend having a = 50 and its choice will be further discussed. I call this method "subtle uprooting" because variables are uprooted or virtually unselected in a subtle way. The method essentially involves two steps: a smooth surrogate function is first applied for approximating cardinality, followed by an additional technical maneuver for enhancing sparsity while retaining differentiability of the objective function.

THE SURROGATE FUNCTION ω(·)
First, we replace the cardinality in (3) with a continuous or smooth surrogate penalty function w(·). This makes the discrete optimization problem become continuous. To preserve λ 0 , a suitable w(·) :R → [0, 1] must be an even function ranging from 0 to 1 and satisfying that w(0) = 0 and w(β) = 1 when β is away from 0. Here,R denotes the set of extended real numbers that adjoins +∞ and −∞ to R. The [0, 1] range requirement essentially makes w(·) nonconvex, but this is necessary in order for w(β) to function similarly as cardinality does, namely, Several choices of w(·) are listed below: 1. Truncated r penalty given by for positive r > 0 and a > 0. Typically, r = 0.5, 1, 2 can be considered. The magnitude of a controls the sharpness of the approximation. The corresponding penalty functions are continuous but not smooth.
2. Several nonconvex penalties are studied in the literature, including SCAD (Fan and Li 2001) and MCP (Zhang 2010). Both being even functions, the penalty function in SCAD is given by, for β > 0, with first derivativeẇ a,b (β) = a{I (β ≤ a) + (ab−β) + a(b−1) I (β > a)}, for a ≥ 0 and b > 2. It corresponds to a quadratic spline function with knots at a and ab. For β > 0, the MCP penalty is given by with first derivativeẇ a,b (β) = (a − β/b) I (β ≤ a b) for a ≥ 0 and b > 1. Both SCAD and MCP penalties are smooth in β ≥ 0 with singularity at β = 0. Since Figure 1. Illustration of subtle uprooting: (a) hyperbolic tangent w(β) = tanh(a · β 2 ) as a surrogate penalty function for cardinality; (b) the optimization problem in the two-dimensional case: minimizing X(β −β) 2 subject to w(β 1 ) + w(β 1 ) ≤ t 0 ; (c) β versus β; (d) w(β) versus β , where β = β w(β). they converge to some constants when β gets large, they can be modified to satisfy the condition that lim β→∞ w(β) = 1. To do so, set b := (2 − a 2 )/a 2 in SCAD or b := 2/a 2 in MCP. But the range of a needs further modification due to the constraints on b.
3. The hyperbolic tangent penalty w(β) = tanh (a · |β| r ) for some constant r ≥ 0 and steepness factor a ≥ 0. When r = 1, w(β) is nonsmooth and singular at 0. When r = 2, w(β) is smooth. Figure 1(a) plots the hyperbolic tangent penalty for different choices of a when r = 2, showing a sigmoid shape on R + . It can be seen that a larger a provides a sharper discrimination between 0 and values that are close to 0.
With a surrogate function w(·), we seek to solve This optimization problem is equivalent to minimizing y − Xβ 2 or (β −β) T X T X(β − β) subject to j w(β j ) = t 0 for some t 0 ≥ 0, whereβ = (β j ) is the MLE or least-square estimate (LSE) of β. A two-dimensional illustration is given in Figure 1(b) for the tanh penalty function with r = 2. Its contour plot (if t 0 is adjustable) starts with circles for small t 0 values and then becomes sharpened diamonds as t 0 increases. To gain further insight, consider the simplified case with orthogonal design X T X = I. Equation (12) reduces to a componentwise optimization problem for some 0 ≥ 0. Although there are no simple formulas for the connections among {λ 0 , t 0 , 0 }, implying that the values of t 0 and 0 , albeit fixed, cannot be explicitly determined, the association in these optimization problems can be analytically useful. Let β denote the "subtle uprooting" estimates. It can be seen from (13) that, ifẇ(β j ) = 0, which occurs to a large β j with w(β j ) = 1, thenβ j =β j . To force sparsity, various regularization methods resort to estimates of thresholding forms, which entails a nonsmooth penalty function w(β) with singularity at 0. In this article, we would like to advocate the use of smooth penalty functions, as given by the hyperbolic tangent function with r = 2 The primary reason is that the proposed method could be viewed as a natural extension of MLE. Since most likelihood or log-likelihood functions are smooth, we do not want to alter this nature. Furthermore, the smoothness property allows us to capitalize on many well-developed theories and methods in both optimization and statistical estimation. As shown in Figure 1(b), sparsity could be forced if t 0 (or λ 0 ) is treated as an adjustable tuning parameter; but that would substantially intensify the computation burden. Our approach here is to have a fixed λ 0 as suggested by the information criteria and achieve sparsity of β in a different way. While other smooth functions could be used as well, the derivatives of tanh(·) are easily available and the tanh(·) function has connection with the logistic or expit function that is widely used in statistics. Some properties of the hyperbolic tangent penalty w(β) are summarized in Proposition 1. The proof is omitted.

UPROOTING
This problem motivates us to enhance the shrinkage around the neighborhood of 0 by having a nonconvex penalty with singularity at 0 as SCAD or MCP does. To do so, a step called "uprooting" is proposed on the basis of the fact that β = β · 1{β = 0} ≈ β w(β). Denote β = β w(β). As shown in Figure 1(c), β and β have a one-toone correspondence and β = β everywhere except a small neighborhood of 0 where the shrinkage in |β| is effective.
We replace β j in Model (1) with β j so that μ i becomes . This can be simply viewed as a reparameterization of the model. The regression coefficient for X j is no longer β j , but β j . The associated optimization problem now becomes where the optimization is done with respect to decision variables β, instead of β . To see why this step would help enforce sparsity, let us consider what happens if (15) is optimized with respect to β : Compared to (12), the only change is the use of reparameterized β in (16). In other words, (16) would reduce to (12) if w(β ), instead of w(β), is used as penalty for β in (16).
Since |β j | ≥ |β j | and the tanh function is monotonically increasing in R + , it follows that w(β) ≥ w(β ). Namely, a larger penalty is applied, especially for values that are close to 0.
Concerning w(β) as a penalty function of β , we have when β = 0; the first two derivatives do not exist at β = 0.
Proof. Since it is difficult to explicitly express β in β , we apply the chain rule and differentiation of the inverse function: Similarly for the second derivative, The validity of the above formulas requires d β /d β be nonzero. As a result, both derivatives are expressed as a function of β instead and do not exist at β = 0. The rest of the proof is omitted.
Proposition 2 indicates that w(β), as a function of β , has singularity at β = 0 after the reparameterization. Figure 1(d) plots w(β) versus β . Compared to Figure 1(a), the curve now shows a cusp at β = 0 and a much steeper change rate around 0, a pattern being similar to the nonconvex SCAD or MCP penalties. Noting that w + βẇ > 0 in the denominator of the second derivative d 2 w(β)/dβ 2 , the convexity of w(β) on R + (where |β| belongs) can be assessed with a detailed look at the numerator, that is, which is mostly negative. In particular, this is true whenever β > (2π − 1) −1 . Note that π has range 1 ≥ π ≥ 1/2. This is also true when β → 0 + since, using Proposition 1(iii), the sign of the second derivative is determined by the dominating term −8aβ + O(β 2 ).
It is important to emphasize that the decision variables involved in the optimization problem (15) are β, instead of β . The objective function remains smooth for β. This allows us to use well-established optimization algorithms and theoretical results from statistical estimation. Illustration on the resultantβ will be made by simulation in Section 5. As we will see, some estimates inβ are shrunk to very small values and can be virtually taken as 0 after rounding.

Connection to Nonnegative
Garrotte. The uprooting step described above can be motivated from a connection with the nonnegative garrotte (NG) method of Breiman (1995). Nonnegative garrotte can be viewed as a sign-constrained regularization method (see, e.g., Meinshausen 2013) based on the decomposition β = sgn(β) · |β|. The sign of each component β is assumed to be correctly specified by an initial estimator, say, LSÊ β or any other consistent estimator (Yuan and Lin 2007). Thus, what remains to be done is estimation of |β|. The NG estimator is given as the solution to If sgn(β) is wrongly specified by the initial estimator, then NG would have no chance to make correction. The uprooting method is based on a different decomposition β = β · 1 (β = 0). The indicator 1 (β = 0) indicates the binary selection status for a corresponding predictor, which can be approximated by some measure of the likelihood for the variable to be selected. The problem can then be formulated as where w j can be considered as some measure of the likelihood for jth variance to be selected. Furthermore, we restate the condition 0 ≤ w j ≤ 1 is restated by relating it to β j . With the tanh function, w j = w(β j ) = tanh (aβ 2 j ). By doing so, β becomes the only decision variables in the above optimization problem.

OPTIMIZATION VIA MODIFIED BFGS
Solving (6) is essentially an unconstrained smooth nonconvex optimization problem. While many optimization methods are applicable, we adopted the modified Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm proposed by Li and Fukushima (2001), who extended the popular BFGS quasi-Newton method for nonconvex optimization problems.
Denote the objective function in subtle uprooting as Given a current estimate β k at the kth step, let g k and H k denote the gradient vector g and Hessian matrix H of Q(β), evaluated at β k . In Newton's method, the direction d k is determined by It also can be numerically approximated via finite difference. The form of its Hessian matrix H is more involved. Quasi-Newton methods seek approximation of H at each step, and, in particular, the BFGS algorithm computes an approximate Hessian B k+1 by adding a rank-two update matrix to its predecessor B k . Nevertheless, owing to the nonconvex nature, H k may not be positive definite. As a result, there is no guarantee that the BFGS updated matrix B k+1 0 given B k 0 and d k necessarily be a descent direction. The modified BFGS algorithm of Li and Fukushima (2001), as presented in Algorithm 1, is intended to address these issues. We refer interested readers to Li and Fukushima (2001) for a detailed description.
Algorithm 1 Modified BFGS for Solving Subtle Uprooting set constants 0 < c 0 < 0.5 and 0 < s 0 < 1; set number of maximum runs K and tolerance ε > 0; initialize β 0 , B 0 0, and k = 0; while k ≤ K and g k > ε do obtain d k by solving B k d + g k = 0, done via Cholesky decomposition of B k ; set i = 0 repeat s k := s i 0 and i := i + 1; The next proposition explores the asymptotic properties of the modified BFGS algorithm. Proposition 3(i) establishes its global convergence. Given any initial value β 0 , the algorithm converges to a critical point of Q(β). Proposition 3(ii) shows that, if the algorithm converges to a local minimum point, then it converges at least at a superlinear rate. The proof is omitted as the results follow immediately from Theorems 5.1 and 5.2 of Li and Fukushima (2001) by checking the conditions.

SELECTION CONSISTENCY
In this section, we show that the subtle uprooting estimatorβ is consistent in selecting variables under mild regularity conditions. All our discussions are restricted to the conventional setting where the total number of predictors p n is fixed. Denote the true model as 2 ) is independent of x i . We make the following regularity assumptions, which are rather standard in studying the asymptotics of LSE or MLE in Gaussian linear models.
(D1) The parameter set B is compact with bounded β .
(D2) X T X is nonsingular with probability one for all n > p.
The subtle uprooting estimatorβ is the minimizer of over all β ∈ B, where the penalty function pen ( (19) is, up to some constant, the profile likelihood of β .
To have a form of the sum of iid terms for the objective function in (19), we consider β as the solver of min β ∈ B exp{Q n (β )} with Therefore,β is an M-estimator (see, e.g., chap. 5 of van der Vaart 1998) with base function With standard notations in empirical processes, we have exp{Q(β )} = P n m β or simply P n m. The corresponding population version P m β or P m is given by where the expectation is taken with respect to the joint distribution of (y i , x i ). Denote its limit as P 0 m β or simply P 0 m, that is, which is convex in β and uniquely minimized at β = β 0 regardless of σ 2 . Theorem 1 shows thatβ is a consistent estimator of β 0 and provides a rate of convergence. The proof can be found in the supplementary materials. Our arguments are essentially based on M-estimation. However, unlike most M-estimators, the base function m β involves n and, owing to the penalty part, P m is not differentiable in β whenever β j = 0. Thus, some modifications are made to accommodate these peculiarities. The O(n −3/8 ) convergence rate may not be the best rate ofβ , but it allows us to establish the selection consistency. To proceed, we define (·) as an indexing operator such that Note that the underfitting class M − contains scenarios whenever one true predictor is not included in the model.
For a model m, let X m denote the corresponding design matrix after removing those predictor columns with 0 coefficients from X. Denote the column space of X m as C m and the space orthogonal to C m as C ⊥ m . Let P C m denote the projection matrix for subspace C m . The following theorem indicates that, the global minimizer of Q n (β ) must have the same sparsity as the true β 0 (w.p.a.1) as n → ∞.
Theorem 2. Assume conditions (D1)-(D3) hold. Furthermore, for any m ∈ M − , we assume that β It is particularly worth noting that all the above asymptotic results hold for any constant a > 0. The practical implication is that there is no need to select a with large n.

NUMERICAL RESULTS
In this section, simulated experiments and real data examples are used to assess and illustrate the subtle uprooting method. To make comparisons with other methods, we have used the same simulation settings and datasets as in Yuan and Lin (2007) and others.
All computations were conducted in R (R Core Team 2013). For practical implementation, we have used simulated annealing as a convenient preprocessing step in solving subtle uprooting, although this method fits best to discrete search space. With the estimates obtained from simulated annealing, modified BFGS is applied next to continue with the optimization till convergence. Optionally, the multistart strategy can also be used in searching for the global minimum. In this approach, solutions from a number of runs with different starting points are compared and the one with the smallest Q(β) value is selected as the final estimate. In all the numerical examples presented in this section, the BIC penalty λ 0 = ln(n) was used unless otherwise indicated and the parameter estimates were rounded to four significant digits for sparsity.

SIMULATED EXAMPLE 1
We first consider Example 1 of Yuan and Lin (2007). The data were generated from where (β 1 , β 2 , β 3 ) = (1, 1, 0); {X 1 , X 2 , ε} ∼ N (0, 1) independently; conditional on X 1 and X 2 , a noisy variable X 3 was generated from N (ρ · (X 1 + X 2 ), 1 − 2ρ 2 ). Thus,  Table 1 tabulated the average and SD of the resultant β estimates over 100 runs in several selected scenarios, as well as the proportion of correct selections. It can be seen that subtle uprooting provides reasonably very accurate variable selection, even with highly correlated variables (ρ = 0.65). While varying with the choices of a and n, its performance shows substantial reliability and robustness. It is not surprising that its performance generally improves with increased sample size with fixed a, except the case when a = 1 and ρ = 0.35. Comparing Figure 2(a) with 2(b), it is interesting to see that the performance of the method seems to improve with larger values of a for less correlated data, but improve with smaller values of a for highly correlated data.
To make comparisons, Figure 3 plots the proportion of correct selections over 100 runs versus n for four competitive methods: best subset regression with BIC, LASSO, SCAD, and MCP. The R packages, bestglm, lars, and ncvreg (for both SCAD and MCP), were used for computation, respectively. These results can be compared to Fig. 1 in Yuan and Lin (2007). However, note that the proportion of correct selection in Yuan and Lin (2007) is defined as the proportion of consistent paths, that is, frequency when the whole regularization path contains the true variable selection. Even with a consistent path, the final variable selection still depends on the choice of the regularization parameter. It can be seen that the best subset selection with BIC provides the best performance for this doable problem in its capacity, followed by SCAD and MCP. Despite the variations, their performances do not seem to be affected too much by increased correlation among variables. LASSO has the lowest proportion of correct selections and its performance deteriorates substantially when predictors are more correlated. Subtle uprooting shows a performance in between best subset and SCAD or MCP, varying with different values of a. Intuitively, a relatively larger a would lead to a better approximation to the best subset selection, while a smaller a value would lead to a smoother objective function. In any case, the optimization problem remains nonconvex. This nonconvex nature also plays a critical role and global optimization techniques can be helpful. In the supplementary materials, additional simulations were conducted by implementing the multistart strategy. It can be seen that, with a reasonable large a, subtle uprooting essentially approximates the best subset selection, both outperforming other methods in this setting.

SIMULATION EXAMPLE 2
We next consider three models used in Tibshirani (1996) and other authors including Fan and Li (2001) and Yuan and Lin (2007). Data were simulated from the linear model y = Xβ + σ ε, where p = 8 and σ = 3. The correlation between X j and X j is ρ |j − j | with Three sample sizes: n = 20, 50, and 100 were considered. For each sample size, 200 simulation runs were made. For each simulated dataset, subtle uprooting was applied with four choices of a: 1, 10, 50, and 100. We recorded the model error, model size, false positive selections, and false negative selections, all averaged over the 200 simulation runs. Given an estimatedβ, the model error (ME) is where V = E(X T X) is the population covariance matrix of X. Table 2 summarizes the results. Since our simulation settings are exactly the same as in Yuan and Lin (2007), the results for Models A, B, and C in Table 2 can be compared to those presented in Tables 1, 2, and 3 of Yuan and Lin (2007), respectively. Model A is also considered in Table 1 of Fan and Li (2001), but they used n = 40. Additional simulation results can be found in supplementary materials, where subtle uprooting implemented with multistart strategy was further investigated by considering an increased sample size n = 200 and different dependence levels ρ = 0.2, 0.5, and 0.8. It can be seen that the results from subtle uprooting, regardless of the choices of a, are generally comparable to the results obtained from other methods. Keep in mind that subtle uprooting is intended for an approximation of all subsets selection with BIC and BIC works best with large sample and strong signals. However, the simulation settings considered here all have relatively weak signals with a large error variance σ = 3, especially for Model B.
As a result, all subsets selection did not win out as much as in Example 1. Subtle uprooting also shows mixed performances when compared to others. In Table 3, we used σ = 1 for stronger signals. Now it can be seen that subtle uprooting outperforms other methods most of the time in terms of ME for some a. When fixing a = 50, the results remain either superior or comparable. Despite the variations, subtle uprooting demonstrates considerable robustness with respect to the choices of a. As theoretically shown in Section 4, subtle uprooting is selection consistent for any constant a > 0. For these reasons, we recommend simply fixing a = 50.

SIMULATION EXAMPLE 3
This section compares the computation time of subtle uprooting with other selection methods. Data were generated from the linear model y = Xβ + σ ε, where σ = 3 and β = (β j ) ∈ R p with β j = 1 for j ≤ p/2 and 0 otherwise for j = 1, . . . , p. Same as in Section 5.2, X were generated from a multivariate normal distribution, where cor(X j , X j ) = ρ |j − j | with ρ = 0.5. To see how computing time varies with n and p, the combinations of n ∈ {200, 2000} and p ∈ {10, 50, 100} are considered.
We considered four choices of a ∈ {1, 10, 50, 100} for subtle uprooting and made comparison with four other methods mentioned before. Table 4 presents the computation time in seconds for performing variable selection with each setting, averaged from three repetitions for every method. Note that the best subset method is not available when p = 50 or p = 100, although it is the fastest when p = 10.
Subtle uprooting offers comparable speed and works well for high-dimensional cases. Its computational time varies with the choice of a; yet, with a = 50, it is faster than SCAD and MCP most of the time. It is worth mentioning that our implementation of subtle uprooting is solely R-based while other packages were written in either C or FORTRAN. Thus, the comparison may not fully demonstrate the advantage of subtle uprooting. In principle, subtle uprooting solves a similar problem to either SCAD or MCP, yet with only one fixed λ. Supposedly, the computation time is expected to increase with larger correlation for all methods; but this does not seem to the case always.

REAL EXAMPLE: PROSTATE CANCER DATA
For further illustration, we revisit the prostate cancer dataset that was previously analyzed by Tibshirani (1996) and many others including Yuan and Lin (2007). The dataset contains 97 observations and 9 variables. The objective is to regress log(the level of prostate specific antigen) (lpsa) on eight medical measures that consist of log (cancer volume) (lcavol), log(prostate weight) (lweight), age, log(benign prostatic hyperplasia amount) (lbph), seminal vesicle invasion (svi), log(capsular penetration) (lcp), Gleason score (gleason), and percentage Gleason scores 4 or 5 (pgg45). All previous variable selection results clearly indicate that lcavol is most predictive while gleason seems unimportant. To illustrate how various variable selection methods perform with correlated predictors, Yuan and Lin (2007) replaced gleason with an artificial predictor, computed as (2 × lcavol + gleason). A good variable selection method is anticipated to recognize the feigned multicolinearity and continue excluding the modified gleason. Figure 4 plots the estimated coefficientsβ from subtle uprooting with different values of a = {1, 5, 10, 15, 20, . . . , 100}. We can see thatβ is very stable when a ≥ 25. Table 5 tabulates the detailed results involved in subtle uprooting when a = 50, as well as results  from six other variable selection methods. Except LASSO and elastic net, all methods are able to exclude the modified gleason variable. NG, best subset, and subtle uprooting provide a more parsimonious model choice than the one found by MCP and SCAD.

DISCUSSION
This article makes two noteworthy contributions. First, we point out that variable selection can be reformulated into one single optimization problem, by approximating cardinality with a smooth surrogate function in information criteria. Second, we propose a special technical maneuver that helps maintain smoothness of the objective function and enhance sparsity of the coefficient estimates. There are several interesting avenues for future research. First of all, the method can be extended in a few ways. It is straightforward to apply subtle uprooting to generalized linear models and other parametric or semiparametric models. Subtle uprooting can be most useful when the likelihood function itself is not log-concave. Extensions to structured sparsity as in grouped LASSO can also be considered. Second, we have used λ = log(n) in spirit of BIC. Note that BIC is no longer a consistent selection criterion when p n, in which scenario, the use of the modified BIC criterion by Chen and Chen (2008) may facilitate a natural extension. With large n and p, the limited-memory BFGS (L-BFGS; Liu and Nocedal 1989) algorithm could be helpful, although some similar modification is needed to handle nonconvexity. Third, further studies are ongoing to better understand the theoretical and empirical properties of this method in terms of both variable selection and parameter estimation.

SUPPLEMENTARY MATERIALS
The supplementary materials contain additional simulation results and proofs. The supplementary materials contain two sections. In the first section, additional simulated experiments were carried out to further address three issues in subtle uprooting: global optimization with multi-start strategy, scenarios with stronger signals, and robustness with respect to the choice of a. The second section contains detailed proofs for Theorem 4.1 and 4.2 in Section 4.