Spline-Lasso in High-Dimensional Linear Regression

We consider a high-dimensional linear regression problem, where the covariates (features) are ordered in some meaningful way, and the number of covariates p can be much larger than the sample size n. The fused lasso of Tibshirani et al. is designed especially to tackle this type of problems; it yields sparse coefficients and selects grouped variables, and encourages local constant coefficient profile within each group. However, in some applications, the effects of different features within a group might be different and change smoothly. In this article, we propose a new spline-lasso or more generally, spline-MCP to better capture the different effects within the group. The newly proposed method is very easy to implement since it can be easily turned into a lasso or MCP problem. Simulations show that the method works very effectively both in feature selection and prediction accuracy. A real application is also given to illustrate the benefits of the method. Supplementary materials for this article are available online.


INTRODUCTION
Suppose that we have a dataset consisting of n pairs (x 1 , y 1 ), . . . , (x n , y n ), where x T i = (x i1 , . . . , x ip ) ∈ R p and y i are either quantitative or {0, 1} representing two classes like "healthy" and "sick." Consider a linear regression model where β T = (β 1 , . . . , β p ) ∈ R p is the unknown parameter, and ε 1 , . . . , ε n are error variables. We are especially interested in problems where the dimension p is high and the sample size n might be small. When dealing with high-dimensional data, there are usually two important considerations: model sparsity and prediction ability. The lasso (Tibshirani 1996) tries to achieve both objectives, giving an estimate of β bŷ Some convergence and asymptotics results of lasso have been provided by Knight and Fu (2000). There are also many similar methods to lasso with different purposes. In particular, the elastic net of Zou and Hastie (2005) imposes an extra L 2 -penalty on top of the lasso L 1 -penalty: where the strict convexity of L 2 penalty can select correlated influential features to enjoy the grouping effect. Other variants Jianhua Guo is Professor, School of Mathematics and Statistics, Key Laboratory for Applied Statistics of MOE, Changchun 130024, China (Email: jhguo@nenu.edu.cn). Jianchang Hu is PhD Student, (E-mail: hujian-chang@gmail.com), and Bing-Yi Jing is Professor, (E-mail: majing@ust.hk), Hong Kong University of Science and Technology, Hong Kong, China. Zhen Zhang is Assistant Professor, Department of Mathematics, National University of Singapore, Singapore (E-mail: matzz@nus.edu.sg). We wish to thank the Editor, an Associate Editor, and two reviewers for their helpful comments which led to considerable improvements in the presentation of the article.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/jasa. include the adaptive lasso of Zou (2006), SCAD of Fan and Li (2001), minimax concave penalty (MCP) of Zhang (2010), grouped lasso of Yuan and Lin (2006), Dantzig selector of Candes and Tao (2007), fused lasso of Tibshirani et al. (2005), and so on.
In this article, we will focus on the cases where the highdimensional features x T i = (x i1 , . . . , x ip ) are ordered in some meaningful way, which may lead to high correlation among closely located variables. Such examples include protein mass spectroscopy, the gene expression data in microarray studies, and single nucleotide polymorphisms (SNPs) data in genomewide association studies (GWAS). Our goal is two-fold: first to select the relevant features, and second to predict Y j from a future X T j = (X j 1 , . . . , X jp ). To tackle this type of problems, Tibshirani et al. (2005) proposed the fused lasso, which gives an estimator of β bŷ Or equivalently, |β j | ≤ s 1 and p j =2 |β j − β j −1 | ≤ s 2 .
The first constraint yields sparse estimates while the second encourages flatness of the coefficients β j . Fused lasso tries to maintain grouping effects as well as sparsity of the coefficients, and has been successfully applied in some real applications, for example, the hot spot detection for comparative genomic hybridization (CGH) data in Tibshirani and Wang (2007), and the identification of disease-associated SNP clusters in Yang, et al. (2011).
Despite these successes, however, we would like to make two observations, which we believe that improvements can be made upon.
1. First of all, the fused lasso encourages local constancy of the coefficient profile within each group; see, for example, Figure 3 of Tibshirani et al. (2005). However, in some applications, the features in a group might vary smoothly, rather than being like a step function, for example, the SNP data in Section 4 later. In those situations, the fused lasso may not work as well as hoped, see Figure 1 later. 2. Second, the optimization problem for the fused lasso is a quadratic programming problem; for large p, as pointed out in Tibshirani and Wang (2007, sec. 2.4), the problem is difficult to solve and special care must be taken to avoid the use of p 2 storage elements. For large p, more efficient algorithms are needed. One of the promising algorithms is the coordinate descent algorithm, proposed by Friedman et al. (2007). The coordinate decent has been shown to work for a variety of penalized (convex and nonconvex alike) regressions, for example, elastic net, grouped lasso, bridge lasso, SCAD, and MCP, etc. The reason why it works is that it reduces the high-dimensional optimization problem to one-dimensional one, for which we can easily derive explicit solution at every step. It is proven to be faster than LARS for the lasso, particularly for large p. However, as shown in Tibshirani and Taylor (2011), the coordinate descent does not work in the fussed lasso, partially due to the nondifferentiability of L 1 -norm, but a modified version works.
In this article, we propose a new approach, referred to as "spline-lasso" or "spline-MCP" to overcome the above two issues. Borrowing the idea from the cubic spline smoothing literature, we impose an L 2 penalty on the discrete version of the second derivatives of β's, as well as the lasso penalty on β's. We will see later that the proposed method can capture smoothing changes in β's within a group. Furthermore, the proposed method is very easy to implement computationally since it can be easily turned into usual lasso or MCP problem, thus using many available efficient algorithms such as LARS of Efron et al. (2004) and PLUS of Zhang (2010) when p is not very large. If the dimension is very large, the coordinate decent algorithm could also be adopted and since for each step, there are closedform solutions for the optimization, thus the algorithm is easy to implement and converges quickly.
It is worth mentioning some related work here. Hebiri and Van de Geer (2011) introduced the "smooth-lasso," where the L 1 -penalty p j =1 |β j − β j −1 | in the fused lasso is replaced by the L 2 -penalty p j =1 |β j − β j −1 | 2 . The method is good at capturing the smooth features in a group, but it does not work as well as our proposed method under similar settings, see the simulation section later in the article. Huang et al. (2011) proposed the sparse Laplacian shrinkage (SLS) method for variable selection and estimation that explicitly incorporates the correlation patterns among predictors. The article is different from ours since it does not treat the case of ordered covariates (or features) as we do in this article.
The layout of the article is as follows. In Section 2, we introduce the spline-lasso and spline-MCP, and given some oracle inequalities. Simulation studies are given in Section 3. Some real application is given in Section 4. In Section 5, we give some discussions. Proofs will be postponed in Section 6.

METHODOLOGY AND MAIN RESULTS
We will first introduce the spline-lasso method, and prove some oracle inequalities. We will then introduce the spline-MCP, a more general version of the spline-lasso, and will also prove some oracle inequalities.

The Spline-Lasso
The spline-lasso procedure is defined aŝ where j β = : (β j +1 − β j ), and (2) are the first-and second-order difference (or discrete versions of derivatives) of the coefficients β j . The first λ 1 -penalty in (1) is the usual lasso L 1 -penalty on β's to encourage sparse coefficients. The second λ 2 -penalty in (1) mimics the cubic spline, that is penalizing the L 2 norm of discrete version of the second-order derivatives of the coefficients β j to encourage smoothness of coefficients.
The spline-lasso is designed to overcome the two difficulties encountered by the fused lasso mentioned in the introduction. First, it tries to capture the smooth varying features by providing a smoother estimation than the fused lasso. Second, it is computationally easy since the L 2 penalty term can be easily merged with the least squares terms to transform the problem into another lasso problem (see the next proposition), so that we can make full use of the LARS algorithms proposed by Efron et al. (2004).
Proposition 1. Given dataset (Y, X) and (λ 1 , λ 2 ), define an artificial dataset (Y * , X * ) by The next result is concerned with the oracle inequality for proposed spline-lassoβ spline which gives L 1 -and L 2 -error bounds on the risk, where We use |A| to denote the cardinality of the set A. We first give the so-called restricted eigenvalue assumption, see Bickel, Ritov, and Tsybakov (2009). We will discuss more about on Assumption 1 after Theorem 1.
Assumption 1. LetJ = L T L where L is given in Proposition 1, and K n = X T X + nλ 2J and let n = 4 √ |A| + 4λ 2 λ −1 1 J β 2 where A is the nonsparsity set A = {j : β j = 0}, and λ 1 and λ 2 are tuning parameters for the spline-lasso. There is a constant φ λ 2 > 0 such that, for any We now present the oracle inequalities for the spline-lasso.
Theorem 1. Suppose that Assumption 1 holds with a set B ⊃ A such that |B| ≤ cJ |A| for a given constant cJ ≥ 1. Assume where η ∈ (0, 1). Then, with probability greater than 1 − η, we have In addition, when λ 2 = 0, Assumption 1 is widely used in the literature and basically requires that the restriction of the generalized Gram matrix K n to the columns in is invertible; when X T X is invertible, the condition (2) always holds with φ λ 2 at least as large as the smallest eigenvalue of X T X. See Bickel, Ritov, andTsybakov (2009), van de Geer andBuhlmann (2009), and Hebiri and Van de Geer (2011) for more details on this assumption.
The two parameters (λ 1 , λ 2 ) have different roles to play: λ 1 controls the sparsity of coefficients β j 's and is preset in Theorem 1, while λ 2 controls the smoothness of the coefficients β j 's as can be seen from the last inequality in Theorem 1. As long as the λ 2 is chosen such that λ 1 √ |A| dominates the λ 2 |J β| 2 , an increase in the level of λ 2 will lead to an estimation whose smoothness is closer to the true β.
Note that the optimality properties are with respect to the same smoothness conditions. If there is no smoothness requirement, λ 2 can be set as zero, which would reduce to the case of lasso.
The oracle inequalities in Theorem 1 depend intricately on many factors working together. The results in Theorem 1 are not asymptotic in nature, in that they hold for all n and p under certain conditions.

The Spline-Lasso as Bayes Estimates
It is well-known that the lasso can be regarded as a Bayes estimator with double exponential priors on β j 's. We now explain that the spline-lasso has a similar Bayesian interpretation. To illustrate idea, let us first look at a related but slightly simpler problem: It is easy to verify that this amounts to assuming a priori that β 1 , . . . , β p has an Markovian structure with β 0 = 0 and the transition probability is normal: Similarly, for the spline-lasso (1), the penalty on the secondorder differences (2) j β = j β − j −1 β can be regarded as imposing an Markovian structure on j β and assuming a priori the transition probability on j +1 β given j β is normal.

The Spline-MCP
Zhang (2010) proposed to use MCP and showed that it has a number of advantages over the lasso. This motivates us to extend the spline-lasso to the spline-MCP, which we hope will inherit some of the nice properties of MCP. Furthermore, the spline-MCP is computationally easy to implement since it can be readily transformed to MCP, where we can use the PLUS algorithm of Zhang (2010). Oracle inequalities will also be established. Later we will see that the spline-MCP gives spectacular performance in simulations and real application.
First we define the generalized spline-like estimator as the solution of the following optimization problem: andJ is the same as defined in Assumption 1. This M(b; λ, γ ) is very similar to the objective function in Huang et al. (2011), but the intention behind their problem is quite different from ours. The penalty function ρ(|b|; λ, γ ) could be SCAD, MCP, and more general quadratic spline penalty functions. To solve for the solution, again as mentioned in Proposition 1, the original optimization problem could be transformed into an equivalent penalized least square problem with augmented dataset (Y * , X * ), which can then be solved by algorithm like PLUS in Zhang (2010).
Next, we consider the properties of the general spline-like estimators with concave penalty. We use MCP as an example and for other quadratic spline penalties, the results would be similar. The MCP is defined as Its first-order derivative iṡ Therefore, larger values of γ make ρ less concave.
Denote the true regression coefficient by This is the oracle spline OLS estimator on the set O. Let = n −1 X T X. Let c min (λ 2 ) be the smallest eigenvalue of (λ 2 ) = + λ 2J .
We consider two separate cases: (λ 2 ) is positive definite, or singular. In both cases, we will show the sign consistency under different conditions.

Case I: (λ 2 ) is Positive Definite
In this case, the oracle spline OLS estimatorβ 0 can be explicitly written aŝ In addition, we give two conditions.
Condition A. For a certain constant ∈ (0, 1/3), . Condition A puts the sub-Gaussian constraint on the data. Condition B1 ensures that the overall optimization will be convex. Condition B2 puts a minimum level to the sparsity penalty, that is, the threshold should be large enough to kill off all small estimates around zero. Condition B3 ensures that the true nonzero β * is large enough to avoid being killed by the threshold level.

Theorem 2. Suppose Conditions A and B hold. Then
Theorem 2 shows that under the assumed conditions, our proposed spline-MCP would enjoy sign consistency and it will be the same as the oracle estimatorβ o with high probability.

Case II: (λ 2 ) is Singular
For high-dimensional data, most likely (λ 2 ) = + λ 2J is singular, then the above theorem is not applicable. To give the oracle property for the estimator, the sparse Reisz condition as in Zhang (2010) is needed.
LetX be a matrix satisfyingX TX /n = (λ 2 ) = X T X/n + λ 2J andỸ be a vector satisfyingX TỸ = X T Y . Definẽ Then the minimizer ofM is the same as that of original optimization problem M.
We list another condition used in the next theorem. Condition C.

With
Condition C1 is the sparse Reisz condition in our situation, which requires that there are submodels containing the true model that are nonsingular. Condition C2 places a minimal thresholding level so that all small estimates are set to zero. Condition C3 puts a minimum level to the true β * . Therefore, the true nonzero β * is large enough to avoid being killed in this case.

Theorem 3. Suppose Conditions A and C hold. Then
Again, Theorem 3 shows the sign consistency of our proposed spline-MCP estimator in this situation and it will also equal to the oracle estimatorβ o with high probability.
Finally, it is worth mentioning that a nonzero λ 2 will induce smoothness in coefficient estimates. The larger λ 2 is, the smoother the estimates. What is the right level of penalty level for λ 2 ? In fact, the optimal level of λ 2 is related to theoretical work in nonparametric estimation literature for example, Tsybakov (2009), where the minimax rate could be obtained with assumptions on the smoothness of β. On the other hand, in implementing the method, since we do not know the smoothness for the β, so data-driven method like cross-validation will be adopted.

Second-Stage Thresholding After
Spline-Lasso/MCP It will be seen from the simulation studies later that the splinelasso/MCP can indeed capture smoothly changing effects and have very good predication accuracy. However, it may not always necessarily maintain sparsity as many small false features may have been left there. Incidentally, the same phenomenon is also shared by the smooth-lasso method. The reason for this is as follows. The tuning parameters are determined by crossvalidation which was prediction-oriented, so more effort has been spent on capturing the smoothness of coefficients, resulting in small tuning parameters.
To fix this, we propose a second-stage "thresholding" procedure to the spline-lasso/MCP, whenever necessary, after the first-stage estimation to further improve variable selection ability, which proves very effective in practice.
The question is then how to choose the thresholding level T. One possibility is to use cross-validation, however, this may be very computationally demanding. We now suggest a simple procedure. The idea is inspired by Donoho and Johnstone (1998), where they proposed to apply ±σ √ 2 log p as threshold to wavelet coefficients with noise levelσ being estimated from the coefficients in the finest level. Similarly, we propose to adopt T = ±σ 2 log p as the thresholding level to estimated coefficients, whereσ is the standard error of coefficients with small magnitude. One approach to determine the coefficients used to formσ is by sorting the absolute values of the estimatedβ spline . After ordering the absolute values in an ascending order, the smaller half can be used to calculate theσ . This specific threshold relies on Gaussian-type errors originally. However, we also conducted simulation studies later involving other error distributions than Gaussian errors, and found that the above thresholding levels work out nicely as well. Finally, we remark that this level of thresholding may not be necessarily needed if the solution is already very sparse. One such example is the spline-MCP. In the later simulations, we found spline-MCP gives sparse solution, so there is no need to have an extra level of thresholding.

SIMULATION STUDIES
We now conduct simulation studies to compare performances of the spline-lasso and spline-MCP with other alternatives, for example, the fused lasso of Tibshirani et al. (2005) and the smooth lasso of Hebiri and Van de Geer (2011).
The features X 1 , . . . , X p are generated randomly from multivariate normal distribution N (0, ), and the response variable y is generated from We fix n = 71 and p = 600 so that p n. The coefficients β's are obtained from a continuous function consisting of two parts: one part being a combination of sine and constant function, and the other being a combination of two linear functions. Then it is evaluated at discrete points which gives the β as shown in Figures 1 and 2 by blue circles. There are k = 79 non-zero β's with the rest being zero. All tuning parameters are determined by 10-fold cross-validation. We refer to Hastie, Tibshirani, and Friedman (2009, sec. 7.1) for a detailed description of the Kfold cross-validation. The cross-validation is conducted over a grid of parameters (λ 1 , λ 2 ), and the pair with the smallest crossvalidated prediction error is chosen.
Two different covariance structures are considered: • Case 1: = I (no neighboring correlation).
We will first show simulation results when the errors are normally distributed, that is, ε ∼ N (0, σ 2 ). Here, for each case, two different noise levels are considered. Specifically, we take σ = 1 as a low noise level and take σ = 3 as a high noise level.

Performance Criterion
The performance of different methods will be examined in two aspects:  Figure 1. The blue circles are the true β and the green stars represent the estimation for each method. The upper figure is the result for spline-lasso. The lower one is the spline-lasso after applying the proposed second-stage thresholding withσ being the standard deviation of the estimatedβ with location from 300 to 600. The MSE is also reported.
• Prediction error. We generate 1000 testing data for each method and use the fitted models from the training set to give predictions to the testing set. The mean square error (MSE) of the predicted value for the testing dataset then can be calculated as the l 2 norm of the difference between predicted values and the true simulated values. • Feature selection. We compare sensitivity and specificity for different methods. Sensitivity (or true positive rate) measures the proportion of actual positives being correctly identified. Specificity (or true negative rate) measures the proportion of negatives being correctly identified.
Simulation results. For illustration, in Figures 1 and 2 we present the estimation results for the correlated features in Case II, by four different methods: fused-lasso, spline-MCP, splinelasso, and spline-lasso with thresholding, the last of which was described in Section 2.3. Clearly, the spline-MCP gives the best estimation, followed by spline-lasso with thresholding. The other two did not clean out the noisy signals very well. The improvements of the spline-MCP and spline-lasso in capturing the smooth changing effects are clearly seen in Figures 1 and 2.
From Tables 1-4, we make some further observations.
1. Columns 2-5 of Tables 1-4 give results without thresholding. The spline-MCP and spline-lasso outperform the fused lasso in terms of estimation and prediction. This may not be surprising since both are designed to capture the smooth changes of features. 2. The last three columns of Tables 1-4 give results with thresholding. Clearly, thresholding does help improve performances for both the smooth and spline-lassos. The spline-lasso with thresholding gives performance very close to that of spline-MCP. However, the improvement is not significant for fused lasso. This is not surprising. Since the penalties in forming the fused lasso are both l 1 penalty, which makes the estimation sparser than the smooth and spline-lassos. Therefore, when forming the threshold, its magnitude is very small, which makes the thresholding not very effective. This observation also indicates that fused lasso can be another example where the thresholding may not be necessarily applied. Tables 1 and 3, the estimations are more accurate when there was correlation between features. This is because the second penalty was trying to use these information to help make estimation, especially for spline-MCP, spline-lasso, and smooth-lasso. 4. From Tables 2 and 4, without thresholding, the smoothlasso and spline-lasso have very high sensitivity, but rather low specificity. When the thresholding is applied (the last two columns), specificities have been substantially improved with very little sacrifice on sensitivities.

More Simulation Results
We then consider the cases where noise term in the regression follows t-distribution with degree of freedom 5. We also adjust their variances to have σ = 3. For the second-stage thresholding, we still use the same formula to determine the threshold levels. The performance criterion are also kept the same. The following tables gives an summary to the simulation results.
From the above two tables, we can still see clearly that spline-MCP and spline-lasso after thresholding outperform other methods. Fused lasso seems to be better than spline-lasso without thresholding. This is actually reasonable since fused lasso provides sparser estimations than spline-lasso. When the noises follow t 5 , the estimations for those zero β's will be larger than those in the normal case. These false positives may badly affect the performance of the spline-lasso. However, after thresholding, we see that the performance of spline-lasso is very close to and sometimes even better than spline-MCP. This means that the estimation for those true signals is actually fine for the splinelasso and the reason for spline-lasso to perform undesirable is largely due to those noisy terms, which again highlights the necessity of the second-stage thresholding.

EMPIRICAL ANALYSIS
We apply our method to a dataset of Crohn's Disease (CD) from WTCCC (Consortium 2007). The combined sample size is 5009 in the case-control study. We conduct analysis on Chromosome 16, which consists of 10827 SNPs after standard quality control strategy. The purpose is to identify disease-associated SNP clusters in the GWAS.
Although GWAS has identified many disease-susceptibility SNPs, these findings can only explain a small portion of genetic contributions to complex diseases (missing heritability). Since a causal SNP will have a small chance of being directly genotyped, the effects of its neighboring SNPs may be too weak to be detected due to the effect decay caused by imperfect linkage disequilibrium. Furthermore, it is still challenging to detect a causal SNP with a small effect even if it has been genotyped. To increase the statistical power when detecting disease-associated SNPs with relatively small effects, Yang et al. (2011) applied a weighted version of fused lasso and found that there are some disease-associated SNPs located close to each other. We would like to see what happens if we apply our method to the dataset.
We first calculate the z value of each SNP using the Cochran-Armitage trend test, and then calculate p-value by π = 2(1 − (|z|)), where is the distribution function of a standard normal variable. Finally, we take − log π − 1 as our signals. If there were no signals, p-values follow uniform distribution, and − log π follows exponential distribution with rate 1. Since our method will be applied to transformed marginal pvalues whose dimensions are 1 × p, so it is hard to apply 10-fold cross-validation here. Therefore, the tuning parameters for our method are chosen in the same way as in Yang et al. (2011). That is, we fix λ 1 = √ 2 log(π/2.7) to adjust the number of effective SNPs, and consider a wide range of decreasing λ 2 . We solve for spline-MCP solution for each λ 2 and consider the variance of the residuals. Finally, we choose the largest λ 2 such that the variance of the residuals is ≤ 1. In the analysis below, we control the estimated FDR at 5% level.
Adopting this parameter tuning procedure may produce some false positive results. We can consider the stability selection strategy (Meinshausen and Buhlmann 2010) to reduce the effect of parameter tuning. This stability selection procedure depends on the probability that certain signal is detected, also called the probability of detection, under model perturbation via subsampling. More specifically, we randomly sample half of the cases and controls for each subsampling round. Let p b denote the set of p values for the bth subsampling. Then we apply our method on − log p b − 1 and obtain the support s b . If s i,b = 1, it indicates that ith SNP is detected in the bth subsampling. The probability of detecting the ith SNP is estimated by where B is the total number of subsamplings. Here, we set B = 100. Then we can get a set of SNPs as A τ = {i : π i ≥ τ } with τ being the threshold. This threshold will be selected to control the estimated FDR at 5% level. We found a total of 31 SNPs, among which 30 of them are close to each other and their estimation by spline-MCP is shown in Figure 3 where the x-axis gives the location and y-axis is the fitted value. From Figure 3, we see a smooth change in the estimation. More interestingly, we find two peaks in the estimation, which means the two centers could be more relevant to the disease than others.
The full results are listed in Table 7, where the first 30 SNPs correspond to those plotted in Figure 3 and their orders are also kept the same, that is, rs6500315 is the name of the very left point in Figure 3 and rs8047222 corresponds to the point next to rs6500315 and so on. The last one, rs4783163, is the single SNP located far away from the previous 30 SNPs but also identified as disease-associated by spline-MCP. The probability of detection reported is the probability that certain signal is detected under model perturbation via subsampling as described above.
A close look at these identified SNPs reveals that near the higher peak, from rs1861759 to rs1861758 and from rs1077861 to rs8060598, these are several expression quantitative trait loci (eQTLs), which are the SNPs that can regulate gene expression, of CARD15 which is a known gene associated with Crohn's disease. Therefore, our methods can indeed help highlight important SNPs from less important ones by looking at the shape of the estimation. In addition, among identified SNPs, there are also some eQTLs, from rs1981760 to rs8057341, of SLIC1 which is another gene considered to be associated with Crohn's disease. More related information can be found at http://eqtl.uchicago.edu/cgi-bin/gbrowse/eqtl/.
Comparing our findings with those in Yang et al. (2011), we are able to find all those SNPs in Chromosome 16 reported in their Table 2. Moreover, as pointed out above, our proposed method can also potentially differentiate important SNPs from those less important ones by looking at the shape of the curve.

DISCUSSION
In this article, we consider a high-dimensional linear regression problem, where the features are ordered in some meaningful way. When the coefficients are sparse and change smoothly, we propose a spline-lasso or spline-MCP method, which combines the lasso or MCP penalty and cubic spline estimation procedure (i.e., penalizing second-order derivatives of the coefficients). The estimation and prediction of the newly proposed methods were more accurate comparing to some existing methods. Furthermore, whenever it is necessary, adopting a second-stage thresholding procedure can improve the variable selection as well as the prediction and estimation.
The spline-lasso or spline-MCP method tried to achieve feature selection with smooth structure on the coefficients. These objectives are often conflicting with each other. It turns out that spline-MCP works very well for this purpose. For other methods, trying to achieve several targets might work less effectively. In this case, one might divide the task in several stages, and focus on one objective at a time.
When the response y i is binary, for example, {0, 1} representing two classes like "healthy" and "sick," it would be better to use logistic regression rather than the liner model (1). The idea of the spline-lasso could be easily extended to this context, which we are currently exploring.

APPENDIX
We now give a sketch to the proofs of the main theorems. The full proofs can be found in the supplementary materials.

Proof of Theorem 1
We first give two lemmas.

SUPPLEMENTARY MATERIAL
The supplementary materials contain more detailed proofs to Theorems 1-3. [Received November 2013. Revised December 2014