Wavelet-Based Weighted LASSO and Screening Approaches in Functional Linear Regression

One useful approach for fitting linear models with scalar outcomes and functional predictors involves transforming the functional data to wavelet domain and converting the data-fitting problem to a variable selection problem. Applying the LASSO procedure in this situation has been shown to be efficient and powerful. In this article, we explore two potential directions for improvements to this method: techniques for prescreening and methods for weighting the LASSO-type penalty. We consider several strategies for each of these directions which have never been investigated, either numerically or theoretically, in a functional linear regression context. We compare the finite-sample performance of the proposed methods through both simulations and real-data applications with both 1D signals and 2D image predictors. We also discuss asymptotic aspects. We show that applying these procedures can lead to improved estimation and prediction as well as better stability. Supplementary materials for this article are available online.


INTRODUCTION
Substantial attention has been paid to problems involving functional linear regression model where the response y i and the intercept α are scalar, the predictor X i and the slope η are square-integrable functions in L 2 ([0, 1]), and the errors i are independent and identical normally distributed with mean 0 and finite variance σ 2 . The literature on functional linear regression is growing. A sampling of papers examining this situation and various asymptotic properties includes Cardot, Ferraty, and Sarda (2003), Cardot and Sarda (2005), Cai and Hall (2006), Antoniadis and Sapatinas (2007), Hall and Horowitz (2007), Li and Hsing (2007), Reiss and Ogden (2007), Müller and Yao (2008), Crainiceanu, Staicu, and Di (2009), Delaigle, Hall, and Apanasovich (2009), James, Wang, and Zhu (2009), Crambes, Kneip, and Sarda (2009), Goldsmith et al. (2012), and Lee and Park (2012). A potentially very useful idea in fitting models involving functional data is to transform functional data via wavelets. Recent literature on functional data analysis in the wavelet domain includes Amato, Antoniadis, and De Feis (2006), Wang, Ray, and Mallick (2007), Malloy et al. (2010), Zhu, Brown, and Morris (2012), and Zhao, Ogden, and Reiss (2012). Waveletbased LASSO (Zhao et al. 2012) has been proposed as a powerful estimation approach for fitting the model (1). It works by transforming the functional regression problem to a variable selection problem. Functional predictors can be efficiently represented by a few wavelet coefficients. After applying discrete wavelet transform (DWT), techniques such as LASSO (Tibshirani 1996) can then be applied to select and estimate those few important wavelet coefficients. The wavelet-based LASSO method is well suited for the situation in which the η function has spatial heterogeneity and/or spiky local features. Although the wavelet-based LASSO method has good prediction ability, we observed from simulation studies that the functional coefficient estimates often show anomalously large point-wise variability.
The purpose of this article is to explore two potential directions for improvements to the wavelet-based LASSO: methods for weighting the LASSO-type penalty and techniques for prescreening. Although weighting the L 1 penalty terms and adding a prescreening step have been widely studied in linear regression model setting, these strategies have never been investigated, either numerically or theoretically, in a functional linear regression model context. In this study, we first demonstrate that the weighted version LASSO can improve both prediction ability and estimation accuracy. In linear regression, although LASSO shows good prediction accuracy, it is known to be variable selection inconsistent when the underlying model violates certain regularity conditions (Zou 2006;Zhao and Yu 2006). Meinshausen and Bühlmann (2006) proved that the prediction based tuning parameter selection method in LASSO often results in inconsistent variable selection, and consequently the final predictive model tends to include many noise features. To improve LASSO's performance, Zou (2006) proposed to add weights to the L 1 penalty terms where weights were defined as ordinary least square (OLS) estimates. Later, Huang et al. (2008) considered the magnitudes of correlation coefficient between the predictor and the response as the weights. Both approaches were shown to have better variable selection consistency. In functional linear model setting, wavelet-based LASSO suffers from the same difficulty resulting from inconsistent selection of wavelet coefficients. We extend weighted LASSO methods to functional linear regression in the wavelet domain, with the hope that this can improve estimation accuracy by penalizing less important variables more than more important ones. In functional linear regression, the predictors are often curves densely sampled at equally spaced points. That is, the number of data points can be much larger than the sample size resulting in a "large p small n" problem. Therefore, using OLS estimates as weights is not feasible in general. We propose two new weights. The first weighting scheme uses information from the magnitudes of wavelet coefficients, whereas the second one is based on sample variances of wavelet coefficients. Those two weighting schemes are fundamentally different from other weighting schemes in that the importance of each predictor is ranked without consideration of its relationship with response variable. Our results show that the wavelet-based weighted LASSO not only provides great prediction accuracy, but also significantly improves estimation consistency.
Second, we show that incorporating a screening step before applying a LASSO-type penalty in the wavelet domain can improve both prediction ability and estimation accuracy. Adding a screening step to wavelet-based LASSO can be important, as it is increasingly common in practice to have functional predictors with ultra-high dimensionality. The challenge of statistical modeling using ultra-high dimensional data involves balancing three criteria: statistical accuracy, model interpretation, and computational complexity (Fan and Lv 2010). For example, Shaw et al. (2006) used serial image data to study how longitudinal changes in brain development in young children with attention deficit hyperactivity disorder (ADHD) can predict relevant clinical outcomes. In such a case, it is desirable to apply a screening step before model fitting. With 2D images of size 128 × 128 as predictors, it is necessary to deal with more than 16,000 predictors in the wavelet domain. Therefore it is of critical importance to reduce dimensionality to a workable size. In this article, we investigate some screening approaches that would effectively reduce computational burden, while the reduced model still contains all important information with high probability.
The rest of this article is organized as follows. In Section 2, we propose two versions of weighted LASSO in wavelet domain analysis. We then introduce some screening approaches to functional linear models. We show their statistical properties in Section 3 and use simulation studies and real data examples to demonstrate finite sample performance of the proposed methods in Section 4. Section 5 concludes this article with some discussions.

METHODS
In this section, we introduce wavelet-based weighted LASSO with different weighting schemes in the penalty term and discuss some screening strategies that can be applied to wavelet-based functional linear model. We assume readers have certain familiarity with wavelet transform. Readers without this background can refer to Ogden (1997), Vidakovic (1999), and Abramovich, Bailey, and Sapatinas (2000) for a comprehensive overview of wavelet applications in statistics.
In this article, we require an orthonormal wavelet basis on [0, 1] such as those in Daubechies' family. Wavelet transform is a compelling choice mainly due to its great compression ability, that is, functions can be represented by relatively few nonzero wavelet coefficients. Penalized regression methods can be readily extended to functional linear model once functional predictors are transformed into wavelet domain.

WAVELET-BASED WEIGHTED LASSO
The general form of a linear scalar-on-function model is given in (1). For simplicity, we will drop the term α from Equation (1). The intercept α can be estimated byα = Y − 1 0X (t)η(t)dt, whereȲ andX are sample means of response and functional predictor, respectively. We assume each functional predictor X i and the coefficient function η have a sparse representation in the wavelet domain. By applying the DWT to functional data at a primary decomposition level j 0 , we can obtain a discrete version of model (1) expressed as A natural estimation of β in Equation (2) can be obtained via penalized regression: where w h are some user-defined positive weights. Various choices of weights have been proposed in the linear regression model setting. For example, the LASSO (Tibshirani 1996) used constant weights w h = 1. Weights w h = |β 0 h | were considered by Zou (2006) where eachβ 0 h is an OLS estimate of the corresponding term in the model, while Huang, Ma, and Zhang (2008) considered correlation based weights with w h = |( n i=1 z ih y i )/( n i=1 z 2 ih )|. In this article, we propose two new weighting schemes specific to wavelet-based penalized regression method. The weights in (3) can be defined as (z ih −z h ) 2 andz h is the sample mean of the hth wavelet coefficients.
The proposed methods are mainly motivated by shrinkage-based estimators in nonparametric regression with wavelets (Donoho and Johnstone 1994;Donoho at al. 1996).
That is, the empirical wavelet coefficients with magnitudes less than a threshold value C contain mostly the noise and thus ignorable in estimating η. When applied to L 1 -type penalty in wavelet-based approaches, weighting each wavelet coefficient by its magnitude induces threshold-like effect which eventually leads to adaptive regularization.
The rationale for using sample variance of wavelet coefficients as weights comes from Johnstone and Lu (2009). They pointed out that wavelet coefficients with large magnitudes typically have large sample variances, which also agrees with what we have observed in real data examples. In addition, variables with low variability would provide limited predictive power to the outcome variable in regression analysis. Therefore, weighting by the sample variances of wavelet coefficients could effectively separate important variables from unimportant ones. We expect that weighting by sample variance would similarly introduce adaptive regularization to L 1 -type penalty.
The data-dependent weight w h is critical in terms of consistency of variable selection and model estimation for the LASSO-type estimator. The weight function should be chosen adaptively to reduce biases due to penalizations. Ideally, as the sample size increases, we would want the weights for less important, noisy predictors to go to infinity and the weights for important, nonzero ones to converge to a small constant (Zou 2006). That is, adaptive regularization should effectively separate important nonzero wavelet coefficients from the unimportant ones.

SCREENING STRATEGIES
When the data dimensionality is very high, it is natural to perform screening before model fitting. We consider four approaches for the wavelet-based methods: (1) screening by correlation, (2) screening by stability selection, (3) screening by variance, and (4) screening by magnitude of wavelet coefficient. Fan and Lv (2008) proposed to select a set of important variables through a sure independence screening (SIS) procedure. Specifically, the SIS procedure involves selecting covariates based on the magnitudes of their sample correlations with the response variable. Only the selected covariates will be used for further analysis. The SIS step significantly reduces computational complexity. They showed that the SIS procedure can reduce the dimensionality from an exponentially growing number down to a smaller scale, while the reduced model still contains all the variables in the true model with probability tending to 1. Meinshausen and Bühlmann (2010) demonstrated that variable selection and model estimation improve markedly if covariates are first screened by the stability selection procedure. The stability selection procedure involves selecting variables based on their maximum frequencies of being included in models built on perturbed data over a range of regularizations. They claimed that, with stability selection, the randomized LASSO can achieve model selection consistency even if the irrepresentability condition is not satisfied. In this article, we will use a similar approach to screen out less important wavelet coefficients before model fitting. Using wavelet-based LASSO as an example, we resample n/2 individuals from the data, fit the wavelet-based LASSO, and the variables remaining in the model are selected by five-fold cross-validation. We repeat this process B times and the variables with their proportions of inclusion less than a threshold value (π ) will be excluded from further analysis.

Screening by Variance.
For principal component analysis (PCA) of signals with ultrahigh dimension, Johnstone and Lu (2009) asserted that some initial reduction in dimensionality is desirable and this can be best achieved by working in wavelet domain in which the signals have sparse representations. Screening by variance before applying PCA algorithm can improve estimation. We will extend this to the linear scalar-on-function regression model. The wavelet coefficients with small sample variances will be excluded from the model.

Screening by Magnitudes of Wavelet Coefficients.
Wavelet coefficients with large magnitude tend to have a large presence in the predictor functions and thus may play a role in predicting the outcome. Let M = {1 ≤ l ≤ N : θ l = 0} andθ (1) ≥θ (2) ≥ · · · ≥θ (N) be the sample magnitudes of wavelet coefficients. For a given k, the selected subsetM = {l : θ l ≥θ (k) }.

ALGORITHM
In the previous two sections, we proposed two schemes for weighted LASSO which are specific to wavelet domain analysis and extended some screening strategies to wavelet-based methods for functional linear model. In general, the wavelet-based penalized regression can be implemented as follows: 1. Transform functional predictors into wavelet domain.
2. Select a subset of important wavelet coefficients by one of the following criteria: (a) selecting all coefficients; (b) selecting k coefficients with the largest sample variances; (c) selecting k coefficients with the largest sample magnitudes; (d) selecting k coefficients with the largest sample correlations in magnitude; or (e) selecting k coefficients by stability selection.

Variable selection and model estimation by LASSO-type algorithm where the weight
Extension of the above methods to functional linear model with 2D or 3D image predictor is straightforward by performing 2D or 3D wavelet decomposition to image predictors.

TUNING PARAMETER SELECTION
Two tuning parameters λ and j 0 involved in the wavelet-based weighted LASSO methods. The tuning parameter λ controls the model sparsity. It must be positive. All variables are retained in the model for λ → 0, and the model becomes empty as λ → ∞. The other tuning parameter j 0 ranges from 1 to log 2 (N ) − 1. The choice of j 0 controls the optimal level of wavelet decomposition for the functional data. In this study, we choose the values of λ and j 0 by 5-fold cross-validation such that the optimal combination of λ and j 0 would produce the lowest cross-validated residual sum of squares over a grid values of λ and j 0 .
If a screening step is applied before running the desired method, the number of features selected for further analysis (i.e., k) needs to be chosen as well. This value can be chosen by cross-validation, but we do not recommend it. In practice, the results are relatively insensitive to the exact value of k, and we do not want to exclude too many variables in the screening step. Typically, we would select k such that the first k wavelet coefficients would explain 99.5% of total variability in the data. Due to great compression ability of wavelets, we notice this number is often smaller than n − 1 in our simulation studies.

ASYMPTOTIC PROPERTIES
In this section, we will provide some theoretical support of the wavelet-based adaptive LASSO method with magnitudes of wavelet coefficients as weights (i.e, w h =θ h in Equation (3)). In addition, we will study the correct selection property of one screening approach: selection by magnitudes of wavelet coefficients.

CONSISTENT ESTIMATION
We investigate asymptotic properties of wavelet-based weighted LASSO estimator when the curves are increasingly densely observed (N → ∞) as the sample size increases (n → ∞). After applying a screening step (i.e., selection by magnitudes of wavelet coefficients), the dimensionality of the functional predictor in wavelet domain reduced from N n to k n . Here we subscript quantities that vary with the sample size n. Consequently, Equation (2) becomes where * i = i + ξ i , and ξ i = N n h=k n z ih β h is the screening error. Let Z k n be a n × k n matrix, where the columns of Z k n are the k n wavelet coefficients that remain after screening. Let H n = {h : |β h | ≥ C n } with C n > 0, and the cardinality is S n = |H n |. Let ρ k n be the smallest eigenvalue of k n = 1 n Z T k n Z k n . In addition, let ζ H n = min h∈H n |β h | be the smallest magnitude of the coefficients. Following Lee and Park (2012), who proposed a general framework for penalized least squared estimation of η in Equation (1), we make the following assumptions: (a1) After the screening step, k n is larger than the greatest index in the set H n .
(a3) η is a q times differentiable function in the Sobolev sense (i.e., η ∈ W q [0, 1]), and the wavelet basis has p vanishing moments, where p > q.
Assumption (a1) is satisfied if k n → ∞ as n → ∞. Also, due to correct selection property of the screening method (see Section 3.2), for sufficiently large n, the selected subset of k n wavelet coefficients includes nonzero ones with probability tending to one. Assumption (a2) is satisfied if n 1/2 λ n S n ζ −1 H n → 0.
Theorem 1. Letη be the estimated coefficient function for model (2). Let k n be the number of predictors remaining in the model after screening step, and ρ k n be the smallest eigenvalue of k n . If Assumptions (a1)-(a4) hold, then The proof is provided in the Appendix. Note Theorem 1 relies on correct selection of nonzero wavelet coefficients in the screening step. If no screening is applied prior to model fitting, then where ρ N n is the smallest eigenvalue of N n = 1 n Z T N n Z N n and Z N n is an n × N n matrix of all wavelet coefficients. Clearly, if the screening strategy selects all nonzero wavelet coefficients with probably tending to 1, the proposed method with screening step improves the estimation compared to the one without screening step. Johnstone and Lu (2009) showed that the probability that the selected subset does not contain wavelet coefficients with the largest sample variances is polynomially small. In this section, we will show that the selected subsetM contains the largest population magnitudes with probability tending to 1. We assume each wavelet coefficient Z h follows a normal distribution, that is

PROBABILITY OF FALSE EXCLUSION
Let Without loss of generality, let the population magnitude be θ 1 ≥ θ 2 ≥ . . . ≥ θ N , and the ordered sample magnitude beθ (1)  Theorem 2. Assume (5), let C h = σ h /θ h , h = 1, 2, . . . , N, where 0 < C h < C 0 . Let (.) be the cumulative distribution function of a standard normal variable. With γ n = √ log(n)/n, θ k = bγ n where b > 0, a suitably chosen constant d > 1, and a subset of k variables are selected, an upper bound of the probability of a false exclusion is given by The proof of this theorem, following the steps in proof of Theorem 3 by Johnstone and Lu (2009), is provided in the Appendix. The probability of false exclusion is a function of the number of observation points N, the sample size n, the size of the selected variable set k, the smallest wavelet coefficient magnitude in the selected model θ k , and the signal to noise ratio as estimated by coefficient of variation C 0 . As an example, if the size of the selected set k = 50, while N = 1000, d = 2, and b = 0.75, then P (FE) ≤ 0.05 for C 0 = 3 with n = 100. The probability of false exclusion reduces to 0.009 if the sample size n increases to 200.

NUMERICAL STUDIES
In this section, we perform simulations to study finite sample performance of waveletbased weighted LASSO as well as different screening approaches in functional linear regression. To simplify notations in figure legends and labels, as well as in tables, we use "LASSO" to represent wavelet-based LASSO approach, and "Wv," "Wm," and "Wc" for wavelet-based weighted LASSO method with variance, magnitudes of wavelet coefficients, and magnitudes of correlation coefficients as penalty weights, respectively.

SIMULATION STUDY-1D FUNCTIONAL PREDICTOR
We employ similar settings as those in Zhao, Ogden, and Reiss (2012). Specifically, functional predictors, x i (t), t ∈ (0, 1), are generated from a Brownian bridge stochastic process. That is, X(t) is a continuous zero-mean Gaussian process both starting and ending at 0 and with cov(X(t), X(s)) = s(1 − t) for s < t. The true coefficient function is "heavisine" (see Figure S1 in supplementary materials). Performance of the proposed methods is compared under different noise levels, where signal-to-noise ratio (S/N), measured by the squared multiple correlation coefficient of the true model, is set to 0.2, 0.5, and 0.9 respectively. The curve is sampled at N = 1024 equally spaced time points. We carry out 200 simulations for each setting of the parameters with the sample size (n) fixed at 100 for each run. The discrete wavelet transform is performed using "wavethresh" package (Nason 2010) in R 2.15.1. In this study, we use "Daubechies least asymmetric family" with periodic boundary handling and the filter number is set to 4.
The L 1 penalty parameter λ and the wavelet decomposition level j 0 are selected by five-fold cross-validation. The size k of the set of selected variables in the screening step should be determined from the data. In this study, we restrict the maximum number of selected wavelet coefficients at the screening step to be n − 1 for all screening approaches except for screening by stability selection. When screening by stability selection, for each dataset, we run proposed methods 400 times in random subsamples of size n/2 , where the random subsamples are drawn without replacement. Variables shown in the final models at least 80% of the time are included for further analysis after the screening step.

PREDICTION AND ESTIMATION: WEIGHTED VERSUS UNWEIGHTED LASSO
Performance of various methods is compared according to their prediction ability and estimation accuracy. The prediction ability is measured by the mean absolute error of prediction (MAE) 1 n n i=1 |Z T iβ − Z T i β|, while the estimation accuracy is measured by mean integrated squared error (MISE) = 1 Table 1 show MAEs and MISEs for the proposed methods in combination with different screening approaches. Weighted LASSO methods clearly give smaller prediction errors and better estimation accuracy than unweighted LASSO does. When SNR is high (i.e., R 2 = 0.9), all three weighted LASSO approaches have similar prediction accuracy as that of the unweighted LASSO. However, the weighted LASSO methods result in about 15%-19% reduction in prediction error for smaller SNR. The prediction errors from unweighted methods when R 2 = 0.2 are even smaller than those from unweighted LASSO when R 2 = 0.5. Compared to the unweighted LASSO, the weighted methods show around 65% − 84% reduction in MISEs, depending on the weighting schemes and SNRs. R 2 = 0.9, 0.5, 0.2; Column 1: no screening, Column 2: screening by variance, Column 3: screening by magnitude of wavelet coefficients, Column 4: screening by magnitudes of correlation coefficients, Column 5: screening by stability selection). Horizontal line stands for sample mean from Wm with magnitude screening method.  Mean and standard deviation functions obtained from 200 simulated datasets are plotted in Figure 3. Although unweighted LASSO shows great prediction accuracy, the mean estimated coefficient functionη does not approximate the truth well and it has large variability. In contrast, weighted methods clearly improve estimation accuracy and the resultingη tends to be more stable across different simulated datasets, as demonstrated by much lower point-wise variability in functional coefficient estimations. The pointwise standard deviations from unweighted LASSO range from 10 to 30 at most points, while those from weighted LASSO methods are generally in the range of 1-4. It is not surprising to observe the conflict between good prediction and consistent estimation in the unweighted LASSO. Meinshausen and Bühlmann (2006) showed that, in linear regression model, the optimal λ chosen by cross-validation criterion in LASSO often resulted in inconsistent variable selection. This is also observed with unweighted LASSO.

EFFECTS OF SCREENING ON MODEL ESTIMATION
From Table 1, screening by variance, wavelet coefficient magnitude, or stability selection results in significant improvement of prediction and estimation accuracy for LASSO. The prediction errors are reduced by up to 20% with the screening step, and most significantly, the estimation errors are dropped by up to 93%. It should be noted that those three screening approaches show limited or no improvement in terms of both prediction error and the estimation accuracy when the resulting method is fit using weighted LASSO methods.
Irrespective of signal to noise ratio, the gain of adding a screening step is substantial in terms of stability of coefficient function estimation in most settings. Figure 4 and Figures  S2, S3, and S4 (in the supplementary materials) demonstrate that the screening step results in significant improvements in reducing coefficient function estimation variability. This effect of screening on the stability of coefficient function estimation is most impressive for LASSO method in which the point-wise standard deviations were reduced from a wide range of values (e.g., 10-50) to values less than 2.
In general, screening by magnitude and screening by variance of wavelet coefficient perform similarly. They tend to be better than the other screening methods in terms of both prediction and estimation accuracy. Screening by stability selection generally gives desirable results but it is computationally expensive. Compared to the other strategies, screening by correlation performs worst in terms of both prediction error and estimation accuracy. It shows very limited improvement over the underlying methods when S/N is high and it deteriorates prediction when S/N is low. This is true regardless of the weighting scheme applied. It should be noted that when the model-fitting method is unweighted LASSO or LASSO with magnitudes of correlation coefficients as penalty weights, screening by correlation does not help improve estimation accuracy and estimation stability. One possible reason for this is that there may be a group of highly correlated wavelet coefficients after the prescreening step, and LASSO tends to select only one variable from the group, an observation made by Zou and Hastie (2005).

SIMULATION STUDY-2D IMAGE PREDICTOR
Following Reiss et al. (2014) and Goldsmith, Huang, and Crainiceanu (2014), we simulate 500 images based on the first 332 weighted principal components from a subsample of ADHD-200 data (ADHD-200 Consortium 2012), where weights are randomly chosen from N(0,λ j ) withλ j , j = 1, . . . , 332, being the corresponding jth eigenvalue. The true coefficient image (η 1 ) is generated based on the first five principal components computed from the ADHD-200 data. Similar to settings for 1D functional predictor, we generate 200 sets of continuous outcomes with n = 500 and the S/N is controlled at 0.2, 0.5, and 0.9, respectively.
Prediction ability and estimation accuracy are compared based on criteria defined in Section 4.1. The prediction errors from 200 simulated data can be found in Table S1 and the mean estimated image coefficients in Figure S5 in the supplementary materials. LASSO weighted by the magnitudes of wavelet coefficients shows slightly better performance in terms of estimation accuracy, though it does not generally show better prediction.

REAL DATA ANALYSIS
4.5.1 The Wheat Data: 1D Functional Predictor. The wheat dataset consists of 100 samples of near-infrared spectroscopy (NIR) spectrum measured from 1100 nm to 2500 nm in 2-nm intervals (Kalivas 1997). The aim of this study is to establish whether NIR spectra of wheat samples can be used to predict the wheat quality parameters (e.g., the moisture and the protein contents). In our analysis, we use the once-differenced spectra to correct for a baseline shift. The spectra are then linearly interpolated at 512 equally spaced points.

The ADHD-200 Data: 2D Image
Predictor. This dataset comes from the recent ADHD-200 Global Competition (ADHD-200 Consortium 2012). The image predictors are maps of fractional amplitude of low-frequency fluctuations (fALFF) from resting-state functional magnetic resonant images (rs-fMRI), where fALFF indicates the intensity of regional brain spontaneous activities and is defined as the ratio of BOLD signal power spectrum within low frequency range (i.e., 0.01-0.08 Hz) over that of entire frequency range. The 2D image predictors are obtained from the slices for which mean fALFF has the largest variability across voxels. We have 333 samples in this study. In our study, we are trying to determine the relationship between brain measurements and IQ. In this study, IQ was first regressed on age, gender, and handedness. The residuals were used as the new responses.

PREDICTION, ESTIMATION, AND STATISTICAL INFERENCE
4.6.1 Prediction Accuracy. The data are randomly split into two halves with the first half used as training data and the second half as testing data. We repeat this process 10 times. The mean absolute prediction errors summed over testing datasets are reported in Table S2 for the wheat data and Table S3 for the ADHD 200 data. In both datasets, MAEs from the wavelet-based weighted methods are comparable to or smaller than that from the waveletbased LASSO method, and the screening step generally results in smaller MAEs compared to those from the underlying methods. Figure 5 shows the estimated functional coefficients and their corresponding confidence intervals when the outcome measure is moisture content. Consistent with what we find in simulation studies, the functional coefficient estimations from the wavelet-based weighted LASSO methods are more stable/consistent. The estimated coefficient functions (Rows 1 and 2 in Figure 5) suggest a negative relationship between Figure 5. The wheat data: estimated coefficient function with their corresponding pointwise (light gray) and joint (dark extensions) confidence intervals (Rows: 1-2) and permutation tests to assessing statistical significance of the relationship (Rows: 3-4). moisture contents and the intensity of transmission of NIR radiation from the spectral range 1900-2150 nm.

The Wheat Data.
Confidence intervals on ω(t) can be generated using nonparametric bootstrapped based approach. We used B = 1000 nonparametric bootstrap samples of matched pairs Y i , X i (t) and reestimating ω(t). The pointwise estimators for the mean coefficient functionω(t) and the standard deviation ofω(t) areω(t) −ω(t)) 2 /B, respectively. The 95% pointwise confidence intervals can be obtained by calculatingω(t) ± 1.96 * s(t). The 95% joint confidence intervals take the form ω(t) ± q 0.95 * s(t), where q 0.95 is the 95% quantile of M b , b = 1, 2, . . . , B. Here, M b is the maximum over the entire range of t-values of the standardized mean realizations for bth bootstrapped sample. Detailed algorithm for calculating M b can be found in Crainiceanu et al. (2012b). A permutation based test developed by James, Wang, and Zhu (2009) can be used for testing statistical significance of the relationship between the NIR spectrum of wheat and their moisture contents.
Rows 3 and 4 in Figure 5 illustrate the permutation test results. The horizontal line with solid dot indicates the observed R 2 for each method applied to the wheat data. Other dots in the each plot are permuted R 2 . We permuted the response variable 1000 times and calculated a permuted R 2 for each permutation. All permuted R 2 s were well below the observed R 2 , especially when Wm and Wv methods are employed, providing very strong evidence of a true relationship between the NIR spectrum and wheat moisture content. We do believe in this strong relationship as the application of near-infrared spectroscopic technique for the quantitative analysis of food products is nowadays well established. Note there is discrepancy in bootstrap estimates and permutation test of zero relationship when the LASSO or Wc methods are applied, providing evidence that the instability estimation of LASSO and Wc deteriorates the performance of the bootstrapping method.

The ADHD-200 Data.
We plot the mean coefficient images in Figure 6 (Row 1) for ADHD study. The mean coefficient images from Wm and Wv are more sparse and arguably more interpretable than those from wavelet-based LASSO and Wc. The algorithm in Crainiceanu et al. (2012b) can be extended in 2D image setting to obtain the pointwise and joint confidence intervals for coefficient image. Row 2 in Figure 6, derived from 95% pointwise confidence intervals, indicate brain regions where there are significant associations between the image predictor and response variable IQ. Figure 6 also illustrates the permutation test results for the ADHD data. Similarly, we observed discrepancy in bootstrap estimates and permutation test of zero relationship when the LASSO or Wc methods are applied. From the real data examples, the original wavelet-based LASSO and the proposed weighted versions show agreement in terms of permutation test results and discrepancy in bootstrap based confidence interval estimations. This is expected because permutation tests are based on how well the predictor predicts the response and both the original method and the proposed methods in general show excellent prediction ability. In contrast, bootstrap-based approaches perform well only if the underlying methods can produce accurate coefficient estimation with small variance. The original wavelet-based LASSO are unstable in coefficient estimation, therefore it deteriorates the performance of bootstrap-based approach.

DISCUSSION
The primary contributions of this article are the investigations of several strategies in the wavelet-based LASSO context for (a) screening coefficients and for (b) coefficient-specific weighting of L 1 penalizations. These various strategies have been studied both in terms of Figure 6. The ADHD data: Row 1: estimated coefficient function; Row 2: corresponding images indicating regions with statistical significance of the relationship based on 95% pointwise confidence intervals; Row 3: observed R 2 (horizontal line) and permuted R 2 (circles) for assessing statistical significance of the relationship. estimation and prediction as well as in terms of estimator stability. Additionally, this article illustrates one of the strengths of wavelet methods in this context: that the basic extension of functional data methods from one-dimensional signals to two-or three-dimensional images is straightforward both conceptually and computationally. One important point to keep in mind is that, due to the greater natural dimensionality of imaging data, this extension merits further study.
Wavelet-based LASSO shows great prediction but relatively poor estimation accuracy, especially in the setting in which the "irrepresentability condition" is violated. In this study, we showed that this procedure can be improved by adding a prescreening step prior to the model fitting or, alternatively, by using a weighted version of wavelet-based LASSO. The proposed approach in general shows better prediction ability as well as improved estimation accuracy as compared to the wavelet-based LASSO. An additional advantage is more stable coefficient function estimations, as evidenced by results from a simulation study that indicates estimated coefficient functions obtained by weighted LASSO methods all have significantly lower pointwise variability. Those advantages are most striking for models with 1D signal predictors but still can be seen for models with 2D image predictors.
The key point in variable/feature selection is to separate the important predictors from the unimportant ones. In general, the "importance" of a variable could be defined either based on its relationship with the response variable, or based on its ability to represent salient features of the signals/predictors. The former definition corresponds to a supervised approach in the sense that it considers, in the selection process, information from both the response variable and the predictors, while the second definition, considering only information from the predictors, corresponds to an unsupervised approach.
Two of the prescreening approaches considered here (i.e., screening by correlation and stability selection) could be thought of as supervised screening because they take into account the relationship of each predictor to predictions of the outcome variable. On the other hand, the other two approaches (i.e., screening by magnitude and screening by variance of the wavelet coefficient) can be thought of as unsupervised because a predictor's importance is determined based only on its ability to represent the underlying signals or images. Based on our results shown here, we conclude that the unsupervised screening approaches tend to work slightly better than the supervised approaches. It is worth noting that screening by correlation is based on an implicit assumption that the correlation matrix of the X-variables is not too far away from the identity (Fan and Lv 2008). If the correlations among predictors are expected to be high, screening by magnitudes or variance could be expected to give better model performance. Although screening by stability selection generally performs well, it comes with considerably higher computational cost and thus might not be suitable for data with ultrahigh dimension. Notably, we also implemented and tested an approach of screening by magnitude of covariance, a hybrid of the screening by correlation and the screening by variance approaches. However, this approach does not tend to outperform screening by variance in our simulation settings and real data examples.
Weighted LASSO involves incorporating weights into the L 1 penalty terms. This distinction is subtle but important. When the model violates the "irrepresentable condition," weighted LASSO tends to avoid spurious selection of noise predictors by applying less penalization to the "important" variables and more to the "unimportant" ones. As noted earlier, the meaning of "importance" depends on whether a supervised or an unsupervised approach is being considered. For the unsupervised approaches, we are guided by the understanding that many signals can be sparsely represented in the wavelet domain. By this we mean that the "energy" in a small number of k coefficients with the largest magnitudes or variance tends to be close to the total energy. Thus, the wavelet coefficients with small magnitudes or variances contain mostly noise. Based on this, the importance of each wavelet coefficient can be measured by its magnitude or variance.
We posit that the effect of a screening step and the effect of adding weights to the L 1 penalty terms are similar in nature. Implementation of LASSO results in variable selection that has been shown to be consistent under certain regularity conditions (Meinshausen and Bühlmann 2006;Zhao and Yu 2006). When those regularity conditions are violated, it becomes more difficult to separate the important predictors from the unimportant ones. Weighted LASSO heavily shrinks the unimportant variables by downweighting them, thus making important variables more distinguishable from unimportant ones. Alternatively, the screening step discards the unimportant variable in the first place, resulting in improved regularity conditions on the design matrix. Therefore, both approaches tend to similarly improve the performance of LASSO estimator. We should also acknowledge that in some cases when no variable violates the "irrepresentability condition," it is relatively easy to separate nonzero coefficients from zero ones. Therefore, all methods yield remarkably similar results in terms of both prediction error (data not shown) and estimation accuracy.
In summary, the performance of wavelet-based LASSO can be improved by including weights in the L 1 penalty terms or by adding a screening step before model fitting, and that either of these extensions tends to give roughly equal increase in performance. In most situations, we have found that combining both screening and weighting will in general not further improve results. Either of these two enhancements has nice theoretical properties and still enjoys the computational advantage of the wavelet-based LASSO. Implementation of the proposed methods can be found in the online supplementary materials (Crainiceanu et al. 2012a).

SUPPLEMENTARY MATERIALS
Simulation: Additional simulation results can be found in the Supp.pdf file. Appendix: The derivation of Theorems 1 and 2 can be found in the Appendix.pdf. data.zip: The data.zip file includes datasets used in the study. Rcode.zip The Rcode.zip file includes R code to perform the proposed methods.