Robust and sparse learning of varying coefficient models with high-dimensional features

ABSTRACT Varying coefficient model (VCM) is extensively used in various scientific fields due to its capability of capturing the changing structure of predictors. Classical mean regression analysis is often complicated in the existence of skewed, heterogeneous and heavy-tailed data. For this purpose, this work employs the idea of model averaging and introduces a novel comprehensive approach by incorporating quantile-adaptive weights across different quantile levels to further improve both least square (LS) and quantile regression (QR) methods. The proposed procedure that adaptively takes advantage of the heterogeneous and sparse nature of input data can gain more efficiency and be well adapted to extreme event case and high-dimensional setting. Motivated by its nice properties, we develop several robust methods to reveal the dynamic close-to-truth structure for VCM and consistently uncover the zero and nonzero patterns in high-dimensional scientific discoveries. We provide a new iterative algorithm that is proven to be asymptotic consistent and can attain the optimal nonparametric convergence rate given regular conditions. These introduced procedures are highlighted with extensive simulation examples and several real data analyses to further show their stronger predictive power compared with LS, composite quantile regression (CQR) and QR methods.


Introduction
Varying coefficient models (VCMs), the powerful and popular statistical learning techniques, are widely applied in various scientific fields due to the flexibility and explanatory power.Particularly, the collected data in contemporary scientific research has a significant feature that the dimensionality p can diverge at a non-polynomial rate with the sample size n, more attention is thus being focused on variable selection techniques on nonparametric models that can help gain scientific insights and are flexible enough to reduce modeling biases.Among all the nonparametric models, VCM is more preferable for high-dimensional data owing to its simple form and remarkable power of avoiding much curse of dimensionality.Practically, the demands for understanding dynamic changes of high-dimensional covariates, such as time-varying effects, are increasingly called [3].The VCM gradually becomes the superior modeling tool ranging from economics, finance, politics, epidemiology, medical science to ecology and machine learning and has experienced considerable development over the past several decades.The VCM is defined to be where Y is the response, U is some univariate observable exposure variable, X = (X 1 , . . ., X p ) T is the vector of covariates, and ε is the random noise with probability distribution and density functions F(•) and f (•), respectively.An intercept term (i.e.X 0 ≡ 1) can be introduced if necessary.The regression coefficient functions α(•) vary smoothly with U and the nonlinear interaction between U and X is allowed.
In literature when p is moderate, three types of approaches are involved to estimate α(•) in model (1): kernel-local polynomial smoothing-based procedures [7,12,32], polynomial spline-based procedures [13,14] and smoothing spline-based procedures [2,10,12].However, these existing procedures are expected to suffer from loss of efficiency in the presence of outliers and the accuracy of estimation would substantially decrease when the input features are heavy-tailed or skewly distributed, a stylized characteristic of highdimensional data, common in health care and environmental issues.To successfully apply VCM to high-dimensional data, one common strategy is to utilize feature screening techniques and variable selection methods [8,20,21].But these methods that are typically based on conditional mean still face the challenges of designing a robust algorithm.To tackle these problems, procedures based on Quantile Regression (QR) are supposed to be better alternatives, since QR provides a more flexible and systematic way of quantifying the relationship between the response and the covariates at different quantile levels.Though the extensive development and applications, QR has relatively less progress on VCM.To account for censoring, the weighted inverse probability estimators for quantile regression varying coefficients are introduced [27,33].To deal with high-dimensional features, a screening method via quantile partial correlation is proposed in VCM [18].Other relevant works include but are not limited to Honda [11], Cai and Xu [1], Kim [15].However, QR estimation can suffer from great variability at tails, since high tails are regularly associated with data sparsity.Therefore, it is desirable to aggregate information across multiple quantiles to improve the estimation efficiency over simple QRs.Inspired by these previous works and to overcome the aforementioned limitations, our study will focus on robust and sparse learning of varying coefficient models, to further improve estimation efficiency and accuracy of varying coefficients.
First, we consider robust learning of VCM by employing the idea of model averaging when the number of covariates is small.We provide a set of candidate varying coefficient quantile regression models considering different quantiles.By weighted averaging over all the candidate models, one can even out the overestimation and underestimation, making prediction results less sensitive to any selected quantiles even if conditioning on poor data.The chosen weights, in this sense, are crucial to the final results.For this purpose, we introduce novel quantile-adaptive weights, which are obtained via sparsity function, for each potential quantile level, so as to obtain desirable properties of the resulting estimators and asymptotic optimality.We refer to this weighted approach as sparsity-integrated quantile regression (SIQR), and develop another two robust variants, local SIQR and spline-based SIQR, for parameter estimation for VCM.It can be demonstrated that by effective integration of multiple QR strengths, efficiency and accuracy of the SIQR estimators for (1) could be improved dramatically.This is also highlighted in simulation examples, especially when dealing with the input data of a high level of skewness and heterogeneity.
Another objective of this paper is to achieve sparse modeling and learning of VCM with high-dimensional covariates.With the sparsity assumption, feature screening together with variable selection helps improve the accuracy of estimation and gain scientific insights.To effectively reduce dimension, it is well known that hypothesis testing [19] and regularization procedures, such as SCAD and LASSO penalty, are applied frequently in VCM [30,31].Xue and Qu [35] study difference convex programming and investigate global optimality properties to deal with the computational difficulty and theoretical complexity in high-dimensional setting.Some feature screening methods have also emerged to improve efficiency of estimators for VCM, for example [3,21].However, these procedures rely on mean regression cannot be well adapted to explore the heterogeneous and sparse nature of high-dimensional data, and some of the procedures do not perform well in highdimensional settings due to the challenges of computational expediency and algorithm stability.In fact, we find out that it is usually helpful to aggregate information across multiple quantiles to improve the estimation efficiency when dealing with high-dimensional covariates.This in turn suggests a weighted procedure for robust variable selection.Motivated by this, we propose an estimation procedure, called penalized spline-based SIQR (PSQ), for estimating varying coefficient functions and selecting active variables simultaneously.The advantage of the PSQ is three folds.First, it is computationally scalable for big data sets with many features.The PSQ estimators are easy and fast to calculate by using the proposed iterative algorithm.Second, compared to conventional QR and mean regression methods, the PSQ approach is more robust and flexible.By taking into consideration of multiple quantile levels, the PSQ procedure well adapts to heavy-tailed and tail dependence data.It is able to identify the true varying structure of the model when one needs to uncover dynamic relational information from data, such as time series data.Third, it is shown that by using a sparsity-inducing penalty, the PSQ estimator enjoys the oracle property, that is, the irrelevant predictors can be consistently identified and the corresponding coefficient functions can be estimated with the optimal nonparametric convergence rate given regular conditions.
The rest of the paper is organized as follows.Section 2 describes the proposed methods, including the LSQ and SSQ procedures for small p and PSQ procedure for large p. Section 3 investigates the asymptotic properties of the proposed methods.Section 4 provides with extensive simulation examples and Section 5 deals with two real data analyses.Some concluding remarks are given in Section 6.The technical proofs and other simulations are deferred to the supplementary materials.

SIQR weights and computation
Consider a set of K quantiles {τ k , k = 1, . . ., K}, the SIQR weights are defined as These weights are motivated by the asymptotic properties of estimators for linear quantile regression [34], that is, for a given quantile level τ ∈ (0, 1), the quantile regression esti- , where C is a positive definite matrix such that lim n→∞ n −1 X T X = C, and D − → denotes convergence in distribution.Intuitively, the precision of βτ depends on the richness of observations in the neighborhood of the τ th conditional quantile.
Denote by s(τ ) = 1/f (F −1 (τ )) the sparsity function [29].Clearly, the weight (2) contains two parts.The first is √ τ k (1 − τ k ), the geometric mean of τ k and 1 − τ k , indicating the central tendency of the chosen quantiles.The closer the value of √ τ k (1 − τ k ) is to 0.5, the closer the selected quantile is to the median.The second is s(τ k ), reflecting the richness of sample observation near this quantile.In the remaining parts, we use normalized weights w * k = w k / K k=1 w k by convention and denote by ω * = (w * 1 , . . ., w * K ) T the normalized weight vector.Supplementary materials further discuss in detail the superior performance of the novel quantile-dependent weights (2).
Practically, we provide three methods to calculate the SIQR weights.(1) Plug-in method.Replacing the unknown distribution with normal distribution, then ŵ1k , where ϕ and are the standard normal density and distribution functions, respectively.(2) Kernel-based method.Let K(•) be a kernel function such that K h (•) = K(•/h)/h, where h is a bandwidth, and let εi be the residuals of τ k th quantile regression.Then , where c is the τ k th quantile of residual εi .(3) Derivative-based method.Using the fact that where Fn is the empirical distribution function, and h n is the bandwidth.To obtain F−1 n , we follow the idea of Koenker and Machado [16], s(t) is estimated by ŝn (t) = [ QY (t + h n |x) − QY (t − h n |x)]/2h n , where Qt (Y|x) is the tth conditional quantile estimates of Y, x is the mean of x, h n is the bandwidth.Practically, h n is set to be Remark 2.1: The estimated weights based on kernel-based method and derivative-based method are consistent.But practically, we recommend using the plug-in method to calculate weights due to its simple form.What is achieved by working under the normality assumption is an explicit and applicable formula for weight estimation.In practice, we do not know whether the error is normally distributed.If it is, then the plug-in method gives the optimal weights.If not, then the plug-in method will give weights that not too far from the optimum for all error distributions that are unimodal, fairly symmetric and do not have tails that are too fat.Moreover, our empirical studies (supplementary materials) suggest that the proposed kernel-based and derivative-based methods are good choices for estimating the SIQR weight for small-to-moderate sample problems, even if the normal assumption is violated, especially when the data exhibits clear skewness and heteroscedasticity in the tail.While, the plug-in method with the normal assumption could be used in larger sample problems, since for a large number of problems, even with data contamination, it may still be reasonable to assume that the majority of samples have normal random errors.

Estimation when the number of predictors is small
Suppose we observe a random sample {(u i , x T i , y i ), i = 1, . . ., n} coming from model (1).For any given quantile level τ ∈ (0, 1), let ρ τ (u) = u(τ − I(u < 0)) denote the quantile loss function, where I(•) is the indicator function, let Q Y (τ |x) = inf{y : F Y (y|x) ≥ τ } denote the τ th conditional quantile of Y given x, where F Y (•|x) is the conditional distribution of Y given x.In practice, direct quantile estimates at the tails (extremely high or low quantiles) are usually unstable due to data sparseness, and this motivates us to consider new methods of estimation based on SIQR weights by aggregating information across quantiles to improve the estimation efficiency.Specifically, we develop two robust procedures when the number of predictors is small (p < n), that is, local SIQR (LSQ) procedure and spline-SIQR (SSQ) procedure.We also extend the SSQ procedure to high-dimensional setting (p n) by introducing sparsity-inducing penalty functions to consistently uncover the zero and nonzero patterns in α(•).Details of the proposed SSQ procedure are described in Section 2.3.

Proposed method: local-SIQR
Let 0 < τ k < 1 be any quantile level.We first assume the following quantile regression model: where α j,τ k (•) for j = 1, . . ., p are unknown quantile varying coefficient functions, which can be estimated by classical local linear quantile regression methods, defined through where a = (a 1 , . . ., a p ) T , b = (b 1 , . . ., b p ) T , and h k is the bandwidth computed for the τ k th quantile level.
To achieve higher efficiency, we consider estimating α(•) by aggregating information across multiple local linear quantile estimators based on SIQR weights, that is, local SIQR (LSQ) method.The LSQ method is constructed as follows: (1) Take a collection of quantiles {τ 1 , . . ., τ K }.
(2) Employ Equation (3) at each selected quantile level, and obtain a series of estimators (2) and the normalized version ω * .Then, the LSQ estimates for (1) are The LSQ estimator is obtained by pooling information from multiple quantile levels.In our implementation, we use equally spaced quantiles, {1/(K + 1), 2/(K + 1), . . ., K/(K + 1)}.Our empirical study suggests that the LSQ estimator is often more efficient than the estimator obtained at a single quantile level in almost all cases.
In our LSQ, we take the Epanechnikov kernel function, K(u) = 3  4 (1 − u 2 ) + , which is typically employed in VCM.To allow for higher efficiency and more flexibility, we consider using the quantile-dependent bandwidth, where h LS is the asymptotic optimal bandwidth for LS, as suggested in Fan [5].We employ plug-in method to calculate weights for convenience.

Proposed method: spline-SIQR
In practice, the LSQ estimator may not be a superior candidate to well reveal the true varying structure if α(•) is not smooth enough, even if the optimal bandwidth is applied, see simulation Examples 4.1-4.2.To resolve this, we focus on the Spline-basis SIQR (SSQ) method to estimate the varying coefficient functions, which share the same spirit as LSQ procedure.Suppose that the jth coefficient function α j (u) can be approximated by a basis expansion, α j (u) ≈ L j l=1 θ jl B jl (u), where B jl (u) is the spline-basis function, θ jl is the spline coefficient and L j denotes the number of basis functions in approximating the function α j (u), may vary across α j (u).Then for a given quantile τ k , the varying coefficient quan- We consider using B-spline basis.
Define a sequence of quantile levels, τ 1 < . . .< τ K ∈ (0, 1).For each k = 1, . . ., K, we define{ θjl,τ k } j,l , which can be obtained by minimizing and αj,τ k (u) = L j l=1 θjl,τ k B jl (u).Consequently, our SSQ estimators for varying coefficient functions are as follows, where w * k is the normalized weight defined in (2).In our SSQ, L j = k n,j + d + 1 is taken for B-spline basis, where k n,j is the number of interior knots for estimating α j (u), and d is the degree of spline.We choose d = 2 or d = 3 corresponding to the quadratic or cubic splines.The selection of k n,j will be discussed detailedly in the next section.

Proposed method: penalized SSQ
In Sections 2.2.1 and 2.2.2, we assume that the dimensionality of predictors is small and fixed.To extend the SIQR-based procedures to a high-dimensional setting, and build an interpretable model with high prediction power, we develop a weighting-adaptive selection method based on SSQ for (1).Without loss of generality, assume that the functional coefficient vector α(•) = (α 1 (•), . . ., α p (•)) T is sparse, and define A τ = {j : Q τ (α 2 j (U)) > 0} as the true sparse model with non-sparsity size s = |A τ |, where Q τ (•) is the τ th quantile function.Here we assume A τ is identical across quantiles, and allow p to grow with n at a certain level.To simultaneously perform variable selection and coefficient estimation, the penalized SSQ (PSQ) procedure is introduced.Below we present the detailed steps.
Step 1. Define a sequence of quantile levels τ 1 < . . .< τ K ∈ (0, 1) and obtain the corresponding normalized weights {w * i } K i=1 (2); Step 2. For each quantile τ k , we consider the following penalized quantile loss function, where π j (u) = l θ jl B jl (u), p λ k,j (•) is a penalty function with λ k the penalization parameter determined for τ k and X j .Let θ j = (θ j1 , . . ., θ jL j ) T , R j = (r ik ) L j ×L j be a matrix with 7) can be reformulated as To simplify the above notations, we further define T , where It can be seen that ( 8) can be rewritten as We have that the minimizer of ( 9) is Step 3. Obtain the PSQ estimator of and α(•), Remark 2.2: To accomplish the above PSQ estimation procedure, three types of tuning parameters are needed.(1) L j 's (the number of B-spline basis) control the smoothness of the coefficient functions α(•).(2) λ k,j 's (penalization amount) determine the sparseness of the model.( 3) τ k 's (the sequence of quantiles) dominate the efficiency of the PSQ estimator.In our implementation, we use equally spaced quantiles, as in Zou and Yuan [39].It remains to consider the selection of K. Details of the selection of the three tunings are discussed in Section 2.3.2.Beyond that, the sparsity-inducing penalty functions also play great roles in both estimation accuracy and model sparsity.In literature, SCAD [6], LASSO family [28,38], MCP [37] and TLP [25] are extensively used.
Although other penalties can be applied, this paper mainly takes the adaptive lasso penalty [38], defined by p λ k,j (u j ) = λ k η j |u j | γ , where η = (η 1 , . . ., η p ) T is a known weight vector and γ > 0. Thereby, ( 9) can be rewritten as where n-consistent estimator of θ j j .Then by minimizing p τ k ( ) over , we can obtain the adaptive lasso penalized QR estimates τ k , for k = 1, . . ., K. It is clear that PSQ estimator with adaptive lasso penalty is PSQ = K k=1 w * k τ k .

Smoothing parameter selection and computational algorithm
This section discusses the implementation issues of the proposed PSQ.The tuning parameters, including the number of spline basis L j , the penalty parameter λ, and the number of selected quantiles K, can be determined one by one as follows.
(1) Tuning L j .We use B-splines, and L j = d + k n,j + 1.To have enough flexibility, the basis should be rich enough, that is, have a large enough dimension.It can be seen that choosing L j is equivalent to selecting k n,j .Define k n = (k n,1 , . . ., k n,p ) T .We apply Schwartz-type information criterion (SIC) here, specifically, for a quantile τ , where k = (k 1 , . . ., k p ) is a length-p vector, (k) is the τ th QR estimator of ( 12) with k knots.By using grid point search method based on minimizing SIC criterion, the optimal knots k n are selected.(2) Penalization λ.We use BIC-like criterion to choose penalization λ.It is shown that when the tunings decay at a proper rate, our PSQ is consistent in identifying the true parsimonious model.The BIC-like criterion is defined as where df λ is the number of nonzero coefficients.(3) K.We advocate an automatically data-driven method to determine K by virtue of the varying quantile-adaptive sparsity patterns.We consider using the log-likelihood leave-one-out cross-validation (CV), By minimizing CV(K), we obtain the optimal K. Empirically, we have tried three different values of K, i.e.K = 5, 9, 19.We found that as the value of K increases, performance of the proposed methods varied a little bit, but got worse in some scenarios, such as for heavy-tailed distributions.Therefore, we recommend using a small value, say K = 5, 9, to save the computational cost in part, especially for a large dataset.
The above discussion leads to Algorithm ??, which is simple and easy to implement, practically, we do not need to iterate much, two steps are generally enough.
n df λ .7: end for 8: k n and λ reach convergence.

Quantile stability selection for VCM
We further consider an issue that aims at determining the significance of a variable based on penalized estimates, since just judging from the coefficient estimates may lead to unstable and sometimes misleading results.Motivated by the stability selection techniques [23], we tackle the aforementioned issues using a stability quantile path by resampling.We use a set I = { αK j : K ∈ R + , j = 1, . . ., p} to include the estimates for all the coefficients at each quantile level of interest.The stability of quantile paths is the probability of each variable { αq j , j = 1, . . ., p} to be selected when randomly resampled from the data, given a quantile level q.Let S be a random subsample of {1, . . ., n} of size [n/2] drawn without replacement.The probability of each set L ⊂ {1, . . ., p} being in the selected set Ŝq is π q = P(L ⊂ Ŝq ).For each covariate X i , the stability quantile path is given by the selection probabilities π q i , q ∈ {τ 1 , . . ., τ K }.In principle, K can go to infinity, but practically we suggest K ≤ 20 for computational purpose.Figure 3 tells us that this simple path plot is potentially very useful to determine a reasonable K for PSQ.
We say a variable obtained via PSQ is stable if this variable has a relative stable performance at each selected quantile τ k , k = 1, . . ., K.That is to say, the stable variable should be significant across all the quantiles.Assume that the probability of the jth variable to be selected at the qth quantile level is π q j , if the jth variable is stable, it satisfies that min q (π q j ) ≥ π j , where π j is a predetermined threshold probability for q ∈ {τ 1 , . . ., τ K }.Then, the set of stable variables is defined as We propose keeping the variables with high selection probabilities at each quantile level and removing those with low selection probabilities.In our numerical analysis, the results are not very sensitive to the choice of threshold π i .We suggest π i ∈ (0.6, 0.9) in practice.

Asymptotic properties
In this section, we first study the asymptotic behaviors of the LSQ and SSQ estimators defined in ( 4) and ( 6) in the case where p is fixed and the sample size n goes to infinity.
Then we describe the asymptotic properties of the PSQ estimator defined in (11) when p can be larger than n, and n goes to infinity, but the number of active predictors is relatively small.Let f τ (• | u, x) and F τ (• | u, x) be the density and cumulative distribution function (CDF) of ε conditional on (U, X) = (u, x), respectively.Without loss of generality, let = [0, 1].Denote by f U (•) the marginal density function of index U. Define H r as the collection of all functions on [0, 1] whose mth order derivative satisfies the Hölder condition of order ν with r ≡ m + ν.That is, for any h ∈ H r , there exists a constant c ∈ (0, ∞) such that for each h ∈ H r , |h (m)  3) is chosen as a symmetric density and we define We assume the following conditions.Conditions (C1)-(C5) are standard assumptions used in varying coefficient models and quantile regression.Condition (C6) is required to achieve the optimal convergence rate of αSSQ,j (u).Condition (C7) is needed for proving the consistency of SSQ estimators.Then we have the following results.
where δ is a constant determined by weights w k 's, that is (w i w j ) 1/5 )/( i w i ) 2 , h is the optimal bandwidth for LS suggested in Yu and Jones [36].
Theorem 3.2 (Asymptotic property of SSQ estimator when p is small): Assume conditions (C4)-(C7) hold, if r > 1/2 and k n,j ∼ n 1/(2r+1) .Then for j = 1, . . ., p, Throughout this paper, a n ∼ b n is used to indicate that a n and b n has the same order as n → ∞, in other words, there exists some constants 0 < A < B < ∞, such that A ≤ a n /b n ≤ B for sufficiently large n.
Theorem 3.1 presents the asymptotic normality of αLSQ,j .Theorem 3.2 tells us that when α j (t) in (17) has bounded second order derivatives, it gives the well-known optimal rate of convergence n −2r/(2r+1) of Stone [26].
Next, to establish the oracle property of PSQ estimator, we define L a := max 1≤j≤p L j , L b := min 1≤j≤p L j , L n = L a /L b and ρ n := max 1≤j≤p inf π α k − π k ∞ .Thus ρ n characterizes the approximation error of spline procedures.Denote s to be the number of active predictors.Without loss of generality, let the active predictors to be the first s n predictor variables, that is, (C8).The random errors ε i , i = 1, . . ., n are independent and identically distributed as ε, where E(ε) = 0 and E(ε 2 ) = σ 2 < ∞, and its tail probability satisfies P(|ε| > x) ≤ c 3 exp(−c 4 x 2 ) for all x ≥ 0 and for some positive constants c 3 and c 4 .
(C9).Let q n = p n − s n be the number of inactive variables.Suppose that Condition (C8) assumes that the error term has a sub-Gaussian tail behavior.Condition (C9) restricts the numbers of active predictors, basis functions, and the smallest active coefficient functions.Now we describe the results of oracle property for PSQ.n L a log(s n L a ) = 0. Then with a choice of λ n such that λ n → 0 and n r/(2r+1) λ n → ∞, we have: (1) αPSQ,j (u) = 0, j = s n + 1, . . ., p n , with probability approaching one; Theorem 3.3 shows that the irrelevant predictors can be consistently identified by the PSQ procedure, which enjoys the same convergence rate as if we had the knowledge of which predictors are important.For the conditions given in Theorem 3.3, the number of predictors p n can be as large as exp(o(n/L 2 a s n )), which can be much larger than n.

Simulation study
In this section, we carry out simulation studies to assess the finite sample performance of the proposed procedures.We first consider two cases with small predictors and compare the performance of the proposed methods with least square-based methods (LS), composite quantile regression-based methods (CQR), and quantile regression-based methods (QR).We take 19 equally spaced quantiles for CQR as suggested in Zou and Yuan [39].We consider the ratio of average squared errors (RASE) criterion for evaluating the performance, the ASE is defined as where {u k : k = 1, . . ., n grid } is a set of grid points uniformly placed on [0, 1] with n grid = 100.Then for any estimator α, RASE( α) = ASE( αLS ) ASE( α) , where αLS is the least square based estimator.We next consider two high-dimensional cases.To investigate the performance of variable selection, we use notations 'C' and 'IC' to measure the model complexity.Specifically, C shows the average number of zero coefficients correctly estimated to be zero, and IC denotes the average number of nonzero coefficients incorrectly estimated to be zero.We use 'U-fit' (under-fit) to represent the trial which excludes nonzero coefficients.'C-fit' (correctfit) represents the correct model, and 'O-fit' (over-fit) is the trial which selects irrelevant predictors besides the true significant predictors.In computing LSQ, Epanechnikov kernel is employed and the bandwidth is set to be h τ = {τ (1 − τ )/f (F −1 (τ ))} 1/5 • h LS for each quantile.In computing SSQ, B-spline is used, and the number of basis is set to be [n 1/5 ] = 3.In computing PSQ, penalized LS, and penalized CQR, we apply adaptive lasso penalty and employ Algorithm ?? to select tunings.For each example, we report the results based on 400 simulation runs.

Example 4.1 (Relative flat coefficient functions):
We generate 200 samples from (19), where α 1 (u) = sin(6πu), α 2 (u) = sin(2πu).Index U follows from a uniform distribution on interval [0, 1] (U(0, 1)).The predictors X 1 and X 2 are jointly normally distributed with zero means and covariance 2/3.We consider five distributions for the error term: standard normal (N(0, 1)), Laplace, t distribution with three degrees of freedom (t 3 ), chi-square distribution with 4 degrees of freedom (χ 2 4 ) and Cauchy distribution.The QR methods based on local linear approximation and B-spline approximation are denoted as LQR and SQR, respectively.We take τ = 0.05, 0.25, 0.50, 0.75 and 0.95 for each QR method to have a comprehensive evaluation.To explore the effects of the number of selected quantiles on the proposed methods, we consider K = 5, 9 and 19.Sample means and standard deviations (sd) of RASEs over 400 simulations are summarized in Table 1, which shows that in all cases, the proposed methods provide more efficient estimation than other methods and the SSQ methods always have the best performance especially when the data exhibit clear heterogeneity in the tail.Table 1 also suggests that the proposed methods are quite insensitive to the choice of K.
To visualize the estimated functions, we plot in Figure 1 various estimators for α 1 (•) and α 2 (•) under normal and Cauchy distributions.It is evident that the proposed SSQ estimator is superior to all the other estimators, and even when the error follows Cauchy distribution.Results shown in Figure 1 are consistent with those of Table 1.sin(60 u) is assumed to be a ragged appearance, whereas α 2 (u) = 4u(1 − u) is relatively flat.Results of RASEs are summarized in Table S3 (supplementary material).To visualize the estimated functions, estimators for α 1 (•) and α 2 (•) based on various methods under normal and Cauchy errors are depicted in Figure 2. It is obvious to see there are considerable differences in smoothness among these estimators even if the settings are exactly the same.SSQ method is overall more efficient and is able to detect the true varying structure.Table S3 and Table 1 convey similar messages.First, all the proposed methods perform better than LS methods under all error distributions, the RASEs are much greater than 1, indicating a large gain in efficiency.Especially for Cauchy distribution, LS methods totally break down, whereas the proposed methods perform robustly.Second, for QR estimators, the chosen quantile levels and error distributions can affect the efficiency, while the proposed methods remain stable due to the integration of multiple QR strengths.Third, in most cases, the performance of the proposed methods is comparable to that of CQR, while the proposed methods have faster computational speed.Fourth, compared to local linear methods, spline-basis approaches can better capture the varying structure for α 1 (u) and α 2 (u) functions.To illustrate this point, we summarize the results of RASEs between LSQ and SSQ over 400 simulations in Table S4 of the supplementary material.It is observed that spline-basis methods can have larger efficiency than local linear based methods regardless of the value K.

Example 4.3 (Variable selection and stability selection):
This example is designed to investigate the performance of variable selection.Consider Y = X T β(U) + ε, where U is generated from U(0, 1) and β(u) is a vector of constant functions, β(u) = (3, 1.5, 2, 0 p−3 ) T .Predictors X follows a multivariate normal distribution with zero means and correlation ij = 0.5 |i−j| , for 1 ≤ i, j ≤ p.We consider five error distributions: N(0, 1), t 3 , Laplace, Cauchy and mixture of normals 0.9N(0, 1) + 0.1N(0, 10 2 ).We set n = 200 and two p values p = 200 and 2000, corresponding to small p and large p cases.We compare among three estimators: LS, CQR and our PSQ.For PSQ estimator, tuning parameters λ k 's are independently selected for each quantile using BIC-like criterion introduced in Section 2.3.2.
Performance of βPSQ , βCQR and βLS : To examine the estimation efficiency and investigate the effects of K on PSQ, we consider K = 5, 9 and 19 and take βLS and βCQR 9 for comparison.Means with SDs of estimates for nonzero coefficients β 1 , β 2 and β 3 are summarized in Table S1 of the supplements.
Performance of variable selection: We use GMSE( β) = ( β − β) T E(XX T )( β − β) to assess the performance of variable selection procedures.For all the three penalized estimators, we employ adaptive lasso penalty.Results over 400 simulations are summarized in Table 2.
Stability selection: To further explore the ways how noise and sample size impact the quantile stability paths, we use resampling method by considering n = 50, 200 and 500, we graphically show the stability quantile paths under Cauchy errors in Figure 3 as an illustration and present Figures S1-S3 of the supplementary material for other noises.We plot the averages of selection probabilities for each individual predictor that are evaluated over 100 subsamples.To be noted, the depicted quantile paths could also be employed to determine a proper number of selected quantiles K for PSQ.By taking K = 9 in our PSQ, performances of individual variables across distinct quantile level are further examined in Figure 3.
All the tables and figures in Example 3 show that the PSQ estimator has the best performance under various settings in terms of both estimation accuracy and achieving model sparseness, especially when p is large.The PSQ procedure can be an effective tool to achieve dimension reduction in high-dimensional setting.

Example 4.4 (Variable selection for VCM):
This example is adapted from [35].The random data are generated from where U i is uniformly distributed on [0, 1] and X follows from a multivariate normal distribution N p (0, ) with ij = ρ |i−j| .We first take ρ = 0.5 to simulate a correlated example and consider the same noise distributions as in the first two examples.The coefficient functions are assumed to be and α j (u) = 0 for j = 4, . . ., p, indicating that the first three predictors are active.To examine the power of PSQ in model selection, we vary the dimensionality from p = 10, 200, 400 to 800, corresponding to the smaller, close to, and exceeds the sample size.We apply the cubic B-spline to our PSQ, and employ Algorithm ?? to select tuning parameters.We compute the ASEs (18) for each method to assess the overall estimation accuracy for varying coefficient functions.Totally, we conduct 400 simulations and compare among PSQ, penalized LS, penalized CQR and penalized QR procedures.For contrast analysis, we take 19 quantiles for CQR as suggested by Zou and Yuan [39], and take τ = 0.25, 0.50 and 0.95 for QR and K = 5 and 9 for PSQ.Results for standard normal errors are summarized in Table 3 as an illustration.Table 3 also reports the average number of nonzero estimated functions (C, IC), the percentage of correct fitting (C-fit), underfitting (U-fit) and overfitting (O-fit) over 400 replications.It suggests that the PSQ method outperforms the other procedures in a large margin, in the sense that the percentage of correct fitting of all the other procedures decreases significantly as p increases, while the performance of PSQ 9 is relatively stable.Figure 4 plots the nonzero estimated coefficient functions αl via these procedures based on one simulated sample n = 200, p = 400.We also present the estimated plots for ρ = 0 and 0.8 in Figure S4 of the supplementary material, which all demonstrate that the proposed PSQ method can estimate the coefficient functions reasonably well in high-dimensional setting.Overall speaking, the PSQ is effective in identifying the number of active variables and achieving accurate parameter estimation when the dimensionality   p is much larger than the sample size n.It is demonstrated that the performance of PSQ is on par with or better than that of other competitive methods for a variety of distributions of errors.

Boston housing price analysis
This section first deals with a Boston housing data [9] that is also analyzed by Fan et al.
[8], Li et al. [18] to investigate the performance of introduced procedures for VCMs when the dimension of predictors is small.This dataset comprises housing data for 506 census districts of Boston from the 1970 census.To exploit the power of VCM and examine the efficiency of our proposed methods, we share the similar spirit as in Fan et al. [8], Li et al. [18] and take the variable log(DIS), the weighted distances to five employment centers in the Boston region, as the varying index U.This setting enables us to explore and understand how the distance to the business hubs interacts with other variables.Followed by Fan et al. [8], the VCM is established as where the response MV is the median of owner-occupied homes.Other predictors' descriptions can be referred to the manual of R package mlbench.We further standardize each vector and estimate α j (U)'s by several comparative methods: LS, QR and our SSQ.For a comprehensive analysis, we take τ = 0.10, 0.25, 0.50, 0.75 and 0.90 for QR method and choose K = 5, 7, 9, 15 and 19 for SSQ.In each repetition, we randomly split the dataset into a training and a testing set.Specifically, we use the 406 training samples to train VCMs and adopt the remaining samples to assess the predictive performance of each procedure.We employ two criteria here: mean square prediction error (MSPE), n t i=1 (y i − ŷi ) 2 /n t and median absolute prediction error (MAPE), median{|y i − ŷi |, i = 1, . . ., n t }, where n t is the number of testing data in each repetition.We replicate 100 times in total, results for the average of prediction errors are listed in Table 4.
It can be seen from Table 4 that for the Boston housing data, the SSQ procedures perform better or comparably to the other procedures in terms of predictive power.Moreover, the sd values indicate that the SSQ procedures are relatively robust and always among the best few procedures, especially when K = 7.Thus we depict in Figure 5 the estimated functions via SSQ-7.These plots that display very interesting views of housing valuation are more similar  to those in Fan et al. [8], but with slight different functional trends for predictors TAX and CRIM.We leave this issue that is worth pursing and exploring in the future.

Application to gene expression data
We next evaluate the performance of our PSQ method with a high-dimensional rat microarray gene expression dataset [24].The dataset comprises p = 31, 042 genes for n = 120 rats to identify the genetic variation relevant to human eye disease.The goal of this study is to further analyze how the response variable functionally associates with the expression of other genes.To this end, we artificially generate index U that is distributed uniformly on [0, 1].The response here is the expression of gene TRIM32 (probe 1389163_at), which is known to be related to human hereditary disease of retina.Before modeling VCM, we first preprocess the data in the following steps: we remove the genes that show little variance in intensity across all the predictors, keep the 3000 genes with the largest variance in expression value following [17,22] and standardize each vector to have zero mean and unit variance.We subsequently select the top p = 500 genes in a ranking via MV-SIS [4].Such a feature selection step is commonly used in practice for high-dimensional data, since many genes actually make no contribution to the prediction of the response.
The reason why we advocate our robust PSQ procedure for this dataset is that the response is heterogeneous.After closely analyzing gene TRIM32, we detect a possible outlier belonging to the 58th sample that takes a value much less than the others.However, we choose to keep this point to examine the robustness of all the contrasting methods.We apply three kinds of methods considered in the previous examples.To identify significant genes for the response, we take the adaptive lasso penalty for all the employed procedures.The VCM here takes the form as: where X j denotes the expression of jth gene and ε is the random error, U i is drawn from U(0, 1).To investigate the finite sample performance and the predictive power of different methods, we repeat 100 times and compute MAPE and MSPE prediction error.For each replicate, we randomly divide the data into a training with 90 observations and a testing with remaining observations.We report the average prediction error, model size (MS) and their standard deviations in Table 5.We also list the top 3 selected genes with the highest selection frequency under each scenario, note that the identified genes for a method across 100 replicates are not always the same due to the random sampling for index U.Table 5 reflects that the PSQ methods are superior to other procedures in terms of both prediction errors and model size.The PSQ method always gives smaller model size and lower prediction error compared with LS and QR methods, thus it is very effective in filtering noise variables in high-dimensional setting.More interesting finding can be seen from Table 5  that our PSQ methods are more stable in the sense that they offer exactly the same selected genes no matter how K varies in random sampling.We depict in Figure 6 the estimated functions for two selected genes and the intercept with full sample via PSQ 5 , since PSQ 5 always gives smaller prediction errors and standard deviations.

Discussion
In conclusion, this paper develops several robust estimation procedures based on SIQR weights for VCM to reveal the close-to-truth functional structure.By effectively combining strengths across multiple QRs and fully taking advantage of the characteristic of data, the SIQR-based methods are shown to be more robust and efficient than the other competitive methods.To be specific, LSQ and SSQ are designed for a small number of predictors, while PSQ is to deal with high-dimensional predictors, it can significantly reduce the number of unknown functions to be estimated and automatically select the important predictors by imposing a penalty function.Simulations suggest that the proposed SSQ along with its penalized version PSQ are very useful in general and high-dimensional scientific discoveries, they can consistently uncover the zero and nonzero patterns and reveal an interesting interplay between variables.The proposed procedures have a wide range of applications, they can also be adapted to other nonparametric or semi-parametric models with missing data or not.However, our procedures restrict the response to be continuous, which is limited in practice.Thus how to incorporate both continuous and discrete response variables into the procedure is an interesting future research topic.As pointed out by the associate editor, we might also consider the asymptotic properties of SIQR-based estimators with the proposed estimated weights.Practically, we have shown that the plug-in weight ϕ( −1 (τ k )/ √ τ k (1 − τ k ) leads to an interesting estimator, which enjoys nice properties.It would be interesting to see whether other weights could result in similar estimators and how to obtain the optimal weights.

Disclosure statement
No potential conflict of interest was reported by the author(s).

(
C1).The index variable U has a bounded support and the density function f U (•) is positive and has a continuous second derivative; (C2).α j (u) has continuous second derivatives in u ∈ for j = 1, . . ., p; (C3).K(•) is a symmetric density function with bounded support and satisfies the Lipschitz condition; (C4).f τ (•|u, x) is bounded from infinity and it is bounded away from zero and has a continuous and uniformly bounded derivative; (C5).D(u) is nonsingular for all u ∈ .(C6).For some r ≥ 1/2, α j (t) ∈ H r , j = 0, 1, . . ., p,.If α j 's have bounded dth order derivatives on [0, 1], then the above equation holds with r = d.(C7).The conditional distribution of U, given X = x has a bounded density, that is 0 < c 1 ≤ f U|X (u|x) ≤ c 2 ≤ ∞ uniformly in x and u, c 1 and c 2 are positive constants.

Theorem 3 . 1 (
Asymptotic property of LSQ estimator when p is small): Assume conditions (C1)-(C5) hold, if h k → 0, and nh k → ∞ as n → ∞ for k = 1, . . ., K, then Here we write p n , s n , L n to indicate that p, L and s are allowed to grow with n.Let |M| denote the cardinality of any set M ⊂ {1, . . ., p n } and L M = k∈M L k .Let Â = {j : πj 2 = 0, 1 ≤ j ≤ p n }, which represents the set of indices of the variables selected by adaptive lasso.The size of | Â| is ŝ.Define π b = min k∈A c π k 2 .In addition to (C1)-(C7), we assume the following conditions.

Example 4 . 2 (
One ragged and one flat coefficient functions): This example also considers model Example 4.1, but with slightly different forms of coefficient functions: α 1 (u) =

Figure 1 .
Figure 1.Plots for various estimators αl (•) (Example 4.1) under normal and Cauchy errors based on one observation n = 200, the first row is for standard normal error and the second is for Cauchy error.

Figure 2 .
Figure 2. Plots for various estimators αl (•), l = 1, 2 (Example 4.2) under normal and Cauchy error based on one sample n = 200.The first row is for normal error and the second row is for Cauchy.

Figure 3 .
Figure 3. Stability quantile paths and performance of the first eight covariates under Cauchy noise for Example 4.3.The first row is the stability paths of three active covariates β 1 , β 2 and β 5 with sample size from left to right 50, 200, 500.For each panel, X-axis denotes the quantile number, and Y-axis represents selection probability.The first panel in the second row provides behaviors of all eight variables via PSQ 9 (• : n = 50; : n = 200; + : n = 500), where X-axis denotes eight variables, and Y-axis represents the selection probability.The other three panels plot the results of β 1 , β 2 , β 5 over 100 resampling (°,β 1 ; , β 2 ; •, β 5 ).X-axis denotes the nine quantiles, Y axis represents the selection probability.

Figure 6 .
Figure 6.Fitted functional gene estimates selected by PSQ 5 for gene expression data.

Table 1 .
Summary of RASE of α(u) based on 400 simulations for Example 1 (*denotes our proposed methods).

Table 2 .
Performance of variable selection for Example 4.3.

Table 3 .
Results of variable selection under normal error for Example 4.4.

Table 4 .
Predictive power of different methods for Boston housing data.

Table 5 .
Prediction error, model size and top 3 selected genes over 100 repetitions.