Quantile-adaptive variable screening in ultra-high dimensional varying coefficient models

The varying-coefficient model is an important nonparametric statistical model since it allows appreciable flexibility on the structure of fitted model. For ultra-high dimensional heterogeneous data it is very necessary to examine how the effects of covariates vary with exposure variables at different quantile level of interest. In this paper, we extended the marginal screening methods to examine and select variables by ranking a measure of nonparametric marginal contributions of each covariate given the exposure variable. Spline approximations are employed to model marginal effects and select the set of active variables in quantile-adaptive framework. This ensures the sure screening property in quantile-adaptive varying-coefficient model. Numerical studies demonstrate that the proposed procedure works well for heteroscedastic data.


Introduction
Varying coefficient models have been widely applied as an adaptive generalization of the linear regression model to depict dynamic behaviors. Because of their flexibility and interpretability, much work has been done to perform parameter estimation and hypothesis testing, but mostly in the mean regression setting until some authors such as Honda [11], Cai and Xu [1] and Kim [12] paid attention to the quantile regression setting in the early 2000s.
Quantiles themselves were defined without moment conditions. Exploring several conditional quantiles can provide us with a more complete description of the data than exploring only the *Corresponding author. Email: zhangriquan@163.com c 2015 Taylor & Francis conditional mean. These advantages have inspired many researchers to consider the quantile regression not only in parametric models but also in semi-parametric or nonparametric models. Regarding varying coefficient models, Honda [11] and Cai and Xu [1] explored a varying coefficient model for the conditional quantiles using local polynomials. Kim [12] proposed a polynomial spline-based methodology for the same model. For longitudinal data analysis, Wang et al. [17] developed the partially linear varying coefficient model in quantile regression. In spite of remarkable progresses in parameter estimation and hypothesis testing, there do not exist methods how to conduct variable selection efficiently for the varying coefficent model with ultra-high dimensional heterogeneous data in the quantile regression framework.
The ultra-high dimensional data are frequently encountered in a large variety of research areas such as genomic, functional magnetic resonance imaging, tomography, economics, and finance. Analysis of ultra-high dimensional data poses many challenges for statisticians and calls for new statistical methodologies and theories. With the sparsity assumption, variable selection improves of the accuracy of estimation and gains scientific insights. Many significant variable selection methods have been developed, such as Bridge regression in [8], Lasso in [16], smoothly clipped absolute deviation (SCAD) and folded concave penalty in [4], the Elastic net in [23], Adaptive Lasso [22] , and the Dantzig selector in [2]. Techniques on the implementation of folded concave penalized least-squares include the local linear approximation algorithm in [24] and the the plus algorithm in [19]. However, these methods do not perform well in ultra-high dimensional problems.
Fan and Lv [5] introduced a method of sure independent screening (SIS) to reduce dimensionality of the feature space to a moderate scale in ultra-high dimensional linear regression models via marginal correlation learning. By this method, some important variables are retained with probability approaching one. Fan and Song [7] extended the methodology to ultra-high dimensional generalized linear models and constructed a useful technical tool for establishing the sure screening results and bounding false selection rates. Other methods were putted forward including data tilling method in [9], marginal partial likelihood method in [21], robust screening method by rank correlation in [14], distance correlation in [15], nonparametric independence screening (NIS) with additive models in [3], and variable screening in nonparametric varying-coefficient models with ultra-high dimensionality in [6]. Inspired by these previous work, our study will focus on variable screening in varying coefficient quantile regression models with ultra-high dimensionality, that is, dimensionality of nonpolynomial (NP) order of sample size.
The nonparametric quantile regression varying-coefficient model is studied under the conditional linear structure. Suppose that the observed data {(X i , Y i ), i = 1, . . . , n} be i.i.d. copy of (X, Y ), where X i = (X i1 , . . . , X ip ) T . The covariates are denoted as X 1 , . . . , X p , so X also can be written as (X 1 , . . . , X p ) T . The varying-coefficient model is that: where U is some observable exposure variable, and is the random noise with condition mean 0. The vector X is assumed to enter the model linearly, meanwhile it allows coefficient functions to vary smoothly with the exposure variable.
In the quantile-adaptive screening framework, we use B-spline approximation to estimate marginal effects of varying coefficient model. In this aspect, our method shares some similarity with that in [6]. The main technical challenge is to handle the nonsmooth loss function comparing with that in [6]. We derive the exponential bounds by applying the empirical process theory and establish the sure screening property under some general condition.
The rest of the article is organized as follows. In Section 2, we apply B-spline basis approximation and screening variables by ranking a measure of the estimators in marginal quantile nonparametric regression model. In Section 3 we derive the quantile nonparametric independent screening (QNIS) property under certain technical conditions. Iterative QNIS procedures are established in Section 4. In Section 5, some simulation studies and a real data application are conducted to evaluate the performance of our proposed methods. Proofs are presented in Section 6.

Nonparametric quantile marginal screening
We assume that the functional coefficient vector β(·) = (β 1 (·), . . . , β p (·)) T is sparse and that M * = {j, Eβ 2 j (U) > 0} be the true sparse model with nonsparsity size s n = |M * |. p is allowed to grow with n and sometimes written that p = p n whenever necessary.

Quantile-adaptive marginal regression
is the τ th quantile of Y on U. So the set of active variable is equivalent to the set: We consider the marginal quantile regression of Y on X j and U and define where ρ τ (u) = u(τ − I(u < 0)) is quantile loss function. P denotes the joint distribution of (Y , U, X j ). L 2 (P) is the class of square integrable functions under the measure P and I(·) denotes the indicator function. It is known that Q τ (Y |X j , U) =b j 0 (U) +b j (U)X j . We rank the marginal effect of convariates by f j 2 , and define f j where where γ n is a predefined threshold value.

Marginal regression estimation with B-spline
The next step is to estimate the marginal effectf j , j = 1, . . . , p, we approximate b j 0 (U) and b j (U) by functions in S n , the space of polynomial splines of degree d n ≥ 1 on U, a compact set. Let where {μ jk } d n k=1 and {ν jk } d n k=1 are scalar coefficients. We now consider the following sample version of the marginal quantile regression problem: To obtain the solution of Equation (6) we assumed that the regression observations (Y , X) are in general position (defined by Roger Koenker [13]) and for some h ⊂ {1, . . . , n}, that |h| = 2d n is the size of the set h, is an n × 2d n matrix, and H nj (h) is an 2d n × 2d n reversible matrix from H nj (Theorem 2.1 at Roger Koenker [13]). So we derive the the estimate of the marginal effect f j aŝ where Next we apply the similar method to define the expression off j . That is where

Sure screening property
Here and below we give the following regularity conditions to establish the sure screening properties.
(i) The conditional quantile function β j (U) belongs to F, where F is a class of functions defined on [0, 1] whose lth derivative satisfies a Lipschitz condition of order c : is uniformly bounded away from 0 and infinity. That is, there exist some positive constants h 1 and h 2 , such that 0 The density function g of U is bounded away from zero and infinity on U. That is, 0 < T 1 ≤ g(u) ≤ T 2 < ∞ for some constants T 1 and T 2 . (vii) Suppose for all j = 1, . . . , p, there exists a positive constant K 1 and r 1 ≥ 2, such that Suppose there exists some positive constants K 2 and r 2 satisfying uniformly on U, for any t ≥ 0.
Conditions (i)-(iv) were used by He et al. [10] to prove variable screening property of additive model. Condition (v)-(vii) are assumed to restrict the moment of covariate and response on exposure variable and bear some similarity with that [6].
We assume that the covariates are standardized, that is: Proposition 1 If conditions (i)-(iv) hold, then min j∈M * f j 2 ≥ c 6 n −ι , for some positive constant c 6 and all n sufficiently large.
The proof of Proposition 2 is given in the supplemental material [20] . (1) For any h 1 > 0, there exist positive constants h 1 , c 7 , c 8 , t 2 , t 3 , such that The proof of Theorem 1 is given in the supplemental material [20] .
Remark 1 According to Theorem 1, we can handle the number of covariates p = o(exp(n 1−4ι ).

False selection rates
From Proposition 1, the ideal case for vanishing false-rate is when so that important and unimportant variables are naturally separated. By Theorem 1(1), when Equation (10) tends to 0, we have with probability tending to 1 that For this reason, by choosing γ n as in Theorem 1(2), the model selection consistency can be established: We now consider the more general case.
Theorem 2 Suppose the conditions in Theorem 1 hold, then for any γ n = c 9 d n n −ι , The proof of Theorem 2 is given in the supplemental material [20] .

The method of QNIS
Applying the method of QNIS to select the variables, inevitably, we would suffer from false negative (i.e. miss some important predictors that are marginally weakly correlated but jointly correlated with the response), and false positive (FP) (i.e. select some unimportant predictors which are highly correlated with the important ones). Hence, we use group-SCAD penalty as our selection strategy by an iterative framework by which to reduce the large-scale variable to a moderate-scale variable, enhancing the performance of our proposed method.

Description of the Algorithm
We build a data-driven thresholding γ τ n for independence screening, by extending the random permutation idea of Zhao and Li [21], which allows only 1 − q proportion (for a given q ∈ [0, 1]) of inactive variables to enter the model when X and Y are not related (the null model). We decouple X i and Y i by random permutation, so that the resulting data {X π(i) , Y i } follow a null model. Where {π 1 , . . . , π n } is a permutation of {1, . . . , n}. The algorithm proceeds as follows: (1) For j ∈ 1, . . . , p, compute (3) and (4)  (2) For given quantile τ ∈ (0, 1), by quantile regression Y on {U, X j ), j ∈ M 0 }, we get the estimators (relating to τ ), including interceptβ n0 and functional coefficients' estimators {β nj , j ∈ M 0 }. Conditioning on M 0 , the n-dimensional partial residual is n , which reflects the additional utility of each covariate conditioning on the selected set M 0 . Random permutation on the partial residual Y * , which yields Y * * . After a quantile regression for the decoupled data {(Y * * , U, X j ), j ∈ M c 0 }, we get f j * * 2 n . Let γ * q be the q-ranked magnitude of f j * * 2 n , j ∈ M c 0 . The active variable set of covariates is chosen as In our numerical examples, we use q = 1. (3) We apply the Group-SCAD penalty on A k to select a subset of variables M k . and M k = {j ∈ A k :γ jk = 0} as follows: where 2 , and p λ (·) is the SCAD penalty such that with p λ (0) = 0. From Fan and Li [4], we set a = 3.7. We select λ by BIC (Bayesian information criterion) criteria nlog(σ 2 ε + ed n log n), where e is the number of covariates chosen. (4) Repeat the step 2-3 where we replace M k−1 by M k , k = 1, 2, . . . , and get A k+1 and M k+1 .

Greedy method
Following Wang [18], Fan and Song [7] and Fan et al. [6] we also implemented the greedy method of QNIS. We select the top K variables that have the largest marginal norms f j 2 n , that is, |M 0 | = K. We apply the second step above to get M k , then back to the second step to get A k . We then iterate those steps as the third step above.

Numerical studies
In this section we present several simulation examples to assess the performance of our proposed methods (QNIS) with the conditional linear structure as in Equation (1). We also compare our method (QNIS) with the method (NIS) of Fan et al. [6]. We apply the common setup for the following simulation which are: cubic B-spline, L n = 7, sample size n = 500, the number of variables p = 1000, and the number of simulation N = 100 for each example.

Example 1
The example is adopted from Fan et al. [6]. Let {W 1 , . . . , W p } be i.i.d. standard normal and {U 1 , U 2 } be i.i.d. standard uniformly distributed random variables. We get {Y , U, X} as follows: where has Cauchy distribution. Hence corr(X j , X k ) = 0 for j = k when t 1 = t 2 = 0. As t 1 = 3 and t 2 = 1, corr(X j , X k ) = 0.43 for j = k and corr(X j , U) = 0.46. We simulate the model with quantile is that τ = 0.5 and τ = 0.75. We use the conditional permutation method for different K (When K = 0, it is unconditional permutation). The screening threshold is fitted as τ q with q = 1. Report the average of the true positive (TP) number, model size, the lower bound of the marginal signal of true variables and upper bound of the marginal signal of false variables for different correlation setting.
From Table 1, we find that for heteroscedastic data, QNIS method perform better than NIS in selecting important variables for interested quantiles, which is only similar to the case of quantile 0.5. When K = 0, the correlation among covariates is large. It is hardly any differentiability between the marginal utilities of the true variables and the false ones. This makes the unconditional random permutation not a feasible method to determine the screening threshold in practice. When we make a few choices of tuning parameter K, the screening threshold is taken as τ q with q = 1, the effect has been dramatic. When K = 4 or K = 8 all the true variables have already included in the first K variables(i.e. M * \M 0 = ∅), hence the minimum of true signal is not available, that is, 'NA'. Based on Table 1, we find the conditional permutation method for different quantile of interest selecting thresholding maintains the sure screening properties and dramatically reduces the model sizes for heteroscedastic data.
Example 2 Let {U, X}, and be the same as in Example 1. We construct the model as follows: In Example 2 we explore the performance of Conditional-NIS and Greedy-NIS method. We take K = 5 when using conditional-NIS method. We report the average number of TP, FP and prediction error (PE). Table 2 reports the results of the simulated model specified in Example 2. For different quantiles, different number of covariates is selected, that is, the QNIS method is better than NIS for heterogeneous data. Correlation between covariates or between covariates and exposure variables can influence the performance of sure screening. When the covariates are independent or weakly correlated TP is the same as the number of covariates in the true model and FP is rare. As the correlation gets stronger TP decreases and FP increases. It seems that Greedy-NIS selects slightly more FPs than Conditional-NIS, this result being from in each step Greedy-NIS selects the top variables by fitting the residuals conditional on previously chosen variable set and tends to overfit.

Tecator Infratec Food Data
We illustrate the performance of our method through a real data analysis on the Meatspec data set which contains data from the Tecator Infratec Food Analyzer. It is available in the R addon package faraway and on Statlib. The 100 predictors are channel spectrum measurements, and the response variable is 'fat'. A total of n = 215 observations are available. We artificially increase covariant by i.i.d. standard normal random variables {Z 1 , . . . , Z p } and the standard uniform distribution U so that Now we do regression on {U, V 1 , . . . , V p }. We take p = 1000, respectively.
In Meatspec data set all variables have strongly relevance with response 'fat'. To validate the effectiveness of our methods, we add heterogeneous as (V 2 101 + V 102 )ε, ε ∼ N(0, 1) to model, that is, the regression model is We repeat 100 times and report the average PE and model size. The simulation results are presented in Table 3. From Table 3, our QNIS methods are effective with heterogeneous data. For different quantile the number of covariates is different, which cannot be achieved in NIS.
From Table 4, we can see that our methods are very effective in selecting active variables and filtering noise variables in high dimensional setting, and acquire deferent model size for deferent quantile, respectively. In Conclusion, the proposed QNIS methodology is very useful which can select relatively true model and be applied as practical forecast under sparsity assumption.

Disclosure statement
No potential conflict of interest was reported by the authors.