Threshold Selection in Feature Screening for Error Rate Control

Abstract Hard thresholding rule is commonly adopted in feature screening procedures to screen out unimportant predictors for ultrahigh-dimensional data. However, different thresholds are required to adapt to different contexts of screening problems and an appropriate thresholding magnitude usually varies from the model and error distribution. With an ad-hoc choice, it is unclear whether all of the important predictors are selected or not, and it is very likely that the procedures would include many unimportant features. We introduce a data-adaptive threshold selection procedure with error rate control, which is applicable to most kinds of popular screening methods. The key idea is to apply the sample-splitting strategy to construct a series of statistics with marginal symmetry property and then to utilize the symmetry for obtaining an approximation to the number of false discoveries. We show that the proposed method is able to asymptotically control the false discovery rate and per family error rate under certain conditions and still retains all of the important predictors. Three important examples are presented to illustrate the merits of the new proposed procedures. Numerical experiments indicate that the proposed methodology works well for many existing screening methods. Supplementary materials for this article are available online.


Introduction
Ultrahigh-dimensional data analysis are now frequently encountered in diverse fields of scientific research, such as biomedical imaging, functional magnetic resonance imaging, microarrays, and high-frequency financial data, among many others.The ultrahigh dimensionality poses great computational and statistical challenges in statistical inference (Fan, Han, and Liu 2014).To explore the relationship between a response variable Y and p-dimensional predictor vector X = (X 1 , . . ., X p ) T efficiently, Fan and Lv (2008) introduced a sure independence screening (SIS) procedure to screen out uninfluential predictors as many as possible while retain all influential variables by a computational efficient and stable procedure or algorithm.It is typical that one applies existing regularization methods to further clean up uninfluential predictors based on the model selected by the SIS procedure.As SIS builds on marginal Pearson correlations between the response and the features, various extensions of correlation have been proposed to deal with more general cases.See Fan and Lv (2018) for an updated review on this topic.
A feature screening procedure first defines a nonnegative statistic ω j measuring the importance of X j to Y. Generally, this ω j quantifies the marginal association of X j and Y in certain sense, and let us assume without loss of generality that the greater ω j is, the more important X j is.Suppose that ω j is an estimate of ω j based on the sample {X ji , Y i } n i=1 .Thus, ω j can be viewed as a marginal utility to rank the importance of X j at the sample level.We select a set of important predictors with large ω j : where T is a threshold.Clearly T controls the model complexity and plays an important role in model selection.To achieve sure screening property, theoretical choice of T has been derived in the literature.In general, the theoretical choice of T depends on unknown quantities related to the joint distribution of X and Y. Fan and Lv (2008) advocated to retain a fixed number, d, of predictors.A common choice of d is d = n/ log n .This is equivalent to sorting ω j from the largest to the smallest and obtaining the order statistics ω (1) ≤ ω (2) ≤ • • • ≤ ω (p) , and then setting T = ω (p−d+1) .This hard thresholding rule has become a standard choice in the literature.However, choosing an appropriate d remains a challenge in practice.On one hand, one can arbitrarily select a conservative one, say a very large d to ensure that all influential features are included with high probability, but an ad-hoc one would in turn include too many noise predictors that should be discarded.On the other hand, a too small d would fail to ensure sure screening property.Zhu et al. (2011) proposed a strategy of choosing the threshold for their proposed sure independent ranking screening procedure by adding auxiliary covariates.Pan, Wang, and Li (2016) considered using the BIC criterion in the linear discriminant analysis setting.Those strategies aim for specific marginal utilities, and therefore, it is desirable to develop a general strategy for selecting d under a unified feature screening setting.
There is an obvious tradeoff: a small threshold would include more active features but retain more inactive features as well.Besides the power aspect in terms of the sure screening property, we often would like to know whether the estimated model S(T) enjoys reproducibility in that the fraction of noise features is controlled (Fan et al. 2020a).However, research on the error rate control in feature screening, such as false discovery rate (FDR), familywise error rate (FWER), or per family error rate (PFER), is very limited.This is partly because the distributions of ω j 's may vary across j and are often difficult to approximate, let alone all the ω j 's joint distributions.Among others, Fan, Samworth, and Wu (2009) established an upper bound on the probability of recruiting any inactive variables for the hard thresholding rule under an exchangeable condition.Zhu et al. (2011) suggested a soft thresholding rule by adding artificial auxiliary variables to the data and obtained similar non-asymptotic bound to the one in Fan, Samworth, and Wu (2009).Hao and Zhang (2017) suggested a new concept called oracle p-value and discussed its application for variable screening with FDR control.Recently, Chudik, Kapetanios, and Pesaran (2018) considered FDR control with marginal Pearson correlations.There is clearly lack of a systematic approach to determining the threshold so that certain error rate control can be achieved.
In this work, we propose a simple yet effective selection procedure based on sample-splitting.The method entails reforming the marginal utility statistics into p new statistics with marginal symmetry property, using the empirical distribution of the negative statistics to approximate that of the positive ones.The newly proposed methodology is generic and is applicable for most of existing feature screening statistics.Under a unified framework and some mild conditions, we show that the proposed method is able to control the FDR and PFER asymptotically.We also prove that the procedure can still retain all of the important predictors.The procedures are theoretically illustrated with several commonly used screening approaches, including Fan and Lv (2008)'s SIS procedure, the robust rank correlation screening procedure in Li et al. (2012) and the Kolmogorov filter proposed by Mai and Zou (2012).Numerical experiments indicate that the proposed rule works well in a wide range of settings.
The remainder of our article is structured as follows.In Section 2, we present the basic procedure.Asymptotic justification is provided in Section 3. Numerical studies are conducted in Section 4. Some conclusions and discussions are given in Section 5, and theoretical proofs are delineated in the Appendix.Some technical details and additional numerical results are provided in the supplementary material.

FDR Control
For ease of exposition, we start with the FDR control which is a particularly useful tool to maintain the ability to reliably detect true signals without excessive false positive results when large-scale hypotheses are simultaneously tested (Benjamini and Hochberg 1995).Other error rates will be discussed later on.
In what follows, we term the X j as an uninformative predictor if ω j ≤ b n for some sequence b n → 0 as n → ∞.The uninformative set is accordingly H 0 = {j : ω j ≤ b n , j = 1, . . ., p} and its complement H 1 is termed as informative set.Denote p 0 = |H 0 |, p 1 = |H 1 | and throughout this article we always assume that p 1 is dominated by p, say p 1 = o(p).
Consider a screening procedure defined in (1) which yields a selected set S(T).A false discovery is made by S(T) if j ∈ S(T) H 0 .So, the false discovery proportion (FDP) associated with S(T) is and the FDR is accordingly defined as the expectation of FDP(S(T)).The main goal is to find a data-adaptive threshold T such that That is, the asymptotic FDR is controlled at a pre-specified level α.
We first impose some conditions on the marginal utility statistic.
Assumption 1. (i) For j ∈ H 0 and known number γ > 0, Here d → and d → denote the convergence in distribution and convergence in probability, respectively.It is required that the γ , the convergence rate of ω j , is known but we do not need any information of N j .This assumption is quite mild and satisfied by most existing screening procedures.For example, if we take b n = o(n −γ ), then γ = 1/2 for the SIS procedure proposed by Fan and Lv (2008), in which ω j is the absolute value of Pearson correlation coefficient between X j and Y, while γ = 1 for the distance correlation SIS (DC-SIS, Li, Zhong, and Zhu 2012).
We next propose a data-driven threshold for controlling FDR of the feature screening procedure based on the marginal utility statistic w j .We split the whole dataset randomly into two disjoint groups D 1 = (X, Y) 1 and D 2 = (X, Y) 2 of unequal size n 1 = n(K − 1)/K and n 2 = n/K, where K ≥ 3 with assuming that n/K is an integer for simplicity.The marginal utility statistics for the jth variable on D 1 and D 2 are denoted as ω j1 and ω j2 , respectively.Define (2) By Assumption 1-(ii), with probability tending to one, and thus, W j > 0. While for any j ∈ H 0 , W j is (asymptotically) symmetric around zero due to Assumption 1-(i) and the independence between D 1 and D 2 .That is, W j can be used to discriminate an uninformative predictor and an informative predictor, and it has marginal symmetry property for all the uninformative predictors.
Motivated by the properties of W j , we propose to choose a threshold L > 0 via setting and select the predictors by M(L) = {j : W j ≥ L}, where α is the target FDR level.If the set is empty, we simply set L = +∞.Naturally the ω j , j ∈ H 1 is not too weak.Then #{j : W j ≤ −t} is a good approximation to #{j : W j ≤ −t, j ∈ H 0 }, which further is a good approximation to #{j : W j ≥ t, j ∈ H 0 } due to the marginal symmetry of W j for j ∈ H 0 .Theoretically speaking, the key idea is to exploit the following symmetry property (see Lemmas A.1 in the Appendix and Condition 2) for some large M n .It implies that the fraction in ( 3) is an estimate of the FDP.Since we use the empirical distribution of the negative statistics to approximate that of the positive ones, we refer our procedure to as REflection via Data Splitting (REDS).Let us briefly discuss how to choose K.The way we construct W j is to ensure that (a) W j is (asymptotically) symmetric with mean zero for j ∈ H 0 , and (b) W j is a large positive value for j ∈ H 1 without imposing other requirements on N j .Intuitively, with a larger K, the strength of the signal for j ∈ H 1 would be larger.We then more likely select the important predictors.However, this leads to smaller sample size of n 2 , which would further yield a slower convergence rate of n γ 2 ω j2 for j ∈ H 0 .Consequently, the marginal symmetry of W j for j ∈ H 0 may be violated to certain degree and the FDR control of the proposed procedure would be compromised.In practice we recommend K = 3, which performs quite well in our numerical study.
Before we pursue further, let us illustrate the idea of the proposed REDS procedure via a toy example.We generate a simulation dataset from a linear model in which X comes from a multivariate normal distribution with the correlation between X j and X k being 0.2 for j, k ∈ H 1 and being 0 otherwise.In this illustration, we set n = 100, p = 1000, p 1 = 40, and K = 3.As in Fan and Lv (2008), we take the absolute value of Pearson correlation as the marginal utility statistics.As a visual representation of the proposed procedure, we depict in Figure 1(a) the scatterplot of the screening statistic pairs for each feature j, {(K−1) γ ω j1 , ω j2 }, with the red triangles and black dots denoting informative predictors and uninformative ones, respectively.A feature j whose point lies below the diagonal line has a positive value of W j by definition and vice versa.Figure 1(b) depicts the resulting W j 's.Observe that the true signals are primarily above the x-axis, indicating W j > 0, while the uninformative W j 's (black dots) are roughly symmetrically distributed across the horizontal lines.Figure 1(c) shows the corresponding estimate of FDP curve (against t), that is, the fraction in (3), along with the true FDP.The approximation in this case is very good as only very few true informative predictors are present below the horizontal line in Figure 1(b).

Per Family Error Rate Control
The proposed REDS approach can also be used for providing a rough control of other error rates, such as the Per Family Error Rate (PFER), and the k-FWER, the probability of making at least k false discoveries.We refer to Romano and Wolf (2007), recently Janson and Su (2016) and the references therein.Here we discuss the PFER only.
The PFER, the expected number of false discoveries, has a clear interpretation, making it a useful criterion in feature screening approaches.Controlling the PFER amounts to find a t so that where k 0 is the target PFER level.Again, by the intuition we can obtain an approximately correct threshold by The proposed thresholds L in (3) and L in (5) are in a similar spirit to that in the knockoff framework introduced by Barber and Candès (2015) and further advanced as "Model-X" knockoffs (Candes et al. 2018), where the context is regression models.The model-X knockoffs framework can control FDR without assuming a specific regression model.However, it requires the joint distribution of predictors to be known.The robustness issue of the model-X knockoffs is theoretically investigated by Fan et al. (2020a) and Barber, Candès, and Samworth (2020).For other recent development of variable selection via knockoffs, see also Barber and Candès (2019) and Fan et al. (2020b).The knockoff procedure operates via constructing "knockoff copies" of each of the p predictors (features) with certain knowledge of the predictors or responses.The signs of test statistics constructed via knockoff would satisfy (or roughly) joint exchangeability and thus, can yield accurate FDR control in finite samples (Candes et al. 2018).However, in the feature screening problems, we usually do not have enough information on (Y, X 1 , . . ., X p ) , and the exact knockoff copies are generally not available when p is too large compared with n.The knockoff framework is to test conditional hypotheses, while we follow the literature of marginal feature screening (Fan and Lv 2008) and mainly focus on marginal hypotheses.Although conditional and marginal hypotheses are two different concepts, marginal hypotheses can still be informative for the conditional hypotheses in a ultrahigh-dimensional environment, which has been successfully demonstrated by numerous authors in the field of marginal feature screening.Recently, based on marginal linear regression, McKeague and Qian (2015) proposed an adaptive resampling test to detect the presence of significant predictors.See also Wang, McKeague, and Qian (2018) and the references therein for more examples.
The REDS for choosing L and L provides a unified framework for threshold selection in commonly-used marginal feature screening.No matter n γ ω j is asymptotically distributionfree or not, provided that n γ ω j converges to a nondegenerate variable N j when j ∈ H 0 and ω j converges to ω j > 0 when j ∈ H 1 , our approach is basically applicable.Also, notice that the proposed procedure is computationally efficient-it only uses a one-time split of the data and calculation of the W j in (2).This is particularly useful in feature screening because one usually expects a rapid scheme which has minimum requirements on computing and storage.By contrast with the knockoff, we here use the sample-splitting idea to achieve a marginal symmetry property.It turns out that the error rate like FDR is able to be controlled in a reasonable range even with marginal symmetry provided that the empirical distribution converges under certain dependence structures (Storey, Taylor, and Siegmund 2004).We note that the sample-splitting idea has been used by many authors in different contexts.For instance, Wasserman and Roeder (2009) firstly divided the data into two independent parts, secondly used one part to narrow down the focus and finally used the remainder to perform inference tasks.Fan, Guo, and Hao (2012a) and Chen, Fan, and Li (2018) adopted this idea to estimate the error variance in high-dimensional regression models.

Data-Driven Hard Threshold for Screening Methods
The proposed thresholds L in (3) and L in ( 5) are based on error rate control.We control the error rates by a joint use of two screening statistics from two splits.In practice, one may prefer to use the original full-sample statistic ω j .We next demonstrate how to use L and L to carry out a hard-threshold feature screening.We will focus on L only since we may simply replace L by L if we want PFER control.For the selected L, feature screening with the hard thresholding |M(L)| is equivalent to using the threshold T defined by The rationale is as follows.If the FDP is approximately controlled at the level α, there are roughly |M(L)|(1 − α) informative predictors in M(L).Because ω j is expected to be at least as effective as the ω 1j , those informative predictors would be retained with an overwhelming probability when we are using ω j for screening.That in turn would yield false discoveries no greater than the REDS selection, resulting in a more conservative rule.But due to the use of full-sample information, we can expect that the sure screening property would be better achieved.We will refer this procedure as hard-thresholding rule with REDS and the whole algorithm is summarized as follows.
Algorithm 1: Hard-thresholding rule with REDS (HTR) • Step 1: Randomly split the data into two parts and compute ω kj for k = 1, 2 and j = 1, . . ., p; • Step 2: Compute W j by ( 2), find the L by ( 3) and obtain In the next section, we will show that with the help of the FDR control of the REDS procedure, the hard-thresholding strategy, HTR, is also able to control the FDR asymptotically.Hence, the REDS has considerable merit in the sense that M(L) is always informative, especially for complex data and computation-intensive marginal utility statistics where approximate distributions of the ω j are not obvious.A preliminary REDS step helps us to provide insight as to the appropriate quantity of the thresholding.
One may be wary of the stability of the one-time split.To reduce randomness occurred in a single sample splitting (Meinshausen, Meier, and Bühlmann 2009), we may employ the "bagging" technique to aggregate results from multiple sample-splitting procedures.Say, we run the first two steps of Algorithm 1 B times with a nominal level α, and obtain a collection of |M k (L)|, k = 1, . . ., B. We then aggregate them via some commonly used bagging strategy, such as the voting or averaging.For example, we may use [B −1 k |M k (L)|] as our hard threshold.Some simulation results given in the supplementary material show that compared to one-time split, the multiple-splits refinement yields similar FDR and power but slightly smaller variations in the FDP.

Theoretical Results
In this section, we study the theoretical properties of the proposed procedures.We start with general setting in Section 3.1 and then work on specific settings corresponding to several commonly-used marginal utility statistics in the literature of feature screening.

General Settings
In this section, we focus on studying the property of the FDRbased threshold L. We first establish a finite-sample property of W j .
Proposition 1. Assume that W j , 1 ≤ j ≤ p are well defined.For any α ∈ (0, 1), the REDS selection procedure satisfies We can interpret j as measuring the extent to which the symmetry is violated for a specific variable j as well quantifying the effect of the dependence between W j and W −j on the FDR.This result concurs with our intuition that controlling the j 's is sufficient to ensure control of the FDR for the REDS method.In the most ideal case where the W j 's are symmetrically distributed and independent, j = 0 for all j ∈ H 0 , and we obtain the accurate FDR-control result since we can take = 0.Under asymmetric and dependent scenarios, the j can still be expected to be small due to the convergence of n γ 1 ω j1 and n γ 2 ω j2 to a same distribution if n is not too small.
To establish asymptotical property of the proposed procedure, we impose the following technical conditions, which are not the weakest one, but facilitate the technical proofs.Denote FN j (t) = 1 − F N j (t), where F N j is the distribution function of N j .
Condition 1 (The distribution of marginal utility statistics). where and is used to establish the marginal symmetry of W j .It is closely related to the large-deviation theory and can be satisfied by many existing screening procedures.We will discuss this condition in more details for examples in Section 3.2.Condition 1-(ii) is commonly used in the literature of feature screening to derive the sure screening property, for example, in SIS, δ p,n = O(n −κ ) and d p,n = O(exp{−Cn 1−2κ / log n}) for 0 < κ < 1/2 and some C > 0, while in DC-SIS, δ p,n = O(n −κ ) and d p,n = O(exp{−Cn (1−2κ)/3 }) for 0 < κ < 1/2 and some C > 0 (Fan and Lv 2008;Li, Zhong, and Zhu 2012).Condition 2 sets theoretical minimal requirements for the convergence of the empirical distribution {p 0 G + (t)} −1 j∈H 0 I W j ≥ t to its population one.This is reasonable because we cannot expect that our selection procedure would work well if the empirical distributions are far away from the population ones, since we are using the former to search for thresholds.Similar conditions are popular; see, for instance, Storey, Taylor, and Siegmund (2004).This condition is mainly about the correlation between the uninformative statistics W j , j ∈ H 0 .When W k is independent with all the other variables W j or only dependent with finite number of W j , then this condition trivially holds.Certainly, W j 's inherit their dependence structure from X j 's which can affect the validity of the convergence.We will explicitly provide some sufficient conditions so that Condition 2 is valid for some specific examples.Denote then for any α ∈ (0, 1), the FDR of the REDS procedure with threshold L satisfies lim sup (n,p)→∞ FDR ≤ α.
Theorem 1 implies that the feature screening procedure with the proposed threshold L can control the FDR level asymptotically for dependent predictors.The condition β p,n → ∞ implies that the number of informative predictors with identifiable signal strengths is not too small as p → ∞.This seems to be a necessary condition for FDP control.For example, in the context of multiple testing, Liu and Shao (2014) showed that even with the true p-values, no method is able to control FDP with a high probability if the number of true alternatives is fixed as the number of hypothesis tests goes to infinity.To see this clearer, notice that the key step is to show the validity of (4) in which the convergence of empirical sum such like j∈H 0 I(W j < −t) is needed.When t is extremely large, the number of nonzero terms in the summation would be finite and consequently the convergence would fail.The condition that β p,n → ∞ helps to rule out the case of having too large t.In fact, we will show that Pr j I W j > L ≥ β p,n → 1.This implies that those β p,n informative predictors would be identified with probability tending to 1. Thus, there are at least β p,n discoveries in M(L).Its implication is that j I(W j < −L) would tend to infinity by the definition of L. So, j∈H 0 I(W j < −L) → ∞ since {j : W j < −L} are more likely to be uninformative predictors.
The condition β p,n → ∞ is not too stringent since we only require a few predictors whose marginal utility measures exceeding 2aδ p,n /(a − 1) among all the informative predictors.The p 0 d p,n → 0 and T U /n η → 0 are two technical conditions and can be easily satisfied for most existing screening statistics; this will become clearer in Section 3.2.Zhu et al. (2011) suggested to independently and randomly generate p auxiliary variables Z from N(0, 1), which is independent of both X and Y.They proved that when the inactive predictors and the auxiliary variables are exchangeable, the probability of recruiting at least r inactive variables is upper bounded by (1 − r/2p) p .However, the exchangeable condition is not easily satisfied in practice especially when we generate Z without using any information of X, and accordingly such a procedure may not be able to maintain a desired error rate.Our proposed procedure is distinguished from Zhu et al. (2011) in that our procedures do not require exchangeable condition, instead use the sample-splitting scheme to achieve a marginal symmetry property, which is proven particularly useful in the FDR control.
The following theorem establishes the validity of PFER control of the REDS procedure with threshold L .
Theorem 2. Under conditions of Theorem 1, for any k 0 > 0, we have We next show that Condition 2 holds under some weak dependence structures.
Theorem 3. The Condition 2 is valid under either one of the following two dependence settings, and accordingly the FDR control given in Theorem 1 and the PFER control given in Theorem 2 hold.
(i) Assume that for each W j , the number of W k that are dependent with W j is no more than r p,n , where r p,n /β p,n → 0 as where θ > 0 is any small constant.
The first condition (i) imposes the independence between W j and other p − r p,n W k 's.The β p,n is required to diverge faster than r p,n .Note that the r p,n is the upper bound on the number of dependent statistics for each statistic W j , while β p,n represents the lower bound of the number of nonzero terms in the summation j∈H 0 I(W j ≤ −t).The requirement on the growth rate of β p,n is reflected by the law of large numbers under dependent scenarios.The second one allows W j being correlated with all the other variables but the average of the correlation coefficients (in terms of I(W j > t) and I(W k > t)) needs to converge to zero at a log-rate.This assumption is similar to the weak dependence structure given in Fan, Han, and Gu (2012b).If the correlation matrix contains many large entries, this condition may not hold; a certain degree of sparseness is needed.Certainly, those two conditions are not the weakest possible.In fact, Condition 2 can also be proved by using the Bernstein-type inequality under some mixing conditions (Merlevède et al. 2009) together with similar arguments given in the proof of Theorem 3.
The next result establishes the FDR control of Algorithm 1.
It is worth pointing out that in this theorem, we impose a minimum signal condition which is not required for the REDS procedure.For HTR, we use the minimum signal strength condition to ensure that for all j ∈ H 1 , n r ω j > n r 1 ω 1j with probability tending to 1. Accordingly, most of the informative predictors identified by REDS would be retained in S( ω (p−|M(L)|+1) ).As a contrast, in Theorem 1, the FDR of REDS can be controlled provided that there are certain number of identifiable signals.
The HTR procedure needs more stringent conditions to achieve asymptotic FDR control.If there are many weakly informative predictors in H 1 , the FDR control is likely to be compromised.
Corollary 1. Suppose the conditions in Theorem 4 all hold and the FDP of HTR satisfies lim sup (n,p)→∞ FDP HTR ≥ α 0 with α 0 β p,n ≥ 1 in probability.If ω j > 2δ p,n for all j ∈ H 1 , then we have Pr(S( This corollary implies that our HTR procedure is capable of not only controlling the FDR level, but also achieving the sure screening property under suitable conditions.

Examples
In this section, we demonstrate the theoretical results can be directly applied for commonly-used marginal utility statistics.
Example 3.1.Fan and Lv (2008) proposed sure independent screening (SIS) procedure using Pearson correlation coefficient to measure the dependence between X j and Y.That is, ω j = |cov(X j , Y)/ var(X j )var(Y)| and where Xj and Ȳ are the sample means of {X ji } n i=1 and {Y i } n i=1 , respectively.They showed that under suitable conditions, Thus, we can take δ p,n = cn −κ and d p,n = exp{−Cn 1−2κ / log n}.From Lemma S.1 in the supplementary material, we easily verify that Condition 1-(i) holds uniformly for t ∈ (0, n 1/6 ).Hence, Condition 1 holds.
The feature dimension p is allowed to grow exponentially fast with the sample size n, which fits for the requirement of ultrahigh-dimensional settings.
Example 3.3.Mai and Zou (2012) developed a Kolmogorov filter (KF) for variable screening in high-dimensional binary classification.For binary response Y taking values ±1, they considered ω j = sup −∞<x<∞ |F +j (x) − F −j (x)|, where F +j (x) and F −j (x) are the conditional distribution functions of X j given Y = 1, −1, respectively.Correspondingly, where F +j (x) and F −j (x) are the empirical conditional distribution functions of X j given Y = 1, −1, respectively.Mai and Zou (2012) proved that Then we can still take δ p,n = cn −κ and d p,n = exp{−Cn 1−2κ }.In the supplementary material, we show that Condition 1-(i) holds uniformly for t ∈ (0, n 1/2 ).Hence, we have the following result.
Proposition 4. If Condition 2 and p = exp{o(n 1−2κ )} hold, then the result given in Theorem 4 is valid for the Kolmogorov filter.

Numerical Study
We illustrate the breadth of applicability of our proposed procedure by studying its performance on simulated data and a realdata example over four different screening methods: SIS (Fan and Lv 2008), RRCS (Li et al. 2012), KF (Mai and Zou 2012), and DC-SIS (Li, Zhong, and Zhu 2012).The W j in (2) is built when K = 3.

Simulation Study
The performances of the proposed REDS filter and the hardthresholding rule with REDS ("HTR") are evaluated along with some benchmarks through the comparisons of the FDR, the proportion that all active predictors are selected, that is, P a = Pr(A ⊆ M(L)), the true positive rate (TPR), say the proportion of informative predictors that are correctly identified as such, and the average model size (MS).Here A is the set of predictors which are truly related to the response.The hard thresholding rule, which selects a set of predictors with a given number d, is considered for comparison.Following the recommendation in Fan and Lv (2008), we choose d 1 = n/ log n and d 2 = 2n/ log n as two simple competitors and denote them as "HTd 1 " and "HT-d 2 ", respectively.All results are obtained with 1000 replications when the nominal level α is 0.2.
Example 1.For the SIS method in Fan and Lv (2008), a linear regression model is set as Y = X β + .We consider the predictors X following N(0, ) where the covariance matrix = (ρ |i−j| ) p×p with ρ = 0.8, meanwhile is from N(0, 1).The β is a p-dimensional vector where the first s elements are set as 1 and otherwise 0, that is, β i = 1 for i = 1, . . ., s and β i = 0 for i = s + 1, . . ., p.We would vary the values of s to assess the performance.Under this model, the informative set is H 1 = {j : w j > b n } with b n = n −2/3 and its cardinality is denoted as R. Note that R will be larger than s since there exist some variables correlated with X 1 , . . ., X s but without relationship with Y in H 1 .
We fix the dimension p = 5000 and vary n = 100, 200 and s = 15, 25 to compare the estimated FDR, TPR, P a and average model size in Table 1.Under all the scenarios, the REDS is able to deliver a quite accurate control and performs better than the BH in terms of FDR control.For example, when n = 100 and s = 25, the proposed REDS and BH result in the empirical FDRs as 18.5% and 24.1%, respectively, while their corresponding standard errors are 0.5% and 0.3% under 1000 replications.This implies that BH does not control FDR well in this case.This is consistent with our theoretical analysis in Section 3.1.We also observe that the FDR levels of HTR are slightly smaller than REDS's but HTR improves the TPR and P a over the REDS.Certainly, this is not surprising as HTR uses the full-sample statistics ω j for screening without information loss.The HT-d 1 has very small FDR and accordingly it performs not well in terms of P a under n = 100 and s = 25 because of the model size d smaller than s.Meanwhile, HT-d 2 is able to yield a large TPR or P a (as well as a large FDR).Clearly, a larger d should be used but as we argued before how large it depends on unknown quantities, and thus, such ad-hoc choice cannot ensure the sure screening property hold.More simulation results in supplementary material also illustrates that the REDS and HTR can successfully control the FDR at the nominal level.
Example 2. In this example, we consider the RRCS method and the generalized Box-Cox transformation model given by Li et al. (2012) Here, we consider the predictors X 1 , . . ., X p following: {X j } Here, we fix the sample size n = 200 and consider the dimension p = 2000 or 5000.The results when R = 50 and 100 are shown in Table 2.We see that the FDR levels of REDS or HTR are close to the nominal level while maintains a reliable TPR under all the scenarios; it is clearly more effective than the BH and the difference is quite remarkable in terms of FDR control.For example under p = 5000 and R = 50, the empirical FDR of the proposed REDS and HTR are 16.6% and 15.6%, respectively, with the same standard error 0.3%, but the empirical FDR of the BH is clearly out of control because of its FDR level 31.6% with 0.3% standard error over 1000 simulations.The hardthresholding method with two different d's shows that more active predictors can be detected only when d is greater than the cardinality of H 1 .This again demonstrates the effectiveness of the proposed HTR: it is a data-driven thresholding rule which allows FDR control.Similar results for other values of R are provided in the supplementary material.
Example 3.For the KF method in Mai and Zou (2012), we consider a logistic regression model, in which Y is distributed as Binomial(1, p(x)) with log{ p(x) 1−p(x) } = X β, where X is generated same as Example 2, while β 1 , . . ., β 8 and β [p/3]+1 are equal to 1, β [2p/3]+1 is 2 and otherwise 0. As we have shown that the HTR performs usually better than the REDS in terms of TPR and P a , in what follows we focus on the HTR only.Note that the Kolmogorov statistic is exactly distribution-free for all the continuous distributions and thus, we obtain the p-values for the BH procedure via simulations.Besides the measurements of FDR, TPR, P a and the average model size, we also include the probabilities that the active variables X [p/3]+1 and X [2p/3]+1 are selected, which are from t(3) and Poisson distribution, respectively.
Table 3 presents the comparison results under p = 2000 when different sample sizes and R are considered.We observe that the FDRs of the HTR method are close to the nominal level.Meanwhile, the proportion that all active predictors are selected P a in the HTR procedure largely increases when n is large.It implies that the HTR can achieve the sure screening property when n is large.In contrast, the BH method results in overly conservative FDR levels across all the settings.This can be understood by noticing that some of the covariates in our model are Poisson distributed and thus, the null distribution of the Kolmogorov statistic obtained under a continuous distribution is a misspecified one.Compared to the proposed HTR, although the TPRs of BH are close to those of HTR, the BH method results in lower P a and lower probabilities that the active covariates X [p/3]+1 and X [2p/3]+1 could be selected.As we can expect, the hard thresholding methods could result in larger TPR as well as larger FDR only when a larger d is chosen.For example, when we consider the case that R = 50 and n = 200, the HT-d 2 yields a large TPR as 97.1% with its standard deviation 1.2% but also a large FDR as 31.8% with its standard deviation 0.8% due to the selected model size d 2 = 74 much greater than R. Similar analysis also can be found in the previous two examples.
Example 4. In this example, we investigate the performance of the HTR for DC-SIS.The BH method is not readily applicable because approximating the null distribution of the distance correlation needs some permutation procedures which may be too time-consuming.Hence, we only display the results of HTR and the hard-thresholding methods under two following models: • Model I (Nonlinear Regression): Figure 2 depicts the FDR, TPR and P a curves against R when p = 5000 and n = 400.It implies that the HTR method performs reasonably well for both models.The FDR varies in an acceptable range of the target level no matter the size of R. Meanwhile, the TPR and the proportion that all active predictors are selected into the model are quite high, which clearly demonstrates the efficiency of our proposed method.As R increases, less active predictors can be detected by the hard-thresholding method with two different d's.This further indicates that the hard-thresholding method is hard to guarantee the sure screening property with an ad-hoc d.
Example 5. Here, we consider the proposed method HTR applied to the PFER control.Followed by the same model settings in Examples 1-4, we fix the dimension p = 5000 and the sample size n = 400 for four different screening methods SIS, RRCS, KF and DC-SIS, and consider X generated as Example 2 and the Model (I) for DC-SIS.Table 4 reports some PFER results of the HTR when the target PFER level k 0 ∈ [5,20].The validity of the HTR approach in terms of PFER control is clear.

An Application
We apply the proposed HTR procedure with screening methods RRCS and KF to select features for the classification of a diffuse large-B-cell lymphoma (DLBCL-B) dataset.This   (2015).
The scatterplots of statistics W j for RRCS and KF are presented in Figure 3 when the target FDR level is fixed as α = 0.1.We observe that all selected genes (red dots) have large values of W j , while the unselected ones (black dots) are roughly symmetric across the horizontal lines.Table 5 shows the numbers of selected genes with the HTR procedure under different FDR levels.As we expect, a larger FDR level could result in the selection of more genes.Given the evidence from the simulation results in previous subsections suggesting that the FDR is controlled, it is likely that most of these discovered genes (roughly 1 − α percentage) are informative (at least marginally).To further evaluate the performance, we introduce some simulated variables.Specifically, we independently and randomly generate 1000 auxiliary variables Z from N(0, 1) and select the informative genes from the whole 1661 variables.Since Z is truly inactive by construction, one selected auxiliary variable can be seen as a truly false discovery.Table 5 and Figure 3 show the number of selected variables and the scatterplots of resulting statistics W j .We rank the marginal utility statistics on the full sample and find that the 19th statistic for KF method and the 29th statistic for RRCS method correspond to an auxiliary variable, respectively.In this case, the traditional hardthresholding rule with d 1 = n/ log(n) = 20 and d 2 = 2n/ log(n) = 41 may throw away an informative gene due to the selected auxiliary variable.In contrast, our HTR procedure could adaptively decide a large threshold to avoid losing those potentially informative genes.

Concluding Remarks
Threshold selection is very important to conduct efficient feature screening procedures.This article proposes a data-driven threshold selection procedure, REDS, to determine the threshold for error rate control under a unified framework.The REDS procedure is easy to implement and computationally efficient.It is shown that our proposed method can asymptotically control the FDR and can still retain all of the important predictors under mild conditions, and thus, it could serve as a useful alternative to the ad-hoc hard thresholding approaches in practice.
In Section 2.1, we construct a test statistic W j , which satisfies the marginal symmetry for j ∈ H 0 .In fact, an effective W j should satisfy two main properties: (a) for j ∈ H 0 , W j is (asymptotically) symmetric with mean zero; (b) for j ∈ H 1 , W j is a large positive value.Any appropriate forms satisfying these two properties could be taken as suitable choices to implement the REDS method.The W j given in (2) is the one which can be used without imposing other requirements on N j .For some specific N j , more effective and simpler surrogates are available.For instance, if N j follows a symmetric distribution, such like the normal distribution as in SIS, we may take W j = n γ 1 ω j1 ×n γ 2 ω j2 .This W j is also asymptotically symmetric for j ∈ H 0 , while is a large positive value for j ∈ H 1 regardless of the signs of ω j .To make the strength of the signal largest, we can simply take K = 2 in this situation.This construction is obviously applicable for the Examples 1-2 discussed in Section 3.2, since we can take ω j as the original correlation coefficient (without absolute value) and the corresponding N j 's are normal distributed.Systematic investigation and comparison with different types of W j certainly warrant future study.
Our unified framework is developed for the marginal screening methods in which the marginal correlations for the important variables must be bounded away from zero.However, sometimes this assumption can be violated, as predictors are often correlated.As a result, unimportant variables that are highly correlated with important predictors will have high priority of being selected.In such situations, the iterative SIS procedure introduced by Fan and Lv (2008) and some other related works such as Wang and Leng (2016) should be considered.It is of interest to further generalize the REDS method to the screening methods which could give effective variable screening with-out the strong marginal correlation assumption.In addition, it would be also important to identify interactions between predictors.For recent development, see, for instance, Fan et al. (2015) and Kong et al. (2017).It is of great interest to investigate how to adapt the proposed method to interaction screening problems.This would be an interesting topic for future research.

Appendix: Proofs
This appendix contains the proofs of Theorems 1, 3-4 and Corollary 1 in which the technical arguments may be interesting in their own rights.The proofs of Theorem 2 and Propositions 1-4 can be found in the supplementary material, along with a few additional lemmas.
Before we present the proof of Theorem 1, we first state a lemma whose proof is given in the supplementary material.Denote The next lemma characterizes the closeness between Pr(W j > t) and Pr(W j < −t), which plays an important role in the proof.
uniformly in j ∈ H 0 and t.
Proof of Theorem 1.We need to show that Equation (4) holds.Lemmas A.1 and Condition 2 serve this purpose.
By definition, our thresholding rule is equivalent to select the jth feature if W j > L, where We need to establish an asymptotic bound for this L so that the conditions in the theorem hold.We first show that for any j ∈ C β , n γ 1 ω 1j > n γ 2 ω 2j with probability tending to one.In fact, we have Similarly it follows that Then we have V U T U and Pr( j Thus, we get L V U .On the other hand, notice that Accordingly, we have Note that R(L) ≤ j∈H 0 I W j ≥ L / j∈H 0 I W j ≤ −L , and thus, lim sup n→∞ FDP ≤ α in probability by (A.1).Then, for any > 0, FDR ≤ (1 + )αR(L) + Pr (FDP ≥ (1 + )αR(L)) , from which the second part of this theorem is proved.
Proof of Theorem 3. We only show the validity of the first formula and the second one hold similarly.We suppress "+" in G + (t) for notational simplicity.Note that the G(t) is a deceasing and continuous function.Let a p = αβ p,n , z 0 < z 1 < • • • < z h p ≤ 1 and t i = G −1 (z i ), where z 0 = a p /p, z i = a p /p+b p exp(i δ )/p, h p = {log((p−a p )/b p )} 1/δ with 0 < δ < 1 and b p /a p → 0. Note that G(t i )/G(t i+1 ) = 1+o(1) uniformly in i.It is therefore, enough to obtain the convergence rate of Define M j = {k ∈ H 0 : W k is dependent with W j } and further It is noted that under the condition (i)

Supplementary Materials
This supplementary material contains the proofs of some technical lemmas and corollaries, and additional simulation results.

Figure 1 .
Figure 1.(a) is scatterplot of screening statistic pairs (T j1 = (K−1) γ ω j1 , T j2 = ω j2 ) with the red triangles and black dots denoting informative predictors and uninformative ones, respectively; (b) is scatterplot of the W j 's; (c) is the corresponding estimate of FDP curve against t (black line) along with the true FDP (red dash-line).
and select the predictors with the largest |M(L)| values of ω j ; Output the selected subset as S( ω (p−|M(L)|+1) ).

Figure 3 .
Figure 3. Scatterplots of W j 's for KF and RRCS, respectively, with the red points and blue line denoting selected predictors and the threshold under α = 0.1.

Table 1 .
Comparison results of FDR(%), TPR(%), P a (%) and the average model size under p = 5000, n = 100, 200 and s = 15, 25 in Example 1.Under this model, we fix the s = 10 and notice that the set of predictors which are truly related to the response is A = {j : 1, . . ., 8, [p/3] + 1, [2p/3] + 1} and there are R dependent covariates with the first nonzero s−2 elements.Accordingly, the informative set is H 1

Table 5 .
Numbers of selected genes for HTR procedure with KF and RRCS method under different target FDR levels.
Pr j∈H 0 [I(W j > t i ) − Pr(W j > t i )] p 0 G(t i ) ≥ ) = j∈H 0 k∈M j E I(W j > t) − Pr(W j > t) I(W k > t) − Pr(W k > t) ≤ r p,n p 0 G(t).