Model-Free Conditional Feature Screening with FDR Control

Abstract In this article, we propose a model-free conditional feature screening method with false discovery rate (FDR) control for ultra-high dimensional data. The proposed method is built upon a new measure of conditional independence. Thus, the new method does not require a specific functional form of the regression function and is robust to heavy-tailed responses and predictors. The variables to be conditional on are allowed to be multivariate. The proposed method enjoys sure screening and ranking consistency properties under mild regularity conditions. To control the FDR, we apply the Reflection via Data Splitting method and prove its theoretical guarantee using martingale theory and empirical process techniques. Simulated examples and real data analysis show that the proposed method performs very well compared with existing works. Supplementary materials for this article are available online.


Introduction
The unprecedented development of data collection and storage technology has made ultra-high dimensional data available in diverse areas including biomedicine, engineering, finance, and agriculture (Fan et al. 2020).In ultra-high dimensional data, the number of predictors p may grow exponentially with the sample size n.In this case, it is common to assume that only a small number of predictors are relevant to the response, and thus it is of interest to identify these relevant predictors in the initial stage of ultra-high dimensional data analysis.Due to challenges of computational expediency, statistical accuracy, and algorithmic stability (Fan, Samworth, and Wu 2009), regularization based variable selection methods (Tibshirani 1996;Fan and Li 2001) may become unstable or infeasible in the ultrahigh dimensional setting.This inspires researchers to develop feature screening algorithms to exclude unimportant predictors, and take feature screening as a starting point of ultrahigh dimensional data analysis.Since the seminal work of Fan and Lv (2008), which developed the sure independence screening (SIS) method for linear regression models using Pearson correlation as a marginal utility measure, many model-based feature screening methods have been developed for generalized linear models (Fan and Song 2010), nonparameteric regression models (Fan, Feng, and Song 2011;Fan, Ma, and Dai 2014;Liu, Li, and Wu 2014), quantile regression (He, Wang, and Hong 2013;Wu and Yin 2015) and Cox model (Zhao and Li 2012;Yang et al. 2016).Several model-free screening procedures have also been proposed to avoid model misspecification.Li, Zhong, and Zhu (2012) proposed a sure independence screening method using distance correlation, Mai and Zou (2015) developed a Kolmogorov-distance-based screening method for binary classification model, and Pan et al. (2019) used the ball correlation for marginal screening.Fan et al. (2020) provided a comprehensive overview of variable selection and feature screening.
This article aims to develop a new model-free conditional feature screening procedure with false discovery rate control.This work was motivated by the empirical analysis of CITE-seq data in Section 4.3.Recent development in genetics technology has made the simultaneous measurement of RNA expression and protein levels possible at the single-cell level.It is of great interest to investigate the scientific relationships between gene expression and some target proteins.Because some proteins are known activators of target proteins, it is helpful to control the genes of activators and find the genes that are truly relevant to the translation of target proteins.The details are described in Section 4.3.Hence, it is desirable to develop conditional feature screening procedures that use such knowledge.Recent development in this topic includes the conditional feature screening for generalized linear model (Barut, Fan, and Verhasselt 2016) and varying coefficient model (Zhou et al. 2018).Wen et al. (2018) developed a model-free sure independence screening method using conditional distance correlation, but it is not robust to heavy tailed data and is computationally expensive (Wang et al. 2015).In this article, we propose a new conditional feature screening procedure based on conditional independence measure recently proposed by Cai, Li, and Zhang (2022), and refer to the new procedure as Conditional Independence Based Feature Screening (CIS).The proposed screening method is model-free (i.e., no need to specify the functional form of regression function), and is robust to outliers and heavy tailed responses and predictors.We also show that the CIS possesses both the sure screening property and ranking consistency property under mild regularity conditions.
In this article, we further propose an FDR control procedure for the CIS.Existing screening procedures often require specifying a threshold to separate the active and inactive predictors.The conventional threshold is set to be n/ log n , the integer part of n/ log n (Fan and Lv 2008).That is, the screening procedures retain the top n/ log n predictors for a follow-up analysis.In general, the conventional threshold is conservative and leads to a high FDR.An ideal threshold should achieve sure screening property and FDR control simultaneously.That is, the selected predictors include all true active predictors and a few inactive ones.Existing works on threshold selection with FDR control broadly fall into two types.The first type usually constructs a null model as a benchmark, so that all the candidate predictors yielding larger marginal utilities than the null model are selected.For example, Zhu et al. (2011) introduced auxiliary variables for thresholding by generating true inactive auxiliary variables that are independent of the response, and set the maximum of the marginal utilities among the auxiliary variables as a soft threshold.Similar ideas also appeared in Fan, Feng, and Song (2011), Barut, Fan, and Verhasselt (2016) and Pan et al. (2019).Works in this direction generally have not justified their FDR control performance theoretically.The other type of works constructs estimators of FDR based on some special properties of the marginal utilities, then select the set of predictors that reach the desired FDR.For instance, Zhao and Li (2012) and Barut, Fan, and Verhasselt (2016) used high-criticism t-tests, Liu et al. (in press) used the knockoff features (Candès et al. 2018;Barber and Candès 2019).However, how to develop an FDR control for conditional screening with theoretical guarantee remains a challenging problem.
In this article, we first propose the CIS, and then develop an FDR control procedure for the CIS by incorporating the Reflection via Data Splitting (REDS) procedure (Guo et al. in press) and study the theoretical property of the modified REDS.The theoretical development of the modified REDS is nontrivial since the technical proofs in Guo et al. (in press) are restricted to nondegenerate U-statistics, and do not apply to many modelfree screening procedures that rely on degenerate U-statistics.In this article, we use martingale theory and modern empirical process techniques to show that the FDR can be controlled with high probability.We conduct Monte Carlo simulation studies to assess the finite sample performance of the newly proposed procedures and compare them with existing ones.A real data example is used to illustrate the proposed methodology.
The rest of the article is organized as follows.In Section 2, we propose the CIS procedure and establish its sure screening and ranking consistency properties under mild regularity conditions.In Section 3, we propose the REDS for controlling FDR of CIS (CIS-REDS for short) and show that the CIS-REDS can control the FDR with theoretical guarantee.In Section 4, we assess the finite sample performance of the CIS and CIS-REDS and compare them with existing procedures.Conclusion remark is given in Section 5.All technical proofs are postponed in the Appendix and the supplementary materials.

Measure of Conditional Independence
Let us first introduce the basics of the measure for conditional independence proposed in Cai, Li, and Zhang (2022).We start with the univariate case, in which X, Y, and Z are all continuous random variables.The case of discrete X, Y or Z is discussed in the supplementary material to save space.Let U := F X|Z (X|Z), V := F Y|Z (Y|Z) and W := F Z (Z), where a := b means defining a as b.Cai, Li, and Zhang (2022) showed that when X and Y have continuous conditional distribution functions for every given value of a continuous variable Z, X⊥ ⊥Y|Z if and only if U, V, and W are mutually independent.Denote ω as an arbitrary positive weight function, and U,V,W (•), U (•), V (•), and W (•) as the characteristic functions of (U, V, W), U, V, and W, respectively.Then U, V, and W are mutually independent if and only if where ψ 2 := ψ T ψ for a complex-valued function ψ and ψ is the conjugate of ψ.If we choose ω(t 1 , t 2 , t 3 ) to be the joint density function of three iid standard Cauchy random variables, the left hand side of (1) has a closed form, where (U j , V j , W j ), j = 1, . . ., 4, are four independent copies of (U, V, W).By the properties that U⊥ ⊥W and V⊥ ⊥W and the fact that U, V, and W all follow Uniform(0, 1), Cai, Li, and Zhang (2022) derived that (2) indeed equals to As shown in Cai, Li, and Zhang (2022), ρ(X, Y|Z) has many appealing properties.For example, it equals zero if and only if X⊥ ⊥Y|Z.It is also symmetric when conditioning on Z, that is, ρ(X, Y|Z) = ρ(Y, X|Z).Moreover, for any strictly monotone transformations The ρ(X, Y|Z) in (3) can be naturally generalized to the multivariate case ρ(x, y|z) with x ∈ R r 1 , y ∈ R r 2 and z ∈ R r 3 : where • 1 is the L 1 -norm, and (u j , v j , w j ) for j = 1, 2 are two independent copies of (u, v, w) and where is defined according to the context here and below; where q m = (z T , Y 1 , . . ., Y m−1 ) T ∈ R r 3 +m−1 for m = 2, . . ., r 2 ; and where . ., r 3 and We refer the interested readers to Cai, Li, and Zhang (2022) for the theoretical results of ρ(x, y|z).In the following section, we discuss how to perform feature screening using (4).
Remark.Zhou, Liu, and Zhu (2020) proposed a conditional screening approach that is based on testing U⊥ ⊥V.Their approach is different from ours in that U⊥ ⊥V is not equivalent to X⊥ ⊥Y|Z.Specifically, X⊥ ⊥Y|Z implies that U⊥ ⊥V, but they need an additional assumption (U, V)⊥ ⊥Z so that U⊥ ⊥V implies X⊥ ⊥Y|Z.See Lemma 1 in Zhou, Liu, and Zhu (2020) for details.
In this article, the mutual independence among U, V, and W is equivalent to X⊥ ⊥Y|Z, and we fully use this property to construct our screening statistic.

Conditional Independence Based Feature Screening
In this section, we propose a model-free conditional sure independence screening method that uses the conditional independence measure introduced in Section 2. For ease of presentation, we focus on the case where Y ∈ R is the univariate response variable, z = (Z 1 , . . ., Z s ) T ∈ R s with index set [s] = {1, . . ., s} consists of predictors known to relate to the response Y, and x = (X 1 , . . ., X p ) T ∈ R p with index set [p] = {1, . . ., p} is an ultrahigh dimensional predictor vector that contains more irrelevant than relevant variables to the response Y.The following proposed method can also be extended to grouped predictors where any X k in x can be multivariate and/or multivariate response y ∈ R q .Consider regression model F(Y|z, x) = μ(z, x), and iid observations {Y i , x i , z i } n i=1 from this model.Without assuming a specific regression function, we define the index sets of true active and inactive predictors in x as M = {k : μ(z, x) functionally depends on X k } and M c = {k : μ(z, x) does not functionally depends on X k }, respectively.We assume the true model is sparse, namely, the cardinality |M| of set M is small.
The goal is to estimate M by screening out irrelevant predictors in x when z is included in the model, and we refer to this method as Conditional Independence Based Feature Screening (CIS).We propose ranking each predictor X k in [p] by employing the conditional mutual dependence between X k and Y given z which is a special case of (4) with x = X k and y = Y.Larger ρ(X k , Y|z) indicates that X k is more likely to be an active predictor.
We estimate ρ(X k , Y|z) by V-statistics, where U k , V, and w are kernel density estimators.
With slight abuse of notations, we define where K h (t) := K(t/h)/h d with d being the dimension of t.
Accordingly, for i = 1, . . ., n, we define the elements of U k , V, and w as After estimating ρ(X k , Y|z), we keep the variables as the estimated M, where T is a predetermined threshold.We will discuss in detail how to choose T in Section 3 to achieve false discovery rate control.

Theoretical Properties
In this section, we discuss some theoretical properties of CIS.
To facilitate the theoretical analysis, we assume the following regularity conditions hold for any k ∈ [p] and m ∈ [s]\{1}: where k(•) is a univariate, symmetric about zero and Lipschitz continuous kernel function, and satisfies the following conditions: and F(z) functions in equation group ( 8) are locally Lipschitz continuous with respect to z.
Condition (C1.1) is a general assumption.For instance, Giné and Guillou (2002) pointed out that it is satisfied when 1) K(•) = g 1 (g 2 (•)) with g 2 being a polynomial function and g 1 a bounded real function of bounded variation (Nolan and Pollard 1987), such as K(•) constructed using the Epanechnikov kernel etc. We proved in Lemma 3 in the supplementary materials that, under (C1.1), the class of functions F = x → K (t − x) : t ∈ R d is a bounded measurable VC class of functions, that is, satisfies inequality (S1) in the supplementary materials for some A and v and for any probability measure P. The covering number bound (S1) is crucial for proving our main results.Condition (C1.2), (C1.3), and (C4) are also commonly assumed for kernel density estimation, see Li and Racine (2007), Wang et al. (2015) and Zhou et al. (2018) for examples.Condition (C2) guarantees the validity of the conditional independent measure ρ(X k , Y|z) (Cai, Li, and Zhang 2022).Condition (C3) is for bounding the conditional density estimator, since we need to estimate these densities in the denominator as shown in (5).Condition (C5) is widely assumed in screening literature (Cui, Li, and Zhong 2015;Xie et al. 2020) for justifying the ranking consistency property, and can be relaxed to min j∈M ρ(X j , Y|z) − max j∈M c ρ(X j , Y|z) = 2an −γ , where a and γ are defined in Theorem 1.
The following theorem asserts that CIS has the sure screening and the ranking consistency properties.
(i) (Convergence Rate) For any 0 < γ + sθ < 1/2 and 0 < γ ≤ 2θ , there exist some positive constants a, c, c, and c such that Pr max holds for any large enough n. (ii) (Sure Screening Property) Under the conditions in 1, if we further assume that there exists some a, γ , and θ as defined in 1 such that then there exist some positive constants c 1 , c 2 , and c 3 such that holds for any large enough n. (iii) (Ranking Consistency Property) Under the conditions in 1, if we further assume that Condition (C5) holds, then there exist some positive constants c 4 , c 5 , and c 6 such that Pr max holds for any large enough n.
The proof of Theorem 11 can be found in Appendix A.1.This exponential-type tail probability indicates that ρ(X k , Y|z) converges exponentially fast to ρ(X k , Y|z), which ensures the following sure screening and ranking consistency properties hold in the ultra-high dimensional setup.
The minimum signal strength condition ( 10) is commonly assumed in feature screening literature (Liu et al. in press;Pan et al. 2019;Xie et al. 2020).It assumes that ρ(X k , Y|z) for any active X k is uniformly bounded away from zero.The proof of 1 in the supplementary materials shows that any threshold T < an −γ in (9) yields the sure screening property.
Under the condition that there exists a positive signal strength gap between the active and inactive predictors, 1 shows that there exists some threshold T that can separate the active and inactive predictors with probability tending to one, which is a stronger result than 1.The proof of 1 is given in the supplementary materials.

FDR Control with REDS
In this section, we introduce how to boost the performance of CIS to achieve FDR control using the REDS method.We randomly split the data into two disjoint sets D 1 and D 2 of size n 1 = n(K − 1)/K and n 2 = n/K with K ≥ 3. Let n 2 be an integer for simplicity.In practice, we suggest choosing K = 3.We calculate ρ(X k , Y|z)'s on D 1 and D 2 separately as defined in ( 6), and denote them as ρ k1 and ρ k2 , respectively.A new marginal utility measuring the dependence between X k and Y is defined as where sgn(•) is the sign function, and a ∨ b := max{a, b}.For a given threshold L > 0, we define the estimated M as The asymptotic behaviors of ρ(X k , Y|z) as stated in Lemma 1 in the supplementary materials are important for understanding why Q k 's are ideal for separating the active predictors from the inactive ones.For k ∈ M, by Part 1 in Lemma 1 in the supplementary materials and Condition (C5), n 1 ρ k1 > n 2 ρ k2 > 0 in probability, then we have Q k 0 as n → ∞.This implies that #{k : by Part 2 in Lemma 1 in the supplementary materials and the independence between D 1 and D 2 , Q k is small and asymptotically symmetric around zero.This is called the marginal symmetric property for inactive predictors (Guo et al. in press), and it indicates that #{k which motivates the threshold L to be chosen as We name the above procedure for estimating M(L) with FDR control as CIS-REDS method.The steps are summarized in Algorithm 1.The following theorem states that the CIS-REDS method can control the FDR under desired level α ∈ [0, 1] with probability tending to 1 as (n, p) → ∞.
The conditions assumed for Theorem 2 are relatively mild.Specifically, Lemma 10 in the supplementary materials proves that B k converges in distribution to Bernoulli(1/2), and Condition (C6.1) serves to give the convergence rate of EB k a convenience notation in the proof.Condition (C6.2) requires p to grow faster than c n , which is easily satisfied in the ultra-high dimensional setting where p grows exponentially with n.The proof relies on martingale theory and several empirical process techniques.See the Appendix and supplementary materials for more details.

Screening Performance
In this section, we conduct Monte Carlo simulations to assess the finite sample feature screening performance of the proposed method.We compare the proposed conditional independence screening (CIS) with three other conditional screening methods, BKR-CSIS (Zhou, Liu, and Zhu 2020), CDC-SIS (Wen et al. 2018) and C-SIRS (Zhou et al. 2018).For all these three methods, we approximate the model selection rule (9) with as suggested in Fan and Lv (2008).We will investigate the FDR control algorithm in Section 4.2.Throughout this section, we denote = (σ ij ) p×p with σ ij = 0.5 |i−j| for 1 ≤ i, j ≤ p. Cauchy(0, I p ) stands for the pdimensional standard Cauchy distribution, which has a heavytail.We set the sample size n to be 200, and vary the dimension p from 2000 to 5000.We repeat each experiment 1000 times, and record the minimum model size S required to include all active predictors.We report the 5%, 25%, 50%, 75%, 95%, and 100% quantiles of S out of the 1000 replications.We use Gaussian kernel for all numerical studies without further mentioning.The kernel bandwidth is chosen by the maximum likelihood cross validation using the mkde.tunefunction from the R package Compositional (Tsagris, Athineou, and Alenazi 2021).
Example 1 (Conditional on One Active Variable).We generate covariates x and the response Y from Models 1-6, and consider X 1 as a known active predictor to be conditional on.
Model 1: u ∼ Cauchy(0, In all above models, we consider two error distributions: standard normal N(0, 1) and standard Cauchy(0,1).Tables 1  and 2 summarize the quantiles of the minimum model size that contains all active predictors among X 2 , . . ., X p when p = 2000 and p = 5000, respectively.
In Model 1, x is heavy-tailed, and the dependency between active predictors and the response is linear.Models 2-6 are all nonlinear models.In particular, Models 2-4 have additive structure, and Models 5-6 have more challenging nonlinear relationships.CIS outperforms CDC-SIS and C-SIRS in almost all cases.BKR-CSIS performs well in the linear model but fails when the model consists of complex nonlinear relationships.The findings imply that the proposed method is robust and suitable for nonlinear settings and heavy tailed data.
Example 2 (Effect of Conditional Variables).We use Model 3 with p = 2000 as an example to examine the effect of conditional variables.Consider the following four settings:  Case 1: Conditional on one active predictor, say X 1 .
Case 2: Conditional on two active predictors, say X 1 and X 5 .Case 3: Conditional on one active predictor and one inactive predictor, say X 1 and X 3 .Case 4: Conditional on two inactive predictors, say X 2 and X 3 .
Since C-SIRS can only handle one-dimensional conditional variables, we only compare CIS with BKR-SIS and CDC-SIS.X 10 and X 15 are waiting to be identified in all four cases, so we compare the probability P k that the active predictor X k is selected for k = 10, 15.
Results summarized in Table 3 show that CIS outperforms BKR-SIS and CDC-SIS in all cases.Comparing Case 1 with Case 2 shows that CIS can gain more from conditioning on more true active predictors than CDC-SIS.Comparing Cases 2-4 indicates that CIS is more robust to the misinformation of the conditional variables than BKR-SIS and CDC-SIS.All four cases suggest that conditioning on active variables gives better performance than conditioning on inactive ones, which agrees with the findings in Figures 1 and 2 in Barut, Fan, and Verhasselt (2016).

FDR Control Performance
In this section, we assess CIS-REDS with respect to the FDR control performance and sure screening property at the same time with simulated models.Throughout this section, we denote = (σ ij ) p×p with σ ij = 0.5 |i−j| for 1 ≤ i, j ≤ p. t d denotes the t distribution with degree of freedom d.We set n = 200, p = 2000 and repeat each experiment 1000 times.We randomly split the whole dataset with K = 3 as recommended in Guo et al. (in press).The performance of FDR control is evaluated at target FDR levels α = 0.10, 0.15, 0.20, 0.25, and 0.30.
Example 3 (FDR Control for Linear Models).We generate covariates x and the response Y from the following four models, and consider X 1 as a known active predictor to be conditional on.
We summarize the results in Table 4 where α is the target FDR level, M is the average number of selected predictors, columns P k , k = 3, . . ., 9 report the probability that the active are to be identified.α is the target FDR level, M is the average number of selected predictors, P k , k = 3, . . ., 9 report the probability that the active predictor X k is selected, P a stands for the probability that all active predictors are selected.
predictor X k is selected, P a stands for the probability that all active predictors are selected, FDR is the empirical FDR, and the F1 score.F1 score is the harmonic mean of the precision and recall: true positive true positive + 0.5 * (false positive + false negative) .
F1 score is a classical measure for imbalanced data (Sun, Wong, and Kamel 2009), and higher F1 score implies better performance.As shown in Table 4, the proposed CIS-REDS method controls the empirical FDR under all target level α in almost all cases.And the F1 scores are always higher than 0.77 even when α increases to 0.3.Notice that M is small in all cases, which is barely larger than the number of true active predictors waiting to be identified.This implies that the CIS-REDS method tends to be conservative, and we can further improve P a by making the selection of L in (13) less conservative.

Real Data Application
In this section, we apply the proposed CIS and CIS-REDS to a CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) dataset publicly available from the 10X-genomics website (10x Genomics 2018).CITE-Seq data is a kind of singlecell multimodal omics data, a research area labeled "Method of the Year 2019" by Nature Methods.The data contain sequencing RNA and surface protein expression simultaneously measured at the single-cell level.While gene expression data has been extensively studied in the single-cell analysis, surface proteins are relatively difficult to obtain, and are functionally directly involved in cell signaling and cell-cell interactions (Davis 2007).
Because CITE-seq data is very important in scientific discoveries of human biology, scientists are interested in studying the relationship between proteins and RNA gene expression and identifying important biomarkers (Hao et al. 2021;Gayoso et al. 2021).In this section, we investigate the relationship by using the proposed conditional feature screening methods to identify important genes that affect the surface protein level.
After filtering out genes with zero variance and cells with more than 90% zero entries, we obtain 241 single cells and 19703 RNA expression.Among the 17 measured proteins, we take the protein CD8 be the response variable and let the protein CD3 be the conditional variable.This is because CD3 is a protein complex and T cell co-receptor that is involved in activating the expression of CD4 and CD8 (Murphy and Weaver 2016).By controlling CD3, we can identify the genes that are directly associated with the translation of CD8.
First, we apply the CIS-REDS method to the whole dataset under the prespecified FDR level α = 0.15.CIS-REDS selects 8 variables, and Figure 1 shows the scatterplots between the response and each selected variable.The black dots are data points.The blue curves are local quadratic regression curves fitted by the stat_smooth function from R package ggplot2 (Wickham 2011).The gray shaded areas are 95% confidence intervals.Figure 1 indicates that CIS-REDS can discover many different functional forms.
Next, we compare the proposed CIS and CIS-REDS methods with CDC-SIS, C-SIRS and BKR-CSIS following the steps below.The goal is to compare the predictive power of all methods.
Step 1: Randomly split the dataset into training and testing sets of size n train = 193 and n test = 48.
Step 2: Apply all the conditional feature screening methods.For CIS-REDS, we set the target FDR level at 0.15.For the remaining methods, we follow common routine and select the top n train / log(n train ) = 36 features.
Step 3: Fit a random forest model to the selected variables and the response.Compute the root of mean squared prediction error (RMSE) Y test − Y test 2 / √ n test using the testing set.In the last step, we take a log transformation on the data since the random forest is unsuitable for heavy tailed distributions.To ensure reproducibility, we repeat the above procedure 1000 times.Table 5 reports the RMSE of each method.The proposed CIS and CIS-REDS achieve smaller RMSE compared to other methods.

Discussion
We developed a conditional-independence-based feature screening method.It is model-free, and robust to heavy-tailed predictors and random error.The conditioning variables can be multivariate.We proved that CIS enjoys sure screening and ranking consistency properties.We further developed the CIS-REDS method which integrates CIS and REDS to control the FDR.We proved that CIS-REDS can control the FDR under desired level α ∈ [0, 1] with probability tending to 1 as (n, p) → ∞.Simulation studies indicate that CIS performs well in terms of sure screening property in complex settings including heave-tailed predictors and random error, and CIS-REDS achieves satisfactory FDR control.
In this article, we are studying only the case where Z is low dimensional.This is due to the intrinsic limitation of kernel density estimators.It is possible to use dimension reduction techniques (Li 2018) and allow high-dimensional Z under more assumptions on the data.Alternatively, one might replace kernel estimation with other modern techniques (such as neural networks) to estimate the conditional distribution function.These topics are out of the scope of this article and would be interesting topics for future research.

Appendix A: Proofs
For simplicity, we denote c as a positive constant that may take different values (independent of n and p) in each appearance throughout this section.

A.1. Proof of Theorem 11
The proof of Theorem 11 is lengthy.Here we present only a sketch of the proof.The full proof is given in the supplementary materials.
Step 1.We deal with the first term of the RHS of (15), that is, we prove that Pr max for some proper positive constants c, c, and c .
We first deal with g k1 and gk1 with the goal to prove that Then for some positive constants b 1 , b 2 , and b 3 , Since e −x − e −y ≤ x − y for x > 0 and y > 0, we have Similarly, for some positive constants b 4 , b 5 , and b 6 , and for some positive constants b 7 , b 8 , and b 9 , By Lemma 7 in the supplementary materials, ( 17), (18), and ( 19), for some positive constants c 1 , c1 , and c 1 , we have Repeating the above scheme can give us results similar to (16) for Pr( g ki − gki ≥ an −γ ), i = 2, . . ., 9. Then for some positive constants c, c, and c we have and hence, Pr max (20) for some proper positive constants a, c, c, and c .
By Condition (C2), are all one to one transformation, so instead of regarding ρ k as a function of X k , Y, and z, we consider it as a function of U k , V, and w.
We first deal with gk1 .Define g * k1 = (n(n−1)) −1 i =j e − U ki −U kj e − V i −V j e − w i −w j 1 , which is a U-statistic.gk1 can be rewritten as For any given > 0, there is a large enough n such that ≥ 2(1 − g k1 )/(n + 1).Then, we have To prove the uniform consistency of gk1 , it suffices to show the uniform consistency of g * k1 .By Markov's inequality, for any t > 0, denotes the summation over all n! permutations (i 1 , . . ., i n ) of (1, . . ., n), and each 1 with m = n/2 .By Jensen's inequality, Combining ( 22) and ( 23), since h for some proper positive constants a, c, c, and c .

A.2. Proof of Theorem 2
We have It remains to show that where o(1) → 0 as (n, p) → ∞, and we achieve this by using the martingale theory techniques.
Consider the process how we find the threshold L. Without loss of generality, we assume the index set of M c is {1, . . ., p 0 }, the Q k 's for k ∈ M c are ordered as |Q 1 | ≥ |Q 2 | ≥ • • • ≥ Q p 0 > 0, and we further define Q p 0 +1 = 0. To decide L, we try setting Q k 's as the threshold from the smallest one Q p 0 +1 to the largest one |Q 1 |, and stop as soon as we find the first Q k that satisfies (13), and set it as L. We have that L is a stopping time in the above process.Specifically, for j = p 0 + 1, p 0 , . . ., 1, we define this process as For any k ∈ M c , define B k = 1{Q k < 0} and S k = B 1 + • • • + B k , then we can write M j as Let F j be the σ -algebra generated by {B 1 + • • • + B j , B j+1 , . . ., B p 0 +1 } and knowing which k ∈ M c .Hence, L is a stopping time in reverse time (from p 0 + 1 to 1) with respect to the process M p 0 +1 , . . ., M 1 and its filtration Next, we prove that {M j } is a super-martingale with respect to {F j }.Since ρ(X k , Y|z) is model-free for any k ∈ M c by Lemma 1 in the supplementary materials, then Q k 's are identically distributed for any k ∈ M c , so {B 1 , . . ., B j } are identically distributed with respect to F j .Hence, by Lemma 8 in the supplementary materials, we have Pr(B j = 1|F j ) = S j j .
If S j = 0, then S j−1 = 0, so M j−1 = j − 1 < j = M j .If S j > 0, E(M j−1 |F j ) = j 1 + S j − 1 Pr(B j = 0|F j ) Thus, {M j } is a super-martingale with respect to {F j }.Now by the optional stopping time theorem, since L is a stopping time, we have function in the set of finite linear combinations of functions ϕ ≥ 0 satisfying the following property: the subgraph of ϕ, {(s, u) : ϕ(s) ≥ u}, can be represented as a finite number of Boolean operations among sets of the form {(s, u) : ρ(s, u) ≥ φ(u)}, where ρ is a polynomial on R d × R and φ is an arbitrary real function.d is the arity of K(•), and varies with the estimators in equation group (8).(C1.2) (Kernel Bounds and Smoothness) For t

Figure 1 .
Figure 1.Scatterplots between the response and variables selected by CIS-REDS.The black dots are data points.The blue curves are fited local polynomial regression curves.The gray shaded areas are 95% confidence regions.

Table 1 .
The quantiles of minimum model size for Example 1 over 1000 replications when p = 2000.
NOTE: There are three active predictors in Models 1-5, and five active predictors in Model 6.

Table 2 .
The quantiles of minimum model size for Example 1 over 1000 replications when p = 5000.
NOTE: There are three active predictors in Models 1-5, and five active predictors in Model 6.

Table 3 .
The probabilities of the discovery of true active predictors X 10 and X 15 for Example 2 over 1000 replications.

Table 5 .
RMSE of each method over 1000 replications for CITE-seq data.