Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems

Abstract Contemporary high-throughput experimental and surveying techniques give rise to ultrahigh-dimensional supervised problems with sparse signals; that is, a limited number of observations (n), each with a very large number of covariates , only a small share of which is truly associated with the response. In these settings, major concerns on computational burden, algorithmic stability, and statistical accuracy call for substantially reducing the feature space by eliminating redundant covariates before the use of any sophisticated statistical analysis. Along the lines of Pearson’s correlation coefficient-based sure independence screening and other model- and correlation-based feature screening methods, we propose a model-free procedure called covariate information number-sure independence screening (CIS). CIS uses a marginal utility connected to the notion of the traditional Fisher information, possesses the sure screening property, and is applicable to any type of response (features) with continuous features (response). Simulations and an application to transcriptomic data on rats reveal the comparative strengths of CIS over some popular feature screening methods. Supplementary materials for this article are available online.


Introduction
Contemporary high-throughput experimental and surveying techniques employed in many scientific fields often generate data on an enormous number of variables. This is the case, for example, in the "Omics" sciences (Genomics, Transcriptomics, Proteomics, Metabolomics, etc.) and in biomedical applications involving imaging and/or the analysis of extensive electronic medical records. Extracting meaningful and interpretable information from these data often requires studying the association of a response with thousands to millions of predictors (p)here and in the following, we use "feature, " "predictor, " and "covariate" interchangeably, all indicating a potential explanatory variable for the response. Even when these supervised problems have what appears to be a large sample size (n), this can in fact be orders of magnitude smaller than p. For instance, in transcriptomic studies n may be in the hundreds, but p may be in the thousands or tens of thousands. This is commonly referred to as an ultrahigh-dimensional setting, and often linked to the fact that p increases as a function of n (collecting more observations produces more features). Roughly speaking, one can have p ≈ exp{O(n ξ )}, ξ > 0 (Fan and Lv 2008). Importantly, in ultrahigh-dimensional settings, association signals are often sparse, that is, only a handful of predictors contribute to explaining variation in the response.
When p exceeds n, computing or inverting sample covariance matrices (ˆ ) to estimate dependencies among predictors becomes very inaccurate, numerically unstable, or downright unfeasible (Ledoit and Wolf 2004;Schäfer and Strimmer 2005;Bickel and Levina 2008;Chen, Chi, and Goldsmith 2015). This, in turn, negatively affects regression model fitting, classification methods, and also techniques for supervised dimension reduction-in their standard versions, most of these tools employ some versions of or −1 .
Popular methods such as LASSO (Tibshirani 1996), SCAD (Fan and Li 2001), Elastic Net (Zou and Hastie 2005), and MCP (Zhang 2010) use penalties to regularize supervised problems, performing feature selection and model fitting simultaneously (see Fan and Lv 2010 for a comprehensive overview). In practice, many of these methods may successfully handle p > n scenarios, but they also deteriorate and might not scale up in realistic time when p n (see, e.g., Fan and Lv 2008, Table 1, p. 862). Fan, Fan, and Wu (2011) further discussed the challenges of high and ultrahigh dimension in classification. "Curse of dimensionality" issues including computational burden, statistical inaccuracy, and algorithmic instability call for alternative approaches in tackling ultrahigh-dimensional supervised problems (Fan, Samworth, and Wu 2009). One such approach is feature screening, pioneered by the development of sure independence screening (SIS; Fan and Lv 2008) and extensively studied ever since, it led to model-based, model-free, correlation-based, and distancebased procedures for a variety of supervised problems including regression, classification, discriminant analysis, survival analysis, etc. Liu, Zhong, and Li (2015) provided a comprehensive overview of work prior to 2015 (we also provide a selected list of references in Section S5 of the supplementary materials).
In this article, we propose covariate information number-sure independence screening (CIS), a model-free feature screening procedure based on a novel marginal utility called covariate information number (CIN). For each covariate X j , the CIN captures marginal association with the response Y without assuming any specific underlying model, and can be interpreted in terms of the traditional Fisher information in Statistics. CIN is computed from the joint density of (Y, X j ) employing kernels to estimate marginal and inverse conditional densities within response sub-populations that are naturally defined when Y is categorical or discrete, and artificially generated by slicing the range if Y is continuous. Thus, our approach can be employed irrespective of the nature of the response and, in fact for continuous Y, it is robust to outliers (it uses only the ranks of the Y values for slicing). Moreover, switching the roles of Y and X j in computing the CIN, our approach can be utilized also to screen discrete or categorical covariates-as long as the response is continuous. Because of the way information on marginal and conditional densities is used in the CIN calculation, in addition to being model-free, our approach does not require strong assumptions on the predictors (the rationale is similar to that discussed in Yao et al. (2019)). Under mild regularity conditions, we show that the CIS procedure built upon the CIN marginal utility possesses the sure screening property (Fan and Lv 2008). Overall, we find that CIS is competitive with, and sometimes better than, other popular feature screening procedures.
We compare CIS to five other procedures: (a) SIS (Fan and Lv 2008), (b) high-dimensional ordinary least squares projection (HOLP; Wang and Leng 2016), (c) sure independent ranking and screening (SIRS; Zhu et al. 2011), (d) distance correlation-sure independence screening (DC-SIS; Li, Zhong, and Zhu 2012), and (e) martingale difference correlation-sure independence screening (MDC-SIS; Shao and Zhang 2014). SIS postulates a naive linear regression model, and screens the predictors based on the magnitudes of their Pearson correlations with the response. It is intuitive, computationally straightforward, and possesses the sure screening property. SIRS, DC-SIS, and MDC-SIS are among the most popular model-free feature screening procedures in the literature. Specifically, SIRS does not postulate a specific underlying model but relies instead on a general framework which includes many common parametric and semiparametric models. It possesses rank consistency, a stronger property than sure screening, and allows both univariate and multivariate responses. Moreover, similar to our CIS, it only considers response ranks and is therefore robust to outliers. DC-SIS screens the predictors based on their distance correlations (Székely, Rizzo, and Bakirov 2007) with the responsea measure of departure from independence for two random vectors, built through characteristic functions. It possesses the sure screening property, allows both univariate and multivariate responses, and can also handle grouped predictors. MDC-SIS uses martingale difference correlations, which are a natural extension of distance correlations. It screens predictors that contribute to the conditional mean of Y|X, and has the sure screening property. Notably, to tackle regressions with heteroscedastic errors, the authors of MDC-SIS proposed an extension that screens based on contributions to conditional quantiles. While the feature screening procedures discussed above involve computing the marginal utilities independently, HOLP involves a joint estimation of these measures. HOLP is motivated by the ordinary least squares estimator and ridge regression, straightforward and efficient to compute, and relaxes the often violated assumption of strong marginal correlations between each truly associated covariate and the response. However, similar to SIS, HOLP also postulates a naive linear regression model. All procedures in (a)-(e), as well as CIS, allow p to grow exponentially with n.
The rest of the article is organized as follows. Section 2 contains the details of our proposal-formulation, properties, and implementation of the CIN; the CIS algorithm; and the theoretical properties of CIS under appropriate assumptions. Section 3 presents an extensive simulation study to compare the performance of CIS to those of the five popular screening procedures mentioned above. Section 4 presents an application to transcriptomic data on Norway rats (GeneChip Rat Genome 230 2.0 Array Data; Scheetz et al. 2006). Concluding remarks are provided in Section 5, whereas proofs of theoretical results, full simulation results, details on the transcriptomic data application, and some relevant additional information are provided in an online supplementary materials (an "S" in the numbered references below indicates sections, tables, and figures in the supplementary materials).

CIN-Sure Independence Screening (CIS)
In this section, we describe our proposal. We introduce the general setup, provide details on the formulation, properties, and implementation of our CIN marginal utility, describe our CIN-based screening algorithm (CIS) and discuss its theoretical properties under appropriate assumptions. Notation is similar to that in Zhu et al. (2011).
Consider a univariate response Y with support Y and a pdimensional covariate vector X = (X 1 , . . . , X p ) T with p n. Let F(y|x) = P(Y ≤ y|x) denote the conditional distribution of Y given X = x in the following definitions of the two index sets: A indexes predictors that are truly associated with the response; it is called the active set. I indexes the remaining, inactive predictors. Note that, in this definition, F(y|x) is completely generic-no model form is specified. Let s = |A| denote the cardinality of A; s out of p covariates are active, and therefore, measures the sparsity level of the association between Y and X. In any feature screening procedure, including our CIS, the objective is to estimate A conservatively; that is, with a minimal prevalence of false negatives.

CIN
Next, we expand upon the CIN, the marginal utility for our proposed CIS. Some of the developments follow directly as special cases (p = 1) of those for the covariate information matrix (Yao et al. 2019).
Let f (y|x j ) and f j (x j ), respectively, denote the conditional density of Y given X j = x j and the marginal density of X j , and assume that the standard regularity conditions for likelihood analysis hold (see Section 2.6.1). Treating the observed x j as a "parameter, " we can use the traditional Fisher information formulation to create the quantity Capturing the information that f (y|x j ) would carry about x j , if it were in fact unknown, this provides a local measure of association. Based on the bivariate joint distribution of (Y, X j ), the CIN for covariate X j is then defined as the expected value of this quantity; that is Estimates of the scalars ω j , j = 1, 2 . . . , p, which are theoretically nonnegative by definition, are key components of the final form of the marginal utilities (see below) we use for ranking the covariates in our CIS screening procedure. We next introduce two other quantities which are relevant to our proposal. Here, we need to also consider f j (x j |y) and f (y), the inverse conditional density of X j given Y = y and the marginal density of Y, again with standard regularity conditions (see Section 2.6.1). The density information (Hui and Lindsay 2010;Lindsay and Yao 2012;Yao et al. 2019) in the marginal density of X j is defined as Similarly, the density information in the conditional density of X j |Y = y is defined as and can be averaged to produce The marginal utility which CIS uses for each covariate X j , j = 1, 2, . . . , p, is the CIN (ω j ) normalized by its density information, that is, ω * j = ω j /J X j . The next theorem summarizes some properties of ω * j (proofs are in Section S1).
Theorem 2.1 (Properties of the normalized CIN).
(i) The normalized CIN ω * j = 0 if and only if Y and X j are statistically independent.
(ii) If (Y, X j ) follows a bivariate normal distribution with correlation coefficient ρ j , then the normalized CIN ω * j is a monotonically increasing function of |ρ j |. (iii) ω j can be expressed as ω j = J X j |Y − J X j . Hence, the normalized CIN ω * j is and simply refer to it as the CIN. Properties (i) and (ii) motivate its use as a marginal utility in feature screening: positive values of ω * j correspond to statistical dependence between Y and X j and, in a bivariate Gaussian scenario where the association is linear, ω * j increases with the absolute value of the correlation coefficient. Notably, this fact implies that ω * j is a more general measure of association than the marginal utility employed by SIS (Fan and Lv 2008)-capturing also potential nonlinear dependencies between Y and X j . (iii) reformulates the CIN as the ratio of the average density information in f j (x j |y) (inverse regression) to the density information in f j (x j ), the marginal density of the predictor X j . Following the argument in Yao et al. (2019), the ratio in Equation (4) "cleanses" the association signal from potential distributional peculiarities of X j , and renders ω * j effective also for "not well-behaved" covariates. Finally, (iv) describes the effects of affine transformations on ω * j . Note that (3) can be easily adapted for a discrete or categorical response by simply replacing the integral with an appropriate sum.
The ratio of (5) to (2) provides a straightforward definition of the CIN in discrete regressions or classification problems.

Estimation of the CIN
Up to this point, we have defined and characterized the CIN theoretically, at the population level. Next, we describe its estimation for the practical implementation of our CIS procedure on sample data. Three facts are key: First, we write the CIN through (4), which comprises two quantities, J X j |Y and J X j . Second, regardless of the nature of the response, we write J X j |Y through its formulation in (5); if the response is continuous, we create an approximate version with "sub-populations" defined by slicing the range of Y. Notably, this slicing strategy is used by most sufficient dimension reduction methods based on inverse regression, for example, SIR (Li 1991), SAVE (Cook and Weisberg 1991), SR (Wang and Xia 2008), and CIM (Yao et al. 2019). Third, we estimate J X j and the components J X j |Y=y ( ) , = 1, 2, . . . , L of J X j |Y through kernels.
Let us drop the predictor index j to simplify notation (X now stands for a generic predictor) and start with the estimation of J X . (2) expresses J X as an expectation: We therefore need to estimate the density f (·) and the expectation of the function g(x) = ∂ ∂x log f (x) 2 . We use a kernel density estimator and replace the theoretical expectation by the sample average, which gives J X = E X [ g(X)] = 1 n n i=1 g(x i ). Next, using observations within slices (if Y is continuous) or natural subpopulations (if Y is discrete or categorical), we produce each J X|Y=y ( ) , = 1, 2, . . . , L in exactly the same way, and set π = n n , = 1, 2, . . . , L (where n is the number of observations with Y = y ( ) ). Thus, we estimate Finally, we compute a ratio to produce: ω * j = J X|Y / J X . An important remark is for the case of a continuous response: slices are customarily produced as to contain (approximately) the same number of observations. Thus, the partitioning does not use the observed values y i , i = 1, 2, . . . , n, but rather their ranks. Consequently, similar to SIRS (Zhu et al. 2011), our CIS screening is robust to outliers in the response (see Model(4) in Section 3.2).

Tuning Parameters
Following the description regarding (5) and in Section 2.2, when the response is discrete or categorical, the number of "slices" L is given. However, when the response is continuous, the choice of L is critical. This is a well-recognized challenge in inverse regression-based sufficient dimension reduction methods (see Wang and Xia 2008;Yao et al. 2019). There is a tradeoff between using more slices to achieve a more accurate approximation of the overall object of interest (in our case, J X|Y ) and using fewer slices to have a sufficiently large number of observations for inslice calculations (in our case, the estimation of each J X|Y=y ( ) ). For simulations in Section 3, we used total sample sizes n = 200 and 600 and investigated the CIS screening performance for L = 2, 3, 5, 8, 10, and 12. CIS performance does vary with L, because we need a large enough sample size within each slice for kernel density estimation to be reliable. We find that moderate values (e.g., L = 3-8) work well in most cases, and that the effect of L becomes negligible as the total n becomes larger and/or the signal to noise ratio in the data becomes stronger (see Sections 3 and S3).
Another critical choice in our estimation is that of kernel and bandwidth. In our implementation, we use a simple Gaussian kernel for k h (·) (k h ( ) (·) for sub-population with Y = y ( ) ) and Silverman's rule of thumb for the bandwidth, which sets

Computational Burden
Computational burden is an important consideration for any feature screening procedure. In addition to the number of covariates to be screened (p), it depends on the time needed to calculate each marginal utility. Our CIN marginal utility ( ω * j ) has a reasonable computational cost, making CIS viable also in applications with large number of covariates (see Section 4), and comparable to other model-free screens. For instance, we performed a comparative study on the elapsed computation times for CIS and five additional screening procedures (see Sections 1, 3.2, and 4) for a simulation scenario with p = 2000 and n = 200 (this used Model(3)(a) with σ = 0.5 and X = (I) X ; see Section 3.2). In this comparison, all methods except HOLP were implemented using MATLAB (MATLAB 2020), version 9.9.0.1467703 (R2020b). HOLP was implemented in R (R Core Team 2020), version 4.0.2, using the GitHub R package screening available at https://github.com/wwrechard/ screening (see Section S6 for more details). All codes were run on a MacBook Pro 2019 laptop with macOS Mojave Version 10.14.6, 2.3 GHz Intel Core i9 processor, and 16 GB 2400 MHz DDR4 RAM.
Taking the medians over 100 simulation runs, CIS with L = 5 slices took ≈3.054 sec to compute all p = 2000 marginal utilities (see also Table 1). This was higher but comparable to the widely used model-free DC-SIS, that took ≈1.222 sec. As to be expected given the much simpler nature of their marginal utilities, SIS and HOLP were much faster-taking, respectively, ≈0.082 and ≈0.074 sec. SIRS and MDC-SIS, which are also model-free, took, respectively, ≈0.414 and ≈0.741 sec (see Table S25 for run times with n = 600). Note that, except for HOLP, since the utility of each covariate is computed marginally, total computation time scales linearly with p. Extrapolating from the calculations above, in an application with n = 200 and as many as p = 2,000,000 covariates, CIS would compute all marginal utilities in ≈50 minutes. Of course, computation time could be vastly reduced implementing screens in more efficient computer programming languages, such as C (Ritchie, Kernighan, and Lesk 1988).

CIS Algorithm
Let (y i , x i ), i = 1, 2, . . . , n be a random sample from the distribution of (Y, X). To implement CIS in practice, we proceed as follows: Table 1 (p) and estimate A d as the top d ranking covariates.
The normalization in (1) ensures that all marginal utilities are on the same density information scale. Moreover, by Theorem 2.1(iv), the ratio expressing ω * j is invariant to affine transformations of X j , for example, marginal centering and/or scaling to unit variance.
The calculations in (1) involve the tuning parameters L (number of slices, if the response is continuous) and h (bandwidth used in kernel density estimation) discussed in Section 2.3. Notably, (2) involves another crucial quantity that plays a role possibly much more vital than those of tuning parameters in the marginal utility calculation: the number d of covariates retained in the screen. Following the literature (e.g., Fan and Lv 2008;Li, Zhong, and Zhu 2012;Shao and Zhang 2014), in our simulations we employ a hard threshold defined as d = k × n log(n) with constant multiplier k = 1, 2, or 3 (see Sections 3 and S3). However, this is an intriguing open question for screening algorithms. For instance, Zhu et al. (2011) showed that one can use a hard threshold, a soft threshold, or a combination of both. How to select d in an effective and datadriven fashion is beyond the scope of this article, but we hint at the development of a potential diagnostics in Section 5.

Sure Screening Property of CIS
In this section, we establish the sure screening property (Fan and Lv 2008) for our CIS, built upon the CIN marginal utility. At the outset, we adjust the definition of the estimated active set as follows: where c 0 > 0 and 0 < κ < γ < 1 3 are given constants (see below). In proving the sure screening property, we considered a discrete or categorical Y with L distinct values or labels. Recall that if Y is continuous, we operate with its "discretized" version obtained through slicing (see Sections 2.2 and 2.3).

Assumptions and Regularity Conditions
In proving sure screening, we assume that all active covariates X j , j ∈ A satisfy a minimum signal strength condition, and specifically that: where c 0 > 0 and 0 < κ < γ < 1 3 are the same constants that appear in (8). Note that, here the minimum value for the signals (the true marginal utilities) is twice c 0 n −κ . As pointed out in Liu, Li, and Wu (2014), this assumption bounds the marginal utilities of active covariates away from 0 for any finite n. However, as n increases, this minimum signal strength can decrease, converging to 0 asymptotically. This fact indicates that, when n is very large, our procedure with the sure screening property can retain covariates whose marginal association with Y is negligible, but are jointly associated with the response. This assumption corresponds to Condition 3 in Fan and Lv (2008) and is commonly used in the screening literature (see, e.g., Li, Zhong, and Zhu 2012;Shao and Zhang 2014). Also note that, while (9) states the assumption on the minimal signal strength of the active covariates, we do not assume any condition on the order of the maximum signal strength to establish the sure screening property of CIS.
In the context defined by the assumptions and conditions above, we have the following theorem (additional details and proofs are provided in Sections S1.6-S1.8).
Theorem 2.2 (Sure screening property for CIS). Let ξ * 0 and c 0 be positive constants, κ and γ be constants such that 0 < κ < γ < 1 3 , and n (1) = min 1≤ ≤L n be the size of the least numerous class (or slice). For j = 1, 2, . . . , p, we have where ω * j and ω * j are the true and the estimated CIN marginal utilities, respectively. Moreover, using the definition of A in (8) and assuming the minimum signal strength condition in (9), we have where s n is the cardinality of the active set A.
Note that the cardinality of A in (10), s n , is indexed as to indicate dependence on n: CIS guarantees sure screening when the number of covariates (p) as well as the number of active covariates grow as we gather more observations (see also Li, Zhong, and Zhu 2012;Liu, Li, and Wu 2014;Shao and Zhang 2014). Concerning the way p grows with the sample size, the exponent in (10) shows that CIS guarantees sure screening also with a non-polynomial log n is the smallest class proportion. Recall that, when Y is continuous, π (1) ≈ 1 L since we create slices containing approximately equal numbers of observations (see Section 2.2). Finally, we note that, similar to other model-free approaches such as DC-SIS (Li, Zhong, and Zhu 2012), CIS guarantees sure screening under much more generic conditions compared to SIS (Fan and Lv 2008)-in particular, CIS does not require a linear regression function for Y onto X.

Simulation Study
In this section, we present simulation results on the performance of CIS in comparison to those of SIS (Fan and Lv 2008), HOLP (Wang and Leng 2016), SIRS (Zhu et al. 2011), DC-SIS (Li, Zhong, and Zhu 2012), and MDC-SIS (Shao and Zhang 2014). Some of the simulation scenarios are adapted from Zhu et al. (2011), Cui, Li, and, and Chen, Fan, and Li (2018).

Summary Statistics to Assess Screening Performance
Because screening procedures are used as a preliminary step, followed by modeling and fitting efforts in which predictors are further assessed, their main priority is sensitivity; as we separate active from inactive predictors, we want to minimize false negatives; that is, cases in which j ∈ A but j ∈ A. Thus, to measure performance, we consider two summary statistics commonly used in the literature: (a) For each simulated dataset, we compute the maximum rank achieved by true predictors X j , j ∈ A, or equivalently, the minimum rank in A required to ensure A ⊆ A. Following Zhu et al. (2011), we denote this by R and call it the ranking measure. R close to s = |A| is evidence of ranking consistency, another important property for feature screening procedures (see Zhu et al. 2011). In the result tables below, we present the median (median absolute deviation in parentheses) of R over N = 1000 simulated datasets corresponding to each simulation scenario. (b) For each simulation scenario, fixing d = | A d | = k · (n/log(n)) (k = 1, 2, or 3; see Section 2.5), we compute the proportion of simulated datasets (out of N) in which A ⊆ A d . Following Shao and Zhang (2014), we denote this by P a . P a close to 1 is evidence of sure screening for a procedure.
To further investigate which among the active predictors are easier/harder to retain for a procedure, we also consider predictor-specific inclusion proportions denoted by P j , j ∈ A. We call P a and the P j 's as inclusion measures.

Simulation Scenarios
We create simulation scenarios based on the elements described below.
1. Sample size, number of predictors, and number of active predictors (n, p, s). We use p = 2000, n = 200 and 600, and s = |A| varying between 3 and 40 (this controls the sparsity level; the smaller the s, the sparser the problem). As the only exception, for Model(3)(d) we use n = 117 (see below). 2. Nature of the predictors. We simulate the p entries in the covariate vector X with different schemes. We start by drawing from a p-variate Gaussian X ∼ N p (0, X ) (see below for covariance specifications) and: (i) we keep the vector as drawn, to have p continuous predictors (Models(1), (2), (3)(a), (4)-(5)); (ii) we replace 50% of the entries in X with independently drawn binary predictors (Model(3)(b)); (iii) we replace 50% of the entries in X with independently drawn "perturbed" continuous predictors obtained from a Gaussian Mixture spiked with a very high variance component (Model (3)(c)). In addition, we consider (iv) predictors randomly selected from the ones in our real data application in Section 4 (Model(3)(d)). 3. Structure of the predictor covariance matrix. For (i)-(iii) above, the multivariate Gaussian X ∼ N p (0, X ) has four different covariance specifications: (i) 4. Response generating process. We generate a continuous Y ∈ R using single-or multi-index models. These comprise m = 1, 2, or 3 indexes (i.e., linear combinations of the predictors X j , j ∈ A) acting linearly or nonlinearly on the mean or the variance of Y|X (Models (1)-(5)). 5. Nature of the error. We always use additive errors and consider: (i) two homoscedastic cases, namely a Gaussian error 6. Signal-to-noise ratio (SNR). We define SNR = var(E(Y|X)) E(var(Y|X)) . When we use 1 or 2 , all signal is contained in the mean E(Y|X); var(Y|X) does not depend on X. In these cases, SNR has the standard definition. When we use 3 , var(Y|X) = g 2 (σ , β T X) itself contains "signal"; SNR benchmarks the signal in the mean to that in the variance. For the homoscedastic cases, we vary a scalar multiplier σ in the mean functions E(Y|X), and for the heteroscedastic case, the σ in g 2 (σ , β T X), as to obtain SNRs ranging between 0.05 and 20 (see below).
In more detail, for the response generating process, we consider five models: Model (1) X ; and σ ranging between 0.34 and 2.02 to give rise to SNRs in the range ≈ 0.8 ("low") to ≈ 20 ("high"). Given its underlying assumptions, we included HOLP (Wang and Leng 2016) in our comparisons only for this model, where its performance should be among the best.

Model(5).
Multi-index with heteroscedastic error: Y = X 1 + X 2 2 + 3 with g(σ , β T X) = exp{σ |X 22 |}. Here m = 3 and A = {1, 2, 22} with s = 3. We use n = 200 and 600; X ∼ N p (0, X ) with X = (A) X ; and σ in the range 0.23-1.37, giving rise to SNRs in the range 2-0.05. Recall that, instead of the standard definition, SNR here benchmarks the strength of the mean signal to that of the variance signal.
In addition to the above scenarios, all with a continuous response, we also investigate scenarios with a categorical response (see Model(6) in Section S3). For each of the scenarios described above, we simulate N = 1000 datasets to compute the performance summary statistics, and we assess the effect of the number of slices on CIS screening for varying L between 2 and 12.

Simulation Results
Due to space constraints, here we present results only for selected scenarios of Models (2) Table 2. Median (median absolute deviation) of the ranking measure R over N = 1000 simulated datasets. Model(2) with n = 200 and p = 2000. Sparsity (s) and the predictor covariance structure ( X ) vary as indicated in the table. CIS results shown for L = 5 slices. Performance is better for values closer to s.  Table 3. Inclusion measures P a and P j , j = 1, 2, 3, 4, over N = 1000 simulated datasets, provided at thresholds d = n/ log(n), 2n/ log(n), 3n/ log(n) (triplets in parentheses). Model (2)   and MDC-SIS is driven by their inability to capture the active covariates X 1 and X 2 involved in the first index, that is, the linear component in E(Y|X) (see Model (2)). While predictors acting linearly ought to be easy to identify, also with the Pearson correlation-based SIS, the exponential scale possibly renders the signals associated with X 3 and X 4 much stronger. Table 4  X (the compound-symmetric structure that ought to hinder screening), the P a 's for CIS [3] are the highest.
The results described above represent only a small portion of the extensive simulation experiments we conducted (see Section 3.2). Below, we describe salient trends and observations based also on the additional results presented in Section S3. CIS exhibits promising performance in terms of sure screening as well as rank consistency under a wide range of scenarios. Overall, as expected intuitively, CIS performs better at larger SNRs and sample sizes, where it is less sensitive to the number of slices (L) used for a continuous Y. In Model(1) scenarios, CIS shows competitive sure screening and rank consistency performance compared to that of other screening procedures (Tables S1 and S2), especially for larger SNRs (5 and 20) and smaller L (3 and 5). In Model(2) scenarios, CIS does better than SIS, DC-SIS, and MDC-SIS; performs very good when sparsity is high and, while it tends to deteriorate at lower sparsity under one predictor covariance structure, it remains stable across sparsity levels for the other (Tables 2, 3, S3, and S4). In all the Model(3) scenarios ((a)-(d)) CIS does better than SIS and SIRS-which fail for reasons similar to those articulated above for Model(3)(a). In general, CIS has good performance at "moderate" and "high" SNRs, and its deterioration at lower SNR can be counteracted increasing the sample size and/or using a smaller number of slices to guarantee a sufficient number of observations per slice (Tables 4-6 and S5-S15). Moreover, the measures for inclusion (P a ) and ranking (R) provide empirical evidence for sure screening and ranking consistency of CIS, respectively. This is Table 4. Median (median absolute deviation) of the ranking measure R over N = 1000 simulated datasets. Model(3)(a) with p = 2000 and s = 3. Sample size (n), SNR, and predictor covariance structure ( X ) vary as indicated in the table. CIS results shown for L = 5 slices. Performance is better for values closer to s = 3. 1279 (406) 1425 (365) 28 (19) 22 (14) 148 (127) 1333 (411) 1470 (358) 3 (0) 3 (0) 3 (0) Table 5. Inclusion measures P a and P j , j = 1, 2, 5 over N = 1000 simulated datasets, provided at thresholds d = n/ log(n), 2n/ log(n), 3n/ log(n) (triplets in parentheses).
Model (3) (I) true when all predictors are continuous (Model(3)(a); Tables 4-6, S5-S7), but also in cases where continuous predictors are mixed with categorical predictors (Model(3)(b); Tables S8 and S9), or with "perturbed" continuous predictors (Model(3)(c); Tables S10-S13), and when predictors are sub-sampled from real data (Model(3)(d); Tables S14 and S15). Notably, in the latter "realistic" scenario, which carries substantial collinearities (see Section 4 and Figure S1), CIS with L = 3 slices (CIS[3]) performs the best, beating also the otherwise strongest competitor DC-SIS. CIS performs very well also with the heaviertailed error of Model(4) (Tables S16-S18; results are similar to those for Model(3)(a)). When n = 200, almost all P a 's for CIS [3] beat those for DC-SIS and MDC-SIS under (C) X and, with a larger sample size (n = 600), CIS performs competitively with these methods in all respects. SIS and SIRS fail in all Model (4) scenarios for reasons similar to those discussed for Model(3)(a). In Model(5) scenarios, where the error is heteroscedastic, CIS and DC-SIS are the best overall performers (Tables S19-S21). When n = 200, ranking performance of CIS with L = 5 slices (CIS[5]) is better than with smaller or higher L, likely due to the added complexity of capturing signals in the variance (as opposed to the mean). When n = 600, once again CIS performs well, along with DC-SIS, across all L = 3-12 and SNR levels less than 1. Notably, MDC-SIS always fails to capture the active predictor X 22 present in the variance component of the model (Table S19). This is because its marginal utility is designed to detect predictors contributing to the conditional mean of the response (Shao and Zhang 2014). Finally, results for Model(6) scenarios demonstrate that CIS performs quite well also in problems with categorical responses (Tables S22-S24).  Table 5 for Model(3)(a), with sample size n = 600. Model(3)(a) with p = 2000, s = 3, and n = 200. SNR and predictor covariance structure ( X ) vary as indicated in the table. CIS results shown for L = 3 and 5 slices. Values are multiplied by 10 3 ; closer to 1K = 1000 indicate better performance. (I)

Application to Transcriptomic Data
In this section, we analyze a transcriptomic dataset (Affymetrix GeneChip Rat Genome 230 2.0 Array Data; Scheetz et al. 2006) already used in feature screening and variable selection literature (Huang, Horowitz, and Wei 2010;Fan, Feng, and Song 2011;Wang, Wu, and Li 2012;Shao and Zhang 2014;Wang and Leng 2016). The expression Quantitative Trait Loci (eQTL) experiment in Scheetz et al. (2006) used gene transcription measurements from 120 12-week-old male F2 Norway rats (Rattus norvegicus) to better understand gene regulation in the mammalian eye, with potential relevance to the study of human eye disease. The dataset is publicly available at the NCBI Gene Expression Omnibus with accession number GSE5680. Following Shao and Zhang (2014), we preprocess the data taking log transformations and eliminating all genes (more accurately, probe IDs) that do not show sufficient variation across rats-which leaves us with 18,976 genes. As in prior screening exercises conducted on this dataset, we consider as response the transcription of TRIM32 (probe ID: 1389163_at), a gene with a causal association with Bardet-Biedl syndrome which affects multiple human systems (Chiang et al. 2006). As a further preprocessing step, we identify outliers detected by both the built-in R statistical software function boxplot() and the "thrice median absolute deviation rule" (Barghash, Arslan, and Helms 2016) (see for instance Figure S2). We eliminate 34 genes that contain more than 12 outliers (10% of the total number of rats). Also, we omit three outliers detected for the response. Thus, we eventually work with a dataset comprising transcription levels for p = 18,941 genes (the predictors) and transcription levels for TRIM32 (the response) measured on n = 117 rats (the observations). On this dataset, we apply our CIS with L = 3 slices (CIS[3]), as well as SIS, HOLP, SIRS, DC-SIS, and MDC-SIS screening procedures, and GAMSEL (Chouldechova and Hastie 2015)-a generalized additive model selection procedure.
First, for each gene, we exclude outliers (12 or fewer values) and compute marginal utilities for all screening procedures considered. Next, we focus on the top ranked d = 10 genes, which differ substantially across procedures (Table S26)-likely due to linear associations among predictors (the absolute values of pair-wise Pearson correlations range between 0 and 0.9812; first quartile 0.0847, median 0.1796, third quartile 0.3053) and/or other complexities of the problem (e.g., level of sparsity, strength and nature of the signals). Notably though, two of the top 10 CIS[3] genes (ranks 1 and 6) are also within the top genes reported in Fan, Feng, and Song (2011). Also, the genes ranked 6 and 9 by CIS[3] are placed in the top 10 by all other procedures (except HOLP). DC-SIS and SIRS also include the gene ranked 5 by CIS[3] in their top 10. Interestingly, none of the top 10 genes for HOLP overlap with the top 10 of any other procedure considered. Figure 1 illustrates the marginal associations between the transcription of TRIM32 (response) and those of the top 10 CIS[3] genes (predictors). The panels for the genes ranked 5, 8, and 10 clearly show nonlinear relationships, supporting the notion that our CIN, unlike the marginal utility used by SIS, is a general measure of association. Preliminary queries indicate that the top 10 CIS[3] genes do indeed have biological significance. Most of them are conserved in other mammalian and vertebrate species including human, chimpanzee, Rhesus monkey, dog, cow, mouse, chicken, zebrafish, and frog-suggesting that they fulfill critical functions in the genome. Klhl7 (the gene ranked 7), is involved in the eye disease Retinitis pigmentosa (RP) in Norway rats; see the Rat Genome Database (RGD), ID 1305564. According to the Online Mendelian Inheritance in Man (OMIM) database, its human ortholog KLHL7 also plays a role in human RP (OMIM ID 611119; Friedman et al. 2009;Wen et al. 2011). In addition, according to the Mouse Genome Informatics (MGI) database, during wild-type mice development Klhl7 is expressed in the retina (MGI ID 1196453), Rbfox1 (the gene ranked 8) in the retina ganglion cell layer (MGI ID 1926224), and Dr1 (the gene ranked 5) in the retinal inner and outer layers (MGI ID 1100515). Mutations in Fam49b (the gene ranked 2) are involved in abnormal retinal morphology in mice (MGI ID 1923520). Finally, not directly related to the eye, Gorab (the gene ranked 4) plays a role in gerodermia osteodysplastica, osteoporosis and skin abnormalities in Norway rats (RGD ID: 1564990).
When we increase d from 10 to 5000, the overlaps across the top genes identified by various screening procedures increase (e.g., the top 5000 CIS[3] and DC-SIS genes have ≈72% overlap; see Table S26). Following Wang and Leng (2016), we consider the top 5000 genes produced by each screening procedure for subsequent modeling. We marginally standardize each set of 5000 top-ranked predictors, as well as the response, to zero mean and unit variance and employ GAMSEL (Chouldechova and Hastie 2015), a penalized likelihood approach for fitting sparse generalized additive models in high dimension, using the CRAN package gamsel (Chouldechova, Hastie, and Spinu 2018). We also apply GAMSEL directly on all (standardized) p = 18,941 predictors without any screening. We tune the overall penalty parameter (λ ≥ 0) by 10-fold cross-validation, fixing the folds across runs for reproducibility and selecting the largest λ with cross-validation error within 1 standard error of the minimum. We set the penalty mixing parameter (0 ≤ γ ≤ 1; values <0.5 penalize the linear fit less than the nonlinear fit) to γ = 0.6 and the degrees (the maximum number of spline basis functions to use) to 5 for each predictor. All other parameters are left at their default values. Table 7 shows that CIS[3] leads to the highest deviance explained (81.64%), followed by SIS (81.11%). Notably, GAM-SEL applied to all p = 18,941 predictors leads to the lowest deviance explained. To provide a benchmark, we create a "null" distribution as follows: we select d = 5000 genes at random 1000 times, each time fitting GAMSEL and producing the corresponding deviance explained. The density plot in Figure 2 shows that the deviance explained with 5000 CIS[3]-screened genes is significantly larger than those expected when randomly selecting 5000 genes. In contrast, the deviance explained with 5000 HOLP-screened genes is not. Nor is the deviance explained with GAMSEL applied to all genes.
Finally, we evaluate the out-of-sample performance of GAM-SEL fits on the d = 5000 top ranked genes produced by each screening procedure, as well as on all p = 18941 genes, and on a random selection of 5000 genes for benchmarking. We produce 200 90%-10% training-validation random splits of the n = 117 observations. We run GAMSEL fits on the training sets, and compute prediction errors on the corresponding validation sets. Figures S3(a)-(c) display boxplots of, respectively, the trainingset deviance explained (in %), the number of nonzero trainingset coefficient estimates obtained with 10-fold cross-validated λ's, and the validation-set root mean squared prediction error (RMSPE). On the training sets, HOLP and GAMSEL applied to all genes have similar median deviance explained (≈50%) and number of nonzero coefficients (≈15), which are comparable to those achieved with a random selection of 5000 genes. All other screening procedures lead to better fits (median deviance explained in the range ≈65%-70%) and larger gene sets (median number of nonzero coefficient estimates ≈25). Perhaps not surprisingly, given the small sizes of both training and validation sets (106 and 11, respectively) compared to dimensionality (d = 5000 for GAMSEL following screens, and p = 18,941 for GAMSEL directly applied to all genes), for all procedures the RMSPE's computed on the validation sets are on par with those obtained with a random selection of 5000 genes.

Concluding Remarks
In this article, we proposed CIS-a model-free feature screening procedure to reduce the predictor dimension in ultrahighdimensional supervised problems prior to the use of other statistical techniques for feature selection, dimension reduction, and regression or classification modeling.
CIS is built upon the CIN, a novel marginal utility which is essentially the univariate version of the covariate information matrix (Yao et al. 2019) and has an appealing interpretation in terms of the traditional Fisher information in statistics. It is applicable to any type of response (features)-continuous, discrete, or categorical-with continuous features (response), has a reasonable computational burden, and possesses the important sure screening property.
Our simulation results demonstrate that CIS is competitive with, and in some cases superior to, popular feature screening procedures such as SIS, HOLP, SIRS, DC-SIS, and MDC-SIS. CIS successfully identifies active covariates at all levels of sparsity, with both continuous and categorical responses as well Table 7. Deviance explained (%) and number of nonzero coefficient estimates selected by 10-fold cross-validated λ ("lambda.1se") from GAMSEL fits on the top d = 5000 genes/probe IDs ranked by different screens or directly applied to all p = 18,941 ("GAMSEL"). as categorical, "perturbed, " and "realistic" predictors. As to be expected, it outperforms SIS in the presence of nonlinear signals but, notably, it also outperforms DC-SIS and MDC-SIS in less sparse settings (higher number of active covariates). Moreover, it outperforms SIRS when active covariates affect the response through more than two linear combinations (i.e., indexes in multi-index models). Importantly, in addition to sure screening, our simulation results provide empirical evidence that CIS possesses the ranking consistency property. Like most procedures, the performance of CIS improves with higher sample sizes and signal-to-noise ratios. The former is particularly relevant because CIS calculations require a reasonable number of observations per slice. Our general suggestion is to employ a relatively small number of slices (say, L = 3-8). Notably though, L ceases to affect CIS performance when the sample size is sufficiently large. Switching to the case of discrete or categorical responses, where L represents the number of distinct Y values, we note that this (similar to the number of active predictors and that of predictors overall) can increase with n. We considered a finite L to theoretically establish the sure screening property for CIS, but the proof could potentially be generalized to a diverging L.

SIS
While the sure screening property addresses false negatives concerns, screens can retain false positives in cases where correlations between inactive and active covariates produce spurious association with the response (Fan and Lv 2008). One way to mitigate this issue is to use iteration. For example, an iterative model-based screening procedure can be found in Fan, Samworth, and Wu (2009). Iterative model-free screening procedures also exist, and are often based on the notion of predictor residual matrix. This was first introduced for iterative SIRS in Zhu et al. (2011) and later used for iterative DC-SIS (Zhong and Zhu 2015). An iterative CIS could be developed as well. Of course, iteration increases computational burden. In practice, an evaluation of the strength and structure of the associations among covariates can help gauge whether such burden is justified as a way to reduce potential false positives. Further discussion on iteration can be found in Fan, Samworth, and Wu (2009). The interested reader can also refer to univariate penalization screening (UPS; Ji and Jin 2012), covariance assisted screening and estimation (CASE; Ke, Jin, and Fan 2014), and graphlet screening (Jin, Zhang, and Zhang 2014), among others, for ideas on two-stage "screen and clean" procedures to tackle potential false positives.
We foresee several additional avenues for future work. One is combining different screening approaches. For instance, consider a composite marginal utility of the form ω(τ ) = τ ω S(1) + (1 − τ )ω S(2) , where S(1) and S(2) indicate two different screens and τ ∈ [0, 1] is a weighing parameter. ω(τ ), especially with an appropriate data-driven tuning of τ , could combine the strengths of different approaches. As another instance, consider the selection of the threshold d used for separating active and inactive covariates (both soft and hard thresholding rules are discussed in the literature; see Zhu et al. 2011;Li, Zhong, and Zhu 2012;Shao and Zhang 2014). Let c(d) = | A S(1) d A S(2) d | be the cardinality of the intersection of the active sets estimated by two screens using d. By construction c(d) ≤ d; a plot of c(d) versus d, d = 1, 2, . . ., can be used as a visual diagnostics to identify d * where c(d * ) comes very close to d * , that is, a threshold that guarantees high congruence between screens. Both the composite marginal utility and the threshold diagnostic plot, of course, could potentially combine more than two screens.
Another interesting future avenue is investigating the performance of CIS in the rare and weak signal regimes often encountered in Genome Wide Association Studies (GWAS), and in cases where the assumption that zero low-order marginal correlations imply zero higher-order partial correlations (also known as the "faithfulness" condition; Genovese et al. 2012) is violated due to factors such as signal cancellation (Wasserman and Roeder 2009). Procedures such as the covariance assisted screening and estimation (Ke, Jin, and Fan 2014) and the graphlet screening (Jin, Zhang, and Zhang 2014) address these issues.
We mentioned (Section 1) and demonstrated via numerical examples (Sections 3.2 and S3) that CIS can also be used to screen discrete or categorical predictors-as long as the response is continuous. Another important avenue for future work is the extension of CIS to cases where both the response and the covariates are discrete or categorical, as well as to cases where the response is multivariate. Developments in the former (e.g., Huang, Li, and Wang 2014;Cui, Li, and Zhong 2015) and the latter (e.g., Zhu et al. 2011;Li, Zhong, and Zhu 2012;Shao and Zhang 2014) directions already exist in the feature screening literature.

Supplementary Materials and Codes
Proofs of theoretical results, full simulation results, details on the transcriptomic data application, and some relevant additional information are provided in an online Supplement. MATLAB (MATLAB 2020) and R (R Core Team 2020) source functions for the implementation of CIS and other feature screening procedures, codes for the numerical examples in the simulation study, and the analyses of the transcriptomic data are publicly available at the following link: bit.ly/CIS-Codes.