False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation

Abstract We develop a new class of distribution-free multiple testing rules for false discovery rate (FDR) control under general dependence. A key element in our proposal is a symmetrized data aggregation (SDA) approach to incorporating the dependence structure via sample splitting, data screening, and information pooling. The proposed SDA filter first constructs a sequence of ranking statistics that fulfill global symmetry properties, and then chooses a data-driven threshold along the ranking to control the FDR. The SDA filter substantially outperforms the Knockoff method in power under moderate to strong dependence, and is more robust than existing methods based on asymptotic p-values. We first develop finite-sample theories to provide an upper bound for the actual FDR under general dependence, and then establish the asymptotic validity of SDA for both the FDR and false discovery proportion control under mild regularity conditions. The procedure is implemented in the R package sdafilter. Numerical results confirm the effectiveness and robustness of SDA in FDR control and show that it achieves substantial power gain over existing methods in many settings.


Introduction
Multiple testing provides a useful approach to identifying sparse signals from massive data.Recent developments on false discovery rate (FDR; Benjamini and Hochberg 1995) methodologies have greatly influenced a wide range of scientific disciplines including genomics (Tusher, Tibshirani, and Chu 2001;Roeder and Wasserman 2009), neuroimaging (Pacifico et al. 2004;Schwartzman, Dougherty, and Taylor 2008), geography (Caldas de Castro and Singer 2006;Sun et al. 2015), and finance (Barras, Scaillet, and Wermers 2010).Conventional FDR procedures, such as the Benjamini-Hochberg (BH) procedure, adaptive p-value procedure (Benjamini and Hochberg 1997) and adaptive z-value procedure based on local FDR (Efron et al. 2001;Sun and Cai 2007), are developed under the assumption that the test statistics are independent.However, data arising from large-scale testing problems are often dependent.FDR control under dependence is a critical problem that requires much research.Two key issues include (a) how the dependence may affect existing FDR methods, and (b) how to properly incorporate the dependence structure into inference.

FDR Control Under Dependence
The impact of dependence on FDR analysis was first investigated by Benjamini and Yekutieli (2001), who showed that the BH procedure, when adjusted at level α/( p j=1 1/j) with p being the number of tests, controls the FDR at level α under CONTACT Wenguang Sun wenguans@marshall.usc.eduData Sciences and Operations, University of Southern California, Los Angeles, CA 90089.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/JASA.arbitrary dependence among the p-values.However, this adjustment is often too conservative in practice.Benjamini and Yekutieli (2001) further proved that applying BH without any adjustment is valid for FDR control for correlated tests satisfying the PRDS property.This result was strengthened by Sarkar (2002), who showed that the FDR control theory under positive dependence holds for a generalized class of step-wise methods.Storey, Taylor, and Siegmund (2004), Wu (2008) and Clarke and Hall (2009), respectively, showed that, in the asymptotic sense, BH is valid under weak dependence, Markovian dependence, and linear process models.Although controlling the FDR does not always require independence, some key quantities in FDR analysis, such as the expectation and variance of the number of false positives, may possess substantially different properties under dependence (Owen 2005;Finner, Dickhaus, and Roters 2007).This implies that conventional FDR methods such as BH can suffer from low power and high variability under strong dependence.Efron (2007) and Schwartzman and Lin (2011) showed that strong correlations degrade the accuracy in both estimation and testing.In particular, positive/negative correlations can make the empirical null distributions of z-values narrower/wider, which has substantial impact on subsequent FDR analyses.These insightful findings suggest that it is crucial to develop new FDR methods tailored to capture the structural information among dependent tests.
Intuitively, high correlations can be exploited to aggregate weak signals from individuals to increase the signal-to-noise ratio (SNR).Hence, informative dependence structures can become a bless for FDR analysis.For example, the works of Benjamini and Heller (2007), Sun and Cai (2009), and Sun and Wei (2011) showed that incorporating functional, spatial, and temporal correlations into inference can improve the power and interpretability of existing methods.However, these methods are not applicable to general dependence structures.Efron (2007), Efron (2010), and Fan, Han, and Gu (2012) discussed how to obtain more accurate FDR estimates by taking into account arbitrary dependence.For a general class of dependence models, Leek and Storey (2008), Friguet, Kloareg, and Causeur (2009), Fan, Han, and Gu (2012), and Fan and Han (2017) showed that the overall dependence can be much weakened by subtracting the common factors out, and factor-adjusted pvalues can be employed to construct more powerful FDR procedures.The works by Hall and Jin (2010), Jin (2012), and Li and Zhong (2017) showed that, under both the global testing and multiple testing contexts, the covariance structures can be used, via transformation, to construct test statistics with increased SNR, revealing the beneficial effects of dependence.However, the above methods, for example by Fan and Han (2017) and Li and Zhong (2017), rely heavily on the accuracy of estimated models and the asymptotic normality of the test statistics.Under the finite-sample setting, poor estimates of model parameters or violations of normality assumption may lead to less powerful and even invalid FDR procedures.This article aims to develop a robust and assumption-lean method that effectively controls the FDR under general dependence with much improved power.
The summary statistic ξ = n −1 n i=1 ξ i obeys a multivariate normal (MVN) model asymptotically ( 1 ) Denote = −1 the precision matrix.We first assume that is known.For the case with unknown precision matrix, a datadriven methodology and its theoretical properties are discussed in Section 4. The problem of multiple testing under dependence can be recast as a variable selection problem in linear regression.Specifically, by taking a "whitening" transformation, Model (1) is equivalent to the following model: where Y = 1/2 ξ ∈ R p is the pseudo response, X = 1/2 ∈ R p×p is the design matrix, I p is a p-dimensional identity matrix and ε ε ε = (ε 1 , . . ., ε p ) are noise terms that are approximately independent and normally distributed.The connection between model selection and FDR was discussed in Abramovich et al. (2006) and Bogdan et al. (2015), respectively, under the normal means model and regression model with orthogonal designs.Let θ j = I{μ j = 0}, j = 1, . . ., p, where I is an indicator function, and θ j = 0/1 corresponds to a null/nonnull variable.Let δ j ∈ {0, 1} be a decision, where δ j = 1 indicates that H 0 j is rejected and δ j = 0 otherwise.Let A = {j : μ j = 0} denote the nonnull set and A c = {1, . . ., p} \ A the null set.The set of coordinates selected by a multiple testing procedure is denoted A = {j : δ j = 1}.Define the FDP and true discovery proportion (TDP) as follows: where a ∨ b = max(a, b).The FDR is the expectation of the FDP: FDR = E(FDP).The average power is defined as AP = E(TDP).

FDR Control by Symmetrized Data Aggregation
This article introduces a new information pooling strategy, the SDA, for handling the dependence issue in multiple testing.The SDA involves splitting and reassembling data to construct a sequence of statistics fulfilling symmetry properties.Our proposed SDA filter for FDR control consists of three steps: • The first step splits the sample into two parts, both of which are used to construct statistics to assess the evidence against the null.• The second step aggregates the two statistics to form a new ranking statistic fulfilling symmetry properties.• The third step chooses a threshold along the ranking by exploiting the symmetry property between positive and negative null statistics to control the FDR.
To get intuitions on how the idea works, we start with the independent case (Zou et al. 2020).The more interesting but complicated dependent case will be described shortly, with detailed discussions, refinements and justifications deferred to later sections.Suppose the vectors ξ i = (ξ i1 , . . ., ξ ip ) are iid obeying MVN(μ μ μ, I p ).The proposed SDA method first splits the full sample into two disjoint subsets D 1 and D 2 , with sizes n 1 and n 2 and n = n 1 + n 2 .A pair of statistics, both of which follow N(0, 1) under the null, are then calculated to test H 0 j : The product W j = T 1j T 2j is used to aggregate the evidence across the two groups.If |μ j | is large, then both T 1j and T 2j tend to have large absolute values with the same sign, thereby leading to a positive and large W j .By contrast, W j fulfills the symmetry property under H 0 j , that is, This motivates one to consider the following selection procedure A = {j : W j ≥ L}, where L is the threshold chosen to control the FDR at the level α: According to the symmetry property (4), the count of negative W j 's below −t strongly resembles the count of false positives in the selected subset (i.e., the null W j 's above t).It follows that the fraction in Equation ( 5) provides a good estimate of the FDP.
The dependent case involves a more carefully designed SDA filter.After sample splitting, we apply variable selection techniques such as LASSO to D 1 to construct T 1j , which is calculated based on linear model (2) to capture the dependence structure.Before using D 2 to construct T 2j , we carry out a data screening step to narrow down the focus.We show that the screening step can significantly increase the SNR of T 2j under strong dependence, hence the correlations are exploited again to increase the power.The ranking statistic W j is constructed by combining T 1j and T 2j with proven asymptotic symmetry properties.The theory of the proposed SDA filter is divided into two parts: the finite sample theory provides an upper bound for the FDR under general dependence, while the asymptotic theory shows that both the FDR and FDP can be controlled at α + o(1) under mild regularity conditions.

Connections to Existing Works and Our Contributions
The SDA is closely related to existing ideas of samplesplitting (Wasserman and Roeder 2009;Meinshausen, Meier, and Bühlmann 2009) and data carving (Fithian, Sun, and Taylor 2014;Lei, Ramdas, and Fithian 2021), both of which firstly divide the data into two independent parts, second use one part to narrow down the focus (or rank the hypotheses) and finally use the remainder to perform inference tasks such as variable selection, estimation or multiple testing.These ideas have a common theme with covariate-assisted multiple testing (Lei and Fithian 2018;Cai, Sun, and Wang 2019;Li and Barber 2019), where the primary statistic plays the key role to assess the significance while the side information plays an auxiliary role to assist inference (see also the discussion by Ramdas 2019).SDA provides a novel way of data aggregation where both parts of data, which are combined under the symmetry principle, play essential roles in both ranking and selection.This substantially reduces the information loss in conventional samplesplitting methods, while the symmetry principle, which is fulfilled by construction, enables the development of an effective and assumption-lean FDR filter.
The SDA is inspired by the elegant knockoff filter for FDR control (Barber and Candès 2015), which creates knockoff features that emulate the correlation structure in original features, to form symmetrized ranking statistics for selecting important variables via the same mechanism (5).The knockoff method, which is originally developed under regression models, can be applied for FDR control in Model (1) via the equivalent Model (2).The knockoff filter employs local pairwise contrasts: the ranking variable is constructed to capture the differential evidences against the null exhibited by the pair (i.e., the original feature vs. its knockoff copy).While it is desirable to make the pair as "independent" as possible, high correlations will greatly restrict the geometric space in which the knockoff can be constructed; see Appendix B.1 for detailed discussions and illustrations.This would significantly increase the difficulty for distinguishing the variable and its knockoff and hence lower the power.By contrast, the SDA filter, which does not rely on pairwise contrasts, will not suffer from high correlations.
To visualize the correlation effects, we consider a setup similar to (Barber and Candès 2015, fig. 5), where correlated normal, t, and exponential data are generated based on an autoregressive model = (ρ |j−i| ) (see Section 5.2 for more details about the setup).We vary ρ from −0.9 to 0.9 and apply BH, knockoff and SDA at FDR level α = 0.2.The actual FDRs and APs based on 500 replications are summarized in Figure 1.The first column (normal data) shows that knockoff outperforms BH in some situations, but both the FDR and AP of knockoff decrease when correlations grow higher.By contrast, SDA controls the FDR near the nominal level consistently, and the power of SDA increases sharply with growing correlations.This pattern corroborates the insights by Benjamini and Heller (2007), Sun and Cai (2009) and Hall and Jin (2010) that high correlations, which can be exploited to increase the SNR, may become a bless in large-scale inference.
The proposed research improves the work by Zou et al. (2020) in several ways.First, Zou et al. (2020) had mainly focused on the independent and weak dependent case, with the major goal of deriving convergence rate of FDPs when simultaneously performing thousands of t-tests.The methodology in Zou et al. (2020), which does not use LASSO and does not include the data screening step, becomes highly inefficient under strong dependence.See Appendix ?? for an illustration.Second, our new theories for FDR and FDP control under dependence and the robustness of the SDA filter under model misspecification are new.
The SDA filter provides a model-free framework that overcomes the limitations of many selective inference procedures, for example, the methods in Lockhart et al. (2014) and Javanmard and Javadi (2019), which require strong assumptions about the conditional distribution to construct asymptotically valid p-values.Our numerical results show that the methods in Fan and Han (2017) and Li and Zhong (2017), which require correctly specified models, accurate estimates of parameters and normality assumptions, are in general not robust for FDR control.The SDA filter, which employs empirical distributions instead of asymptotic distributions, only requires the global symmetry of the ranking statistics.It is substantially more robust than its competitors for a wide range of scenarios since the asymptotic symmetry property is much easier to achieve in practice compared to asymptotic normality.1As illustrated by the second column (multivariate t data) of Figure 1, BH fails to control the FDR under heavy-tailed models.The failure in accounting for the deviations from normality may result in L. DU ET AL.
Figure 1.Impacts of correlation on different FDR procedures: "t" denotes the t distribution with 3 df and "exp" denotes the exponential distribution with scale parameter 2. In both cases, the models have been misspecified as normal when computing the p-values.
misleading empirical null and severe bias in FDR analysis (Efron 2004;Delaigle, Hall, and Jin 2011;Liu and Shao 2014).Finally, our Theorem 1, which develops a finite-sample upper bound of FDR under dependence, is closely connected to robust knockoffs theory and is established using key arguments from Barber, Candès, and Samworth (2020).Specifically, we employ the leaveone-out technique to analyze the effect on the SDA filter of possible deviations from symmetry and the sure screening property, similarly to the analysis of the effect on the Model-X knockoff filter of errors in estimating the true covariance structure.This important connection sheds lights on how the model uncertainty can affect the actual FDR level and how the error bound in FDR can be explicitly quantified using appropriate deviation measures; a detailed discussion is provided in Appendix B.3 of the supplementary material.

Organization
The remainder of our article is structured as follows.In Section 2, we introduce the SDA filter for FDR control and discuss the effects of dependence on multiple testing.We develop finite sample and asymptotic theories for FDR control in Section 3. Methodology and theory for the unknown dependence case are discussed in Section 4. Simulation and real data analysis are presented in Sections 5 and 6, respectively.The extensions, proofs of theories and additional comparisons are provided in the supplementary material.
Notations.For M ⊂ {1, . . ., p}, let X M be the design matrix with columns (X j : j ∈ M) and X j = (X 1j , . . ., X pj ) being the jth column.For a matrix or a vector

The SDA Filter for FDR Control
We start with the assumption that the covariance matrix is known and then move to the case with unknown in Section 4. Our discussion is mainly based on regression model (2); an equivalent description of the methodology via Model (1) follows similarly.We first outline in Section 2.1 the steps for constructing the ranking statistics, then provide intuitive explanations on how the SDA filter works in Sections 2.2 and 2.3.The detailed SDA algorithm is provided in Appendix A4.

Construction of Ranking Statistics and the Symmetry Property
SDA first splits the data into two independent parts D 1 and D 2 , which are, respectively, used to construct statistics T 1j and T 2j .
The information in the two parts is then combined to form the ranking statistic W j = T 1j T 2j .A wide class of pairs may be constructed from the sample.This section presents a specific pair (T 1j , T 2j ), which is used in all numerical studies.Examples of other possible pairs are presented in Appendix A.2 in the supplementary material.
Remark 1.Following Wasserman and Roeder (2009), we suggest using n 1 = 2n/3 , which provides stable performance across a wide range of settings.To obtain asymptotically unbiased estimator in the next step, it is required that S contains all the signals with high probability.In practice, this can be achieved by deliberately choosing an overfitted model that includes most true signals and many false positives; see also Barber and Candès (2019) and Remark 2 in Section 3.2.
Next, we use D 2 to obtain the least-square estimates (LSEs).
To aggregate information across both D 1 and D 2 , let W j = T 1j T 2j , where and σ 2 S,j 's are the diagonal elements of (X S X S ) −1 .A multiple testing procedure consists of two steps: ranking and thresholding.Next we show that W j 's play key roles in both steps.Intuitively, the positive W j 's can be used for ranking because a large and positive W j indicates strong evidence against the null.Meanwhile, the negative W j 's, which usually correspond to null cases, can be used for thresholding.The key idea is to exploit the following asymptotic symmetry property: which holds if Pr(A ⊆ S) → 13 .Next, we explain how the SDA filter works.

FDR Thresholding
The asymptotic symmetry property (9) motivates us to choose the following data-driven threshold to control the FDR at the level α: Our decision rule is given by δ δ δ = (δ j : 1 To see why (10) makes sense, note that #{j : W j ≤ −t} is an overestimation of #{j : W j ≤ −t, j ∈ A c }, which is asymptotically equal to #{j : W j ≥ t, j ∈ A c }, the number of false positives, due to the asymptotic symmetry property (9).It follows that the fraction in Equation ( 10) provides an overestimate of the FDP, which (desirably) leads to a conservative FDR control.Moreover, the empirical FDR level is typically very close to α because the gap between the fraction in Equation ( 10) and the actual FDP is usually small in practice, where, for a suitably chosen L, most cases in {j : W j ≤ −L} should come from the null.The operation of the SDA filter can be visualized in Figure 2. We generate {ξ i : i = 1, . . ., 90} from an MVN distribution with μ ∈ R p=1000 and = (0.8 |i−j| ) 1≤i,j≤p .We randomly set 10% of the coordinates in μ μ μ to be ±0.2 and 0 elsewhere.Panel (a) presents the scatterplot of 288 nonzero W j 's with red triangles and black dots, respectively, denoting true signals and nulls.Panel (d) plots the normalized knockoff statistics that are constructed according to (Barber and Candès 2015, eq.(1.7)) 4 .We can see that both SDA and knockoff fulfill the symmetry property approximately for the null W j 's (black dots).However, SDA achieves a more clearcut separation of signals and noise.As explained in Appendix B.1 of the supplementary material, the symmetrized knockoff statistics suffer from high correlations.By contrast, the construction of SDA statistic, which does not depend pairwise contrasts, eliminates the needs for creating fake variables.We can see from Panel (a) that the SDA ranking places the most true signals above 0, and many true signals stay well above the majority of the null cases.However, in Panel (d) that illustrates the knockoff ranking, the true signals are not well separated from the nulls, and many true signals even fall below 0. Since the threshold must be positive, signals with negative W j 's will be missed, which leads to substantial power loss.
The impacts on the FDP processes are shown in the second column in Figure 2. We can see that the estimated FDP process [ FDP(t)] of SDA approximates the true FDP process [FDP(t)] fairly accurately.However, the knockoff method yields overly conservative estimates of the true FDPs, which leads to overly conservative thresholds (marked by blue vertical lines).The last column in Figure 2 compares the TDP processes of SDA and knockoff.At the FDR level 0.2, the TDP of SDA is 0.87 (threshold L = 0.62), which is much higher than that of knockoff (TDP=0.03with threshold L = 6.80).The low TDP of knockoff is due to the decreased power in distinguishing the signal from noise [Panel (d)] and an overly conservative threshold [Panel (e)].

Power and Effects of Dependence
The impact of dependence on FDR analysis has been extensively studied but most discussions have focused on the validity issue.This section first discusses the impact of dependence on power,  and then provides insights on the information loss of conventional data splitting methods.
Under the SDA framework, many possible pairs of (T 1j , T 2j ) may be constructed.It is easy to show that W j constructed via the pairs of sample averages also fulfills the asymptotic symmetry property.However, the pair in Equation ( 11), which falls into the class of marginal testing techniques, can be highly inefficient since it completely ignores the dependence structure.Next we provide intuitions on how the dependence structure is incorporated into the SDA filter to improve the efficiency of existing methods.First, T 1j is superior to T 0 1j by leveraging joint modeling techniques.The merit of joint modeling has been carefully illustrated by Barber and Candès (2015) through extensive simulations.Candès et al. (2018) further argued that the conditional testing techniques are in general more powerful in recovering sparse signals than marginal testing methods.T 1j is constructed based on LASSO (a conditional inference technique) and serves as a more suitable building block than T 0 1j for constructing W j .Second, T 2j enjoys a higher SNR than T 0 2j by exploiting the dependence between ξ S and ξ S c .Clearly, the expectations of both μ 2S and ξ 2S are μ 2S .The covariance of μ 2S is n −1 2 Q, where Q = (X S X S ) −1 .By the inversion formula of a block matrix, we have X S X S = S,S = S,S − S,S c −1 ,S , which is the conditional covariance of ξ S given ξ S c .Let s jl be the (j, l)th element of .Then n 2 Var( ξ2j ) = s jj .However, n 2 var( μ 2j ) = s jj − e j S,S c −1 S c ,S c S c ,S e j < s jj .This provides the key insight on the effect of data screening.In regression terms, strong correlations indicate that a large fraction of variability in the variables in S can be explained by the variables in S c .The higher the correlations, the more reductions in the uncertainties and hence the higher SNRs.This explains why SDA becomes more powerful as correlations increase (Figure 1).
Finally, both knockoff and SDA achieve the symmetry property at the expense of possibly reduced SNR: the former increases the dimension of the design matrix by adding noise variables while the latter involves sample splitting.In contrast with the sample splitting method in Wasserman and Roeder (2009), where D 1 is thrown away after model selection, SDA provides a new aggregation strategy: T 1j is kept and combined with T 2j to form the ranking statistic W j .This substantially reduces the information loss in conventional sample splitting methods.

Effects of Data Screening
The data screening step is always beneficial as long as the tests are correlated.Intuitively, the smaller the set S, the larger amount of uncertainty can be explained by the variables in S c .Hence, a more effective dimension reduction implies increased SNR and higher power.Meanwhile, our theory on FDR control requires that Pr(A ⊆ S) holds with high probability, indicating that an overly aggressive data screening step can hurt the FDR procedure.In practice, we recommend deliberately choosing an overfitted model to ensure the validity in FDR control; this would slightly compromise the power.To illustrate the tradeoff, Figure 3 presents a numerical study to investigate how the size of S may affect both the FDR and power.We can see that the actual FDRs of SDA may deviate from the nominal level when S is too small.By contrast, a large S (overfitted model) has little impact on the FDR levels, but affects the power negatively.We choose n = 90, p = 500, and μ = ±0.2.The proportion of nonnulls is 10% and α = 0.2.We investigate the performance of SDA over three distributions and three covariance structures described in Section 5. Here, k denotes the excess counts of |S| with λ selected by the AIC criterion (k can be negative).

Theoretical Properties of the SDA Filter
This section first establishes finite sample theory for FDR bounds (Section 3.1), and then develops asymptotic theories for FDR and FDP control.

Finite-Sample Theory on FDR Control
Our finite-sample theory, which requires no model assumptions, establishes an upper bound for the FDR under general dependence.We emphasize that the upper bound holds for both known and estimated covariance matrices.
Our theory is developed for a modified SDA filter (SDA+) which chooses the threshold SDA+ is slightly more conservative than SDA but their difference is negligible when the number of rejections is large.Recall S = {j : μ 1j = 0}.Denote W S = (W j : j ∈ S) and W −j = W S \ W j .The key quantity that controls the upper bound is which can be interpreted as a measure of the extent to which the "flip-sign" property of W j is violated5 .Our finite sample theory for FDR control is given by Theorem 1.
Theorem 1.For any α ∈ (0, 1), the FDR of the SDA+ method satisfies Our theorem is closely connected to (Barber, Candès, and Samworth 2020, theor. 1 ).Both theorems involve assessing how the deviations from the "idealized situation" would affect the actual FDR level.However, the interpretations are very different.In model-X knockoff the deviation (from the assumption of a known X matrix) comes from the estimation errors of the X matrix whereas in SDA the deviation (from the perfect symmetry property) comes from the possible violations of the symmetry assumption and sure screening property.Our theorem shows that a tight control of j 's leads to effective FDR control.Next we carefully interpret the bound and present several important settings in which the upper bound in Equation ( 13) exactly achieves or is very close to the nominal level α.
Consider the ideal case where (a) the error distribution is symmetric, (b) S contains all signals and (c) W j 's are independent of each other for j ∈ S. We can show that j = 0 for all j ∈ A c ∩ S. The upper bound achieves the nominal level α exactly since Pr( and hence we can set = 0.Even when the error distribution is asymmetric, we expect that j 's would become vanishingly small for moderate sample size n due to the convergence of μ 2j to a symmetric distribution (Lemma S.1).Hence, the FDR bound would be close to α.
Next we turn to the dependent case.For simplicity, assume that ξ i 's come from an MVN distribution.Let Q = (X S X S ) −1 := (Q jk ) q n ×q n with q n = |S|.The matrix Q = S,S − S,S c −1 S c ,S c S c ,S is the conditional covariance matrix of ξ S given ξ S c .The following lemma shows that the magnitude of j is controlled by the matrix Q.
Lemma 1. (Flip-sign property under Gaussian dependence).Assume that ξ i 's obey an MVN distribution.Denote Q −j,j the jth column of Q excluding Q jj .If Q −j,j = 0, then j = 0.
To provide some intuitions on how close the bound is to α in practice, consider the autoregressive (AR) structure = (σ j,l ) = (ρ |j−l| ).Since the precision matrix of AR structure is tridiagonal, only consecutive coordinates are correlated with each other conditional on remaining variables.Suppose sparse signals are randomly distributed on the p coordinates and the dimension reduction via S is performed effectively, for example, q n p.Let E be an event such that for any null variable j ∈ S ∩ A c , remaining variables in S are conditionally uncorrelated with it.We expect E to occur with high probability since for large tridiagonal precision matrices, there is a small chance that two consecutive coordinates are selected into a small set S simultaneously.On event E, we have Q −j,j = 0 and it follows from Lemma 1 that j = 0. Consequently, the FDR bound would converge to α when Pr(E) → 1.In the same vein, we expect that the bound would be close to α for the class of power decay covariance matrices and the class of sparse precision matrices.

Asymptotic Theory on FDP Control
Under the asymptotic paradigm we can prove that the FDR can be controlled at α + o(1) under suitable conditions (asymptotic validity).and A(S) := (X S X S ) −1 X S = (a jk ) q n ×p .Assume that q n is uniformly bounded above by some nonrandom sequence qn that will be specified later.We start with some regularity conditions.
Remark 2. Condition 1 ensures that μ 2j is unbiased for j ∈ S.This pre-selection property, which has been commonly used (Wasserman and Roeder 2009;Meinshausen, Meier, and Bühlmann 2009;Barber and Candès 2019), can be fulfilled with suitably chosen λ under the "zonal" assumption (Bühlmann and Mandozzi 2014).In practice, we recommend applying AIC to deliberately choose an overfitted model.The sure screening property may not hold exactly but missing small μ j 's is inconsequential.For example, if we ignore "unimportant" signals, then Condition 1 is fulfilled by LASSO for large signals exceeding the rate of d n log p/n.Asymptotically unbiased estimators are usually sufficient for effective FDR control.This has been corroborated by our empirical results in Section 5.The next two conditions are standard: Condition 3 imposes constraints on the diverging rates of qn and p, both of which depend on the existence of certain moments; Condition 4 requires that the eigenvalues of the design matrix are doubly bounded by two constants.
Condition 3. (Moments) There exist two positive diverging sequences K n1 and K n2 such that E( Condition 4. (Covariance) There exist positive constants κ and κ such that with probability one, where Remark 4. Condition 5 implies that the number of identifiable effect sizes should not be too small as p → ∞.This seems to be a necessary condition for FDP control.For example, Liu and Shao (2014) showed that if a multiple testing method controls the FDP with high probability, then its number of true alternatives must diverge when the number of tests goes to infinity.
Remark 5. Condition 6 allows ξ j to be correlated with all others but requires that the number of large correlations cannot diverge too fast.The condition appears to be similar to the regularity conditions in Fan, Han, and Gu (2012) and Xia, Cai, and Sun (2020) but in fact our condition is much weaker.For instance, the correlation between μ 2j 1 and μ 2j 2 is just the partial correlation of ξ j 1 and ξ j 2 given the rest variables.In particular, large correlations would be highly unlikely after data screening for a wide range of popular models, such as the class of power decay covariance matrices and the class of moderately sparse precision matrices.This reveals the advantage of SDA, which effectively de-correlates the strong dependence via data screening and conditioning.
Our main theoretical result on the asymptotic validity of the SDA method for both FDP and FDR control is given by the next theorem.

Unknown Dependence
Now we turn to the case where the covariance structure is unknown.When is unknown, the SDA filter operates in the same way except that we substitute the estimate in place of .
We propose to estimate using only the first part of the sample D 1 .Denote the corresponding estimator.Then, the SDA filter can be readily constructed via the steps in Sections 2.1 and 2.2 with X = 1/2 .Various high-dimensional precision matrix estimation methods, such as the graphical LASSO (Friedman, Hastie, and Tibshirani 2008) and CLIME (Cai, Liu, and Luo 2011), can be used to obtain .An attractive feature of the SDA filter under unknown dependence is its robustness for FDR control.We next show that the SDA filter is robust for FDR control if is constructed based only on D 1 .We first state a modified version of Condition 6, which uses Q in place of Q.
The following theorem, which is in parallel with Theorem 2, establishes the asymptotic validity of the SDA filter for estimated covariance.
Remark 6.Our FDR theory does not require an accurate estimator for .The accuracy of the estimator only affects the power but not the validity.Consider a working covariance structure that "estimates" as the identity matrix.Then, it can be shown that the FDP can still be controlled.This is more attractive than the FDR theories in, for example, Fan and Han (2017) and Li and Zhong (2017) that critically depend on the accuracy of the covariance estimators.
The key step in the proof is to verify the validity of Equation (9).This amounts to addressing two major issues: the asymptotic symmetry of W j under the null and the uniform convergence of q −1 0n j∈S∩A c I(W j ≥ t).Because is obtained from D 1 , then μ 2j is unbiased conditional on D 1 and thus j∈S∩A c Pr(W j ≥ t) is approximately equal to j∈S∩A c Pr(W j ≤ −t), establishing the symmetry property.The dependence assumption on Q ensures the convergence of q −1 0n j∈S∩A c I(W j ≥ t).While sample-splitting ensures the independence between μ 1 and μ 2 and hence the robustness of the SDA filter, as one would expect, a more accurate estimate of yields better power.Previously we have proposed to estimate using D 1 and construct the LSE (7) using D 2 .In practice, one may consider using D 1 to construct T 1j , and then obtaining the LSE via the full sample estimator, denoted F , that is estimated using {D 1 , D 2 }.The caveat is that, although X = 1/2 F can potentially increase the power, stronger conditions will be needed to guarantee the asymptotic validity of the "full-sample" SDA method.As pointed out by an insightful referee, the asymptotic theory requires that F must converge to at a very fast rate, which can be impractical in applications.We recommend the robust SDA filter that estimates using only D 1 .Next, we specify the requirements on the estimation accuracy of F .
Condition 7. The estimated precision matrix F satisfies F − ∞ = O p (a np ) with a np → 0.
The following theorem shows that the FDR and FDP can be controlled asymptotically when F is sufficiently close to .Let s n = ∞ .
Theorem 4. Consider a modified SDA procedure where we use D 1 to construct T 1j and the full sample estimator F to construct the LSE (7).for a small γ > 0, the results in Theorem 2 hold for the procedure with F .
This theorem, which is a complementary result to Theorem 3, provides conditions that warrant the implementation of a more efficient version of SDA.It is worth further investigating the condition (15), which seems to be unavoidable because T 1j and T 2j are no longer independent when the whole sample is used to estimate .To fix ideas, suppose that = (ω ij ) p×p is k n -sparse, that is, max 1≤i≤p j =i I(ω ij = 0) ≤ k n , and that all its elements ω ij s are bounded.First, standard arguments in, for example, Yuan (2010) and Liu et al. (2012) indicated that a np = O p (k n log p/n).Accordingly, with c np = d n log p/n, Equation ( 15) is equivalent to the condition d n k n s n qn /n 1/2 → 0 if p is of a polynomial rate of n.The condition above imposes restrictions on the diverging rates of d n , k n , s n and qn .Assume that d n , k n and s n are all bounded.Then we must require that qn = o(n 1/2 ).Alternatively, if we only assume that k n and s n are bounded, then a sufficient condition for ( 15) is qn = o(n 1/4 ) (since d n ≤ qn ).These rates are consistent with those in the literature; see, for example, Portnoy (1984) and Fan and Peng (2004).

Simulation
This section first introduces the R package sdafilter (Section 5.1), followed by simulation designs (Section 5.2) and comparison results (Section 5.3).Additional results for comparisons with unknown covariance matrix and other correlation structures are provided in the supplementary material.

Implementation Details
We describe the implementation details of the R package sdafilter.For sample-splitting, we follow the strategy in Wasserman and Roeder (2009), which uses n 1 = [2/3n] for selecting variables, and the rest n 2 = n − n 1 for obtaining the LSEs.The AIC is used to select the tuning parameter in LASSO.If the number of the variables selected by AIC exceeds [p/3], then only the first [p/3] variables will be retained.For the case with unknown , our default option is to apply the R package glasso to D 1 , where the tuning parameter is set by the R package huge.If prior knowledge suggests a nonsparse , then the "nonsparse" option in our package can be used.This option first estimates the covariance matrix using the R package POET and then takes its inverse as the input.The stable option implements the R-SDA method described in Appendix A.1 of the supplementary material.The kwd option enables the usage of different estimators to summarize the information in the first part of data, including the de-biased LASSO, innovated transformation of the sample means (Hall and Jin 2010), and factor-adjusted sample means (Fan and Han 2017).

Simulation Settings
We consider three types of covariance structures: (I) Autoregressive (AR) structure: = (ρ |j−i| ).(II) Compound symmetry structure: all off-diagonal elements of the are ρ, which can be regarded as a factor model with one principal component.(III) Sparse covariance structure: = + I p , where is a p × p matrix and each row of has only one position with nonzero value sampled from uniform distribution [1,2].
The diagonal elements are normalized as unity for all three settings.To investigate the robustness of different methods, we consider three error distributions: (i) MVN; (ii) t-distribution with df = 3, and (iii) exponential distribution with scale parameter 2. The observations are then standardized to have mean zero and standard deviation one.The correlation structure remains nearly unchanged after transformation.The following six methods will be compared: (a) The Benjamini-Hochberg (BH) procedure with the pvalues transformed from the t statistics.(b) The principal factor approximation (PFA) procedure proposed by Fan, Han, and Gu (2012) for known covariance and Fan and Han (2017) for estimated covariance.Two versions of the PFA procedure using the unadjusted pvalues and adjusted p-values are implemented using the R package pfa, denoted as PFA U and PFA A , respectively.We only report the results for PFA A as it generally outperforms PFA U .(c) The sample-splitting method (SS; Wasserman and Roeder 2009), which conducts data screening using LASSO and then applies BH to the p-values calculated based on μ 2 .(d) The knockoff method (Knockoff; Barber and Candès 2015), which is implemented using function "create.fixed" in the R package knockoff.(e) The DATE method (DATE; Li and Zhong 2017), which we implemented by ourselves.(f) The stability-refined SDA filter (R-SDA) implemented using our package sdafilter with the "stable" option.We only presented R-SDA, which we recommend to use in practice, to make the plots easier to read.SDA has similar performance to R-SDA.
Let n be the sample size, p the number of tests, and π 1 the proportion of signals.For each combination (n, p, π 1 ), we generate data and apply the six methods at the FDR level α.The FDR and AP are calculated by averaging the proportions from 500 replications.

Comparison Results for Known Covariance Structures
We fix (n, p, π 1 , α) = (90, 500, 0.1, 0.2) and generate μ j from the following random mixture model: where δ 0 is the dirac delta function (denoting a point mass at 0), and g(•) is the density of the non-null distribution, specified as a uniform distribution [μ − 0.1, μ + 0.1].The signs of μ j 's are then randomly flipped.To assess the effect of signal strength, we vary μ from 0.1 to 0.3 and apply the six methods to simulated data.The results for Structures (I) and (III) are summarized in Figure 4, where in the top row we fix ρ = 0.8.The results for Structure (II) with ρ = 0.8 are shown in Figure S5 of the supplementary material.The following observations can be made: (a) For the Gaussian error case, BH, knockoff, R-SDA, and SS control the FDR at the nominal level.The FDR levels of PFA A and DATE are inflated when signals are weak.(b) For the non-Gaussian error case, BH, DATE, SS, and PFA A fail to control the FDR under various settings and the FDR levels can be much higher than the nominal level.Knockoff controls the FDR in all settings but can be very conservative.R-SDA has the most accurate and stable FDR levels among all methods.(c) R-SDA vs. SS and BH.As expected, SS and BH control the FDR under the Gaussian case but are not robust for non-Gaussian errors.R-SDA has much higher power than both methods (even when the FDR levels of R-SDA are much lower).It is interesting to note that although SS only uses the second part of the data, its power can be much higher than BH when the correlation structure is highly informative [Normal case under Structure (I) on top left].This is because the data screening step can significantly increase the SNR (Section 2.3).(d) R-SDA vs. Knockoff.R-SDA and knockoff, both of which are distribution-free, are the only methods that can control the FDR at the nominal level across all scenarios.The knockoff method is overly conservative in Setting (I) due to the high correlation.The conservativeness become less severe under Setting (III).By contrast, R-SDA controls the FDR more accurately near the target level and has significantly higher power than knockoff.(e) R-SDA vs. DATE and PFA A .In some scenarios, DATE and PFA A can outperform SDA in power.However, the higher power may be attributed to the severely inflated FDRs.The numerical results reveal the promise of extending the SDA framework by employing other methods, such as factoradjusted z-scores or innovated transformations, as alternatives to the LASSO estimates, to construct T 1j .
Next, we turn to investigate how the six methods are affected by the strength of correlation.For covariance structures (I) and (II), we fix μ = 0.2 under alternative and vary the magnitude of correlation ρ from independence (ρ = 0) to strong dependence (ρ = 0.9).The results are summarized in Figure 5.In addition to the observations that we have made based on the (a) The knockoff method becomes more conservative when correlations become higher.Note that the average correlations in Structure (II) is much higher than that in Structure (I), the power of the knockoff method deteriorates faster for Structure (II) as ρ increases.For Structure (II), the FDR of BH also decreases as ρ increases.
(b) In contrast with BH and knockoff, both of which suffer from high correlations, the FDR of R-SDA remains at the nominal level consistently, and the power increases with the correlation.The power grows faster for Structure (II).This corroborates the insights that high correlations can be useful in FDR analysis (Benjamini and Heller 2007;Sun and Cai 2009).(c) In Column 2 of Figure 5, knockoff fails to control the FDR for heavy tailed distributions when correlation is low.By contrast, SDA controls the FDR accurately under non-Gaussian errors.

A Real-Data Example
This section illustrates the SDA filter for analysis of highdensity oligonucleotide microarrays.The dataset, which contains 12, 625 probe sets from 128 adult patients enrolled in the Italian GIMEMA multi-center clinical trial, has been used in Chiaretti et al. (2005) and Bourgon, Gentleman, and Huber (2010) for identifying genetic factors that are associated with acute lymphoblastic leukemia (ALL).The ALL dataset is available at http://www.bioconductor.org.
We focus on a subset of 79 patients with B-cell differentiation because existing research reveals that malignant cells in B-lineage ALL are often associated with genetic abnormalities that have significant impacts on the clinical course of the disease.The patients are divided into two groups based on the molecular heterogeneity of the B-lineage ALL: 37 with the BCR/ABL mutation and 42 with NEG.We further narrow down the focus to 10% of the genes (i.e., p = 1, 263) before carrying out the FDR analysis.Specifically, the uncorrelated screening method (Bourgon, Gentleman, and Huber 2010) has been used to remove probe sets with small overall sample variances since they are unlikely to be differentially expressed.
We apply a two-sample version of R-SDA (see Appendix A.3 for details), BH, SS, PFA A , Knockoff, and DATE at several significance levels for identifying differentially expressed genes across the two groups.Table 1 summarizes the number of significant probe sets for each method.In Figure 6(a) and (b), we plot the pairwise correlations of the genes.We can see that a significant proportion of the correlations exceed 0.4.These correlations can jointly exhibit nonnegligible dependence effect.This explains why the knockoff method is overly conservative.R-SDA is more powerful than SS by exploiting additional information from the second part of data.BH, PFA A , and DATE claims more significant genes than R-SDA.However, some caveats need to be given regarding the reliability of BH, PFA A , and DATE, which all require normality assumptions (and the latter two require accurate estimates of the unknown covariance matrices).
Next, we conduct a preliminary analysis to investigate the normality assumption, which seems to have been severely violated in this dataset.From Column 2 of Figure 6, we can see that the skewness scores of many genes exceed the conventional cutoff ±1.As a comparison, we display in Column 3 of Figure 6 the "ideal" pattern where the normality assumption holds.The histograms in Column 2 are much wider than the histograms in Column 3, indicating a possibly highly skewed error distribution.One possible explanation for the difference in power is that BH, PFA-A and DATE may have inflated FDR levels under violation of normality.This has been observed in our simulation studies (e.g., last column in Figure S3).By contrast, SDA and knockoff are distribution-free methods, which tend to produce more reliable and replicable findings.The lists of 19 highest ranked probe sets by the six methods are presented in Table S1 of Appendix E.
and λ max (B) denote the smallest and largest eigenvalues of a square matrix B. The notation A n ∼ B n means that A n /B n and B n /A n are both bounded in probability as n → ∞.The " " and " " are similarly defined.Let A n ≈ B n denote the two quantities are asymptotically equivalent, in the sense that A n /B n p → 1.

Figure 2 .
Figure 2. (a) Scatterplot of the 288 nonzero W j 's from the SDA filter with red triangles and black dots denoting true signals and nulls, respectively.A vertical space is added to the middle of the plot to better contrast positive and negative W j 's.(b) The corresponding estimate of FDP curve (against t) along with the true FDP for the SDA filter; (c) the true power curve (against t) for the SDA filter.(d)-(f) the scatterplot of p = 1000 W j 's, the corresponding FDP estimate, and the true power for the knockoff method.

Figure 3 .
Figure3.The effects of data screening.We choose n = 90, p = 500, and μ = ±0.2.The proportion of nonnulls is 10% and α = 0.2.We investigate the performance of SDA over three distributions and three covariance structures described in Section 5. Here, k denotes the excess counts of |S| with λ selected by the AIC criterion (k can be negative).

Figure 4 .
Figure 4. FDR and AP comparison for varying μ in Settings (I) and (III) with known variance.

Figure 5 .
Figure 5. FDR and AP comparison for varying ρ in Settings (I)-(II) with known covariance matrix.

Figure 6 .
Figure 6.(a) and (b).Histograms of the off-diagonal elements of the sample correlation matrix for BCR/ABL and NEG; (c) and (d).Histogram of the skewness of the p = 1263 genes for BCR/ABL and NEG; (e) and (f) the ideal patterns of (c) and (d) when the data are normal.

Table 1 .
The Number of rejections for six multiple testing procedures and various significance levels.