Least Conservative Critical Boundaries of Multiple Hypothesis Testing in a Range of Correlation Values

Abstract Under suitable assumptions we prove that there does not exist a perfect exact multiple test procedure that would apply simultaneously to any positive correlation coefficient even with a known distribution of test statistics. This nonexistence theorem holds for all simple tests under normal distribution, and holds for all tests under Ferguson’s distribution. Given the nonexistence of a perfect exact test, we provide least conservative tests using three parametric models. The average conservativeness of these tests can be reduced to as low as 1/8 of that of the widely used Simes test assuming normal distribution. Power analysis indicates that the newly proposed tests are useful in practice.


Introduction
Multiple test procedures almost always keep the probability of rejecting a null hypothesis below the desired significance level. For instance, the bivariate normal Simes (1986) test is a valid test with two uncorrelated or positively correlated outcomes, but is only exact when the correlation coefficient ρ = 0 or ρ = 1. The Simes test is conservative when 0 < ρ < 1 because the bound in the corresponding Simes inequality is not tight (Finner, Roters, and Strassburger 2017). As a matter of course, the Simes test can be modified to have exact size α, if we know the ρ value. Examples of such modifications include Dunnett and Tamhane (1992), Cai and Sarkar (2006), and . Unfortunately, the correlation coefficient ρ is unknown in most cases. In practice, we only have limited information on ρ such as the sign of the value. Thus, we have to apply a conservative test to control the Type I error rate below α for all possible ρ values, even though the probability of rejecting the null hypothesis never attains its desired level for almost all situations.
A natural question to ask is whether it is possible to obtain a perfect test that controls the Type I error exactly at level α for a wide range of correlation coefficients. Here in this article we prove that, such a perfect hypothesis test is mostly unlikely. However, we can aim for subperfect tests which are less conservative than the commonly used multiple hypothesis tests in a given range of correlation coefficients.
The present article focuses on bivariate tests where the number of hypotheses n = 2, and includes discussion on multivariate tests in the last section. In the next section, we formulate the problem and prove that there is no simple function solution of the perfect bivariate normal test. In addition, we prove that there is no solution of the perfect test that includes all positive dependence cases when the test statistics follow the Ferguson distribution. In Section 3, we discuss three parametric models

Theoretical Results
The intersection hypothesis testing is the basis of multiple test procedures (Henning and Westfall 2015;Tamhane and Gou 2018;Tamhane, Gou, and Dmitrienko 2020;Zhang and Gou 2021). For example, the Hochberg (1988) procedure that controls the familywise error rate (FWER) and Benjamini and Hochberg (1995) procedure that controls the false discovery rate (FDR) are both based on the Simes (1986) test that reject the intersection hypothesis ∩ n i=1 H i if and only if at least one is a set of null hypotheses with corresponding p-values {p i } n i=1 , and the ordered p-values are denoted by p (1) ≤ · · · ≤ p (n) . The Simes test is a valid α-level test when the Simes inequality holds: Pr ∪ n i=1 {p (i) ≤ iα/n} ≤ α. Samuel-Cahn (1996) showed that the Simes inequality holds for two outcomes which follow a bivariate normal distribution with zero or positive correlation coefficient. The upper bound α is not attainable when 0 < ρ < 1. This result is later generalized to multivariate cases by Sarkar and Chang (1997) and Sarkar (1998). It is worth noting that the condition of multivariate totally positive of order two (MTP 2 ) or positively dependent through stochastic ordering (PDS) in Sarkar and Chang (1997) and Sarkar (1998) is much stricter than the condition of positive correlation . A multivariate normal distributed random variable whose correlation matrix has all positive elements may not satisfy this condition. For example, the trivariate normal distributed random variable with correlation matrix {ρ ij } i,j=1,2,3 , where ρ 11 = ρ 22 = ρ 33 = 1, ρ 12 = ρ 21 = 0.1, and ρ 23 = ρ 32 = ρ 31 = ρ 13 = 0.6, does not satisfy the MTP 2 condition. In this article, we mainly focus on bivariate distributions, especially the bivariate normal distribution, since the central limit theorem guarantees the normal distribution in the large sample size limit. The Simes test has been shown to be conservative under certain positive dependence structures. For example, using the bivariate normal Simes test with nominal significance level α = 5%, the true Type I error rate can be as low as 4.1%. If the dependence structure and correlation coefficient could be clearly prespecified, there are various ways to modify the Simes test and make it exact. However, in practice, it is rare to have a complete understanding of the dependence structure between two endpoints in the design stage. It is more likely that people only know two outcomes of interest are positively or negatively correlated. So it is practical to obtain new tests which are less conservative than the Simes test for all possible situations. Furthermore, a perfect test that is always exact is our optimal and ambitious aim. If such a perfect test does not exist and it is impossible to construct a test for all ρ values, then we realize that some compromises have to be made.
In what follows, we restrict attention to testing the intersection hypothesis H 1 ∩ H 2 of two null elementary hypotheses. We define admissible critical boundary curves to avoid the possibility that less extreme p-value is rejected but more extreme one is accepted.
An admissible critical boundary curve divides [0, 1] 2 into a rejection region at the bottom left and an acceptance region at the top right. The property of admissibility guarantees that if (p 1 , p 2 ) is in the rejection region, then (p 1 , p 2 ) is also in the rejection region for all pairs of p-values that satisfy p 1 ≤ p 1 and p 2 ≤ p 2 . A symmetric admissible curve is both admissible and symmetric with respect to f (u) = u. Common p-value combination tests include the Simes (1986) test, Fisher's (1925) probability test, and Stouffer et al.'s (1949) Z-score test. The corresponding symmetric admissible critical boundary curves of these tests are where 1 A (u) is an indicator function having the value 1 for all u's in A, F −1 χ 2 (p, k) is the quantile function of the chi-square distribution with k degrees of freedom, and −1 is the inverse cumulative distribution function of the standard normal distribution. These curves are shown in Figure 1, where a large α = 0.20 is applied to demonstrate the differences among these three tests.
The top left part of the critical boundary curve that satisfies f (u) ≥ u is referred to as left curve, and the bottom right part that satisfies f (u) ≤ u is referred to as bottom curve. The rejection region is divided by u = f (u) into the left and bottom regions. The probability that falls in the left rejection region is denoted by I l f , and that falls in the bottom region is I b f . Consequently, the probability of rejection is I l f + I b f . For a symmetric admissible curve, I l f = I b f , and we denote both of them by I f for simplicity.
We first consider the symmetric admissible curve where f is a simple function, which is a finite linear combination of indicator functions. In addition, a simple test is defined to be a hypothesis test with a simple function as its symmetric admissible critical boundary curve. The majority of the most commonly used multiple test methods, including the Bonferroni (Dunn 1958), Sidak (1967), and Simes (1986) tests, are examples of simple tests.
Theorem 1. For any α-level simple bivariate normal test that is exact at ρ = 1, there exists a correlation coefficient ρ 0 < 1, and the test is either conservative or anti-conservative when ρ ∈ (ρ 0 , 1).
Theorem 1 states that there is no simple bivariate normal test that is always exact for all ρ ∈ [1 − , 1], where > 0. To put it another way, the Type I error rate of this test is either strictly less than or strictly greater than α, but does not equal α for all ρ ∈ (1 − , 1). Furthermore, we call a bivariate normal test perfectly exact if the test is exact for all situations we considered. Corollary 1 is directly followed from Theorem 1, and concludes that the functional equation of f has no simple function solution, where φ ρ (x, y) is the probability density function of the standard bivariate normal distribution with correlation ρ.
To measure the distance between a perfect exact test and a non-perfect one, we calculate the average conservativeness using the difference between the significance level α and the Type I error rate of the test, and compare it with α. We call this measure relative conservativeness, which depends on the distribution of ρ.
Definition 2 (Relative conservativeness). We define the relative conservativeness by is the density function of ρ, and f is the admissible critical boundary curve of the test. When the type of correlation coefficient need to be specified, the relative conservativeness is denoted by The relative conservativeness defined in Definition 2 is served as the target function for searching the least conservative boundary in the corresponding optimization problems. If ρ is given, the density function g(ρ) is reduced to a Dirac delta function. For a test with symmetric curve f , the expression of the relative conservativeness can be simplified as The beta distribution and its functions can be applied to describe the distribution of ρ. For example, when using a flat distribution on all positive correlation coefficients, the relative conservativeness Consider the relative conservativeness of the Simes and Bonferroni tests using uniform distributions of ρ. The Simes bivariate normal test is conservative with positive ρ's. Numerical calculation shows that R Simes (U(0, 1)) is a decreasing function of α. For example, R Simes (U(0, 1); α = 5%) = 6.67%, and R Simes (U(0, 1); α = 1%) = 5.30%. A rough upper bound of R Simes (U(0, 1)) is 1/(3 log(1/α)). The Bonferroni test is conservative under any dependence. For the Bonferroni bivariate normal test, R Bonf (U(−1, 1)) is also a decreasing function of α, with a rough upper bound 1/(4 log(1/α)). For instance, R Bonf (U(−1, 1); α = 5%) = 6.49%, and R Bonf (U(−1, 1); α = 1%) = 4.64%.
Corollary 1 shows that there is no simple function admissible critical boundary curve f that satisfies the functional equation in (1). However, when test statistics are not normally distributed, or correlation coefficients are limited to some subsets on [−1, 1], some solutions may exist. Ferguson (1995) proposed a class of bivariate uniform distribution models, and a special type of Ferguson's model is proposed by Gou and Tamhane (2018). We denote this distribution by the FGT distribution.
Definition 3 (FGT distribution). Ferguson distribution (Gou-Tamhane type) is a bivariate distribution of (U, V) on the unit square [0, 1] 2 , with density function p (u, v) Ferguson's model and bivariate normal copula are both bivariate distributions on the unit square with uniform marginal distributions. Their density functions are compared in Figure 2 for the positive dependence case (ρ = 0.5) and the negative dependence case (ρ = −0.2). Generally speaking, Ferguson's model and other step-function models provide good approximations to normal models for power analysis (Zhang and Gou 2016).
In this article, we denote the correlation coefficient of copula function by τ , and that of normal distribution by ρ. When calculating the relative conservativeness R f (g), we use ρ for normal distributed random variables, and τ for other random variables, if not otherwise specified.
The Simes normal test is exact only when ρ = 0 and ρ = ±1. For the FGT distributed test statistics, the Simes test is exact on a wide range of τ , and the relative conservativeness for all positive τ 's is considerably small.
Bonferroni test is dependence assumption free, but is only exact when ρ = −1 for bivariate normally distributed test J. GOU  statistics. The multiple test procedures using the Bonferroni technique, including the Holm (1979) procedure and graphical approach (Bretz et al. 2009), share similar properties with the Bonferroni test. However, using the FGT distributed test statistics, the error rate control of the Bonferroni test becomes notably different and summarized in Theorem 3.
From Figure 3, we observe that the Simes test is almost perfectly exact for τ ∈ [0, 1] under FGT distribution. A natural question arises about whether a perfect exact test exists for all positive τ values. The following theorem gives a negative answer.
Corollary 1 states the nonexistence of simple normal test that is exact for all positive correlation coefficients. Theorem 4 further concludes no perfect exact test for all positive correlation coefficients under FGT distribution. Alternatively, we may consider minimizing the conservativeness in a given range of correlation coefficients despite the nonexistence of perfect exact test. We have observed that the conservativeness of the Simes test, measured by R f (U(0, 1)), is extremely small under FGT distribution, but is not negligible when using normal test statistics. In the following section, we purpose three families of intersection hypothesis tests which are less conservative than the Simes test averagely speaking. The relative conservativeness is calculated by following the normal distribution that applies asymptotically to vast types of data.

Conservativeness Minimization
The symmetric admissible critical boundary curves are modeled with three types of parametric functions on (p 1 , p 2 ) ∈ [0, 1] 2 . The bivariate probability density function under normal distribution is When ρ = ±1, the probability density concentrates on p 1 = ±p 2 , as the complete positive (negative) dependence situation (Song 2000; Zhang and Gou 2016).

Power-Function Model
Power-function model has a power function admissible critical boundary curve, where the left curve is f (u) = b(c − u) k + c for u ≤ c, and the bottom curve is f (u) = c − (1/b)(u − c) 1/k for u ≥ c. As a symmetric curve, we only consider the left part. We first consider two types of power-function models with constraints that (1) tests are exact at ρ = ±1 and (2) at ρ = 0 and 1. Valid α-level tests may not exist for all ρ values when applying these two types of models. Details are included in the online Appendix A.2.1. For these anticonservative tests, we need to relax the assumption of having exact tests under independence, and allow these tests to be conservative when ρ = 0. For simplicity, we omit the subscript f in I f (ρ) unless we need to specify the expression of the boundary curve. Assume I(ρ = 0) = α 0 /2 ≤ α/2 and α 0 = α − α 0 ≥ 0. The corresponding test with Type I error rate α 0 when ρ = 0 and with Type I error rate exactly α when ρ = 1 is Numerical calculation shows that the tests in (2) are valid αlevel tests for all positive and negative ρ values, as shown in Figure 4. Table 1 lists the Type I error rate under independence 2I(ρ = 0), average Type I error rate for positive correlation coefficients 1 0 2I(ρ)dρ, and relative conservativeness R pwr (ρ ∼ Figure 4. Type I error rate versus correlation coefficient, power function models (α = 5%). U(0, 1)) = (α − 1 0 2I(ρ)dρ)/α of various power-function model based tests, comparing with the Simes test. When α = 5%, the minimum of R pwr (ρ ∼ U(0, 1)) is 2.42%, achieved at k = 7.8.

Linear-Function Model
Linear function provides another three-parameter model for the critical boundary curves with left curve f To have a valid test for all ρ values, the true significance level at ρ = 0 may be set less than α. Assuming α 2 /2 ≤ I(ρ = 0) = α 0 /2 ≤ α/2 and α 0 = α − α 0 ≥ 0, the critical boundary curve is We include the Type I error rate under independence, average Type I error rate for positive correlation coefficients and the corresponding relative conservativeness of some linear-function model based tests in Table 2. The relations between the type I error rate and the correlation coefficient under linear function models are presented in Figure 5. The average relative conservativeness of 5%-level test for all positive ρ values can be as small as 3.44% when d = 0.0117.

Tangent-Function Model
Trigonometric functions can also be applied to model the critical boundary curves. A three-parameter tangent-function  Figure 6 shows how the type I error rate varies with the correlation coefficient using tangent function models. The Type I error rate under independence, average Type I error rate for positive correlation coefficients and the corresponding relative conservativeness of several 5%-level tangent-function model based tests are reported in Table 3. The smallest average relative conservativeness on ρ ∈ [0, 1] is 0.89%, which is smaller than the minimum values using the power-function and linear function models.

Power Comparisons
To investigate the statistical powers of proposed intersection hypothesis tests, numerical integration is used. Assume that the (x, y) is bivariate normal with mean (μ x , μ y ) and correlation ρ:  (Simes) 0.05000 0.04667 6.67% Figure 6. Type I error rate versus correlation coefficient, tangent function models (α = 5%).
Then the joint density function can be written as φ ρ (x, y; The probability that falls in the left rejection region of (p 1 = (x), p 2 = (y)) with critical boundary curve f is where the left critical boundary curve passes (p 1 = u 0 , p 2 = 1). For a test with symmetric admissible curve, the statistical power is I(ρ; μ x , μ y ) + I(ρ; μ y , μ x ). We further define the average power of a test with symmetric admissible curve f by where g(ρ) is a density function of ρ values. Table 4 compares the average power values P avg f (ρ ∼ U(0, 1)) of the Simes tests and six newly proposed tests from three model families. Power-function model with k = 49.5, linear-function model with d = 0.00149 and tangent-function model with b = 0.00295 lead to three tests which are less conservative than the Simes test for any ρ. The tests based on the power-function model with k = 7.8, linear-function model with d = 0.0117 and tangent-function model with b = 0.0206 achieve smaller relative conservativeness R f (U(0, 1)) than that using the Simes test. The power gains over the Simes test have been observed if μ x and μ y are not remarkably different. Bold font indicates the highest average power among the listed tests for a given combination of μ x and μ y . Tests based on tangent-function model are more powerful than those based on power-function and linear-function models. Since we use the left reject region for the one-sided hypothesis testing in calculation, the mean vector (μ x , μ y ) under the alternative hypothesis are negative values when computing the power.
Note that the Simes test is actually anticonservative under negative ρ values, and our newly proposed tests control the Type I error rate at level α for any correlation coefficients. For a fair comparison, we may consider the c-Simes test that is a valid αlevel test for both positive and negative ρ values, proposed by Gou and Tamhane (2018), as a conservative version of the Simes test. For a bivariate test, the c-Simes test compares p (2) with α, and compares p (1) with α/2 − α 2 /2. We can show that the 5%level tests based on the power-function model with k = 49.5, linear-function model with d = 0.00149 and tangent-function model with b = 0.00295 are uniformly more powerful than the 5%-level c-Simes test, since they all have larger rejection regions than that of the c-Simes test. Generally speaking, the powerfunction and tangent function based models show more power gain than the linear-function based models, and therefore are recommended in this simulation setting.
To gauge the significance of a certain increase in power, we compare the sample size needed to achieve a given power for normally distributed observations. When only one hypothesis is involved, the relation between the sample size m and power 1 − β satisfies m ∝ ( −1 (α) + −1 (β)) 2 , and therefore the differentials of m and β satisfy the relation . Table 4. Average powers (%) of Simes tests and newly proposed tests based on the power-function, linear-function, and tangent-function models for various (μ x , μ y ) pairs (α = 5%).
Tests with dI f /dρ = 0 at ρ = 0 Tests with minimum R f (U(0, 1))  For a fixed differential dβ, the relative sample size reduction |dm/m| is an increasing function of power 1 − β when The twohypothesis case is more complex than the one-hypothesis case. Consider testing two one-sided hypotheses H x : η x = 0 and H y : η y = 0. The test statistic X follows a normal distribution with mean η x and variance σ 2 x and test statistic Y ∼ N(η y , σ 2 y ). The correlation between X and Y is denoted by ρ. Given significance level α, the relation between the sample size m and power 1 − β is where I(ρ; μ x , μ y ) is defined in (5). There is no explicit expression of sample size m as a function of Type II error β, except a few special cases. For example, using the critical boundary curve f of the Bonferroni test and assuming independence ρ = 0 and equal effect size η/σ = η x /σ x = η y /σ y , the sample size m can be written as a function of power 1 − β, which is m = (η/σ ) 2 · −1 (α/2) + −1 ( √ β) 2 , and the relative sample size reduction is When α = 5%, the increasing powers from 0.50 to 0.51, from 0.70 to 0.71, and from 0.98 to 0.99 represent efficiency gains of 1.029, 1.025, and 1.142, respectively. In a general case, |dm/m| depends on the critical boundary curve f , correlation ρ and effect size η x /σ x and η y /σ y . For a given power 1 − β, we calculate the required sample sizes for tests with different critical boundary curves f under various choices of correlations and effect sizes. The tests are compared using the relative sample size defined by a ratio of the sample size of the Simes test to that of a particular test, where the significance level α = 5%, power 1 − β = 80%, ratios of the effect size in testing H x to that in testing H y are 1:1 and 1:2, and correlation coefficients between X and Y are −0.9, −0.5 (negative dependence), 0 (independence), 0.5 and 0.9 (positive dependence). The results are listed in Table 5. For example, when (η x /σ x ) : (η y /σ y ) = 1 : 1, ρ = 0.9, the relative sample size of using the test based on the tangent-function model with parameter b = 0.0206 is 1.072, which indicates that a trial design of 536 participants using the Simes test only requires 500 participants if the tangentfunction model based least conservative test is applied. A remark here is that the Simes test is an α-level test under independence and positive ρ values, and least conservative test proposed in Section 3 are valid for all ρ values. The least conservative tests are more powerful than the Simes test in most cases, even under negative dependence.

Discussion
We have proved that there is no simple function solution under normal distribution and no solution under FGT distribution of the integral equation about the exact test for all positive correlation coefficients. For bivariate normal distributed test statistics, the conclusion may not be directly generalized from simple function to arbitrary function, because dI f n (ρ)/dρ does not converge uniformly on [0, 1], where {f n (u)} +∞ n=1 is a pointwise increasing sequence of nonnegative simple functions and f = lim n→+∞ f n , and I f n (ρ) is the rejection probability defined in Section 2. However, an extensive numerical search suggests that there is no solution of any form to the equation as listed in (1). Since the test using the FGT distributed statistics also have no solution, we conjecture that there is no perfect exact test for all positive dependence cases regardless of the distribution of test statistics.
If a bivariate perfect exact test does not exist, there is little practical meaning to multivariate perfect exact tests. The reasons are as follows. The intersection hypothesis tests are developed to construct multiple test procedures for each individual hypothesis. For example, the closure principle can be used to design the FWER-control multiple test procedure for {H i } n i=1 based on the hypothesis tests of the intersection hypotheses ∩ i∈I 0 H i , where I 0 is a subset of {1, . . . , n} (Marcus, Peritz, and Gabriel 1976;Hochberg and Tamhane 1987;. Without a perfect exact bivariate test of intersection hypothesis, there is no chance to establish a perfect exact multiple test procedure. Nevertheless, searching a least conservative surface of intersection hypothesis tests with more than two elementary hypotheses can be achieved under some parametric models, using similar methods introduced in this article. For example, we may search the least conservative critical boundary surfaces on (p 1 , p 2 , p 3 ) ∈ [0, 1] 3 for all possible correlation matrices. We can define the symmetric admissible critical boundary surface in a similar way, where the surface is symmetric to planes p 1 = p 2 , p 2 = p 3 , and p 3 = p 1 , as shown in Figure 7. The Simes test for three hypotheses is also plotted in Figure 7 as a reference critical boundary surface. Power gain over the Simes test is also expected for tests with rejection regions defined by least conservative surfaces.