Nonparametric Two-Sample Tests of High Dimensional Mean Vectors via Random Integration

Abstract Testing the equality of the means in two samples is a fundamental statistical inferential problem. Most of the existing methods are based on the sum-of-squares or supremum statistics. They are possibly powerful in some situations, but not in others, and they do not work in a unified way. Using random integration of the difference, we develop a framework that includes and extends many existing methods, especially in high-dimensional settings, without restricting the same covariance matrices or sparsity. Under a general multivariate model, we can derive the asymptotic properties of the proposed test statistic without specifying a relationship between the data dimension and sample size explicitly. Specifically, the new framework allows us to better understand the test’s properties and select a powerful procedure accordingly. For example, we prove that our proposed test can achieve the power of 1 when nonzero signals in the true mean differences are weakly dense with nearly the same sign. In addition, we delineate the conditions under which the asymptotic relative Pitman efficiency of our proposed test to its competitor is greater than or equal to 1. Extensive numerical studies and a real data example demonstrate the potential of our proposed test. Supplementary materials for this article are available online.


Introduction
In many applications, high-dimensional data, whose dimension is much larger than the sample size, are commonly available.Examples include diffusion tensor imaging (Le Bihan et al. 2001), finance (Lam and Yao 2012), gene expression (Pan et al. 2018), and risk management (Bollerslev, Meddahi, and Nyawa 2019).Testing the equality of two high-dimensional mean vectors is a basic problem.For example, Chen and Qin (2010) and Zhang et al. (2020) studied differential gene expression in various molecules and tissues.Ayyala et al. (2015) detected differentially methylated regions based on MethylCap-seq data.
We deal with the two-sample test for equality of high-dimensional mean vectors.Given {X 1 , . . ., X m } and {Y 1 , . . ., Y n } are independent identically distributed random samples drawn from p-dimensional random variables X and Y having p × 1 mean vectors μ 1 and μ 2 , respectively, we want to test: (1.1) where their covariance matrices 1 and 2 are unknown.
For testing (1.1), some existing methods assume that 1 = 2 (Bai and Saranadasa 1996;Wu, Genton, and Stefanski 2006;Srivastava and Du 2008;Li et al. 2020).This assumption is complicated to be validated for high-dimensional data.Motivated by the idea from Bai and Saranadasa (1996), Chen and Qin (2010) (Chen and Qin 2010) Yes Cai (Cai, Liu, and Xia 2014) Yes Yes Yes aSPU (Xu et al. 2016) Yes Yes Mult1 (Chen, Li, and Zhong 2019) Yes Yes L2 (Zhang et al. 2020) Y e s Y e s DCF (Xue and Yao 2020) Yes Yes procedure for improving the supremum-type statistic's power.Cai, Liu, and Xia (2014) concluded that the supremum-based tests were powerful when the true mean differences were sparse in the sense that there were only a few but significant nonzero componentwise differences.However, such tests may not work well under a nonsparse alternative (Xu et al. 2016).
In practice, the true alternative hypothesis is unknown, so it can not help us choose a powerful test.Fortunately, powerful methods exist for both "dense" alternatives and sparse alternatives in the high-dimensional setting.Xu et al. (2016) proposed an adaptive testing procedure by combining information across a class of sum-of-powers tests.Chen, Li, and Zhong (2019) introduced an L 2 -type test by either thresholding, which removed the nonsignal bearing dimensions or transforming the data via the precision matrix for signal enhancement.Zhang et al. (2020) proposed an L 2 -norm-based test through the Welch-Satterthwaite χ 2 -approximation to deal with the nonnormality of the null distribution.However, their test again assumes an equal covariance matrix between the two populations.In contrast, Xue and Yao (2020) proposed a distribution and correlation free two-sample mean test.Yu et al. (2022) proposed a power-enhanced high-dimensional mean test.
We categorize the methods mentioned above in Table 1 according to (a) whether a test needs distribution assumptions; (b) whether a test needs assumptions on the common covariance matrix; (c) whether a test needs a sparsity assumption under the alternative hypothesis; and (d) whether a test requires clear conditions in the relationship between the data dimension p and the sample size n.Note that most of the methods are either sumof-squares or supremum-based.Xu et al. (2016) pointed out that such tests were not powerful if nonzero signals in the true mean differences were weakly dense with nearly the same sign or there were more dense or only weakly dense nonzero signals, but did not offer a solution.
This article establishes a unified framework by using random integration (Jiang et al. 2022) of the difference (RID) technique for two-sample tests of high-dimensional mean vectors.This technique uses the difference in the p-dimensional independent density-weighted function with the finite mean and variance.Many existing tests, such as the weighted L 2norm-based test, the supremum-type tests, and a burden test through p i=1 ( X(i) − Ȳ(i) ) (Pan and Shen 2011;Lee et al. 2012), are special cases of our unified framework.Furthermore, our framework (RID) has the following advantages: • It is nonparametric and can operate without assuming 1 = 2 .
• It does not require a direct relationship between the data dimension and sample size, and nor the sparsity assumption under the alternative hypothesis.Xu et al. (2016).
The rest of the article is organized as follows.Section 2 introduces our test statistic via the random integration of the difference technique and establishes the asymptotic properties.In Section 3, simulation studies are conducted to evaluate the finite sample performance of the proposed test.In Section 4, a real dataset is analyzed to compare the proposed test with some existing methods.We conclude with some remarks in Section 5.All technical details and some additional simulation results are provided as supplementary materials.

Note that
Therefore, testing whether μ 1 and μ 2 amounts to testing whether where w(δ) is any positive weight.
It is important to clarify that for convenience, we use RID w (X, Y) to indicate its dependence on the distributions of X and Y, not the random variables X and Y per se.We obtain that μ 1 = μ 2 if and only if RID w (X, Y) = 0 by Equation (2.1).Theorem 1 is a critical result that provides an explicit derivation for evaluating RID w (X, Y) with a suitable w.Theorem 1.If w(δ) = p i=1 w i (δ i ) and w i (•) is a density function with a mean α i and variance and RID θ (X, Y) ≥ 0 with the equality holds if and only if μ 1 = μ 2 , where θ = (α 1 , . . ., α p , β 1 , . . ., β p ) , a = α 1 , α 2 , . . ., α p , and Remark 1.Using Theorem 1, we can derive an explicit form of RID w (X, Y), such as when δ follows a density function with independent components.With different choices of the parameters {α i , i = 1, . . ., p} and {β i , i = 1, . . ., p}, RID θ (X, Y) can give rise to existing tests.Thus, RID θ (X, Y) provides a unified framework.
1.When α i = 0 and β i = 0 for all i, RID θ (X, Y) yields a weighted L 2 -norm-based test that is designed to be powerful for the "dense" alternatives (Chen and Qin 2010).

When α
results in a burden test, which is widely used in genome wide association study of rare variants (Pan and Shen 2011;Lee et al. 2012).3. When α i = 0 for all i, β j = 0 for j = j 0 , and otherwise β j = 0, where j 0 = arg max 1≤j≤p (μ j1 − μ j2 ) 2 , RID θ (X, Y) can lead to the supremum-type tests using the L ∞ -norm of the mean differences.In practice, j 0 is not known a priori and can be estimated by ĵ0 = arg max 1≤j≤p X(j) − Ȳ(j) 2 , where X(j) and Ȳ(j) are the jth component of sample mean vectors X and Ȳ for μ 1 and μ 2 , respectively.These tests are powerful against the "sparse" alternatives (Cai, Liu, and Xia 2014).4. When α 1 = • • • = α p = 0 and β i = 0 for all i, RID θ (X, Y) produces a hybrid of a weighted L 2 -norm-based test and a burden test, and may retain the strengths of both tests with proper weights so that it is powerful whether there is a large proportion of small to moderate componentwise differences or nonzero signals are weakly dense with nearly the same sign (Chen and Qin 2010;Xu et al. 2016). 5.When β j = 0 for all j, α i > 0 if μ i1 − μ i2 > 0, and α i < 0 otherwise, RID θ (X, Y) induces a hybrid of a weighted L 2norm-based test and a weighted L 1 -norm-based test, which should be powerful for dense or only weakly dense nonzero signals (Chen and Qin 2010;Xu et al. 2016).

Asymptotic Properties
To establish the limiting distribution of RID θ,m,n , we assume the following four conditions: E1.There exist a p × k 1 matrix 1 , a p × k 2 matrix 2 , k 1 -dimensional random vectors {Z 1i } m i=1 , and k 2 -dimensional random vectors {Z 2j } n j=1 , such that for some constants 1 and 2 , where Z ιjl is the lth component of Z ιj with ι = 1 or 2. Also, for a positive integer q and ς l such that As p → ∞, and for s 1 , s 2 , s 3 , s 4 ∈ {1, 2}, (2.4) Remark 2. Condition E1 gives a general multivariate model for high-dimensional data analysis, which includes the Gaussian family and members of the elliptically contoured distributions among many others (Bai and Saranadasa 1996;Chen and Qin 2010;Zhang et al. 2020).As said in Chen and Qin (2010), min{k 1 , k 2 } ≥ p indicates that the rank and eigenvalues of 1 or 2 are not affected by the transformation.Condition (2.3) means that each Z ij has a kind of pseudo-independence, which is a relaxed independence relation that allows some margin over probabilities (Kim and Lesser 2008).Clearly, if Z ij has independent components, then (2.3) is true.Condition E2 is a standard regularity assumption in twosample problems, which guarantees that m and n go to infinity proportionally.
Condition E3 is satisfied under H 0 , and enables the variance of RID θ,m,n to be asymptotically characterized by σ 2 m,n given in the following Theorem 2. Similar to Chen and Qin (2010), if all of the eigenvalues of W θ , 1 and 2 are bounded above from infinity and below away from zero and μ 1 − μ 2 = (ω, . . ., ω) , then condition E3 implies ω = o((n + m) −1/2 ), which is also studied in Xu et al. (2016), and is a smaller order than the local alternative hypotheses with the form μ 1 − μ 2 = ν(n + m) −1/2 for a nonzero constant vector ν and the fixed p setting.
Therefore, condition E3 can be viewed as a high-dimensional version of the local alternative hypotheses.
Condition E4 is typical to obtain a normal limit for the leading terms using the martingale central limit theorem (Hall 1984), and similar conditions can be found in Chen, Zhang, and Zhong (2010) and Li and Chen (2012) for proving the asymptotic distribution in high-dimensional hypothesis-testing problems.In addition, condition E4 is also useful for our proposed test in the high-dimensional case, although an explicit relationship between p and m, n is not required.Furthermore, if 1 and 2 are close to identity matrix, then one must rule out the case that β j = 0 for all j; otherwise condition E4 is violated.Therefore, other tests would be preferable in this case.
Corollary 1.Under Conditions E1-E4 and H 0 : To formulate a test procedure, we need to estimate σ m,n .Similar to Chen and Qin (2010), we propose the following estimators of tr{(W θ 1 ) 2 }, tr{(W θ 2 ) 2 }, and tr(W where X(i,j) and Ȳ(i,j) are the sample mean after excluding X i , X j and Y i , Y j , respectively, and X(i) and Ȳ(j) are the sample mean after excluding X i and Y j , respectively.Therefore, we can obtain an estimator of σ 2 m,n .
Furthermore, we can obtain the following Theorem 3.
Theorem 3.Under Conditions E1-E4 and H 0 : According to Theorem 3, the proposed test with a nominal ϑ level of significance rejects H 0 if RID θ,m,n ≥ σm,n z ϑ , where z ϑ is the upper-ϑ quantile of N (0, 1).

Power of the Proposed RID Test
In this section, we investigate the power of the proposed RID test.Denote , where (•) is the cumulative distribution function of the standard normal random variable.
Theorem 4 provides a general result on the power of the proposed RID test statistic.It reveals that as long as (m + n)P θ (μ 1 − μ 2 , 1 , 2 ) diverges to the infinity, the power will converge to 1. Next, we investigate the power of the following two special cases for heterogeneous α i and For the sake of simplicity, let us assume that where Case I: μ 1 − μ 2 = (ω, . . ., ω) for ω ≥ 0.Then, we have Therefore, we have By the above inequality, we can obtain the following Corollary 2.
Corollary 2 demonstrates that our proposed RID test is powerful when nonzero signals in the true mean differences are weakly dense with nearly the same sign.By contrast, (3.11) in Chen and Qin ( 2010) and ( 27) in Zhang et al. (2020) indicate that their tests have low power under H 1 .
Corollary 3 indicates that if all the eigenvalues of ˜ (τ ) are bounded away from 0 and ω 2 = O((m + n) −1 ), then e > 7/12.Therefore, the proposed RID test cannot cope with extremely sparse signals as did in Cai, Liu, and Xia (2014) and Chang et al. (2017) unless the sparse nonzero signals are extremely strong.
It follows from (2.2) that the first term in RID θ (X, Y) is a weighted L 2 -norm between μ 1 and μ 2 .Therefore, we will compare the asymptotic power of the our proposed RID test with the CQ test of Chen and Qin (2010).According to Theorem 4, and equation (3.11) in Chen and Qin (2010), the asymptotic power functions for the proposed RID test and CQ test are defined as follows Then, the asymptotic relative Pitman efficiency of the proposed RID test versus the CQ test is given by .
Furthermore, we can obtain the following Theorem 5.
Theorem 5. Assume Conditions E1-E4, The conditions, pr 2 = O(p 1/4 ) and The latter condition is used in the literature (e.g., Remark 3 in Wang, Peng, and Li 2015) and stronger than ours.

Simulation Studies
Example 1.In this example, we investigate the numerical performance of the proposed method using Monte Carlo simulations in the presence of weak signals.We compare the RID with the adaptive sum-of-powers test proposed by Xu et al. (2016) (aSPU), the CQ test (Chen and Qin 2010), the method without transformation proposed by Cai, Liu, and Xia (2014) (Cai), the nonstudentized test with screening (  Xue and Yao (2020).The aSPU is implemented by the function apval aSPU in the R package highmean.As suggested in Xu et al. (2016), to obtain aSPU, γ takes seven values, that is, γ ∈ {1, 2, 3, 4, 5, 6, ∞}, and p-value is given by (1) in Xu et al. (2016).
The empirical p-values are shown in Tables 2-3.From Tables 2-3, we find that the empirical p-values of all methods are controlled fairly well around 0.05 for all cases.
The power of RID is similar under the four different distributions, so we report the empirical power under only N (0, 1) in Figures 1-4.For the other three distributions, the results are presented in the supplementary material.From Figures 1 to 4, we have the following findings: 1.The proposed RID test has the greatest power when ρ = 0.1 or ρ = 0.2.This result is consistent with Corollary 2. Thus, the RID is powerful when nonzero signals of the difference between two mean vectors are weakly dense with nearly the same sign.2. The empirical power of RID increases as the signal strength r increases.3. RID's empirical power diminishes as ρ increases.
Example 2. In this example, we evaluate the power performance of our proposed RID test for the weakly dense signals with varying signs.We use the same setup in Example 1, except that ρ = 0, 1 = 0.4 |i−j| and 2 = 2 1 for 1 ≤ i, j ≤ p, (m, n) = (60, 80), and p = 200.The values of the nonzero entries in μ 2 were uniformly drawn at random from the interval − 2r(1/m + 1/n) log p, 2r(1/m + 1/n) log p with r = {0.02,0.04, 0.06, 0.08}.Due to the similar power performance of the proposed RID test under the four different distributions, we present only the empirical power under N (0, 1) in Figure 5.The results for the remaining three distributions are included in the supplementary materials.As illustrated in Figure 5, the CQ test has the best performance, followed by our proposed RID test.Meanwhile, all tests have poor power when r = 0.02.
Example 3. In this example, we carry out numerical simulations to illustrate the power performance of our proposed method for the sparse setting.We use the same setup in Example 1, except that we now use r = {0.6,0.8, 1.0, 1.2}, which are considered in Chen, Li, and Zhong (2019), 0.05p elements in μ 2 are set to nonzero values, which is a sparse setting and is studied in Cai, Liu, and Xia (2014), and 1 = 2 = , where The covariance matrice structure is studied in Cai, Liu, and Xia (2014).Since the proposed RID test has the similar power performance under the four different distributions, we only display the empirical power under N (0, 1) in Figure 6.For the other three distributions, the results are shown in the supplementary material.From Figure 6, we find that the power of the aSPU is the highest among all tests.Cai test, Mult1 test, and the proposed RID test have similar power when p = 200.Furthermore, all tests have power close to 1 when p ≥ 600 and r ≥ 0.8 except the DCF and f ns,ϑ .

Real Data Analysis
In this section, we evaluate the finite sample performance of various tests in the analysis of a breast cancer dataset from a genome-wide association study (Gravier et al. 2010), which can be downloaded from https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/geo/query/acc.cgi?acc=GSE19159.The dataset contains 168 pT1T2pN0 invasive ductal carcinoma patients with either good (no event five years after diagnosis: 111 patients) or poor (57 patients with early-onset metastasis) outcomes (Qiu, Xu, and Zhu 2021).There are 2905 genome tiling array type variables for each subject, representing the normalized log2 ratio of the Cy5 and Cy3 signals.
It has been reported that breast cancer is the most frequent malignancy in women and has been a significant source of cancer-related morbidity and mortality in women worldwide (Harbeck et al. 2019;Hendrick, Baker, and Helvie 2019).Therefore, it is essential to identify significant genetic variants with breast cancer, and this will be helpful for the diagnosis, prevention, and treatment of breast cancer.Traditional multiple testing methods need multiple comparisons, making the genome-wide association study computationally intensive and might lead to misleading findings.Furthermore, it has been shown that there is strong evidence of polygenic effects for breast cancer (Shiovitz and Korde 2015).Therefore, we apply various tests to analyze the genome tiling data in each of the 22 autosomes separately to better demonstrate the possible power differences between the tests.The familywise nominal significance level is set at 0.05, and we set 0.05/22 = 0.0023 for each chromosome to consider the Bonferroni adjustment.
All of the compared methods have moment-based assumptions and according to Wang, Peng, and Li (2015), they are sensitive to outliers.We refer to Wang, Peng, and Li (2015) and Feng, Zou, and Wang (2016) for specific examples demonstrating the sensitivity of such tests to the outliers.In addition, we plot the histograms of the marginal kurtosises for chromosomes 3, 6, and 7 in Figure 7, which clearly demonstrate that some gene expression levels have heavy tails due to their kurtosises being much larger than 3. Therefore, prior to performing any test, we use an outlier detection method proposed by Ro et al. (2015) to eliminate outliers.The number of samples after the outlier removal and the number of genes after the pre-processing as in Gravier et al. (2010) are listed in Table 4.
To be concise, Table 5 presents some representative results.We also plot the differences of sample means between two groups for the selected chromosomes in Figure 8.It can be seen that for chromosome 3, all methods yield p-values less than 0.05/22 = 0.0023.For chromosome 6, RID is the only test indicating a significant signal (p-value = 3.22×10 −5 ).For chromosome 7, all methods except Cai and DCF lead to significant p-values, supporting the possible lower power of Cai and DCF tests.
Having observed the significant differences, we further explore different signal structures.From Figure 8, we can see that the number of negative mean differences is comparable with those of positive mean differences on chromosome 3 while most signals are relatively strong.These signals are easily detected by most methods.In contrast, chromosome 6 has a dense-butweak signal structure and has a large proportion of the negative mean differences (81.34%).As expected from our theoretical and simulation results, such signals are challenging to the existing methods.This application confirms that RID is particularly useful when nonzero signals are extremely dense with nearly the same sign.Also importantly, our result is consistent with the literature supporting the role of the genes on chromosome 6 in breast cancer (Noviello, Courjal, and Theillet 1996;Tao et al. 1999).
Comparing the signal structures between chromosome 6 and chromosome 7, we can see that although both have dense signals, the signal magnitude on chromosome 7 is much stronger.This indicates that the power of RID is stable when the signal is relatively weak, whereas the other methods lose power in detecting weak signals.Thus, our data analysis demonstrates that our proposed RID fills a gap in the existing methodology by introducing a test that is powerful in a setting when the existing methods are not powerful, that is, when nonzero signals of the difference between two mean vectors are weakly dense with nearly the same sign.

Discussion
A variety of methods have been developed for testing the equality of mean vectors in two samples.As summarized in Table 1, this problem remains to be an active research topic in statistics (Chen, Li, and Zhong 2019;Xue and Yao 2020;Zhang et al. 2020).Despite meaningful progress, further research is warranted.For example, there is no powerful test when nonzero signals are weakly dense with nearly the same sign or when there are more dense or only weakly dense nonzero signals (Xu et al. 2016).The first contribution of this work is to fill in this gap by developing a powerful test to detect such signals.We provide theoretical and numerical results that convincingly demonstrate that this goal is achieved.While searching for this solution, we use the random integration of the difference technique, enabling us to unify many existing methods.This is the second significant contribution of this work because this unified framework helps us understand when and why a test is powerful.By re-analysis a real dataset, we illustrate how our proposed test may discover insightful information.
It is noteworthy that there are further issues to investigate for our proposed method.First, in our simulation studies and real data analysis, we use the p-dimensional independent density function with α 1 = • • • = α p = 2p −3/8 and β i = √ 2(p + i)/p, i = 1, . . ., p as the weight function.As a result, RID combines a weighted L 2 -norm-based test and a burden test.Our empirical results support these choices in their effectiveness for constructing a powerful RID test.The test is still reliable when there are many small to moderate componentwise differences or when nonzero signals are weakly dense with nearly the same sign.Nonetheless, there are other choices for the weight function.An interesting future topic is to consider a p-dimensional        independent density function with an adaptive choice α i , β i or other choices of weight function.Second, a weight function is given by a density function with independent components in this article.We will consider a different measure with dependent components as further work.Finally, we investigate the fourth setting in Remark 1 in order to increase the power for extremely dense nonzero weak signals of nearly identical sign.We will consider the other settings in Remark 1 in greater detail to ensure that they meet the requirements of practical applications.
f ns,ϑ ) (Chang et al. 2017), the multilevel thresholding test without the data transformation proposed by Chen, Li, and Zhong (2019) (Mult1), an L 2 -norm-based test proposed by Zhang et al. (2020) (L2), and the distribution and correlation free (DCF) two-sample mean test proposed by

Figure 1 .
Figure 1.Empirical power when Z ij follows N (0, 1) and m = 60, n = 80 for Scenario 1 under different signal levels of r and sparsity levels of ρ in Example 1.

Figure 2 .
Figure 2. Empirical power when Z ij follows N (0, 1) and m = 60, n = 80 for Scenario 2 under different signal levels of r and sparsity levels of ρ in Example 1.

Figure 3 .
Figure 3. Empirical power when Z ij follows N (0, 1) and m = 90, n = 120 for Scenario 1 under different signal levels of r and sparsity levels of ρ in Example 1.

Figure 4 .
Figure 4. Empirical power when Z ij follows N (0, 1) and m = 90, n = 120 for Scenario 2 under different signal levels of r and sparsity levels of ρ in Example 1.

Figure 7 .
Figure 7. (a) Histogram of marginal kurtosises of X, Y for 2,905 genes in chromosome 3; (b) Histogram of marginal kurtosises of X, Y for 2905 genes in chromosome 6; (c) Histogram of marginal kurtosises of X, Y for 2905 genes in chromosome 7.

Figure 8 .
Figure 8. Differences in gene expression levels between the two groups for chromosomes 3, 6, 7. The X-axis represents the position (Mb) of the clones along the corresponding chromosome.

Table 1 .
The comparison of two-sample tests of high-dimensional mean vectors.
• The asymptotic relative Pitman efficiency of our proposed RID test compared to the test (CQ) proposed by Chen and Qin (2010) is greater than or equal to 1 under some conditions.• It leads to a distinctly powerful test when nonzero signals are weakly dense with nearly the same sign or when nonzero signals are "dense" under the alternative hypothesis.Hence, we solve the problem raised by m, and Y j = μ 2 + 2 Z 2j for j = 1, . . ., n.And 1 , 2 ,

Table 4 .
After removing the outlier samples, the total number of samples and the number of genes after pre-processing on each Chromosome (Chr).denotes the dimension and m, n denote the sample sizes after eliminating outliers.

Table 5 .
The p-values of the various methods applied to the breast cancer data.