Two-sample Behrens–Fisher problems for high-dimensional data: a normal reference scale-invariant test

ABSTRACT For high-dimensional two-sample Behrens–Fisher problems, several non-scale-invariant and scale-invariant tests have been proposed. Most of them impose strong assumptions on the underlying group covariance matrices so that their test statistics are asymptotically normal. However, in practice, these assumptions may not be satisfied or hardly be checked so that these tests may not be able to maintain the nominal size well in practice. To overcome this difficulty, in this paper, a normal reference scale-invariant test is proposed and studied. It works well by neither imposing strong assumptions on the underlying group covariance matrices nor assuming their equality. It is shown that under some regularity conditions and the null hypothesis, the proposed test and a chi-square-type mixture have the same normal and non-normal limiting distributions. It is then justifiable to approximate the null distribution of the proposed test using that of the chi-square-type mixture. The distribution of the chi-square type mixture can be well approximated by the Welch–Satterthwaite chi-square-approximation with the approximation parameter consistently estimated from the data. The asymptotic power of the proposed test is established. Numerical results demonstrate that the proposed test has much better size control and power than several well-known non-scale-invariant and scale-invariant tests.


Introduction
The problem of testing the equality of mean vectors for high-dimensional data is frequently encountered in many contemporary statistical studies. One prominent aspect of high-dimensional data is that there are many measurements taken on only a few subjects, that is, the number of variables is much larger than the number of observations. For example, in DNA microarray data, thousands of gene expression levels are often measured on a relatively few subjects. Our motivating example is the colon data set, which is well-known and publicly available at http://microarray.princeton.edu/oncology/affydata/index.html. It contains 22 normal colon tissues and 40 tumor colon tissues, each having 2000 gene expression levels. It is of interest to check whether the normal colon tissues and the tumor colon tissues have the same mean expression levels. In this two-sample problem, the data dimension p = 2000 is much larger than the total sample size n = 62, and the covariance matrices of the two samples are probably not the same. Therefore, the classical Hotelling T 2 -test cannot be applied as it is based on the common covariance matrix assumption and the assumption that the data dimension is much smaller than the total sample size. Without imposing the covariance matrix homogeneity assumption, the above two-sample problem for high-dimensional data is usually termed as a high-dimensional two-sample Behrens-Fisher (BF) problem. Therefore, tests for high-dimensional two-sample BF problems are needed. Mathematically, a two-sample BF problem can be defined as follows. Suppose that we have the following two independent high-dimensional samples: where the dimension p of y i1 is very large, and may be close to or even much larger than the total sample size n = n 1 + n 2 . We wish to test the equality of the two mean vectors: without assuming that the covariance matrices 1 and 2 are equal. To our knowledge, for this high-dimensional two-sample BF problem, several tests have been proposed. We classify them as non-scale-invariant and scale-invariant tests. Note that scale-invariant tests are unchanged under any scale transformation of the highdimensional data, while non-scale-invariant tests do not have such a property. The nonscale-invariant tests include [1,2,4], and [14] among others, while the scale-invariant tests include [7,10], and [5] among others. Note that when the variances of the p-variables of high-dimensional data are very different, the scale-invariant tests generally have higher powers than the non-scale-invariant tests. This is because the scale-invariant tests take the diagonal variations of the sample covariance matrices into account while the nonscale-invariant tests do not. It is quite common that different components of a multivariate datum point may have different scales in practice, the scale-invariant tests may be preferred although they often require much stronger conditions than the non-scale-invariant tests. Besides, [6] presented some examples demonstrating the advantage of scale-invariant tests over non-scale-variant tests for the two-sample BF problems. Note that most of the above tests impose strong assumptions on the underlying covariance or correlation matrices so that their test statistics are asymptotically normal. However, these assumptions may not be satisfied or hardly be checked for real data applications. This means that these tests may not be able to maintain their sizes well in practice and their conclusions may be unreliable and even misleading. To overcome this difficulty, [14] proposed a simple nonscale-invariant test for high-dimensional two-sample BF problems and [15] proposed a scale-invariant two-sample test for high-dimensional data but with the common covariance matrix assumption. They showed that their tests have a much better size control than several existing competitors under different configurations. In this paper, we extend the tests by [14] and [15] for high-dimensional two-sample BF problems so that the resulting scale-invariant test enjoys a good size control and high powers.
Letȳ i = n −1 i n i j=1 y ij andˆ i = (n i − 1) −1 n i j=1 (y ij −ȳ i )(y ij −ȳ i ) , i = 1, 2 be the usual sample mean vectors and sample covariance matrices of the two samples (1). Set To motivate our new test for the high-dimensional two-sample BF problem (2), we first recall the scale-invariant test by [10], namely, the SKK-test. Its test statistic is given by whereσ 2 = 2{tr(R 2 n ) − [(n 1 − 1) −1 n 2 2 tr 2 (D −1 nˆ 1 ) + (n 2 − 1) −1 n 2 1 tr 2 (D −1 nˆ 2 )]/n 2 }, and c n,p = 1 + tr(R 2 According to [9], c n,p is an adjustment coefficient used to improve the convergence of T SKK to N(0, 1). The critical value or p-value of the SKK-test is calculated based on the asymptotical normality of the SKK-test statistic which is valid only when some strong conditions about the underlying group correlation matrices of the two samples (1) are satisfied. In practice, however, these strong conditions are hardly satisfied or difficult to check so that the null distribution of T SKK may not be asymptotically normal and the SKK-test may not be able to maintain the nominal size well. To appreciate this, in Figure 1, we display the histograms of the simulated T SKK . The two samples (1) are normally generated under the null hypothesis with the two covariance matrices associated with two tuning parameters ρ 1 and ρ 2 which control the correlation of the p-variables of the high-dimensional data: the larger the values of ρ 1 and ρ 2 are, the larger the correlation among the p-variables is. It is seen that the shape of the histogram of the simulated T SKK is mainly determined by the values of ρ 1 and ρ 2 . When [ρ 1 , ρ 2 ] = [0.01, 0.01], the histograms are quite symmetric and bell-shaped so that a normal approximation as suggested by the theory of [10] is adequate for approximating the underlying null distribution of T SKK . However, when [ρ 1 , ρ 2 ] = [0.5, 0.3] and [0.9, 0.5], the histograms are quite skewed so that a normal approximation is no longer adequate and it may result in a misleading test if a normal approximation is blindly employed. In fact, the simulation results presented in Section 3.2 reveal that when the underlying null distribution of T SKK is nearly normal, the SKK-test indeed has reasonable sizes and powers, but when the underlying null distribution of T SKK is actually skewed, the SKK-test has unacceptable small sizes and powers.
To overcome the drawbacks of the SKK-test, our test statistic is constructed as Note that there is a close connection between T n,p and T SKK as seen from the expressions (4) and (6) and for large samples, the distributions of T n,p and T SKK have similar shapes, either symmetric or skewed. However, it is not difficult to find that T n,p is always nonnegative while T SKK takes both positive and negative values.
In the next section, we shall show that under some regularity conditions and the null hypothesis, the proposed test statistic T n,p and a chi-square-type mixture have the same normal or non-normal limiting distributions so that it is justifiable to approximate the null distribution of T n,p using that of the chi-square-type mixture. It turns out that the chisquare-type mixture is obtained from the proposed test statistic under the null hypothesis and when the two samples (1) are normally distributed. Therefore, the associated scaleinvariant test is termed as a normal reference scale-invariant test. The distribution of the chi-square-type mixture can be well approximated by the well-known Welch-Satterthwaite chi-square-approximation with the approximation parameters consistently estimated from the data. The asymptotic power of the proposed test under some local alternative is also established.
As mentioned in the definition (4) of the SKK-test statistic T SKK , [10] employed the adjustment coefficient c n,p to improve the convergence of T SKK to N(0, 1). A small simulation study presented in Section 3.1 reveals that c n,p does improve the convergence of T SKK and T n,p to normal distributions when the null distributions of T SKK and T n,p are nearly normal and it will significantly worsen the performance of T SKK and T n,p otherwise. This shows that applying the adjustment coefficient blindly to a scale-invariant test is not wise. The same simulation study reveals that when c n,p ≤ 1.2, we can apply c n,p to improve the performance of T n,p and otherwise we should not use it. This empirical criterion for using c n,p in a scale-invariant test is new. Two simulation studies and a real data application demonstrate the good performance of the proposed test adjusted by c n,p smartly, against the tests of [4] and [10] in terms of size control and power.
It may be worthwhile to make some detailed comparison of the proposed test T n,p with the tests of [14] and [15], namely T ZZGZ and T ZZZ , respectively. First of all, both T n,p and T ZZGZ are robust against the covariance heterogeneity of the two samples (1) but T ZZZ is not. This is because T ZZZ imposes the covariance homogeneity assumption but the other two tests do not. When the covariance homogeneity assumption is approximately valid, all three tests are generally comparable in terms of size control but when the covariance homogeneity assumption is seriously violated, T ZZZ will be biased in the sense that its empirical sizes may be much smaller or larger than the nominal size. In the high-dimensional setting, this could be a big drawback since it is often difficult to check if the two samples (1) satisfy this covariance homogeneity assumption. Secondly, both T n,p and T ZZZ are scale-invariant but T ZZGZ is not. This is because both T n,p and T ZZZ take the variances of the p-variables of the high-dimensional data into account in their test statistic constructions but T ZZGZ does not. When the variances of the p-variables are approximately the same, T n,p and T ZZGZ are generally comparable in terms of power. However, when the variances of the p-variables are very different, T n,p generally has much higher power than T ZZGZ . Thirdly, both T n,p and T ZZZ impose the assumption that log(p) = o(n min ) [see (10) in Section 2] but T ZZGZ does not. This assumption indicates that the minimum sample size of the two samples should not be too small compared with log(p). This is reasonable because both T n,p and T ZZZ need to estimate the variances of the p-variables accurately but T ZZGZ does not need. This indicates that the power gain of T n,p over T ZZGZ is not a free lunch. Last but not least, the estimators of the approximation parameters of T n,p are much more complicated than those of T ZZGZ and T ZZZ due to the fact that T n,p takes the covariance heterogeneity and the variances of the p-variables into account simultaneously while T ZZGZ and T ZZZ take either the covariance heterogeneity or the variances of the p-variables into account but not both. In summary, the proposed test T n,p is designed to overcome the drawbacks of T ZZGZ and T ZZZ and to inherit their advantages with some price paid. In addition, we also develop an empirical criterion for using the adjustment coefficient c n,p of [10] smartly to improve the performance of T n,p in Section 3.1 but such an empirical criterion was not investigated in the development of T ZZZ .
The rest of this paper is organized as follows. The main results are presented in Section 2. Section 3 presents a small simulation study on the adjustment coefficient, two simulation studies, and an application to the colon data. Some concluding remarks are presented in Section 4. The technical proofs of the main results are given in the supplement.

Asymptotic null distribution
To test (2), we need to derive the null distribution of T n,p which was defined in (6). However, it is very difficult to derive the distribution of T n,p . We first consider where D n = diag( n ) with n = n 2 n 1 + n 1 n 2 , is unknown but can be consistently estimated usingD n in (3). Letσ i,rr , r = 1, . . . , p and σ i,rr , r = 1, . . . , p be the diagonal entries ofˆ i and i , i = 1, 2 respectively. Letσ rr , r = 1, . . . , p and σ rr , r = 1, . . . , p be the diagonal entries ofˆ n and n , respectively, that is,D n = diag(σ 11 ,σ 22 , . . . ,σ pp ) and D n = diag(σ 11 , σ 22 , . . . , σ pp ). We haveσ rr = n 2 nσ 1,rr + n 1 nσ 2,rr , r = 1, . . . , p, and σ rr = n 2 n σ 1,rr + n 1 n σ 2,rr , r = 1, . . . , p. Under Conditions C1-C4 specified below, we show in the supplement that where n min = min(n 1 , n 2 ) is the minimum sample size of the two samples and where O p (·) denotes the "bounded in probability" operation. That is, T n,p and T * n,p have the same distribution asymptotically provided that Note that Condition (10) requires that n min should not be too small compared with log(p). This implies that we allow p to tend to infinity in a slower exponential rate in terms of n min . This is a reasonable assumption in high-dimensional problems. Therefore, under Condition (10), studying the null distribution of T n,p is equivalent to studying the null distribution of T * n,p . For further study, we set where It is clear that T n,p,0 has the same distribution as the null distribution of T * n,p . Throughout this paper, let 1 random variables. Then when the two samples (1) are normally distributed, for any given n and p, it is easy to show that the distribution of T n,p,0 has the same distribution as that of the following chi-square-type mixture where λ n,p,r , r = 1, . . . , p are the eigenvalues of R n in descending order and R n = D −1/2 n n D −1/2 n . Since T * n,p,0 is obtained from T n,p,0 when the two samples (1) are normally distributed, we may call the distribution of T * n,p,0 as the normal-reference distribution of T n,p,0 . By some simple algebra, the first three cumulants of T * n,p,0 are given by Then the skewness of T * n,p,0 is given by For further study, we set so that R n = n 2 n R 1 + n 1 n R 2 . In addition, set ρ n,p,r = λ n,p,r /[tr(R 2 n )] 1/2 , r = 1, . . . , p which are the eigenvalues of R n /[tr(R 2 n )] 1/2 in descending order. We impose the following conditions: C4. There exist two constants c 1 and c 2 such that 0 < c 1 ≤ min 1≤r≤p σ i,rr ≤ max 1≤r≤p σ i,rr ≤ c 2 < ∞ for all p and i = 1, 2. C5. There exist real numbers ρ r , r = 1, 2, . . ., such that lim n,p→∞ ρ n,p,r = ρ r , r = 1, 2, . . ., uniformly and lim Conditions C1 and C2 are also imposed in [3] and [4]. They specify a factor model for high-dimensional data analysis. Condition C3 requires that n 1 tends to ∞ proportionally with the total sample size n. Under Condition C3, as n → ∞, we have Condition C4 requires that the diagonal entries of 1 and 2 are bounded below and above. It is imposed in [10]. Condition C5 ensures existence of the limits of the eigenvalues of R n /[tr(R 2 n )] 1/2 as n, p → ∞ and that the limit and summation operations in lim n,p→∞ p r=1 ρ n,p,r are exchangeable. It ensure that the limiting distributions of the normalized T n,p,0 and T * n,p,0 , namely, are non-normal. Condition C6 is used to ensure that the limiting distributions ofT n,p,0 and T * n,p,0 are normal.
Let L −→ and d = denote convergence in distribution and equality in distribution, respectively. The following theorem states thatT n,p,0 andT * n,p,0 have the same normal and non-normal limiting distributions under some regularity conditions.

Theorem 2.1: (a) Under Conditions C1-C5, as n, p → ∞, we havẽ
where Then under the conditions of (a) or (b), as n, p → ∞, we always have

Implementation
Theorem 2.1 indicates that it is natural to approximate the distribution of T n,p,0 by that of T * n,p,0 . Since T * n,p,0 is a χ 2 -type mixture with nonnegative unknown coefficients λ n,p,r , r = 1, . . . , p, it is nonnegative and generally skewed. Therefore, it is not always applicable to approximate the distribution of T * n,p,0 using a normal distribution. To overcome this difficulty, following [13], we can approximate the distribution of T * n,p,0 using that of the following random variable where the approximation parameter d may be called the approximate degrees of freedom. It can be determined via matching the variances of T * n,p,0 and G. By (14), Var(T * n,p,0 ) = 2p −2 tr(R 2 n ) while by (22), we have Var(G) = 2/d. Therefore, equating the variances of T * n,p,0 and G leads to Since E(T * n,p,0 ) = E(G) = 1, the above variance matched χ 2 -approximation is essentially the well-known Welch-Satterthwaite χ 2 -approximation which is very accurate and widely used in solving Behrens-Fisher problems for univariate data [8,11].
For any nominal significance level α > 0, let χ 2 v (α) denote the upper 100α percentile of the χ 2 v -distribution. Letd be the ratio-consistent estimator of d in (23), then the proposed test can be conducted via using the approximate critical value χ 2 Theorem 4 of [13] indicates that the chi-square-type mixture T * n,p,0 is asymptotically normal if and only if d * → ∞ where d * is defined in (15). Further, by Theorem 5 of [13], we have 1 ≤ d * ≤ d ≤ p. The above results indicate that the proposed normal reference scale-invariant two-sample test is adaptive to the shape of the underlying distribution of T * n,p,0 in the following sense. On the one hand, when T * n,p,0 is asymptotically normal, the associated d * , d → ∞ and hence G is also asymptotically normal. On the other hand, when d is finite and hence G is not asymptotically normal, the associated d * is also finite and T * n,p,0 is also not asymptotically normal.
In real data analysis, d is always finite so that the proposed normal reference scaleinvariant two-sample test can always be conduced provided that d is consistently estimated from the data. Based on (23), we only need to find a ratio-consistent estimator of tr(R 2 n ). For this end, we can write tr(R 2 n ) as where R 1 and R 2 are defined in (16) so that we only need to estimate tr(R 2 i ), i = 1, 2 and tr(R 1 R 2 ) consistently. Note that R i , i = 1, 2 are not the usual correlation matrices of i , i = 1, 2. However, we can treat R i , i = 1, 2 as the covariance matrices of "the transformed data" Following [13] and [14], under Conditions C1-C3, the ratio-consistent estimators of tr(R 1 R 2 ) and tr(R 2 i ), i = 1, 2 are given by tr(R 1R2 ) and Notice that under Conditions C1-C4, by (8) and (10), we can writeD −1 . Therefore, under Conditions C1-C4 and (10), tr(R 1R2 ) and are also the ratio-consistent estimators of tr(R 1 R 2 ) and tr(R 2 i ), i = 1, 2, respectively. Hence it is easy to show the ratio-consistent estimator of tr(R 2 n ) is given by and the ratio-consistent estimator of d is then given bŷ Remark 2.1: A small simulation study presented in Section 3.1 indicates that in terms of size control, T n,p performs well regardless if the data are nearly uncorrelated, moderately correlated, or highly correlated but it is still slightly liberal when the data are nearly uncorrelated. To overcome this difficulty, an empirical criterion for adjustingd (26) properly using the adjustment coefficient c n,p (5) of [10] is developed in Section 3.1. Roughly speaking, T n,p may be implemented withd replaced byd c =d/c n,p whenever c n,p ≤ 1.2.

Asymptotic power
In this section, we investigate the asymptotic power of T n,p when both n and p are large. Under Conditions C1-C4 and (10), by (9), we can write T n,p = T * n,p [1 + o p (1)] with T * n,p having the decomposition (11) where T n,p,0 and S n,p are defined in (12). Note that Var(S n,p ) = n 1 n 2 Following [10], we consider the power of T n,p under the following local alternative: This is the case when the information in the local alternative is weak compared with the variance of T n,p,0 . Let where D and are defined as in (17).

Theorem 2.2:
Assume that as n, p → ∞,d is ratio-consistent for d.

Remark 2.2:
Under the conditions of Theorem 2.2(b), the asymptotic power of T n,p is identical to that of [10]'s test. However, this is not the case under the conditions of Theorem 2.2(a).

Effect of the covariance homogeneity
In this section, we briefly discuss the effect of the covariance homogeneity on the proposed normal reference scale-invariant test which is a covariance-heteroscedasticity-based test. To this end, we assume that the two samples (1) actually have the same covariance matrix, i.e. 1 = 2 = . In this case, we have n = (n 2 /n) 1 + (n 1 /n) 2 = and D n = diag( ). It follows that R n = D −1/2 n n D −1/2 n = diag( ) −1/2 diag( ) −1/2 which is the same as the matrix R defined in [15]. Therefore, the approximation parameter d of our covariance-heteroscedasticity-based test reduces to the approximation parameter of the covariance-homogeneity-based test in [15]. In addition, the estimatedd in (26) is still consistent for d. That is, our test continues to work when the two covariance matrices are actually the same.

Numerical results
In this section, we first present a simulation study on the effect of the adjustment coefficient c n,p (5) used in the SKK-test statistic (4), and then conduct two simulation studies to compare the proposed normal reference scale-invariant test against several well-known existing competitors for high-dimensional Behrens-Fisher problems in terms of size control and power. This section ends by a real data application.
To measure the overall performance of a test in terms of maintaining the nominal size α, we adopt the average relative error (ARE) of [12], defined as ARE = 100M −1 M r=1 |α r − α|/α, whereα r , r = 1, . . . , M are empirical sizes under consideration. The smaller ARE value indicates a better overall performance of the associated test in terms of size control. Throughout the simulation studies in this section, the nominal size is α = 5% and the number of replications is 10, 000.

A simulation study on the adjustment coefficient c n,p
Note that the adjustment coefficient c n,p (5) used in the SKK-test statistic (4) aims to improve the convergence of the null distribution of T SKK to N(0, 1). It has been used by [9] in a scale-invariant test statistic for a high-dimensional two-sample problem when the covariance matrices of the two samples are the same. It has also been used in other highdimensional tests developed by Dr. Srivastava and his collaborators. However, to the best of our knowledge, its effect on high-dimensional tests has not been well studied in the literature. In this section, we address this problem carefully via studying its effect on T n,p and T SKK as well.
Following the way of [10] to adjust the SKK-test statistic (4) using the adjustment coefficient c n,p , we adjust our test statistic T n,p via replacing tr(R 2 n ) with tr(R 2 n )c n,p in the expression of the estimated approximate degrees of freedom (26). The resulting estimated approximate degrees of freedom is then given byd c =d/c n,p . The resulting test, denoted as T c n,p , is then conducted via replacingd withd c . To study the effect of c n,p on T c n,p and T SKK in terms of size control, we conduct a small simulation study as described below.
To generate the two samples (1), we set y ij = μ i + ∼ N(0, 1). Further, we specify the covariance Note that the diagonal elements of i are different and that the larger difference between ρ 1 and ρ 2 will determine the larger difference between 1 and 2 , and the value of ρ i controls the correlation among the p-variables of the generated data: the larger the value of ρ i is, the larger correlation among the p-variables is. The absolute correlation value of r i,k , i = 1, 2 decays as |k − | increases. We specify the tuning parameters as follows. We set p ∈ {200, 500, 1000}, n 1 = 0.8n 0 and n 2 = 1.2n 0 with n 0 ∈ {200, 300, 400}. In addition,  N(0, 1). However, when the data are moderately or highly correlated, i.e. when [ρ 1 , ρ 2 ] = [0.5, 0.3] and [0.9, 0.5], the three tests perform quite differently in terms of size control, showing that in these two cases, the adjustment coefficient c n,p actually worsens the performance of T c n,p and T SKK . Therefore, the adjustment coefficient c n,p cannot be applied blindly on the high-dimensional tests.
A question arises naturally. In what situation can we apply the adjustment coefficient c n,p on a test? [10] pointed out that ideally, c n,p should converge to 1 in probability as n, p → ∞. However, in finite sample situations, the values of c n,p can be much larger than 1. Table 2 displays the values of c n,p in the simulation study mentioned above, associated with the values ofd andd c under various configurations. It is seen that when [ρ 1 Table 1. According to Tables 1 Table 2. Values ofd,d c , and c n,p in the simulation study on c n,p . and 2, we may use c n,p ≤ 1.2 as an empirical criterion for applying the adjustment coefficient properly. That is, we adjustd only when c n,p ≤ 1.2. For notation simplicity, from now on, the resulting estimated approximate degrees of freedom is still denoted byd c : The resulting test, still denoted as T c n,p , is then conducted via replacingd withd c . We shall further study the performance of T c n,p in the following sections.

Simulation studies
In this section, we conduct two simulation studies, namely Simulations 1 and 2 to compare the proposed normal reference scale-invariant test T c n,p against some well-known existing competitors in terms of size control and power, including a non-scale-invariant test proposed by [4] and a scale-invariant test proposed by [10], denoted as T CQ and T SKK respectively.
∼ χ 2 2 . The above three models generate the entries of z ij with three different distributions: normal, non-normal but symmetric, and non-normal and skewed.

Simulation 1
In this simulation study, we set the covariance matrices and the associated tuning parameters as in Section 3.1. First of all, we compare T CQ , T SKK and T c n,p in terms of size control. Table 3 Table 4 for space saving. It is seen that when [ρ 1 ,  that the normal approximations to the underlying null distributions of the tests are less or not adequate. This explains why in these two cases, T CQ and T SKK do not perform well in terms of size control. Finally, we compare the empirical powers of the three tests. Table 5 displays the empirical powers of T CQ , T SKK and T c n,p . It is seen that T SKK and T c n,p generally have much higher powers than T CQ under various configurations and T CQ are nearly powerless. It is reasonable since the diagonal entries of both 1 and 2 are different and this information is taken into account by both T SKK and T c n,p but not by T CQ . From Table 5, it is also seen that when [ρ 1 , ρ 2 ] = [0.01, 0.01], the powers of T SKK and T c n,p are generally comparable. However, when [ρ 1 , ρ 2 ] = [0.5, 0.3] and [0.9, 0.5], T c n,p has much higher powers than T SKK . The low powers of T SKK are probably due to the blind use of the adjustment coefficient c n,p in the definition of T SKK .

Simulation 2
In this simulation, we generate data from the following moving average model which has been used by several researchers including [4]: where y ijk denotes the k-th component of y ij , and z ij , = 1, . . . , (2p − 1) are i.i.d random variables generated in the same ways as described in Simulation 1. For configurations of dependence among the p components of y ij , we consider the following two cases: (  U(2, 3) and ρ 2 's from U (1,2). They are kept fixed throughout the simulation study once they are generated.
Other tuning parameters are specified in the same way as in Simulation 1.
We first compare the performance of the three tests in terms of size control. Table 6 displays the empirical sizes of the three tests under consideration. For the partial dependence case, both T SKK and T c n,p outperform T CQ indicating that the adjustment coefficient c n,p may indeed improve the performance of T SKK and T c n,p when the simulated data are nearly uncorrelated. However, for the full dependence case, i.e. when the simulated data are highly correlated, T c n,p performs much better than the other two tests. T CQ is quite liberal while T SKK is extremely conservative. The bad performance of T CQ and T SKK may be partially explained by the fact that the asymptotic null distributions of T CQ and T SKK are unlikely to be normally distributed as indicated by the small values ofd c and the big values of c n,p . The blind application of the adjustment coefficient c n,p to T SKK also contributes to the bad performance of T SKK . These conclusions are similar to those drawn from Table 3.
We now compare the performance of the three tests in terms of power. Table 7 displays the empirical powers of the three tests. It is seen that for the partial dependence case, the empirical powers of the three tests are generally comparable, with the empirical powers of T CQ being slightly higher and those of T c n,p being slightly lower. This is because the diagonal elements of the covariance matrices are the same in this moving average model and the empirical sizes of T CQ are also slightly larger and those of T c n,p are also slightly smaller. For the full dependence case, the empirical powers of T CQ are still slightly larger than those of T c n,p . This is also due to the fact that the empirical sizes of T CQ are also slightly larger and those of T c n,p are also slightly smaller. However, the empirical powers of T SKK are much smaller than those of T c n,p , showing the strong impact of the blind application of the adjustment coefficient c n,p to T SKK . These conclusions are similar to those drawn from Table 5. Table 6. Empirical sizes (in %) of T CQ , T SKK and T c n,p in Simulation 2. The columns associated withd c and c n,p list the values ofd c and c n,p calculated using (31) and (5) In summary, the proposed test T c n,p generally outperforms T CQ and T SKK in terms of size control and power. While T CQ is generally liberal and less powerful when the diagonal entries of the two covariance matrices are not the same and when the data are moderately or highly correlated, T SKK performs well only when its asymptotic null distribution is nearly normal.

Application to the colon data
We now apply T c n,p , together with T CQ and T SKK , to the colon data introduced in Section 1, aiming to check if the normal colon tissues and the tumor colon tissues have the same mean expression levels without assuming that the normal colon tissues and the tumor colon tissues have the same covariance matrix. In high-dimensional settings, it is often challenging to check if two samples have the same covariance matrix. Table 8 displays the results of the three tests. It is seen that both T CQ and T c n,p suggest a strong rejection of the null hypothesis, indicating that the mean gene expression levels of the normal colon tissues and the tumor colon tissues are unlikely to be the same. The p-value of T CQ is less trustful than that of T c n,p because we haved c = 5 and c n,p = 10.93, showing that the asymptotic null distribution of T CQ is unlikely to be normally distributed and hence the normal approximation to the null distribution of T CQ is not adequate. T SKK ,  on the other hand, does not suggest a rejection of the null hypothesis, showing that T SKK has a low power in detecting the differences between the mean gene expression levels of the two colon tissue groups. This is not a surprise since according to the simulation results presented in Simulation 1, the power of T SKK can be substantially reduced when the underlying distribution of T SKK is actually not normally distributed.

Concluding remarks
We have proposed and studied the normal-reference scale-invariant test T n,p and compared it against the non-scale-invariant test T CQ of [4] and the scale-invariant test T SKK of [10] for the high-dimensional two-sample Behrens-Fisher problem where the two highdimensional samples may have different covariance matrices. Both the competitors T CQ and T SKK are the normal-limiting-distribution-based tests in the sense that their null distributions are showed to be asymptotically normal under some strong conditions and are approximated by normal distributions. We showed that under some regularity conditions, T n,p and a chi-square-type mixture, after normalization, have the same normal and nonnormal limiting distributions. Thus, it is not always applicable to approximate the null distribution of T n,p using just one of its limiting distributions, either the normal limiting distribution or the non-normal limiting distribution. However, it is justified to approximate the null distribution of T n,p using that of the chi-square-type mixture which arises naturally from T n,p under the null hypothesis and when the data are normally distributed. Since the normal and non-normal limiting distributions are not used to approximate the null distribution of T n,p , there is no need to check if the key conditions (e.g. Conditions C5 and C6 in Section 2.1) hold. This is an advantage since in real data analysis, it is rather challenging to check if these key conditions hold approximately. The distribution of the chi-square-type mixture can be well approximated by that of a scaled chi-square random variable of form G ∼ χ 2 d /d with the approximate degrees of freedom d consistently estimated from the data. The distribution of G, as an approximate distribution of T n,p , is adaptive to the shape of the underlying distribution of T n,p in the sense that when the underlying distribution of T n,p is approximately normal, d is large and otherwise, it is moderate or small. This adaptivity is an advantage of a normal-reference test but it is not shared by T CQ and T SKK since they are the normal-limiting-distribution-based tests. This partially explains why T n,p has a good size control regardless if the high-dimensional data are nearly uncorrelated, moderately correlated, or highly correlated, while T CQ and T SKK have good size control only when the high-dimensional data are nearly uncorrelated, as demonstrated by the simulation results presented in Section 3.2.
Both T n,p and T SKK are scale-invariant but T CQ is not since both T n,p and T SKK take the variances of the p-variables of the high-dimensional data into account in their test statistic constructions but T CQ does not. This partially explains why both T n,p and T SKK have higher powers than T CQ when the variances of the p-variables of the high-dimensional data are very different (see Simulation 1 for details) and why these three tests have comparable powers when the variances of the p-variables of the high-dimensional data are approximately the same (see Simulation 2 for details). Unfortunately, the power gain of a scale-invariant test to a non-scale invariant test is not a free lunch. This is because a scale-invariant test often asks for a larger sample size [see (10)] than a non-scale-invariant test to guarantee that the variances of the p-variables of the high-dimensional data are accurately estimated as indicated by (8).
The small simulation study in Section 3.1 indicates that the adjustment coefficient c n,p used in [10]'s test does improve the convergence of T SKK to a normal distribution when the asymptotic null distribution of T SKK is nearly normal but it will worsen the size-control and power performance of T SKK substantially otherwise. This has also been demonstrated by the simulation results presented in Section 3.2. An empirical criterion was developed in Section 3.1 to apply the adjustment coefficient to T n,p smartly, resulting in the so-called T c n,p test. When the high-dimensional data are nearly uncorrelated such that c n,p ≤ 1.2, T c n,p slightly outperforms or is generally comparable with T n,p in terms of size control and power and otherwise, they are exactly the same.
In real data analysis, the key conditions (e.g. Conditions C5 and C6 in Section 2.1) required by a test are often difficult to be checked especially in high-dimensional data settings. Therefore, when the sample sizes are large (largely speaking, when n min ≥ 20 log(p)), T c n,p should be recommended since it has good size-control and power regardless if the high-dimensional data are nearly uncorrelated, moderately correlated, or highly correlated. When the sample sizes are large and the high-dimensional data are nearly uncorrelated (largely speaking, whend ≥ 30 or c n,p ≤ 1.2), T SKK may be used and otherwise, it is not recommended since it cannot maintain its size well and has low power. When the sample sizes are moderately large (largely speaking, when n min ≤ 20 log(p) but n min ≥ 5 log(p)) and the high-dimensional data are nearly uncorrelated (largely speaking, whend ≥ 30 or c n,p ≤ 1.2), T CQ may be used and otherwise, it is not recommended since it cannot maintain its size well. In this case, the normal-reference test T ZZGZ proposed and studied by [14] should be recommended.
In many real applications, it may be often of interest to test the equality of the mean vectors of several high-dimensional samples without assuming the equality of their group covariance matrices. This problem is often referred to as a heteroscedastic one-way multivariate analysis of variance (MANOVA) problem for high-dimensional data. An extension of our normal reference scale-invariant test to this high-dimensional MANOVA problem is interesting and warranted.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The work was supported by the National Research Foundation Singapore academic research grant [R-155-000-187-114].