Multivariate-Sign-Based High-Dimensional Tests for the Two-Sample Location Problem

ABSTRACT This article concerns tests for the two-sample location problem when data dimension is larger than the sample size. Existing multivariate-sign-based procedures are not robust against high dimensionality, producing tests with Type I error rates far away from nominal levels. This is mainly due to the biases from estimating location parameters. We propose a novel test to overcome this issue by using the “leave-one-out” idea. The proposed test statistic is scalar-invariant and thus is particularly useful when different components have different scales in high-dimensional data. Asymptotic properties of the test statistic are studied. Compared with other existing approaches, simulation studies show that the proposed method behaves well in terms of sizes and power. Supplementary materials for this article are available online.

(1) Such a hypothesis test plays an important role in a number of statistical problems. A classical method to deal with this problem is the famous Hotelling's T 2 test statistic T 2 n = n 1 n 2 n (X 1 − X 2 ) T S −1 n (X 1 −X 2 ), where n = n 1 + n 2 ,X 1 andX 2 are the two sample means, and S n is the pooled sample covariance. However, T 2 n is undefined when the dimension of data is greater than the within sample degrees of freedom, say a so-called "large p, small n" case that is largely motivated by the identification of significant genes in microarray and genetic sequence studies (Kosorok and Ma 2007).
The "large p, small n" situation refers to high-dimensional data whose dimension p increases to infinity as the number of observations n → ∞. With the rapid development of technology, various types of high-dimensional data have been generated in many areas, such as hyperspectral imagery, internet portals, microarray analysis, and DNA. Like the Hotelling's T 2 test mentioned above, traditional methods may not work anymore in this situation since they assume that p keeps unchanged as n increases. This challenge calls for new statistical tests to deal with high-dimensional data, see Dempster (1958), Bai and Saranadasa (1996), Srivastava and Du (2008), and Chen and Qin (2010) for two-sample tests for means, Ledoit and Wolf (2002), Schott (2005), Chen, Zhang, and Zhong (2010), and Zou et al. (2014) for testing a specific covariance structure, Goeman, Van 2 = , Bai and Saranadasa (1996) proposed a test statistic based on the squared Euclidean distance, ||X 1 −X 2 || 2 . The key feature of Bai and Saranadasa's proposal is to use the Euclidian norm to replace the Mahalanobis norm since having S −1 n is no longer beneficial when p/n → c > 0. Chen and Qin (2010) considered removing n i j=1 X T i j X i j for i = 1, 2 from ||X 1 −X 2 || 2 because these terms impose demands on the dimensionality. However, both these two methods are not scalar-invariant. In practice, different components may have completely different physical or biological readings and thus certainly their scales would not be identical. For example, in Nettleton, Recknor, and Reecy (2008), the authors pointed out that "it is well known that different genes exhibit different levels of variation. If this heterogeneity of variance is not accounted for, genes with larger variability can dominate the results of the proposed test. " As a result, prior to analysis the authors applied scaling, which is equivalent to standardizing that data for each gene to a common variance.
Note that both Bai and Saranadasa (1996) and Chen and Qin (2010) tests take the sum of all p squared mean differences without using the information from the diagonal elements of the sample covariance, that is, the variances of variables, and thus their test power would heavily depend on the underlying variance magnitudes. To address this issue, Srivastava and Du (2008) proposed a scalar-transformation-invariant test under the assumption of the equality of the two covariance matrices. Srivastava, Katayama, and Kano (2013) further extended this approach to unequal covariance matrices and revealed that their test has certain advantages over Chen and Qin (2010) test by asymptotic and numerical analysis. Park and Ayyala (2013) considered a similar setting with the idea of leave-one-out crossvalidation as employed in Chen and Qin (2010). Recently, Feng et al. (2015) and Gregory et al. (2014) also considered cases where heteroscedasticity is present. Both two articles addressed the issues of nonnegligible bias-terms due to the plug-in of variance estimators, while the latter one considered a higher-order expansion for bias-correction.
Although essentially nonparametric in spirit, the statistical performance of the moment-based tests mentioned above would be degraded when the nonnormality is severe, especially for heavy-tailed distributions. This motivates us to consider using multivariate sign-and/or-rank-based approaches to construct robust tests for (1). Many nonparametric methods have been developed, as a reaction to the Gaussian approach of Hotelling's test, with the objective of extending to the multivariate context the classical univariate rank and signed-rank techniques. Essentially, these methods belong to three main groups. The first of these groups relies on componentwise rankings (Puri and Sen 1971), but suffers from a severe lack of invariance with respect to affine transformations, which has been the main motivation for the other two approaches. The second group uses the concept of interdirections (Peters and Randles 1991), while the third one (Hettmansperger and Oja 1994;Möttönen and Oja 1995;Randles 2000) is closely related with the spatial signs and ranks along with the use of the so-called spatial median. See Oja (2010). This work belongs to the third group with emphasis on the applicability and effectiveness in high-dimensional environment.
Most of the tests proposed in the works mentioned above are based on spatial-signs and ranks of the norms of observations centered at θ (an estimate in practice), with test statistics that have structures similar to T 2 n . Those statistics are distributionfree under mild assumptions, or asymptotically so. Please refer to chap. 11 of Oja (2010) for a nice overview. Among them, due to its simplicity and effectiveness, the test entirely based on spatial-signs is of particular interest and has been detailedly discussed. In this article, we focus on this type of test and show that some "off-the-shelf " modifications for high-dimensional data are not robust against high dimensionality in the sense that they would produce tests with Type I errors far away from nominal levels. This is mainly due to additional biases yielded by using the estimation of location parameter to replace the true one. In the next section, we develop a novel remedy that is robust against high dimensionality. We show that the proposed test statistic is asymptotically normal under elliptical distributions. Simulation comparisons show that our procedure performs reasonably well in terms of sizes and power for a wide range of dimensions, sample sizes, and distributions. Recently, Wang, Peng, and Li (2015) proposed a high-dimensional nonparametric multivariate test for mean vector based on spatial-signs. However, their focus is on one-sample problem and thus their method is significantly different from our proposal as we will explain in a later section.

The Proposed Test Statistic
We develop the test for (1) under the elliptically symmetric assumption, which is commonly adopted in the literature of multivariate-sign-based approaches (Oja 2010). Assume {X i1 , . . . , X in i }, i = 1, 2 be two independently and identically distributed (iid) random samples from p-variate elliptical distribution with density functions det( i ) −1/2 g i (|| −1/2 i (x − θ i )||), i = 1, 2, where θ i 's are the symmetry centers and i 's are two positive definite symmetric p × p scatter matrices. The spatial sign function is defined as U (x) = ||x|| −1 xI(x = 0). Denote The modulus ||ε i j || and the direction u i j = U (ε i j ) are independent, and the direction vector u i j is uniformly distributed on the p-dimensional unit sphere. It is then well known that E(u i j ) = 0 and cov(u i j ) = p −1 I p .
In traditional fixed p circumstance, the following so-called "inner centering and inner standardization" sign-based procedure is usually used (Oja 2010, sec. 11.3) ),θ and S −1/2 are Hettmansperger and Randles (2002) estimates (HRE) of location and scatter matrix so that Q 2 n is affine-invariant and can be regarded as a nonparametric counterpart of T 2 n by using the spatial-signs instead of the original observations X i j 's. However, when p > n, Q 2 n is not defined as the matrix S −1/2 is not available in high-dimensional settings. Though there has been a growing body of research in large-scale covariance matrix estimation under certain assumptions of sparseness (Bickel and Levina 2008;Cai and Liu 2011), obtaining a sparse estimator of scatter matrix in the robust context appears to be even more complicated and has not been thoroughly addressed in the literature. More importantly, the effectiveness of Mahalanobis distance-based tests is adversely impacted by an increased dimension even p < n, reflecting a reduced degree of freedom in estimation when the dimensionality is close to the sample size. The contamination bias, which grows rapidly with p, would make the Mahalanobis distancebased tests unreliable for a large p. Bai and Saranadasa (1996) provided certain asymptotic justifications and Goeman, Van De Geer, and Van Houwelingen (2006) contained some numerical evidence. Please refer to numerical comparison in Section 3 and some technical discussion in Appendix F of the supplemental material.
Alternatively, we develop a scalar-transformation-invariant test on the line of Srivastava, Katayama, and Kano (2013), which is able to integrate all the individual information in a relatively "fair" way. To this end, we suggest to find a pair of diagonal matrix D i and vector θ i for each sample that simultaneously satisfy can be viewed as a simplified version of HRE without considering the off-diagonal elements of S. We can adapt the recursive algorithm of Hettmansperger and Randles (2002) to solve (3). That is, repeat the following three steps until convergence: The resulting estimators of location and diagonal matrix are denoted asθ i andD i , i = 1, 2, respectively. We may use the sample mean and sample variances as the initial estimators.
It appears that we can construct Q 2 n with a pooled sample estimate, (θ,D), obtained by using (3). However, this would yield a bias-term that is not negligible (with respect to the standard deviation) when n/p = O(1) because we replace the true spatial median with its estimate. It seems infeasible to develop a biascorrection procedure as done in Zou et al. (2014) because the bias term depends on the unknown quantities i 's. Please refer to Appendix C in the supplemental material for the closed-form of this bias and associated asymptotic analysis. In fact, the test statistic proposed by Wang, Peng, and Li (2015) is essentially in a similar fashion to Q 2 n . However, their method does not suffer from additional biases because in a one-sample problem we do not need the estimate of spatial median.
To overcome this difficulty, we propose the following test statistic: whereθ i, j andD i, j are the corresponding location vectors and scatter matrices using "leave-one-out" samples {X ik } k = j . Intuitively, if θ 1 = θ 2 , both U (D −1/2 1,i (X 1i −θ 2, j )) and U (D −1/2 2, j (X 2 j −θ 1,i )) would deviate from 0 to certain degree and thus a large value of R n leads to reject the null hypothesis. As shown later, E(R n ) ∝ (θ 1 − θ 2 ) T D −1/2 1 D −1/2 2 (θ 1 − θ 2 ) (approximately speaking), and hence R n can well reflect the difference between two locations and is basically all we need for testing. A derivation in Appendix A shows that under H 0 the expectation of R n is asymptotically negligible compared to its standard deviation. This feature is particular useful in the construction of the test because we do not need to estimate its expectation. The following proposition shows that the proposed test statistic R n is invariant under location shifts and the group of scalar transformations.

Asymptotic Results
Next, we study the asymptotic behavior of R n under the null and local alternative hypotheses.
and σ 2 n = ( 2 n 1 (n 1 −1)p 2 tr(A 2 1 ) + 2 n 2 (n 2 −1)p 2 tr(A 2 2 ) + 4 n 1 n 2 p 2 tr (A T 3 A 3 )). From the proof of Theorem 1, we can see that var(R n ) = σ 2 n (1 + o(1)). We need the following conditions for asymptotic analysis: To appreciate Condition (C2), consider the simple case D 1 = D 2 = I p , c 1 = c 2 , thus the condition becomes tr( i j k l ) = o(tr 2 (( 1 + 2 ) 2 )), i, j, k, l = 1, 2, which is the same as condition (3.8) in Chen and Qin (2010). To better understand Condition (C3), consider 1 and 2 with bounded eigenvalues, which leads to σ 2 n = O(n −2 p). Thus, the condition becomes p = O(n 2 ), which allows dimensionality to increase as the square of sample size. Certainly, when p is larger than n 2 , there would be another bias-term in R n that is difficult to calculate and deserves a future research.
The Condition (C4) is used to get the consistency of the diagonal matrix estimators. If R i = I p , i = 1, 2, the module and the direction of D −1/2 i (X i j − θ i ) are independent and accordingly it is easy to obtain the consistence of the diagonal matrix. However, the module and the direction of D −1/2 i (X i j − θ i ) are not independent in general case. Consider a simple setting, tr(R 2 i ) = O(p) (Srivastava and Du 2008). This condition reduces to p/n → ∞. When the correlation between each components becomes larger, the dimension is required to be higher to reduce the difference between the module ||ε i j || and ε T i j R i ε i j . See more information in the proof of Theorem 1 in Appendix A.

This result suggests rejecting
Next, we consider the asymptotic distribution of R n under the alternative hypothesis . This assumption implies that the difference between θ 1 and θ 2 is not large relative to c −1 1 c −1 2 σ n so that a workable expression for the variance of R n can be derived and thus leads to an explicit power expression for the proposed test. It can be viewed as a high-dimensional version of the local alternative hypotheses.
Theorems 1 and 2 allow us to compare the proposed test with some existing work, such as Chen and Qin (2010) and Srivastava, Katayama, and Kano (2013), in terms of the limiting efficiency. We consider the following local alternatives with some constant > 0. Accordingly, the asymptotic power of our proposed spatial-sign-based test (abbreviated SS) under this local alternative is where is the standard normal distribution function. To obtain an explicit expression for comparison use, we assume that F = G and λ max (p −1 R i ) = o(n −1 ). Then, D 1 = D 2 = D, R 1 = R 2 = R, and c 1 = c 2 = c 0 . As a consequence,σ n = σ n (1 + o(1)) and the asymptotic power becomes In comparison, by Srivastava, Katayama, and Kano (2013) we can show that the asymptotic power of their proposed test (abbreviated as SKK hereafter) is Thus, the asymptotic relative efficiency (ARE) of SS with respect to SKK is where we use the fact that (1)) under Condition (C4) (see the proof of Theorem 1). It can be also shown that for multivariate normal distributions ARE(SS, SKK) → 1 as p → ∞ (see Appendix E in the supplemental material). Hence, the SS test is asymptotically as efficient as the Srivastava, Katayama, and Kano (2013) test in such settings. When the dimension p is not very large, it can be expected that the proposed test, using only the direction of an observation from the origin, should be outperformed by the test constructed with original observations like that of Srivastava, Katayama, and Kano (2013). However, as p → ∞, the disadvantage diminishes.
On the other hand, if X i j 's are generated from the multivariate t-distribution with ν degrees of freedom (ν > 2), The ARE values for ν = 3, 4, 5, 6 are 2.54, 1.76, 1.51, and 1.38, respectively. Clearly, the SS test is more powerful than SKK when the distributions are heavy-tailed (ν is small), which is verified by simulation studies in Section 3. In contrast, Chen and Qin (2010) showed that the power of their proposed test (abbreviated as CQ) is As shown by Srivastava, Katayama, and Kano (2013), this quantity, β CQ , can be much smaller than β SKK in the cases that different components have different scales because the CQ test is not scalar-invariant. To appreciate the effect of scalar-invariance, we consider the multivariate normality assumption and i be diagonal matrices. The variances of the first p/2 components are τ 2 1 and the variances of the other components are τ 2 2 . Assume Thus, the ARE of the proposed test (so as SKK) with respect to the CQ test is τ 4 1 + τ 4 2 /( √ 2τ 2 1 ). It is clear that the SS and SKK are more powerful than CQ if τ 2 1 < τ 2 2 and vice versa. This ARE has a positive lower bound of 1/ √ 2 when τ 2 1 >> τ 2 2 , and it can be arbitrarily large if τ 2 1 /τ 2 2 is close to zero. This property shows the necessity of a test with the scale-invariance property.

Bootstrap Procedure and Computational Issue
The proposed SS test is based on the asymptotic normality; a good approximation requires that both n and p are large. As shown in Section 3, when p is not large enough (as in Table 1), the empirical size is a little larger than the nominal one. In contrast, when p is too large compared to n, our proposed test is somewhat conservative. The reason is that Condition (C3) basically prevents us from using the asymptotic normality for the case that p grows in a too fast rate of n. To address this issue, we propose the application of a bootstrap procedure for finitesample cases.
The procedure is implemented as follows. We first calculate the based samplesX When this procedure is repeated many times, the bootstrap critical value z * α is the empirical 1 − α quantile of the bootstrap test statistic. The test with rejection region R n ≥ z * α is our proposal. We recommend to use this bootstrap method when either p is not large (say, p ≤ 50) or p is very large compared to n (in a rate of O(n 2 ) or faster). We will study the effectiveness of this bootstrap method by simulation in the next section.
The leave-one-out procedure seems complex but basically computes fast. Today's computing power has improved dramatically and it is computationally feasible to implement the SS test. For example, it takes 1.5 sec to calculate R n /σ n in FORTRAN using Inter Core 2.2 MHz CPU with n 1 = n 2 = 50 and p = 1440. The leave-one-out estimator essentially requires O(pn) computation and thus the calculation of R n is of order O(pn 3 ). Also note that computing the estimates of the trace terms needs O(pn 2 ) computation and thus the complexity of the entire procedure is O(pn 3 ). In contrast, the SKK test requires O(p 2 n) computation. When p is large but n is small, such as n 1 = n 2 = 50, p = 1440, our method is even faster than the Srivastava, Katayama, and Kano (2013) test. The FORTRAN code for implementing the procedure is available in the supplemental material.

Monte Carlo Simulations
Here, we report a simulation study designed to evaluate the performance of the proposed SS test. All the simulation results are based on 2500 replications. The number of variety of multivariate distributions and parameters are too large to allow a comprehensive, all-encompassing comparison. We choose certain representative examples for illustration. The following scenarios are first considered.
is the density function of p-variate multivariate normal distribution. γ is chosen to be 0.8. First, we consider the low-dimensional case p < n and compare the SS test with the traditional spatial-sign-based test (abbreviated as TS). We choose R 1 = R 2 = (ρ jk ), ρ jk = 0.5 | j−k| . Without loss of generality, under H 1 , we fix θ 1 = 0 and choose θ 2 as follows. The percentage of θ 1l = θ 2l for l = 1, . . . , p is chosen to be 95% and 50%, respectively. At each percentage level, two patterns of allocation are explored for the nonzero θ 2l : the equal allocation and linear allocation where all the nonzero θ 2l are linearly increasing allocations. To make the power comparable among the configurations of H 1 , we set η =: ||θ 1 − θ 2 || 2 / tr( 2 1 ) = 0.1 throughout the simulation. Two combinations of (n, p) are considered: (n i , p) = (50, 40) and (n i , p) = (75, 60). Table 1 reports the empirical sizes and power comparison at a 5% nominal significance level under various scenarios. From Table 1, we observe that the sizes of the SS test are generally close to the nominal level under all the scenarios, though in some cases the SS appears to be a little liberal. In contrast, the sizes of the TS test are much smaller than 5%, that is, too conservative. Our SS test is clearly more powerful than the TS test in most cases. Such finding is consistent with that in Bai and Saranadasa (1996) that demonstrated classical Mahalanobis distance may not work well because the contamination bias in estimating the covariance matrix grows rapidly with p. When p and n are comparable in certain sense, having the inverse of the estimate of the scatter matrix in constructing tests would be no longer beneficial.
Next, we consider the high-dimensional cases, p > n, and compare the SS with the tests proposed by Chen and Qin (2010) (CQ), Srivastava, Katayama, and Kano (2013) (SKK), and Gregory et al. (2014) (GCT). Because the SKK procedure uses the estimate of tr(R 2 ) under normality assumption, which has a considerable bias under nonnormal distribution as shown by Srivastava, Kollo, and von Rosen (2011), we replace it by the estimator proposed by Srivastava, Yanagihara, and Kubokawa (2014). GCT is implemented using the function GCT.test in the R package "highD2pop. " We consider the cases with unequal correlation matrices: R 1 = (0.5 | j−k| ) and R 2 = I p . The sample size n i is chosen as 25, 50, and 100. Three dimensions for each sample size p = 120, 480, and 1440 are considered. Table 2 reports the empirical sizes at a 5% nominal significance level. The empirical sizes of SS, SKK, and CQ tests are converging to the nominal level as both p and n increase together under all the scenarios. There is some slight size distortion (tends to be larger than 5%) for the CQ test when p is small or moderate. In contrast, our proposed SS test seems to be somewhat conservative when p/n is very large, such as p = 480 or 1440 and n i = 25. This is because the ratio ofσ 2 n /σ 2 n tends to be slightly larger than one in such cases. The empirical sizes of GCT tend to be a little smaller than the nominal level when p/n is large, especially under the nonnormal cases.  TS  SS  TS  SS  TS  SS  TS  SS  TS  SS  TS Scenario To get a broader picture of goodness of fit of using the asymptotic normality for R n /σ n , Figure 1 displays the normal Q-Q plots with various combinations of sample size and dimension. Here, we only present the results of Scenarios (I), (III), and (V) since the results for the other scenarios are similar. There is a general convergence of our test statistic to N(0, 1) as n and p increase simultaneously.
The results of comparison with Feng et al. (2015) are all reported in the supplemental material. Generally, under Scenarios (I) and (II), SKK has certain advantages over SS as we would expect, since the underlying distribution is multivariate normal. The SS also offers quite satisfactory performance though its sizes are often a little smaller than SKK as shown in Table 2. Under Scenario (I), the variances of components are identical and thus the superior efficiency of CQ is obvious. However, under Scenario (II), both SKK and SS outperform CQ uniformly by a quite large margin of power, which again concurs with the asymptotic comparison in Section 2.2. Under the other elliptical scenarios (III), (IV), and (V), the SS test is clearly more efficient than the CQ, SKK, and GCT tests, and the difference is quite remarkable. Certainly, this is not surprising as neither t p,4 nor MN p,γ ,9 distribution belongs to the linear transformation model on which the validity of CQ depends much. Because the variance estimator in GCT usually leads to an overestimation under the alternative, especially for the sparse cases, the power of GCT is not as good as the other scalar-invariant tests. In addition, the power of the four tests is mainly dependent on ||θ 1 − θ 2 || as analyzed in Section 2 and thus should be invariant (roughly speaking) for different patterns of allocation and different percentage levels of true null.
Next, to study the effect of correlation matrix on the proposed test and to further discuss the application scope of our method, we explore another four scenarios with different correlations and distributions. The following moving average model is used:    The coefficients {ρ il } T i l=1 are generated independently from U (2, 3) and are kept fixed once generated through our simulations. The correlations among X i jk and X i jl are determined by |k − l| and T i . We consider the "full dependence" for the first sample and the "2-dependence" for the second sample, that is, T 1 = p and T 2 = 3, to generate different covariances of X i j . For simplicity, set η =: ||θ 1 − θ 2 || 2 / tr( 2 1 ) + tr( 2 2 ) = 0.05 and (n i , p) = (50, 480).  Table 4 reports the empirical sizes and power of SS, SKK, CQ, and GCT. Even under the last three nonelliptical scenarios, the SS test can maintain the empirical sizes reasonably well. Again, the empirical sizes of CQ tend to be larger than the nominal level. The empirical sizes of GCT deviate much from the nominal level, making its power unnecessarily high. In general, the SS test performs better than the CQ and SKK test in terms of power for the three nonnormal distributions, especially under the 95% pattern. This may be explained as the proposed test, using only the direction of an observation from the origin but not its distance from the origin, would be more robust in certain degrees for the considered heavy-tailed distributions.
Next, we compare the SS test with a nonparametric method proposed by Biswas and Ghosh (2014) (abbreviated as BG). The samples are generated from Scenarios (I)-(V) with R 1 = (0.5 | j−k| ) and R 2 = σ 2 R 1 . Note that the null hypothesis of the BG test is the equality of two distributions rather than the equality of two location parameters. Thus, under H 0 , we set σ 2 = 1. Under H 1 , consider θ 1 = 0, θ 2 = (θ, . . . , θ ). Table 5 reports the empirical size and power comparison with (θ, σ 2 ) = (2.5, 1) or (1,1.2) when (n i , p) = (50, 240). The empirical sizes of BG appear to be a little larger than the nominal level. The general conclusion is that our SS test significantly outperforms the BG test when the two location parameters are different. This is not surprising to us because the BG test is an omnibus one, which is also effective for the difference in scale or shape parameters. When both location and scale parameters are different, the BG test performs better than our test under normality assumption (Scenarios (I) and (II)), while our SS test is more efficient under the other three nonnormal scenarios.
We also study how our method is compared with some variants of the Hotelling's T 2 test. One variant is Chen et al. (2011) regularized Hotelling's T 2 test (abbreviated as RHT), which uses S + ζ I p instead of S with a sufficiently small value of ζ . Chen et al. (2011) suggested a resampling procedure to implement the RHT. This method is implemented using the "RHT" R package with the false alarm rate 0.05. Another natural variant is to use a sparse estimate of to replace S. Here, we consider the banding estimator proposed by Bickel and Levina (2008) and denote the resulting test as the sparse Hotelling's T 2 test (abbreviated as SHT). It is easy to see that the normalized n 1 n 2 n (X 1 −X 2 ) T −1 (X 1 −X 2 ) is asymptotically normal as (n, p) → ∞. As shown in Appendix F, this asymptotic normality does not hold well even when an optimal estimator of is used in high-dimensional settings. Thus, we also consider to use the bootstrap procedure for the SHT (denoted as SHTB). For simplicity, the correlation matrices are fixed as R 1 = R 2 = (0.5 | j−k| ) 1≤ j,k≤p . We choose the banding parameter as five, resulting in a nearly "oracle" estimator of . Table  6 reports the empirical sizes of SS, RHT, SHT, and SHTB when (n i , p) = (50, 240). We observe that the RHT is rather conservative in this high-dimensional setting even with a bootstrap procedure, while the empirical sizes of SHT are much larger than the nominal level as we have explained before. Fortunately, the SHTB can well maintain the significant level. For power comparison, we consider two cases for θ 2 . Half random: the last p/2 components of θ 2 are generated from N(0, 1) and the others are zero; Sparse random: the last 5% components of θ 2 are generated from N(0, 1) and the others are zeros. The power results with η = ||θ 2 || 2 / tr( 2 1 ) = 0.1 are also tabulated in Table 6. Our SS test has superior efficiency under all the scenarios, which further concurs with our previous claim that having the inverse of the estimate of the scatter matrix in high-dimensional settings may not be beneficial.
Finally, we evaluate the performance of the bootstrap procedure (denoted as SSB) suggested in Section 2.3. Table 7 reports the empirical sizes of the SSB. The data generation mechanism is the same as that in Table 2. Clearly, this bootstrap procedure is able to improve the SS test with asymptotic in terms of size control in the sense that its empirical sizes are closer to the nominal level. In this table, the power values of SS and SSB with (n i , p) = (25, 480) are also presented, from which we can see that SSB performs better than SS in all cases because the latter is conservative in this setting of (n i , p). We conducted some other simulations with various alternative hypotheses, p, and nominal size, to check whether the above conclusions would change in other cases. These simulation results, not reported here but available from the authors, show that the SS test works well for other cases as well in terms of its sizes, and its good power performance still holds for other choices of alternatives.

A Real-Data Example
Here, we demonstrate the proposed methodology by applying it to a real dateset. The colon data (Alon et al. 1999) contain the expression of the 2000 genes with highest minimal intensity across the 62 tissues (http://microarray.princeton. edu/oncology/affydata/index.html). There are 22 normal colon tissues and 40 tumor colon tissues. We want to test the hypothesis that the tumor group has the same gene expression levels as the normal group. Figure 2 shows the p-values of normality test and standard deviations of each variables of these two samples. From the above two figures of Figure 2, we observe that most variables of the tumor colon issues are not normally distributed. Thus, we could expect that SS test would be more robust than CQ and SKK tests. Furthermore, from the bottom panels of Figure 2, the standard deviations of each variables of these two samples are range from 15.9 to 4655, which illustrates that a scalar-invariant test is needed. Thus, we apply SS and SKK tests to this datasets. The p-values of our SS test with asymptotic normality and the bootstrap procedure are 0.0093 and 0.001, respectively. These results suggest the rejection of the null hypothesis; the gene expression levels of tumor group are significantly different from the normal group. However, the pvalue of SKK test is about 0.18 for this dataset and thus cannot detect the difference between these two groups at 5% significance level. It is again consistent with our preceding theoretical and numerical analysis.

Discussion
Our asymptotic and numerical results together suggest that the proposed spatial-sign test is quite robust and efficient in testing the equality of locations, especially for heavy-tailed or skewed distributions. It should be pointed out that when the data come from some light-tailed distributions, the SS is expected to be outperformed by the SKK test. This drawback is certainly inherited from the spatial-sign-based nature of SS and shared by all the sign-or rank-based procedures.
Our analysis in this article shows that the spatial-sign-based test combined with the data transformation via the estimated diagonal matrix leads to a powerful test procedure. The analysis also shows that the data transformation is quite crucial in high-dimensional data. This confirms the benefit of the transformation discovered by Srivastava, Katayama, and Kano (2013) for L 2 -norm-based tests. In a significant development in another direction that using the max-norm rather than the L 2 -norm, Cai, Liu, and Xia (2014) proposed a test based on the max-norm of marginal t-statistics. See also Zhong, Chen, and Xu (2013) for a related discussion. Generally speaking, the max-norm test is for more sparse and stronger signals whereas the L 2 norm test is for denser but fainter signals. Developing a spatial-sign-based test for sparse signals is of interest in the future study. Besides the spatial-sign-based sphericity test, there are some other tests based on spatial-signed-rank or spatial-rank in the literatures, such as Hallin and Paindaveine (2006). Deriving similar procedures for those tests are highly nontrivial due to their complicated construction and deserves some future research.
In our tests, we standardize the data for each variable to a common scale to account for heterogeneity of variance. On the other hand, Dudoit, Fridlyand, and Speed (2002) standardized the gene expression data so that the observations (arrays) have unit variance across variables (genes). They pointed out that scale adjustments would be desirable in some cases to prevent the expression levels in one particular array from dominating the average expression levels across arrays. This way of standardization in high-dimensional settings warrants future study.

Appendix: Technical Proofs
This appendix contains six parts but we only present here the Appendix A that gives the succinct proofs of Theorems 1 and 2. Some additional simulation results, technical arguments in Section 2, and the proofs of Lemmas A.1, A.2, A.4, A.6, and A.7 can be found in the Appendices B-F in the supplemental material. The proofs of Lemmas 3 and 5 are given below because they are the core of the technical arguments and may be interesting in their own rights.

Appendix A: The Proofs of Theorems 1 and 2
Before proving the two main theorems in Section 2, we present several necessary lemmas. First, we restate the Lemma 4 in Zou et al. (2014) here.
Lemma A.1. Suppose u are independent identically distributed uniform on the unit p sphere. For any p × p symmetric matrix M, we have Let the corresponding estimator bê The following lemma provides an asymptotic expansion forθ i . Note that givenD i , the estimatorμ i is the minimizer of the following objective function or the estimatorθ i is equivalent to solve the equation Proof. We first show that ||μ i || = O p (b n,p,i ). Note that the objective function L(μ) is a strictly convex function in μ. Thus as long as we can show that it has b −1 n,p,i -consistent local minimizer, it must be b −1 n,p,i -consistent global minimizer. The existence of a b −1 n,p,i -consistent local minimizer is implied by that fact that for an arbitrarily small > 0, there exists a sufficiently large constant C, which does not depend on n or p, such that lim inf So, it can be easily seen Thus, Next, we will show that B 1 = o p (b n,p,i ) and B 2 = o p (b n,p,i ). By the Taylor expansion, and according to Lemma A.2, by Condition (C3). So, B 1 = o p (b n,p,i ). Similar to B 1 , we can also show that B 2 = o p (b n,p,i ), from which the assertion in this lemma immediately follows.
Lemma A.4. Under H 0 and Condition (C1), for k = 1, 2, Next, we will give an asymptotic equivalence to R n under H 0 .
Proof. By Lemma A.2, which implies that where Q n denote the rest parts of R n . For simplicity, we only show that and we can show that the other parts in Q n are all o p (σ n ) by using similar arguments: Next we will show that E(G 2 n1 ) = o(σ 2 n ).
) by Condition (C2). By the Cauchy inequality, we have So we obtain that G n1 = o p (σ n ) by Condition (C4). Taking the same procedure, we can also show that G n2 = o p (σ n ). Moreover, where C ni is a bounded random variable between −1 and −(u T 1i R 1 u 1i ) 2 . By the same arguments as G n1 , we can show that J n12 = o p (σ n ). Thus, Similarly, Finally, under H 0 , R n = n1 i = j u T 1i A 1 u 1 j n 1 (n 1 − 1) + n2 i = j u T 2i A 2 u 2 j n 2 (n 2 − 1) − 2 n1 i=1 n2 j=1 u T 1i A 3 u 2 j n 1 n 2 + o p (σ n ).
By Condition (C3), var(Z n ) = O(n −2 ). Consequently, we have R n = Z n + o p (n −2 ). Now, to prove Theorem 1, it remains to show that Z n is asymptotically normal. Clearly, Z n is essentially a two-sample U-statistic with order two. However, the standard CLT for U-statistics (Serfling 2009) is not directly applicable because the conditional variance of kernel function is zero. However, the martingale central limit theorem (Hall and Heyde 1980) can be used here. We require some additional lemmas stated as follows.
We can verify that for each n, {S nm , F nm } n m=1 is the sequence of zero mean and a square integrable martingale. To prove the normality of Z n , it suffices to show the following two lemmas. The proofs of these two lemmas are straightforward but the calculation involved is rather tedious, and thus they are included in Appendix D of the supplemental material. Proof of Theorem 1. Based on Corollary 3.1 of Hall and Heyde (1980), Lemmas A.6 and A.7, it can be concluded that Z n /σ n d → N(0, 1). By combining Lemma A.5, we can obtain the assertion of this theorem immediately.
Thus, taking the same procedure as Lemma A.5, we obtain that By the same arguments as Lemma A.5, we can show that × (1 + o(1)).
Next, taking the same procedure as in the proof of Theorem 1, we can prove the assertion.

Supplementary Materials
The supplementary material contains some technical lemmas and additional simulation results.