A Comparison of Two-Sample Tests of Significance When Used With Variable Treatment Effects

This article compares the performance of many two-sample tests of significance that might be used to test the equality of means when the effect of the treatment is variable. Of the 19 tests that were compared, the normal scores test is recommended for general use in testing the null hypothesis of no treatment effect against the alternative that the distributions are stochastically ordered when the ratio of the larger standard deviation to the smaller standard deviation does not exceed 1.3. The Baumgartner-Weiß-Schindler tests and an adaptive test also have higher power than the pooled t-test, the unequal variance t-test, and the rank-sum test for many distributions. In the simulation studies, data in the first sample are generated from nine distributions, including long-tailed and skewed distributions. Data in the second sample are generated by adding a random treatment effect to a random variable that was generated from the same distribution that was used in the first sample. Because we restricted our power studies to treatment effects that are positive or zero, the population distributions will be stochastically ordered. The results of these studies demonstrate that the normal scores test is often more powerful than the t-tests and the rank-sum test. If the ratio of the standard deviations does exceed 1.3, then one of the t-tests is recommended.


Introduction
In this research, we will compare 19 two-sample tests to determine which one can be recommended in experimental settings when the treatment effect may be variable. In many experiments, subjects have some variability in their response to treatments. For example, in an experiment designed to measure the effect of exercise on blood pressure, Georgiades et al. (2000) reported differences in variability between the control group and the exercisediet group with blood pressure measurements that were recorded during mental stress. In our investigation of the significance level and power of these tests, we will not assume that treatments are constant, which would imply a location shift model.
There are several reasons why the location shift model may not be realistic. The experimental units may have responses that are monotone functions of follow-up time, but the functions are not identical so for a fixed followup time the responses will vary. Another source of variability in response in human and animal experiments could be differences in baseline characteristics (e.g., initial weight). In this article, we will investigate the relative power of many tests of significance when the location shift model may be violated.
Before we describe the test procedures that we will compare, we need to clearly describe the null and C American Statistical Association Statistics in Biopharmaceutical Research May 2015, Vol. 7, No. 2 DOI: 10.1080/19466315.2015 alternative hypotheses. In an experimental setting, the null hypothesis is that the treatment has no effect, so the distributions in the first and second populations are identical. Let X 1 , . . . , X m be a random sample from the first population that has a cumulative distribution function (cdf) of F(x), and let Y 1 , . . . , Y n be a random sample from the second population that has cdf G(y). A random variable X with cdf F(x) is said to be stochastically smaller than the random variable Y with cdf G(y) if F(x) ≥ G(x), with F(x) > G(x) for at least one x-value. If the random variable X is stochastically smaller than the random variable Y , then these distributions are said to be stochastically ordered. See Randles and Wolfe (1979, p. 130) for details.
If the treatment effect is equal to the same positive constant for all subjects then F(x) and G(y) are stochastically ordered, with G(y) shifted to the right of F(x). If the treatment effect is not constant, but is positive for some subjects and is never negative, then the distributions will be stochastically ordered. In some studies it may be reasonable to assume that the treatment effect increases with time and that the variability of the treatment effect increases with time. Because the researcher would typically not know the direction of the effect, we will assume, at the time the measurements are recorded, that the treatment effect is nonnegative for all subjects or nonpositive for all subjects. Thus, we will perform a two-sided test of H 0 : F(x) = G(y) for every x versus H a : F(·) is stochastically smaller, or larger, than G(·).
Clearly, stochastically ordered distributions would result from positive treatment effects in an experimental setting, but it is also possible to perform the same tests with observational data if the distributions are stochastically ordered. While the stochastically ordered alternative is more general than the shift alternative, it is more restrictive than the general alternative H a : F(x) = G(x) for at least one x. In this research we focus on the performance of two-sample tests with stochastically ordered distributions, but it should be remembered that many of these tests were designed for a shift alternative and that some of the other tests were designed to be powerful with a general alternative. The Behrens-Fisher problem, which assumes unequal variances under the null hypothesis, will not be addressed because, in an experimental setting, the null hypothesis usually is a statement that the treatment has no effect, which implies that F(x) = G(x) for all x.

Description of Two-Sample Tests
The pooled (or equal variance) t-test is one of the most widely used tests of significance because it is the most powerful test when both populations are normal and the variances are equal. In addition, simulation studies have shown that this test maintains its level of significance with many nonnormal distributions when the variances are equal. The problem with the pooled t-test is that it is not as powerful as the Wilcoxon rank-sum test (RS test) and the normal scores test (NS test) when the distributions are skewed. In addition, it is not known how the pooled t-test performs with variable treatment effects.
The unequal variance t-test is recommended by some researchers because it does not assume that the variances are equal. However, O'Gorman (1997) demonstrated that this test does not maintain its level of significance with skewed distributions when the sample sizes are not equal.
In an effort to avoid this difficulty, we have included a permutation version of this test. This permutation Welch t-test is performed by computing the Welch test statistic in the usual manner, but the p-value is estimated by taking many random permutations of the group labels, and for each of these permutations the Welch test statistic is computed. These test statistics are used to compute a permutation distribution. Typically, the values of m and n are so large that a full permutation distribution would be difficult to obtain, so we usually use a sample permutation method with a large number of permutations. Let R be the number of permutations and let E be the number of permutation test statistics that are as extreme or more extreme than the unpermuted test statistic. Following Davidson and Hinkley (1997, p. 140), the two-sided p-value (p) will be estimated by p = (E + 1)/(R + 1). This test will be called the permutation Welch t-test. In this study, we used R = 2000 for all tests that require a permutation approach to obtain a p-value. We also used this approach to construct a permutation version of the pooled t-test, which will be called the permutation t-test. We used R = 2000 for the sample permutation tests because it would give a reasonably accurate estimate of the p-value that would be obtained from a full permutation distribution. This value of R will give a standard error of 0.005 for p-values near α = 0.05. In practice, when only one dataset is analyzed, a larger value of R could be used.
Some researchers prefer to use nonparametric tests to perform two-sample tests because, by relying on the ranks, the test statistics are not too sensitive to outliers. We included the Wilcoxon rank-sum test that uses the ranks as scores and we used the large sample normal approximation to obtain the two-sided p-value. These procedures are described by Randles and Wolfe (1979, chap. 9) and by Hájek, Sidák, and Sen (1999, chap. 4). We have also included the Van der Waerden version of the normal score test. If we let R i be the rank in the combined sample for the ith observation, the normal scores are given by −1 [R i /(m + n + 1)], where (·) is the cdf of the standard normal distribution. These normal scores are used along with a large sample normal approximation to obtain the two-sided p-value. It is also possible to use normal scores and rank scores in a Mantel-Haenszel procedure to perform these rank tests. While the Mantel-Haenszel tests are more commonly used with categorical data, they can be used to perform the rank test if the ranks are used as column scores and the sample indicator is used as the row scores. The choice of column scores is discussed by Graubard and Korn (1987). Although we will not use the Mantel-Haenszel approach in our simulations for the rank tests, our results will apply to rank tests that use the Mantel-Haenszel approach.
A rank test was proposed by Baumgartner, Weiß, and Schindler (1998) that gives greater weights to those observations having the smallest and largest ranks. This test is based on a weighted difference between the empirical distribution functions. We denote the observations in Sample 1 as x 1 , . . . , x m and those in Sample 2 as y 1 , . . . , y n , and let G i be the rank, over both samples, of the ith observation in Sample 1 and let H j be the rank, over both samples, of the jth observation in Sample 2. The components of the test statistic are The test statistic is B = (B x + B y )/2. Although the approximate critical values for this test have been computed by Baumgartner, Weiß, and Schindler (1998), we used a permutation approach to compute the p-value because Neuhäuser (2003) found that, if those critical values were used, the test may not maintain its level of significance with discrete data. This test will be called the BWS test. Fligner and Policello (1981) proposed another modification of the rank-sum test to test the null hypothesis of equal medians. With this method they do not assume equal variances or shapes for the two populations. We have discovered in a simulation study that using the critical values from the standard normal will produce a test that does not always maintain its level of significance with skewed distributions. For example, with m = 12 and n = 48 if we use a skewed low-kurtosis distribution the test will reject the null hypothesis in approximately 6.72% of the datasets. Hence, a permutation approach was used to compute the p-values for the Fligner-Policello (FP) test.
Because, in this research, the treatment effects will be variable, the population distributions will differ, and it is reasonable to include test statistics that are sensitive to differences between the cumulative distribution functions. Let F m (x) and G n (y) denote the empirical cdf 's for the m samples from the first population and the n samples from the second population, respectively. The two-sample version of the Cramér-von Mises test statistic, which was studied by Rosenblatt (1952), is The exact distribution of CVM 2 has been tabulated for small samples, but we have used a permutation test to estimate the p-value. A variant of the CVM 2 test that was investigated by Schmid and Trede (1995) has a test statistic We use a permutation method to obtain the p-value for this test. It was expected that the CVM 2 test and the CVM 1 test would be sensitive to location and scale, so they might have high power to detect a variable treatment effect.
Several tests have been proposed that are adaptive, which means that the procedures used to perform the tests depend on the empirical distribution of the combined samples. The oldest and best known of these is the HFR test that was proposed by Hogg, Fisher, and Randles (1975). This test maintains its level of significance because all of the component tests are rank tests and the selection of the rank scores depends on selection statistics that are functions of the order statistics of the combined samples, which are independent of the ranks for every distribution. For details see Randles and Wolfe (1979, chap. 11). In a small simulation study, this test has been shown by Hogg, Fisher, and Randles (1975) to have higher power than some other tests.
Another adaptive test was proposed by O'Gorman (2012, chap. 3). In this test, weights are assigned to individual observations, with outliers getting smaller weights, and a weighted least squares approach (WLS) is used to compute an F-test statistic based on the weighted observations. To ensure that the test maintains its significance level, a permutation method is used to compute the p-value.
Several two-sample tests for changes in location and scale have been proposed. Lepage (1971) proposed a test that combines the test statistic of the Wilcoxon rank-sum Standard normal t with df = 4 t with df = 4, adjusted to give σ = 1 Symmetric bimodal 0.5 N (− √ 9/13, 4/13) + 0.5 N ( √ 9/13, 4/13) Skewed low kurtosis RST distribution with α 3 = 1 and α 4 = 4.2 Skewed high kurtosis RST distribution with α 3 = 1 and α 4 = 8.4 Highly skewed low kurtosis RST distribution with α 3 = 2 and α 4 = 11.4 Highly skewed high kurtosis RST distribution with α 3 = 2 and α 4 = 15.6 Skewed bimodal 0.75 N (−1/2, 1/7) + 0.25 N (3/2, 4/7) NOTE: All distributions have a mean of zero and standard deviation of one. The bimodal distributions are mixtures of normals. For RST distributions, see Ramberg et al. (1979) (1988) suggested that a test for location and scale could be performed by using the group indicator as the dependent variable in a regression with the observed values and the square of the observed values as independent variables. The F-test statistic for this quadratic model is computed and compared to an F distribution with 2 and m + n − 2 degrees of freedom to obtain the p-value. We will call this test the quadratic raw data test. O'Brien (1988) recommended using this quadratic test along with the pooled t-test to analyze the data, but we have decided to evaluate only the quadratic raw data test. In addition, O'Brien (1988) suggested using the ranks and squared ranks as independent variables, instead of the raw data, in a test procedure that we will call the quadratic rank test. It is also possible to use the normal scores and the squares of the normal scores, in a test that we will call the quadratic normal scores test. It is expected that the quadratic raw data test, the quadratic rank test, and the quadratic normal scores test will be sensitive to changes in location and scale that will arise with variable treatment effects. Finally, Neuhäuser, Leuchs, and Ball (2011) proposed a new test for location and scale. Letx andỹ be the sample medians for the observations in the first and second samples, respectively. To perform the test we first compute the transformed values |x i −x| for i = 1, . . . , m and |y j −ỹ| for j = 1, . . . , n. The proposed test statistic is the sum of the Wilcoxon test statistic on the raw data and the Wilcoxon test statistic on the transformed data. Neuhäuser et al. (2011) recommended that a permutation method be used to compute a p-value, which is what we did in this study. For those tests that use permutation methods we used R = 2000 permutations of the raw data.

Description of Simulation
To evaluate the relative power of these tests of significance, we used nine distributions to generate the data in the first sample. These individual effect distributions are listed in Table 1 along with a brief description of each. All nine of these distributions had μ = 0 and σ = 1. In the table, we use N (μ, σ 2 ) to denote the normal distribution with mean μ and variance σ 2 . For the t distribution, we generated a random variable from the t distribution with df = 4; then it was divided by √ 2 so that it had unit variance. The observations from the symmetric bimodal distribution are generated by a N (− √ 9/13, 4/13) distribution with probability one-half, and from a N ( √ 9/13, 4/13) distribution otherwise. Most of the skewed distributions are members of the Ramberg, Schmeiser, and Tukey (RST) family of distributions, which allows the user to specify the first four moments. For these distributions we set μ = 0, σ 2 = 1, and we specified the skewness (α 3 ) and kurtosis (α 4 ) in the manner described by Ramberg et al. (1979). The observations from the bimodal skewed distribution were generated by a N (−1/2, 1/7) distribution with probability 0.75 and from a N (3/2, 4/7) distribution otherwise.
While the observations in the first sample were generated from one of the individual effect distributions in Table 1, the observations in the second sample are obtained as the sum of the individual effect and the treatment effect. The individual effect is obtained from one of the nine distributions listed in Table 1 and the treatment effect is generated from a treatment distribution and is independent of the individual effect. We used five treatment effects for each sample size configuration and distribution. Let t i be the treatment effect for the ith observation and let be the expected value of the treatment effect. If we let ε i be the individual effect for the ith observation, then the observations in the first sample are generated from y i = μ + ε i for i = 1, . . . , m, and the observations in the second sample are generated from y i = μ + t i + ε i for i = 1, . . . , n. For the constant treatment effect we set t i = for i = 1, . . . , n. For the second treatment effect, the effect is uniformly distributed so that t i ∼ U [0, 2 ] for i = 1, . . . , n. For the third treatment effect, the effect is distributed as an exponential with mean so that t i ∼ exp( ) for i = 1, . . . , n. To generate even more variability in the second sample, the treatment effects for the fourth effect are generated from U [0, 4 ] with probability one-half, and are set to zero with probability one-half. For the last treatment effect, the effects are generated from exp(2 ) with probability one-half, and are set to zero with probability one-half.
If is set to a large value then most of the tests will have high power, so it will be difficult to compare the powers of the tests. For most of the simulations we investigated the relative powers of these tests when was set to obtain empirical powers near 80%. In these studies we use α = 0.05, we set the desired power ( ) such that = 0.8, and we determine (1) to approximate the desired power when a constant treatment effect is used. For given sample sizes of m and n, and a desired power of , we compute and use it with the five treatment effects. We do not adjust to compensate for the loss of power that occurs with variable treatment effects nor do we adjust it to compensate for nonnormal data.
When variable treatment effects are used, the second population will have greater variability than the first population. Table 2 gives the population standard deviations in the second group for several combinations of sample size and treatment effects for a desired power of 80%. Because will decrease as the sample size increases, the variability of the treatment effect will decrease with increasing sample size. Because the treatment effects and the individual effects are independent, the standard deviations of the observations in the second population were obtained as the square root of the sum of the variance of the individual effect and the variance of the treatment effect.
We used total sample sizes of N = m + n = 30, 60, 120, and 240 in these simulations. For each N, we used a balanced size of m = n = N /2. We also used un-

Results
We begin by describing the level of significance of these tests. In Table 3 the empirical level of significance is shown, based on simulations that use 10,000 datasets, for two sample size configurations with a normal distribution, a skewed low kurtosis distribution, and a skewed high kurtosis distribution. For each of these tests, we determined if the p-value was less than α for each of the 10,000 datasets, and then we determined the proportion of datasets that led to rejection of the null hypothesis, which is the empirical level of significance. The majority of tests have empirical significance levels near the α = 0.05 level, while a few have levels that are below the α = 0.05 level. Note that the six tests for location and scale generally have empirical significance levels below the α = 0.05 level for m = n = 15. However, the Welch t-test had an empirical significance level of 5.68% when m = 12 and n = 48 observations were generated from a skewed, low kurtosis distribution. The online supplementary tables give the empirical significance level for N = 60 for all distributions and these results show that the empirical significance level of the Welch t-test equals 6.86% for the bimodal skewed distribution when m = 12 and n = 48. We conclude that the Welch t-test does not always maintain its significance level. Note that the results from simulation studies for the significance level with m = 12 and n = 48 also apply to studies having m = 48 and n = 12 because they are performed under the null hypothesis.
Because most of the tests maintain the nominal level of significance, we will proceed to display, in Table 4, the empirical powers of the 19 tests that were used in six of these simulations. We used 10,000 datasets to calculate the empirical power in these simulations. In Table 4, the empirical powers are shown for a simulation study with a constant treatment effect with m = n = 15 observations that were generated from a normal distribution. Because the treatment effect was constant, these are the powers for testing the null hypothesis against a shift alternative. We used Equation (1) to determine the value of that would achieve 80% power for the t-test with constant treatment effects. If m = n = 15 with data generated from a normal distribution we find that, as expected, the pooled t-test was the most powerful test against a shift alternative. However, many of the other tests had powers within 5% of the maximum. Note that the last seven tests in Table  4 had lower power than the other tests. The next two columns give the same information for datasets that were generated from skewed distributions having α 3 = 1. With these skewed distributions, the empirical powers of these tests show that the nonparametric tests tend to be more powerful than the t-tests. In the last three columns, we see that these general results also hold for larger samples having m = n = 30 when the treatment effect is variable. In general, the t-tests have greater power with normally distributed errors and the nonparametric tests have greater power with skewed errors. Because, for m = n = 30, the treatment effect distribution was U [0, 2 ], the standard deviation in the second population is 1.084, as was shown in Table 2. Note that the results for m = n = 15 are not comparable to those for m = n = 30 because different values of were used for the simulations. For each of the four sample sizes with five sample configurations of balanced and unbalanced data, we obtained the population standard deviation in the second group, which we will denote by σ . In our initial analysis of the results, we found that for simulations that had σ ≤ 1.3 the results were similar to those that had σ = 1.0. We will give some results for the simulations having σ > 1.3 at the end of this section, but most of the simulations had σ ≤ 1.3, as we observed in Table 2. For example, with N = 60, the constant, U [0, 2 ], and Exponential( ) treatments effects had σ ≤ 1.3, and with N = 240 all of the treatment effects had σ ≤ 1.3. Because it would be impractical to show all of the results from this study, in Table 5 we give the average power over the five sample configurations and the nine error distributions for simulation studies that have σ ≤ 1.30. The results for all simulations having N = 60 are given in the online supplementary tables.
To make the results more manageable, we will eliminate from further consideration those tests that have lower power than the other tests, and those tests that appear to have very similar performance to other well-known tests. In Tables 4 and 5, we observe that the last seven tests have lower power than many of the traditional tests. This includes the Wilcoxon-AB test, the Wilcoxon-Mood test, the NS-Klotz test, the three quadratic tests, and the Neuhäuser test. Because these seven tests have lower power than many of the other tests, we will not consider them further. Nor does there seem to be a good reason to include, in our comparisons, the permutation version of the pooled t-test because the pooled t-test maintains its level of significance at or near α with a variety of distributions and sample size configurations. On the other hand, the Welch's t-test does not maintain its significance level with unbalanced sample sizes and skewed data, as we observed in Table 3. Hence, we will use the permutation Welch t-test because it maintains its significance level, and we will not consider the Welch t-test further. We note, in Table 5, that the average power of the HFR adaptive test is more than 3% below that of the adaptive WLS test for all sample sizes. In addition, the average power of the HFR test is less than the average power of the rank-sum test for all sample sizes. Hence, we will not include the HFR test in further comparisons.
We also want to exclude any tests whose performance is closely approximated by a standard test. To find tests whose performance is closely related to other tests, we computed the Pearson correlations between the powers of all tests over the 612 simulation studies having N = 30, 60, 120, and 240 with σ ≤ 1.3. We found a correlation of r = 0.996 between the Wilcoxon rank-sum test and the CVM 1 test and a correlation of r = 0.982 between the Wilcoxon rank-sum test and the CVM 2 test. We also note that the average power of the ranksum test approximates the average power of the CVM 1 test and that the CVM 2 test is less powerful than the Wilcoxon test. Hence, we will not include either of the Cramér-von Mises tests in the other comparisons. None of the other tests had correlations with the t-tests or the rank-sum test that exceeded r = 0.973. After excluding these tests, we have seven tests remaining that will be compared.
In Table 6 we see the average power, when is set to achieve 80% power, for the seven most promising tests for  simulations having m = n, m < n, and m > n. For each sample size, the average power is obtained by averaging over the nine individual effect distributions, the sample size configurations, and the treatment effects that produce σ ≤ 1.3. From Table 2 we see that, when m = n = 30, this includes simulations with the constant, the U [0, 2 ], and the exp( ) treatment effects. Averaging over individual effect distributions and treatment effect distributions can be justified because the researcher would know m and n but would not know the individual effect distribution or the treatment effect distribution. The results in Table 6 show that the NS, BWS, and adaptive tests are generally more powerful than the other tests.
To make the results more useful, we have compared the power for each test to the other tests for each simulation by finding the test that had the maximum power and all other tests that had powers within 5% of the maximum. We did this because a researcher would prefer to select the most powerful test, but may not be too concerned if the selected test is slightly less powerful than the most powerful test. None of these seven tests is the most powerful test for every simulation, but some are nearly as powerful as the test with the greatest power for most of the simulations. Figure 1 gives the percent of the simulations where the tests have the greatest power, or within 5% of the greatest power, over the seven tests, for the 27 balanced simulations that have set to give approximately 80% power with N = 60 and m = n with σ ≤ 1.3. These re- sults show that if a researcher had a variety of datasets similar to those used in this study, and if they choose to use the NS test to analyze every dataset, then they would be assured that they were using the most powerful test, or nearly the most powerful test, of these seven tests. In contrast, if they had decided to use one of the t-tests for all datasets having m = n, they would often be using a test that had much lower power than some of the other tests. Hence, for datasets having N = 60 and m = n the pooled t-test and the Welch t-test appear to be poor choices for general use, and the NS, BWS, and adaptive tests appear to be good choices.
If N = 60 and m < n, the larger sample will have the larger population variance. Figure 2 gives the percent of the simulations where the tests have power within 5% of the greatest power for simulations designed to give approximately 80% power with N = 60 and m < n with σ ≤ 1.3. In these simulations, the BWS test that was proposed by Baumgartner, Weiß, and Schindler (1998) is superior to the other tests, but the NS test and the adaptive test also have high power. Again, the t-tests are not good choices when m < n.  If m > n, the larger sample will have the smaller population variance. For simulations having N = 60 and m > n with σ ≤ 1.3, that were designed to give approximately 80% power, the percent of the simulations where the tests have power within 5% of the greatest power are shown in Figure 3. The results show that the adaptive test is the best choice if the larger sample has the smaller population variance.
So far, we have limited the choice of to a value that would produce an empirical power near 80% for the pooled t-test. To see if these results will apply to other values of , we performed simulations having N = 60 with desired powers near 50% and 90%. The average power for simulations having N = 60 with σ ≤ 1.3 are given in Table 7. The average powers demonstrate that the NS test is more powerful, on average, than the pooled ttest, the Welch t-test, and the rank-sum test. It also shows that, when m < n, the BWS test is the most powerful test and when m > n the NS test and the Adaptive tests are more powerful than the other tests. The relative power of the tests with 50% power and 90% power are similar to those for 80% power that are shown in Table 6.
We also wanted to determine which tests would be appropriate if the treatment effect was so variable that σ > 1.3. In Table 8, we show the average power over the simulations designed to give approximately 80% power that had σ > 1.3, which in this study only occurs with smaller samples that have highly variable treatment effects that are listed in Table 2. The results show that, if σ > 1.3, the t-tests tend to be more powerful than the other tests. When the larger sample has the greater variability the Welch t-test is more powerful and when the smaller sample has the greater variability the pooled ttest is more powerful. If the sample sizes are equal the Welch t-test is slightly more powerful, on average.
Because the results for σ > 1.3 differ greatly from those for σ ≤ 1.3, we investigated the relationship between σ and the relative power of these tests. If m = n we note, from Table 6, that if σ ≤ 1.3 the NS test is more powerful, on average, than the pooled t-test but, from Table 8, if σ > 1.3 the NS test is less powerful, on average, than the pooled t-test. We will define the power advantage of the NS test to be the power of the NS test minus the power of the pooled t-test. In Figure 4, we display a scatterplot of the power advantage of the NS test versus σ for all simulations having m = n. Clearly, the power advantage of the NS test declines with σ and is negative for σ > 1.3. Note that most of the simulations have σ ≤ 1.3 because large differences in variability only occur with small samples in these simulations, as we have shown in Table 2.
In Figure 5, we display the power advantage of the NS test versus σ for m < n. We see that the power advantage of the NS test declines as σ increases for σ ≤ 1.3, but for σ > 1.3 the power advantage is sometimes positive and sometimes negative. For simulations having m > n, the power advantage of the NS test is shown in Figure 6. For σ < 1.2 the power advantage is positive and for σ > 1.3 the power advantage is negative. Taken together, Figures  4-6 show that the power advantage of the NS test for simulations having σ ≤ 1.3 is quite different from simulations having σ > 1.3.

Conclusions
We have attempted to evaluate the power of many twosample tests of significance when the treatment effect is variable. If the treatment effect is constant, or somewhat variable so that the ratio of standard deviations is less than 1.3, we find that: • The NS test, the BWS test, and the adaptive test are often more powerful than the pooled t-test, the permutation Welch t-test, the Wilcoxon rank-sum test, and the FP test.
• The BWS test often has lower power than the NS test and the adaptive test if m > n.
• There is little difference in power between the NS and adaptive tests.
• The power advantage of the NS test decreases with the ratio of the standard deviations.
Consequently, if the ratio of standard deviations does not exceed 1.3 then the NS test and the Adaptive test are recommended. However, the NS test has several advantages over the adaptive test. The NS test is described in many nonparametric textbooks and is relatively easy to understand. Not only does the NS test have high power over a range of distributions and treatment effects, but also it maintains its level of significance without using a permutation method. In addition, it is available in statistical software packages, including the NPAR1WAY procedure in SAS/STAT © (SAS Institute 2014) software. In contrast, the Adaptive test requires a permutation-testing method to obtain a p-value. The major advantage of the Adaptive test is that it can also be used for more complex designs, as described by O'Gorman (2012).
These results apply to datasets where it is reasonable to assume that the distributions are stochastically ordered. If a study is being planned, results from similar studies or pilot studies should be used, when available, to determine if it is reasonable to assume that the ratio of standard deviations will be 1.3 or less. If the data have already been obtained, an analysis strategy might be to first decide if it is reasonable to assume that the treatment effect, if any, is one-directional. If the treatment effect is nonnegative for all subjects, or nonpositive for all subjects, then the distributions should be stochastically ordered and our results are relevant. Letω be the ratio of the largest sample standard deviation to the smallest sample standard deviation. Ifω ≤ 1.3, then the normal scores test would be a good choice. Ifω > 1.3, then the treatment effect is highly variable so that one of the t-tests would be recommended.
In some research we assume that the treatment effect is constant so we can estimate the mean treatment effect, which will be the individual treatment effect. If we are unwilling to assume that the treatment effect is constant, we would not be able to estimate a treatment effect for an individual, but would be able to estimate the mean treatment effect.
If observational data are obtained from two populations, the results of this research may be of some value provided the distributions are stochastically ordered. However, if the distributions are not stochastically ordered, with either experimental or observational data, then these results are not relevant. If the distributions are not ordered, the analyst may decide that a test for location is not appropriate and may consider one of the tests for location and scale such as the test of Lepage (1971) or the test by Neuhäuser, Leuchs, and Ball (2011). [Received February 2014. Revised February 2015