Design choices in randomization tests that affect power

Abstract We consider the problem of evaluating designs for a small-sample two-arm randomized experiment using the power of the one-sided randomization test as a criterion. Our evaluation assumes a linear response in one observed covariate, an unobserved component and an additive treatment effect where the model's randomness is due to different treatment allocations. It is well-known that the power depends on the allocations’ imbalance in the observed covariate. We show that power is additionally affected by two other design choices: the number of allocations and the degree of the allocations’ dependence. We prove that the more allocations, the higher the power and the lower the variability in power. Our theoretical findings and simulation studies show that the designs with the highest power provide thousands of highly independent allocations each providing nominal imbalance in the observed covariates. These high-powered designs exhibit less randomness than complete randomization and more randomness than recently proposed designs that employ numerical optimization. This advantage of high power is easily accessible to practicing experimenters via the popular rerandomization design and a greedy pair switching design, where both outperform complete randomization and numerical optimization. The tradeoff we find also provides a means to specify rerandomization’s imbalance threshold parameter.


Introduction
Our goal is to examine experimental power in small sample classic treatment-control randomized experiments when testing for a positive treatment effect via the randomization test.The subjects have a continuous response, each subject is assigned to the treatment group (T) or a control group (C) and subjects' covariates are known beforehand and considered fixed.This non sequential setting was studied by Fisher (1925) when assigning treatments to agricultural plots and is still of great importance today.They occur in clinical trials as "many phase I studies use 'banks' of healthy volunteers … [and] … in most cluster randomized trials, the clusters are identified before treatment is started" (Senn 2013(Senn , page 1440)).
To investigate experimental power, we formalize our notation, assume a response model and then compute explicit expressions for the power of the one-sided randomization test in Section 2. Section 3 then investigates power of common designs and provides theoretical results and Section 4 provides empirical results.We conclude with practical advice and future directions in our discussion, Section 5.

Formulation
We denote the responses y ¼ y 1 , :::, y n ½ > where the number of subjects n is assumed even.Each subject is associated with a one-dimensional covariate.The vector of covariates is denoted by x ¼ x 1 , :::, x n ½ > , which is assumed centered and scaled.The assignment or allocation vector is w ¼ w 1 , :::, w n ½ > whose entries are either þ1 (the subject received T) or À1 (the subject received a C) and w 2 À1, þ 1 f g n : The full design D is defined as a discrete uniform random variable with support W D À1, þ 1 f g n : The subset of designs we investigate are termed forced balance procedures where all allocations have the same number of treated and control subjects (Rosenberger and Lachin 2016, Chapter 3.3).This is a minor restriction denoted as W FB ¼ w : w > 1 n ¼ 0 f gwhich has n n=2 allocations.Herein, every design is distinguished by its specific W D W FB : Allocations can be thought of as a division of the n individuals into two subsets and the two subsets are delegated to T/C or C/T with probability 1/ 2. Thus, the allocation space can be written as pairs W D ¼ w 1 , À w 1 , w 2 , À w 2 , ::: f g and we call this the mirror property of designs.
A few examples of full designs we examine herein are: the balanced complete randomization design (BCRD) specified by W FB , rerandomization design specified by w : w > x j j a È É \ W FB where a is a threshold of covariate imbalance (to be elaborated upon later), pairwise matching design specified by w : w r ¼ Àw s f g for unordered pairs with indices r, s f g in the set of pairs (created by minimizing a distance function) which is naturally a subset of W FB and the greedy pair-switching design of Krieger, Azriel, and Kapelner (2019) which is a subset of W FB with covariate imbalance provably lower than the previous designs.
The rerandomization, matching and greedy pair-switching designs are examples of restricted randomization.They are restricted in the sense that the space of available vectors is more restrictive i.e., a proper subset of W FB : These designs are less random than BCRD but provide a solid advantage: smaller observed covariate imbalance.Our previous work examined this tradeoff in the context of estimation (Kapelner et al. 2020) where we showed that if a design deviates from randomness by too much, i.e., it is too restrictive, one incurs large risk in mean squared error of treatment effect estimation.Herein, we seek to explore this tradeoff by analyzing the power of the randomization test.
We first talk about an important practical consideration.The full designs above have allocation spaces W D that are exponentially large.In practice, these spaces are sharply limited to W 2R :¼ w 1 , À w 1 , w 2 , À w 2 , :::, w R , À w R f g , a unique randomly-sampled subset of W D that has R mirrored pairs and thus a total of 2 R vectors.Henceforth in this paper, we refer to design (sans "full") as the discrete uniform random variable that draws from W 2R , not the full space.Since this smaller space is used in statistical practice, it is the target of our theoretical investigation henceforth and properties of full designs are not explored.
This introduces a choice to the experimenter beyond which design strategy to employ: how large should the hyperparameter R be?The effect of the value of R on statistical power as far as we know has never been investigated.Previously, the R vectors were assumed to be merely a means of approximating the full space (Dwass 1957) and common practice is it should be large enough to provide sufficient resolution to the pvalue in the randomization test.Our perspective here is slightly different: R is now a design choice in and of itself.
To investigate the power of the randomization test in our design with hyperparameter choice R, we first assume the following response model: where z is the unexplained but fixed component after a homogeneous additive treatment effect and a linear covariate effect.The only source of the randomness in the response is thus the treatment assignments w.This assumption on the source of randomness is termed the randomization model (Rosenberger and Lachin 2016, Chapter 6.3) whereby "the n subjects are the population of interest; they are not assumed to be randomly drawn from a superpopulation" (Lin 2013, page 297).
Our focus is to assess whether H a : b > 0 can be demonstrated over H 0 : b 0 beyond a reasonable doubt.We employ the simple differences-in-means estimator, where the equality follows from our assumption of forced balance (i.e., allocations have the same number of treated and control subjects).Our criterion to evaluate experimental designs is the statistical power of the randomization test at level a.
To "run an experiment", a w run is first chosen at random from W 2R : The allocation w run is used to generate the responses y run in Equation ( 1) and the estimate of the treatment effect brun : Our previous work (Kapelner et al. 2020) examined the effect of the full design on the mean squared error of this estimator in the same settings.
The decision of the one-sided randomization test locates this estimate within the null distribution, and if larger than the 1 À a quantile, H 0 is rejected.
In our work that follows, we wish to explore two questions about this widely-used testing procedure.(Q1) Is there an optimal value of R? (Q2) How does the choice of the full design affect power?
To explore these questions, we must understand the role of these 2 R vectors in the experimental decision and approximate power computation.To do so, consider M, the 2R Â 2R matrix of estimates where the row i indexes the experimental run allocation, i.e., w run ¼ w i , and the column j indexes the allocation corresponding to the jth element of its null distribution, i.e., the elements of the matrix are m ij ¼ w > j y i =n, where y i denotes the vector of responses when the allocation w i is applied.The rows of M are organized by the mirrored couples i.e., the first row corresponds to w 1 and the second row corresponds to Àw 1 and the third row corresponds to w 2 , etc.The diagonal elements are the estimators of the brun for each w run : For any run w i , the null hypothesis is rejected if the element in the ith column is greater than Quantile m iÁ , 1 À a ½ where m iÁ denotes the ith row vector of m.We examine experimental power, denoted P z, W 2R , the proportion of rejections over our 2 R run allocations, where 1 Á denotes the indicator function and the index z emphasizes the dependence on the unobserved component of the response (from Equation 1).
To understand how this power expression is dependent on the choice of the full design (Q2), we use the response mode of Equation (1) to express the matrix entries as: where r ij :¼ w > i w j =n, B x, j :¼ w > j x=n and B z, j :¼ w > j z=n, termed pairwise allocation correlation, imbalance in the observed covariate and imbalance in the unobserved response component respectively.Note that w run affects only the first term above.
The quantity r ij is of pivotal importance in this work so we enumerate a few useful facts here.First, w > i w j is the number of allocations among two designs that agree save the number of allocations that disagree and thus r ij ¼ 2f ij À 1 where f ij is the fraction of allocations that agree in the two designs.Also r ij varies between À1, þ 1 ½ and r ii ¼ 1.Further, fixing w i , the conditional expectation of the pairwise allocation correlation over each draw of w j will be zero for all designs due to the mirror property.Thus, fixing w i , the conditional variance of the pairwise allocation correlation over each draw of w j is the expectation of the square, i.e., the sum of r 2 ij ¼ ðw > i w j Þ 2 =n 2 over j.
In the ith experimental run, we compare the null estimates in m iÁ with the data-generated estimate m ii and desire m ii > m ij for most j to reject.If this is true for a large proportion of the R experimental runs, we reject the null often (and often rejections translate to high power).Hence the following difference quantity being positive for a sufficient number of j (based on a) for each fixed i is of great importance: This difference (and ultimately experimental power) fundamentally depends on three quantities: (I) the imbalance in the observed covariates, (II) the imbalance in the unobserved response component and (III) the correlations among the set of allocations.All terms are functions of w making them critically dependent on the experimenter's choice of full design.The results of this paper lie in a careful examination of these three.What follows is an intuitive and conceptual discussion about these terms while the formal theorems follow in Section 3. If we consider a fixed w run , we fix i and desire as many positive m ii À m ij terms as possible.
We start the discussion with term III which constitutes the "signal" while terms I and II constitute the "noise" (due to both the observed and unobserved covariates respectively).The signal term is always non negative and thus in the absence of noise, power would be one.The signal term can be controlled by the experimenter through the choice of the allocations in the design which yield a set of r ij 's.Since term III features a 1 À r ij term and the r ij 's come in ± pairs (due to the mirror property), term III optimizes power when the r ij 's are as close to zero as possible.
Terms I and II are the noise and if they were zero, power would be one.Therefore, we would like to design the allocation vectors to make these terms as small as possible.Term I can be effectively eliminated by using restricted randomization as x is observed before the experimenter selects a design procedure.Popular designs such as pairwise matching design yields I ¼ O p ðn À1 Þ, rerandomization design makes I small through choosing threshold a to be small and Krieger, Azriel, and Kapelner (2019, Theorem 1) proves that greedy pair-switching has I ¼ O p ðn À3 Þ: (These orders follow from an assumption that the x i 's are drawn from a continuous distribution with a finite variance).As for term II, we do not know the values z 1 , :::, z n and thus cannot employ restricted randomization.Nevertheless the term can be small if the allocations are similar to each other since II can be written ðw i À w j Þ > z: This corresponds to a set of allocations where r ij 's are as close to one as possible.
In conclusion, for modest sample sizes, term I is negligible when a restricted randomization design is employed.This was a known answer to (Q2)full designs that seek to equalize treatment and control groups on an observed covariate will have higher power.Terms II and term III reveal great complexity and thus we have no clear answers to our two questions about design decisions when committed to the randomization test.For (Q1), we still do not have any intuition on the role of R, the number of vectors to employ in the randomization test.For (Q2), the full design also controls pairwise correlations.However, we do not know the affect of the pairwise correlations since terms II and III are in opposition (as II is optimized with large magnitude correlations and III is optimized with small magnitude correlations).We answer these questions in the next section through a careful mathematical analysis.The answer to (Q1) is to have a large R.For a second answer to (Q2), we find that term III dominates over term II and thus it is best to choose full designs that keep the r ij 's close to zero.

Results
To obtain our results, we need to make strong assumptions that allow our problem to become tractable.First, we assume that each allocation has trivial observed covariance imbalance (i.e., B x, i ¼ 0 for all intents and purposes) for all allocations i ¼ 1, :::, 2R: This is attainable by the restricted randomization designs discussed in Section 2 (e.g., via rerandomization with suitably small threshold).We also assume that B z, j :¼ P n l¼1 w j, l z l =n $ N ð0, r 2 z =nÞ: This is true when the full design is BCRD due to the finite central limit theorem of H ajek (1960) but it is not yet proven for other full designs.This assumption removes the dependence of power on the n fixed values of the z l 's shifting the dependence to more tractable realizations from a normal model.
Due to the mirror property, pairwise allocation correlations come in pairs of r ij , Àr ij : We also approximate these correlations to be 6q, i.e., constant.In practice, it can be shown that even if these correlations differ, replacing the positive pairs with q . R 2 to approximate power (and the same except negative for the negative pairs).This approximation is simulated in Section B.2 of the Supplementary Material.Thus, we will assume going forward that q is constant within any subset of assignments in a full design.Also, without loss of generality, we let r 2 z ¼ 1 whose value only serves as a scale factor and does not affect the theoretical results of this section.

Mathematical preliminaries
We begin by rewriting the power as defined in Equation (3) as where i denotes the index of the experimental run vector w run , I i denotes the indicator that the treatment effect estimator is higher than all but the top 1 À a proportion of estimates from the other 2R À 1 allocations (i.e., H 0 is rejected) and I m i is the analogous indicator for the mirror allocation.We now define our power metric, which averages the above expression over B z and all 2 R-sized subsets of W D , The expectation above is identical for all different subsets W 2R as the simplifying assumption of uniform q implies each allocation vector subset is equivalent in this context.
To compute PðI 1 ¼ 1Þ and PðI m 1 ¼ 1Þ, we require comparisons of the estimator under w run ¼ w 1 to the other 2R À 1 estimators under other allocations.To emphasize that these are random variables, we denote all 2 R estimators as where Z 0 , Z 1 , :::, Z R $ iid N ð0, 1Þ: To compute we first examine the case where R ! 1: Here, the quantile in Equation ( 7) above is fixed if we condition on Z 0 ¼ z and can be found by solving for q(z) where UðÁÞ computes the value of the CDF of the standard normal distribution.
Then the probability of PðI 1 ¼ 1Þ, which here is the same as the value of PðI m 1 ¼ 1Þ, reduces to a normal CDF calculation which is a function of z.Averaging over the distribution of Z 0 we obtain lim where /ðÁÞ computes the value of the PDF of the standard normal distribution.Details can be found in Theorem A.2 in the Supplementary Material.
We now examine the case of finite R. Here, we need to compare V 11 to every other estimator and count the number of times V 11 is larger.We change variables from (Z 0 , Z 1 ) to a rotation (U, S) which greatly simplifies our analysis (see Equation A30 in the Supplementary Material).When we condition on U ¼ u and S ¼ s, the events V 1j > V 11 become iid Bernoulli random variables whose probability parameter is a function of u and s.Then PðI 1 ¼ 1 j U ¼ u, S ¼ sÞ reduces to a binomial CDF calculation.We then average the computation over the distribution of both U and S to arrive at where F B ðÁ, N, hÞ is the CDF of the BinomialðN, hÞ random variable, /ðÁ; l, r 2 Þ computes the value of the PDF of a N ðl, r 2 Þ random variable and pðu, sÞ Details of the derivation of Equation ( 10) can be found in Theorem A.4 in the Supplementary Material.Note that power is discretized at natural number values of 2aR which we term "attainable power" values similar to the concept of attainable p-values (Hemerik and Goeman 2021).From this point forward, when we refer to "R increasing", we mean increasing values in the sequence of b2aRc: To gain intuition about how R and q affect the power expression of Equation ( 10), we use Monte Carlo integration to compute P under the following settings: R 2 10 1:0, 1:5, :::, 3:5 f g % f10, 30, 100, 320, 1000, 3160g, q 2 0:0, 0:1, 0:2, 0:3 f g and n 2 26, 50, f 100, 200g fixing a ¼ 5%: The results are illustrated in Figure 1 along with the calculation of lim R!1 P via Equation (9).These illustrations beg two conjectures: (1) for any fixed q, power monotonically increases as R increases and (2) for any fixed R, power monotonically increases as q decreases.The ensuing subsections explore these two conjectures analytically.

As q decreases the power increases for fixed R
We prove this conjecture for any R in Theorem A.2 in the Supplementary Material by first noting that q has two competing effects which complicate its overall behavior in power.High values of q are preferable because it reduces the variance of V 1j À V 11 : Simultaneously, low values of q are also preferable because it reduces V 1j by diminishing its term qc: To examine which consideration is stronger, we first examine the case where R ¼ 2. Through much manipulation, we prove that power increases as q decreases (see Theorem A.1 in the Supplementary Material).
To understand the role of q for R > 2, we need to examine Equation ( 10) which requires understanding the behavior of the random variable T ¼ pðU, SÞ in the binomial CDF calculation.Assuming c ¼ 0, Lemma A4.1 in the Supplementary Material shows that the measure of T has a point mass PðT ¼ 1Þ ¼ 1=2 and otherwise has uniform density on 0, 1 ½ (and this result holds beyond the normality assumption).If c > 0, Lemma A4.2 in the Supplementary Material shows PðT ¼ 1Þ ¼ UðÀcÞ and the density on 0, 1 ½ is given by where u ¼ uða, tÞ satisfies UðÀu þ aÞ þ UðÀu À aÞ ¼ t: To prove that power increases as q decreases, we first show that it is sufficient to demonstrate that for every q 1 < q 2 Figure 1.An illustration of asymptotic power in R for different fixed settings of n and q where a ¼ 5%: The green horizontal line is the theoretical asymptotic power upper bound (Equation 9).In red is the finite power calculated using the derived expression for power as a function of R, q, a (Equation 10) via Monte Carlo integration (one million samples per cell).The points indicate the values of R for which finite power was computed.
there exists t 0 such that f T ðt; q 1 , cÞ > f T ðt; q 2 , cÞ for t < t 0 , and there exists t 1 such that f T ðt; q 1 , cÞ < f T ðt; q 2 , cÞ for t > t 1 : We then prove the existence of t 0 and t 1 .
The assumption of all pairs of allocations having the same fixed q is not realistic for any design used in practice and it may questionable if this theorem applies.We offer simulation evidence that demonstrates power increasing with the average r ij declining for the of R ¼ 40 in the pairwise matching design (see Section B.1 in the Supplementary Material).

As R increases the power increases for fixed q
We prove this conjecture in Theorem A5 in the Supplementary Material by first showing that f T of Equation ( 11) decreases in t for any q and c > 0 in Corollary A4.2 in the Supplementary Material.We then use this fact in conjunction with the behavior of f T when c ¼ 0 (Lemma A4.1 in the Supplementary Material) and representing the CDF of the binomial as a regularized incomplete beta function allows us to complete the proof.

The variability of power
The previous results addressed the expected value of power.A further consideration is the variability of the power.Recall that the power metric we focus on (Equation 6) considers power for an average value of the imbalance of the unobserved response component B z , i.e., it takes the expectation over B z and W 2R : We can also take the standard error over B z and W 2R to get the standard error in power, SE B z , W 2R P z, W 2R ½ , i.e., the instability in experimental power.Replacing the value of the quantile in Equation ( 7) by q(z) from Equation ( 8), Theorem A6 in the Supplementary Material proves that as R increases, this instability monotonically decreases to a positive constant.This limiting constant is a function of c and q.Numerical studies show that it does not exhibit monotonicity in either of these two parameters.Also, for typical values of c and q, this limiting constant is large; it could be as high as 0.06.Future work will elucidate designs that seek to minimize this value.

Simulations
In this section, we wish to explore the power P of the randomization test at a ¼ 0:05 for different experimental design strategies and different values of R and n.We vary n 2 26, 50, 100, 200 f g and then set the observed covariates x to be the f1=ðn þ 1Þ, 2=ðn þ 1Þ, :::, n=ðn þ 1Þg quantiles of the standard normal distribution.The full designs considered were BCRD, rerandomization, a priori pairwise matching, the greedy pair switching of Krieger, Azriel, and Kapelner (2019).For rerandomization, we picked a threshold of B x j j corresponding to the 0.1% best out of 1,000,000 allocations from BCRD.All these designs are restricted to have an equal number of subjects n=2 assigned to the treatment and control groups by construction.
We vary R 2 f10, 30, 100, 320, 1000, 3160g: After sampling the R vectors w, we augment this set by concatenating their mirrors Àw thus arriving at 2 R vectors for each design.To generate the response we vary b x 2 0, 1 f g and b 2 0, 0:25 f g: The positive value of b was selected to both induce separation among the many simulation settings and result in powers that were not close to either zero or one.In each simulation cell, we run 50 realizations from each of the considered designs i.e., 50 different subsets W 2R & W D : In order to simulate different values of B z under the asymptotic setting, we take 500 draws of the unobserved covariates z from a standard normal distribution within each design duplicate (a different value of r 2 z would only monotonically shift our results).The responses were computed according to our theoretical setup (Equation 1) and the power in each cell P z, W 2R was tabulated via Equation (3).
For n ¼ 26, the total number of allocations is 26 13 ¼ 10, 400, 600 which is nearly the largest sample size that can comfortably be enumerated exhaustively.Thus, for the n ¼ 26 setting we include another experimental design, which we term best.Here, we calculate the observed covariate imbalance B x for each allocation vector w.We then sort from the smallest B x j j to the largest and enumerate the best 2 R vectors (since the B x j j is equal for w and Àw, the mirrored pairs appear in order after sorting).Additionally, since the subset of R vectors is deterministic, we do not simulate 50 duplicates of this design during the simulation.
The main goal of this simulation is to compare power P across the various design strategies, R and n, computed via Equation ( 6) where the expectation over B z was approximated by averaging the 500 replicates over different z realizations and the expectation over all subsets W 2R & W D was approximated by averaging over the 50 replicates of the different W 2R subsets.Using the law of total variance, we can compute standard errors of our simulation that incorporate these two sources of variation.
We also collected other information during the simulation such as an estimate of mean q, a metric of how similar the allocation vectors are within specific experimental designs by R and n (as measured by the average r ij ).This allows us to understand the interplay of R and q on power and assess our theoretical results in settings outside of their stylized assumptions.The power results are found in Figure 2 and the allocation dependence illustrations are found in Figure 3.There are many observations from these plots.First, all eight illustrations of Figure 2 are consistent with our technical result that P increases monotonically in R (Theorem A5 in the Supplementary Material).
Further, the bottom row illustrations of Figure 2 (where b x 6 ¼ 0) demonstrate the effect of the design on the observed covariate imbalance B x affecting term (I) in b (Equation 5).BCRD has poor balancing performance and hence much lower power than the contenders (see Figure C1 in the Supplementary Material for observed covariate imbalance by design and n).Matching has worse power compared to rerandomization and greedy pair switching for small sample sizes but is no longer detectable at the illustration scale for n ¼ 200.This is due to a combination of worse performance balancing the observed covariate (see Figure C1) and also higher allocation dependence (as apparent in Figure 3), i.e., higher average r ij with the latter consideration being more responsible (as evidenced by the comparison of the top row of Figure 2 corresponding to the setting of b x ¼ 0).The strategy of using the best vectors in an exhaustive search does poorly for low R since the vectors are highly dependent as apparent from the leftmost plot of Figure 3.The winning strategies in the realistic setting of b x 6 ¼ 0 are rerandomization and greedy pair switching because both these designs can (a) drive B x j j to nominal levels and (b) provide highly independent allocations.
Figure 3 also confirms the disadvantage of pairwise matching: higher q versus the other full designs as predicted theoretically.Also confirmed is the similarity of q among BCRD, rerandomization at the threshold considered and greedy pair switching.Theoretically, BCRD vectors must exhibit less pairwise correlation in low samples than rerandomization and greedy pair switching, a point confirmed by the recent work of Nordin and Schultzberg (2020).The simulation here is not powered to detect the 10-15% difference that their Figure 2 shows.With more covariates and/or less well-behaved covariates, we would likely find marked differences.Additionally, our metric of average absolute pairwise correlation is different than their metric of assignment correlation.
We also verified that the tests are properly sized in all settings (see Figure C2 in the Supplementary Material).And we verified that the standard error of P monotonically decreases in R but not ultimately to zero.This can be seen from inspecting the length of the error bars of Figure 2 but can be seen more clearly in Figure C3 in the Supplementary Material.Again, the winners here are rerandomization and greedy pair switching.The variability in BCRD in small R in the case where b x 6 ¼ 0 is the largest variability of the simulation.This is likely due to B x varying wildly across allocations.

Discussion
We investigated the power of the randomization test under different experimental designs in the setting of a simple response model with an additive treatment effect, an additive effect of an observed covariate and an additive effect of a fixed unobserved component.Through our investigation of the power, we noticed that three main features of the experimental design (i.e., within the experimentalist's control) affect power: (1) The allocations' imbalance among the observed covariates, (2) the number of allocations in the design and (3) the orthogonality of the allocations.
The most important design consideration is (1) to make imbalance among the observed covariate in the two arms small enough so that it becomes inconsequential.This fact is not new; it is a well-studied problem with many heuristic designs including Student's (1938) classic rerandomization and recently Bertsimas, Johnson, and Kallus (2015); Kallus (2018) who employ numerical optimization.
For consideration (2), we have demonstrated that it is critical to have a large number of allocations R in the design as power will increase and variability of power due to the effect of the unobserved response component will decrease.An order of R in the low 1000s seems to be sufficient.One must be careful that all these many R allocations respect the small observed covariate imbalance restriction.This restriction is explicit in rerandomization where the minimum threshold imbalance is specified.One can prove that imbalance is very small in matching and greedy pair switching.
As for consideration (3), we conjecture that power will improve upon increasing the orthogonality among the allocations.Intuitively, (2) and (3) are at odds with another: as the number of vectors increases in a fixed space of dimension n, there will be pairs with larger and larger correlations.The tradeoff is dependent on the constants r 2 z , the importance of the unobserved covariates to the response (i.e., conceptually the same as R 2 for the observed covariates) and b, the size of the experimental effect.
In our previous work (Kapelner et al. 2020) we studied the mean squared error (MSE) of the same estimator in the same settings in an effort to understand how it is affected by the design.We found (a) the expected MSE over z is optimized when using one allocation vector that minimized B x j j, (b) when considering the worst case MSE by z, BCRD is the optimal strategy and (c) when examining a high quantile of MSE over z, then a design that provides good imbalance while having orthogonal designs is preferred.Our results herein concerning experimental power dovetail with our previous finding in (c).
Based on this work and our previous work, we offer design recommendations for the practitioner that address these salient considerations.We recommend using rerandomization with a threshold as low as possible to produce R on the order of a few thousand.The threshold can be lowered by having more time and computational resources at the practitioner's disposal.Rerandomization puts an upper limit on observed covariate imbalance and also provides allocations which seem (based on our numerical experience) to be as orthogonal as allocations that are expected in BCRD as suggested by Figure 3 (we await a rigorous proof of this statement).We also recommend greedy pair switching (Krieger, Azriel, and Kapelner 2019) which provides smaller observed imbalance than rerandomization (by an order of n) and has orthogonality of its allocations proven to be nearly as orthogonal as BCRD in finite sample sizes.
A priori pairwise matching has observed covariate imbalance performance between these two designs but is not recommended because the orthogonality of its allocations is higher, lowering power in the small sample setting (see Figure 2, columns 1 and 2).However, if the linear additive model (Equation 1) is not believed, matching affords better performance on consideration (1) as it will match the response component of the observed covariate and not just the value of observed covariate (see Kallus 2018, Section 2.3.2).This is recommended as lower allocation orthogonality does not penalize power by a significant amount.
Designs that enumerate all allocations vectors in W D and select w that provide optimally small B x j j are possible in very small sample sizes (up to n % 30) and naively would seem to be the best strategy.We recommend strongly against these designs as they both have small R and high q which can be deleterious to the experiment (see Figure 2, column 1, green line).

Further research
In this paper, we considered one covariate but in more realistic settings, there would be many covariates.In such settings, this would increase the imbalance in the observed covariates and increase the importance of term (I) in Equation (5) but would not impact the other terms which are functions of the dependence structure of the allocation vectors.Our design recommendation remains the same: find allocations that drive down covariate imbalance but also retain the relative independence of the allocations.To reiterate, this can be accomplished by using a lower rerandomization threshold.Future work can investigate numerical optimization approaches that optimize observed imbalance (Bertsimas, Johnson, and Kallus 2015;Kallus 2018) while still providing many allocations with a high degree of orthogonality.Also, we assumed the classic differences-in-mean estimator b ¼ 1 2 ð Y T À Y C Þ: Alternatively, one could employ the ordinary least squares estimator which adjusts for the observed covariate(s).We showed in our previous work that the MSE of the estimator is an entire order of n lower in observed covariate imbalance (Kapelner et al. 2020, Equation 24).Thus, when the ordinary least squares estimator is employed, we anticipate term (I) to decrease but terms (II) and (III) to be unaffected.This will result in the number of vectors R and the average absolute allocation dependence q having a more pronounced impact on experimental power.
We also assumed a basic response model, Equation (1), which is additive and linear in the observed covariate, unobserved covariate and treatment effect.If this model is incorrectly specified, this would be the same as the z term being a function of x and thus x would appear in the expression B z and its finite-sample central limit theorem.Examining designs in this setting would be interesting future work.
In the approximation scenario where 2R ( R D and a Monte Carlo is employed, recommending R to be large has been noted in the literature on permutation tests (Hemerik and Goeman 2018, Section 3.2).An additional concern in the approximation scenario, our power expression given in Equation (3) implicitly assumes a p-value calculation of the ratio of number of elements beating the run estimate to the total number of allocations, known in the literature as the "unbiased estimate".Phipson and Smyth (2010) and many others caution practitioners of this approach as it is anti-conservative by a factor of about 1=ð2RÞ for a < 1=2, an unintuitive flaw that compounds in severity during multiple testing.Their solution is to use the Wilson estimate, adding one to both the numerator and denominator in the ratio calculation, which is conservative if R is small.However, since in our construction we include m ii in the quantile calculation, our test then becomes conservative (Lehmann and Romano 2006, Equation 15.8).However, since randomization tests are categorically different methods that do not rely on group structure (Hemerik and Goeman 2021), we are unsure how this literature applies to our findings.We leave exploration of these issues to further research although we do not believe a new estimator or dropping the identity permutation will change our results nor the thrust of our recommendations.
Another question our work brings up is given a fixed budget R, is there a way to minimize q by manicuring a subset of vectors?There is a rich literature in science where the applications require finding allocation vectors with small dependence similar to the problem we face in our application.For example, if a Hadamard matrix of order n exists, then its rows yield n orthogonal allocations for an experiment of size n.However, our focus is small experiments (e.g., n < 100) and thus this number of allocations falls far short of the desired R > 1000 (even before filtering for covariate balance).Other methods such as the Gold (1967) codes and Kasami (1966) codes that are used in telecommunication and global positioning system technology offer allocation sets which are larger, but not by the orders of magnitude we have shown are necessary.Broadly, when R is large relative to n, attempts to eliminate correlation are fruitless in the following sense: for allocation vectors w 1 , :::, w R in an experiment of size n, Datta, Howard, and Cochran (2012, Theorem 2.2) proves that 1 RðR À 1Þ X i6 ¼j jw T i w j j 2 !n R À n R À 1 :

Figure 2 .
Figure 2. Simulated estimates of power P of the randomization test by number of allocation vectors in the design R where b ¼ 0:25, a ¼ 0:05 and b x 2 f0, 1g: Individual plots correspond to different settings of the effect of the observed covariate b x and sample size n.Colors indicate the design strategy employed.Error bars are jittered slightly left-right and indicate 95% confidence.

Figure 3 .
Figure 3. Simulated average jr ij j over all 2 R allocation vectors on the log scale for each design method considered by subset size R. Individual plots correspond to different settings the sample size n.Allocation vector correlations are independent of b, b x and a. Colors indicate the design strategy employed.Error bars are jittered slightly left-right and indicate 95% confidence but are usually smaller than the dot width.
which are the random variables in row m 1Á computed under the three assumptions).
z is an equicorrelated standard multivariate normal where B z is the vector of B z j 's for unmirrored allocations j ¼ 1, :::, R: Hence we can write the estimators as