Assessing the Impacts of Cluster Effects and Covariate Imbalance in Cluster Randomized Equivalence Trials

Abstract Equivalence tests establish whether treatments are similar in their intended outcomes. This is in contrast to superiority tests, which establish whether a new treatment is better than a standard treatment or placebo. Few equivalence trials have employed a cluster randomized design, but they are subject to some of the same analysis pitfalls that are common to superiority trials—namely, a failure to adjust for either cluster effects or covariate imbalances resulting from randomization. Using real and simulated data from a cluster randomized trial comparing exercise protocols among U.S. Army soldiers, this study empirically demonstrates the consequences for power and Type I error rates when either or both of these effects have been ignored in the analysis. Analysis of real trial data showed that equivalence test outcomes can change depending on whether appropriate adjustments are applied. Simulations demonstrated that failing to adjust for important baseline covariates severely reduces statistical power, and failing to adjust for cluster effects increases the risk of false declarations of equivalence. As cluster randomized designs are increasingly employed for equivalence trials, analysts must be aware of the importance of adjusting for cluster effects and covariate imbalances to avoid false conclusions.


Introduction
Equivalence tests in randomized controlled trials (RCTs) aim to establish whether different treatments are similar in their intended outcomes. This is in contrast to superiority tests, which seek to determine whether a new treatment has improved outcomes as compared to a standard treatment or placebo. A distinction can also be made to noninferiority tests, which establish whether a new treatment is not worse than the standard by an acceptable margin. Establishing equivalence or non-inferiority may be preferred over superiority when a new treatment is not necessarily favored for having greater efficacy, but for being less costly to implement-by being cheaper to produce or having lesser patient side effects, for example. Equivalence and noninferiority testing are also appropriate for active control trials, where denying treatment to a comparator group would be unethical.
Though equivalence and noninferiority trials are often discussed together in opposition to superiority trials (Chan 2004;Wang et al. 2006;Powers 2008;Dasgupta, Lawson, and Wilson 2010;Bowalekar 2011;Flight and Julious 2016;Kerai 2017;Ranstam and Cook 2017;Rief and Hofmann 2019;Acuna, Dossa, and Baxter 2020;Ghosh et al. 2020;Schober and Vetter 2020), there is arguably greater similarity between noninferiority and superiority from a purely analytical standpoint (Ganju and Rom 2017;Dunn, Copas, and Brocklehurst 2018). Both involve a one-sided hypothesis test, and post hoc superior- ity may be declared from a noninferiority trial if the treatment difference is large enough in the favored direction. Ganju and Rom (2017) have even suggested that the distinction between the two is better regarded as a matter of clinical versus statistical superiority. Equivalence, on the other hand, might be described as the criteria of both noninferiority and nonsuperiority, statistically speaking. This would be appropriate for situations when true replication of a treatment's effects is desired, or when a large effect magnitude in either direction may be detrimental. Modern equivalence tests were originally developed to assess the comparable bioavailability of pharmaceutical drugs (e.g., a brand name vs. a generic alternative). Since then, the use of equivalence trials has expanded to a variety of therapeutic investigations, and awareness regarding their use continues to grow.
Equivalence tests distinguish themselves from conventional two-sided hypothesis tests through an inversion of the null hypothesis, from that of no difference between the relevant parameters of the new and standard treatments (H 0 : θ N −θ S = 0), to that of difference beyond clinically important thresholds (H 0 (Schuirmann 1981;Westlake 1981;Anderson and Hauck 1983). The reason for this inversion is that under the null hypothesis of no difference, equivalence could falsely be declared merely by having an underpowered study. Because equivalence typically implies that the margin of acceptable difference between treatments be small, the sample sizes required to detect the desired effects are usually larger than those needed for superiority trials (Stefanos, Graziella, and Giovanni 2020).
Few equivalence trials have employed a cluster randomized design (Piaggio et al. 2001;Jaffar et al. 2009;Irena et al. 2015), in which naturally occurring groups (e.g., clinics, classrooms, or communities) are the units of randomization, rather than each individual subject (Turner et al. 2017b). A cluster randomized trial (CRT) design may be beneficial when administering different interventions within a cluster is impractical or prone to contamination. Special statistical considerations in CRTs warrant attention, however. First, because some homogeneity is usually expected within each cluster, outcomes are likely to be correlated, which violates the assumption of independence in many standard statistical tests and power calculations. As such, alternative power calculations resulting in greater sample sizes are needed (Rutterford, Copas, and Eldridge 2015;Turner et al. 2017a), as are analyses that account for the dependence of outcomes within clusters (Murray, Varnell, and Blitstein 2004;Campbell, Donner, and Klar 2007;Turner et al. 2017b). Unfortunately, many studies using a cluster randomized design fail to apply appropriate statistical techniques. A systematic review of CRTs from 2004, for example, found that only 59% of published trials accounted for clustering in their analysis (Eldridge et al. 2004). More recently, a 2018 review of cancer-related CRTs found that 51% used appropriate methods (Murray et al. 2018). Expected consequences of ignoring clustering are falsely precise effect estimates (see Appendix, supplementary materials) and test statistics with misspecified degrees of freedom, both of which lead to a higher risk of Type I error. Although this has thoroughly been investigated in the context of superiority trials (Feng et al. 2001;Lee and Thompson 2005), these concerns should apply to equivalence trials as well.
Second, investigators must be aware of the potential for imbalances in baseline covariates between treatment arms. Although a major purpose of randomization in RCTs is to "equalize" the distribution of potential confounders, skewed assignments may still result by chance, with implications for effect bias, power, and Type I error (Pocock et al. 2002;Ciolino et al. 2011Ciolino et al. , 2019Turner et al. 2012;Kahan et al. 2014). This is especially true for CRTs, where small numbers of clusters lead to greater chances for imbalance (Ivers et al. 2012;Leon et al. 2013;Wright et al. 2015;Moerbeek and van Schie 2016;Yang et al. 2020). Strategies to lessen the impact of covariate imbalance usually include some form of restricted randomization in the design stage and covariate adjustment in the analysis (preferably declared a priori) (Committee for Proprietary Medicinal Products 2004; Moher et al. 2010). In a 2015 review, Wright et al. (2015) found that more than 40% of CRTs did not report use of restricted randomization and 27% did not adjust for any covariates in the analysis of the primary outcome. Covariate adjustment in particular has been shown to increase power (Moerbeek and van Schie 2016) and mitigate the potential for treatment effect bias (Yang et al. 2020) in CRTs. This has only been demonstrated in the context of superiority testing, however, and not for equivalence tests.
Little research work has addressed the consequences of failing to adjust for clustering or covariate imbalance in cluster randomized equivalence trials. Hence, the purpose of this article is to demonstrate the impact to study conclusions when either or both of these have been ignored in the analysis. There are two main parts to this article. First, conclusions will be compared between appropriate and inappropriate analyses of an actual CRT example regarding physical exercise protocols. Appropriate analyses are defined as those that adjust for both the clustering effect and the impact of baseline covariates. Second, a simulation based on the same trial data will show how empirical power and false positive (Type I error) rates differ between appropriate and inappropriate analyses at a range of realistic data generation model parameters. Of particular interest is the performance of the two one-sided test (TOST) (Schuirmann 1987), which is the current standard for equivalence testing. (Although it is not the most powerful test (Phillips 1990;Berger and Hsu 1996), the simplicity of the TOST has led to its widespread adoption, including in FDA guidelines for bioequivalence trials.) Implications for the analysis of equivalence trials using cluster randomized designs are discussed.

Study Data
This study is based on data from a parallel-arm CRT comparing two exercise protocols meant to improve lumbar strength and core muscular endurance among U.S. Army soldiers (Mayer et al. 2016). These interventions were (i) a suite of core stabilization exercises (CORE) and (ii) a program of high-intensity progressive resistance exercise (HIPRE) for the lumbar extensor muscles. The chosen outcome for the present study is core muscular endurance, measured by seconds achieved in a prone static plank test (this was a secondary outcome in the original trial). The sample originally consisted of 582 soldiers randomized by platoon (cluster) to either the CORE or HIPRE intervention. Ultimately, 451 soldiers from 12 platoons completed the plank test at follow-up. 220 of these were in the CORE intervention group and 231 received HIPRE. Covariates measured at baseline included sex, age, height, weight, and seconds achieved in the plank test prior to the intervention.

Test for Equivalence
The null and alternative hypotheses for the equivalence test are as follows: where θ is the average difference in the outcome between the CORE and HIPRE intervention groups. The equivalence thresholds are set at δ L = −10 and δ U = 10. Thus, the null hypothesis is rejected and equivalence is declared if an α-level statistical test concludes that the mean difference in the follow-up plank test between interventions is between −10 and 10. In a parallel-arm trial, the test statistics for a two one-sided test (TOST) of the null hypothesis are whereθ is an estimate of θ . Assuming these follow a Student's tdistribution with v degrees of freedom, then the null hypothesis can be rejected if both Since these tests are "equal-tailed" in the rejection regions (Berger and Hsu 1996), an equivalent procedure in this case is to reject the null hypothesis if the 100(1 − 2α)% confidence interval for θ is contained entirely within the thresholds,

Comparison of Approaches
Four different TOSTs based on the 90% confidence interval for θ were compared. These tests involved: 1. The unadjusted estimates of θ and its standard error ignoring the effects of clustering 2. The unadjusted estimates accounting for the effects of clustering 3. The covariate-adjusted estimates ignoring the effects of clustering 4. The covariate-adjusted estimates accounting for the effects of clustering The estimates of (1) and (3) were obtained from an ordinary least squares model fit to the data, and the estimates of (2) and (4) were part of a two-level linear mixed model fit using REML, with platoons included as random intercepts. Note that for linear mixed models, the test statistics are only approximately t-distributed, particularly with unbalanced designs. The Satterthwaite or Kenward-Roger approximations to the degrees of freedom are often recommended (Brown and Prescott 2015;Luke 2017), though alternative inferential approaches (e.g., likelihood-based inference, bootstrapping, or large-sample normal approximations) are valid as well. To maintain theoretical consistency with the TOST, we apply the Satterthwaite adjustment to the values of t 1−α, v used for the confidence intervals in (2) and (4).
For the analysis of actual trial data, the estimates of θ , the 90% confidence intervals, and the conclusions from the four above tests were computed. The estimated intraclass correlations (ICC) were also recorded for (2) and (4).

Simulation Model
For the simulation study, outcome data were generated from the following model: where Y ij is seconds achieved in the follow-up plank test for the jth soldier (j =1, 2, …, m) in the ith platoon (i =1, 2, …, K), μ is the mean outcome for subjects receiving the CORE intervention, θ is the mean difference in outcomes between subjects in a θ was set to 0.5 for the evaluation of empirical power and set to the lower equivalence threshold δ L = −10 for an evaluation of the empirical rate of Type I error. Simulations were repeated for each combination of θ and ρ for a total of 6 scenarios. The number of clusters K was calculated to achieve approximately 80% power at level α = 0.05 for each setting of ρ, with a hypothesized effect size of 0.5.
the HIPRE and CORE groups, t i is a treatment indicator taking the value 0 for the CORE group and 1 for the HIPRE group, γ is the mean increase in Y ij for each unit increase in covariate X ij , and X ij is seconds achieved in the plank test at baseline. u i is a random term representing the additive effect of platoon i on outcome Y ij , and ε ij is the subject-specific residual error. u i and ε ij are independent and normally distributed with variances σ 2 ε and σ 2 u , respectively. The parameters in this model were set to empirical estimates of the full model fit to the trial data (with sex, age, height, and weight included as additional covariates). An exception was σ 2 u , which was varied across three different simulation scenarios, each with a different chosen value for ρ, the intraclass correlation (ICC). All parameter values are listed in Table 1.
Equal numbers of platoons, each of size m = 38, were assigned to each intervention group. The single covariate X ij was generated as a uniform U(51, 540) random variate, with the range equal to that observed for the baseline plank test among actual trial subjects. Covariate imbalance was induced by choosing a randomly generated assignment such that the average difference in X between treatment arms (X HIPRE −X CORE ) was within one unit of the 0.80 quantile of its distribution. This created an imbalance where subjects in the HIPRE group had a higher average value of X, which would create a positive bias in unadjusted estimates of θ (Yang et al. 2020). t i and X ij remained fixed across replicate datasets.

Simulation Scenarios
There were two sets of simulation scenarios, and each scenario had 10,000 replicates. The first set had a "ground truth" of θ = 0.5, a treatment difference well within the equivalence thresholds. The rejection rate of the null hypothesis for each of the four TOSTs provided an estimate of their empirical power. This was repeated for three preset values of the ICC (see Table 1). The second set of simulation scenarios had a "ground truth" of θ = −10, a treatment difference equal to the lower equivalence threshold. In this case, the rejection rate of the null hypothesis for each of the four TOSTs indicated their expected rate of false positives (Type I error). This was again repeated for the three different values of the ICC listed in Table 1. In addition, the empirical MSE for each of the four estimates of θ was computed in each scenario.

Sample Size, Nominal Power, and α
For each simulation scenario, the sample size needed to achieve at least 80% power using a standard TOST was obtained using PROC POWER in SAS 9.4. With a hypothesized effect size for θ set to 0.5, α set to 0.05, and the assumed within-cluster variance (σ 2 ε ) from Table 1, the outputted total sample size was 662 individuals. Since this number is intended for individually randomized trials, it was inflated by a multiplicative factor of (1+ (m−1) ρ), where m is the average cluster size and ρ is the intraclass correlation. This factor is known as the "design effect" for CRTs (Rutterford, Copas, and Eldridge 2015). The final sample size was rounded up so it was divisible by 2 m, resulting in an even number of clusters. The final number of clusters corresponding to each setting of ρ is listed in Table 1.
All simulations and analyses were performed in R (v 3.6.1). Linear mixed models were fit using the lmer package (v 1.1-23) (Bates et al. 2015), and the Satterthwaite approximation to the degrees of freedom was implemented using lmerTest (v 3.1-3) (Kuznetsova, Brockhoff, and Christensen 2017).

Trial Data
For the actual trial, the average values of the model covariates in each treatment arm are presented in Table 2. The results from applying each type of TOST are presented in Table 3. Although the tests are underpowered with 451 participants in 12 platoons, it is apparent that covariate adjustment makes it easier to reject the null hypothesis through explaining away the residual variance. Ignoring the effects of clustering makes the confidence intervals narrower still, due to underestimation of the residual variance. This leads to a rejection of the null hypothesis and declaration of equivalence in the adjusted model that ignores clustering, while the adjusted model that accounts for clustering (the appropriate model) failed to establish equivalence.

Simulation
The empirical power of each TOST for the first set of simulation scenarios is provided in Table 4. Samples of the simulated tests are plotted in Figure 1. Power is generally lower in the unadjusted compared to the adjusted models. This is because in the unadjusted models, the variability in the outcome due to the baseline covariate becomes part of the estimated residual variance (σ 2 ε ), which in turn widens the confidence interval for θ and increases the MSE. The imbalance in covariates also creates bias, as shown in Table 4, and as evidenced by the asymmetric distribution of the unadjusted confidence intervals around θ in Figure 1. Models that ignore the clustering effect reject the null hypothesis more often than models that account for it. This is because the residual variance is underestimated when clustering is ignored, and the confidence intervals for θ are hence, narrower. Although ignoring clustering leads to higher power in the case of true equivalence, this comes at the cost of a greater risk of false positives when equivalence in fact does not exist. This is seen in Table 4, which shows the empirical rate of Type I error (false positives) when the true intervention effect is equal to δ L , the lower equivalence threshold. Plotted samples appear in Figure 2.
Even when clustering is accounted for, the risk of false positives can is elevated when covariate imbalance is unadjusted for, since in this particular scenario, the imbalance biases the estimate of θ toward the midpoint of the equivalence thresholds. Had the covariate imbalance occurred in the opposite direction (biasing the estimate of θ toward −∞) the Type I error rate for the unadjusted models would be smaller-yet power would still be diminished as well. Only in the adjusted model that accounts for clustering (the correct model) is the empirical Type I error rate near the nominal level of 0.05.

Discussion
As equivalence trials become more common outside of pharmacological research, diversification in trial designs call for guidance with respect to appropriate statistical procedures. Cluster randomized equivalence trials in particular have been rare, but this is expected to change as situations are encountered where treatment assignment at the group level offers practical advantages over assignment at the individual level. Failing to account for the statistical dependence of responses within groups (i.e., clusters) is a common mistake during analysis, however, (Eldridge et al. 2004;Murray et al. 2018). Furthermore, randomization by cluster increases the potential for unequal covariate distributions among treatment arms, which, if unadjusted for, can result in statistical bias (Ivers et al. 2012;Leon et al. 2013    Through an analysis of trial data comparing physical exercise protocols on lumbar strength among U.S. Army soldiers, this study has shown that equivalence tests are not robust to failures to account for cluster effects or covariate imbalances when using a cluster randomized design. In addition, a simulation based on the same trial data has demonstrated that not adjusting for covariate imbalance can severely reduce the statistical power of equivalence tests by introducing both bias and additional error into the estimate of the treatment difference. Meanwhile, incorrectly assuming independence increases the risk of falsely concluding equivalence when in fact there is none.
A few things are worth noting: first, the extent of treatment effect bias depends on the degree of covariate imbalance. Also, the present simulation assumes that the adjusted model specifies the true functional relationship between the covariates and the outcome. In practice, this relationship may be difficult to model correctly. Hence, the bias-remedying effects of covari-ate adjustment observed in this simulation may be somewhat optimistic.
Importantly, this study also assumes that treatment effects are constant among clusters. It is common, however, for treatment effects to depend on cluster membership, and this is often modeled with additional random effects specifying a clustertreatment interaction. In this case, equivalence tests (as heretofore discussed) would similarly apply to the mean treatment effect among clusters, but investigators must also decide whether equivalence is important for the between-cluster treatment variance or other model parameters as well (Berger and Hsu 1996;Chervoneva, Hyslop, and Hauck 2007;Hua, Xu, and D' Agostino 2015;Pallmann and Jaki 2017).
In the case of mean effects, however, these findings mirror the expectations for superiority tests in CRTs (Campbell, Donner, and Yang et al. 2020). The importance of these adjustments should also extend to alternative CRT designs, such as crossover, stepped-wedge, network-randomized, pseudocluster, and individually randomized group treatment trials (Campbell and Walters 2014;Moerbeek and Teerenstra 2015;Turner et al. 2017aTurner et al. , 2017b. Future research should also explore further developments for equivalence testing in these contexts.

Supplementary Materials
The statistical properties of the OLS estimator for fixed effects applied to clustered data are described in the appendix. R code for the simulation is also included as a supplementary file.