Efficient analysis of split-plot experimental designs using model averaging

Abstract Split-plot experimental data are often analyzed as if the data came from a completely randomized design. As is well known, ignoring the different levels of randomization and replication can lead to serious inferential errors. However, in some experiments, including many of the ocean global change experiments that motivated this research, variation between whole-plot experimental units may be small relative to variation between subplot units. Even though a factorial analysis will often perform poorly in general, in this special case it outperforms a split-plot analysis, providing narrower confidence intervals for treatment means and differences with coverage rates close to the desired level. The performance of the proposed model-averaged analysis was compared to a classical split-plot analysis via a simulation study, and its utility demonstrated for an ocean global change experiment examining growth and condition of a juvenile mussel species. In our simulation study, model-averaged confidence intervals for whole-plot treatment means or comparisons of means were up to 40% narrower than split-plot confidence intervals while maintaining close to nominal coverage rates. In our example experiment, we observed narrowing of up to 25%. We recommend model averaging as a preferred approach when variation between whole-plot experimental units is expected to be less than between subplot units, with a few caveats for studies with very few replicates.


Introduction
Researchers performing manipulative multi-factor experiments are often logistically constrained by physical equipment and resources. A natural consequence is that factors may be manipulated at different scales, leading to multiple levels of replication and randomization. A classical example is the split-plot experimental design, first developed for use in agricultural experiments (Fisher 1925). Split-plot designs occur when, for practical considerations, some factors are kept constant across a "whole-plot" experimental unit, while other factors are allowed to vary across "subplots" within each whole plot. Like Fisher's classic paper, we focus our analysis on the simplest case with two factors, with one factor manipulated at each level. However, we anticipate that our work should be informative for more complex split-plot or related experiments. Thus, for this study, we can say that there are two levels of randomization and experimental units. The first level of randomization occurs at the whole-plot level and assigns whole-plot treatments. Subplot treatments are then randomly allocated to subplots within each of the whole plots.
Our research is motivated by ocean global change experiments relevant for both ecological studies and the aquaculture industry, whose features are described in Section 1.1. However, we start with a classical and illustrative agricultural experiment that motivates the importance of split-plot designs and analyses. Later, we compare these to other experimental fields, e.g. industrial experiments. In the illustrative agricultural experiment, we are interested in testing the effect of irrigation methods and crop varieties on crop yield. It may be physically difficult to vary irrigation methods for areas smaller than a field; however, it is often feasible to have different crop varieties within a field. Thus, we can randomize different irrigation methods to our fields (whole plot), and, within each field, randomize different crop varieties to the smaller subplots ( Figure 1a). The subplots are nested within whole plots, and this can be extended such that the whole plots are themselves nested within blocks. Because there may be substantial field-to-field variation in crop yield irrespective of treatment, subplots within the same whole plot will be more similar than subplots in different whole plots. Hence, within-whole-plot comparisons are generally more precise than between-whole-plot comparisons, and this must be accounted for during analysis. Since their origins in agricultural research, split-plot designs have been used across various disciplines including industrial experiments (Bingham and Sitter 2001;Box and Jones 1992;Goos and Vandebroek 2001;Jones and Nachtsheim 2009;Li and Marasteanu 2004;Simpson, Kowalski, and Landman 2004), genetics (Baskauf, McCauley, and Eickmeier 1994;Stevens, Waller, and Lindroth 2007;Whitehead and Crawford 2005), life-sciences (Alessi et al. 2005;Creighton et al. 2019;Gueguen et al. 2019;Kuboyama et al. 2017;Wellington et al. 2006), physical sciences (Langhans, Goos, and Vandebroek 2005;Parker et al. 2008), and environmental research (Jain et al. 2019;Purahong et al. 2016;Quinn and Keough 2002;Schielzeth and Nakagawa 2013;Wang et al. 2016).
Analysis of variance (ANOVA) is the classical parametric analysis method for experimental designs, with versions available for particular experimental designs including split-plot designs (Fisher 1925). Split-plot designs are also commonly analyzed using regressionbased methods, e.g. via linear mixed models (Bates et al. 2015;Jones and Nachtsheim 2009), or using Bayesian methods (Browne and Draper 2006;Gilmour and Goos 2009), with differences between the three approaches described in Section 2.5. However, the split-plot nature of experiments is often hidden or not understood (Jones and Nachtsheim 2009;Wooding 1973). This can lead to researchers analyzing split-plot experiments as if they were completely randomized factorial experiments, i.e. using a factorial ANOVA instead of a split-plot ANOVA or a linear regression instead of a linear mixed model. Typically, this leads to researchers believing they have more independent replicates than they actually do, a situation that is often termed "pseudoreplication" (Hurlbert 1984). A consequence of this may be incorrect conclusions due to overly small estimates of uncertainty (Ganju and Lucas 1997).
In some cases it is possible to keep experimental conditions nearly constant across different whole plots, leading to small whole-plot error terms. An interesting special case is when the whole-plot variance is close to zero. Then, the models for completely Figure 1. Identical split-plot experimental layouts for (a) an agricultural experiment where irrigation is randomized at the field level and plant variety is randomized across subplots within each field and (b) an ocean global change experiment where pCO 2 is randomized at the header tank level and light is randomized across culture tanks within each header tank. Inherent variability between fields may be substantial, while inherent variability between header tanks may not be. randomized designs and split-plot designs are virtually identical (regardless of whether ANOVA, mixed models, or Bayesian approaches are used to analyze the data). For a narrow range where the ratio of whole-plot to subplot variance is sufficiently small, there are advantages, in terms of a bias-variance tradeoff, in analyzing a split-plot design as a completely randomized design, e.g. using a factorial ANOVA. That is, the widths of confidence intervals for treatment means or comparisons are reduced relative to a split-plot analysis while maintaining nominal or near nominal coverage. For agricultural or other field-based experiments, the whole-plot variance may be relatively large and this special case would be rare. For many ocean global change studies, however, whole-plot manipulations are often of physico-chemical properties, while biological activity primarily occurs at the subplot level, e.g. in tanks holding organisms ( Figure 1b). In these cases, it is often reasonable to expect that whole-plot experimental units will exhibit substantially lower unit-tounit variation than occurs at the subplot scale. But, this is not always the case and it should not be routinely assumed that this special case is relevant Hurlbert 2004). In other settings, e.g. industrial experiments where whole-plot and subplot variation may both be small on an absolute basis, or social science experiments where both may be large, it may be difficult to anticipate whether their ratio is small. For example, Browne and Draper (2006) includes a social science and a health science example. Their first examplea longitudinal study of 2,000 pupils from 50 primary schoolsfound the ratio near 0.15, while their second examplea Guatemalan child health studyfound the ratio closer to 1. Similarly, Jones and Nachtsheim (2009) discuss two industrial examples. The firsta robust product designfound the ratio near 1.3, while the seconda 5-factor central composite response surface designfound the ratio near 0.3.
Determining a priori whether a factorial ANOVA (or equivalent) for a split-plot design would work well for any given experiment is challenging, relies on an assumption, and can lead to poor inference if incorrect. One common approach is to perform an implicit or explicit model selection process to choose between the split-plot or completely randomized factorial model, e.g. reverting to the factorial analysis degrees of freedom when whole-plot variance is estimated at the boundary of 0 (Bates et al. 2015) or via a stepwise procedure where non-significant random and fixed effects are eliminated sequentially (Kuznetsova et al. 2015). However, inferences based on a single model following a model selection process can lead to confidence intervals with coverage probabilities well below the nominal level (Fletcher and Dillingham 2011;Freedman 1983;Kabaila 2009;Lukacs, Burnham, and Anderson 2010).
An alternative approach that we employ is to use model averaging, in which results from both split-plot and factorial ANOVAs are combined. Model averaging is an approach to estimation (or prediction) that allows for uncertainties due to the model selection process, and is appropriate when the parameter of interest has the same interpretation in all models, as it does when estimating treatment means or comparisons. It has been used in a wide range of application areas, such as econometrics (Kapetanios, Labhard, and Price 2008;Pesaran, Schleicher, and Zaffaroni 2009), ecology (Burnham, Anderson, and Huyvaert 2011;Lamon and Clyde 2000), environmental risk assessment (Clyde 2000;Piegorsch et al. 2013), public health (Jackson, Thompson, and Sharples 2009;Schwartz et al. 2008), and pharmacology (Faes et al. 2007;Lunn 2008). Particularly relevant to this study are previous uses of model averaging to improve precision of treatment means and comparisons in factorial experiments (Fletcher and Dillingham 2011) including ocean global change experiments (Boyd et al. 2016;Zeng et al. 2019). In this context, we think of the split-plot model as the full model, and the factorial model as a reduced-parameter approximation. This is similar to use of model averaging in factorial experiments, where a full model with all interactions may be thought of as closest to "truth," yet reduced models may be usefully incorporated on a bias-variance tradeoff basis. For a recent comprehensive review of model averaging, see Fletcher (2018).

Motivation from ocean global change studies
Globally, many physical, chemical, and biological aspects of the oceans are expected to change due to anthropogenic CO 2 emissions. Biological and ecosystem impacts are of particular interest, e.g. on ecosystem services, culturally important taxa, or the aquaculture industry, but these are driven by physical and chemical changes. Much of the early research in the field revolved around the impact of individual drivers (factors with positive or negative effects on organisms) or stressors (factors with negative effects), e.g. decreasing pH and its impact on calcifying organisms, a process termed ocean acidification. This process can be observed at natural CO 2 seeps (Fabricius et al. 2011;Hall-Spencer et al. 2008) or via the laboratory-based experiments that motivate this paper. With increasing awareness of the importance of the interplay between various marine drivers with each other (e.g. temperature, light, trace metals, pH), there has been a move away from single-driver studies (e.g. pH only) to multiple-driver studies (Boyd et al. 2019(Boyd et al. , 2018Boyd and Hutchins 2012;Byrne and Przeslawski 2013;Riebesell and Gattuso 2015;Wernberg, Smale, and Thomsen 2012).
Multiple-driver studies allow researchers to improve predictions of the effects from our changing oceans (Boyd and Hutchins 2012;Byrne and Przeslawski 2013;Crain, Kroeker, and Halpern 2008;Halpern et al. 2007;Orr et al. 2020;Riebesell and Gattuso 2015). The complexity of multiple-driver experiments grows rapidly when more than two drivers are required (e.g. pH and temperature and nutrients and light) and the maximum sample size available is often small. Similar issues arise in other application areas, e.g. industrial experiments. However, planned confounding using collapsed factorial or similar designs can be used to reduce a multi-factor experiment to a two-factor experiment (Boyd et al. 2016(Boyd et al. , 2018. Therefore, statistical designs and analyses for two-factor studies are of particular interest to the field. Because of cost and space limitations, analyses that optimize information for a given sample size while maintaining desirable statistical properties are especially relevant. A typical approach to laboratory-based studies examining these phenomena is to isolate an organism or organisms, e.g. in individual tanks or mesocosms, and to subject them to a treatment for a given period. At the end of the period, and often during, response metrics are recorded, e.g. growth rate or physiology. These experiments normally operate in challenging logistical environments and it is common for an individual experiment to have different levels of replication, including interdependent replicates Riebesell et al. 2011). In particular, many global change studies manipulate physico-chemical factors of the water supplied to the organisms, such as pH or temperature, at a relatively large scale with few (or single) replicates. These large tanks are commonly referred to as "header" tanks. The large scale manipulation is then applied to multiple smaller units for which further manipulation may occur, e.g. multiple individual tanks where organisms are cultured being supplied by a common water header tank where pH is manipulated, while light is manipulated at the individual tank level (Figure 1b). In these systems there may be little variation between the large scale replicates. For example, while it is challenging and expensive to manipulate and monitor seawater chemistry (Riebesell et al. 2011), systems with advanced feedback control can produce highly repeatable results (Barr, Lohrer, and Cummings 2017;McGraw et al. 2010). However, other unit-to-unit sources of variability cannot be entirely eliminated, e.g. contamination by micro-organisms Hurlbert 2004), and should be quantified where possible. Notwithstanding this, there may often be relatively little large-scale variation compared to small-scale biological variability. Ideally, designs and analyses would be used that can accommodate large scale variation when it occurs, but also allow for and capitalize on the possibility that it may be small in some or many cases.
Our motivating example was a study of green lipped mussels or k utai (Perna canaliculis), a culturally treasured (taonga) and commercially important species endemic to Aotearoa New Zealand. Aotearoa New Zealand's aquaculture industry is poised to achieve annual sales of NZD$1 billion by 2025, with green lipped mussels accounting for over half of the export value (Ministry for Primary Industries 2019). This study examined the potential for a waste product (shells) from the aquaculture industry to mitigate ocean acidification at a vulnerable stage in the mussel farming process. Specifically, it examined whether juvenile green lipped mussels cultured with a waste shell substrate would afford some advantages in reduced pH conditions by providing the mussels access to additional inorganic carbon for growth and maintenance of their shells. The experimental system had 12 header tanks, each of which supplied water to 6 "raceways." Seawater pH was manipulated at the header tank level using a highly reproducible and stable system, with 6 tanks each set to Year 2020 or projected Year 2050 pH levels. At the raceway level, either inert glass beads or crushed shell was added as a substrate. Juvenile mussels were then placed in each raceway, with growth and condition responses recorded.
Given the reproducibility of the seawater system, and the high level of biological activity expected to occur between mussels at the raceway level, we expected the whole-plot variance to be considerably smaller than the subplot variance. On a theoretical basis, we also expected that model averaging would have benefits relative to a split-plot ANOVA, if the whole-plot variance was small enough. What we did not know was whether the levels of variability observed in a real-world setting would align with the range where model averaging would be particularly effective.

Outline
The outline of the paper is as follows. First, we describe theory for split-plot ANOVA, including inference for common contrasts of scientific interest, e.g. treatment differences at whole-plot or subplot levels. Then, we provide details of model averaging and construction of model-averaged confidence intervals. Next, we describe a simulation study used to evaluate confidence interval coverage and width for modelaveraged intervals relative to split-plot ANOVA. We also consider alternative approaches using linear mixed models or hierarchical Bayesian implementations, which yield qualitatively similar results. Then, we indicate the relevance of the simulation study through the analysis of real experimental data involving four response variables concerning two size classes of juvenile mussels. This study follows a slightly different design, indicative of applications beyond the classic split-plot design analyzed in detail in the simulation study, and typical of variation observed across multiple-driver studies of ocean global change. Finally, after presenting relevant results, we discuss implications of our research study. In Supporting Information, we provide additional technical detail for the split-plot model, supplementary simulation results, comparable results using linear mixed effects or Bayesian implementations, and further details about the layout of the mussel experiment.

Split-plot model
We consider the classical split-plot model, with whole-plot treatments in a completely randomized design, where Y ijkl is the response of the lth subplot replicate in the kth whole plot given whole-plot treatment i and subplot treatment j; l is a constant; s i is the whole-plot treatment effects; c j is the subplot treatment effects; ðscÞ ij is an interaction effect; ik is a whole-plot error; and d ijkl is a subplot error. The whole-plot errors are independent Nð0, r 2 w Þ and the subplot errors are independent Nð0, r 2 s Þ random variates, while i ¼ 1, :::, n 1 , j ¼ 1, :::, n 2 , k ¼ 1, :::, r 1 , l ¼ 1, :::, r 2 ; n 1 , n 2 are the numbers of whole-plot and subplot treatment levels; and r 1 , r 2 are the numbers of whole-plot and subplot replicates. Randomization ensures that the whole-plot errors are mutually independent and that the subplot errors are mutually independent within whole plots. When the split-plot experiment is balanced, the full ANOVA for this model is given in Table S1.
Under model (1), the confidence intervals for treatment comparisons are based on standard errors and degrees of freedom in Table S2. The degrees of freedom associated with each of the standard errors correspond to the error mean squares. When comparing whole-plot-subplot interactions between different whole plots, the degrees of freedom is obtained using Satterthwaite's approximation. There are more degrees of freedom associated with the subplot error, hence comparisons for the subplot treatments and its interactions are more precise compared to those for whole-plot treatments. The appropriate multipliers are then obtained using Student's t-distributions.

Factorial model
When a split-plot design is intentionally or unintentionally analyzed as if it were a completely randomized two-way factorial experiment, the model reduces to where l, s, c are as defined previously with d ijk , the errors, being independent Nð0, r 2 Þ for k ¼ 1, :::, r 1 r 2 : Here, all treatment combinations are treated as being randomly assigned to experimental units with a single pooled error variance and common degrees of freedom. This can lead to confidence intervals that are too wide (over-coverage) for subplot factors or, more concerning, too narrow (under-coverage) for wholeplot factors. Therefore, model (2) is usually incorrect for analyzing data from a split-plot experiment. However, if r w =r s is close to 0, then model (1) and model (2) are nearly equivalent on a theoretical basis.
In this region, model (2) should gain an advantage from an estimation basis, as the pooling of error variances in model (2) produces narrower confidence intervals than model (1) for treatment means or comparisons at the whole-plot level, while maintaining nominal coverage. One component of the simulation study examines this region, determining how narrow it is and how quickly coverage drops.

Model-averaging
Information criteria such as the Akaike Information Criterion (AIC) are often used for model selection or model averaging. In the context of split-plot designs, it is reasonable to assume that model (1) is closer to the truth than model (2). That is, r w may be near 0, but it is unlikely to be precisely 0. The purpose of model selection or averaging in this setting is to balance prediction bias against variance by considering a simpler model as a potential approximation to a more complex model, rather than due to any underlying uncertainty about which model is correct. However, use of the "best selected model" can lead to poor coverage rates when it is subsequently used to estimate treatment means or comparisons (Fletcher and Dillingham 2011;Turek and Fletcher 2012). Consequently, our focus is on the properties of model averaging rather than model selection.
In frequentist model averaging, for a parameter of interest w, the model-averaged point estimate is a weighted sum of the estimates obtained from each model, i.e.,ŵ where w m is the weight for model m, with w m ! 0 and P M m¼1 w m ¼ 1: The weight associated with a model is often based on an information criterion. The commonly used AIC is a measure of how well the model performs at predicting an independent copy of the observed data, and is often used to determine AIC weights (Fletcher 2018). Below, we describe calculation of AIC weights using the marginal likelihood, and construction of model-averaged confidence intervals.
Model (1) can be expressed as a linear mixed model, i.e.
where Y is a n Â 1 vector corresponding to the response variable and n ¼ n 1 n 2 r 1 r 2 ; X is a n Â p matrix corresponding to the p Â 1 parameter vector b of fixed effects with p ¼ n 1 n 2 ; Z is the n Â q matrix corresponding to the q Â 1 vector b of whole-plot random effects, with q ¼ n 1 r 1 ; and e the n Â 1 vector of subplot error. We assumed that b $ Nð0 q , r 2 w I q Þ, e $ Nð0 n , r 2 s I n Þ and Cov(b, eÞ ¼ 0 qÂn , where I n is the n Â n identity matrix. Model (2) is obtained by simply removing the Zb term.
For statistical inference, parameter estimates might be calculated via least squares, maximum likelihood (ML), or Bayesian methods. Likelihood-based methods are widely used, drawing on the assumption that b and e are normally distributed (Hartley and Rao 1967;Harville 1977;Patterson and Thompson 1971), while restricted maximum likelihood (REML) is often used for linear mixed models. However, when we consider model selection amongst linear mixed models with different numbers of fixed effects or nested fixed effects, the restricted likelihoods of the models are not directly comparable (Faraway 2016;M€ uller, Scealy, and Welsh 2013;Verbeke and Molenberghs 2000). Consequently, for model averaging, use of ML rather than REML is appropriate. For the split-plot design, the log-likelihood is a special case of the log-likelihood descibed by Harville (1977) for the general case of linear mixed models: where h ¼ ðr 2 s , r 2 w Þ and V ¼ r 2 w ZZ T þ r 2 s I n : In linear mixed models, AIC weights can be based on either the marginal likelihood (Eq. [5]), or they can be based on the conditional likelihood. The marginal likelihood is appropriate when interest is in estimation of the population parameters; that is, the fixed effects and variance components (Zhang et al. 2016). In other settings, prediction given the random effects b is of interest, e.g. when predicting outcomes for repeat studies on the same system. In these cases, the "conditional likelihood" function, which also treats the random effects as given, is more appropriate (Greven and Kneib 2010;Vaida and Blanchard 2005); also see Zhang, Zou, and Liang (2014) and Kruse, Silbersdorff, and S€ afken (2022) for discussion of model averaging in the conditional case. For our setting, the fixed effects are of interest, and so the marginal AIC (mAIC) was used: where lðyjb,ĥÞ is the maximized log-likelihood. Under model (1),ĥ is obtained from the full ANOVA table. For linear mixed models, negative estimates ofĥ are truncated to zero, i.e. model (1) reduces to model (2).
Using mAIC, w m in Eq.
[3] is given by where D m ðmaicÞ ¼ ðmaic m À min maicÞ and minmaic is the minimum mAIC over the entire model set. The mAIC differences correspond to differences in Kullback-Leibler divergence, e.g. information gain or loss when using one model versus another. To obtain the weights w m and to interpret the relative likelihood of a model, the relative likelihood of the models given the data are normalized to sum to one (Burnham and Anderson 2002). A full description of the justification of this choice of model weights is in Fletcher (2018).
For the simple case where we only consider the split-plot model and full factorial model, we have M ¼ 2 in Eq. [3]. This can be extended such that the set of models includes other sub-models, for example, models with or without individual fixed effects or interaction terms. For example, a previous study applied model averaging of fixed effects across 19 models in a 2 3 factorial study, finding that interval widths could be reduced by up to 30% while maintaining near nominal coverage (Fletcher and Dillingham 2011). Here, we focus on the simple M ¼ 2 case in order to most usefully examine the benefits of model averaging on random effects.
Model weights are used to calculate model-averaged confidence intervals (Burnham and Anderson 2002;Fletcher 2018;Turek and Fletcher 2012;Zeng et al. 2019) or confidence distributions . Kabaila, Welsh, and Mainzer (2017); Turek and Fletcher (2012); Zeng et al. (2019) demonstrate that model-averaged tail area (MATA) intervals outperform conventional model-averaged Wald (MA-WA) intervals (Burnham and Anderson 2002) in terms of coverage and interval width. We focus on MATA-Wald confidence intervals (Turek and Fletcher 2012) here, appropriate when individual models use z-based or t-based intervals. However, these methods can be extended to more complex settings, e.g. skewed data, using parametric studentised bootstrap intervals (MATA-SBoot, Zeng et al. (2019)).
The MATA-Wald interval calculates a weighted average of the tail areas of the t-based confidence interval associated with each model. The 100ð1 À 2aÞ% MATA-Wald interval ½w L , w U is obtained by solving has a t-distribution with m degrees of freedom when model m is true, In Bayesian settings, a variety of approaches for model averaging can be applied, e.g. using Watanabe-Akaike Information Criterion (WAIC) model weights, Bayesian stacking, or reversible jump Markov chain Monte Carlo (Fletcher 2018). In the example presented here, the simple approach of Carlin and Chib (1995) can be applied. In this approach, a variable is incorporated that indicates the selected model, and both model-specific parameters and the model are updated using Markov chain Monte Carlo. First, a prior distribution for model choice is incorporated into the model. Then, responses are modeled conditional on the selected model, e.g. by calculating linear predictors for all models but only using the predictor for the currently selected model. Finally, model weights are given by the proportion of times each model is selected, and model averaging can be done by examining the relevant posterior distribution, e.g. treatment means or comparisons based on the selected model at each iteration.

Simulation study
We conducted a simulation study to investigate the properties of 95% confidence intervals for treatment comparisons from a split-plot ANOVA and compared them to those obtained when using model averaging. Given how often split-plot experiments are incorrectly analyzed, we also considered intervals estimated using a completely randomized two-way factorial ANOVA to demonstrate the narrow range where the factorial ANOVA performs well, and measure how quickly the performance deteriorates. However, the primary comparison of interest is the performance (both precision and coverage) of split-plot ANOVA versus MATA-Wald intervals, as factorial ANOVA is (obviously) not a viable general analysis method for split-plot experiments.
We simulated data with n 1 ¼ 2 whole-plot treatments and n 2 ¼ 3 subplot treatments, reflective of the studies that motivated this research. The number of replicates at each level were varied to reflect typical scenarios; r 1 ¼ ð2, 3, 4, 5Þ at the whole-plot level and r 2 ¼ ð2, 3, :::, 24, 25Þ at the subplot level. The choice of b has no impact on the coverage rates of confidence intervals, and so we arbitarily set it to 0 p : This meant that we could also easily assess performance in terms of Type I error rates. In addition, as only the ratio of the whole-plot and subplot error standard deviations is relevant, rather than their actual values, we arbitrarily set r s ¼ 1, and considered r w between 0.01 and 1; the upper limit was set to 1 as we did not expect model averaging to provide any benefit beyond that range. This means that r w can be regarded as the ratio of the whole-plot error standard deviation to the subplot error standard deviation, similar to other settings, e.g. Goos and Vandebroek (2001); Jones and Nachtsheim (2009), that focus on d ¼ r 2 w =r 2 s when studying properties of split-plot designs. We simulated 10000 datasets for every scenario, incrementing r w by 0.01 between 0.01 and 0.20, and by 0.05 thereafter.
Model averaging can be used to estimate treatment means or compare treatments. This can be done for individual treatments, or at marginal levels, e.g. examining responses for a given whole-plot level averaged across all subplot levels. Of course, the relevance of marginal means and comparisons depends on both scientific context and the magnitude of any interaction terms. Our simulation was exhaustive in this regard, i.e. we examined mean responses and differences for whole-plot treatments, subplot treatments, interactions within the same whole plot, and interactions across different whole plots. We focus presentation on the whole-plot factor, where model averaging has the greatest potential benefit. This implicitly presumes that inferences at that level are relevant, i.e. that interactions are small enough that describing an average response across levels of the subplot factor are meaningful. Model averaging also provides benefits when comparing individual treatments, but there is no benefit for marginal means or comparisons of the subplot factor. As results are qualitatively similar whether examining treatment means or differences, we focused on treatment means in the main manuscript. Further results are presented in the Supporting Information. The 95% confidence intervals following a split-plot ANOVA were calculated following Table S2, while the intervals for the factorial ANOVA were calculated in the usual manner. The MATA-Wald interval was computed using the R function mata.wald() from the MATA package (Turek 2019), based on the methods described in Section 2.3. We calculated the coverage probability and mean interval width (over all simulations) for each scenario. Using 10000 simulations for each run ensured that the coverage probability for each scenario was estimated with a binomial standard error of approximately 0.2%. All calculations were performed in R Version 3.6.0.

Alternative analysis approaches
Split-plot and other nested designs are commonly analyzed using linear mixed models and Bayesian implementations (Browne and Draper 2006;Kuznetsova, Brockhoff, and Christensen 2017;Schielzeth and Nakagawa 2013). These analysis approaches have benefits in terms of generalizability to a broader variety of models, and are recommended as experimental complexity increases, e.g. multiple levels of random factors or non-orthogonal designs. Particularly, Gilmour and Goos (2009) recommend a Bayesian approach for complex, non-orthogonal designs with few large-scale degrees of freedom, where the linear mixed model approach often selects the zero boundary value for whole-plot variance components (Goos, Langhans, and Vandebroek 2006). Both alternative approaches were considered in addition to ANOVA via a set of secondary simulations, with details in Section 3 of Supporting Information. Any of the three approaches could have been used as the primary analysis, and typically behave similarly for balanced designs. The primary differences between the three is how they perform with small whole-plot variances, and computational speed. Splitplot ANOVA allows negative whole-plot variance estimates, which perversely can lead to narrower interval widths for whole-plot comparisons compared to a factorial ANOVA or model averaging. However, it otherwise performs well. The linear mixed model approach, as implemented in the R packages lme4 (Bates et al. 2015) or lmerTest (Kuznetsova, Brockhoff, and Christensen 2017), performs an implicit model-selection step if the whole-plot variance is estimated to be the boundary of 0, reverting to a factorial ANOVA model (when using Satterthwaite degrees of freedom). The Bayesian implementation requires the most computational effort, precluding 10000 simulations per scenario, and is also the least used by ocean global change researchers. Ultimately, we preferred to avoid conflation of model-selection effects and model-averaging effects more than we wanted to avoid negative variance estimates, so chose ANOVA as the primary analysis for the simulation study. For more complex designs, we would normally choose Bayesian or linear mixed model implementations.

Ocean acidification and substrate experiment
For the experiment measuring the effects of ocean acidification and substrate on juvenile mussel growth, twelve independently controlled header tanks were available to manipulate pH. Six header tanks were assigned to "Ambient pH" (Year 2020) conditions and six tanks to "Low pH" (projected Year 2050 conditions) (Figure 2). Each header tank supplied flowing water (50 ml min À1 ) to six raceways (each 240 mm long Â 38 mm wide Â 38 mm deep), each containing a small (14-19 mm shell length (SL)) and large (25-31 mm SL) juvenile mussel (size classes analyzed separately). For each raceway, either crushed shell or inert glass beads was used as a substrate. A common substrate treatment was independently applied to all raceways supplied from a given header tank. Finally, spatial randomization within the laboratory was done in blocks ( Figure 2). These latter two aspects differ from the classic design used in the simulation study, which would have led to half the raceways supplied by a given header tank being given each substrate treatment, and is typical of the variation or added complexity in design of many actual experiments in the field. Details of the experimental system and the pH manipulation and monitoring methods are described in Bylenga, Cummings, and Ryan (2017) using analytical methods from McGraw et al. (2010). Because near-shore marine organisms live in variable pH conditions and this can affect their growth (Cornwall et al. 2013), seawater pH (total hydrogen ion scale, pH T) in all header tanks was additionally manipulated to mimic the diel cycle measured in field conditions. Due to the commercial importance of mussels, four industry-relevant response variables (growth and condition) were measured after 37 days. Specifically, the growth variables were % growth in SL and % growth in shell width (SW). Physiological condition was calculated using FW:SW, the ratio of dry flesh weight (FW; dried at 60 C for 48 h) to dry shell weight (SW, air dried for 48 h). Physical condition used the ratio of SW to SL (SW:SL). Condition indices follow examples in Lucas and Beninger (1985). Split-plot designs come in many varieties, depending on their purpose and the resources available (Goos and Vandebroek 2001;Jones and Nachtsheim 2009). For example, the design for the experiment allows a simpler treatment of the subplots as subsamples, where averaged responses across raceways supplied by each header tank could alternatively be analyzed using a two-factor ANOVA. For such an analysis, any potential (random) block effects are absorbed by the random error term; alternatively the model could be modified to explicitly include them as fixed or random effects. With model averaging, we could explicitly average both block and header tank effects, i.e. consider models with none, one, or both random effects. However, to keep our focus on model-averaging the header tank random effects, and after preliminary analysis demonstrated any block effects were minimal, we removed block from the analysis. That is, our interest here was in two variance components (header tank and raceway) to assess the potential efficacy of model-averaging the header tank random effects. As such, we estimated the header tank error standard deviations and the raceway error standard deviations for the four indices (for the small and large juvenile mussels) using split-plot ANOVA, linear mixed effects, and Bayesian approaches. We also performed calculations using model averaging to determine the benefit it would bring. We calculated the confidence interval width reduction (%) between model averaging and a split-plot analysis for each of the three approaches using where CI MA is the width of the model-averaged confidence interval for the whole-plot treatment comparison (i.e. Ambient pH versus Year 2050 pH) and CI SP is the width from the comparable split-plot analysis.

Simulation study
Simulated data were generated from a split-plot experimental design and analyzed using split-plot ANOVA, factorial ANOVA, and model averaging. Only split-plot ANOVA and model averaging were considered as viable or potentially viable general analysis approaches; factorial ANOVA was included for purposes of comparison. The simulations highlight the advantages of model averaging and the risks from using a factorial ANOVA rather than a split-plot ANOVA. Related analyses using linear mixed models and Bayesian models showed qualitatively similar results.
First, we examine coverage rates versus the nominal level of 0.95. Figure 3 shows the coverage probabilities for treatment comparisons versus r w across different levels of replication, with r s ¼ 1 throughout. Figure  S1 shows the same information in higher resolution for r w < 0:2: Next, Figures 4 and S2 show the ratio of the confidence interval widths under the factorial ANOVA or model averaging, relative to those under the split-plot ANOVA. As expected, the split-plot ANOVA achieved near-nominal coverage rates throughout all simulations, and forms a standard for comparison, while factorial ANOVA and model averaging produce substantially narrower intervals. There is no expectation that nominal rates will be met exactly when using approximate methods like model averaging or methods that ignore different levels of randomization like factorial ANOVA. While some departure from nominal rates may be acceptable if there are corresponding advantages in terms of reduced widths of confidence intervals, large deviations indicate that uncertainty is not being correctly characterized and the results are therefore untrustworthy. We quickly observe that model averaging maintains near-nominal coverage (Figure 3) while producing substantially narrower interval widths for whole-plot treatment comparisons (Figure 4). In comparison, factorial ANOVA has the narrowest intervals but performs poorly in terms of coverage outside a narrow range, which is why it is not viable as a general analysis method for split-plot experiments. Expanded results follow below.

Factorial ANOVA
We start by demonstrating that a factorial ANOVA for a split-plot design generally performs poorly, except within a narrow range where it has advantages. When r w is less than 10% of r s , the factorial ANOVA shows close to nominal coverage, especially when replication levels are low (Figures 3 and S1), with the widths of whole-plot confidence intervals being substantially smaller than those from the split-plot ANOVA ( Figure  4). In this narrow range where coverage rates are close to nominal, the reduction in width is approximately equivalent to having an additional whole-plot replicate (and corresponding subplot replicates) for each of the three scenarios presented in Figure 3. There is never a benefit to using a factorial ANOVA model when the interest is in subplot treatment comparisons, as the interval is wider rather than narrower (up to 30% in our study, Figure S2).
The reason the factorial ANOVA model cannot be generally used is evident from the rapid and dramatic decline in performance as r w increases (Figure 3). Coverage for whole-plot treatment comparisons drops far below the nominal rate, while over-coverage is exhibited for subplot comparisons. There is again under-coverage when estimating treatment means rather than making treatment comparisons ( Figures  S3-S5) and usually under-coverage for interactions ( Fig S5). Type I error rates for F tests ( Figure S6) also deviate markedly from 5% as r w increases relative to r s . Finally, simulations for the other replication combinations described in the Methods section showed similar results, with the three scenarios discussed here indicative of the observed patterns.

Model averaging
We next consider results from the MATA-Wald interval, where most of the benefits of the factorial ANOVA are achieved with substantially reduced coverage loss. Figure 5 shows how mean model weights vary with r w for three scenarios. When r w is a small fraction of r s , around 60% of the model weight (on average) is given to the factorial ANOVA model. With increasing r w , the weight shifts to the split-plot ANOVA model; that is, the model-averaged analysis tends toward the split-plot ANOVA as r w increases relative to r s . Finally, this shift occurs more quickly for greater levels of replication. Figures 3 and S1 show that coverage probabilities when we use a MATA-Wald interval are near nominal across a broad range. While coverage probabilities for whole-plot treatment comparisons are still below nominal levels, there is a marked improvement over the factorial ANOVA model and coverage is usually within 5 percentage points of the nominal rate. In the most challenging scenario (2 whole-plot replicates, 3 subplot replicates), coverage fell slightly below 90% from r w ¼ 0:4 until the simulation boundary at 1.0 (Figures 3 and S2). This suggests scope for small-sample calibration, e.g. via an additional expansion of interval widths by a factor based on replication levels, or other improvements to improve coverage.
The advantages from the MATA-Wald interval are evident when we consider the width of the intervals relative to a split-plot ANOVA (Figures 4 and S2). When the interest is in subplot treatment comparisons, there is no benefit in considering a MATA-Wald interval, but, unlike the factorial ANOVA, there is negligible cost ( Figure S2). However, for whole-plot treatment comparisons, there is a substantial reduction in interval widths while maintaining near nominal coverage. While there is a narrow range where the factorial ANOVA outperforms model averaging, the range is small and the gap between them is fairly Figure 4. The ratio of interval widths, relative to a split-plot ANOVA, shows the narrowing that occurs when using factorial ANOVA ( ) and model-averaged () confidence intervals when making whole-plot treatment comparisons (r w 2 ð0:00, 0:20Þ). The yellow and orange-filled circles are when the error of the coverage probability is greater than 5% and 10% respectively, demonstrating that the narrowing for the factorial ANOVA model comes at the cost of coverage, while the model-averaged approach maintains near nominal coverage. Figure 5. Mean AIC weights given to the factorial ANOVA model for ðr 1 , r 2 Þ ¼ ð2, 3Þ ( ), ðr 1 , r 2 Þ ¼ ð3, 5Þ ( ) and ðr 1 , r 2 Þ ¼ ð5, 15Þ (). narrow (Figure 4). Like the factorial ANOVA, the gains observed in Figure 4 correspond to an effective increase in experimental units, on the order of onehalf to a single whole-plot replicate and corresponding subplot replicates per treatment. When interest involves interactions, although we expect reduced benefit relative to whole-plot marginal means, near nominal coverage is maintained ( Figure S5).

Linear mixed model and Bayesian
model averaging When we consider the linear mixed model, coverage probabilities are similar to those under the split-plot ANOVA model and within 2% of the nominal level ( Figure S7). There is only a slight reduction in widths for treatment comparison confidence intervals (< 10%) when replication levels are low, otherwise, the interval widths are virtually identical to those under the split-plot ANOVA model ( Figure S8). As such, regardless of whether the linear mixed model or split-plot ANOVA model was used, model averaging with the factorial ANOVA model results in similar reduction in interval widths.
Bayesian model averaging (here) is straightforward to implement using JAGS (Plummer (2019), Figure  S9). Credible intervals under Bayesian model averaging have similar coverage probabilites to confidence intervals under frequentist model averaging with marginal AIC. There is slight over-coverage with the Bayesian credible intervals ( Figure S8) which may be due to the choice of prior distributions for r s and r w . The magnitude of reduction of interval widths is approximately 10% lower compared to frequentist model averaging, but this difference disappears when the replication levels increase ( Figure S9).
In brief, results from these alternative approaches are qualitatively similar to those under ANOVA. Therefore, the choice of which modeling or model averaging approaches to use lies with the researcher and depends on the focus of the research. The key is that model averaging (frequentist or Bayesian) gives intervals that are narrower while maintaining near nominal coverage probabilites, compared to using just a single model (either the split-plot ANOVA model, the linear mixed model, or the hierarchical Bayesian model). Table 1 presents the estimated header tank and raceway error variances for the four response indices using a split-plot ANOVA, and the decrease in interval widths achieved by model averaging. As expected, the header tank error variances were considerably smaller than raceway error variances, and in half the cases the estimated header tank error variances were negative using ANOVA. When comparing interval widths for whole-plot treatment means for each pH level from model averaging to comparable split-plot analyses, e.g. split-plot ANOVA versus ANOVA-based model averaging or a split-plot Bayesian analysis versus Bayesian model averaging, interval widths were reduced by up to 25% (small mussels, growth in width). Similar results occur for whole-plot treatment comparisons. Interestingly, because of the negative error variance estimates allowed by ANOVA, ANOVA-based model averaging can lead to wider intervals than split-plot ANOVA (i.e., W SPA R < 0). However, this widening may be considered desirable, as the model-averaged intervals are closer to intervals that do not allow negative variances. For linear mixed effects and Bayesian approaches, model-averaged intervals are never wider for wholeplot comparisons due to the restriction thatr HT ! 0: For linear mixed models, the header tank error variance is set to the boundary value of 0 whenever the split-plot variance estimate is negative. In these cases, there is no difference between factorial, linear mixed model, or model-averaged intervals, as the linear mixed model devolves to the factorial model. That is, all three estimators achieve the narrowest possible intervals. However, substantial reductions in interval widths were observed for all indices that had positive estimates for the header tank error variance. Finally, Table 1. Header tank error variances (r 2 HT ) in the mussel experiment are small relative to raceway error variances (r 2 RW ), as estimated using a split-plot ANOVA. Model averaging can lead to a substantial reduction in interval widths for whole-plot treatment means versus a comparable split-plot analysis, e.g. when using an ANOVA (W SPA R ), linear mixed model (W LMM R ), or Bayesian approach (W B R ).

Ocean acidification and substrate experiment
Bayesian model averaging always achieved narrower intervals than a Bayesian split-plot analysis as the header tank error variance was strictly positive.

Discussion
We showed that model averaging for split-plot experimental designs can improve the precision of treatment comparisons while maintaining coverage rates close to nominal levels, with particular advantages accruing when r w is small relative to r s . If the whole-plot variance term is very small, in principle the different levels of randomization and replication can be entirely ignored and a factorial analysis used without model averaging. That is, there are certain cases where embracing pseudoreplication is advantageous with minimal cost. However, in practice this would be a risky strategy even if it was anticipated that r w =r s was small, as the range where a factorial analysis performs well is small and the range where it performs extremely badly is large. Model averaging acts as a bet-hedging strategy, recognizing the different levels of randomization but also allowing for the possibility that whole-plot variation may be small. The initial motivation for our study came from conversations with ocean global change researchers about the replication of header tanks. In many ocean global change studies, there may be no replication of header tanks or other large scale equipment . In these cases, header tank effects and treatment effects are inherently confounded (Hurlbert 2004). However, some researchers mentioned to us that they believed any header tank effects in their setups would be small, based on the effort they took to eliminate variation between tanks. Contrary to these beliefs, Cornwall and Hurd (2016) describe many of the challenges inherent to tankbased studies, the need for replication, and the importance of experiments with sufficient statistical power. While whole-plot variation may be kept low through good experimental practice, it cannot be routinely assumed to be small. That is why best practice incorporates replication at the header tank level (Boyd et al. 2019(Boyd et al. , 2018Cornwall and Hurd 2016;Wernberg, Smale, and Thomsen 2012), as in the mussel experiment. Without whole-plot replication, there is no ability to employ a model-averaging strategy. In contrast, with replication, model averaging can be employed. Like a split-plot analysis, this provides protection against false inference, while simultaneously providing narrow confidence intervals when appropriate. The mussel experiment shows the applicability of model averaging for these studies. Given the physical constraints of the laboratory and other on-going experiments, there was no practicable way to physically increase the number of header tanks allocated to this study. However, header tank variation was small relative to raceway variation, and confidence intervals from model averaging for whole-plot comparisons were up to 25% narrower than without it. By replicating at the header tank level, there is no danger of confounding, while use of model averaging avoids pseudoreplication and provides appropriately narrow confidence intervals.
Our methods, while initially motivated by ocean global change studies, are broadly applicable to any splitplot design, including designs with more levels or more complex structures. For example, we are currently working with colleagues applying the method in a mixed model framework to an experiment with two large scale random factors crossed with each other. Here, the model set includes models where combinations of random and fixed factors are set to 0; initial results suggest substantial reduction in interval widths for treatment comparisons of primary interest. As in this study, we do not expect perfect coverage for more complex designs, but do anticipate coverage acceptable for many applications. Future studies examining coverage for more complex designs would be beneficial, particularly examining linear mixed effects versus Bayesian implementations due to potential boundary value issues with mixed effects models identified by Gilmour and Goos (2009). However, there are settings where model averaging will not provide any benefits. Where r w is large relative to r s , there is no advantage (or disadvantage) in using model averaging, as the weight given to the split-plot model will be close to 1. If it is expected a priori that this would be the case, as may occur in some field-based experiments, there would be no benefit with going through the additional computational effort of model averaging of the random factors. Similarly, if the focus is on subplot treatment comparisons, model averaging does not help. However, for many studies we may expect r w < r s , or at least consider that to be plausible, e.g. pupils within schools (Example 1 in Browne and Draper (2006)) or in some industrial experiments (e.g. Example 2 in Jones and Nachtsheim (2009)). In these settings, model averaging may substantially reduce whole-plot interval widths and can be thought of as increasing the effective sample size for no additional cost. In principle, it could also allow studies to be designed with fewer observations, although care would be needed regarding unintended consequences, e.g. potential widening of subplot comparisons. Further benefits can accrue when model averaging is employed on the fixed effects in the model (Fletcher and Dillingham 2011).
For our simulation, ANOVA-based methods, linear mixed models, and Bayesian methods all performed similarly. There is a key technical distinction relevant to this study, where the three approaches perform differently when fitting model (1). Usually, any estimation approach based on model (1) will yield wider intervals for whole-plot treatment comparisons than those based on model (2) due to the whole-plot variance component, with model-averaged intervals sitting in between the two. While the Bayesian approach always behaves as expected, the other two do not. Normally, split-plot ANOVA and linear mixed models yield identical estimates to each other (for balanced designs). However, with ANOVA, the estimated variance component for the whole plot may be negative, as observed in the mussel experiment. Then, the splitplot ANOVA may actually yield narrower intervals for whole-plot comparisons than the factorial ANOVA. Model-averaged intervals using ANOVA follow the same reversal. Linear mixed models, e.g. as implemented by the lmer() function in lmerTest package (Kuznetsova, Brockhoff, and Christensen 2017), truncate variance estimates at zero, and revert to a factorial analysis. This means that models (1, 2) and model averaging yield identical intervals. Despite the obvious disadvantage of a negative variance estimate, estimation procedures that allow negative variances have some advantages, e.g. in terms of bias, and there is no fundamental reason to not employ them (Fletcher and Underwood 2002;Nelder 1954;Wang, Yandell, and Rutledge 1992). However, it is also common to truncate variances to 0. For model averaging, the choice of ANOVA versus linear mixed models versus Bayesian models is unlikely to make a large difference for balanced designs, but it is important to be aware of the differences between them and the seemingly perverse behavior of ANOVA.
The broadly similar performance of the different approaches is important because many experiments are unbalanced, often due to the equipment available or because of data loss during the study. For example, laboratories often have a fixed number of header tanks and, depending on the experimental treatments, the experimenter may need to choose between using all available equipment or having a balanced design. When the design is unbalanced or more complex, more general regression-based modeling methods are required for analysis, and linear mixed effects or Bayesian models are recommended.

Conclusions
We recommend that model averaging be routinely deployed for split-plot designs in settings where the whole-plot variance may be less than the subplot variance, provided that a small departure from nominal coverage is contextually acceptable. We reiterate the importance of incorporating replication at all levels in any analysis, as analyses that ignore whole-plot variance can lead to poor inference even when it is fairly small relative to subplot variance. Some caution is warranted at present for small sample studies, where coverage for nominal 95% model-averaged confidence intervals can drop below 90%, likely due to the difficulty in estimating appropriate model weights with little data, and future work adjusting small sample intervals to improve coverage would be beneficial. Finally, we note that the methods can be employed using ANOVA, linear mixed models, or Bayesian models, with similar benefit.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work was supported by a University of Otago Research Grant (no. 115043.01.R.FO). CEC was funded by a Rutherford Discovery Fellowship from The Royal Society of New Zealand Te Ap arangi (no. RDF-VUW1701). The mussel experiment was supported by funding from the Sustainable Seas National Science Challenge Innovation Fund (no. IF2.2.2.1).

About the authors
Chuen Yen Hong is a final year MBChB student at the University of Otago, New Zealand, having previously completed a PhD in Statistics on model averaging. Her research interests include model averaging and application of statistics in ophthalmology.

David
Fletcher is a Statistical Consultant, who previously worked for 35 years at Universities in the US, UK, Australia and New Zealand. His research interests include model averaging and overdispersion, and he is the author of the Springer book, Model Averaging (2018).
Jiaxu (Jimmy) Zeng is a Senior Lecturer in Biostatistics at the University of Otago. His research interests include model averaging, longitudinal data analysis, causal inference, and study design.
Christina McGraw is a Senior Lecturer in Chemistry at the University of Otago, New Zealand. She is an analytical chemist whose research interests include sensor development, carbonate chemistry, and multiple driver experiments. She is also involved in capacity development and building tools for policymakers to help address impacts of ocean acidification.
Christopher Cornwall is a Lecturer in Marine Biology at Victoria University of Wellington, New Zealand, and coleader of the Understanding theme of the Coastal People: Southern Skies Centre of Research Excellence. His research focusses on kelp forests, coral reefs, and climate change.
Vonda Cummings is a Principal Scientist -Marine Ecology, at the National Institute of Water and Atmosphere. Her research is focused on implications of acidification, climate change and other anthropogenically derived changes to the structure and function of New Zealand and Antarctic ecosystems.
Neill Barr is Principal Technician -Marine Ecology, at the National Institute of Water and Atmosphere, Wellington, New Zealand. His expertise includes developing and building experimentation systems for aquaculture, marine ecology, and ocean change studies. Additionally, he has a PhD in seaweed ecophysiology, specifically around developing seaweeds as indicators of marine pollution.
Emily Frost is a Research Fellow in Aquatic Biotechnology at Auckland University of Technology, New Zealand. Her research interests include marine biology and environmental/comparative physiology, ecology and molecular biology.
Peter W. Dillingham is a Senior Lecturer in Statistics at the University of Otago, and co-leader of the Training theme of the Coastal People: Southern Skies Centre of Research Excellence. His research focusses on ecological and environmental statistics, including the use of model averaging in multi-factor experimental designs.