A Comparison of Multilevel Imputation Schemes for Random Coefficient Models: Fully Conditional Specification and Joint Model Imputation with Random Covariance Matrices

Abstract Literature addressing missing data handling for random coefficient models is particularly scant, and the few studies to date have focused on the fully conditional specification framework and “reverse random coefficient” imputation. Although it has not received much attention in the literature, a joint modeling strategy that uses random within-cluster covariance matrices to preserve cluster-specific associations is a promising alternative for random coefficient analyses. This study is apparently the first to directly compare these procedures. Analytic results suggest that both imputation procedures can introduce bias-inducing incompatibilities with a random coefficient analysis model. Problems with fully conditional specification result from an incorrect distributional assumption, whereas joint imputation uses an underparameterized model that assumes uncorrelated intercepts and slopes. Monte Carlo simulations suggest that biases from these issues are tolerable if the missing data rate is 10% or lower and the sample is composed of at least 30 clusters with 15 observations per group. Furthermore, fully conditional specification tends to be superior with intraclass correlations that are typical of crosssectional data (e.g., ICC = .10), whereas the joint model is preferable with high values typical of longitudinal designs (e.g., ICC = .50).

model that features incomplete variables regressed on complete variables, whereas fully conditional specification imputes variables one at a time from a sequence of univariate models with an incomplete variable regressed on complete and previously imputed variables. As explained later, univariate and multivariate imputation schemes offer very different approaches to random coefficient models.
The few studies to examine missing data handling for random coefficient models have focused on the fully conditional specification imputation framework (Enders et al., 2016 popularized by van Buuren andcolleagues (van Buuren, 2011, 2012; van Buuren, Brand, Groothuis-Oudshoorn, & Rubin, 2006; van Buuren et al., 2014). In the context of a random slope analysis, fully conditional specification employs a "reverse random coefficient" approach  that treats y as a random predictor of an incomplete covariate, and vice versa. Simulation results suggest that fully conditional specification can produce negatively biased slope variance estimates , and analytic work shows that an incorrect distributional assumption is responsible for this problem (Enders, Du, & Keller, 2017, October). As an alternative, Yucel (2011) proposed a joint (multivariate) imputation strategy that uses random within-cluster covariance matrices to model cluster-specific associations. This promising approach was recently implemented in the R package jomo (Quartagno & Carpenter, 2016a), and the only study to examine its performance did so in the context of individual patient data meta-analysis (Quartagno & Carpenter, 2016b). As we show later, using within-cluster covariance matrices to model random coefficients is potentially problematic because the method cannot preserve covariance between the random intercepts and slopes, but the practical magnitude of this shortcoming is unclear.
Random coefficient models pose particularly challenging missing data problems because existing imputation frameworks can introduce bias. Promising new Bayesian imputation procedures are currently under development but are not yet widely available in software (Enders et al., 2016. Although fully conditional specification and joint modeling with random covariance matrices are both imperfect, they will not necessarily produce comparable results because their underlying models are quite different. As such, understanding the relative strengths and weaknesses of these strategies is important for selecting an appropriate imputation procedure. As the methodology literature currently offers no guidance on this issue, the primary goal of our manuscript is to compare fully conditional specification and joint model imputation with random level-1 covariance matrices. We begin by describing each approach and discussing its compatibility with a random coefficient analysis, after which we use Monte Carlo computer simulations to evaluate parameter recovery in a wide range of conditions typical of applied research settings. This article concludes with a real data analysis example.
To keep the ensuing discussion as simple as possible, we focus on missing data handling for a twolevel random slope model with a single predictor at each level. The analysis model is where y ij is the outcome score for observation i in cluster j, x ij and w j are level-1 and level-2 predictors, respectively, b 0 is the intercept, and b 1 and b 2 are the fixed effect slope coefficients for the two predictors. Consistent with the standard multilevel model formulation, we assume that the intercept and slope residuals, b 0j and b 1j , respectively, are multivariate normal with zero means and an unstructured covariance matrix R b . Similarly, the within-cluster residuals are normal with common variance r 2 e . As an aside, readers may question our focus on multiple imputation, given that maximum likelihood is the default estimator in most multilevel software programs. Maximum likelihood is applicable to a wide range of single-level missing data problems, but it is arguably less flexible for multilevel missing data. Dedicated software programs that operate within the traditional mixed model framework allow incomplete outcomes (e.g., a longitudinal analysis where the number and configuration of measurements differs across respondents), but the absence of distributional assumptions for explanatory variables forces programs to exclude observations or clusters with missing predictor scores. The HLM program (Raudenbush, Bryk, & Congdon, 2013) appears to be the only dedicated multilevel software package that can accommodate incomplete explanatory variables, but this functionality is limited to random intercept models with normally distributed predictors (Shin & Raudenbush, 2007. The multilevel structural equation modeling framework (Mehta & Neale, 2005;Rabe-Hesketh, Skondral, & Zheng, 2012;Stapleton, 2013) is a second option for fitting random coefficient models with missing data. Mplus (Muthen & Muthen, 1998 is the only package we know of that can accommodate random slope predictors with missing data, but no technical documentation is available describing this functionality. Maximum likelihood estimators for incomplete multilevel data tend to invoke complicated transformations of the population joint distribution (Shin & Raudenbush, 2007, making it difficult to understand and explain their performance without also knowing their technical underpinnings. 1 As such, we focus on multiple imputation because it is arguably a more mature technology for incomplete multilevel data.

Joint model imputation with random covariance matrices
The joint imputation framework is based on a multivariate regression model that is typically parameterized by regressing the set of incomplete variables on the set of complete variables, or by treating all variables as outcomes regardless of missing data pattern (Asparouhov & Muthen, 2010;Carpenter & Kenward, 2013;Goldstein et al., 2009;Schafer & Yucel, 2002;Yucel, 2011). We describe the latter parameterization, focusing on a variant of the joint model that introduces heterogeneous within-cluster covariance matrices (Yucel, 2011). This approach holds promise for random coefficient models but has not received much attention in the literature. It has to be noted that we alter our notation in this section (e.g., replacing bs with cs, etc.) to clearly differentiate the imputation and analysis models, and we use alphanumeric subscripts to associate parameters to specific variables (e.g., c (y) and c (x) ). The joint imputation model for the random coefficient analysis defines each observation as the sum of a grand mean and one or more normally distributed deviation scores where c ðÁÞ is the regression intercept or grand mean, u jðÁÞ is a level-2 deviation score capturing the difference between a latent cluster mean and the grand mean (e.g., u j y ð Þ ¼ l j y ð Þ À c y ð Þ ), and e ijðÁÞ is a withincluster deviation between an observation and its corresponding latent group mean (e.g., e ij y ð Þ ¼ y ij À l j y ð Þ ). Furthermore, R u is the level-2 covariance matrix of the between-cluster deviation scores, and R e j is the within-cluster covariance matrix for cluster j, as follows.
where J W denotes an inverse Wishart probability distribution for cluster j's covariance matrix, t and S are the degrees of freedom and scale matrix (i.e., jointly, these quantities define the center and spread) of the hierarchically specified inverse Wishart distribution of all random covariance matrices, n j is the number of observations in cluster j, and S e j is the within-cluster residual sum of squares and crossproducts matrix for cluster j. Importantly, allowing the covariance matrices to vary across clusters differs from the classic joint model formulation (Schafer, 2001;Schafer & Yucel, 2002) and provides a mechanism for preserving random slope variation. The computational machinery for joint model imputation uses an iterative Markov chain Monte Carlo (MCMC) algorithm consisting of two major steps. At each iteration, the algorithm first uses Bayesian estimation to generate model parameters and level-2 deviation scores, conditional on the filled-in data from the previous cycle, after which it updates imputations, conditional on the current estimates. The estimation sequence generates a coefficient vector, c, between-cluster covariance matrix, R u , level-2 deviation scores, u ¼ u ðyÞ ; u ðxÞ ; u ðwÞ ð Þ , and a set of J withincluster covariance matrices, R $ e ¼ R e 1 ; :::; R e J À Á . The distribution of the random covariance matrices across all clusters requires two additional parameters, the degrees of freedom and scale matrix, t and S (i.e., parameters defining the distribution's center and spread, respectively). Roughly speaking, t and S can be viewed as quantities that define a pooled level-1 covariance matrix.
The Bayesian estimation sequence views model parameters as random variables, and it uses Monte Carlo computer simulation to "sample" plausible estimates from a series of probability distributions. Specifically, each iteration t of the MCMC algorithm applies the following estimation steps to the filled-in data 1 We included the Mplus implementation of maximum likelihood estimation in a subset of our simulations, and the method was uniformly biased. We do not report these results here because there is currently no technical documentation that helps us understand and explain the results.
; data 4: Sample the scalar degrees of freedom where N and J W are normal and inverse Wishart distributions, respectively, and the dot accents denote synthetic estimates generated via Monte Carlo computer simulation (van Buuren, 2012). The full conditional distributions of c, R u , and u are given in a variety of resources that describe the classic joint model formulation (Carpenter & Kenward, 2013;Goldstein et al., 2009;Schafer & Yucel, 2002;Yucel, 2011). 2 The conditional distributions of t and S are based on aggregate information from R $ e (Yucel, 2011, p. 359), and the final sampling step draws random covariance matrices from a distribution that leverages this pooled information as well as clusterspecific variation (Equation 3). Finally, we use the generic symbol f to denote the conditional distribution of t because this quantity does not follow a standard parametric form. In addition to Yucel (2011), interested readers can consult Carpenter and Kenward (2013) and Quartagno and Carpenter (2016b) for additional technical details, including a discussion of prior distributions.
After estimating the parameters and random effects, a final MCMC step updates missing values by sampling imputations from a multivariate normal distribution, the center, and spread of which depend on the multilevel model parameters and level-2 residual terms.
Recall that w has no with-cluster variation, and hence each _ w j is simply the sum of a grand mean and between-cluster residual term (i.e., _ w j ¼ _ c ðwÞ þ _ u jðwÞ because this variable has zero entries in the withincluster covariance matrix). The updated version of the completed data set carries forward to the next iteration, where MCMC estimation steps provide new parameter estimates and residual terms for the next round of imputation. The desired number of imputed data sets is obtained by iterating the MCMC algorithm and culling data sets at specified intervals in the chain (Schafer, 1997;Schafer & Olsen, 1998).

Compatibility of imputation and analysis models
A vital consideration with multiple imputation is whether (and how) the imputation model preserves the features of the analysis; methodologists have described this issue as congeniality (Meng, 1994;Schafer, 2003) and more recently as compatibility (Bartlett, Seaman, White, & Carpenter, 2014;Carpenter & Kenward, 2013). Yucel (2011) explained that modeling heterogeneous covariance matrices is "mimicking the idea underlying random effects" (p. 352), but he did not specifically examine random coefficient models. 3 The covariance matrix formulation makes it difficult to see which parts of the imputation model are preserving important features of the random slope analysis, but we can informally explore compatibility by using estimates of R e j and R u to define within-and between-cluster regression parameters that are analogous to those of the analysis model. To be clear, we are not examining whether the random coefficient analysis and imputation model are formally compatible with the joint distribution of the population. Rather, we are deriving regression model parameters as functions of R e j and R u to highlight major similarities and differences between the imputation and the analysis models. Quartagno and Carpenter (2016b) provide a similar assessment in the context of meta-analysis models.
To begin, consider the level-1 part of the imputation model from Equation 2. The R e j matrix contains variances and covariances of within-cluster deviation scores for cluster j, which we use to define a level-1 2 Cluster-specific covariance matrices replace a common within-cluster matrix in the classic joint model formulation, but the probability distributions for c, R u , and u are otherwise the same. The joint model uses improper uniform prior distributions for the coefficients and level-2 residual terms. Both the level-1 and level-2 covariance matrices use inverse Wishart prior distributions with identity scale matrices and degrees of freedom equal to the dimension of the respective matrices.
where p 1j is the slope coefficient for cluster j, e ij is a within-cluster residual for observation i in cluster j, and r 2 e j is the corresponding residual variance for that group. Equation 6 clearly shows that the random covariance matrices are responsible for preserving slope variation as these parameters imply a unique regression coefficient for each group, p 1j . Importantly, the level-1 model has no mean structure or intercept that varies across level-2 units because the within-cluster deviation scores have an expectation of zero in all groups. To facilitate our subsequent comparison with the analysis model, we express the cluster-specific slopes as a function of an average coefficient and a between-cluster residual term, p 1j ¼ c 1 þ g 1j . Substituting the righthand side of this expression into Equation 6 gives the following level-1 regression model.
Next, the R u matrix contains variances and covariances of between-cluster deviation scores, which we can use to define a level-2 multiple regression model with u j x ð Þ and u jðwÞ predicting u j y where c 2 and c 3 are between-cluster slope coefficients, g 0j is a residual that captures unexplained variation in the cluster means, and r 2 g 0 is the residual variance. Finally, combining the right-and left-hand sides of the within-and between-cluster regression equations and adding the grand mean of y to each side of the resulting expression gives a model with the same form as the random coefficient analysis.
Þ þ e ij (9) Comparing Equations 1 and 9 reveals two important differences between the imputation and analysis models, one of which is potentially problematic. First, Equation 9 includes an additional regression slope that partitions the relationship between x and y into distinct within-and between-cluster components (i.e., c 1 and c 2 , respectively), as in the classic contextual effects model (Enders, 2013;Kreft, de Leeuw, & Aiken, 1995;Longford, 1989;Lu €dtke, Marsh, Robitzsch, & Trautwein, 2011). Second, the imputation model accommodates a heterogeneous withincluster residual variance structure (e.g., r 2 e j from Equation 6), whereas the analysis model assumes constant variance. Incompatibilities such as these that arise from a rich imputation model with ancillary parameters are usually not problematic and may even be beneficial (Collins, Schafer, & Kam, 2001;Meng, 1994;Schafer, 2003). Although Equations 6-9 clearly confirm that random covariance matrices mimic random effects, they highlight an important caveat. Specifically, because R e j and the corresponding within-cluster regression model do not include a mean structure or intercept that varies across level-2 units, between-cluster variation in p 1j (or equivalently, g 1j ) is necessarily orthogonal to functions of R u , notably the between-cluster residual term, g 0j . Consequently, an imputation model based on random covariance matrices could only be approximately compatible with a random coefficient analysis with uncorrelated intercepts and slopes (i.e., a block-diagonal R b ). This restriction could be detrimental to analyses where the level-2 covariance captures an interesting substantive phenomenon (e.g., growth models where initial status informs change rates).

Fully conditional specification
Although joint model approach uses a multivariate regression model to generate imputations, fully conditional specification cycles through incomplete variables one at a time, generating imputations from a sequence of univariate multilevel models that feature an incomplete variable regressed on all remaining variables, complete and previously imputed (Enders et al., 2016van Buuren, 2011van Buuren, , 2012. Consistent with the previous section, we alter our notation (e.g., replacing bs with cs, etc.) to clearly differentiate the imputation and analysis models, and we use alphanumeric subscripts to associate parameters to specific variables (e.g., c ðyÞ and c ðxÞ ).
Returning to the random slope analysis from Equation 1, fully conditional specification applies the following imputation models where x j and y j are the level-2 cluster means computed from the imputed data for cluster j, u 0j Á ð Þ and u 1j Á ð Þ are now between-cluster residual terms, and e ij Á ð Þ is a within-cluster residual with constant variance r 2 eðÁÞ . Notice that the round-robin imputation scheme features the outcome from one equation (the target of imputation) as a predictor in all other equations. Importantly, the models use a "reverse random coefficient" approach  preserving random slope variation by which x is a random predictor in the y imputation model, and vice versa. Also, note that latent rather than manifest variable group means can be used to preserve between-cluster variation among the incomplete variables (Grund, Lu€dtke, & Robitzsch, 2017). Analytic and simulation results suggest that manifest cluster means provide good performance in most situations but can introduce slight biases with low intraclass correlations, few observations per cluster, and extremely unbalanced cluster sizes .
Like joint model imputation, fully conditional specification employs an iterative MCMC algorithm that generates Bayesian estimates of model parameters and level-2 residual terms, conditional on the filled-in data, after which it updates the imputations, conditional on current estimates and residuals. Each iteration applies these estimation and imputation steps to the incomplete variables in a sequence. The model parameters for an incomplete level-1 variable q include a coefficient vector, c ðqÞ , level-2 residuals, u ðqÞ , within-cluster residual variance r 2 eðqÞ , and the level-2 residual covariance matrix R uðqÞ , whereas level-2 imputation requires only the coefficients and residual variance (Equation 12). Consistent with the joint model, Monte Carlo computer simulation samples plausible estimates from the series of probability distributions given below.
As mentioned before, the dot accents denote synthetic values generated via Monte Carlo simulation (van Buuren, 2012). A variety of resources provide the technical details about these distributions (Browne & Draper, 2000;Enders et al., 2017;van Buuren, 2012). 4 After estimating the parameters and random effects for incomplete variable q, a final MCMC step updates missing values by sampling imputations from a univariate normal distribution, the center and spread of which depend on the model parameters. The imputation steps for our example are as follows.
To simplify notation, we omit the iteration superscript from the parameters and residual terms because these quantities always originate from iteration t.

Compatibility of imputation and analysis models
Consistent with the joint model, the imputation models in Equations 10 and 11 are more flexible than necessary because they include a contextual effects-type specification that partitions the relationship between x and y into unique within-and between-cluster components (Carpenter & Kenward, 2013;Enders et al., 2016Enders et al., , 2017. 5 The primary problem with fully conditional specification is its use of reverse regression to preserve random slope variation (e.g., y is a random predictor of x, and vice versa). As noted previously, simulation studies suggest that this strategy can produce negatively biased slope variance estimates (Enders et al., 2016, and an incorrect distributional assumption appears to be responsible for this problem (Enders et al., 2017, October). Specifically, when a random predictor is missing, assuming a normal distribution for the level-2 residuals in the analysis or y imputation model precludes the possibility that the corresponding residuals from the reverse regression are also normal. The estimation of MCMC violates this assumption when it samples level-2 residuals for x's imputation model in Equation 11. A related issue occurs in the single-level context with reverse regressions that include a product term (Bartlett et al., 2014;Gelman & Raghunathan, 2001).
To provide additional insight into the incompatibility problem, we can express the random effects from the reverse regression as a function of the corresponding terms from the analysis model. To begin, we solve for x in the analysis model as follows.
Next, the right-hand side of this expression replaces x ij in the left-hand side of the reverse random coefficient model in Equation 11. For simplicity, we omit the cluster means from the right-hand side of the imputation model as they do not appear in the analysis. Finally, solving for the imputation model's random effects gives the following expressions.
Equation 18 shows that assuming normality for b 0j and b 1j precludes the possibility that u 0jðxÞ , u 1jðxÞ , and e ij x ð Þ are normal because neither the inverse of a normal variate nor the ratio of two Gaussians are members of the normal distribution family. In fact, the above expressions can be quite skewed and kurtotic. Consequently, applying the MCMC estimation steps from Equation 13 to the reverse random coefficient model violates a key distributional assumption that ends up impacting imputation quality. As an aside, this issue does not occur when x is complete, nor does it occur in random intercept models.

Computer simulation study
To date, the only study to examine imputation with random covariance matrices did so in the context of individual patient data meta-analysis (Quartagno & Carpenter, 2016b), and it did not examine fully conditional specification. Given the compatibility issues highlighted previously, it is important to understand the relative strengths and weaknesses of these two approaches as they will not necessarily produce comparable results. To address this gap in the literature, we designed a Monte Carlo simulation study with two within-participants factors (missing data handling method and cause of missingness) and five betweenparticipants factors: intraclass correlation (ICC ¼ .10 and .50), residual correlation between the random intercepts and slopes (0 and .50), number of clusters (J ¼ 15, 30, and 100), within-cluster sample size (n j ¼ 5, 15, 30, and 50), and MAR missing data rate (10, 20, and 30%). We generated 2000 artificial data sets within each of the 144 between-participants design cells, resulting in 288,000 replications.
The previous studies of fully conditional specification  and recommendations from the literature guided our design choices. For example, the ICC values of .10 and .50 are common in published applications and are consistent with those from crosssectional and repeated measures designs, respectively (Gulliford, Ukoumunne, & Chinn, 1999;Hedges & Hedberg, 2007;Murray & Blitstein, 2003;Spybrook et al., 2011). In choosing the level-1 and level-2 sample sizes, we attended to recommendations from the literature (Kreft & de Leeuw, 1998;Maas & Hox, 2005), but we also selected small values capable of introducing bias (McNeish & Stapleton, 2016). It is difficult to identify typical missing data rates because published manuscripts do not routinely report this information, and hence we selected the values large enough to expose small-sample biases in variance estimates .

Data generation
The random coefficient analysis from Equation 1 served as the true population model for the 5 Carpenter and Kenward (2013, p. 220) attribute the idea of including cluster means as predictors to a personal communication from Ian White.
simulations. The data generation procedure also included level-1 and level-2 auxiliary variables, a 1 and a 2 , respectively. As described later, a 1 determined missingness probabilities for x in one of the simulations (y was the cause of missingness in the second), and a 2 always served as the cause of missingness on w. To specify the effect sizes on a convenient metric, we used the values listed in Table 1 to define the total correlations among the variables. We chose correlations of .30 to align with Cohen's (1988) medium effect size benchmark, and we set the correlations between the analysis and the auxiliary variables to .40 to ensure that bias would result if these additional variables are omitted from imputation (Collins et al., 2001). We also set the correlation between x and y to .40 and hence we could manipulate the cause of missingness (i.e., a 1 or y) without changing the strength of the MAR selection mechanism.
We solved for the population parameters using the following steps. First, we fixed the total variance of all variables to unity. The level-2 predictor w and the auxiliary variables had all variation assigned to a single level, whereas x and y had variation at both levels, with between-cluster variance set to either .10 or .50 (ICC ¼ .10 and .50, respectively). Second, we computed the level-1 and the level-2 covariance matrices, R ð1Þ and R ð2Þ , by pre-and postmultiplying the total correlation matrix by a diagonal matrix containing the level-1 or level-2 standard deviations (e.g., in the ICC ¼ .10 condition the level-1 and level-2 standard deviations of x (or y) were ffiffiffiffiffiffi :90 p and ffiffiffiffiffiffi :10 p , respectively). The total covariance matrix was the sum of the within-and between-cluster matrices (i.e., R ¼ R ð1Þ þ R ð2Þ ). Third, we used the following expressions to solve for the model parameters where R ðxwÞ , R 1 ð Þ xw ð Þ , and R 2 ð Þ xw ð Þ are submatrices containing the variances and covariances of the predictors, R ðy;xwÞ is a vector containing covariances between the outcome and the predictors, r x ) are within-and between-cluster variances, respectively, r b 0 b 1 is the correlation between the intercepts and the slopes. We somewhat arbitrarily set the slope variance equal to 40% of the between-cluster variance, reasoning that this parameter should be large enough to reveal important differences between the imputation methods.
Applying the previous expressions produced the parameter values in Equation 20. We then used the following steps to generate the data: (a) sample the within-and between-cluster components of the predictor variables from a normal distribution with zero means and covariance matrices of R 1 ð Þ xw ð Þ and R 2 ð Þ xw ð Þ , respectively, (b) compute x as the sum of its withinand between-cluster deviation scores, (c) sample b 0j and b 1j from a normal distribution with zero means and a covariance matrix R b , (d) sample e ij from a normal distribution with variance r 2 e , and (e) compute y ij as a weighted sum of x, w, b 0j , b 1j , and e ij , as in Equation 20. The R data generation script is available upon request from the first author.
As noted previously, either the level-1 auxiliary variable a 1 or the outcome y determined missing values on x, and a 2 served the same role for w. In all cases, higher scores on the cause of missingness produced higher rates of missing data. Using a logistic regression model, we defined the level-1 missingness probability for each observation as where k 0 and k 1 are intercept and slope coefficients, respectively, and z ij is the standardized version of a 1 or y. Using the latent variable formulation for logistic regression (Agresti, 2012;Johnson & Albert, 1999), we derived coefficients that produced the desired missing data rate while maintaining a .50 correlation between the cause of missingness and the underlying normal propensity for missing data. The slope coefficient was k 1 ¼ 1.815, and the intercept coefficients for the 10, 20, and 30% missing data rates were approximately k 0 ¼ -3.22, -2.10, and -1.31. Substituting z into Equation 21 produced an N-row vector of level-1 missingness probabilities, and we subsequently created an N-row vector of missing data indicators (0 ¼ observed, 1 ¼ missing) by sampling binary values from a binomial distribution with success rate equal to each observation's missingness probability. An identical procedure produced missing values on w, such that p j and z j (the standardized version of a 2 ) replaced p ij and z ij in the logistic function. As different auxiliary variables determined level-1 and level-2 missingness, this deletion process induced a general missing data pattern where x or w (or both) could be missing for a particular observation.

Imputation, estimation, and outcomes
We used the R package jomo (Quartagno & Carpenter, 2016a) for joint model imputation and the Blimp application  for fully conditional specification. To diagnose convergence of the MCMC algorithm, we used Blimp to generate potential scale reduction factors (Gelman & Rubin, 1992) for a single replication in every design cell. These diagnostic values generally reached acceptable levels (e.g., <1.05) in fewer than 1500 iterations. Based on these convergence diagnostics, we generated 20 imputations 6 (Graham, Olchowski, & Gilreath, 2007) from MCMC chains with 50,000 iterations, such that the first data set was saved after the 2500th iteration and the remaining data sets were saved every 2500 iterations thereafter (i.e., in Blimp, we set the burn-in and thinning intervals to 2500). We relied on the default prior distributions of each software package under the assumption that most researchers would do the same. The imputation models for both procedures included the three analysis variables and two auxiliary variables. We used Mplus 8 (Muthen & Muthen, 1998 to fit the analysis model to the multiply imputed data sets, and we wrote a custom R program to pool the resulting estimates and standard errors. All computational tasks were executed on UCLA's Shared Hoffman2 Cluster, and we used a Linux shell script to coordinate the simulation components. The simulation code is available upon request from the first author. We examined three outcomes, relative bias, mean squared error, and confidence interval coverage. Relative bias was computed as the difference between an average estimate and the true population value divided by the true value and multiplied by 100 (i.e., bias as a percentage of the true value). In the conditions with orthogonal random effects, we used the nonzero covariances from Equation 20 as the divisor of the relative bias expression to avoid dividing by zero. The published simulation studies often define relative bias <10% in absolute value as acceptable (Finch, West, & MacKinnon, 1997;Kaplan, 1988). Mean squared error (MSE) is the average squared difference between an estimate and its true population value. This outcome is an overall measure of accuracy that captures the sum of squared bias and sampling variance. To facilitate interpretation, we defined an MSE ratio as MSE JM /MSE FCS , such that values greater than unity favor fully conditional specification (i.e., the joint model had larger squared errors), whereas values <1 favor the joint model. Finally, 95% confidence interval coverage was computed as the proportion of estimates where the 95% symmetric confidence interval (i.e., the estimate plus or minus 1.96 standard error units) included the true parameter value. Values lower than the nominal 95% rate reflect Type I error inflation (e.g., a coverage value of 90% suggests a twofold increase in Type I errors), whereas values >95% reflect conservative inference. Confidence interval coverage is an indicator of standard error quality when estimates are unbiased. We consider coverage for only the slope coefficients because the literature argues that symmetric confidence intervals are inappropriate for variance estimates (Maas & Hox, 2005;Snijders & Bosker, 2012).

Results
The results are organized as follows. We begin by discussing the scenario where auxiliary variables determine missingness and the random intercepts and slopes are uncorrelated in the population. This situation is a useful baseline because the random covariance matrix approach should be approximately compatible with the data-generating model. Next, we 6 When considering the design of the study, we examined a subset of results with 5 and 100 imputations. As bias values were effectively the same in both situations, we adopted Graham et al.'s (2007) recommendation to use 20 imputations. examine the same missingness mechanism with correlated random effects. As described previously, this correlation should be detrimental to the joint model because imputation uses fewer parameters than the data-generating process. We did not expect this correlation to meaningfully influence fully conditional specification. Finally, we discuss any differences that result from an MAR mechanism where the outcome variable causes missingness in x. In the interest of space, we use a limited set of figures to highlight the main findings, and we present the results for all combinations of conditions in the online supplemental material.

Missingness owing to auxiliary variables and uncorrelated random effects
Figures 1 and 2 are trellis plots showing relative bias for the ICC ¼ .10 simulation with 15 and 30 clusters, respectively. In the interest of space, we relegate the 100-cluster plot to the online supplemental material because it was comparable to Figure 2  highlight a few general trends. Not surprisingly, increasing the number of clusters from 15 to 30 reduced bias, particularly for the b 2 coefficient and the intercept variance, estimates that rely heavily on the number of level-2 units. Similarly, increasing the within-cluster sample size from 5 to 15 reduced bias, but further increasing the group sizes did not impact most parameters. With the exception of the slope variance, using 30 clusters with at least 15 observations per group produced imputation estimates that exhibited roughly the same bias as complete-data estimates, even with 30% of missing data. Figures 1 and 2 suggest that both imputation approaches had difficulty preserving random slope variation although they erred in different directions and by different amounts. Specifically, the joint model overestimated slope variance, with bias decreasing to reasonable levels (e.g., <10% in absolute value) when the within-cluster sample size was 30 or larger. The reliance on the within-cluster sample size is not surprising, given that joint model uses random level-1 covariance matrices to preserve random slope variation (Equation 6). On the other hand, fully conditional specification consistently underestimated slope  variance, and the magnitude of this bias did not decrease in larger samples. Rather, bias was largely a function of missingness, with values exceeding 10% in absolute value at a 20% missing data rate. The trellis plots in Figures 3 and 4 show relative bias for the ICC ¼ .50 simulation with 15 and 30 clusters, respectively. The major trends from the ICC ¼ .10 simulations were also evident here, and hence we do not reiterate these results. The primary difference in Figures 3 and 4 is that the joint model slope estimates were negatively rather than positively biased. We suspect that the prior distribution (an inverse Wishart with identity scale matrix) is responsible for this change, perhaps because its mass better approximates that of the likelihood function. As described before, the bias decreased to reasonable levels with group sizes of 30 or larger.
Mean-squared error ratios provide additional information about the relative performance of the two imputation approaches. Recall that mean squared error captures the overall accuracy of an estimate, with ratios greater than unity favoring fully conditional specification (i.e., joint model estimates were further from their true values, on average). For brevity, we report average ratios across the fixed regression coefficients and the level-2 covariance matrix  elements and do not consider ratios for individual parameters. However, the online supplemental material includes trellis plots of the mean squared error ratios for all conditions. Mean-squared error ratios for the regression coefficients favored fully conditional specification, albeit slightly, with average ratio values of approximately 1.02. The ratios for the level-2 covariance matrix differed by intraclass correlation. Fully conditional specification was superior in the ICC ¼ .10 conditions, with average ratio values of approximately 1.10, whereas the joint model was superior in the ICC ¼ .50 conditions, with average ratios of about .96.
No additional figures are needed to convey the confidence interval coverage results. Collapsing across other design cells, the average coverage value for the joint model regression slopes was .94 (min ¼ .88, max ¼ .99), and the corresponding average for fully conditional specification was .93 (min ¼ .89, max ¼ .99). To further convey the results, we examined the percentage of design cells with coverage values between .925 and .975, an interval defining the so-called liberal criterion from Bradley (1978). These results varied by intraclass correlation but consistently favored joint model imputation. Specifically, in the ICC ¼ .10 simulations, approximately 81.9% of the

Missingness owing to auxiliary variables and correlated random effects
Next, we consider the results from the simulation conditions where the residual correlation between the random intercepts and the slopes equaled .50. As described previously, a nonzero covariance should be detrimental to the joint model because imputation uses fewer parameters than the data-generating process. To illustrate these results, the trellis plot in Figure 4 shows relative bias values from the ICC ¼ .50 condition with 30 clusters. As expected, the joint model underestimated the level-2 covariance although the magnitude of this bias was not as large as one might expect. In particular, bias was generally in a tolerable range (e.g., <10% in absolute value) if the missing data rate was 20% or less. Consistent with the previous results, joint model imputation improved as the within-cluster sample size increased. Meansquared error ratios were largely similar to those from the simulation with uncorrelated random effects. For example, ratios for the level-2 covariance matrix elements again differed by intraclass correlation, with average ratio values of about 1.06 and .93 in the ICC ¼ .10 and .50 conditions, respectively.

Missingness owing to the outcome variable
To explore whether our simulation results can generalize to other MAR processes, we repeated the previous simulations, this time treating y as the cause of missingness on x. The level-2 auxiliary a 2 again determined missingness for w, and the imputation routines always included both auxiliary variables. Manipulating the cause of missingness produced no meaningful changes to the results. For completeness, the online supplemental material includes a full set of trellis plots showing relative bias, but no additional plots are needed to convey these results here. Similarly, the mean-squared error ratios and confidence interval coverage results were virtually identical to those from the previous simulation, and hence no further discussion of these results is warranted.

Real data example
In this section, we use the R package jomo (Quartagno & Carpenter, 2016a) and the Blimp application  to illustrate joint model imputation and fully conditional specification, respectively. The Blimp application for macOS, Windows, and Linux is a free download available at www.appliedmissingdata.com/multilevel-imputation. The data set for the analysis mimics a cluster-randomized design from an educational study of math achievement where 50 schools with 25 students each are randomly assigned to a novel math curriculum condition (the intervention) or a standard curriculum condition (the control) (Montague, Krawec, Enders, & Dietz, 2014). The analysis model examines the influence of the intervention dummy code (0 ¼ control, 1 ¼ intervention), controlling for math selfefficacy and teacher's experience in years.
The intervention dummy code is complete but all other analysis variables have missing values. For the purposes of the analysis example, we treat math selfefficacy (a 6-point rating scale) as continuous, but jomo and Blimp have facilities for imputing categorical variables (Carpenter & Kenward, 2013;Enders et al., 2017).
The online supplemental material gives the jomo and Blimp syntax files and R code for the analysis and pooling steps. The Blimp commands can also be implemented via pull-down menus from the graphical user interface (macOS and Windows only). In the jomo package, random level-1 covariance matrices are invoked by specifying the imputation method as "random," whereas Blimp denotes a random association by joining two variables with a colon (e.g., "math:mathse") on the MODEL line. We refer readers to the software documentation files for additional details on syntax conventions Quartagno & Carpenter, 2016a). Table 2 lists the estimates from the two approaches. Consistent with the simulation results, the imputation methods produced similar fixed effect estimates but different variance components. In particular, fully conditional specification produced level-2 covariance matrix elements that were uniformly smaller than those of joint model imputation. Pooled likelihood ratio tests (Meng & Rubin, 1992) obtained from the mitml package  indicated that the slope variance (and covariance) were significant although the joint model test statistic was somewhat larger in value at F(2,1176.28) ¼ 13.60, p < .001 versus F(2,567.04) ¼ 6.51, p ¼ .002. Of course, it is difficult to form judgments from a single sample, but the results of the analysis highlight that the two imputation approaches can produce divergent estimates.

Discussion
The procedures for handling multilevel missing data have received less attention in the literature relative to single-level analysis problems. Existing research has largely focused on random intercept models, and the literature addressing random slopes is particularly scant. To date, the few studies to examine missing data handling for these models have focused on the fully conditional specification framework and "reverse random coefficient" imputation . Although it has not received much attention in the literature, a joint modeling strategy that uses random within-cluster covariance matrices to preserve cluster-specific associations (Yucel, 2011) is a promising alternative for random coefficient analyses. This approach was recently implemented in the R package jomo (Quartagno & Carpenter, 2016a), and the only study to examine its performance did so in the context of individual patient data meta-analysis (Quartagno & Carpenter, 2016b).
The primary goal of our manuscript was to compare fully conditional specification to joint model imputation with random level-1 covariance matrices. A vital consideration with multiple imputation is whether (and how) the imputation model preserves features of the analysis, a notion described as congeniality (Meng, 1994;Schafer, 2003) or compatibility (Bartlett et al., 2014;Carpenter & Kenward, 2013). Simulation results suggest that fully conditional specification can produce negatively biased slope variance estimates , and we show that an incorrect distributional assumption is responsible for this issue; the reverse regression model's random effects cannot be normally distributed if the analysis model's random effects are also normal (a standard assumption in multilevel modeling applications). In a similar vein, we show that joint model imputation with random covariance matrices is potentially problematic because it fails to preserve covariation between the intercepts and the slopes.
Our simulation results convey a number of takehome messages to substantive researchers. First, both imputation procedures generally produced tolerable biases when the missing data rate was 10% or less and the sample was composed of at least 30 clusters with 15 observations per group. These favorable conditions may encompass a wide range of multilevel modeling applications. Second, in terms of overall accuracy, fully conditional specification appears to be superior when the intraclass correlation is low (e.g., ICC ¼ .10), whereas joint model imputation is somewhat better when the intraclass correlation is high (e.g., ICC ¼ .50). Roughly speaking, this suggests that fully conditional specification is preferable for crosssectional data, and the joint model is better for withinparticipants designs. Third, the missing data pattern is an important determinant of imputation quality. Problems with fully conditional specification are generally restricted to incomplete predictors with random slopes, and the incompatibilities that we described earlier in the article would not occur with incomplete outcomes or fixed predictors. On the other hand, there is no reason to expect joint model imputation to improve if y is missing and x is complete.
Perhaps the most important take-home message from our study is that neither fully conditional specification nor joint model imputation are optimal for random slope analyses with incomplete predictors. Fully Bayesian (i.e., model based) imputation (Erler et al., 2016;Ibrahim, Chen, & Lipsitz, 2002;Zhang & Wang, 2017) and its close relative substantive modelcompatible imputation (Bartlett et al., 2014;Enders, Du, & Keller, 2018;Goldstein, Carpenter, & Browne, 2014) are promising alternatives that are starting to receive attention in the literature. Erler et al. (2016) and Grund, L€ udtke, and Robitzsch (2018) illustrate model-based Bayesian imputation with the JAGS software package (Plummer, 2016), and a substantive model-compatible variant of fully conditional specification was recently implemented in the Blimp application . Both procedures avoid bias by respecifying the conditional distribution of the incomplete covariates (e.g., the problematic reverse random coefficient model from Equation 11) into a composite function, the components of which are compatible with the analysis model. Limited simulation results suggest that these procedures can provide a rather dramatic improvement over conventional imputation routines when the analysis model includes random coefficients and other types of interactive effects Grund et al., 2018). For interested readers, the online supplemental material gives the Blimp code for applying substantive model-compatible imputation to a random slope analysis, and this material also gives plots showing relative bias values from a subset of our simulation conditions. Additional analysis examples (e.g., for models with interactive terms) are available in the Blimp user guide . Though our study addresses an important gap in the methodological literature, our simulations lack generalizability because we necessarily investigated a limited set of conditions. Future studies could investigate different combinations of level-1 and level-2 sample sizes, different effect sizes, and more complex models. The interaction of model complexity and sample size could be particularly important for understanding joint model imputation, given the influence of within-cluster sample size on imputation quality. Also, it could be important to investigate the impact of unequal group sizes on fully conditional specification as the imputation model for level-2 variables (Equation 12) does not directly adjust for the fact that cluster means are potentially derived from different numbers of level-1 units. Analytic and simulation results suggest that manifest cluster means provide good performance in most situations but can introduce slight biases with low intraclass correlations, few observations per cluster, and extremely unbalanced cluster sizes . A future release of Blimp will feature the option to invoke the latent cluster mean procedure described by Grund et al. (2017). Next, we limited this preliminary comparison to normally distributed variables. Nonnormal continuous variables are potentially problematic for multiple imputation (Yuan, Yang-Wallentin, & Bentler, 2012;Yucel & Demirtas, 2010), and hence future studies should examine whether the two imputation frameworks are differentially impacted by this common problem. Additionally, both approaches readily accommodate incomplete categorical variables; Blimp uses a probit regression framework (i.e., latent variable imputation) for ordinal or nominal variables, and jomo offers the same functionality for nominal variables (imputation for nominal variables can also accommodate ordered categories, albeit it with a richly parameterized model). Although both software programs use the same underlying latent variable definition of categorical variables (Albert & Chib, 1993;Carpenter & Kenward, 2013;Johnson & Albert, 1999), implementing this approach in the joint model framework could greatly accentuate model complexityrelated issues, particularly with incomplete level-1 variables and random slopes. Finally, it is important to emphasize that we considered a rather conventional multilevel regression model. Further studies could examine the use of multiple imputation with other covariance structures or analytic frameworks (e.g., random intercept models, multilevel structural equation models, generalized estimating equations).

Conclusion
In sum, relatively few studies have examined missing data handling for random coefficient models, and ours was the first to compare fully conditional specification to joint model imputation with random covariance matrices (Yucel, 2011). Although both approaches can introduce bias-inducing incompatibilities, they tend to do so in somewhat different situations. However, the potential for problematic biases underscores the importance of promising new Bayesian imputation procedures (Enders et al., 2016.
analysis, and interpretation of data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.