A Bayesian Approach for Estimating Multilevel Latent Contextual Models

In many applications of multilevel modeling, group-level (L2) variables for assessing group-level effects are generated by aggregating variables from a lower level (L1). However, the observed group mean might not be a reliable measure of the unobserved true group mean. In this article, we propose a Bayesian approach for estimating a multilevel latent contextual model that corrects for measurement error and sampling error (i.e., sampling only a small number of L1 units from a L2 unit) when estimating group-level effects of aggregated L1 variables. Two simulation studies were conducted to compare the Bayesian approach with the maximum likelihood approach implemented in Mplus. The Bayesian approach showed fewer estimation problems (e.g., inadmissible solutions) and more accurate estimates of the group-level effect than the maximum likelihood approach under problematic conditions (i.e., small number of groups, predictor variable with a small intraclass correlation). An application from educational psychology is used to illustrate the different estimation approaches.

Multilevel modeling (MLM) is one of the most prominent modeling approaches in the social sciences. One major advantage of MLM is that it can be used to simultaneously relate outcome variables to predictors at the individual and group levels (e.g., Goldstein, 2011;Raudenbush & Bryk, 2002;Snijders & Bosker, 2012). However, many constructs in the social sciences cannot be measured without error, and it is well known that this measurement error leads to biased and less efficient estimates of population parameters. In the last two decades, substantial progress has been made in integrating MLM with structural equation modeling (SEM) by employing multiple indicators to correct for the unreliability that occurs in the assessment of constructs (e.g., Lee & Poon, 1998;McDonald, 1993;Mehta & Neale, 2005;B. O. Muthén & Asparouhov, 2011;. In many applications of MLM, group-level (L2) variables for assessing group-level effects are generated by aggregating variables from a lower level (L1). For example, in educational psychology, features of the learning environment (e.g., classroom climate, teacher behavior; see Marsh et al., 2012) are measured by aggregating individual student ratings at the class level and then, in the next step, using these aggregated ratings to predict learning outcomes. However, it has been pointed out in the methodological literature that the observed group mean might not provide a reliable measure of the unobserved true group mean (e.g., Asparouhov & Muthén, 2007;Croon & van Veldhoven, 2007;Grilli & Rampichini, 2011;Lüdtke et al., 2008;Pokropek, 2015;Preacher, Zyphur, & Zhang, 2010;Shin & Raudenbush, 2010). Lüdtke, Marsh, Robitzsch, and Trautwein (2011; see also Marsh et al., 2009;Rabe-Hesketh, Skrondal, & Pickles, 2004) argued that two main sources of error can be distinguished when assessing group-level effects by aggregating individual data from a lower level: unreliability that is due to measurement error at L1 and L2 and unreliability that is due to sampling only a small number of L1 units (e.g., persons) within each L2 unit. To correct for these two types of error, they discussed a multilevel latent contextual model for estimating group-level effects of aggregated L1 variables. This model has been described as a doubly latent model for estimating contextual effects because it takes into account measurement error at L1 and L2 (based on multiple L1 indicators) as well as L2 sampling error (latent aggregation from L1 to L2). The multilevel latent contextual model is currently implemented in Mplus (L. K.  and is based on maximum likelihood (ML) estimation (see also the GLLAMM module in Stata; Rabe-Hesketh et al., 2004).
However, extensive simulation research has identified two major obstacles to applying the multilevel latent contextual model in research practice. First, serious estimation problems such as nonconvergence or inadmissible estimates (e.g., negative variances) have been observed for the doubly latent approach with small intraclass correlations and small sample sizes at L1 and L2 (Li & Beretvas, 2013;Lüdtke et al., 2011). Second, under certain conditions (e.g., a small number of L2 groups, a small intraclass correlation), the multilevel latent contextual model, although approximately unbiased, was found to be much more variable (i.e., to have a larger empirical sampling variance) than the biased traditional MLM approach, which uses the observed group mean as a predictor variable. This has resulted in the counterintuitive finding that-under certain conditions-the multilevel latent contextual model provides less accurate estimates of the group-level effect than the manifest approach in terms of the root mean squared error (RMSE), which combines bias and variability. This bias-accuracy trade-off constitutes a major challenge for the general applicability of the multilevel latent contextual model (Li & Beretvas, 2013; see also Ledgerwood & Shrout, 2011).
In this article, we introduce a Bayesian approach for estimating the multilevel latent contextual model, and we evaluate it with simulation studies. We believe that a Bayesian approach has the potential to mitigate the limitations of the multilevel latent contextual model. First, several scholars have argued that Bayesian methods offer a promising approach for estimating multilevel models with latent variables, particularly when the focus is on estimating variance components, and the number of groups is small (see Depaoli & Clifton, 2015;Hox, van de Schoot, & Matthijsse, 2012; see also Hamaker & Klugkist, 2011). Using appropriate prior distributions for the parameters, the Bayesian approach guarantees that the model estimates will be within the admissible range (Can, van de Schoot, & Hox, 2015). Second, Bayesian methods allow for the stabilization of parameter estimates by specifying weakly informative prior distributions for critical parameters (Baldwin & Fellingham, 2013;Chung, Gelman, Rabe-Hesketh, Liu, & Dorie, 2015). In a recent study, Zitzmann, Lüdtke, and Robitzsch (2015) showed that a Bayesian approach with a weakly informative prior distribution for the group-level variance of the predictor produced more accurate estimates of group-level effects than an ML approach (see also Depaoli & Clifton, 2015). However, the authors did not take into account the role of measurement error and considered only the sampling error.
This article is organized as follows. We begin by briefly describing the multilevel latent contextual model. We then present two studies. In Study 1, where there is one latent predictor variable, we discuss how the multilevel latent contextual model can be estimated by applying a Bayesian approach and present a simulation study that compares the proposed Bayesian approach with the ML approach. In Study 2, we add a manifest group-level predictor to the multilevel latent contextual model. We show both analytically and with a simulation how an unreliable assessment of the latent predictor at the group level results in biased estimates of the effect of the manifest group-level predictor. We again compare the Bayesian approach with the ML approach. Finally, we use an empirical example from educational psychology to demonstrate how the Bayesian approach can be specified with the WinBUGS software (Spiegelhalter, Thomas, Best, & Lunn, 2003) and in Mplus.

MULTILEVEL LATENT CONTEXTUAL MODEL
In this section, we briefly present the main features of the multilevel latent contextual model, which is essentially a multilevel SEM. Different frameworks for conceptualizing multilevel SEMs have been proposed (see Rabe-Hesketh, Skrondal, & Zheng, 2012, for a discussion). In the following, we use the within-between framework, which decomposes observed indicator variables into within-group components (denoted by W) and between-group components (denoted by B). Two standard structural equation models are then defined: one within-group model and one between-group model. More specifically, the multivariate response vector y ij for person i in group j (e.g., student i in class j) is decomposed as follows: where y B j is a vector of random effects between groups, and y W ij is a vector of residuals within groups.
In the next step, measurement models that relate the within-and between-group components of the multiple indicators to the latent constructs at the individual and group levels are specified: where ν is a vector of intercepts; and Λ W and Λ B are factor loading matrices linking the within-parts y W ij and betweenparts y B j to the latent constructs η W ij and η B j at L1 and L2. The latent constructs η W ij and η B j are assumed to be multivariate normally distributed, and the ε W ij and ε B j terms denote within-and between-group errors that have multivariate normal distributions around zero with covariance matrices Θ W and Θ B at L1 and L2. The model in Equation 2 has been extensively discussed in the literature on multilevel confirmatory factor analysis (e.g., Geldhof, Preacher, & Zyphur, 2014;Mehta & Neale, 2005;B. O. Muthén, 1994).
The structural models that relate the latent constructs at L1 and L2 to one another are given as follows: where α is a vector of regression intercepts and B W and B B are coefficient matrices for the effects of the latent constructs (η W ij and η B j ) on each other at L1 and L2. The influence of the exogenous covariates x ij and x j at L1 and L2 are given by the coefficient matrices Γ W and Γ B . Note that exogenous covariates are manifest variables that are assumed to be measured without error. The ζ W ij and ζ B j terms denote structural residuals that are multivariate normally distributed with means of zero and covariance matrices Ψ W and Ψ B at L1 and L2. It is important to emphasize that in the multilevel latent contextual model, the effects of the latent constructs at the group level given by B B are corrected for both measurement error and the unreliability that results from sampling only a finite number of persons from a potentially infinite population (e.g., a small number of students from a school). Both kinds of error are discussed in more detail in the next section.

STUDY 1: ONE LATENT PREDICTOR VARIABLE
In this section, we begin with the simple case of a multilevel latent contextual model with one latent predictor variable that is measured at the individual level. We further assume that the latent construct is measured by k = 1, …, p multiple indicators (e.g., the items or subscales of a test). According to Equation 2, the indicator y kij is decomposed as follows: where v k is the indicator-specific mean, λ W k is the withingroup factor loading, λ B k is the between-group factor loading, and ε W kij and ε B kj are the indicator-specific residuals at L1 and L2, which are assumed to be uncorrelated across indicators. As the factor loadings are allowed to vary across indicators, Equation 4 represents a congeneric measurement model for the unobserved true scores η W 1ij and η B 1j at L1 and L2. To identify the metric of both latent variables η W 1ij and η B 1j , the loadings of the first indicator at L1 ðλ W 1 ¼ 1Þ and at L2 ðλ B 1 ¼ 1Þ can be fixed to 1. In multilevel research, the intraclass correlation (ICC) provides important information about the multilevel structure of the data. For the latent predictor variable, the intraclass correlation is defined by: where ψ B 1 is the variance of the latent between-group portion η B 1j , and ψ W 1 is the variance of the latent within-group portion η W 1ij . Thus, the ICC η indicates the proportion of the total variance of the latent predictor variable that can be attributed to between-group differences.
In the following, we are interested in predicting an observed dependent variable y dep,ij that is measured without measurement error and can be decomposed into its within and between portions as follows: The structural models at L1 and L2 are then given as: where β W is the within-group (L1) regression coefficient, which describes the relationship between η W 2ij and η W 1ij within groups, and β B is the between-group (L2) regression coefficient, which describes the relationship between the group components η B 2j and η B 1j . The structural part of the multilevel latent contextual model can be written as follows: It can now be seen that the dependent variable is decomposed into the parts that can be predicted by the within-and between-group components of the latent predictor and the corresponding residual terms at L1 and L2. The model is illustrated in Figure 1. The congeneric measurement model for the latent predictor can be further simplified by assuming parallel indicators; that is, the factor loadings are equal at L1 ðλ W k ¼ λ W k 0 Þ and at L2 ðλ B k ¼ λ B k 0 Þ, and the measurement error variances are also equal at L1, Varðε W kij Þ ¼ Varðε W k 0 ij Þ, and at L2, In research practice, the model with parallel indicators at L1 and L2 in Equation 7 is often approximated by a traditional multilevel model in which the latent between-group component η B 1j is assessed by using the observed group mean, which is calculated by averaging across items and persons within a group: y kij , where n j denotes the number of persons within a group j. In the traditional multilevel model, this observed group mean of the total score is treated like a manifest variable rather than a latent variable. Using the group-mean-centered predictor variable at L1, the traditional approach would be specified as follows (see Raudenbush & Bryk, 2002): As can now be seen, the traditional approach in Equation 8 approximates the multilevel latent contextual model from Equation 7. The latent variables η B 1j and η W 1ij are substituted by the manifest variables y j and y ij À y j at L1 and L2, and β W DM and β B DM are the corresponding within-group and between-group coefficients. It is apparent that the traditional approach does not take into account the error that is present in the manifest scores. As both types of error are ignored-measurement error and sampling error-the approach could be labeled a doubly manifest (DM) approach for estimating group-level effects (Lüdtke et al., 2011). It can be shown thatβ W DM is a biased estimator of the within-group regression coefficient β W that underestimates the absolute size of the true relationship within groups (e.g., Cohen, Cohen, West, & Aiken, 2003;Kenny, 1979;see Equation 14 in Lüdtke et al., 2011). The bias ofβ W DM is a direct function of the reliability of the manifest score at L1, which depends on the number of items when indicators are parallel. However, the bias of the estimatorβ B DM of the between-group coefficient β B depends on the reliability of the observed group mean, which is also influenced by the L2 sampling error. Assuming parallel indicators at L1 and L2, the reliability of the observed group mean (Kane, Gillmore, & Crooks, 1976; see also Schweig, 2014) is given by: where θ B is the variance of the between-group errors ε B kj , and θ W is the variance of the within-group errors ε W kij . As Equation 9 shows, the reliability is a function of the sampling error (i.e., 1 n j ψ W 1 ), the measurement error at L2 (i.e., 1 p θ B ), and the measurement error at L1 (i.e., 1 pn j θ W ). Assuming equal group sizes (i.e., omitting the index j in n j ), it can be shown analytically (Lüdtke et al., 2011; see also Shin & Raudenbush, 2010) that the bias of the estimator of the between-group coefficient in the traditional multilevel model in Equation 8 is given by: As can be seen, the bias depends on the reliability of the manifest group means as well as the size of the within-group (β W ) and between-group (β B ) coefficients in the population. It is also evident that the bias of the estimator of the between-group coefficient is a function of sampling error (n and ICC η ) and measurement error (number of indicators p), and that the bias decreases with an increasing number of groups and a larger ICC. However, even when there is no sampling error (when the ICC η is 1),β B DM is still a biased estimator because measurement error attenuates the effect of the manifest group mean.

664
ZITZMANN ET AL. multiplied by the likelihood function. One refers to the result as the posterior distribution, which forms the basis for the construction of Bayesian point and interval estimators. This idea can be formally written as the wellknown Bayes's theorem: posterior / likelihood • prior. In practice, the posterior is often approximated by using Markov chain Monte Carlo (MCMC) algorithms that iteratively sample from conditional distributions, constituting an MCMC chain. Bayesian point and interval estimates can be derived when the chain has converged and approximates the posterior distribution sufficiently well (for an introduction, see Christensen, Johnson, Branscum, & Hanson, 2011;Jackman, 2009;Lynch, 2007). The prior distributions that we used to estimate the multilevel latent contextual model were as follows. We specified uninformative or weakly informative priors for the parameters of the measurement model in Equation 4 and the structural model in Equation 6, respectively. Given meancentered indicator variables, a normal distribution with zero mean and a small precision of .001 (or equivalently a large variance of 1,000) was considered uninformative for the intercepts and the loadings (except for the first one) in the measurement model of the latent predictor as well as for the intercept and coefficients in the structural model: Although it is often less critical to select priors for regression coefficients, there is an ongoing discussion about which prior distribution should be specified for variance components (e.g., Browne & Draper, 2006;Chung, Rabe-Hesketh, Dorie, Gelman, & Liu, 2013;Gelman, 2006;Natarajan & Kass, 2000). However, the inverse gamma (IG) distribution is most commonly used in research practice (Browne & Draper, 2006). With a shape/scale parameter of .001, it is approximately uniform for a large range of possible values and is thus meant to be uninformative (but see Gelman, 2006, for the argument that this prior can be quite informative when variance components are very small). Thus, an approximately uninformative IG distribution with a shape/scale parameter equal to .001 was specified for the within-group variances in the multilevel latent contextual model: Bayesian approach provides the opportunity to stabilize parameter estimates by specifying weakly informative prior distributions (e.g., Baldwin & Fellingham, 2013). One application of an informative prior is related to the observation that, under particular conditions, estimates of the between-group variances can be close to zero (Chung et al., 2013). To penalize for near-zero estimates, we set the shape/scale parameter to .1 (see Depaoli & Clifton, 2015): ψ B 1 ,IG : quantile of the distribution was 0.094 and the 97.5% quantile was 1.082 • 10 15 . As can be seen, large estimates of the between-group variances were still allowed, so the prior could be considered weakly informative.

Simulation Study
The aim of the simulation was threefold. First, previous simulations have shown that the multilevel latent contextual model is prone to estimation problems (nonconvergence, inadmissible estimates) when estimation is carried out with an ML approach (Li & Beretvas, 2013;Lüdtke et al., 2011). These estimation problems should not occur with a Bayesian approach (e.g., Hox & Maas, 2001). Second, we believe that one reason for the low accuracy of the multilevel latent contextual model when estimated by an ML approach is a nonnegligible proportion of near-zero estimates of the between-group variance of the predictor variable (see Chung et al., 2013). We assumed that by specifying weakly informative prior distributions for variance parameters at the group level, the estimates of the between-group effect provided by the Bayesian approach would be stabilized and thus more accurate than the estimates provided by the ML approach (Zitzmann et al., 2015). Third, we compared the doubly manifest model that did not correct for error in the predictor variable with two multilevel latent contextual models that differed in their measurement models (parallel vs. congeneric indicators). This allowed us to test how the complexity of the measurement model was related to the quality of the estimates of the between-group effect.

Method
Conditions. The data were generated by the multilevel latent contextual model given in Equations 4 and 6 (see also Figure 1). The latent predictor was measured with three mean-centered and normally distributed indicators, each with a standardized between-group loading of 0.7. The (unstandardized) within-and between-group regression coefficients were set to 0.2 and 0.5, respectively. The manifest outcome variable was standardized and normally distributed (i.e., zero mean, unit variance). The unconditional ICC of the outcome variable (not adjusted for the predictor) was 0.3. The (unstandardized) within-group and between-group loadings of the first indicator were both set to 1 to identify the metric of the latent predictor. The indicators could be either parallel (withingroup loadings: 1, 1, 1; between-group loadings: 1, 1, 1) or congeneric (within-group loadings: 1, 0.4, 1.6; between-group loadings: 1, 0.4, 1.6). In both cases, the loadings were specified to be invariant across levels (cross-level invariance). The number of groups was set GROUP-LEVEL EFFECTS to either 25 or 100. These numbers were chosen so that the lower number would represent a very small number in multilevel research (see Kreft & De Leeuw, 1998). The group size was either 10 or 30 persons. Small group sizes occur in organizational psychology (e.g., working teams), whereas group sizes are often larger in educational psychology (e.g., school classes). The ICC of the latent predictor was set to either .1 or .3 to simulate ICCs that often occur in the social sciences (see Gulliford, Ukoumunne, & Chinn, 1999). The indicators were assumed to have the same ICC as the latent predictor. For each of the 2 Â 2 Â 2 Â 2 ¼ 16 data constellations, 1,000 data sets were generated.
Analysis models. Each of the data sets was analyzed with the doubly manifest model and the multilevel latent contextual models with parallel and congeneric indicators. In the model with parallel indicators, all within-and betweengroup loadings were fixed to 1. In the model with congeneric indicators, only the within-and between-group loadings of the first indicator were fixed to 1 to identify the metric of the latent predictor, whereas the remaining within-and between-group loadings were freely estimated but constrained to be equal across levels. Moreover, in the model with parallel indicators, the within-and between-group error variances were freely estimated but constrained to be equal across the indicators, whereas in the model with congeneric indicators, the error variances were freely estimated without any constraints.
Each model was estimated in Mplus 7.1 by applying ML with robust standard errors (MLR; L. K.  and in WinBUGS 1.4.3 (Lunn, Thomas, Best, & Spiegelhalter, 2000). Before running the main simulation study, we studied the behavior of the MCMC sampler in a preliminary simulation by inspecting two criteria for each data constellation and each analysis model: (a) the potential scale reduction factor (PSR; Gelman & Rubin, 1992), and (b) the MCMC standard error (MCMCSE) divided by the posterior standard deviation (PSD): MCMCSE PSD More specifically, for each model parameter, we inspected whether the PSR was close to 1 (see B. O. Muthén & Asparouhov, 2012) and whether the MCMCSE PSD ratio was smaller than .05 (see Cho & Bottge, 2015). The PSR is typically used to diagnose whether the chain has converged, and the MCMCSE PSD ratio is used to assess how well the empirical mean of the chain approximates the posterior mean (Hoff, 2009, p. 101). Applying these two criteria suggested that an average chain length of about 80,000 iterations (with a burn-in period of one tenth of the total length) was sufficient to provide a good approximation of the posterior distribution (see Figure E.1 in the supplemental materials for example trace plots). However, the actual chain length varied substantially in accordance with the analysis model and the data constellation. In the worst case (J ¼ 25, n ¼ 10, ICC η ¼ :1), the model with congeneric indicators required a total of 180,000 iterations. However, in the most favorable case (J ¼ 100, n ¼ 30, ICC η ¼ :3), the doubly manifest model required only 900 iterations. The Bayesian point estimate was defined by the mode of the posterior distribution (MAP; see DeCarlo, Kim, & Johnson, 2011;Zitzmann et al., 2015). The MAP was calculated using kernel density estimates (e.g., Silverman, 1998). The Bayesian confidence interval (BCI) estimate was defined by the 2.5th and the 97.5th percentiles of the posterior distribution (see Box & Tiao, 1992).
Evaluation criteria. The performances of the different analysis models and estimation approaches were evaluated on the basis of the following four criteria: frequency of estimation problems, relative bias, relative RMSE, and coverage rate. An ML solution was classified as nonconverged (NC) when the convergence criterion was not met within the maximum number of iterations or when difficulties arose in optimizing the fit function. 2 Furthermore, a solution was classified as inadmissible converged (IAC) when it converged to a solution that included an inadmissible estimate (e.g., a negative variance; see Marsh, Hau, Balla, & Grayson, 1998; see also Wothke, 1993). Solutions that did not fall into these two categories were classified as admissible converged (AC), indicating that no estimation problems occurred. The relative bias of the estimator of the between-group effect β B is given as the difference between the expectation of the estimateβ B and β B divided by β B . A maximum absolute value of 0.10 was considered to constitute an acceptable degree of relative bias (L. K. Muthén & Muthén, 2002). The overall accuracy was assessed by computing the relative RMSE, which is given as the square root of the expectation of the squared deviation of the estimateβ B from β B divided by β B . Finally, the standard errors ofβ B were evaluated in terms of the coverage rate, which is given as the probability that the 95% confidence interval (CI) captures β B . A coverage rate between 91% and 98% was considered acceptable (L. K. Muthén & Muthén, 2002).

Results
Estimation problems. No estimation problems occurred for the Bayesian approach. However, in some data constellations, serious estimation problems were observed when the ML approach was applied. Table 1 presents the AC rate, the NC rate, and the IAC rate for the multilevel latent contextual models with parallel and congeneric indicators. The doubly manifest model estimated with ML did not show any estimation problems and was therefore not included in Table 1. When the data-generating model was the model with parallel indicators, the AC rate was higher for the correctly specified model with parallel indicators (M = 95.2%, range = 80.9%-100.0%) than for the model with congeneric indicators (M = 89.8%, range = 62.0%-100.0%) for which loadings were estimated. As can be seen, for both analysis models, the AC rate was lowest when the number of groups was small (i.e., J = 25). When the data-generating model assumed congeneric indicators, the percentages of admissible converged solutions were clearly lower for the model with parallel indicators (M = 66.2%, range = 42.3%-93.3%) than for the model with congeneric indicators (M = 90.9%, range = 59.7%-99.9%). Most of the estimation problems for the "misspecified" model with parallel indicators were due to nonconverged solutions, which increased in particular when the ICC increased.
For the following analyses, the comparison between the ML and Bayesian approaches was based on only those data sets for which the ML approach provided an AC solution. Furthermore, we restricted the comparison of the two estimation approaches to the case where the analysis model matched the data-generating model.

Comparison
of the ML and Bayesian approaches. Table 2 shows the relative bias, relative RMSE, and coverage rate for the estimators of the betweengroup effect provided by the ML and Bayesian approaches. As can be seen, when the data were generated by the model with parallel indicators, both estimation approaches provided approximately unbiased estimates when the ICC of the latent   Note. Results are based on only those data sets for which ML provided an admissible converged solution. PI = multilevel latent contextual model with parallel indicators; CI = multilevel latent contextual model with congeneric indicators; J = number of groups; n = group size; ICC η = intraclass correlation of the latent predictor.

GROUP-LEVEL EFFECTS
predictor was large (ML: M = 0.05, range = 0.04-0.07, Bayes: M = 0.00, range = −0.10-0.07). However, in conditions with a small ICC, the Bayesian approach produced a negatively biased estimate, particularly when the number of groups was small. By contrast, the ML approach provided slightly positively biased estimates of the between-group effect in conditions with a large number of groups and a small group size. A very similar picture emerged for the relative bias when the data were generated with the model with congeneric indicators.
Moreover, when the model with parallel indicators was the data-generating model, the estimates obtained with the Bayesian approach (M = 0.64, range = 0.32-1.04) were more accurate than the estimates provided by the ML approach (M = 1.00, range = 0.31-2.12), particularly when the number of groups was small and the ICC was low. For example, in the condition with a small number of groups, a small group size, and a low ICC, the relative RMSE was 2.12 for the ML approach and only 1.04 for the Bayesian approach. Again, the findings were very similar when the data were generated by the model with congeneric indicators. However, it needs to be pointed out that for both datagenerating models, differences in the relative RMSE between the ML and Bayesian approaches vanished in the most favorable condition (J = 100, n = 30, ICC η = .3).
As can also be seen from Table 2, when the model with parallel indicators held in the population, the 95% CI for the ML estimator (M = 94.7%, range = 93.2%-96.3%) and the 95% BCI for the Bayesian approach (M = 95.7%, range = 93.4%-98.1%) produced coverage rates that were close to the nominal 95% rate. However, the Bayesian approach showed a tendency to provide coverage rates that were slightly too high (compared with ML) in conditions with a small number of groups. The results for the coverage rate did not change when the model with congeneric indicators was the data-generating model.
Comparisons of different analysis models within Bayes. We also investigated the differences between the different analysis models for estimating the between-group effect. For these analyses, we only used the Bayesian approach because no estimation problems were produced by the Bayesian approach. Table 3 presents the relative bias, RMSE, and coverage rate for the estimator of the between-group effect for the doubly manifest model, the model with parallel indicators, and the model with congeneric indicators. As can be seen, when the data were generated by the model with parallel indicators, the estimates provided by the doubly manifest model were substantially negatively biased. As expected, the estimated bias was largest when the group size was small and the ICC was low. By contrast, the model with parallel indicators and the model with congeneric indicators provided approximately unbiased estimates of the between-group effect, with the exception that in the worst condition (J = 25, n = 10, ICC η = .1), both models produced slightly negatively biased estimates. However, when the data were generated under the assumption of congeneric indicators, the misspecified model with parallel indicators provided positively as well as negatively biased estimates, particularly in conditions with a large group size and a high ICC. Again, estimates obtained from the doubly manifest model were substantially negatively biased, whereas the correctly specified model with congeneric indicators provided approximately unbiased estimates.
Next, we compared the different analysis models with respect to their relative RMSE. When the data-generating model assumed parallel indicators, the doubly manifest model provided more accurate estimates than the model with parallel indicators and the model with congeneric indicators when the number of groups was small. However, with a large number of groups and a high ICC, the doubly manifest model, which does not take into account error in the predictor variable, was outperformed by the two latent models. For example, in the most favorable condition (J = 100, n = 30, ICC η = .3) the relative RMSE was 0.44 for the doubly manifest model but 0.32 for the model with parallel indicators and 0.31 for the model with congeneric indicators. A very similar pattern of findings was observed when the data were generated under the assumption of congeneric indicators. However, as could be expected, the model with congeneric indicators provided more accurate estimates than the misspecified model with parallel indicators when the number of groups was large.
Finally, the doubly manifest model provided coverage rates that were clearly below 90% (range = 15.3%-87.6%) when the data-generating model assumed parallel indicators. By contrast, the model with parallel indicators and the model with congeneric indicators produced coverage rates that were close to the nominal 95% rate. When the data were generated under the assumption of congeneric indicators, the coverage rates provided by the doubly manifest model were again too low (range = 9.4%-85.8%). The models with parallel (M = 95.3%, range = 88.1%-99.2%) and congeneric indicators (M = 95.8%, range = 93.6%-97.2%) provided coverage rates close to 95%, with the exception that the coverage rate was too low (88.1%) for the misspecified model with parallel indicators in the condition with a high ICC and a large number of groups with large group size. 3 We also compared the WinBUGS software with the Bayesian estimation procedure that is implemented in the popular SEM software Mplus. To this end, we conducted an additional simulation in both programs in which we examined the relative bias, relative RMSE, and coverage rate for the estimator of the between-group effect in the models with parallel and congeneric indicators. The results for WinBUGS and Mplus were almost identical (see Table A3 in the supplemental materials). 4

Discussion
The main findings from the simulation can be summarized as follows. First, in some of the conditions, serious estimation problems were observed when the ML approach was used to estimate the multilevel latent contextual model. Estimation problems were due to nonconverged solutions or inadmissible estimates. As expected, these estimation problems could be avoided by using a Bayesian estimation approach. Thus, from the perspective of an applied researcher, adopting a Bayesian approach seems to be particularly promising when the ML approach does not converge or converges to an inadmissible solution (see Depaoli & Clifton, 2015). Second, the results for the relative RMSE, which combines bias and variability, suggested that the Bayesian estimator of the between-group effect can be more accurate than the ML estimator, particularly when sample sizes are small and the ICC is low. However, these differences between the Bayesian estimator and the ML estimator vanished when the number of groups was large. Third, consistent with previous research, the estimates of the between-group effect provided by the doubly manifest model were substantially biased. However, in challenging data constellations (i.e., small groups, low ICC), the estimates provided by the doubly manifest model were more accurate than the estimates produced by the multilevel latent contextual model (bias-accuracy trade-off; Lüdtke et al., 2011; see also Ledgerwood & Shrout, 2011). There is a turning point when the approximately unbiased estimator of the latent model becomes more accurate than the estimator of the manifest model. Further analyses of the simulation data indicated that the Bayesian approach required less data (i.e., a smaller number of groups and a smaller group size) Note. DM = doubly manifest model; PI = multilevel latent contextual model with parallel indicators; CI = multilevel latent contextual model with congeneric indicators; J = number of groups; n = group size; ICC η = intraclass correlation of the latent predictor. 4 However, it is important to note that the same specification of the MCMC method was used (i.e., single MCMC chain with a fixed length and the first tenth of iterations as burn-in iterations). In a further simulation, we evaluated a different specification that uses the default convergence criterion in Mplus (i.e., a PSR< 1.1 for the parameter with the worst convergence). It is possible for an MCMC chain to meet the default criterion but the approximation of the posterior might not be optimal (as indicated by MCMCSE PSD >:05). One could speculate that this could lead to less accurate parameter estimates in terms of the RMSE. We reran the simulation with two specifications in Mplus: the default convergence criterion and a specification that used a chain that was 100 times longer (see Table A4 in the supplemental materials). With a small number of groups (J = 25), the default convergence criterion resulted in less accurate estimates of the between-group effect. B. O. Muthén and Asparouhov (2012) also pointed out that the default criterion might not be sufficient for a close approximation of the posterior distribution and that longer chains should be used.

GROUP-LEVEL EFFECTS
to reach this turning point than the ML approach. For the model with congeneric indicators, Figure 2 shows that differences in the RMSE between the manifest and latent models depended on the estimation approach. For example, in the condition with a large number of groups (J = 100) and a small ICC (ICC η = .1), the Bayesian estimator of the between-group effect had almost the same RMSE as the biased estimator of the doubly manifest model, whereas the ML estimator was clearly less accurate. To conclude, these findings suggest that the Bayesian approach offers a promising alternative to an ML approach. It reduces estimation problems in the multilevel latent contextual model and can also increase the accuracy in the estimation of grouplevel effects (see Depaoli & Clifton. 2015).
One limitation of the simulation was that it was restricted to only one predictor variable that was measured at L1. In the next section, we discuss the multilevel latent contextual model for the case of two predictor variables in which one predictor is directly measured at the group level.

STUDY 2: ADDING A MANIFEST GROUP-LEVEL PREDICTOR VARIABLE
MLM is often used to assess how variables directly measured at the group level (e.g., class size, treatment condition) affect outcomes at the individual level. In cluster or group randomized trials in educational research, classrooms or schools are randomly assigned to different treatment conditions (Aydin, Leite, & Algina, in press;Murray, 1998). MLM is then applied to estimate the effect of the intervention on individual student outcomes. In observational studies, researchers rely on the naturally occurring variation between groups to estimate the effect of directly measured group-level variables. For example, educational psychologists are interested in assessing the effect of school type (e.g., academic track schools vs. nonacademic track schools) on cognitive and noncognitive student outcomes. In this section, we investigate the multilevel latent contextual model with a directly measured group-level predictor and one latent predictor measured at L1.

Multilevel Latent Contextual Model With a Manifest Group-Level Predictor Variable
The multilevel latent contextual model with a manifest group-level variable is an extension of the model presented in Study 1. In addition to the latent predictor η 1ij , which is measured by multiple indicators y kij k ¼ 1; . . . ; p ð Þ and splits into a between-group part η B 1j and a within-group part η W 1ij , the model assumes a manifest group-level predictor  x j , which is measured directly at the group level without error. Then the structural model becomes: where the regression coefficient γ indicates the effect of x j on the dependent variable. The measurement model for the latent predictor η 1ij is the same as in Equation 4 when an additional manifest group-level predictor is added. In the traditional MLM approach for estimating the effect of group-level predictors, the observed group mean y j would be used instead of η B 1j and the group-mean-centered individual scores y ij À y j would be used instead of η W 1ij . The traditional multilevel model would then be specified as follows: It can now be shown that β B DM ,β B DM andγ DM are biased estimators of the corresponding regression coefficients β W , β B , and γ in the multilevel latent contextual model presented in Equation 11 (for the analytical derivation, see Appendix). For the between-group effect of the predictor that is measured at L1, the bias (assuming that η 1ij and x j are standardized) is given by: where ρ ηx indicates the correlation between the betweengroup component of the latent predictor and the manifest group-level predictor. As can be seen, when the two predictors are uncorrelated at the group level ρ ηx ¼ 0 , the expression for the bias of the between-group coefficient simplifies to the case with only one predictor variable measured at L1 (see Equation 10). However, with correlated predictors ρ ηx Þ0 , the bias becomes a function of the reliability of the observed group mean, the ICC of the latent predictor, and the correlation of the two predictors at the group level. Note that with correlated predictors, the bias of the estimatorβ B DM will be even stronger than in the case with uncorrelated predictors.
It is interesting to note that the doubly manifest model can also provide a biased estimator of the effect of the observed group-level predictor that is assumed to be measured without error. It can be shown thatγ DM is a biased estimator of γ: This relationship indicates that the bias depends on the correlation between the predictors at the group level as well as on the ICC of the latent predictor and the bias of the estimator of the between-group effect. As can be seen, when there are uncorrelated predictors at the group level ρ ηx ¼ 0 ,γ DM is an unbiased estimator of γ. However, when there are positively correlated predictors at the group level ρ ηx >0 ,γ DM will be positively biased when a positive between-group effect exists β B >0 À Á . For the case where β W ¼ 0:2, β B ¼ 0:5; γ ¼ :2; and ρ ηx ¼ :3, Figure 3 shows the expected bias as a function of the ICC of the latent predictor, the group size, and the number of indicators (again assuming that the standardized loadings are .7 at L2). As can be seen, the bias grows larger as the ICC increases. However, increasing the group size or the number of indicators has a negative effect on the bias. Thus, in the general condition in which the predictor variables are will be biased if the predictor measured at L1 is affected by sampling and measurement error.

Specification of Prior Distributions
As in the previous section, we suggest a Bayesian approach for estimating the multilevel latent contextual model with an additional manifest group-level predictor. In this model, we assume that the between-group component of the latent predictor and the manifest group-level predictor are multivariate normally distributed. For these additional model parameters, appropriate prior distributions have to be selected in the Bayesian approach. More specifically, we specified an uninformative normal distribution for the mean of the manifest group-level predictor: μ x ,N 0; 1000 ð Þ; and a somewhat informative IG distribution for its variance: σ 2 x ,IG (.1,.1). Moreover, as suggested by Lunn, Jackson, Best, Thomas, and Spiegelhalter (2013; see also Liu, Zhang, & Grimm, 2016), we used an uninformative uniform distribution ranging from -1 to 1 as a prior distribution for the correlation between η B 1j and x j : ρ ηx ,U À1; 1 ð Þ. The prior distributions for all other parameters were specified as in Study 1.

Simulation Study
The simulation had two goals. First, the simulation was expected to confirm the results of the analytical derivation that showed that the doubly manifest model can provide substantially biased estimates of the effect of a manifest group-level predictor when the error in a predictor variable measured at L1 is not taken into account. Second, we compared the ML approach with the Bayesian approach and assumed that the advantages of the Bayesian approach for estimating the multilevel latent contextual model would also transfer to the situation in which an additional manifest group-level variable was added to the model.

Method
The data-generating model was a multilevel latent contextual model with parallel indicators and a manifest grouplevel predictor (see Equation 11). The manifest group-level predictor was standardized at the group level, and its regression coefficient was set to 0.2. The within-group and between-group loadings of the latent predictor were all set to 1 and were not varied further in the simulation. The number of groups was set to either 50 or 200 with a group size of either 5 or 20 persons each. The ICC of the latent predictor was .05 or .2. The very low value (.05) was chosen to include a condition that is expected to be particularly problematic for the ML approach. The correlation between the between-group component of the latent predictor η B 1j and the manifest group-level predictor x j À Á was set to zero (uncorrelated) or .3 (correlated; see Croon & van Veldhoven, 2007). For each of the 2 × 2 × 2 × 2 = 16 conditions, 1,000 data sets were generated. Each data set was analyzed with the doubly manifest model (see Equation  12) and the multilevel latent contextual model with parallel indicators. Again, the PSR and the MCMCSE PSD ratio were used to assess the convergence behavior of the MCMC algorithm in the different conditions. 5

Results
Estimation problems. First, we evaluated potential estimation problems for the ML approach (see also Table B.1 in the supplemental materials). Again, the doubly manifest model did not show any estimation problems. However, for the multilevel latent contextual model with parallel indicators, estimation problems were observed under some of the conditions. For example, for the most problematic condition (J = 50, n = 5, ICC η = .05, ρ ηx = .3), only 69.6% of the solutions were classified as admissible converged. The percentage of solutions with estimation problems decreased as the ICC of the latent predictor increased and as the group size and number of groups increased. In the most favorable condition (J = 200, n = 20, ICC η = .20), no estimation problems were identified for the latent model. The occurrence of estimation problems was not affected by whether correlated or uncorrelated predictors were specified in the data-generating model.

Comparison
of the ML and Bayesian approaches. Table 4 shows the relative bias, RMSE, and coverage rate for the estimator of the effect of the manifest group-level predictor provided by the ML and Bayesian approaches. When uncorrelated predictors were specified in the data-generating model, both the ML approach and the Bayesian approach provided approximately unbiased estimates. This was also true when the predictors were correlated, with the exception of the condition with a small number of groups, a small group size, and a low ICC, in which both approaches were slighthy positively biased (ML = 0.10, Bayes = 0.12).
Next, an inspection of the relative RMSE showed that the Bayesian approach (M = 0.35, range = 0.19-0.53) provided more accurate estimates of the effect of the manifest grouplevel predictor than the ML approach did (M = 0.47, range = 0.20-1.05). Especially in conditions with a small number of groups and a low ICC, the Bayesian approach clearly outperformed the ML approach. In addition, the difference between the ML approach and the Bayesian approach in terms of the RMSE was larger when correlated predictors were specified in the data-generating model. However, it is interesting that even in conditions with uncorrelated predictors in the population, the Bayesian approach was more accurate than the ML approach. For example, when the number of groups was small, the group size was small, and the ICC was low, the relative RMSE was 0.85 for the ML approach and only .51 for the Bayesian approach. One possible explanation for these gains in accuracy is that in conditions with small groups and low ICCs, the observed correlations between the two predictors at the group level might deviate substantially from zero in the sample, and the Bayesian approach provides more stable estimates of these associations at the group level. This will in turn also lead to more stable estimates of the effect of the manifest group-level predictor.
Both the ML approach (range = 93.8%-97.5%) and the Bayesian approach (range = 94.5%-99.6%) provided acceptable coverage rates, with the exception that the coverage rate was somewhat too high (99.6%) for the Bayesian approach in the condition with an ICC of .05 and 50 groups with n = 5. The results for the between-group effect of the latent predictor produced by the ML and Bayesian approaches are in line with the findings presented in the previous simulation (see Tables B2 and B3 in the supplemental materials).
Comparisons of different analysis models within Bayes. In the next step, we compared the doubly manifest model with the model with parallel indicators. As expected by the analytical derivation, the doubly manifest model provided approximately unbiased estimates of the effect of the manifest group-level predictor in the case of uncorrelated predictors (see Table 5). The same was true for the model with parallel indicators. However, with correlated predictors, the doubly manifest model produced positively biased estimates, whereas the estimates provided by the model with parallel indicators were still approximately unbiased, with the exception of a slight positive bias (0.11) in one extreme condition (J = 50, n = 5, ICC η = .05).
With regard to the relative RMSE, the doubly manifest model (M = 0.33, range = 0.19-0.49) provided slightly more accurate estimates than the model with parallel indicators (M = 0.35, range = 0.19-0.54), particularly when the number of groups or the group size was small and the ICC was low. However, with increasing information about the aggregated L1 predictor (i.e., large number of groups, large group size), the difference between the doubly manifest model and the model with parallel indicators decreased. In one of the most favorable conditions (J = 200, n = 20, ICC η = .20, ρ ηx = .3), the model with parallel indicators even outperformed the doubly manifest model in terms of the relative RMSE. Furthermore, the difference between the models was more pronounced when the predictors were correlated in the population.
Finally, in the case of uncorrelated predictors, both the doubly manifest model (range = 94.1%-96.1%) and the model with parallel indicators (range = 94.5%-99.6%) provided acceptable coverage rates such that the model with parallel indicators had a tendency to provide higher coverage Note. Results are based on only those data sets for which ML provided an admissible converged solution. J = number of groups; n = group size; ICC η = intraclass correlation of the latent predictor.
GROUP-LEVEL EFFECTS rates than the doubly manifest model in conditions with a small number of groups. The findings for the coverage rates were similar when the predictors were correlated, except that in conditions with a large number of groups (J = 200), the doubly manifest model provided coverage rates that were too low.

Discussion
Overall, this simulation confirmed the results of the analytical derivation that showed that the doubly manifest model can produce biased estimates of group-level effects even when the group-level predictor is measured directly at the group level without error. In addition, the simulation was in line with the previous simulation, which showed that the Bayesian approach for estimating a multilevel latent contextual model provides more accurate estimates than the ML approach, particularly when the sample sizes were small and the ICC was low. It is interesting that these gains in accuracy offered by the Bayesian approach were also found for the effect of the manifest group-level predictor when the predictors were assumed to be uncorrelated in the population. The next section illustrates the different estimation approaches with a real data example from educational psychology.

ILLUSTRATIVE EXAMPLE: EFFECT OF TEACHER BEHAVIOR ON STUDENTS' ACHIEVEMENT
Previous research in educational psychology has supported the notion that students' learning outcomes are affected by teacher behavior. One important specific teacher characteristic is how easily teachers get distracted by their students.
In the Third International Mathematics and Science Study (Baumert et al., 1997), students were asked to rate their mathematics teacher's distractibility on three items (e.g., "Our mathematics teacher is easily distracted if something attracts his/her attention"; see Kounin, 1970). The data set stemmed from the German sample of lower secondary school students and contained observations from 2,131 students distributed across 108 classes with an average class size of 19.73. The ICCs of the student ratings for the three items were 0.03, 0.04, and 0.09. In line with previous research, we expected that student perceptions of their mathematics teacher's distractibility would be negatively associated with students' mathematics achievement. We specified two multilevel latent contextual models in which the latent predictor variable (teacher distractibility) was measured with three indicators at the student level. In the first model, the loading of the first indicator was fixed to 1 at L1 and L2. The other two loadings were freely estimated but constrained to be invariant across levels (crosslevel invariance). In the second model, the constraint of invariant loadings across levels was relaxed, and the second and third loadings were freely estimated at L1 and L2. For comparison purposes, we also estimated the doubly manifest model, which ignores error in the predictor variable.
The models were estimated with both the ML and Bayesian approaches (see the supplemental materials for the WinBUGS code). For the ML approach, we used Mplus 7.1. For the Bayesian approach, we specified a burn-in period of 50,000 iterations. Each fifth iteration of the subsequent 450,000 iterations was saved, which resulted in 90,000 draws from the posterior distribution. For each model and model parameter, the PSR was 1 and the MCMCSE PSD ratio was well below .05, indicating that WinBUGS approximated the posterior distributions very well (see Hoff, 2009;B. O. Muthén & Asparouhov, 2012; see also Cho & Bottge, 2015). We selected the MAP as the Bayesian point estimate and the PSD as the Bayesian standard error.
Results for the doubly manifest model, the model with invariant loadings, and the model with noninvariant loadings are presented in Table 6. In all three models, there was no effect of distractibility on mathematics achievement at the student level. However, student perceptions of their mathematics teacher's distractibility had a negative effect at the class level, indicating that students in classes with a more distractible mathematics teacher achieved less in mathematics than did those with a less distractible teacher. As expected from the the results of the analytical derivation in Equation 10, the estimate of the class-level

674
effect provided by the doubly manifest model was substantially smaller than those provided by the multilevel latent contextual models. It is interesting to note that fitting the model with noninvariant loadings with ML led to an inadmissible solution as indicated by a negative error variance at L2. To avoid estimation problems, we fixed that variance to zero. As expected, no inadmissible estimates were observed with the Bayesian approach. However, given that the data provided a relatively large amount of information (i.e., relatively large sample sizes), the differences between the Bayesian and ML approaches were small 6

GENERAL DISCUSSION
We presented an application of the Bayesian approach to multilevel SEM, focusing on (a) multilevel contextual models (Study l), and (b) multilevel models with multiple predictors at L2 (Study 2). In Study 1, we discussed the multilevel latent contextual model, which is often considered the true model in contextual studies, and presented a Bayesian estimation approach. In Study 2, we extended the model to include an additional observed group-level predictor and analytically derived the bias of that coefficient under the traditional MLM approach (i.e., the doubly manifest approach). In two simulation studies, we showed that with a Bayesian approach, estimation of multilevel SEMs could be stabilized (fewer estimation problems), and the accuracy of the estimates of group-level effects could be substantially improved. Finally, we presented an example from educational psychology to show how the Bayesian approach could be specified in WinBUGS and Mplus. Note. Fitting the multilevel latent contextual model with congeneric noninvariant indicators with ML had led to an inadmissible solution as indicated by a negative between-group error variance. Therefore, we fixed that variance to zero. DM = doubly manifest model; CI = multilevel latent contextual model with congeneric indicators; CNI = multilevel latent contextual model with congeneric noninvariant indicators; Est. = estimate; SE = standard error. 6 We also specified the Bayesian approach in the software Mplus (see Appendix for example code). Using the default criterion in Mplus to assess convergence (i.e., a maximal PSR of 1.1) resulted in MCMCSE PSD ratios that were too high (doubly manifest: .12; invariant loadings: .20; noninvariant loadings: .23), indicating that the posterior distributions of the parameters were not sufficiently approximated by the MCMC chain. When we used the same specifications in Mplus that we had used in WinBUGS (burn-in period of 100,000, chain length of 450,000, thinning factor of 5), the MCMCSE PSD ratios were well below .05. Moreover, Mplus provided point estimates and standard errors that were very similar to those provided by WinBUGS (see Table C.1 in the supplemental materials). It is interesting that Mplus was about seven times faster than WinBUGS. However, this is not surprising because Mplus's MCMC sampler is specifically written for latent variable models, whereas WinBUGS is written for (almost) arbitrary classes of statistical models.

GROUP-LEVEL EFFECTS
However, the following limitations of this study should be mentioned. First, as with most simulation studies, it is difficult to generalize the findings beyond the conditions that we manipulated in our studies. Second, as advised by Zitzmann et al. (2015), we stabilized the group-level effects by adding very little information to the prior distributions for the variances at L2. The priors for other model parameters were left uninformative. However, Depaoli and Clifton (2015) showed that the accuracy of the estimate of the between-group effect could be improved even more by incorporating previous knowledge about that parameter. For example, a weakly informative normal distribution could be selected for the between-group effect. Moreover, in Study 2 we specified an uninformative uniform prior ranging from −1 to 1 for the correlation between the between-group component of the latent predictor and the observed group-level predictor. Alternatively, a multivariate inverse Wishart (IW) distribution could have been specified for the covariance matrix (e.g., Gelman & Hill, 2007;B. O. Muthén & Asparouhov, 2012;see Daniels, 1999, for a multivariate alternative to the IW distribution). It would be interesting to see whether the findings would change with an IW prior distribution. Third, we used the Mplus default specification of ML in the comparison with the Bayesian approach. However, estimation problems are expected to be reduced when variances are constrained to be positive, negative variances are fixed to zero, more iterations are run, or better starting values are used. Fourth, only relatively simple multilevel SEMs were discussed. Future research should investigate the potential of a Bayesian approach for stabilizing the estimates of more complex multilevel SEMs with randomly varying slopes and cross-level interactions (see Preacher, Zhang, & Zyphur, 2016). It could be assumed that the Bayesian approach will become even more advantageous for estimating these models. Finally, it should also be noted that the ideas that motivated the use of the Bayesian approach in this investigation can also be found in likelihood-based approaches. For example, penalized likelihood methods (e.g., Chung et al., 2013;DeCarlo et al., 2011;Galindo-Garre, Vermunt, & Bergsma, 2004;McNeish, 2015;Mislevy, 1986) employ weakly informative priors for variance components to avoid boundary estimates (e.g., zero variances; Chung et al., 2013).
To conclude, this article showed that applying the Bayesian approach to multilevel SEM could be a promising alternative to an ML approach, particularly when the data contain only a little information about the group-level variables (e.g., small group sizes, low ICCs). General-purpose Bayesian software (e.g., WinBUGS) as well as software that was designed specifically for latent variable models (e.g., Mplus) can be used to specify the proposed approach.

DERIVATION OF BIAS FOR DOUBLY MANIFEST APPROACH
Assuming that the multilevel latent contextual model in Equations 4 and 6 holds in the population, the theoretically expected biases associated with adopting the doubly manifest approach were derived elsewhere (e.g., Lüdtke et al., 2011). Therefore, in the following, we consider the case of an additional observed group-level predictor (see also Study 2). If we assume that the indicators y k of the latent predictor η 1 , the manifest dependent variable η 2 and the manifest group-level predictor x are centered (i.e., zero mean), then all intercepts become 0. Moreover, if we assume that the y k s are parallel, then the loadings become 1, and the error variances no longer depend on k (i.e., Var ε W kij ¼ θ W ; Var ε B kij ¼ θ B ). More formally, the measurement model for the latent predictor becomes: and the structural model becomes: Let the within-and the between-group components η W 1 and η B 1 of the latent predictor η 1 and the observed group-level predictor x be summarized by the vector η 1 ¼ η W 1 ; η W 1 ; x À Á T . Then the vector of the coefficients β ¼ β W ; β B ; γ À Á T can be calculated as: where: S η 1 η 1 is the covariance matrix of the predictors, and S y dep η 1 is the covariance matrix of the outcome variable and the predictors, respectively. In the doubly manifest approach, the groupmean-centered manifest scale score y ij À y j and the observed group mean y j replace the within-and the between-group components η W 1 and η B 1 of the latent predictor η 1 , respectively: If we let the vector y be defined as y ¼ y ij À y j ; y j ; x T , and we assume equal group sizes (i.e., omitting the index j in n j ), then the covariance matrix S y dep y is given by: where I is the identity matrix and C is: The covariance matrix Σ yy is given by: The manifest regression coefficients β DM ¼ β W DM ; β B DM ; γ DM À Á T are obtained as an ordinary least squares estimate: Let the estimates of β DM be summarized by the vector β DM ¼ ðβ W DM ;β B DM ;γ DM Þ T . Then its bias with respect to β is given as the difference between the expectation ofβ DM and β: For ease of notation, let: be the variances of y ij À y j and y j , respectively. Then the inverse of the manifest covariance matrix S À1 yy is given by: By inserting Equation A12 into Equation A10, we obtain the following biases: To facilitate interpretation, we insert the terms: Rel y ij À y j ¼ ψ W 1 s 2 ; Rel y j ¼ ψ W 1 =t 2 ; into Equation A13: Assuming that the latent predictor η 1 and the observed group-level predictor x are standardized implies: GROUP-LEVEL EFFECTS