A Comparison of ML, WLSMV, and Bayesian Methods for Multilevel Structural Equation Models in Small Samples: A Simulation Study

ABSTRACT Multilevel structural equation models are increasingly applied in psychological research. With increasing model complexity, estimation becomes computationally demanding, and small sample sizes pose further challenges on estimation methods relying on asymptotic theory. Recent developments of Bayesian estimation techniques may help to overcome the shortcomings of classical estimation techniques. The use of potentially inaccurate prior information may, however, have detrimental effects, especially in small samples. The present Monte Carlo simulation study compares the statistical performance of classical estimation techniques with Bayesian estimation using different prior specifications for a two-level SEM with either continuous or ordinal indicators. Using two software programs (Mplus and Stan), differential effects of between- and within-level sample sizes on estimation accuracy were investigated. Moreover, it was tested to which extent inaccurate priors may have detrimental effects on parameter estimates in categorical indicator models. For continuous indicators, Bayesian estimation did not show performance advantages over ML. For categorical indicators, Bayesian estimation outperformed WLSMV solely in case of strongly informative accurate priors. Weakly informative inaccurate priors did not deteriorate performance of the Bayesian approach, while strong informative inaccurate priors led to severely biased estimates even with large sample sizes. With diffuse priors, Stan yielded better results than Mplus in terms of parameter estimates.

Hierarchical data structures that call for multilevel analysis are encountered in several distinct research applications in psychology and sociology as well as in educational research. One important application of multilevel analysis is multirater studies. In practice, data often consist of different types of self-and informant reports, incorporating data from multiple raters that are nested within targets. This applies, for example, to peers as raters or students rating their teachers, as they are considered interchangeable, having the same access to the target's behavior (Eid et al., 2008).
Observed variables in psychological research are most often not measured on a continuous but rather on a discrete scale (i.e., categorical dependent variables), imposing additional challenges for the estimation process. Maximum likelihood (ML) estimation of multilevel SEMs with categorical outcomes requires numerical integration with many dimensions of integration (B. Muthén, 2010), making the estimation process computationally demanding. Furthermore, estimation techniques like ML, relying on large-sample properties, tend to produce unreliable and unstable results in small samples (Asparouhov & Muthén, 2010b;Hox & Maas, 2001;Meuleman & Billiet, 2009). In multilevel analysis, estimation of the parameters on the between level is based on as many observations as there are clusters in the data, needing a sufficient number of clusters in order for the asymptotic properties of estimators to hold (Asparouhov & Muthén, 2010b).
As an alternative to classical estimation techniques, Bayesian estimation might help to overcome some of the aforementioned shortcomings when estimating complex multilevel SEMs in small samples. The Bayesian approach has recently been applied successfully to many complicated SEMs, such as two-level nonlinear SEMs , multivariate latent curve models (Song, Lee, & Hser, 2009), or semiparametric SEMs (Song, Chen, & Lu, 2013;Yang & Dunson, 2010). Bayesian analysis can deal with highly complex models in cases where classical approaches such as ML estimation often fail, as it is the case for the estimation of complex multilevel SEMs with categorical indicators (Asparouhov & Muthén, 2010b;Muthén & Asparouhov, 2012). Furthermore, samplingbased Bayesian methods do not rely on asymptotic theory and exhibit better small-sample performances for factor analyses  or when there are only few clusters in multilevel models (B. Muthén & Asparouhov, 2012;Hox et al., 2012), and might overcome convergence problems in small samples (Depaoli & Clifton, 2015).
Only a few studies have compared these methods for continuous indicator multilevel SEMs (Hox et al., 2012;Depaoli & Clifton, 2015). However, Hox et al. (2012) solely focused on between-level sample sizes in the context of international comparative surveys, assuming large within-level sample sizes. In multirater research, involving the assessment of interpersonal relationships with family members, friends, or colleagues, within-level sizes often do not comprise more than two to 10 observations. Moreover, one recent study compared Bayesian with frequentist approaches for a multilevel SEM with dichotomous indicators (Depaoli & Clifton, 2015). Their results indicated advantages of Bayesian estimation as compared to WLSMV (weighted least squares, mean and variance adjusted) when used in combination with informative priors. In accordance with previous studies (Asparouhov & Muthén, 2010b;Lee, Song, & Cai, 2010), their results showed that for models with categorical items in small samples there was a substantial effect of the choice of priors (Depaoli & Clifton, 2015), a phenomenon termed prior assumption dependence (Asparouhov & Muthén, 2010b). As there is less empirical information available in data with small sample sizes and categorical indicators, inaccurate prior information may have particularly detrimental effects in these situations. Nevertheless, simulation studies investigating the influence of inaccurate priors on the estimation of multilevel SEMs are scarce. The current study fills this gap. To the best of our knowledge, this is the first study investigating the influence of inaccurate priors on estimation accuracy in multilevel SEMs and comparing estimation methods for ordered categorical indicators for small within-and between-level sample sizes.

Simulation studies on sample size requirements
With respect to continuous multilevel SEMs, simulation studies have indicated a great influence of between-level (Maas & Hox, 2005) as well as within-level sample sizes (Koch, 2013;Koch, Schultze, Eid, & Geiser, 2014;Yuan & Hayashi, 2005) on parameter estimates. For two-level SEMs, a minimum of 100 between-level units has been recommended (Julian, 2001;Maas & Hox, 2005). Koch et al. (2014) have shown that 100 between-level units require at least five within-level observations to obtain appropriate parameter estimates. A simulation study by Lee and Song (2004) suggested that for ML estimation the sample size should comprise observations of at least 4 to 5 times the number of parameters while 2 to 3 times the number of parameters seemed sufficient for Bayesian estimation.
According to a simulation study by Hox et al. (2012) on continuous indicator multilevel models, Bayesian as compared to ML estimation yielded satisfactory results with only 20 instead of 50 to 100 clusters on the between level. Simulation studies on SEM with binary and/or ordinal variables found that larger sample sizes and more iterations until convergence were required as compared to models with continuous variables (Depaoli & Clifton, 2015;Lee et al., 2010). In a simulation study by Depaoli and Clifton (2015), Bayesian estimation as compared to frequentist estimation methods showed fewer convergence problems in small samples for models with dichotomous indicators. Results on the accuracy of Bayes with diffuse priors are mixed (Depaoli & Clifton, 2015;Hox et al., 2012). While Bayesian estimation with informative priors outperformed other estimation techniques with respect to the estimation of between-group parameters (Depaoli & Clifton, 2015), diffuse priors led to biased between-group parameter estimates in small samples (<100) and for low intraclass correlation (ICCs) (ࣘ.1). For the less problematic within-group part, Bayesian estimation with diffuse priors required at least 100 level 2 observations for accurate estimates of loading parameters in dichotomous indicator models, while larger samples were needed for the accurate estimation of structural effects (Depaoli & Clifton, 2015).

Objectives of the study
The objective of this study is to address the preceding issues for multilevel SEMs by Monte Carlo simulations. Bayesian estimation and classical estimation procedures (ML for continuous and WLSMV for categorical indicators) are compared in their performance for a simple multilevel SEM under different between-and withinlevel sample sizes. The study focuses on small sample size conditions that are realistic for many psychological studies, including multirater studies or longitudinal studies with a small number of measurement occasions (two to three) per person. Differences in estimation accuracy for a model with continuous and categorical indicator variables, as well as the influence of different prior input (diffuse vs. different informative priors) in Bayesian analysis are investigated. Inaccurate informative prior conditions are included in the simulation to scrutinize the effect of possible prior misspecifications. Furthermore, we compared Mplus (L. K. Muthén & Muthén, 1998 and Stan (Stan Development Team, 2014b) in their performance with regard to Bayesian estimation. With increasing computational power and the availability of Markov chain Monte Carlo (MCMC) methods, the number of statistical software packages for Bayesian data analysis has grown immensely, with widely varying implementations of MCMC methods and algorithms. To the best of our knowledge, this is the first study directly comparing different software programs for Bayesian estimation of multilevel SEMs.

Bayesian versus ML and WLS estimation methods
In the classical SEM framework, ML estimation focuses on covariances rather than individual observations, aiming at minimizing the difference between the sample covariances and the covariances predicted by the model of interest (Bollen, 1989). If the model is correctly specified, ML yields consistent and asymptotically unbiased, asymptotically efficient, and asymptotically normally distributed parameter estimates (Bollen, 1989). Difficulties with ML estimation arise when numerical integration is required with many dimensions of integration, as is the case in confirmatory factor analysis (CFA) for categorical outcomes (B. Muthén, 2010;Wirth & Edwards, 2007). Computational demands of numerical integration methods increase with the number of factors, currently reaching its limits at a maximum of three to four dimensions of integration, corresponding to models with at most three to four latent variables (Asparouhov & Muthén, 2012a). An alternative estimation method is weighted least squares (WLS) estimation, which is based on a polychoric correlation matrix using pairwise information and can be applied even with a comparably high number of latent variables (Asparouhov & Muthén, 2012a). The robust WLSMV estimator as implemented in Mplus uses a diagonal weight matrix for the fitting function, with standard errors and a mean-and variance-adjusted χ 2 test statistic that uses the full weight matrix (L. K. Muthén & Muthén, 1998Asparouhov & Muthén, 2007).
For detailed information on the WLSMV estimation procedure and the two-level WLS estimator employed in Mplus, see B. Muthén (1983Muthén ( , 1984 and Asparouhov and Muthén (2007). Note that the categorical weighted least squares procedure is not unique to Mplus, and similar, if not identical, implementations of WLS estimation in other programs should lead to similar or identical results. The WLSMV estimator has been found to perform well for SEMs with ordinal indicator variables (Beauducel & Herzberg, 2006;Nussbeck, Eid, & Lischetzke, 2006) or multilevel SEMs with continuous and categorical indicator variables (Asparouhov & Muthén, 2007;Hox, Maas, & Brinkhuis, 2010). Previous Monte Carlo studies also indicated a deterioration of performance in small samples and complex models (Nussbeck et al., 2006) and convergence problems for multilevel SEMs at low sample sizes (Depaoli & Clifton, 2015).
By contrast, the Bayesian approach regards parameters as random variables, reflecting the uncertainty about a parameter's value in the parameter's probability distribution. Bayesian statistical analysis applies Bayes theorem to probability distributions, expressing prior uncertainty about model parameters in terms of distributions and updating these prior beliefs with observed data to obtain a posterior probability distribution (Lynch, 2007). It states the following (Lynch, 2007): where y is a vector of observed data; θ is the unknown parameter vector; p(θ | y) represents the posterior density for the parameter θ; p(y | θ) is the likelihood of the data given θ; and p(θ) is the prior probability density function (pdf) for the parameter. The Bayesian approach bears many advantages over frequentist approaches such as ML or WLSMV. Bayesian analysis allows one to update knowledge by using informative priors based on previous studies, the posterior distribution being more precise than the likelihood or the prior alone (Jackman, 2009). The incorporation of informative priors into the Bayesian analysis can be used to obtain better results in the presence of small to moderate sample sizes. Furthermore, Bayesian analysis tackles the problem of inadmissible parameter estimates, such as negative variances, by choosing a shape of the prior that assigns zero prior probability to these parameter spaces (Depaoli & Clifton, 2015;Hox et al., 2012). While ML estimation is grounded on large sample theory, Bayesian estimation does not rely on asymptotic arguments and can give more reliable results for small samples Song & Lee, 2012). For example, in multilevel models with only a small number of level 2 units, the number of independent observations is small, and Bayes estimation might perform better than ML (Asparouhov & Muthén, 2010b;Baldwin & Fellingham, 2013;Hox et al., 2012). Nevertheless, Bayesian and ML approaches are asymptotically equivalent, Bayesian estimates sharing the optimal properties of ML estimates (Song & Lee, 2012). As the likelihood p(y | θ) depends on the sample size, while the prior does not, the effect of the prior distribution decreases in large samples and the posterior approximates the likelihood. Hence for n → Ý, the prior's influence on the posterior becomes negligible (Lynch, 2007).
Another main advantage of Bayesian analysis is the possibility to analyze models that are computationally heavy or impossible to estimate with ML or WLSMV (e.g., Asparouhov & Muthén, 2012b). These include models with categorical outcomes with many latent variables and thus many dimensions of numerical integration (Asparouhov & Muthén, 2010b;B. Muthén, 2010).

Bayesian estimation software
Bayesian analysis relies on sampling from complex distributions using MCMC methods, an iterative sampling process generating a Markov chain of draws from the posterior (Lynch, 2007). Implementations of MCMC methods and algorithms, however, vary widely across software packages for Bayesian data analysis. Currently available softwares comprise, among others, Mplus (L. K. Muthén & Muthén, 1998, WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000), MLwiN (Rasbash, Charlton, Browne, Healy, & Cameron, 2009), Jags (Plummer et al., 2003), and Stan (Stan Development Team, 2014b). In the current study, two of these programs, Mplus and Stan, are compared with respect to their estimation accuracy. Mplus was chosen as it is one of the most widely applied programs for structural equation modeling and due to its fast estimation. Stan on the other hand is an open-source software coded in C++, offering the possibilities of more flexible modeling. The two Bayesian estimators, as implemented in Mplus and Stan, differ in several ways that shall be delineated briefly. Mplus uses an MCMC algorithm based on the Gibbs sampler (Asparouhov & Muthén, 2010a). The Mplus algorithm is based on conjugate priors, that means normally distributed priors for intercept, slope, and loading parameters, inverse Wishart priors for variance-covariance matrices, or an inverse gamma prior in case the updating block consists of only one parameter. Stan on the other hand uses an adaptive version of Hamiltonian Monte Carlo (HMC), the No-U-Turn Sampler (NUTS). HMC is a generalization of the Metropolis algorithm, which uses proposal distributions that change depending on the current position of the sampler and are influenced by the direction in which the posterior distribution increases . It avoids the random walk behavior and sensitivity to correlated parameters inherent to many MCMC methods such as Gibbs sampling. Iterations are allowed to move farther in parameter space and more rapidly through the target distribution, thereby allowing faster mixing, especially in high dimensions . The NUTS sampler is an extension to HMC that prevents inefficiencies by sending the trajectories, from one position to the next position in parameter space, as far as possible during each iteration but stopping them before they make a U-turn back to the starting position (Kruschke, 2015). In addition, tuning parameters of the HMC algorithm are adaptively set during the sampling procedure. For an introduction to HMC, see Gelman et al. (2014, chapter 12) or Kruschke (2015, chapter 14). For a more detailed illustration, the interested reader is referred to Neal (2011) and Hoffman and Gelman (2014). As Stan does not use Gibbs sampling, it is not restricted to the use of conjugate priors (Stan Development Team, 2014c) and has a wider variety of possible prior distributions that can be assigned to the parameters. Besides differences in Bayesian estimators and samplers, the two softwares differ in the fact that the between-level error variances cannot be set to zero (but only to a value close to zero) in Mplus. Stan on the other hand allows fixing between-level error variances to zero, a flexibility that might render it preferable in the present context.

Data-generating model
To keep the results as general as possible while still allowing the investigation of between-as well as within-level latent correlations, the most basic multilevel CFA model with two correlated factors at each level ( Figure 1) was chosen for the simulation. This model was suggested as the multilevel CFA model for interchangeable raters by Eid et al. (2008). Applied to the context of multirater studies, in this CFA model each target is rated by multiple interchangeable raters on several constructs (traits). Using several indicators per construct, each observed variable can be decomposed into a latent trait component, a method effect associated with the raters, and an error component. It is a multilevel CFA model with two latent factors at each level. We used this model with (a) continuous observed variables or (b) categorical observed variables as the data-generating model in the simulation studies. The model can be defined as follows: Considering multiple indicators i of multiple traits k, measured by multiple raters r per target t, each observed variable Y rtik can be decomposed into the following components: Equation (2) states that each observed variable for indicator i of trait k is additively decomposed into an indicator-specific grand mean μ ik , representing the mean rating over all targets for indicator i of trait k, a level 2 mean-centered trait variable T tik , capturing the deviation of a particular target's true mean (μ ik + T tik ) from the population grand mean (μ ik ) for indicator i of trait k, a level 1 method variable M rtik , representing the deviation of a particular rater's true rating from the mean rating over all raters for this target, and a measurement error ε rtik . The component μ ik + T tik is called a target's "true" mean as it is corrected for measurement error as well as rater-specific effects (see, Eid et al. 2008). The component μ ik + T tik + M rtik is called a rater's "true" rating as it is corrected for measurement error.
The trait variable T tik is a level 2 variable characterizing the target, while method-specific variable M rtik and measurement error variable ε rtik are both level 1 variables specific to a particular rater's rating of the target. To permit the separation of M rtik and ε rtik , the assumption of unidimensionality of the method-specific variables within a trait k is made, resulting in one common method factor M rtk for all indicator variables measuring the same trait k. In addition, the indicator-specific trait variables T tik can be assumed to be unidimensional for parsimony reasons, resulting in a single common trait factor T tk . That means that method as well as trait variables are positive linear functions of each other, respectively. The resulting equation for an outcome variable Y rtik is The unidimensionality of method effects implies that while method effects may differ between traits, they are homogeneous within traits, resulting in the assumption that a rater over-or underestimates a target's trait value on all indicators measuring the same trait simultaneously.
The variables T tk , M rtk , and ε rtik are assumed to have zero means, and it is assumed that (a) trait variables are uncorrelated with all method and error variables; (b) method variables are uncorrelated with error variables; and (c) all error variables are uncorrelated (Eid et al., 2008). Correlations between trait variables of different traits as well as between method variables belonging to different traits are explicitly allowed. The former are a measure of discriminant validity of the traits at the target level, while the latter indicate the generalizability of method effects across traits (Eid et al., 2008). This model is depicted in Figure 1 for the case of two traits and three indicators per trait.
Note that this definition of the model is just a specific example applying multilevel CFA models to the case of informant reports by several interchangeable raters per target. The model is equivalent to a general multilevel CFA model measuring two factors (see, e.g., Fox, 2010), where Y rt are the observed variables for observation r in cluster t being decomposed into with μ being a vector of intercepts (grand means), λ B and λ W matrices of loading parameters, and η B and η W the latent variable vectors on the between and within level, respectively. Note that there is no between-level error variance in this model. Following this notation, the variance-covariance matrix of Y is given by with B and W being the between-and the withincovariance matrices, respectively, and B , W , and being the covariance matrices of η B , η W , and . For a more detailed description of the model, see Eid et al. (2008). The multilevel CFA model introduced in the preceding can be easily extended to ordered categorical variables, assuming the existence of a continuous latent response variable Y * rtik underlying the observed ordinal variables Y rtik . For a variable taking on one of c ik different values out of the set of possible categories S ik := {0, …, c ik − 1}, the relation between Y * rtik and Y rtik is given by the following measurement structure (Eid, 1996;B. Muthén, 1984;Nussbeck et al., 2006): According to the preceding measurement model, the variable Y rtik results from a categorization of the variable Y * rtik that is determined by the threshold parameters κ sik . As Y * rtik is a latent variable and its metric therefore is not determined (B. Muthén & Asparouhov, 2002), residual variances have to be fixed to one and the intercepts μ ik of Y * rtik be set to zero for identification reasons, resulting in where the residuals are assumed to be i.i.d. as ε * rtik ∼ N(0, 1). Note that this is only one of several parameterization options, corresponding to a probit link with theta parameterization in Mplus (B. Muthén & Asparouhov, 2002).

Monte Carlo simulation study
The main goal of this Monte Carlo (MC) simulation study was to investigate differences between ML (for continuous observed variables) or WLSMV (for categorical observed variables) and Bayesian estimation methods for multilevel SEMs under different sample size conditions. The focus in the present work was on parameter estimates, standard errors, and coverage assuming correctly specified models. For this purpose we conducted two simulation studies. The first MC study concentrated on the comparison of classical estimation methods with the Bayesian approach under several conditions. The aim of the second MC study was to compare the two different estimation programs for Bayesian inference (Mplus and Stan).

Simulation design
Four aspects were manipulated in Simulation Study 1: (a) the dependent variable type (continuous or categorical); (b) the estimation method (ML for continuous observed variables, WLSMV for categorical observed variables, and Bayes); (c) the type of prior in case of Bayesian analysis (diffuse vs. different informative priors); (d) the number of clusters (level 2 observations; nL2 = 50, 100, 150, 200); (e) the number of observations per cluster (level 1; nL1 = 2, 4, 6). Small sample size settings were chosen in order to investigate the model under minimal, realistic conditions and to evaluate possible advantages of Bayesian statistics. They can be considered realistic sample sizes in the context of multirater studies. In the case of the categorical indicator model, the simulation comprised an extended set of prior conditions (see section on data analysis).
The aim of Simulation Study 2 was to compare the two different estimation programs for Bayesian inference (Mplus vs Stan). Simulation Study 2 considered the software as an additional condition to the conditions in Simulation Study 1, restricted to Bayesian analysis as estimation method. However, not all sample size conditions of Simulation Study 1 were analyzed, but we focused on a reduced set of conditions for the number of level 2 and level 1 units (nL2 = 100, 150; nL1 = 4, 6). The number of observations on the between and within level were chosen according to the results of Simulation Study 1 as well as Stan test runs, considering both feasibility and stability of results.
The combination of these four factors results in 72 and 36 simulation conditions for the categorical and the continuous indicator models in Simulation Study 1, respectively, and 32 simulation conditions in Simulation Study 2. In Simulation Study 2, 100 replications per condition were simulated. To yield more stable results and diminish the effect of potential convergence problems, the comparatively more complex categorical indicator model was simulated with an increased number of iterations (see Data generation and analysis), using 300 replications per condition in Simulation Study 1.

Data generation and analysis
The data-generating model used in the present study is described in the preceding and depicted in Figure 1. Categorical dependent variables were modeled with four response categories, using a probit link. For identification reasons, the first loading per factor was set to one.

Continuous indicator model
Note. λ Tik = Between-level loadings; λ Mik = within-level loadings; μ ik = intercepts; ε rtik = level  residuals; κ sik = thresholds; T = trait (between-level factor); inf. = informative prior; M = method (within-level factor). IW = inverse Wishart distribution; IG = inverse gamma distribution; N = normal distribution; LKJ = LKJ correlation matrix distribution (Lewandowski et al., ), set on the correlation matrix of the respective parameters. Note that for Stan the priors listed for variance parameters are half-Cauchy distributions due to truncation, set on the respective scale parameters.
The remaining parameters were chosen corresponding to a consistency coefficient of CO(Y ik ) = .308. The consistency coefficient is defined as the proportion of the reliable variance of an indicator that is due to differences in the trait factor. It indicates the degree to which true differences between the ratings represent differences between targets (Eid et al., 2008). A consistency coefficient of .308 corresponds to intraclass-correlation values of ICC = .28 and ICC = .20 in the continuous and categorical indicator models, respectively, coefficient values that are realistic for multirater data applications. The data for the model with categorical dependent variables were generated according to the theta parameterization in Mplus (Asparouhov & Muthén, 2007), corresponding to a level 1 error variance of Var(ε * rtik ) = 1 for all i, k. Models were simulated using Mplus 7 (L. K. Muthén & Muthén, 1998, estimated using Mplus 7 (L. K. Muthén & Muthén, 1998 and RStan (Stan Development Team, 2014a), the R Project interface to Stan. Results were analyzed using the software R 3.0.2 (R Development Core Team, 2013), as well as the R package "MplusAutomation" (Hallquist, 2013). In the present work, the MLR and WLSMV estimators as implemented in Mplus were used for model estimation of the continuous and the categorical indicator models, respectively. Simulation Study 1 comprised six different prior conditions for Bayesian estimation in the categorical indicator model: (a) diffuse, (b) weakly, and (c) strongly informative accurate priors on loading parameters, (d) weakly and (e) strongly informative inaccurate priors on loading parameters, or (f) informative priors for variance and covariances only (informative Wishart prior). For the continuous indicator model, diffuse and strongly informative accurate priors were used. Exact population model parameters and prior distributions are given in Table 1. Strongly informative prior conditions were intended to provide an upper bound for the performance of the estimator. The estimates obtained with this prior are the best that can be expected under Bayesian analysis and can thus serve as a benchmark for the evaluation of the diffuse or weakly informative priors. More realistic prior conditions in applied contexts were considered by the inclusion of the weakly informative and inaccurate prior conditions. Inverse Wishart priors in the informative Wishart prior condition corresponded to prior means matching the population parameter values with prior variances of 0.017, 0.010 and 0.010 for the variances, the between-level covariance and the within-level covariance, respectively. Priors on variance and covariance parameters in all but the informative Wishart prior condition were left to Mplus default values. For details on the expectation and variance of the inverse Wishart distribution, see Asparouhov and Muthén (2010b).
Mplus priors for variance and covariance parameters are based on inverse gamma and inverse Wishart distributions. A disadvantage of the inverse Wishart distribution is that the informativeness of one parameter in the matrix (e.g., a covariance) determines the informativeness of other parameters (e.g., variances) and might restrict their prior range . In addition, inverse Wishart priors impose a dependency between standard deviations and correlations (Tokuda, Goodrich, Van Mechelen, Gelman, & Tuerlinckx, 2011). This reduces the flexiblility of prior settings that might be needed for some applications. As Stan is not restricted to conjugate priors, a wider variety of possible prior distributions can be applied. In Stan, the covariance matrix can be decomposed into a correlation and a scale matrix so that correlation and marginal variance parameters are independent and can receive separate priors. In the present simulation, the vector of scales was given a weakly informative half-Cauchy distribution, and the prior for the correlation matrix was an LKJ prior (Lewandowski, Kurowicka, & Joe, 2009). For a correlation matrix , the LKJ distribution with shape α > 0 is defined by (Stan Development Team, 2014c): and has support on correlation matrices, that is, symmetric positive definite matrices with unit diagonal. The expected amount of correlation among the parameters is controlled with the shape parameter α, favoring less correlation as α increases. Setting α equal to one corresponds to a uniform prior over all correlation matrices of a given order, while for 0 < α < 1, the density has a trough at the identity matrix . For details, see Lewandowski et al. (2009). For a discussion of using inverse gamma priors or half-Cauchy priors in hierarchical models, see Gelman et al. (2006). Bayesian estimation was done using two MCMC chains. The number of iterations used was determined by extensive preanalyses, inspecting PSR values, KS tests, and traceplots. For Mplus, 80,000 iterations with a thinning factor of 3 were used, resulting in a total of 240,000 iterations, a burn-in of 120,000 iterations, and a sample of 40,000 iterations for posterior analyses. The NUTS sampler implemented in Stan usually needs fewer iterations than Gibbs sampling in order to converge, as iterations typically have lower autocorrelation (Stan Development Team, 2014c). Preanalyses showed that for Stan fewer iterations were needed than for Mplus. These were (a) 40,000 iterations for ordered categorical indicators and diffuse priors, (b) 10,000 for ordered categorical indicators and informative priros, and (c) 20,000 for continuous indicators. For estimation with Stan, MCMC chains were not thinned due to a compromise between precision and estimation time, as thinning reduces precision (Link & Eaton, 2012) if not applied with a simultaneous increase of iterations.
Additional sensitivity analyses were conducted to investigate whether increasing the number of iterations or excluding replications according to the potential scale reduction (PSR) factor would change simulation results.

Model performance and evaluation criteria
Estimation performance was judged according to the following evaluation criteria: Convergence, relative parameter estimation bias (peb), standard error bias (seb), and 95% coverage. Nonconvergence in the Bayesian setting refers to the Markov chain not converging to its equilibrium distribution. The PSR, effective sample sizes, and results of the Kolmogorov-Smirnov convergence test (KStest) were inspected to determine convergence for each replication. Traceplots for randomly selected replications within each condition were checked manually to ensure convergence of the chains to the same target distribution. In addition, the amount of Mplus warning messages and potential improper solutions was checked.
The peb is a standardized measure of parameter estimation bias computed by with θ pt being the parameter estimate of the t th Monte Carlo replication and θ p the true population value for parameter p (Koch et al., 2014). Following Lee and Song (2004), standard error bias (seb) was calculated as the ratio between average standard error (SE) estimates and empirical standard deviations (SD): where se(θ pt ) denotes the estimated standard error (or posterior SD) of parameter p for replication t, and sd( θ p ) denotes the standard deviation of the parameter estimates over all n rep replications (empirical SD). Peb values are pooled for groups of parameters. In line with L. K.  and Koch et al. (2014), peb values falling below a cutoff value of 0.10 (10%) were considered acceptable. Peb values falling between 0.10 and 0.30 (10% and 30% deviation from the population value) were considered as medium bias, and peb values >0.30 were considered as large bias. If SE estimates (posterior SDs) are close to empirical SDs, their ratio should be close to 1. Seb values (SE/SD) were classified as (a) 5/6 < ratio <6/5: negligible; (b) 2/3 < ratio <5/6 and 6/5 < ratio <3/2: medium; (c) ratio <2/3 or ratio >3/2: large. The 95% coverage is the proportion of replications for which the 95% confidence interval (ML and WLSMV) or the 95% credibility interval (Bayes) contains the true parameter value. Coverage should be between .91 and .98.

Results
Exact peb, seb, and coverage values for all parameter types and conditions of both Simulation Study 1 and 2 are provided in Tables S1-S12 in the supplementary material. An overview of the main results of the simulation studies is given in Table 2.

Convergence
Convergence problems, as indicated by high PSR values at the last iteration, significant KS tests, and bad mixing in traceplots, were mostly restricted to certain parameters. These were primarily the between-level loadings in the categorical indicator models. Convergence problems almost solely occurred in the case of Bayesian estimation using diffuse priors in combination with categorical indicator variables. As expected, convergence difficulties were associated with a small number of level 1 as well as level 2 units, especially for those conditions with diffuse priors on loading parameters. Weakly informative (accurate and inaccurate) priors on loading parameters reduced convergence problems to the nL1 = 2 conditions, while all replications converged under all conditions for strongly informative priors. Convergence differences between the different conditions are illustrated by means of the PSR values at the last iteration for all replications in Figure 2. Increasing the number of iterations or excluding replications according to a PSR criterion did not change the simulation results. That is, no significant or systematic differences in estimation bias of the parameters could be found when either increasing the number of iterations or excluding replications according to PSR values.
For WLSMV estimation, 5.3% of the replications did not converge in the nL2 = 50, nL1 = 2 condition, while in all other conditions the number was close to zero. Computation of SEs in WLSMV estimation failed in 34.7%, 10.7%, and 1.7% in the nL1 = 2 conditions with nL2 = 50, nL1 = 100, and nL1 = 150, respectively. Convergence problems in these conditions might indicate a lack of empirical information on the within level. The model might thus be observationally underidentified with only two observations within each cluster (two raters per target). Results on parameter estimates for the nL1 = 2 conditions should therefore be treated with caution.
Convergence checks using graphical inspection of trace plots and the KS test for the models with continuous indicator variables revealed satisfactory results, except for the intercept parameters. With ML estimation all models converged. No warning messages or improper solutions were found for ML estimation.

Continuous indicator model
Peb values fell below the cutoff value (<10%) for almost all within-and between-level sample sizes for all parameter types. Bias was highest for the Bayesian estimates of variances and covariances, with peb values exceeding the cutoff for some of the conditions involving 50 level 2 units (medium bias) and decreasing when increasing the number of observations on the between level (betweenlevel variances and covariances) or on the between and the within level (within-level variances and covariances). More accurate parameter estimates were found under ML as compared to Bayesian estimation for variances and covariances on both levels for almost all sample sizes. Differences in peb values between diffuse and informative priors were negligible.
For between-and within-level loadings, average posterior SDs in Bayesian estimation with informative priors tended to overestimate empirical SDs, the amount of bias decreasing with increasing sample sizes on both levels. There was no clear pattern of differences in SE bias between ML, Bayes diffuse, and Bayes informative for variances and covariances, residual variances, or intercepts.
Coverage values of loading parameters as estimated with ML and Bayes with diffuse priors all lie in the desired interval of 91% to 98%, while for some small sample size conditions Bayesian estimation with informative priors gave coverage rates higher than 98%. For between-level variances and covariances, coverage values below 91% occurred only for ML estimation with 50 level 2 units, while for within-level variances and covariances, coverage values in the nL1 = 2 and nL1 = 4 with nL2 = 50 condition of Bayes with diffuse priors did not reach 91%. Coverage for the intercepts revealed satisfactory results under ML estimation, while those of Bayesian estimation exceeded 98% in three cases. Coverages of residual variances ranged between 91% and 98%, except for ML with nL1 = 2 and nL2 = 50 (<91%), with Bayesian estimation yielding coverages closer to the values of 95%.

Categorical indicator model
Peb values for the categorical indicator model are depicted in Figure 3. For the loading parameters, WLSMV estimates yielded peb values below 10% for all but the lowest sample size conditions, while Bayesian estimation with diffuse priors only yielded peb values below the cutoff for the two highest nL1 = 6 conditions. The weakly informative prior condition showed considerably smaller peb values as compared to the diffuse prior condition, while estimation bias mostly stayed at higher levels as compared to WLSMV. With respect to the within-level variances and covariances, WLSMV exhibited peb values greater than 10%, while Bayesian estimation with diffuse priors yielded better results for all sample sizes. By contrast, peb values for the between-level variances and covariances of WLSMV estimation tended to be better than those of diffuse prior Bayesian estimation.
Peb values were smallest for the strongly informative accurate prior condition. Notably, the weakly informative inaccurate prior condition also showed considerably small estimation bias for within-level loadings and bias values comparable to those of diffuse priors for betweenlevel loadings. Inaccurate priors with strong prior information increased bias values for loading and variance and covariance parameters to considerable degrees. This effect diminished with increasing sample sizes; however, in most cases bias levels did not reach levels of diffuse prior bias even for the highest sample size conditions. Note that informative priors on loading parameters seemed to exert an influence on the estimation accuracy of variance and covariance parameters, while the opposite was not the case. The informative Wishart prior did not affect estimation accuracy of loading and threshold parameters. The informative Wishart prior condition indicated a clear reduction in estimation bias for variances and covariances as compared to WLSMV or Bayes diffuse. Figure 4. In the informative prior conditions, posterior SDs tended to overestimate empirical SDs for the respective parameters. Setting weakly informative accurate and inaccurate priors on loading parameters caused negligible bias in large samples and medium to large biases in small samples, while strongly informative accurate and inaccurate priors led to large biases in the loading parameter's posterior SDs irrespective of sample size. In the wishart prior condition, large biases were found for the variances' and covariances' posterior SDs. Note, however, that the large deviations between posterior SDs and empirical SDs in the informative prior conditions are partly caused by empirical SDs decreasing considerably under the use of informative priors. In general, bias values decreased with increasing sample sizes on both levels. The largest bias was observed for WLSMV estimation in the nL1 = 2 conditions. WLSMV mean coverage lay in the range between 90.3% and 97.7% for all parameter types. Coverage values of Bayesian estimation with diffuse priors lay between 82.3% and 96.3 %, reaching the desired minimum of 91% for within-level loadings and between-level variances and covariances only with sample sizes of at least nL2 = 150 and nL1 = 4. Weakly informative accurate priors led to coverage values in the desired range (91% to 98%) for all conditions, except for the loading parameters (coverage >98%) in small sample size conditions (nL1 = 2 or nL2 = 50). Strongly informative accurate priors led to coverage values close to 100% for loading parameters and acceptable coverage for all remaining parameters. Weakly informative wrong priors led to acceptable coverage levels except for the within-level loadings (>98%) and betweenlevel variances and covariances (<91%). Coverage values of the strongly informative inaccurate prior conditions lay close to zero for loading parameters (maximum of 17% coverage for nL2 = 200 and nL1 = 6), and between 35% and 73% for variances and covariances. Wishart priors led to coverages > 98% for variances and covariances and coverages between 84.6% and 95.6% for loadings in small samples.

Convergence
Convergence checks using graphical inspection of trace plots and PSR values for estimation with Stan indicated good convergence for all parameter types and conditions. For convergence with Mplus see Simulation Study 1.

Continuous indicator model
Peb values for the model with continuous indicator variables all fell below the cutoff of 10%. With both programs, the highest peb values were observed for between-level variances and covariances (mean peb < 0.063), followed by within-level variances and covariances (mean peb < 0.028). Peb values of Stan estimation were smaller than those of Mplus; however, differences in peb values between the two programs were negligible in absolute value.
Standard error biases showed comparable levels for Stan and Mplus.
Coverage lay over 91% for all parameters in all conditions and exhibited pretty similar values for Mplus and Stan per condition (maximal difference between coverage values of the programs: 2.3%).

Categorical indicator model
In accordance with the results of Simulation Study 1, peb values were considerably lower in the informative as compared to the diffuse prior condition for both programs. Peb values in the diffuse prior conditions tended to be lower for the software Stan as compared to Mplus for most parameters. However, peb differences between the programs were negligible in absolute value (<10%). Peb values are illustrated in Figure 5, and results are summarized in Table 2. Seb values for loading parameters were higher in the informative than in the diffuse prior conditions with both programs. In the informative prior conditions, Mplus as well as Stan posterior SDs overestimated empirical SDs of the loading parameters.
Mplus estimation with diffuse priors yielded coverage values in the desired range (91% to 98%) for all parameters but within-level loadings in the nL2 = 100 condition (>89.7%) and between-level variances and covariances in the nL2 = 100 condition (> 89.7%). Stan with diffuse priors assumed coverage values closer to 95% as compared to Mplus in most conditions (maximal difference in coverage between the programs: 3.7%) and higher than the cutoff of 91% in all but one case (between-level variances and covariances in the nL2 = 100, nL1 = 6 condition, coverage > 89.7%). Informative priors increased coverage for both loading and variance and covariance parameters, leading to coverage being overestimated to up to 100% coverage rates for the loadings with both programs.

Data application
To study the estimation of model parameters in real data situations, the simulated model was applied to data stemming from a nonacademic 360°feedback survey. The survey was conducted by the cut-e Group in 2011 (cut-e Group, 2011;Lochner & Preuß, 2014) and included the assessment of managers' leadership competencies by their colleagues, supervisors, and subordinates on 64 items. For the current purpose, six items measuring two different constructs were selected, and the categorical indicator model was fitted to the data. The first three items inquired information on the managers' competencies and strategies for identifying and solving complex problems. The other three items measured different facets of handling conflicts in the working environment. We assumed complete data. The final data set used for estimation comprised 2,386 observations with information on 425 targets (managers) rated by a mean number of 5.614 raters (colleagues) each. Items were measured on an ordered response scale ranging from 1 (far above expectations) to 5 (far below expectations).
Included estimation methods are WLSMV and Bayes with diffuse priors and informative priors, estimated with Mplus as well as Stan. Diffuse prior settings correspond to those used in the simulation study. Informative priors were based on the parameters as estimated with WLSMV and set on loading parameters only. WLSMV estimates were used as prior means with a prior variance of 0.01. This was done for illustrative reasons only and is not recommended in practice. That is, the informative prior condition was included in order to demonstrate the effect of different priors in a real data application. However, this use of data-driven priors is, in general, questionable, as the same data are used twice: first for constructing the prior and second when estimating the posterior distribution. This approach might overestimate precision (Darnieder, 2011;Gelman et al., 2014) and has been criticized and recommended against as "double dipping" (Berger, 2006;Darnieder, 2011;Depaoli & Clifton, 2015).
Iterations were set to 80,000 with a thinning of 20 in Mplus and 8,000 with 3,000 burn-in iterations (no thinning) in Stan. Convergence of the chains was satisfactory with both Mplus and Stan. The model fits the data reasonably well, with an RMSEA of 0.04, CFI of 0.99, TLI of 0.99, and SRMR of 0.02 for the within and 0.07 for the between level. Bayesian model fit was also acceptable, with a PPP value of .173 and a 95% confidence interval for the difference between the observed and replicated χ 2 values including zero [−15.70, 46.20] (fit of Mplus estimation with diffuse priors). Parameter estimates, standard errors (posterior SDs), and estimated consistency coefficients are given in Table 3. It can be seen that estimates of loading parameters of Bayesian estimation with Mplus and Stan are rather similar when given comparable prior information. The same is true for posterior SDs. Moreover, parameter estimates obtained with Bayesian estimation are close to those obtained with WLSMV, with an exception of the trait loadings for the second construct, which are higher with all Bayesian methods than with WLSMV. As expected, the use of informative priors shrinks the Bayesian parameter estimates toward the WLSMV estimates, while posterior SDs for Bayesian estimation with informative priors are reduced as compared to those with diffuse priors.    Latent correlations of trait and method factors show similar magnitudes for all estimation methods, indicating similar degrees of estimated discriminant validity and generalizability of method effects across different constructs. The magnitude of the trait correlation indicates a high association between a manager's conflict handling and theoretical problem-solving competencies. Estimated consistency coefficients for Bayesian estimation with Mplus and Stan are almost identical and only differ slightly from those estimated with WLSMV. Consistency coefficients lie around a mean value of CO = .273 (WLSMV estimate), indicating a level of convergent validity often encountered in multirater studies. That is, approximately 27% of interindividual differences in the informant ratings can be explained by differences in the managers' competencies. This magnitude of the consistency coefficient is close to the value used for data generation in the simulation studies (.308).

Discussion
The results of the simulation studies indicated convergence problems for the model with categorical indicator variables in small sample size conditions and diffuse priors with Mplus. In line with previous findings, convergence problems diminished with increasing observational information and indicated that a minimum of 100 level 2 and 4 level 1 units should be used to ensure convergence. Weakly informative accurate priors could not ensure convergence in all cases with only two level 1 units.
Warnings concerning noncomputation of standard errors in WLSMV estimation with nL1 = 2 pointed in the same direction. In contrast to Mplus, Stan did not show any convergence problems for categorical indicator models in the nL2 = 100 and nL2 = 150 in combination with nL1 = 4 and nL1 = 6 conditions. This might be due to HMC iterations typically having lower autocorrelation (Stan Development Team, 2014c), allowing for faster mixing in high dimensions and not showing the sensitivity to correlated parameters inherent to many MCMC methods such as Gibbs sampling . However, one disadvantage of the NUTS sampler is that each iteration takes considerably more computation time.

Continuous indicators
For the continuous indicator model, parameter estimation bias was relatively small with all estimation methods and diminished with increasing number of observations. In accordance with previous findings (Depaoli & Clifton, 2015;Koch et al., 2014;, the parameters that were most sensitive to bias were the latent variances and covariances. Hence, the choice of an appropriate sample size is a more important issue in research contexts where the parameters of interest are latent variances and covariances. The use of informative priors was associated with higher standard error bias for the respective parameters in small samples. In general, in the continuous indicator model, informative as compared to diffuse priors did not exhibit any distinctive improvement in parameter estimates once the number of observations reached the recommended numbers. The minor differences in estimation bias between the use of diffuse and informative priors indicates that the choice of prior information for continuous indicator models is less influential than it is in the case of categorical models. According to these results, at least 100 level 2 and 4 level 1 observations are needed for Bayesian estimation of the continuous indicator model in order to yield proper and comparably good parameter estimates as for ML. A minimum number of 100 observations was also given as a recommended between-level sample size for two-level SEMs when estimated with ML by previous simulation studies (Julian, 2001;Koch et al., 2014). The given recommendations correspond to a ratio of 10 or more observations per parameter, a number in line with the findings of previous simulation studies on continuous indicator SEMs (Bentler & Chou, 1987;Koch et al., 2014). In accordance with the simulation study by Koch et al. (2014), the number of level 1 units was found to have an important effect on parameter and standard error estimates on the within level.
We did not find any performance advantages of Bayesian estimation over ML in terms of the accuracy of parameter estimates in the continuous indicator model. On the contrary, parameter estimation bias was higher for Bayesian methods than for ML in the case of variance and covariance parameters. This difference decreased with increasing empirical information. This result stands in contrast to previous studies finding performance advantages of the Bayesian approach over ML estimation for small sample sizes in single-level SEMs  or small between-level sample sizes in two-level SEMs (Hox et al., 2012). A similar result was, however, obtained by Depaoli and Clifton (2015) concerning the effect of diffuse priors on between-level estimates in small samples.
With respect to the estimation software, Stan showed slightly better results than Mplus in terms of parameter estimates; differences, however, were minor.

Categorical indicators
For the ordered categorical indicator model, in contrast to the continuous indicator model, Bayesian estimates showed a strong dependence on prior assumptions. This is a known phenomenon termed prior assumption dependence (Asparouhov & Muthén, 2010b). In line with previous results (Asparouhov & Muthén, 2010b;Lee et al., 2010), the effect of the prior decreased with increasing sample size.
In general, parameter estimation bias for the loading parameters was considerably high in small samples using Bayesian estimation with diffuse priors. This indicates a need for a considerably larger number of observations on both levels in order to obtain proper parameter estimates as compared to the continuous case. In line with a simulation study by Nussbeck et al. (2006) on single-level MTMM models with categorical items, the WLSMV estimator as implemented in Mplus performed well in terms of parameter estimation bias except for the smallest sample size conditions. On the basis of our findings, we recommend to sample a minimum number of 100 level 2 and 4 level 1 observations when using WLSMV. In contrast, we suggest to sample a minimum number of 150 level 2 and 6 level 1 units to obtain proper parameter estimates using Bayesian estimation techniques with diffuse priors. This number was primarily determined by the loading parameters. The use of weakly informative accurate priors on the loading parameters reduced the required between-and within-sample sizes to 100 and 4 observations, respectively. Using strongly informative priors for the loading parameters, an improvement in parameter estimation bias of all parameters was found, which outperformed WLSMV estimation irrespective of sample sizes and led to proper parameter estimates for more than two observations on the within level. These performance advantages are in line with findings by Depaoli and Clifton (2015). Strongly informative priors increased bias of standard errors (posterior SDs), again needing a larger number of observations on both levels to adjust for this effect. It should be noted, however, that standard error estimates in Bayesian estimation do not play a role as important as in ML estimation as they do not directly influence hypothesis testing or credibility intervals. Nevertheless, large standard errors indicate an excessive variability that could be reflected in too wide credibility intervals.
Notably, using weakly informative but inaccurate priors for the loading parameters yielded comparable results as when diffuse priors were used. Sufficiently low peb values for all parameters were achieved with at least nL2 = 100 and nL1 = 6. Increasing the informativeness of inaccurate priors showed considerable detrimental effects on estimation accuracy of factor loading as well as variance and covariance parameters. Increasing the sample size did not decrease bias values to sufficient amounts as to reach levels of diffuse prior bias.
In contrast to informative priors on the loadings, Wishart priors for the variances and covariances did not affect the estimation of the other parameters and reduced estimation bias only for variance and covariance parameters.
With diffuse priors, Stan parameter estimates were slightly less biased than Mplus parameter estimates; however, in order for all parameter estimates to show sufficiently low bias levels, minimum required sample sizes still are 150 level 2 and 6 level 1 units. Both programs showed a tendency to overestimate empirical SDs for those parameters that were given informative priors. A systematic overestimation of standard errors by Bayesian estimation methods has been reported before for the estimation with WinBUGS Lee et al., 2010).
According to these results, at least 150 level 2 and 6 level 1 observations are recommended for Bayesian estimation of the ordered categorical indicator model with diffuse priors, while using lower sample sizes can be considered in combination with informative priors. In partial accordance with a study on ordered categorical single-level CFA models (Liang & Yang, 2014), Bayesian methods did not outperform WLSMV in all cases, but solely when highly informative accurate priors were used. Also for multilevel SEMs with dichotomous indicators, Bayesian estimation with diffuse priors performed worse than WLSMV (Depaoli & Clifton, 2015). Notably, the use of weakly informative inaccurate priors on loading parameters recovered parameter estimates as well as or even better than Bayesian estimation with diffuse priors, a result that has, for example, also been reported for the use of inaccurate priors for growth parameters in Bayesian growth mixture modeling (Depaoli, 2014). However, putting strongly informative inaccurate priors on loading parameters can lead to severe bias in the parameter estimates. Applying the high amount of prior information needed in order to render the Bayesian approach preferable to WLSMV consequently involves the risk of detrimental effects in case the location of these priors are not accurate. The effect of inaccurate priors on variance or covariance parameters was not addressed here and should be explored in future research. Priors on variance and covariance parameters did, however, not exert an influence on estimates of other parameters, making inaccurate prior choices relatively less harmful. As to the choice of an estimation software, recommended sample sizes for Stan and Mplus are comparable.

Limitations and future directions
A potential reason for the rather slow convergence of the models under estimation with Mplus might be the amount of specified between-level error variance. Theoretically, the between-level error variance in the specified model is zero. However, as Mplus requires setting it to a value different from zero, it was set to a value as small as 0.002. Increasing the between-level error variances was observed to considerably speed up convergence; it would, however, present a misspecification of the model.
A topic for future studies should be the sensitivity of the results to the magnitude of loading parameters, error variances, latent covariances, and ICC values as previous studies indicate a potential effect on estimation accuracy (Depaoli & Clifton, 2015;Liang & Yang, 2014;Koch, Schultze, Burrus, Roberts, & Eid, 2015;Koch et al., 2014).
Further research is needed to explore the effect of different prior distributions as the specification of good and meaningful priors is a difficult endeavor. Especially the effect of inaccurate prior specifications in different sample sizes needs further investigation as the availability of good a priori information on model parameters might be scarce in practice. The advantages of the Bayesian approach when combined with informative priors is of no practical value if possible harm due to wrong prior specifications cannot be judged.

Conclusion
The current study is the first study investigating performance differences between classical and Bayesian estimation techniques for multilevel SEMs with ordered categorical indicators and studying the effect of inaccurate priors on estimation accuracy in these models. Focusing on situations with small sample sizes on the within level often encountered in multirater studies, the simulation study produced results suggesting that Bayesian estimation with diffuse priors does not outperform classical estimation techniques, neither for continuous nor for categorical indicator models. Prior information exhibited a great influence on estimation accuracy for models with ordered categorical indicators. Priors with inaccurate locations did perform as well as, or in cases even better than, diffuse priors if prior information was weak, but led to considerable bias when increasing the informativeness of the prior, even in large samples. A comparison of two estimation softwares for Bayesian estimation and recommendations for minimum required sample sizes were provided.

Article information
Conflict of Interest Disclosures: Each author signed a form for disclosure of potential conflicts of interest. No authors reported any financial or other conflicts of interest in relation to the work described.
Ethical Principles: The authors affirm having followed professional ethical guidelines in preparing this work. These guidelines include obtaining informed consent from human participants, maintaining ethical treatment and respect for the rights of human or animal participants, and ensuring the privacy of participants and their data, such as ensuring that individual participants cannot be identified in reported results or from publicly available original or archival data.

Funding: None.
Role of the Funders/Sponsors: None of the funders or sponsors of this research had any role in the design and conduct of the study; collection, management, analysis, and interpretation of data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.