The SEM Reliability Paradox in a Bayesian Framework

Abstract Within the frequentist structural equation modeling (SEM) framework, adjudicating model quality through measures of fit has been an active area of methodological research. Complicating this conversation is research revealing that a higher quality measurement portion of a SEM can result in poorer estimates of overall model fit than lower quality measurement models, given the same structural misspecifications. Through population analysis and Monte Carlo simulation, we extend the earlier research to recently developed Bayesian SEM measures of fit to evaluate whether these indices are susceptible to the same reliability paradox, in the context of using both uninformative and informative priors. Our results show that the reliability paradox occurs for RMSEA, and to some extent, gamma-hat and PPP (measures of absolute fit); but not CFI or TLI (measures of relative fit), across Bayesian (MCMC) and frequentist (maximum likelihood) SEM frameworks alike. Taken together, these findings indicate that the behavior of these newly adapted Bayesian fit indices map closely to their frequentist analogs. Implications for their utility in identifying incorrectly specified models are discussed.


Introduction
Structural equation modeling (SEM) combines measurement models, which define latent variables in terms of their relations with observed variables, and structural models, which are a system of user-defined simultaneous equations relating the latent variables to each other in a path analysis-like approach, with some paths fixed to zero and others freely estimated as dictated by substantive theory.Prior to interpreting the estimated model coefficients from a SEM, an important preliminary step involves demonstrating that the hypothesized analytic model provides a reasonable approximation to the observed data (e.g., Bentler, 1990).Measures of overall model fit simultaneously evaluate both the quality of the measurement models and the reasonableness of the free and fixed structural relationships among the latent variables.
However, challenges arise in evaluating model fit because no two substantive applications of SEM are identical.These differences arise from variations, for example, in the number of latent and indicator variables (i.e., model size), the sample sizes on which they are estimated, and the measurement quality of the latent variables, all of which been found to impact typically reported frequentist goodness-of-fit (FGFI) measures beyond the actual degree of model misspecification that may be present.In fact, research on typically reported (frequentist) measures of model fit has revealed that better measurement models (i.e., latent variables that have stronger influences on their indicators, as measured by higher factor loadings) can result in poorer model fit estimates than SEMs with worse measurement (i.e., lower factor loadings) when the same structural misspecification is held constantknown as the reliability paradox (Hancock & Mueller, 2011;Heene et al., 2011;McNeish et al., 2018;Miles & Shevlin, 2007).
We revisit the reliability paradox in the context of recently described measures of a fit designed for use with Bayesian SEM (Garnier-Villarreal & Jorgensen, 2020;Hoofs et al., 2018).Bayesian frameworks offer well-known opportunities and benefits over frequentist methods in SEM.This includes the ability of researchers to incorporate prior information into their statistical models, along with estimates of (un)certainty around these priors (Kaplan & Depaoli, 2012), the capacity to analyze more complex models with smaller sample sizes, and models that will not fail to converge (Muthen & Asparouhov, 2012).In addition, credible intervals for fit indices provide a more straightforward and desirable interpretation than frequentist confidence intervals.
Even with these advantages, the evaluation of model fit in Bayesian analysis remains as much a part of the analytic process as it is in the frequentist framework to ensure substantive interpretations are not misleading.Our study specifically investigates whether the Bayesian forms of analogous frequentist fit measures demonstrate the same reliability paradox observed in prior frequentist studies using uninformative, less informative, and strongly informative priors.In the sections that follow, we briefly review the relevant frequentist measures of SEM fit that have been the focus of these studies, as well as the newly developed Bayesian goodness-of-fit measures.We then study their behavior using Monte Carlo simulation.

Measures of Fit
Applied researchers are faced with a formidable challenge in navigating the literature about recommendations for evaluating the fit of their SEMs, as the meaning of model fit and the ways in which it is estimated are operationalized in different ways across the multitude of fit-indices available to users.For example, measures of model fit have been distinguished on the basis of (a) whether they are characterized as descriptive measures or test statistics (Yuan, 2005), (b) whether or not they account for sample size and or model complexity (or parsimony), (c) if or what they include as a contrasting baseline model to which the hypothesized model is compared, (d) whether or not they are based on a noncentrality parameter, and (e) whether they are considered absolute or incremental.Moreover, any given measure of fit can typically be described as having several of these characteristics.In considering the detrimental impact that higher quality measurement models have been found to have on estimates of frequentist measures of overall fit (v 2 , RMSEA, CFI, and TLI) we consider the behavior of their analogous Bayesian forms (BRMSEA, BCFI, and BTLI) along with the PPP.In doing so, we organize them with respect to a taxonomy that distinguishes them as absolute or incremental estimates of model fit.

Absolute Measures
In the frequentist framework, absolute fit indices provide a quantitative estimate of how well a theoretical model is able to reproduce the observed sample data.Hypothesized models that imply a covariance matrix that more closely matches the observed data covariance matrix can be viewed as better-fitting models.The likelihood ratio test (LRT, i.e., the chi-square test-statistic, v 2 ), the root mean square error of approximation (RMSEA; Steiger & Lind, 1980), and gamma-hat (Steiger, 1989) are three such examples of absolute measures of fit.
In the case of maximum likelihood estimation, the LRT is based on the maximum likelihood (ML) fit function (Fml) that reflects the similarity of the model implied covariance matrix to that of the observed sample covariance matrix, weighted by N-1 (Bollen, 1989): When the model implied covariance matrix matches that of the observed covariance matrix (i.e., H 0 is true) the LR follows a v 2 distribution (assuming large samples and multivariate normality), with an expected value equal to the degrees of freedom associated with the hypothesized model (df H ), and a non-centrality parameter of zero.When H 0 is false, the LRT is distributed as a noncentral v 2 distribution, with a non-centrality parameter equal to the difference between the hypothesized model v 2 and its degrees of freedom: v 2 H À df H (see for example, Garnier-Villarreal & Jorgensen, 2020).This non-centrality parameter is a composite of the measurement portion of the model and the structural path portion of the model (McDonald & Ho, 2002), and allows for a test of exact fit through the use of the v 2 distribution with df ¼ df H when used as a measure of stand-alone-fit.Because the LRT follows a v 2 distribution, it has become more widely known as the model v 2 test (Kline, 2011).
The problems with overreliance on this test statistic have been well-documented (Gerbing & Anderson, 1992;Yuan & Chan, 2016), leading some to conclude that the v 2 measure "should never be used as the sole measure of model fit" (Lomax, 2019, p. 463).One reason is because SEM models require relatively large samples for estimation, and these large samples can result in over-powered tests rejecting reasonably well-fitting hypothesized models (Bentler & Bonett, 1980).
The RMSEA follows a non-central v 2 distribution, with a non-central parameter defined as the difference between hypothesized model and its associated degrees of freedom (i.e., v 2 H À df H ; Kline, 2011).This non-centrality parameter is weighted by the inverse of the product of model complexity (i.e., df H ) and sample size.Where smaller values indicate better fitting models: Finally, Gamma-hat provides another measure of absolute fit that is also based on the non-centrality parameter, but further takes into account the number of observed variables (p) in the model: with an adjusted form (adjusted gamma-hat) that may be less susceptible to sample size bias: Where, p ¼ the number of observed variables in the model, and p Ã ¼ the number of unique sample variances and covariances (i.e., pðp þ 1Þ=2 for covariance structures).For both forms of gamma-hat, larger values (i.e., closer to 1) are reflective of better fitting models.

Incremental Measures
Incremental measures of fit have been referred to as comparative fit (Miles & Shevlin, 2007), relative fit (McDonald & Ho, 2002), and approximate or descriptive fit (Kline 2011); these indices evaluate model quality by comparing the hypothesized model to a baseline model.Although in most applications the baseline model is an independence model in which all variables are assumed to be uncorrelated, other baseline models can be used.The Comparative Fit Index (CFI: Bentler, 1990) and the Tucker-Lewis Index (TLI; Tucker & Lewis, 1973) are two widely used examples of incremental fit measures.CFI is expressed as: Where, (v 2 H À df H Þ is the non-centrality parameter of the hypothesized model, and ðv 2 B À df B Þ is the non-centrality parameter of the baseline model.As a result, the CFI evaluates the obtained improvement in moving from the baseline non-centrality parameter (e.g., in which all variables are assumed to be uncorrelated) to the hypothesized model in some number of these constraints are relaxed (i.e., freely estimates some paths).
The TLI (or the non-normed fit index: Bentler & Bonett 1980) also evaluates the relative improvement of a hypothesized model to that of a baseline model, but does so through by taking model complexity (i.e., df) into account by way of a penalty for more complex models: CFI and TLI values typically range from 0 to 1.0, with larger values taken to reflect better fitting models.

Bayesian Measures of Fit (New Friends)
Bayesian forms of the RMSEA, CFI, and TLI have recently been introduced (Garnier-Villarreal and Jorgensen, 2020; Hoofs et al., 2018) allowing for an extension of their general principals of evaluating model quality within Bayesian contexts.The resulting values are on the same scale as their frequentist analogs and they additionally offer posterior distribution-based credible intervals to capture uncertainty in the point estimates (Garnier-Villarreal & Jorgensen, 2020).Before reviewing these new fit measures, it is important to note that posterior predictive model checking (PPMC; Gelman et al., 1996) has long been an integral part of Bayesian analysis.Specifically, PPMC involves evaluating the discrepancy between the observed data (D obs ) and that which is predicted by the model through replications (D rep ) across iterations (i).Where, D obs i at a given iteration represents the discrepancy between the observed data mean (m) and covariance matrix (S) on the one hand, and the modelimplied mean (l i ) and covariance matrix ( P i ) on the other hand: at a given iteration (i) replaces m and S in D obs i with m i and S i that represent values obtained from replicated data of the same size as the sample data (Asparouhov & Muth en, 2021).
The posterior predictive p-value (PPP) is one index that has been shown to be useful as part of Bayesian PPMC; it measures of the proportion of iterations in which D obs exceeds D rep .Very well-fitting models produce PPP values near 0.50, and lower values are reflective of misspecification (Cain & Zhang, 2019).Some have suggested that PPPs < 0.05 are indicative of misspecified models (Asparouhov & Muth en, 2010) while others have shown that values in the range of 0.10-0.15are reflective of misspecification, depending upon model design conditions (Cain & Zhang, 2019).
Although the PPP has shown some improvement over the frequentist v 2 test in terms of not rejecting reasonably fitting models when estimated on large samples, it is by no means immune to this problem of over-rejections for large-sample situations (Asparouhov & Muth en, 2010;Hoijtink & Van de Schoot, 2018).
Below we review these Bayesian forms of model fit following the work of Garnier-Villarreal & Jorgensen (2020) 1 .In their derivation of these measures, frequentist elements were replaced with what can be considered their Bayesian analogs.Namely, v 2 values appearing in the ML-based formulas above were replaced with D obs i À pD, degrees of freedom were replaced with p Ã À pD, and the non-centrality parameter becomes D obs i À p Ã : Where, pD ¼ the effective number of parameters.In a frequentist framework, model complexity refers to the number of free and fixed parameters as reflected in the model degrees of freedom.However, in a Bayesian framework, model complexity is a function of the number of estimated parameters, prior beliefs about those parameters, and the certainty of them (see for example Hoofs et al., 2018).Consequently, the effective number of parameters (pD; Spiegelhalter et al., 2002) incorporates these elements in characterizing model complexity in Bayesian models.The effective number of parameters, based on the deviance information criteria (pD DIC ), is estimated from the posterior distribution as the amount by which the mean deviance (DÞ is greater than that of the model's deviance DðhÞ (Spiegelhalter et al., 2002; see also Garnier-Villarreal & Jorgensen (2020) for a good discussion of the pD DIC ).Alternatives to the pD DIC , such as Leave-one-out, pD LOO and the widely applicable information criterion pD WAIC , can be found in Vehtari et al. (2017).
Through substitution of these values in the frequentist RMSEA formula, the Bayesian form becomes: where a posterior distribution of BRMSEA is constructed across MCMC iterations (i).
Similarly, the Bayesian forms of the gamma-hat (B Ĉi Þ, adjusted gamma-hat (B Ĉadj, i Þ, CFI, and TLI become: and where the subscript B refers to the baseline model.All other terms are as previously described.Similar to the BRMSEA, the BCFI and BTLI are calculated for each MCMC iteration to form posterior distributions of these estimates.From the resulting distributions, point estimates and credible intervals can be obtained. Recent simulations on these Bayesian model fit measures (with uninformative priors) have found them to "be reasonable approximations of fit indices under [maximum likelihood estimation]" (Garnier-Villarreal & Jorgensen, 2020, p. 70).In addition, these Bayesian estimates provide credibility intervals that capture uncertainty around the point estimates in the form of an interval that contains the parameter of interest with a given probability (see Garnier-Villarreal and Jorgensen, 2020, for a more detailed discussion.)1.4.Characterizing "Good" Fit and the Reliability Paradox Although descriptive measures of fit were never intended to be used to dichotomize models into distinctions of good or bad (Bentler & Bonett, 1980), the problems with the LRT's over-rejection of models with large samples (Miles & Shevlin, 2007), and similar behavior being found with use of the PPP in Bayesian approaches (Asparouhov & Muth en, 2010;Hoijtink & Van de Schoot, 2018), has led to the use of applying fixed cut-off values to descriptive measures of fit.However, the delimiting point on these descriptive fit scales that can be taken to distinguish between "good", "reasonable", or "poor" fitting models has been the subject of considerable discussion and research in the frequentistbased methodological literature.While substantive researchers point to the simulations of Hu and Bentler (1999) when defending their models as indicating a relatively good fit on the basis of RMSEA values <0.06 and CFI and TLI values >0.95, methodological simulations have shown why these recommendations do not generalize across all sample sizes, model complexities, and data conditions (e.g., Fan & Sivo, 2005, 2007;Heene et al., 2011;Saris et al., 2009).Further, recent Bayesian simulations of fit indices indicate that the Bayesian forms of these indices will behave in similar ways to the frequentist counterparts (Garnier-Villarreal & Jorgensen, 2020).
Exacerbating the challenges in quantifying what constitutes a 'good' fitting model is the situation presented when, for a given structural model, higher quality measurement models adversely impact values of the aforementioned frequentist measures (i.e., the reliability paradox; Hancock & Mueller, 2011).This has been demonstrated across both population-level analyses (Hancock & Mueller, 2011;Miles & Shevlin, 2007) and Monte Carlo simulations (Heene et al., 2011;McNeish et al., 2018).These studies all shared several common design features.Namely, that the measurement portions of the SEMs were correctly specified (i.e., the same number of free and fixed loadings), and the factor loadings linking the observed indicators to their latent influences (i.e., reliabilities) were manipulated across conditions.In addition, the evaluation of model fit indices across 'lower' and 'higher' factor loadings were evaluated within the context of the same structural misspecification, although the exact nature of the structural misspecification varied across studies (e.g., an omitted factor covariance in the estimated model that was present in the population model or an omitted set of direct paths among latent variables that were present in the population model).
Importantly, these earlier studies all used maximum likelihood (ML) estimation.As Hancock and Mueller (2011) point out, applied researchers would expect that the overall fit of the SEM model would be better for measurement models with higher factor loadings, than for those with lower factor loadings, given the exact same structural misspecification, yet this has not been the case.Across all of the earlier studies, absolute measures of fit (LRT and RMSEA) resulted in values that were suggestive of poorer fitting models (larger LRT values and larger RMSEA values) when the factor loadings were high, and better fitting models (smaller LRT values and smaller RMSEA values) when the factor loadings were low.To be clear, these studies have shown that for a given structural misspecification, small or large, these measures of model fit seem to suggest better-fitting models when the reliabilities of the measured variables are lower.Both Heene et al. (2011) and Miles and Shevlin (2007) attributed this behavior to the fact that larger factor loadings result in smaller observed variable residuals.This in turn results in a more powerful test to reject the hypothesis of adequate model-data fit (Miles & Shevlin, 2007).In a similar way, when residual variances are larger (as a potential consequence of lower factor loadings) the eigenvalue decomposition is impacted in a way that reduces the value of v 2 (Heene et al., 2011).
In terms of incremental measures of fit (CFI and TLI), results across prior studies have been mixed.For models with the same structural misspecification, Hancock andMueller (2011) andMcNeish et al. (2018) reported that the CFI and TLI improved when factor loadings were lower (worse measurement) compared to greater (better measurement).Miles & Shevlin (2007), on the other hand, found little change in CFI and TLI with different loading levels, and lastly, Heene et al. (2011) and Shi et al. (2019) found that the CFI and TLI improved as loadings increased (Heene et al., 2011) and as a function of the number of indicators (Shi et al., 2019).

The Current Study
The current study extends the reliability paradox research on SEM fit indices from the frequentist framework to the more recently developed Bayesian SEM measures of fit.In doing so, we manipulate the strength and variety of the factor loadings (i.e., measurement portion of the model) and the degree of structural relations misspecification.We also consider the behavior of these fit indices when different levels of informativeness for loading priors are used.Specifically, less informative priors can be useful in circumstances in which researchers do not have a meaningful basis from which to assign priors (Muthen & Asparouhov, 2012) and often result in estimates similar to those obtained with maximum likelihood estimation (Browne & Draper, 2006;Muthen, 2010).At the same time, using non-informative priors can result in sample-specific results (Van Erp et al., 2018); as such, we also considered two informative prior specifications in which the measurement portion of the model were informed by population values.These latter conditions allowed for a more targeted evaluation of the impact of structural misspecification in the context of correctly specified and informed measurement models.

Population Model
Data were generated using Mplus 8 (Muth en & Muth en, 1998-2017) within the MplusAutomation package in R (Hallquist & Wiley, 2018).Our design conditions were aligned with previous frequentist simulation studies focused on the impact of measurement quality on overall goodness of fit in a SEM framework (i.e., Heene et al., 2011;McNeish et al., 2018), that were themselves based on conditions used in Hu and Bentler (1999).These commonalities included specification of a 3-factor, 15-indicator population model with factor correlations of u 1,2 ¼ 0.5; u 1,3 ¼ 0.4; and u 2,3 ¼ 0.3 (see Figure 1).However, because reliability is a function of both the factor loadings and the number of indicators (Meade et al., 2008), we also added a less reliable population condition with nine indicators (three indicators per factor).For each of the two types of indicator models, we specified nine-factor loading conditions, resulting in a total of 18 population models.The loading conditions can be characterized as having tau-equivalent and uniform (TE-U) factors, tau-equivalent and mixed (TE-M) factors, and congeneric and random (Cong) factors; each of which was varied to include relatively low, medium, and high loading magnitudes to reflect low-, medium-, and high-quality measurement (Table 1).
In the TE-U factors model, standardized factor loadings were fixed at 0.4, 0.6, and 0.8 for the low, middle, and high variations, respectively.In the TE-M factors condition, loadings were the same for a given factor, but as shown in Table 1, were mixed across factors to average 0.4, 0.6, and 0.8.In the Cong condition, factor loadings for each indicator were randomly sampled from a uniform continuous distribution to assign loadings to each indicator for the low k2 U{0.3, 0.5}, medium k2 U{0.5, 0.7}, and high k2 U{0.7, 0.9} conditions.These latter values, shown in Table 1, were borrowed from Heene et al. (2011).Given the factor loadings, we computed the population scale reliability for each factor, for each model, using McDonald's (1970) omega (x); thereafter, we computed the mean omega across the three factors; those values are shown in Table 1.Across all models combined, the average scale reliability (at the population level) is 0.42, 0.67, and 0.86 for low, medium, and high measurement quality conditions, respectively.For each of the 18 population models, N ¼ 1,000 datasets were generated for four sample sizes: N ¼ 150, 250, 500, and 1,000.
For each of the replications across the 72 conditions, we analyzed the data using three SEM model specification types (correctly specified: CS, and two forms of incorrect specifications: IS1-IS2) crossed with four SEM analysis approaches: the frequentist approach with full maximum likelihood (ML) and the Bayesian MCMC approach with three levels of prior informativeness for the loadings (described in detail next), for a total of 12 analyses per dataset.In sum, there were 72 conditions x 12 analyses ¼ 864 cell conditions.All models were analyzed using Mplus; in addition, we also analyzed all datasets for the N ¼ 1,000 sample size condition using R blavaan (Merkle et al., 2021) to obtain gamma-hat results 2 .In addition to sample-based results, we present population-level ML-based goodness-of-fit values computed using the model-implied covariance matrix for the N ¼ 1,000 sample size as another reference point.

Model Specification Details
In all analyses, the measurement portions of the models were correctly specified and all factor variances were fixed to 1 (i.e., all factor loadings were estimable).In the correctly specified (CS) model, the three structural components of the model (latent variable correlations) were freely estimated.In the first incorrect specification (IS1), the association between the first and second factor, u 1,2 ¼ 0.5, was fixed to zero; in Other R blavaan goodness-of-fit results are provided in the supplemental materials, which showed the same results pattern as those from Mplus.
the second incorrect specification (IS2), all three-factor associations were fixed to zero.For frequentist analyses, model parameters were estimated using maximum likelihood.For Bayesian analyses, model parameters were estimated in Mplus using the Gibbs sampler algorithm (Geman & Geman, 1984) with two chains and a maximum of 50,000 iterations, with the first half of iterations discarded as burnin 3 ; in the R blavaan package, model parameters were estimated using Stan settings, with 500 burn-in samples and three chains of 1,000 samples each 4 .

Bayesian Priors
Our three levels of informativeness included: non-informative loading priors of N(0,10 10 ), weak informative loading priors of N(0.60,0.04),and strong informative loading priors of N(true value, 0.01) 5 .The weak informative level relates to the idea that the manifest variables being used may be believed to be modestly good indicators of the factors, with a range of 0.60 ± 0.20 SDs.The strong informative level captures a situation where the full data set may have been randomly divided, and model estimates from one sub-sample are used to inform priors for the other sub-sample used in the final analysis, with a range of the known value ± 0.10 SDs.On a practical level, because variances and covariances are more difficult to assign priors to, we used non-informative priors for the remaining model parameters.Specifically, once we specified the factors to be standardized, the factor-factor covariance priors are correlation priors, and these were set to U(À1,1) 6 .Last, the indicator residual variance priors were set to IG(3,1) 7 .
Average fit index values for the Bayesian PPP, as well as frequentist and Bayesian forms of the RMSEA, CFI, and TLI, were calculated across replications for each design condition.Our focus on the latter three Bayesian goodness of fit (BGFI) measures is because they have been the focus of past research on the reliability paradox.Bayesian descriptive measures of fit were calculated based on the pD DIC , with credible intervals calculated from the posterior distribution through the use of the equal-tail percentages approach.
Additionally, because Bayesian forms of gamma-hat have performed well in recent simulations and pD LOO has been recommended for BGFI calculations (Garnier-Villarreal & Jorgensen, 2020), and because these are not currently available in Mplus, we also used blavaan (Merkle et al., 2021) in R to analyze the same datasets as those analyzed with Mplus for the N ¼ 1,000 sample size (across all other conditions).We present the gamma-hat results below and also provide Mplus invokes an automatic convergence criterion based on the potential scale reduction (PSR; Brooks & Gelman, 1998) that is monitored at every 100th iteration (Asparouhov & Muthen, 2010) such that all model parameters must reach PSR values of < 1.1 for iterations to stop.We kept this setting in the current study. 4 We added code in blavaan's model fitting process such that the burn-in was iteratively increased by 1,000 samples until all model parameters reached PSRs 1.05, similar to Garnier-Villarreal & Jorgensen 2020).

5
The default non-informative prior for loadings in R blavaan (Merkle et al., 2021) are based on Stan defaults, unless otherwise specified, and are N(0,10) [sd].Mplus' default non-informative priors for loadings are wider: N(0, 10 10 ).In all our Bayesian analyses, since factors were specified as standardized, their covariances were treated as correlations with priors that were noninformative.B(1,1) is the recommended non-informative prior for correlations in blavaan, which can be translated to U(-1,1) and is similar to Mplus' default factor correlation matrix prior of IW(0,-p-1) (Asparouhov & Muth en, 2021).We note that we could not use the IW as a prior for our simulations because IS1 and IS2 analyses place constraints on at least one of the correlations in the psi matrix.

7
In all our Bayesian analyses, we used a non-informative prior of IG(3,1) for indicator residual variances to closely approximate blavaan's default noninformative residual variance prior G(1,.5).The IG and G distributions are positively skewed and cannot be negative.The IG prior we used in Mplus specifies a distribution with a mode of 0.25, mean of 0.50, and standard deviation of 0.50.The G prior in blavaan specifies a distribution with a mode of 0, a mean of 0.50, and a standard deviation of 0.50.
the other BGFI results estimated with blavaan in the supplemental information.For all blavaan results, estimates were based on the pD LOO , and credible intervals were calculated using the highest posterior density (MPD) method.

Results
Bayesian goodness-of-fit (BGFI) results are presented in table form across the different design conditions in the interest of transparency.However, for ease of illustration, tabled values are also graphically represented in figures as well.We focus most of our discussion on results for the N ¼ 1,000 sample size condition for brevity as well as to be consistent with prior research; further, sample size had a relatively minor impact on fit index point estimates, with the exception of the PPP.This said, we briefly present sample size results as well.

Convergence Rates
Across simulation conditions, for correctly specified models, frequentist structural equation model (SEM) analyses (using maximum likelihood; ML) had a convergence rate of 99.27%, with convergence problems isolated to the smaller sample size conditions of N 500.For incorrectly specified models, the convergence rate was 99.20% and 99.28% for the first and second misspecification types, respectively (i.e., one factor-factor relation set to zero vs. all three factor-factor relations set to zero).In the Bayesian models, convergence completed with all model parameters reaching PSR values of < 1.1 in Mplus, and PSR values < 1.05 in the supplemental blavaan models.

Absolute Fit Indices
Mean results for the root mean square error of approximation (RMSEA/BRMSEA) and the Bayesian posterior predictive probability (PPP) by condition, model, and analytic approach (including population-level analysis for ML-based RMSEA), are reported in Tables 2 and 3, respectively.We also plotted these results in Figures 2-4, respectively.Gamma-hat and adjusted gamma-hat results are shown in Figures 5 and 6.
With respect to the RMSEA/BRMSEA results, several key findings can readily be observed.First, all three analytic approaches (ML, Bayesian MCMC, and population-level analyses) produced nearly identical results across conditions.Second, fit values were similar across the three-factor structure types (tau-equivalent with uniform loadings, TE-U, tauequivalent with mixed loadings (TE-M), and congeneric (Cong)).Third, measurement quality (i.e., reliability conditions) did not affect fit indices for correctly specified models (all produced very low RMSEAs/BRMSEAs).Fourth, similar to results reported previously (Fan & Sivo, 2007;Garnier-Villarreal & Jorgensen, 2020), model size plays a role in the behavior of RMSEA and BRMSEA values.Here values were generally lower (better) for the 15-indicator model than for the 9-indicator model when measurement quality was moderate to high.Fifth, and importantly, the BRMSEA behaved nearly identically to that of the ML and population analysis model RMSEAs: better measurement models yield worse BRMSEAs compared to worse measurement models, for the same misspecification.In other words, the BRMSEA is also prone to the "reliability paradox".Last, Figure 3 shows 90% credible intervals for the three Bayesian levels of loading Values represent means across N ¼ 1,000 replications for the N ¼ 1,000 sample size condition.Num Ind: total number across three factors.Pop: population-based computation.ML: maximum likelihood estimation.Bayesian: Bayesian MCMC estimation: NonInf: non-informative loading priors set to N(0,10 10 ); LsInf: less informative loading priors set to N(0.6, 0.04); StrInf: strong informative loading priors set to N(true value, 0.01).For ease of model comparability, priors for factor-factor covariances (correlations) set to U(À1,1) and indicator residual variances set to IG(3,1).CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors was set to zero; IS2: second type of incorrect specification where all three factor covariances were set to zero.Avg Scale Reliability: mean of each factor's scale/composite reliability (Omega; McDonald, 1970) given the population loading and residual error variance values.The grand mean across indicator and factor structure conditions is 0.42, 0.67, and 0.86 for low, medium, and high reliability levels, respectively.
prior informativeness.As average BRMSEA values increased for models with better measurement quality, the credible intervals simultaneously became narrower.
The Bayesian PPP results reported in Table 3, and illustrated in Figure 4 for the N ¼ 1000 sample size condition, showed that, for correctly specified models, values were stable across loading levels, ranging within approximately 10% of the 50% mark (i.e., indicating good fit).Second, PPP patterns were nearly identical across the three types of factor structures (TE-U, TE-M, and Cong), and were also similar across levels of loading prior informativeness.However, in contrast to RMSEA/BRMSEA results, the PPP had similarly low values across reliability levels for misspecified models (near 0%)indicating it to be an excellent detector of model misspecification irrespective of measurement quality.This said, we observed a small paradoxical pattern for the 9-indicator model, and, as we show in the forthcoming sample size results, a paradoxical pattern for smaller samples: for fewer indicators and/or lower sample sizes, better measurement model quality was associated with lower PPP (worse fit).
Figures 5 and 6 display Bayesian gamma-hat and adjusted gamma-hat results from blavaan analyses of the N ¼ 1,000 sample size datasets (other blavaan BGFI results are in the supplemental file).For incorrectly specified models, these indices also decreased (reflecting worse fit) as measurement quality increased.This effect was more pronounced for the adjusted form.Like the PPP and RMSEA, these indices were generally better at flagging problematic structural misspecifications for models with greater measurement quality.

Relative Fit Indices
Mean results for CFI/BCFI and TLI/BTLI by condition, model, and analytic method are reported in Tables 4 and 5, and Figures 7-10, respectively.Across the results for both indices, we observed several key themes.First and most noteworthy is that better measurement quality (i.e., greater reliability) resulted in better average relative model fit values for misspecified models; this of course is in contrast with the absolute fit findings.Here, when measurement quality was high, values of both measures were near 0.95 in the presence of both misspecifications (i.e., IS1 and IS2) with the 15-indicator model; and near 0.90 with the 9-indicator model.The remaining findings, on the other hand, are similar to what was found for absolute fit indices: namely, (1) results for all three analytic approaches yielded nearly identical values; (2) there was little difference in relative model fit patterns among the three kinds of factor structures (TE-U, TE-M, and Cong), although there was a slightly better fit for the TE-M factor structure compared to the other two; 3) for correctly specified models, reliability levels had little impact on fit values (all produced very high CFI/BCFI and TLI/BTLI values); and 4) model size appeared to play a role.In the most severe misspecification (i.e., IS2), CFI/BCFI and TLI/BTLI values were generally larger for the 15-indcator model than the 9-indicator model across measurement quality conditions.However, like Fan and Sivo (2007) and Garnier-Villarreal and Jorgensen (2020), this was only the case for the TLI/BTLI in the less severely misspecified model (IS1), where CFI/BCFI values were much less impacted by model type.Last but not least, Figures 6 (BCFI) and 8 (BTLI) illustrate the 90% credible intervals for each of the three levels of prior informativeness.As was the case with the BRMSEA, credible intervals for the BCFI and BTLI became narrower as measurement quality increased.

Sample Size
Although we focus on the N ¼ 1,000 sample size condition in our results above (again, for both brevity and consistency with previous work on the reliability paradox), we now draw attention to the influence of sample size on Bayesian fit indices.As illustrated in Figures 11 and 12, collapsed across factor structure (TE-U, TE-M, and Cong) and model size (9 and 15 indicators), the pattern of findings we observed for mean fit indices described above were relatively impervious to sample size.The exception was that, for the PPP, smaller sample size conditions (i.e., N ¼ 150 and 250) yielded a better fit for misspecified models when the model had lower measurement quality compared to models with better reliability (i.e., the reliability paradox).Otherwise, mean patterns were consistent across sample sizes, and as would be expected, larger samples had more narrow credible intervals.

Credible Intervals
Credible intervals are a valuable tool in Bayesian analysis for quantifying the uncertainty in our parameters (Kaplan & Depaoli, 2012).Less is known about their usefulness with descriptive measures of fit for adjudicating between correctly and incorrectly specified models.Our results show that across all investigated BGFIs, larger credible intervals occurred in combination with poorer measurement quality (increased measurement error).Further, this effect was invariant across both minor (IS1) and major (IS2) structural misspecifications.That is, the width of the credible intervals for a given factor reliability was largely the same across varying degrees of structurally misspecified models and did not differentiate between correctly and incorrectly specified models, nor between the severity of the structural misspecifications.

Prior Informativeness
Three levels of loading prior informativeness were examined that might arguably be described as non-informative, N(0,10 10 ), less informative, (0.60,0.04), and strongly informative, N(true value,0.01) 8 .Average BGFI values along with their credible intervals are illustrated in Figures 3,5,6,8,10,12.Across all indices, mean values and their respective credible intervals are nearly perfectly overlapping and reflect that the precision of measurement loading priors had negligible to no effect on these measures of fit.Because these figures illustrate results for the N ¼ 1,000 sample size condition, and because the likelihood will overwhelm priors in larger samples (e.g., Gelman et al., 2014, p. 32;Kruschke, 2015, pp. 112-113), Figures 11 and 12 contrast average BGFI values along with their credible intervals across different sample sizes.Here again, these estimates showed very little variability across different levels of informativeness and sample size conditions, and are consistent with others in finding that "misspecified priors do not appear to translate to increased precision for fit indices" (Garnier-Villarreal & Jorgensen, 2020, p. 64).

Discussion
The current investigation of new Bayesian model fit indices was motivated by earlier research on frequentist measures of structural equation model (SEM) fit that were found to produce values suggestive of poorer fitting models, for a given structural misspecification, when the measurement of the latent variables is better (i.e., the "reliability paradox"; Hancock & Mueller, 2011).Given our focus, we purposefully varied design facets related to the quality of the measurement portion of SEMs.The first, and perhaps most The default non-informative prior for loadings in R blavaan (Merkle et al., 2021) are based on Stan defaults, unless otherwise specified, and are N(0,10) [sd].Mplus' default non-informative priors for loadings are similar but wider: N(0, 10 10 ).Values represent means across N ¼ 1,000 replications for the N ¼ 1,000 sample size condition.Num Ind: total number across three factors.Pop: population-based computation.ML: maximum likelihood estimation.Bayesian: Bayesian MCMC estimation: NonInf: non-informative loading priors set to N(0,10 10 ); LsInf: less informative loading priors set to N(0.6, 0.04); StrInf: strong informative loading priors set to N(true value, 0.01).For ease of model comparability, priors for factor-factor covariances (correlations) set to U(À1,1) and indicator residual variances set to IG(3,1).CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors was set to zero; IS2: second type of incorrect specification where all three factor covariances were set to zero.Avg Scale Reliability: mean of each factor's scale/composite reliability (Omega; McDonald, 1970) given the population loading and residual error variance values.The grand mean across indicator and factor structure conditions is 0.42, 0.67, and 0.86 for low, medium, and high reliability levels, respectively.Values represent means across N ¼ 1,000 replications for the N ¼ 1,000 sample size condition.Num Ind: total number across three factors; Pop: population-based computation; ML: maximum likelihood estimation; Bayesian: Bayesian MCMC estimation; NonInf: non-informative loading priors set to N(0,10 10 ); LsInf: less informative loading priors set to N(0.6, 0.04); StrInf: strong informative loading priors set to N(true value, 0.01).For ease of model comparability, priors for factor-factor covariances (correlations) set to U(À1,1) and indicator residual variances set to IG(3,1).CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors was set to zero; IS2: second type of incorrect specification where all three factor covariances were set to zero; Avg Scale Reliability: mean of each factor's scale/composite reliability (Omega; McDonald, 1970) given the population loading and residual error variance values.The grand mean across indicator and factor structure conditions is 0.42, 0.67, and 0.86 for low, medium, and high reliability levels, respectively.salient, of these was the magnitude of the indicator-factor relationships within and across factors.Here, standardized factor loadings were designed for low, medium, and highreliability conditions.In addition, loading levels were crossed with three-factor structure conditions in order to investigate the possibility that variance in Bayesian Goodness of Fit Indices (BGFIs) across factor loading magnitudes might be further influenced by tau-equivalent (TE-U and TE-M) or congeneric (Cong) factor specifications.This aspect of our simulation design was incorporated because CFI and TLI values have been found to be moderated by factor loading magnitudes within the frequentist SEM framework.Namely, that across increasing factor loading values, some research has found that TLI and CFI values decrease (Hancock & Mueller, 2011;McNeish et al., 2018), while others have reported that the values remain relatively unchanged (Miles & Shevlin, 2007) or increase in value (Heene et al., 2011).A notable difference among these studies was that the former three investigations used tau-equivalent factors, whereas the latter study employed congeneric factors.Finally, because, all things equal, more indicators result in more reliable measurement models (Meade & Bauer, 2007), we included both 9-and 15-indicator models.
We specifically focused most of our attention on Bayesian forms of the RMSEA, CFI, and TLI because: (1) they are new additions to the Bayesian tool box (Garnier-Villarreal & Jorgensen, 2020); (2) they have been highlighted in past research on the reliability paradox within a frequentist framework (Hancock & Mueller, 2011;Heene et al., 2011;McNeish et al., 2018;Miles & Shevlin, 2007); and (3) they remain among the most frequently reported measures of fit in frequentist applications of SEM (Jackson et al., 2009) and are therefore likely to rise in use within Bayesian modeling applications.Our results found that the BRMSEA reveals similar behavior to that of the RMSEA in relation to the reliability paradox (Hancock & Mueller, 2011;Heene et al., 2011;McNeish et al., 2018;Shi et al., 2019).Namely, for a given structural misspecification, values of BRMSEA increased (demonstrated worse fit) as the magnitude of factor loadings increased.This pattern was consistent across factor structure conditions (i.e., tau-equivalent and congeneric) and sample sizes.Moreover, the level of prior informativeness for factor loadings did not make an appreciable difference in BRMSEA means or credible intervals.Although this relationship has been described as a paradox, it might also be what one would expect given that larger factor loadings result in smaller observed variable residuals and more powerful tests to reject the hypothesis of adequate model fit (Miles & Shevlin, 2007).
By contrast, increasing the number of indicators in the measurement models was found to mitigate the increase in BRMSEA values across higher loadings.That is, like findings in frequentist SEM with respect to the RMSEA (Kenny & McCoach, 2003), model size appeared to play a role with the BRMSEA reliability paradox, where values were generally smaller in the 15-indicator condition than in the 9-indicator condition, across both types of structural misspecifications.Still, researchers invoking the RMSEA < 0.06 threshold for BRMSEA values as an indication of acceptable fit (often attributed to Hu and Bentler, 1999, in frequentist applications), would still likely assume their models fit well even in the most extreme misspecification (i.e., with all three structural relationships incorrectly fixed to zero) when factor loadings are low (i.e., near 0.40).It is worth noting that although the pattern of smaller BRMSEA values for a given a given structural misspecification when factor loadings are lower has been characterized as a paradox, it's better for.
The PPP, which is an absolute measure of fit only available for Bayesian analyses, was a much better barometer of structural misspecification across conditions.Although it reflected a decline in fit for misspecified models as measurement quality increased, this effect was limited to models with fewer numbers of indicators and instances in which N < 500.For models estimated on samples of N > 500, the reliability paradox was non-evident.
Although gamma-hat, another absolute fit index for assessing Bayesian analyses in which values closer to 1 are optimal (Garnier-Villarreal & Jorgensen, 2020), is not currently available in Mplus and has not been studied for cutoff criteria like that of the RMSEA, results from our analyses of the N ¼ 1,000 dataset conditions using blavaan in R showed that gamma-hat and adjusted gamma-hat values both decreased (reflecting worse fit) as measurement quality increased.This effect was more pronounced for adjusted gamma-hat then for gamma-hat (see again Figures 5 and 6).This said these BGFI values were still quite high (many above 0.95) for incorrectly specified models, particularly in models with more indicators.

Incremental Measures of Fit
The two Bayesian incremental fit measures investigated (i.e., BCFI and BTLI) showed similar behaviors to their frequentist forms (i.e., CFI and TLI) when examined across varying measurement conditions (e.g., Heene et al., 2011;Shi et al., 2019).Namely, they appeared somewhat impervious to the measurement paradox, and in fact revealed a pattern of increasing values (associated with better fitting models) with increases in factor loadings, regardless of whether the measurement model factors were specified as tau-equivalent (uniform or mixed) or congeneric.In the majority of conditions, average values fell below thresholds that continue to be applied in the frequentist literature as demonstrating poor fit (i.e., CFI and TLI < 0.95) when the model was incorrectly specified (i.e., IS1 and IS2).However, as the magnitude of factor loadings increased to 0.80, both average BCFI and BTLI values were near 0.95, for both incorrectly specified model conditions.Thus, a BCFI or BTLI value of 0.95 can occur when there are in fact structural misspecifications but where factors are defined by indicators with high loadings.

Model Size
We also replicate and extend findings with frequentist (Fan & Sivo, 2007) and Bayesian (Garnier-Villarreal & Jorgensen, 2020) forms of the TLI in relation to it being influenced by model size.As our results demonstrate, this occurs as a function of measurement quality.When the number of indicators increased from 9 to 15, BTLI values exceeded .95when factor loadings were high (i.e., .80),even when the model structural misspecification was severe (i.e., IS2: fixing all three structural parameters to zero).This outcome was invariant to the type of factors in the measurement model considered (i.e., TE-U, TE-M, or Cong).Prior work has also shown that the CFI/BCFI are relatively invariant to model size (Fan & Sivo, 2007;Garnier-Villarreal & Jorgensen, 2020).Our results demonstrate similar outcomes for moderately misspecified structural models (i.e., IS1, see Figure 7), but show that model size does play a role in more severely misspecified structural model (i.e., IS2).Where more indicators (a design feature that can serve to increase factor reliability, all things equal) seem to result in higher CFI/BCFI values when the structural misspecification is severe.Overall, we see that high-quality measurement models (more indicators with average loadings of 0.80) may overcompensate values of BCFI and BTLI (as well as the frequentist versions of these GFIs) to the point that many substantive researchers would assume their models fit well when in truth there could be severe structural misspecifications.

Prior Informativeness
Average BRMSEA (Figure 3), gamma-hats (Figures 5 and  6), BCFI (Figure 8), and BTLI (Figure 10) as well as their credible intervals were nearly identical across the three investigated loading prior informativeness levels.With the exception of the small sample size condition, the average PPP was also fairly invariant to loading prior condition.This said, the use of more informative priors was found to result in slightly better BTLI values than those obtained with non-informative priors in the poorer measurement model conditions (i.e., factor loadings of 0.40 with a 9-indicator model).In short, the level of informativeness did not have an appreciable impact on fit index behavior.

Implications for Applications of SEM
The BRMSEA, gamma-hat, BCFI, and BTLI were all found to increase as measurement quality increased.This of course has different implications for different types of measures of fit.In the case of BRMSEA, incorrectly specified models should intuitively result in higher values (poorer fit); however, average BRMSEA values only exceeded 0.05 (the common flag for a model fit problem) for structurally incorrect models when factor loadings were moderate or better in the 9-indicator condition, and when factor loadings were high in the 15indicator condition.Consequently, the usefulness of the BRMSEA in helping to adjudicate between structurally correct and incorrect models appears limited to situations in which the average factor loadings are 0.60 or greater in less complex models (i.e., 3 indicators per factor) and 0.80 or greater in more complex models (i.e., 5 indicators per factor).Gamma-hat and adjusted gamma-hat behaved similarly to BRMSEA.Namely, lower values (poorer fit) were most evident for structurally incorrect models when factor loadings were high (in the case of gamma-hat) or moderate to better (in the case of adjusted gamma hat).Taking these results together, we encourage analysts to use these absolute measures of fit as a tool only when there is evidence of high-quality measurement via moderate to strong factor loadings or many indicators per factor.
In contrast to the absolute fit measures, it is difficult to see the usefulness of the BCFI and BTLI relative fit measures in adjudicating between structurally correct and incorrect model specifications.These measures only appear to perform well at identifying incorrectly specified models (e.g., < 0.90) when factor loadings are low to moderate (i.e., 0.40-0.60)and when factors are measured with fewer indicators, neither of which is desirable in practice.
Last but certainly not least, the PPP was most immune to measurement quality and performed well in differentiating between correctly and incorrectly specified models.As noted elsewhere (Asparouhov & Muth en, 2010;Hoijtink & Van de Schoot, 2018) a small qualification to this is that the PPP was less capable of detecting structural misspecification in small samples (N ¼ 150), but only in conditions with low measurement quality (average loadings of 0.40).
In short, our results indicate that the BRMSEA and adjusted gamma-hat can be useful in detecting structural misspecifications for instances in which standardized factor loadings are moderate or better across sample sizes.For modest to large samples, the PPP seems most suitable for identifying structural problems, regardless of factor loading magnitude; further, similar to the BRMSEA, it was also capable of detecting structural misspecifications in smaller samples (e.g., N ¼ 150 when standardized factor loadings were moderate or better.

Conclusion
Bayesian SEM offers many advantages over traditional frequentist modeling.Nevertheless, the task of assessing model fit, irrespective of the analytic framework, must still be undertaken.The current study shows that the quality of the measurement model, as characterized by the number of indicators and their factor loadings, plays a strong role in the obtained values of the BGFI measures investigated in this study.Wherein, the direction of influence varied as a function of BGFI type.Poorer quality measurement models resulted in lower (better) values of the BRMSEA, to the point where incorrectly specified models would result in BRMSEA values < 0.05 when factor loadings were low (i.e., 0.40), and BRMSEA values > 0.05 when factor loadings were high (i.e., 0.80).In contrast, poorer quality measurement models resulted in lower (worse) BCFI and BTLI values and wider credible intervals, as one might expect.Here, however, these measures had a harder time picking up on the misspecifications in the highest quality measurement conditions.Namely, in the 15-indicator condition with factor loadings of 0.80, BCFI and BTLI values were > 0.95 in both the moderately and severely misspecified models.
Taken together, our results are consistent with the views of others in emphasizing that commonly applied model fit cutoff values (e.g., CFI > 0.95) should not be taken to indicate a "good" fit (e.g., Kang et al., 2016;Miles & Shevlin, 2007), and the adoption of cutoff criteria is something of a moving target that likely requires more responsiveness to different model conditions.We recommend the use of the PPP as well as transparency in the reporting of results pertaining to the measurement portions of their SEM models, as they have implications for the usefulness of various descriptive measures of overall model fit, even in a Bayesian analytic framework.

1
AlthoughHoofs et al. (2018) first introduced a Bayesian form of RMSEA, the RMSEA proposed by Garnier-Villarreal and Jorgensen (2020) as shown in (5) minimizes the influence of sample size on the index.

Figure 3 .
Figure 3. BRMSEA Results Plotted by condition, model, and loading prior informativeness for N ¼ 1,000 sample size.Values represent means across N ¼ 1,000 replications per cell for the N ¼ 1,000 sample size condition.NonInf: non-informative loading priors set to N(0,10 10 ); LsInf: less informative loading priors set to N(0.6, 0.04); StrInf: strong informative loading priors set to N(true value, 0.01).For ease of model comparability, priors for factor-factor covariances (correlations) set to U(À1,1) and indicator residual variances set to IG(3,1); CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors was set to zero; IS2: second type of incorrect specification where all three factor covariances were set to zero; Ind: total number across three factors (with equal numbers of indicators per factor); TE-U: tau-equivalent loadings, uniform across factors; TE-M: tau-equivalent loadings for a given factor, but incrementally mixed magnitudes across factors; Cong: congeneric loading magnitudes across factors.Scale reliability reflects population-level omega values calculated from the population-level loading values, which average 0.42, 0.67, and 0.86 (across model type and number of indicators/factor), for Low, Medium, and High levels, respectively.Due to overlapping values, some colors not visible.

Figure 4 .
Figure 4. Posterior Predicted probability results Plotted by condition, model, and loading prior informativeness for N ¼ 1,000 sample size.Values represent means across N ¼ 1,000 replications per cell for the N ¼ 1,000 sample size condition.NonInf: non-informative loading priors set to N(0,10 10 ); LsInf: less informative loading priors set to N(0.6, 0.04); StrInf: strong informative loading priors set to N(true value, .01).For ease of model comparability, priors for factor-factor covariances (correlations) set to U(À1,1) and indicator residual variances set to IG(3,1).CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors was set to zero; IS2: second type of incorrect specification where all three factor covariances were set to zero; Ind: total number across three factors (with equal numbers of indicators per factor); TE-U: tau-equivalent loadings, uniform across factors; TE-M: tau-equivalent loadings for a given factor, but incrementally mixed magnitudes across factors; Cong; congeneric loading magnitudes across factors.Scale reliability reflects population-level omega values calculated from the population-level loading values, which average 0.42, 0.67, and 0.86 (across model type and number of indicators/factor), for Low, Medium, and High levels, respectively.Due to overlapping values, some colors not visible.

Figure 5 .
Figure 5. Gamma-hat results Plotted by condition, model, and loading prior informativeness.Values represent means of medians and 90%CIs across N ¼ 1,000 replications for the N ¼ 1,000 sample size condition.Num Ind: total number across three factors.Bayesian MCMC estimation in blavaan (based on Stan) used.Fit indices based on pD LOO.NonInf: non-informative loading priors set to N(0,100); LsInf: less informative loading priors set to N(.6, .04);StrInf: strong informative loading priors set to N(true value, 0.01).Priors for factor-factor covariances (correlations) set to B(1,1) and indicator residual variances set to G(1,.5).CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors set to zero; IS2: second type of incorrect specification where all three factor covariances set to zero.Avg Scale Reliability: mean of each factor's scale/composite reliability (Omega; McDonald, 1970) given the population loading and residual error variance values.The grand mean across indicator and factor structure conditions is 0.42, 0.67, and 0.86 for low, medium, and high reliability levels, respectively.Due to overlapping values, some colors not visible.

Figure 6 .
Figure 6.Adjusted gamma-hat results Plotted by condition, model, and loading prior informativeness.Values represent means of medians and 90%CIs across N ¼ 1,000 replications for the N ¼ 1,000 sample size condition.Num Ind: total number across three factors.Bayesian MCMC estimation in blavaan (based on Stan) used.Fit indices based on pD LOO.NonInf: non-informative loading priors set to N(0,100); LsInf: less informative loading priors set to N(0.6, 0.04); StrInf: strong informative loading priors set to N(true value, 0.01).Priors for factor-factor covariances (correlations) set to B(1,1) and indicator residual variances set to G(1,0.5);CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors set to zero; IS2: second type of incorrect specification where all three factor covariances set to zero; Avg Scale Reliability: mean of each factor's scale/composite reliability (Omega; McDonald, 1970) given the population loading and residual error variance values.The grand mean across indicator and factor structure conditions is 0.42, 0.67, and 0.86 for low, medium, and high reliability levels, respectively.Due to overlapping values, some colors not visible.

Figure 7 .
Figure 7. CFI/BCFI Results Plotted by condition, model, and analytic Approach for N ¼ 1,000 sample size.Values represent means across N ¼ 1,000 replications per cell for the N ¼ 1,000 sample size condition.Pop: population-based; ML: maximum likelihood estimation; Bayes: Bayesian MCMC estimation (collapsed across levels of loading prior informativeness); CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors was set to zero; IS2: second type of incorrect specification where all three factor covariances were set to zero; Ind: total number across three factors (with equal numbers of indicators per factor); TE-U: tau-equivalent loadings, uniform across factors; TE-M: tau-equivalent loadings for a given factor, but incrementally mixed magnitudes across factors; Cong: congeneric loading magnitudes across factors.Scale reliability reflects population-level omega values calculated from the population-level loading values, which average 0.42, 0.67, and 0.86 (across model type and number of indicators/factor), for Low, Medium, and High levels, respectively.Due to overlapping values, some colors not visible.

Figure 8 .
Figure 8. BCFI Results Plotted by condition, model, and loading prior informativeness for N ¼ 1,000 sample size.Values represent means across N ¼ 1,000 replications per cell for the N ¼ 1,000 sample size condition.NonInf: non-informative loading priors set to N(0,10 10 ); LsInf: less informative loading priors set to N(0.6, 0.04); StrInf: strong informative loading priors set to N(true value, 0.01).For ease of model comparability, priors for factor-factor covariances (correlations) set to U(À1,1) and indicator residual variances set to IG(3,1); CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors was set to zero; IS2: second type of incorrect specification where all three factor covariances were set to zero; Ind: total number across three factors (with equal numbers of indicators per factor); TE-U: tau-equivalent loadings, uniform across factors; TE-M: tau-equivalent loadings for a given factor, but incrementally mixed magnitudes across factors; Cong: congeneric loading magnitudes across factors.Scale reliability reflects population-level omega values calculated from the population-level loading values, which average 0.42, 0.67, and 0.86 (across model type and number of indicators/factor), for Low, Medium, and High levels, respectively.Due to overlapping values, some colors not visible.

Figure 9 .
Figure 9. TLI/BTLI Results Plotted by condition, model, and analytic Approach for N ¼ 1,000 sample size.Values represent means across N ¼ 1,000 replications per cell for the N ¼ 1,000 sample size condition.Pop: population-based; ML: maximum likelihood estimation; Bayes: Bayesian MCMC estimation (collapsed across levels of loading prior informativeness).CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors was set to zero; IS2: second type of incorrect specification where all three factor covariances were set to zero; Ind: total number across three factors (with equal numbers of indicators per factor); TE-U: tau-equivalent loadings, uniform across factors; TE-M: tau-equivalent loadings for a given factor, but incrementally mixed magnitudes across factors; Cong: congeneric loading magnitudes across factors.Scale reliability reflects population-level omega values calculated from the population-level loading values, which average 0.42, 0.67, and 0.86 (across model type and number of indicators/factor), for Low, Medium, and High levels, respectively.Due to overlapping values, some colors not visible.

Figure 10 .
Figure 10.BTLI Results Plotted by condition, model, and loading prior informativeness for N ¼ 1,000 sample size.Values represent means across N ¼ 1,000 replications per cell for the N ¼ 1,000 sample size condition.NonInf: non-informative loading priors set to N(0,10 10 ); LsInf: less informative loading priors set to N(0.6, 0.04); StrInf: strong informative loading priors set to N(true value, 0.01).For ease of model comparability, priors for factor-factor covariances (correlations) set to U(À1,1) and indicator residual variances set to IG(3,1); CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors was set to zero; IS2: second type of incorrect specification where all three factor covariances were set to zero; Ind: total number across three factors (with equal numbers of indicators per factor); TE-U: tau-equivalent loadings, uniform across factors; TE-M: tau-equivalent loadings for a given factor, but incrementally mixed magnitudes across factors; Cong: congeneric loading magnitudes across factors.Scale reliability reflects population-level omega values calculated from the population-level loading values, which average 0.42, 0.67, and 0.86 (across model type and number of indicators/factor), for Low, Medium, and High levels, respectively.Due to overlapping values, some colors not visible.

Figure 11 .
Figure 11.Absolute fit indices Plotted by model and prior informativeness across sample size conditions.Panel A: BRMSEA, Panel B: PPP.Values represent means from across N ¼ 1,000 replications per cell across all conditions.Gamma-hat and adjusted gamma-hat not shown as these were estimated only for the N ¼ 1,000 sample size condition.NonInf: non-informative loading priors set to N(0,10 10 ); LsInf: less informative loading priors set to N(0.6, 0.04); StrInf: strong informative loading priors set to N(true value, 0.01).For ease of model comparability, priors for factor-factor covariances (correlations) set to U(À1,1) and indicator residual variances set to IG(3,1); CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors was set to zero; IS2: second type of incorrect specification where all three factor covariances were set to zero; Ind: total number across three factors (with equal numbers of indicators per factor); TE-U: tau-equivalent loadings, uniform across factors; TE-M: tau-equivalent loadings for a given factor, but incrementally mixed magnitudes across factors; Cong: congeneric loading magnitudes across factors.Scale reliability reflects population-level omega values calculated from the population-level loading values, which average 0.42, 0.67, and 0.86 (across model type and number of indicators/factor), for Low, Medium, and High, respectively.Due to overlapping values, some colors not visible.

Figure 12 .
Figure 12.Relative fit indices Plotted by model and prior informativeness across sample size conditions.Panel A: BCFI, Panel B: BTLI.Note.Values represent means across N ¼ 1,000 replications per cell across all conditions.NonInf: non-informative loading priors set to N(0,10 10 ); LsInf: less informative loading priors set to N(0.6, 0.04); StrInf: strong informative loading priors set to N(true value, 0.01).For ease of model comparability, priors for factor-factor covariances (correlations) set to U(À1,1) and indicator residual variances set to IG(3,1); CS: correct specification; IS1: first type of incorrect specification where the covariance between the first two factors was set to zero; IS2: second type of incorrect specification where all three factor covariances were set to zero; Ind: total number across three factors (with equal numbers of indicators per factor); TE-U: tau-equivalent loadings, uniform across factors; TE-M: tau-equivalent loadings for a given factor, but incrementally mixed magnitudes across factors; Cong: congeneric loading magnitudes across factors.Scale reliability reflects population-level omega values calculated from the population-level loading values, which average 0.42, 0.67, and 0.86 (across model type and number of indicators/factor), for Low, Medium, and High, respectively.Due to overlapping values, some colors not visible.
McDonald, 1970)(2011) across three factors; Factor structure; loading pattern; TE-U: tau-equivalent with uniform loading values across all three factors; TE-M: tau-equivalent and uniform loading within a factor, but with mixed levels across factors; Cong: congeneric factor loadings with randomly drawn values from a uniform distribution based onHeene et al.'s (2011)values.Avg Scale Reliability ¼ mean of each factor's scale/composite reliability (Omega;McDonald, 1970)given the population loading and residual error variance values.The grand mean across indicator and factor structure conditions is 0.42, 0.67, and 0.86 for low, medium, and high reliability levels, respectively.Across conditions, population factor-factor correlations set to U 1,2 ¼ 0.5, U 1,3 ¼ 0.4, and U 2,3 ¼ 0.3.3

Table 2 .
Mean RMSEA/BRMSEA results by condition, model, and analytic Approach for N ¼ 1,000 sample size.

Table 4 .
Mean CFI/BCFI results by condition, model, and analytic Approach for N ¼ 1,000 sample size.

Table 5 .
Mean TLI/BTLI results by condition, model, and analytic Approach for N ¼ 1,000 sample size.