Impact of Informative Priors on Model Fit Indices in Bayesian Confirmatory Factor Analysis

Abstract Assessing model fit is a key component of structural equation modeling (SEM); however, measures of fit in Bayesian SEM remain limited. Recently, versions of frequentist fit indices have been adapted for use in Bayesian models, but the impact of prior information on these fit indices remains unknown. This simulation study investigates the performance of three fit indices (RMSEA, CFI, and TLI) in Bayesian confirmatory factor analysis (CFA) across a variety of prior specifications that included different degrees of informativeness and inaccuracy. Results show that Bayesian fit indices are impacted by prior choice, particularly when sample sizes are small. We discuss implications of assessing model fit with Bayesian fit indices and provide recommendations for applied researchers.


Introduction
Confirmatory factor analysis (CFA) is a useful tool for evaluating the quality of measurement models that form the basis for examining relationships among latent variables in a structural equation modeling (SEM) framework. In practice, CFA is commonly conducted using frequentist methods that include maximum likelihood estimation; however, in the last decade, Bayesian estimation for CFA has gained attention as a tractable alternative (e.g., Kaplan & Depaoli, 2012;Muth en & Asparouhov, 2012;van de Schoot et al., 2017). Applications of Bayesian CFA are increasingly found in behavioral and educational research (e.g., de Beer & Bianchi, 2019;Dombrowski et al., 2018;Falkenstr€ om et al., 2015;Modrowski et al., 2021;Murray et al., 2019;Reis, 2019;Taylor, 2019). Yet, as the Bayesian approach continues to grow in popularity among methodologists and applied researchers alike, more research is needed to fully understand the behavior of these methods under different applications of use. Much of the extant methodological literature has focused on parameter estimate bias in Bayesian SEM, but there are a number of other considerations that remain understudied, including the behavior of recently derived methods for evaluating model fit within a Bayesian context.
A key aspect of CFA is assessing model fit to determine how well a proposed measurement model is consistent with the observed data. In the frequentist framework, model fit is traditionally evaluated with multiple measures that address different aspects of (mis)fit, including the chi-square test statistic and a variety of descriptive indices (e.g., RMSEA, CFI, TLI). Descriptive fit indices are useful because they are less sensitive to sample size compared to the chi-square test statistic, which tends to reject approximately well-fitting models with large samples (Bentler & Bonett, 1980). In Bayesian analysis, model evaluation typically involves using the posterior predictive p-value (PPP); however, the PPP method is also sensitive to sample size. With large sample sizes, PPP can reject models with even the slightest amount of misspecification, essentially rendering PPP useless for evaluating approximately well-fitting models in large samples (Asparouhov & Muth en, 2010;Cain & Zhang, 2019). In an effort to develop alternative methods of model evaluation for Bayesian SEM models, versions of RMSEA, CFI, and TLI were recently extended to the Bayesian context (Garnier-Villarreal & Jorgensen, 2020;Hoofs et al., 2018). These Bayesian approximate fit indices have been shown to be less sensitive to large sample sizes compared to PPP (Garnier-Villarreal & Jorgensen, 2020).
Although the extension of these measures to the Bayesian framework represents significant progress for the field, questions remain around their utility in applied settings. Specifically, it is unclear how the proposed fit indices perform under different prior specifications, given that the choice of priors can severely impact parameter estimation in Bayesian models (e.g., Gelman, 2006). For example, uninformative and (informative) inaccurate priors have been found to result in biased estimates and insufficient power, particularly when sample sizes are small (Depaoli et al., 2021;van Erp et al., 2018). In addition, recent work demonstrated that the PPP is also influenced by prior specifications in Bayesian CFA models, suggesting that model fit may be prior dependent (Cain & Zhang, 2019). With respect to Bayesian versions of the approximate fit indices (e.g., RMSEA, CFI, and TLI), previous simulation research has only evaluated these indices in the context of diffuse (i.e., uninformative) prior specifications through default prior settings readily available in statistical software packages such as Mplus (Muth en & Muth en, 2021) and the R package blavaan (Merkle & Rosseel, 2018). With uninformative priors, Bayesian fit indices have been shown to behave similarly to frequentist fit indices (Garnier-Villarreal & Jorgensen, 2020); however, their behavior in the context of informative or inaccurate priors remains understudied. Garnier-Villarreal and Jorgensen (2020) considered the case of informative priors by way of an illustrative example but acknowledged that additional research is needed to understand how the proposed indices are affected by the choice of priors. Understanding the influence of different prior specifications on Bayesian fit indices is important given that priors play a critical role in Bayesian analysis. Building on previous work (Asparouhov & Muth en, 2021;Garnier-Villarreal & Jorgensen, 2020;Hoofs et al., 2018), this paper presents a simulation study that evaluates the performance of Bayesian fit indices under various design conditions, with particular focus on the impact of different prior specifications not previously examined. In the following sections we describe frequently used methods of model evaluation in frequentist and Bayesian CFA, with attention to their similarities and differences; and discuss factors, such as sample size, that are known to impact model fit in both traditions. Thereafter, we present our study design and results, along with a discussion of their implications for using fit indices in applications of Bayesian CFA.

Frequentist Methods
Measurement models within an SEM framework are used to quantify various aspects of relationships among a set of observed variables and their underlying latent construct(s). CFA is often conducted to evaluate the quality of these models. Thereafter, relationships among the resulting latent variables can be examined in a variety of SEM applications (e.g., multilevel, mediation, mixture, etc.). For model evaluation in the frequentist tradition, the likelihood ratio test, which is asymptotically distributed as chi-square (v 2 ) when the assumption of multivariate normality is met, evaluates the discrepancy between the observed and model-implied covariance and mean structures. The v 2 statistic provides a test of exact model fit, such that any statistically significant discrepancy (beyond random sampling error) leads to the conclusion that the model is misspecified. Nonstatistically significant results are often taken to imply that the model provides a reasonable approximation to the data. However, this test is somewhat controversial given that it is based on a number of assumptions that are unlikely to be met in applied work (Bollen, 1989), it tends to be over-powered in rejecting reasonable models (Kaplan, 1990), and the v 2 approximation may not hold in a variety of circumstances (Chen et al., 2020). In light of these limitations, several fit indices have been developed as alternative measures of model fit. Among the more popular approaches that have been adapted to the Bayesian framework are the root mean square error of approximation (RMSEA) (Steiger & Lind, 1980), the comparative fit index (CFI) (Bentler, 1990), and the Tucker-Lewis Index (TLI) (Tucker & Lewis, 1973).

Root Mean Square Error of Approximation
RMSEA is an absolute fit measure of the average discrepancy between the model-implied covariance matrix and that of the observed data per degrees of freedom. Unlike the v 2 test of exact model-data fit, RMSEA is used to evaluate how well a model approximates the observed data. Based on the notion that some misspecification is inherent in all models, RMSEA assumes a noncentral v 2 distribution that is defined by discrepancies attributable to both sampling error and specification error (Browne & Cudeck, 1992). When estimated using maximum likelihood, RMSEA is computed as a function of the hypothesized model's v 2 statistic (v 2 H ), degrees of freedom (df H ), and sample size (N): The degree of model misspecification is measured by the noncentrality parameter, which is equal to v 2 Hdf H : As shown in Equation (1), the noncentrality parameter is then divided by the product of df H and N. In effect, RMSEA accounts for model complexity (i.e., the number of model parameters) and sample size. The lower bound of RMSEA is zero, with higher values indicating increasingly poorer fit.

Comparative Fit Index
CFI is an incremental fit index that compares the hypothesized model to a more restricted baseline model (i.e., a null, or independence, model) to measure the improvement of model fit (Bentler, 1990). The baseline model is assumed to be nested under a theoretically best-fitting model that imposes no constraints on the covariance structure (i.e., a saturated model). The hypothesized model then lies somewhere on a continuum between the baseline and saturated models. CFI is normed to a scale of 0 to 1, such that values near 0 indicate the hypothesized model more closely resembles the baseline model and therefore provides poor fit. At the other end of the scale, CFI values near 1 indicate that the hypothesized model fits the data nearly as well as the saturated model. CFI is expressed as where v 2 Hdf H and v 2 Bdf B correspond to the noncentrality parameters of the hypothesized and baseline models, respectively. Thus, CFI can be interpreted as a normed ratio, or comparison, of the degree of misspecification in the nested models.

Tucker-Lewis Index
TLI is also an incremental fit index that evaluates a hypothesized model's fit relative to the fit of the baseline model (Bentler & Bonett, 1980;Tucker & Lewis, 1973); however, unlike CFI, values of TLI can exceed the range of 0 to 1, and TLI is not based on the noncentral v 2 distribution. The formula for TLI is where the ratio v 2 /df imposes a penalty for model complexity. Like CFI, higher values of TLI indicate better fit. Findings from previous studies demonstrate that RMSEA, CFI, and TLI are largely robust to the effects of large sample sizes (Bentler, 1990;Fan et al., 1999;Marsh et al., 1988;Tanguma, 2001). When correctly specified models are fit to large-sample data, fit indices tend to appropriately characterize model fit, even when the chi-square test statistic is inflated. However, factors other than sample size have been shown to impact the performance of fit indices. Such factors include non-normality (Jobst et al., 2021), missing data (Zhang & Savalei, 2020), factor loading magnitude (Gagne & Hancock, 2006), and model size (Kenny & McCoach, 2003;Shi et al., 2019). Due to the influence of these factors, it is difficult to establish fixed cutoff values of RMSEA, CFI, and TLI that can be universally applied to different modeling contexts. Although fixed values of CFI and TLI > .95 and RMSEA < .06 are often cited as indicative of good model fit following the seminal work of Hu and Bentler (1999), many researchers have emphasized that such cutoffs do not generalize beyond the model conditions with which they were developed (e.g., Fan & Sivo, 2005;Jorgensen et al., 2018;Marsh et al., 2004). As a solution, researchers have developed simulation-based methods for determining appropriate cutoff values of RMSEA, CFI, and TLI that account for specific model characteristics (McNeish & Wolf, 2021;Millsap, 2007Millsap, , 2013Pornprasertmanit et al., 2013); however, these methods are not widely used by applied researchers in practice. Notwithstanding the oft-cited ambiguity of fixed cutoffs (e.g., Lai & Green, 2016;Shi et al., 2019;Xim enez et al., 2022;Yuan et al., 2016), fit indices continue to provide empirical researchers with a practical method of model evaluation that is otherwise not available with large sample sizes.

Bayesian Methods
Although confirmatory factor analysis (and SEM, more generally) has a long tradition in the frequentist framework, applications of Bayesian SEM in the social and behavioral sciences have become increasingly prevalent in recent years . One advantage of the Bayesian approach is the ability to incorporate prior beliefs about the parameters into the model. Prior information is combined with the observed data to construct the posterior distribution where h is a vector of unknown parameters, x is the observed data, p(xjh) is the conditional data likelihood function, and p(h) is the prior distribution. Markov chain Monte Carlo (MCMC) methods are often used to compute an empirical approximation of the posterior distribution. In Bayesian CFA, priors are specified for factor loadings, indicator residual variances, and latent factor variances and covariances. For example, the prior distribution for factor loadings is typically specified as the normal distribution, $N(m, r 2 ), with mean (m) and variance (r 2 ) hyperparameters. The variance hyperparameter determines the amount of information that the prior distribution contributes to the posteriors. When relevant prior information about the model parameters is unknown, uninformative (i.e., diffuse) priors can be specified using large variances, so less information is contributed to the posterior distribution. Alternatively, informative priors can be specified using smaller variances, such that the prior contributes more information to the posterior, thereby reflecting more certainty about the parameters. The mean hyperparameter determines the accuracy of the prior distribution. As the value of the mean hyperparameter approaches the true value of the population parameter, the prior is said to be increasingly accurate.
Extensive research shows that the choice of priors can have a substantial impact on results in Bayesian analysis. Relying on default diffuse priors is not always appropriate and can result in biased estimates, especially when sample sizes are small (e.g., McNeish, 2016;Smid & Winter, 2020;van Erp et al., 2018). Because the prior distribution is combined with the data (Equation (4)), sample size plays a nontrivial role in the formation of the posterior. With large samples, the information contributed to the posterior distribution by the likelihood function p(xjh) outweighs the amount of information contributed by the prior. However, with small samples, the likelihood function contributes less information from the data, so the prior distribution has a greater impact on the posterior. As a result, in small sample contexts, priors would ideally be specified as strongly informative with small variance hyperparameters, assuming the prior is accurately centered on the population parameter value. Yet, in empirical settings, researchers cannot know with certainty how accurate (or inaccurate) a prior is with respect to the true value. Inaccurate informative priors with small-sample data will result in biased parameter estimates (Depaoli, 2014;van de Schoot et al., 2018). Hence, careful consideration must be given to the choice of priors.

Posterior Predictive Model Checking
Evaluation of model fit in Bayesian analysis is typically conducted with posterior predictive model checking (PPMC) (Gelman et al., 1996). PPMC assesses whether the model adequately summarizes the data by comparing the observed data to replicated data that is predicted by the model. Draws from the posterior distribution are simulated to empirically construct the posterior predictive distribution, which is the conditional distribution of the replicated data given the observed data and the model. Discrepancy measures are then used to assess any significant difference between the replicated and observed data. The realized discrepancy measure (D obs ) is obtained from the observed data, and the predictive discrepancy measure (D rep ) is obtained from the replicate data. The proportion of iterations in which D obs is greater than D rep is called the posterior predictive p-value (PPP). Values of PPP near .5 indicate good data-model fit, with PPP < .05 generally taken to reflect model misspecification (Asparouhov & Muth en, 2010).
Although PPMC is a common procedure for model-fit evaluation in Bayesian SEM (Levy, 2011;Zhang et al., 2022), PPP is known to be sensitive to sample size (Asparouhov & Muth en, 2010;Hoijtink & van de Schoot, 2017;Lee & Song, 2004;Rindskopf, 2012;Rupp et al., 2004). In large samples, PPP tends to reject models with negligible misspecification. In addition, research has found that PPP is sensitive to other factors beyond model misspecification, including prior specification, and model size (Cain & Zhang, 2019). This work has also demonstrated that factor-loading priors with one standard deviation of inaccuracy will result in high PPP false rejection rates (Cain & Zhang, 2019). Thus, conclusions about Bayesian model fit based on PPP values may be misguided depending on characteristics of the model and data.

Bayesian Fit Indices
To address this issue of PPP in large sample sizes, Hoofs et al. (2018) proposed a method for adapting RMSEA to the Bayesian context as an alternative measure of model fit in large samples. More recently, Garnier-Villarreal and Jorgensen (2020) extended that work by developing Bayesian versions of several approximate fit indices (including RMSEA, CFI, and TLI) that would behave like the frequentist versions under ML estimation for a wider range of sample sizes (not just large samples of N > 1,000). Subsequently, the Bayesian versions proposed by Garnier-Villarreal and Jorgensen (2020) have been implemented in two software packages: blavaan (Merkle & Rosseel, 2018) and Mplus (Asparouhov & Muth en, 2021). In the current study, we focus our investigation on the Bayesian formulations of RMSEA, CFI, and TLI as developed by Garnier-Villarreal and Jorgensen (2020). These fit indices are formulated by replacing frequentist measures of model complexity and misspecification in Equations (1)-(3) with analogous forms from the Bayesian framework. Specifically, p Ã -pD is used in place of df as a measure of model complexity, where p Ã is the number of observed sample moments and pD is the estimated number of parameters in the hypothesized model. In addition, D obs H as a measure of model misspecification, where D obs i is the discrepancy function for the observed data. The Bayesian form of RMSEA (Garnier-Villarreal & Jorgensen, 2020), is computed for each MCMC iteration (i) as: Compared to the frequentist formulation of RMSEA in Equation (1), the Bayesian version in Equation (5) replaces the noncentrality parameter (v 2 Hdf H ) in the numerator with D obs i p Ã , and replaces df H in the denominator with p Ã -pD. Through the process of MCMC sampling, an empirical distribution of RMSEA values is obtained. CFI is computed for Bayesian models as: where D obs Bi is the discrepancy function for the observed data in the baseline model. In Equation (6), D obs i and D obs Bi are used in place of v 2 H and v 2 B , respectively, and p Ã is used in place of the frequentist df. Then, CFI i from each MCMC iteration is combined to obtain a distribution of values. The formula for Bayesian TLI is: Similarly, values of TLI i for all iterations are used to construct a distribution of TLI. Point estimates and credibility intervals of Bayesian RMSEA, CFI, and TLI can be obtained using summary statistics of the respective empirical distributions.
Bayesian versions of the fit indices appear to perform well under the simulation conditions examined thus far. Point estimates of Bayesian RMSEA, CFI, and TLI based on posterior means are consistent with values of the frequentist fit indices across a variety of model types (CFA and SEM), sample sizes, and misspecification levels (Garnier-Villarreal & Jorgensen, 2020). In addition, the Bayesian framework provides credibility intervals for fit indices, which allow researchers to summarize uncertainty around the point estimates. As noted by Asparouhov and Muth en (2021), credibility intervals of the Bayesian fit indices are particularly useful when determining whether sample sizes are large enough to conclusively evaluate model fit. If the sample is too small, credibility intervals for fit indices will be too wide such that they will contain the cutoff value. In that case, model fit is said to be inconclusive based on fit indices, and PPP values should be used instead (Asparouhov and Muth en, 2021).
Bayesian versions of RMSEA, CFI, and TLI provide researchers with additional methods of model fit evaluation beyond the traditional PPMC. However, the methodological literature on the applicability of these fit indices across a range of modeling conditions is limited. Thus far, research on Bayesian fit indices has only considered their performance with default diffuse priors. As a result, a systematic understanding of how the choice of priors influences Bayesian fit indices remains incomplete. This simulation study aims to address this gap by examining the performance of fit indices across priors with different degrees of (in)accuracy and informativeness. Based on previous work that documented the effect of prior specifications on PPP (Cain & Zhang, 2019), we expect priors to also impact Bayesian approximate fit indices. In addition, we expect sample size to play a role, such that priors with higher levels of informativeness and inaccuracy will have a larger impact on results obtained with smaller sample sizes. Finally, our investigation considers how these modeling factors may differentially impact fit assessment when using credibility intervals instead of point estimates of the fit indices.

Method
The primary focus of the current study was on the performance of three Bayesian model fit indices (i.e., RMSEA, CFI, and TLI) in models estimated with MCMC under conditions of varying prior distributions for factor loadings. ML estimation was also included as a design facet for purposes of comparison. The behavior of these fit indices was further evaluated by varying model specification type (3 levels), magnitude of cross-loadings (3 levels), and sample size (5 levels). Levels of each design facet are provided in Table 1. The population CFA model used for data generation was a correlated two-factor model with 12 observed variables, which was based on the two-factor reference model used in Hoofs et al. (2018). The path diagram for our data-generating model is shown in Figure 1. Population parameter values included factor variances set to 1.0, intercepts and latent means set to zero, latent factor covariance equal to 0.3, and primary factor loadings equal to 0.7. In the population model, each observed indicator was constrained to load onto a single factor, with one notable exception: Item 6. Specifically, the magnitude of the cross-loading from Item 6 to Factor 2 was varied to be k 6,2 ¼ 0, 0.15, or 0.3, representing null, small, and moderate values, respectively (Muth en & Asparouhov, 2012).
We fit a series of three analysis models (Models 1-3) to the data simulated from the population model, whereby each subsequent model specification included an additional parameter constraint. A summary of these model specifications and their corresponding population RMSEA, CFI, and TLI (based on frequentist versions) are provided in Table 2. Model 1 involved freely estimating both the cross-loading on Item 6 and the latent factor covariance. For Model 2, the latent factor covariance was freely estimated, but the crossloading was constrained to zero. Hence, this model specification was considered correct when k 6,2 ¼ 0 and incorrect when k 6,2 ¼ 0.15 or 0.3. Finally, in Model 3, both the crossloading and factor covariance were constrained to zero, resulting in model misspecification at all levels of k 6,2 .
As detailed in Table 1, the study compared nine different Bayesian prior specifications for the primary factor loadings. Plots of these prior distributions are presented in Figure A1 of the Supporting Information. Prior 1 was a diffuse (i.e., uninformative) prior, specified using the Mplus default of N(0, 10 10 ) for factor loadings, and Priors 2-8 were informative priors with varying degrees of accuracy and informativeness. Accurate priors were specified with a mean hyperparameter equal to 0.7 (the true value of the primary loadings) and inaccurate priors with a mean hyperparameter Prior 4: N(0.7, 0.05) 1,000 Prior 5: N(0.7, 0.01) Prior 6: N(0, 7.84) Prior 7: N(0, 0.49) Prior 8: N(0, 0.05) Prior 9: N(0, 0.01) Note. Model 1: freely estimated cross-loading and latent factor covariance; Model 2: cross-loading constrained to zero, latent factor covariance freely estimated; Model 3: cross-loading and latent factor covariance constrained to zero. Priors for cross-loadings were specified only for Model 1. of zero. These two conditions of (in)accuracy were then crossed with four conditions of informativeness. The level of prior informativeness was varied by manipulating the value of the variance hyperparameter, which ranged from 0.01 (most informative) to 7.84 (least informative). In addition to manipulating the priors for the primary factor loadings, we also implemented two different prior specifications for the cross-loading: k 6,2 $ N(0, 0.01) and N(0, 0.05). This allowed us to examine the impact of different small-variance priors for cross-loadings on model fit.
Beyond the factor loadings, all other model parameters in the MCMC conditions were specified with diffuse priors using the Mplus defaults, which included C À1 (À1, 0) for residual variances and W À1 (0, À3) for the latent factor covariance matrix. For each of the 585 cells in the simulation design, 1,000 datasets were generated, and all models were estimated in Mplus version 8.5 (Muth en & Muth en, 1998-2021). Bayesian estimation was conducted using the Gibbs sampler, two chains, and 50,000 iterations per chain with the first half discarded as burn-in. Fit measures (RMSEA, CFI, TLI, and p-value/PPP) for all models were obtained during the estimation process in Mplus, and 90% credibility intervals for RMSEA, CFI, and TLI were computed for MCMC models.

Evaluation Criteria
Analysis of variance (ANOVA) was conducted to evaluate the strength of association between the simulation design conditions and model fit indices. Effect sizes were evaluated based on partial eta-squared with a cutoff criterion of g 2 p ! 0.14, indicating a large effect size (Cohen, 1988). Following the ANOVA, we examined bias and coverage of the fit indices. For CFI and TLI, bias was computed as the relative difference between the parameter estimate (averaged across all replications in the cell) and the population value: [(h est Àh pop )/h pop ]. Values of relative bias ±0.1 were considered extreme (Kaplan, 1988). For RMSEA, we evaluated absolute bias, (h est Àh pop ), to avoid undefined values for cases when the population RMSEA equaled zero. Coverage of the credibility interval (CI) was computed as the proportion of replications for which the population value was within the 90% CI.
To evaluate the applicability of common cutoff values for the fit indices in Bayesian CFA, model fit for each replication was classified based on recommendations provided in Hu and Bentler (1999): 0.06 for RMSEA, and 0.95 for CFI and TLI. Two different methods were used to classify model fit, following the approach described in Asparouhov and Muth en (2021). In the first method, model fit was classified as either "good" or "poor" based on point estimates of fit indices. That is, model fit was determined to be good when the point estimate met the cutoff criterion, and poor otherwise. The second method involved using 90% credibility intervals of the Bayesian fit indices instead of point estimates. Model fit was classified as good when the entire CI met the cutoff criterion, and poor when the entire CI was beyond the threshold. When the CI contained the cutoff value, model fit was classified as "inconclusive."

Results
We first present ANOVA results to show the effect of each design condition on RMSEA, CFI, TLI, and p-value/PPP. Partial eta-squared values are provided in Table 3 for each measure of model fit as a function of simulation condition. Notably, results for MCMC estimation revealed that the two-way interaction of sample size and prior specification for primary loadings had a large effect on RMSEA (g 2 p ¼ 0.42), CFI (g 2 p ¼ 0.78), and TLI (g 2 p ¼ 0.75), holding constant all other factors included in the analysis. That is, the interaction effect accounted for 42% of the variance in Bayesian RMSEA and more than 75% of the variance in Bayesian CFI and TLI, after accounting for variance explained by the other design facets. Although this interaction term had a negligible effect on PPP (g 2 p ¼ 0.03), the primary-loading prior was found to a have large main effect on PPP (g 2 p ¼ 0.24). No effects were observed for crossloading priors (g 2 p ¼ 0). Figure 2 shows plots of bias in parameter estimates (RMSEA, CFI, TLI) across levels of sample size, model specification type, and cross-loading values. Each plot provides results for the nine MCMC prior conditions for primary loadings (Priors 1-9), as well as results for ML estimation. Results are only shown for the cross-loading prior specification of k 6,2 $ N(0, 0.01) given that results were similar for k 6,2 $ N(0, 0.05). Full tables of results for bias are provided in Tables B1-B6 of the Supporting Information. As depicted in Figure 2, results indicated that for all fit indices, MCMC Model 1: freely estimated cross-loading and latent factor covariance; Model 2: cross-loading constrained to zero, latent factor covariance freely estimated; Model 3: cross-loading and latent factor covariance constrained to zero. .03 Note. All values were statistically significant at p < .001. Values in bold indicate large effect sizes (g 2 p ! 0.14).
and ML estimates tended to converge on population values as sample size increased, such that at N ¼ 1,000, bias was generally negligible. Notably, this trend held for all primaryloading prior conditions except for Prior 9, which was considered a strongly informative, inaccurate prior: N(0, 0.01). Specifically, Prior 9 resulted in upwardly biased estimates of RMSEA even at large sample sizes, although estimates of CFI and TLI were found to be unbiased at N ¼ 1,000. At small sample sizes (e.g., N ¼ 50), RMSEA was upwardly biased, particularly for the strongly informative inaccurate priors: 8 and 9. For example, when the population RMSEA was zero and sample size was 50, estimates of RMSEA were, on average, 0.15 and 0.22 for Prior 8 and 9, respectively. Plots of 90% coverage rates are shown in Figure 3, and full tables of results are provided in Tables C1-C6 of the Supporting Information. Overall patterns of coverage were largely consistent between RMSEA, CFI, and TLI. In general, coverage was always below the nominal 90% for Prior 9 (i.e., the most restrictive inaccurate prior). Even when the model was correctly specified (Model 1), the use of such an inaccurate, tight prior yielded coverage rates near zero. Coverage rates obtained with the other prior specifications for Model 1 improved as sample size increased. However, we observed the opposite pattern when models were misspecified (i.e., for Model 2 when k 6,2 ¼ 0.15 or 0.3 and for Model 3 at all levels of k 6,2 ). Under conditions of model misspecification, coverage declined for Priors 1-8 at larger sample sizes but improved for Prior 9. For instance, for Model 2 at k 6,2 ¼ 0.3 and N ¼ 1,000, RMSEA coverage was, on average, 0.07 for Priors 1-8 and 0.59 for Prior 9.
To provide further context for understanding the coverage results, Figure 4 presents plots of point estimates and 90% CIs of RMSEA for two primary-loading priors: the diffuse (Prior 1) and strongly informative inaccurate (Prior 9) specifications. For clarity, Figure 4 only depicts these two prior conditions, which reflect the most disparate results across priors; however, Appendices D and E of the Supporting Information provide full tables of point estimates and CIs for all prior conditions evaluated in this study. As shown in Figure 4, the diffuse prior generally yielded point estimates of RMSEA that were closer to the population value compared to estimates obtained with the inaccurate-informative prior; however, as sample size increased, these differences gradually diminished. Similarly, although the diffuse prior resulted in wider CIs at small sample sizes than did the inaccurate-informative prior, the  estimated CIs for both priors became narrower as sample size increased. Looking at the plots for conditions in which the model was misspecified (i.e., for Model 2 when k 6,2 ¼ 0.15 or 0.3 and for Model 3 at all levels of k 6,2 ), Figure 4 shows that at large sample sizes (e.g., N ¼ 1,000), the diffuse prior produced CIs that were almost completely below population values (indicative of better model fit), while the inaccurate-informative prior's CIs were narrowly above these values. This trend is consistent with the pattern of coverage reported above. Specifically, the declining coverage rates for Priors 1-8 in misspecified models (Figure 3) are reflective of the underestimated CIs.
The results presented above help extend our understanding of how Bayesian fit indices behave under different prior specifications; however, we think it prudent to also illustrate why the conventional fixed cutoff values that are ubiquitous in the frequentist literature should not be hastily applied to the Bayesian context. Figure 5 presents stacked bar charts of model fit based on RMSEA point estimates (left panel) and 90% CIs (right panel) across the different prior specifications and sample sizes, where each plotted bar depicts the proportion of replications that resulted in good, poor, and inconclusive (for CIs) model fit. Results based on point estimates of RMSEA showed that correctly specified models (Model 1) were often falsely rejected (i.e., poor fit) when sample size was small (N ¼ 50), regardless of prior specification. Conversely, misspecified models were often falsely accepted (i.e., good fit) when sample size was large (N ¼ 1,000) and when the degree of misspecification (as quantified by the population-level RMSEA) did not exceed the threshold for poor model fit based on the fixed cutoff criterion of RMSEA < .06. Comparing results across point estimates and CIs revealed that evaluation of Bayesian model fit as either good or poor (based on fixed cutoff values) was dependent on whether point estimates or CIs were used. Results showed that when CIs were used for model evaluation, fit was largely inconclusive for MCMC models at small sample sizes (N 100), contradicting conclusions about model fit that would otherwise be drawn from point estimates under the same conditions. For example, as shown in Figure 5, when Model 1 was estimated with a sample size of N ¼ 100, more than 75% of replications for Priors 1-7 resulted in good model fit based on point estimates of RMSEA. However, under the same model/data conditions, when CIs were used to assess model fit, approximately 50% of replications yielded inconclusive fit. The proportion of replications characterized as having inconclusive model fit (based on CIs) was slightly lower for Priors 4 and 5, which were the accurate informative priors with smaller variance hyperparameters. Conversely, the strongly informative inaccurate specification (Prior 9) always resulted in poor model fit based on RMSEA CIs when sample sizes were N 250. Similar results were observed for CFI and TLI ( Figures A2 and A3 of the Supporting Information).

Discussion
The present simulation study was designed to evaluate the effect of prior specifications across different model characteristics on the performance of fit indices in Bayesian CFA. Although there has been increasing interest in Bayesian applications of latent variable modeling to educational and behavioral research (K€ onig & van de Schoot, 2018; Levy, 2016;van de Schoot et al., 2017), methodological guidance on model fit evaluation within the Bayesian SEM framework is still limited (Cain & Zhang, 2019;Fife et al., 2022;Levy, 2011;Muth en & Asparouhov, 2012). The recent development of Bayesian approximate fit indices (Garnier-Villarreal & Jorgensen, 2020;Hoofs et al., 2018) has prompted important work on this topic (Asparouhov & Muth en, 2021;Winter & Depaoli, 2022); however, the utility of these fit indices for model fit assessment in Bayesian contexts remains understudied. In the methodological literature, questions have long been raised about the use of approximate fit indices in the frequentist framework (Marsh et al., 2004). Specifically, extensive research has shown that fit indices are impacted by various model characteristics (e.g., Gagne & Hancock, 2006;Jobst et al., 2021;Kenny & McCoach, 2003;Shi et al., 2019;Zhang & Savalei, 2020), which undermines the applicability of commonly used fixed cutoff values (Marsh et al., 2004;McNeish & Wolf, 2021;Pornprasertmanit et al., 2013). Hence, if Bayesian fit indices are to hold any adjudicative value, factors that may affect their performance should be well understood. The current study focused on how different prior specifications impact model fit. Considerable attention has been paid to the influence of prior choice on parameter estimation in Bayesian latent variable models (e.g., Depaoli, 2014;McNeish, 2016;Smid & Winter, 2020;van Erp et al., 2018). Collectively, these previous studies show that uninformative and inaccurate priors can yield biased estimates, especially in the context of small samples. We add to this corpus of work by documenting the performance of Bayesian fit indices under different prior specifications.
While previous work has evaluated Bayesian fit indices using diffuse priors (Asparouhov & Muth en, 2021;Garnier-Villarreal & Jorgensen, 2020;Hoofs et al., 2018;Winter & Depaoli, 2022), results of the current study show how informative priors for factor loadings influence Bayesian versions of RMSEA, CFI, and TLI. In general, Bayesian estimates of fit were found to behave as expected for different prior specifications when examined across different levels of sample size. That is, informative priors had a larger impact on model fit indices at smaller sample sizes (e.g., N 100), particularly for priors that were inaccurately centered away from the true value and specified with a narrow precision (e.g., variance hyperparameter ¼ .01). When sample sizes were small, the strongly informative inaccurate priors resulted in higher values of RMSEA and lower values of CFI/TLI (indicative of worse model-data fit) compared to values obtained with the other priors. However, as sample size increased, differences in model fit between priors diminished, such that at large sample sizes, results were largely the same for all prior specifications. This finding is consistent with Bayes theory, as we would expect to see the priors contribute less information to the posterior distribution when the data contributes more information in the form of larger N. In addition, informative small-variance priors for nonzero cross-loadings resulted in fit indices that were indicative of better model fit compared to when nonzero cross-loadings were constrained to zero. However, the magnitude of the variance hyperparameter for these smallvariance priors (i.e., r 2 ¼ .01 vs. r 2 ¼ .05) did not meaningfully impact results.
Findings from our study support concerns raised by Garnier-Villarreal and Jorgensen (2020) regarding the inappropriateness of generalizing fixed cutoff values obtained from MLE to Bayesian fit indices. Consistent with results reported by Garnier-Villarreal and Jorgensen (2020), the current study found that the sampling variability in the Bayesian fit indices was underestimated by the 90% CIs. As a result, when models were misspecified, coverage rates approached zero as sample sizes increased. Even the diffuse prior yielded intervals that did not capture the ML-based population fit values under conditions of model misspecification and large sample size. Instead, these intervals suggested the misspecified model fit the data better than expectations based on the population values. Furthermore, in most cases, the population-level fit indices for the misspecified models met the Hu and Bentler (1999) cutoff criterion for acceptable model. Hence, when point estimates of the Bayesian fit indices were evaluated using fixed cutoffs, rejection rates for misspecified models generally decreased as sample size increased.
In addition to evaluating model fit using point estimates of the Bayesian fit indices, we showed how results may differ when using 90% credibility intervals for Bayesian RMSEA, CFI, and TLI. At small sample sizes, CIs had large interval widths that spanned the fixed cutoff values, indicating that model fit was inconclusive. As sample size increased, CIs became narrower such that models could be conclusively characterized as having either "good" or "poor" fit. These results were largely consistent across different prior specifications, and in line those reported in previous investigations that focused on diffuse priors (Asparouhov & Muth en, 2021;Hoofs et al., 2018;Winter & Depaoli, 2022). As noted in Asparouhov and Muth en (2021), the use of CIs in this context can thus help researchers identify whether their sample size is more conducive to the use of fit indices or PPP values. When model fit is inconclusive based on the CI, the sample size is likely to be too small, and researchers should instead use PPP values to evaluate model fit. Conversely, when the CI provides conclusive model fit, the sample size is likely large enough that PPP values will no longer be useful (Cain & Zhang, 2019). However, even when using CIs to evaluate fit, models with slight misspecifications may still be deemed acceptable if ML-based fixed cutoffs are applied and sample sizes are large (e.g., N ! 250).
It is important to note that these findings are limited to the model characteristics we evaluated. For example, our simulation study considered the two-factor CFA model; however, results may not generalize to models with more latent factors. One area for future research, then, is the investigation of Bayesian fit indices with informative priors and larger factor structures. In addition, we only evaluated informative prior specifications for factor loadings. Although the priors used in the current study represent a range of informative and (in)accurate priors, additional research is needed to understand how priors for other parameters (i.e., residual variances and latent factor covariance matrix) impact fit indices in Bayesian CFA. Previous research has demonstrated that priors for variance parameters can play an important role in Bayesian latent variable modeling (e.g., Depaoli, Liu, & Marvin, 2021;Liu, Zhang, & Grimm, 2016), so future investigations should consider how Bayesian model fit indices behave when estimated with different variance priors. Finally, as we highlight above, fixed cutoff values are known to be inappropriate beyond the modeling conditions for which they were originally recommended (Hu & Bentler, 1999), despite their continued use in practice. As an alternative, researchers can empirically derive values that are specific to the data and model characteristics being evaluated (Millsap, 2007(Millsap, , 2013Pornprasertmanit et al., 2013). Recently, McNeish and Wolf (2021) developed a software application that uses simulation-based methods to construct data-and model-specific dynamic cutoffs for use in frequentist SEM. An interesting extension of this work would be the development of dynamic cutoffs for Bayesian SEM that applied researchers could easily implement in their own work.