Problems of Domain Factors with Small Factor Loadings in Bi-Factor Models

Abstract Many measurement designs produce domain factors with small variances and factor loadings. The current study investigates the cause, prevalence, and problematic consequences of such domain factors. We collected a meta-analytic sample of empirical applications, conducted a simulation study on statistical power and estimation precision, and provide a reanalysis of an empirical example. The meta-analysis shows that about a quarter of all standardized domain factor loadings is in the range of −.2<λ<.2 and about a third of all domains is measured by five or fewer indicators, resulting in small factor variances. The simulation study examines the associated difficulties concerning statistical power, trait recovery, irregular estimates, and estimation precision for a range of such realistic cases. The empirical example illustrates the challenge to develop measures that produce clearly interpretable domain factors. Study planning and interpretation need to take the (expected) sum of squared factor loadings per domain factor into account. This is relevant even if influences of domain factors are desired to be small, and equally applies to different model variants. We propose several strategies for how researchers may better unlock the bifactor model’s full potential and clarify its interpretation.


Introduction
Bi-factor models (Holzinger & Swineford, 1937) have become increasingly popular in psychological research over the past years (Reise, 2012;Zhang et al., 2021).One major reason is their ability to distinguish domain-specific variation in item responses from a general trait.Other than traditional models with a set of correlated factors, bi-factor models include an overall trait across different content domains, raters, tasks, or otherwise grouped indicators.This trait is of focal interest in many studies, for example as a general measure of quality of life (Chen et al., 2006), intelligence (Beaujean, 2015;Gignac & Watkins, 2013;Keith & Reynolds, 2018), or psychopathology ("p-factor," Caspi et al., 2014;Lahey et al., 2012;Patalay et al., 2015;Watts et al., 2019).Domain factors capture additional, domain-specific variation.Critically, many common study designs entail weak domain factors (small factor variance).In the following, we consider domain factors to be "weak" to the degree that appropriate statistical tests for their detection have low power, they provide unreliable trait estimates, or their related estimates are small and therefore difficult to interpret.Weak domain factors are abundant in the literature.A review of articles from 2013 and 2014 found nonsignificant factor loadings or non-significant domain factor variances ("collapsing factors") in 47 of 82 articles (57%, Eid et al., 2017).
Whereas some studies merely account for domainspecific variation to obtain a "clean" measure of the general trait, others are concerned with the domain factors themselves.In validation studies, the presence of certain domain factors indicates a valid measurement design.Domain factor loadings indicate if indicators are valid exemplars of their assigned domain.In substantive research, the unique association of the general factor and the domain factors with third variables can be independently studied.In this way, structural equation models (SEM) can test increasingly differentiated theories within complex nomological nets (Eid et al., 2018;Zhang et al., 2021).Finally, practitioners may be interested in domain-specific individual scores (DeMars, 2013;Reise et al., 2013).The distinction between a general factor and domain factors offers a whole new perspective on psychological constructs and their relationships.
In the following section, we introduce the bi-factor model and its notation.After that follows an investigation of the causes, prevalence, and consequences of weak domain factors.The role of statistical power and the strength of domain factors in confirmatory bi-factor models has not yet been addressed in the literature.Although there are results on the recovery of loading matrixes in exploratory bi-factor analysis (Giordano & Waller, 2020), to our knowledge, the problem of weak domain factors has not been targetedly researched in the bi-factor EFA literature, either.Therefore, this study aims to assess which conditions are necessary to reliably detect and estimate domain factors and their loadings and compare these to real studies.It will be discussed how awareness of potentially weak domain factors when designing, choosing, or interpreting measures can drastically improve the utility of bi-factor model applications.

Bi-factor models
Bi-factor models use a general factor across all indicators and a set of domain factors for sets of related indicators (Figure 1).In the symmetrical model variant (S), every indicator loads on both a general factor g g and one domain factor g s .
The S bi-factor model of the response vector Y i of case i is shown in Equation (1).K is the matrix of factor loadings and g i the vector of latent trait values of case i.The error values in the vector e i are assumed to be independently and normally distributed for each indicator variable Y.
Equation ( 2) shows the characteristic loading pattern of bi-factor models: all indicators load on the general factor g g (first column of K) and on one of the k domain factors (further columns).So k s j s is the loading of the j'th item of domain s on the domain factor g s and k s j g its loading on the general factor g g .
Since all factors of the model are orthogonal in the S variant, the variance-covariance matrix U of its factors is a diagonal matrix.In the S-1 bi-factor model variant proposed by Eid et al. (2017), one domain factor is omitted (cf. Figure 1).The presence of the reference domain, whose indicators exclusively load on g g , enables a proper variant of the bi-factor model in which the remaining domain factors may correlate freely (S-1c). 1 In the S-1 and S-1c models, g g is Figure 1.Bi-factor model path diagram with a general trait g g and four domain traits (g 1À4 , S model); only if some items exclusively load on the general factor (e.g.omitted dashed g 1 , S-1 model), freely estimating correlations between domains is a reasonable option (dotted double-headed arrows, S-1c model).
1 For a discussion of problems regarding the estimation of correlated domain factors in the S model see Markon (2019).Conceptually, a full set of positively correlated domain factors (¼ correlations between all indicators) and the general factor are to some degree redundant, leading to problems in both estimation and interpretation.
interpreted as the common trait as assessed with the reference domain.In the terminology of classical test theory (CTT, Novick, 1966), in S-1 models the general trait combines the common true score of all domains and the true score specific to the reference domain (Eid et al., 2017).Compared to the S model, the S-1 and S-1c models therefore provide improved clarity in the interpretation of g g if domains are not randomly sampled.If the domains are a meaningful selection, as in most multifaceted psychological measures, "defining the latent variables of the [S bi-factor and second-order] models [ … ] as random variables on a well explicated set of possible outcomes" (Eid et al., 2017, p. 548) could not be achieved.
To examine problems with weak domain factors, a measure of their strength is needed.In the following, we use the sum of squared loadings SS k in the fully standardized bi-factor model with indicator and trait variances equal to one (Equation ( 3)).
This quantity measures the total share of indicator variance of the factor.SS k ¼ 1 means that the factor explains a total indicator variance equal to the variance of one indicator. 2o better understand the influence of each factor, the variance of each indicator Y s j can be decomposed into three components: consistency, specificity, and error.Note that the simplified Equation (4) assumes a fully standardized model.The first term k 2 s j g is the consistency of the indicator: the proportion of variance due to the general trait g g .The second term k 2 s j s is the specificity of the indicator: the proportion of variance due to the domain-specific trait g s .The remaining error is assumed to be independently, randomly, and normally distributed with a variance of r 2 es j : The reliability (Rel) of an indicator is the proportion of its variance that is explained by the latent variables.
Weak and "anomalous" domain factors Weak and anomalous domain factors are a consequence of the structure of bi-factor models and the typical construction process of psychological measures.There are several reasons why weak domain factorsdesired or not-should be expected in practical applications: First, the measurement of domain factors and the general factor compete for each indicator; indicator reliability is split into consistency and specificity (Equation ( 4)).Standardized factor loadings for both are typically lower compared to models with indicators relating to only one factor each.In indicators with high consistency, the ratio of domain factor variance to error variance can be small, even though the reliability is high.It is a frequent intention to use reliable total scores as the main criterion when applying measures in practice (e.g., conscientiousness and general intelligence-rather than their facets-in personnel selection).Even if another purpose of a measure is to discern different parts of a construct (e.g., different facets of a personality trait or different aspects of intelligence), a likely concern is that the indicator still measures the overall construct (e.g., the personality trait or general intelligence).A key challenge is that factor loadings in correlated-factor models confound the relationship of indicators to general variance (shared among all domains) and domain-specific variance.Therefore, indicators without domain-specific variance are not automatically disqualified.The goal conflict between measuring a general trait and domain-specific variance may be more or less likely to occur and more or less easy to solve depending on the nature of the construct and the other desired properties of the measure.
Second, each domain consists of a fraction of the indicators of the overall measure.Measures based on a correlated-factor model had their number of indicators chosen based on stronger factors, which include a substantial portion of the general trait from the bifactor model.Factors in the correlated-factor model contribute to the general factor in the bi-factor model to the extent of their intercorrelation.The leftover domain-specific variance can be tiny.Especially problematic are short measures which were reduced to a barely acceptable length.They may measure a general trait or a set of correlated factors efficiently (shortened as much as possible without their reliability falling below a target value) but fail to produce reliable domain-specific factors in bi-factor models.As we will show in more detail below, if researchers choose the desired length of a measure without explicitly considering the consequences for domain factor measurement, they are in danger of choosing too few indicators to properly recover them.
For these reasons, one should expect a substantial portion of domain factors to have few and small factor loadings (k s < :2) and therefore little varianceeven before considering the substantive research context.Given that the surge in popularity of the bi-factor model (Reise, 2012;Zhang et al., 2021) is in large parts based on reanalyzes of older measures, this disconnect of the listed particularities of the bi-factor model from the development process of the measures should be expected to lead to weak domain factors and small domain-factor loadings.Whereas some research areas may welcome such outcomes-potentially, because they adequately reflect the trait of interest-we argue that obtaining weak domain factors should not be an accident.Researchers should be aware of this issue before conducting their research.
Indeed, Eid et al. (2017) showed an abundance of problematic empirical examples.Not only were there many domain factor loadings that did not significantly differ from zero.Multiple domain factors "collapsed" entirely, showing non-significant variance estimates or a set of non-significant factor loadings.Some extreme cases had negative factor variance estimates. 3This led many researchers to question or modify their application of the bi-factor model (see also Watts et al., 2019) and Eid et al. (2017) to speak of "anomalous results".The prevalence of studies with at least one anomaly was 61% in their sample of articles that used a bi-factor model and were published in 2013 or 2014.This number might have been even higher if there were unpublished studies or researchers quietly switched to another model.
Problematic results were one reason why Eid et al. (2017) questioned the use of the symmetrical bi-factor model (S).They criticized its use in cases where domains are specifically selected (single-level sampling structure) as opposed to randomly sampled (two-level sampling structure).They base their argument on Stochastic Measurement Theory (SMT, Steyer, 1989): From the perspective of SMT, the latent variables in traditional bifactor and related G-factor models cannot be defined as random variables on a well explicated random experiment when only a singlelevel sampling design is considered.[ … ] From the scope of SMT many of the anomalous results encountered in empirical applications in fact have to be expected when domains are not randomly selected or when they cannot be considered interchangeable.(Eid et al., 2017, p. 555) They consequently introduced the S-1 and S-1c variants4 as sound alternatives from the perspective of SMT (Eid et al., 2017, p. 550ff).They did not discuss the effect of small domain strength, insufficient statistical power, or the rate at which anomalous results occur in S-1 models.Because they classified all nonsignificant estimates of factor loadings and factor variances as "anomalous" results due to badly specified models, we consider the current work a crucial extension to their work, because it inquires into alternative explanations.If "anomalous" results are equally frequent in S and S-1 models, the consideration of the sampling structure would be irrelevant to problems with weak domain factors.

Statistical power, effect size, and estimation precision
In the context of our simulation study, we consider domain factors to be weak if they cause a problem: a) if their associated null hypothesis cannot be rejected (the model without the domain factor fits the data equally well, given a finite, reasonable sample size) or b) if they produce (comparatively) unreliable trait estimates, meaning that the trait recovery (R 2 ) is half as good as for the general factor (or worse).One purpose of the simulation study is to provide a range of benchmark values for applied researchers to compare empirical results to.To understand the surprisingly high prevalence of null results in the literature, statistical power needs to be taken into account.For power analysis, the size of the effect needs to be known: how large are estimates of domain factor loadings and domain factor strengths in empirical applications?Moreover, for many applications, it is not enough to show that certain parameters in the model differ significantly from 0. Sufficient model parameter estimation precision and trait recovery precision are crucial for interpretation.Especially studies that use domain factors to predict other variables or use domain-specific scores rely on unbiased trait estimates and sufficient precision.
The presence of domain-specific variance may be a mere nuisance to the measure of the general factor for some purposes or areas of research.In that sense, weak or completely absent domain factors are desirable, as long as they do not produce irregular estimates.The corresponding ideal case is a model with a single general factor explaining all systematic variance of the indicators.This is especially true for applications that assign specific factors to different raters or alternative methods of measurement (e.g., Frey et al., 2017;Scholz et al., 2022).These factors do not necessarily have a useful substantive meaning.Instead, they are influences that should be controlled for.In such scenarios, researchers may want to avoid strong domain factors.Nevertheless, judging their strength and impact may be the focal point of a study.A research question could be if two measures (or two types of raters) can be treated as interchangeable or if biases are introduced by choosing one over the other.For this purpose the ability to judge the statistical power to detect undesired domain-specific influences and the precision with which they are captured by the model is relevant.

The current study
To identify the necessary conditions to reliably detect and properly estimate domain factors and their loadings, we conducted a simulation study.We compare its results to the conditions in a meta-analytic sample of empirical applications.The meta-analysis uses the reported factor loading matrixes of the studies listed by Eid et al. (2017).It tests our arguments on why weak domain factors should be expected in practice: How large are domain factor loadings and general factor loadings typically?How many indicators per domain are used?How prevalent are reliable indicators with low specificity (k g > :5 and k s < :2)?Do null results happen in small samples (n 300) only?
In the simulation study, the measurement design was varied to answer the following questions: What is the strength of a detectable domain factor under realistic conditions?Which measurement designs provide a relatively adequate recovery of domain trait scores?What are the core influences on the precision of domain factor loading estimates?Under which conditions occur unacceptable "anomalous" results (negative domain factor variance estimates, nonconvergence)?Can the newly proposed model variants (S-1 or S-1c) reduce the number of irregular results or null results?
After presenting the meta-analysis and the simulation study, we finally reuse open data to provide an empirical example to facilitate the discussion.The following discussion combines the meta-analysis results and simulation results to examine the origins and consequences of the outlined practical challenges.We propose several steps to maximize the utility of bi-factor applications and outline limitations.

Methods
For the analysis of factor loadings and SS k of domain factors in the literature, we chose to adopt the list of empirical examples in Eid et al. (2017) to enable comparison with their work.These studies were originally sampled from PsycInfo using the terms "bifactor" and "bi-factor" (all fields), and include publications from 2013 or 2014.They were coded to contain either a non-significant domain factor variance estimate (Eid et al., 2017, Table 1) or a non-significant domain factor loading estimate (Eid et al., 2017, Table 2).We searched the 47 articles for S bi-factor loading matrixes (K).Only one set of estimates per sample was included to not bias the overall result by repetition.Two articles reported two bi-factor studies on unique samples, which were both included.21 articles were excluded from subsequent analysis: incomplete report of estimates (1), IRT model (1), exploratory model ( 1), free estimation of domain factor correlations (5), no consideration of S model variant (4), exclusive report of adapted models (7), outlier5 (1).We reconstructed one unreported S bi-factor model based on the reported correlation matrix. 6Reversely keyed indicators and domain factors were recoded for the current analysis so that all factor loadings are expected to be positive.An indicator or domain was considered reversely keyed if the factor loadings were expected to be negative based on the study design and theory.
28 models from 26 articles were included in the final sample (a reference list can be found in the Appendix).Two were coded by Eid et al. (2017) as including a non-significant domain factor variance estimate.The other 26 were coded as including (at least one) non-significant domain factor loading estimate.The sample of models includes 3 ability tests, 21 self-report scales, and 4 other-report scales.Table 1 shows the large variety of constructs encountered in the sampled articles (see also Eid et al., 2017 Tables 1  þ 2).We sorted the constructs into three broad categories: Clinical/health constructs include mental and physical health related outcomes and behaviors.Personality constructs include non-clinical, relatively stable interindividual differences.Education constructs are specific to the education context.Of the 28 models in our analysis, 18 dealt with clinical/health constructs, Another one had to be omitted due to irregular estimates.The error variance of an indicator variable was estimated to be impossibly large and negative, leading to uninterpretable results.
6 with personality constructs, and 4 with education constructs.A full table linking articles to constructs can be found on the osf page of this article.

Results
Figure 2 shows the combined distribution of factor loadings on the general factor (k g ) and the domain factor (k s ) for each indicator variable.7 Indicator reliabilities show a large variability (M ¼ 0.54, SD ¼ 0.19).This may reflect differences in the breadth of constructs as well as differences in the quality of the selected indicators.The sizes of k s and k g are limited by each other: 79% of all indicators have a very small domain factor loading (À:2 < k s < :2), but also a reasonably high factor loading on the general factor (k g > :5).This likely reflects indicator selection procedures that focus on the measurement of the general trait or maximize the internal consistency of the whole measure.17 of 28 models include at least one negative factor loading estimate.Note that negatively keyed factors and indicators were recoded before plotting, so these are unexpected results.Figure 3 shows the number of indicators per domain.31.25% of all domains were measured by 5 or less indicators.What is the resulting strength of the domain factors? Figure 4 shows the combined distribution of domain      factor SS k and sample sizes.52.50% of factors have SS k < 1, and 16.25% have SS k < 0:5: 26 of 28 models were included because of a non-significant domain factor loading, but most of them show at least one whole weak domain factor (if judged by SS k < 1, more detailed discussion below).Weak domain factors could be the product of noise in small samples even if the true underlying factor is strong in the population.Figure 4 shows that a lack of power due to insufficient sample size alone cannot explain weak domain factors: they occur across all sample sizes.In conclusion, the presence of at least one weak domain factor (SS k < 1) is the norm in the sampled bi-factor models, not the exception.

Methods
In the simulation study, random data for bi-factor models of the three model variants were generated.In S-1 and S-1c models, the first of four domain factors was omitted.For S-1c data generation the correlation between the second and the third domain was set to r 23 ¼ :5 and all other correlations were set to zero.Conditions relevant to statistical power and estimation precision were systematically varied (Table 2). 8 To vary the strength of the domain factors, the factor loading size k s and the number of indicators per domain were varied.Factor loadings were held constant across all indicators and invariant during data generation, which greatly simplifies interpretation.We only included domain factor loadings that are positive and at least k s ¼ :2, so it can be checked if sampling variation of truly admissible values explains the occurence of negative or zero factor loadings in practice (Figure 2).For both the sample size and k s , realistic values and values in a problematic range were included (down to n ¼ 200 and k s ¼ 0.2).The domain factor loadings lie in a range that was frequently observed in the reviewed empirical example studies (:2 k s :6, dashed lines in Figure 2).Given these fixed values for k s , the reliability of the indicators was varied using two different values for k g .This design produces reliabilities between 0.29 and 0.85 across conditions.All factor loadings are fully standardized because random error variance was added to all indicators to reach r 2 Y ¼ 1 and traits were sampled with a variance of one.Since S-1 and S-1c models have no variance attributed to the first domain factor, and k g was held constant, they have a higher proportion of error variance on indicators of the first domain.Only continuous data with multivariately normally distributed trait values and error terms were considered.Although contamination with other types of errors is frequent in practice (Micceri, 1989), and the true distribution of latent traits is debatable, normally distributed traits and errors are prototypical for this model class and frequently assumed in practice.The fully crossed design resulted in 300 simulation conditions with 1008 replications per condition.  The correlation between domain factors in the S-1c model also affects the statistical power and estimation precision (Yuan et al., 2010), but was not varied beyond the distinction between S-1 and S-1c models.Higher correlations were shown to lead to both increases and decreases in standard errors for both loadings and factor variances in correlated-factor models depending on the other model parameters (Yuan et al., 2010, Table 3).It is unclear if such differences are substantial in bi-factor models and how they would proliferate to other factors in the model.As seen below, the difference in statistical power between the S-1 and S-1c model, which is essentially a large variation of a domain factor correlation (0 vs. 0.5), proved to be relatively inconsistent and unimportant in comparison to other factors.
For each sample dataset, all model variants were estimated using maximum likelihood (ML) estimation with the default settings of lavaan (Version 0.6-7, Rosseel, 2012).The fixation of the first factor loading to one for identification made negative estimates of the domain factor variance possible.In S-1c models, all correlations between domain factors were freely estimated.This results in a fully crossed design regarding data-generating model variant and estimation model variant.To analyze anomalous results, improper solutions (e.g.negative variance estimate) were retained.In the following, converged solutions are those, for which lavaan indicated convergence and standard errors of estimates were obtained.If not specified otherwise, the presented results refer to correctly specified models only, meaning the datagenerating model and the estimated model variant are the same.Results on domains are presented as a summary (mean) for domains two, three, and four, even for the S-1c models.The distinction between the uncorrelated fourth domain and the other domains in the S-1c model did not prove relevant in any of the analyses.
The statistical power to detect domain factors was measured in three different ways.First, the proportion of significant variance estimates of the domain factor was calculated based on the Wald-Test against zero with a ¼ :05: This test corresponds to the "anomalous results" in Eid et al. (2017) (non-significant domain factor variance estimates).Results for this test are part of the default summary output of lavaan.Note, that this is a test against the boundary of the parameter space (H 0 : Varðg s Þ ¼ 0).For this reason, the distributional assumption is violated and results are conservatively biased (Molenberghs & Verbeke, 2007;Stoel et al., 2006).The uncorrected version is used to represent what plausibly was the general practice in the sample of studies above.Second, the proportion of significant likelihood-ratio-tests (LRT) comparing the model with and the model without the first domain factor was calculated.The LRT tests the difference in model misfit Dv 2 $ v 2 ðdf ModH 0 À df ModH 1 Þ between the correctly specified model H 1 (which includes the domain factor's variance and its loadings) and the incorrectly specified model H 0 (which by omitting the domain factor essentially fixes the latent variances and all related factor loadings at 0 and is therefore nested within the first model) against 0. This is a more adequate test to decide if the domain in question should be part of the model.The LRT is based on all estimated parameters related to the domain in question, whereas the Wald-Test is based solely on the variance estimate.Therefore, differences in the results can be expected.Note that non-converged models were counted as false negatives, so the reported values for statistical power can never exceed the convergence rate.Omitting non-converged cases would be biased in conditions with a low convergence rate.Furthermore, researchers planning a study are likely most interested in the probability of a successful study than in the conditional probability given convergence.Third, the theoretical power of this LRT was computed for all simulation conditions, testing the correctly specified model variant against the same model without the domain factor in question.The model without the domain factor was fit to the theoretical variance-covariance matrix under the true model with the domain factor present.The misfit between the resulting model implied variance-covariance matrix and the true variance-covariance matrix was then used to compute the statistical power of the LRT using the semPower R package (Moshagen (2021)).
For individual indicators, the average number of significant indicators per domain was calculated under each condition.Significance was judged based on the Wald-Test of the factor loadings with a ¼ :05: To assess the quality of estimated trait values the squared correlations between the true and the estimated trait values (R 2 ) were calculated.This is the proportion of variance of the estimated trait values that is determined by the true trait.Trait values were estimated using regression factor scores (DiStefano et al., 2009), as implemented in lavaan (Version 0.6-7, Rosseel, 2012).To assess the precision of factor loading estimates, root mean square errors (RMSEs) were calculated for each repetition.They were computed based on the differences between the estimates and the true factor loadings in the population, given by the simulation condition.
To complete the list of potential "anomalous" results discussed by Eid et al. (2017), the proportion of cases with at least one negative domain factor variance estimate was computed.The simulation did not replace (or tweak the estimation of) cases that did not converge.Instead, convergence rates are analyzed below.
To assess the importance of the simulation conditions (Table 2) for each outcome, we estimate general linear 9 models.Because these models merely serve to indicate the relevance of the conditions, we use a simple baseline model without interactions.For the parameters with multiple conditions on a metric scale (n and k s ), we also include a quadratic term to allow for non-linear effects.To assess the importance of a given parameter, we compare this baseline model (Equation ( 5)) to the model without the term(s) relating to this parameter.In Equation ( 5), outcome refers to all the individually analyzed outcomes (statistical power, estimation precision, … ) and variant is is a dummy-coded factor with three levels.For brevity, we only report the p-value of the F-Test for model comparison, as well as the difference in adjusted R 2 .We describe any predictor with a DR 2 < :01 (equivalent to r < 0.1) as irrelevant, regardless of its statistical significance.

Domain factor detection
In general, the power of the LRT tends to exceed that of the Wald-test (McCulloch & Searle, 2004, p. 150).
In the current simulation results, the LRT for model comparison consistently shows superior statistical power to the Wald-Test of the factor variance.There is no condition with a meaningful advantage of the Wald-test.A substantial advantage of the LRT shows under many conditions: Under conditions where at least one test has a power estimate below 1 (not all replications significant) the mean difference in statistical power is 0.26 in favor of the LRT (additional figure in supplementary materials).
Figure 5 presents an overview of the power of the LRT depending on SS k and sample size.The simulated values (including non-converged cases as false negatives) are connected via vertical lines with the theoretical values.The model variant is irrelevant to the statistical power of the LRT to detect domain factors (p ¼ 0.62, DR 2 ¼ 0:00).Sample size (p ¼ 0.00, DR 2 ¼ 0:10), number of indicators per domain (p ¼ 0.00, DR 2 ¼ 0:05), loading on the general factor (p ¼ 0.00, DR 2 ¼ 0:02), and the size of the domain factor loadings (p ¼ 0.00, DR 2 ¼ 0:52), all contribute uniquely to the prediction of statistical power.Consider a domain factor with SS k ¼ 0:75, based on three standardized domain factor loadings of 0.5: The LRT easily detects the presence of the domain, even in samples of n ¼ 200 (1 À b ¼ 0:98).For smaller effects, there is a steep drop in statistical power.Judging by the relationship between SS k and the statistical power (Figure 5), adding a single indicator with k s !:4 (D SS k !0:16) can improve power drastically.Realistic variations in the reliability of indicators beyond their loading on the domain factor (k g ¼ :5 (circles) vs. k g ¼ :7 (triangles)) result in large differences in statistical power (up to D 1Àb ¼ 0:46).The blindness of the theoretical analysis to non-convergence is a major cause for the difference between the theoretical and simulated power under challenging conditions.Table 3 compares the cumulative results of the LRT by model variant for correctly specified models.Across conditions, the model variant barely influences convergence or power, S-1 models converge slightly more often.This explains a slight increase in the proportion of non-significant results because convergence is most often an issue in low power conditions.In case of misspecification there are much larger differences (see section on convergence).
Figure 6 presents an overview of the statistical power of the test of domain factor loadings.The model variant is irrelevant to the statistical power of the test of domain factor loadings (p ¼ 0.41, DR 2 ¼ 0:00).Sample size (p ¼ 0.00, DR 2 ¼ 0:08), number of indicators per domain (p ¼ 0.00, DR 2 ¼ 0:04), loading on the general factor (p ¼ 0.00, DR 2 ¼ 0:02), and the size of the domain factor loadings themselves (p ¼ 0.00, DR 2 ¼ 0:56), all contribute uniquely to the prediction of statistical power.The more indicators a domain factor has, and the less error variance its indicators have (higher k g ), the more precisely its loadings are estimated (for an analytical approach, see Yuan et al. (2010)).Under favorable circumstances (k g ¼ :7, m ¼ 6) a sample size of n ¼ 300 is more than sufficient for k s ¼ :3 (in the population) to be detected with high power (1 À b ¼ 0:99).The power is much higher compared to realistic, but much less favorable conditions (k g ¼ :5, m ¼ 3, 1 À b ¼ 0:51).To compensate for this, the sample size would have to be increased to n > 1000 (1 À b > 0:95).
9 For outcomes on a scale of 0 to 1, we considered linear models to be sufficient, because they detect the presence of monotonous effects, and their easily interpretable determination coefficient is able to roughly order them by importance.Binomial regression would not have offered an easy to interpret determination coefficient and a logit transform would have led to many infinity values due to observed relative frequencies of exactly 1.

Parameter recovery
The distribution of the RMSE of domain factor loading estimates is heavily skewed and includes outliers from irregular estimates.Therefore, Figure 7 shows the median of the RMSE distribution across replications.Note, that for a small proportion of replications, the RMSE was substantially higher. 10The model variant is irrelevant to median estimation precision of domain factor loadings (p ¼ 0.38, DR 2 ¼ 0:00).Sample size (p ¼ 0.00, DR 2 ¼ 0:16), number of indicators per domain (p ¼ 0.00, DR 2 ¼ 0:05), loading on the general factor (p ¼ 0.00, DR 2 ¼ 0:05), and the size of the domain factor loading itself (p ¼ 0.00, DR 2 ¼ 0:34), all contribute uniquely to the prediction of the estimation precision.Domain factor loadings that are relatively small in the population are estimated with less precision than larger ones. 11Higher overall indicator reliability (higher k g ) and more indicators per domain increase precision.The problem case that a domain factor loading is truly substantial but estimated near zero can only be expected under a combination of multiple adverse conditions.For example: Assuming a normal distribution of estimates and RMSE ¼ 0.05 (dashed line), only 2.28% of k s ¼ :3 are estimated at 0.2 or lower.Only very few near-zero loadings can be explained by estimation uncertainty (cf. Figure 2).This could also be understood from the estimated standard errors and confidence intervals of the loading estimates in empirical studies reporting negative or near-zero estimates.
Figure 8 shows that domain trait recovery barely improves with increased sample size and improves much slower with increased effect size than statistical power.The model variant (p ¼ 0.00, DR 2 ¼ 0:00) and sample size (p ¼ 0.00, DR 2 ¼ 0:01), are irrelevant to domain trait recovery.The number of indicators per domain (p ¼ 0.00, DR 2 ¼ 0:07), loading on the general factor (p ¼ 0.00, DR 2 ¼ 0:04), and the size of the  The same absolute difference on the scale of k s is larger on the scale of k 2 s (indicator variance) for larger values of k s .The truth of this claim, therefore, depends on the scale.domain factor loadings (p ¼ 0.00, DR 2 ¼ 0:88), all contribute uniquely to the prediction of trait recovery.At SS k ¼ 1, even in large samples only about 50-70% of the variance of the factor score is determined by the true trait.Below SS k ¼ 1, this value quickly declines even further, falling below half the typical value of the general trait (% :7 to 0.95, see below).
Figure 9 shows the influence of the domain traits on the recovery of g g .The sample size (p ¼ 0.00, DR 2 ¼ 0:00), is irrelevant to general trait recovery.The model variant (p ¼ 0.00, DR 2 ¼ 0:03), the number of indicators per domain (p ¼ 0.00, DR 2 ¼ 0:12), the loading on the general factor (p ¼ 0.00, DR 2 ¼ 0:67), and the size of the domain factor loadings (p ¼ 0.00, DR 2 ¼ 0:15), all contribute uniquely to the prediction of general trait recovery.Importantly, the recovery of g g gets worse the higher the domain factor loadings are (for constant general factor loadings).This may be counter-intuitive because it means that less reliable indicators (lower k 2 s and higher r 2 e ) produce more reliable factor scores of g g .That this effect seems strongest for the S model is probably a consequence of the additional domain factor.The model variant in Figure 9 refers to both data generation and estimation.If instead  the S-1 or S-1c model is estimated on S data, the recovery of g g is worse 12 (additional figure in supplementary materials).

Anomalous results
The main contributors to convergence problems of correctly specified models (Figure 10) are weak domain factors.The model variant (p ¼ 0.49, DR 2 ¼ 0:00) is irrelevant for the rate of convergence.The sample size (p ¼ 0.00, DR 2 ¼ 0:10), the number of indicators per domain (p ¼ 0.00, DR 2 ¼ 0:03), the loading on the general factor (p ¼ 0.00, DR 2 ¼ 0:02), and the size of the domain factor loadings (p ¼ 0.00, DR 2 ¼ 0:49), all contribute uniquely to the prediction of convergence rates.Selective non-convergence in the presence of small factor loadings has also been observed in several other studies (for a discussion of those results, see Yuan & Bentler, 2017).Beyond that, small sample sizes and weaker loadings on g g increase the risk of convergence problems.For model variants, the picture is less clear.The S variant tends to perform worst under otherwise problematic conditions, which may be related to the additional weak domain.In cases with misspecification (not shown), the combination of S-1 data and the estimation of the S model produces particularly bad results: the S model has convergence rates below 0.7 under all conditions.This problem is less frequent if the data-generating model is S-1c instead of S-1.According to the present simulation results, convergence problems originate from specifying factors for domains with no specific variance, not from the S model variant per s e: S model estimation works fine if the reference domain has a specific variance.Negative domain factor variance estimates are most prevalent if the true variance is small (SS k :27, figure in supplementary materials).Without misspecification, there is no principled advantage of S-1 models over S models regarding anomalies.

Empirical example
To illustrate the potential for difficulties with weak domain factors in practice, we reanalyzed the open data shared by Dueber and Toland (2023) (https://doi.org/10.17605/OSF.IO/3QT5S).The Scoliosis Quality of Life Index (SQLI) questionnaire features 20 13 indicators measuring four subdomains with five indicators each: self-esteem (SE, indicators 1-5), back pain (BP, 12 Given that the S-1 model was proposed along with a change in the interpretation of g g , one could also understand this as the consequence of a change in the meaning of g g .The current work can only demonstrate the recovery of the original data-generating trait g g , not the interpretability or reliability of the resulting factor score if the S-1 model is estimated.The dataset provided by Dueber and Toland (2023) omits two indicators refering to satisfaction with management.
indicators 6-10), physical activity (PA,, and moods and feelings (MF,.The data comprise n ¼ 2322 cases of adolescent idiopathic scoliosis patients. As stated in the introduction, the approach to indicator selection plays a key role in the emergence of weak domain factors.The SQLI was developed as an adaptation of an existing questionnaire (Asher et al., 2000;Haher et al., 1999) without a repeated analysis of its covariance structure (Feise et al., 2005).The original indicator selection of the original questionnaire included an exploratory factor analysis (EFA) with varimax rotation.In a major overhaul of this original instrument, many indicators were exchanged or changed, effectively reducing the number of dimensions from seven to five (Asher et al., 2000).None of the authors report an effort to prioritize or balance generality (measuring quality of life) and discrimination of subdomains (covering distinct features of the chosen dimensions).Assumably, the resulting domain factor variance is largely a by-product of other design choices (desired total length of the scale, conceptualization and choice of domains, subscale reliability standards).
To understand the structure of the SQLI, the dissection of the indicator variances into general factor variance, domain factor variance (including the 95% confidence interval of the estimates), and unique indicator variance in a S bi-factor model14 (CFI ¼ 0.95, RMSEA ¼ 0.056, srmr ¼ 0.051) is displayed in Figure 11.All but one factor loading reach significance and one domain factor loading is estimated to be significantly negative ( kSQLI 10, BP ¼ À:13).Plotting variance proportions makes it immediately obvious that there are several indicators which barely contribute to their domain factors.This may be surprising to researchers, even if they knew the correlated-factor model of the same data (CFI ¼ 0.91, RMSEA ¼ 0.067, srmr ¼ 0.065), in which all indicators load substantially on their respective factors ( k !:35, for example kSQLI 12, PA ¼ 0:55, 95% CI ½0:52, 0:58).Because confidence intervals are depicted in Figure 11, it is clearly visible that the near-zero estimates of some domain factor loadings in the bi-factor model are hard to explain as random underestimations.Some indicators just contribute less to the estimation of factors overall (such as SQLI_5), but importantly, there are meaningful differences in the specificity of equally reliable indicators (such as SQLI_2 and SQLI_8).The presence of near-zero domain factor loadings results in two domain factors with SS k < 1: SS BP k ¼ 0:71, SS MF k ¼ 1:30, SS PA k ¼ 0:77, SS SE k ¼ 1:51: For this reason, researchers might expect the domain factor scores of these factors to be substantially less reliable than those of the others (cf. Figure 8).But in turn, the domains with a higher SS k have smaller average factor loadings on the general SQLI factor (Mð kBPÞ ¼ 0:64, Mð kMFÞ ¼ 0:46, Mð kPAÞ ¼ 0:65, Mð kSEÞ ¼ 0:39), which also limits the precision of their factor scores.When comparing to the most favorable conditions in Figure 7, it becomes clear that domain trait recovery could be slightly increased if the indicator selection would be optimized for the measurement of the domain traits by selecting for high k s (which may or may not be a relevant goal).At the same time, the domain factor detection is trivial in a sample of n > 2000 cases (Figure 5).All p-values of the LRTs comparing the full bi-factor model to models excluding individual domain factors were below p < 10 À102 (Wald-tests of domain factor variances: all p < 10 À8 ).
This example shows how easily small factor loadings can appear when using a bi-factor model on a measure developed with a correlated-factor model.In this case the main problem is the limited interpretability of domains due to some of the domain factor loadings unexpectedly being close to zero.

Discussion
The aim of the meta-analysis and simulation was to identify the necessary conditions to reliably detect and estimate domain factors and their loadings, and compare these to real studies.The meta-analysis shows that many domain factor loadings are small (k s < :2) in practice (Figure 2) and mostly smaller than the loadings on the general factor.There is an abundance of indicators that contribute barely anything to their domain factor (jk s j < :2) but have reasonable loadings (k g > :5) on the general factor.On the one hand, this may be desired because it provides a relatively pure general factor.On the other hand, given that many domains are measured by six or less indicators (Figure 3), this results in low domain strengths (SS k ; Figure 4).The diverse nature of the sampled constructs (Table 1), in combination with the extremely high prevalence of models having at least one domain factor for which SS k < 1 (Figure 4) shows that weak domain factors can be found in many research contexts.
The simulation, which covers a realistic range of factor loading values, provides an overview of the consequences of small domain factor variances (especially in the range of SS k < 1).The presence of domain factors is best detected by a likelihood-ratiotest (LRT) that compares the model with to the model without the domain factor.This way, domain factors with SS k ! 1 will almost always be detected.In large samples and with high overall indicator reliability, much smaller effects are reliably detectable (Figure 5).Larger samples however do not meaningfully improve the precision of the estimation of domain factor scores (Figure 8) or general factor scores (Figure 9).Dueber and Toland (2023); bars are non-overlapping; specific ¼ squared lower limit of domain factor loading estimate, variance attributable to the domain factor with relative certainty; specific ci ¼ complete 95% confidence interval of the squared estimate of the domain factor loading (lower limit to upper limit), variance potentially attributable to the domain factor; gray areas indicate leftover (error) variance if the upper limit of the specific ci were true; thick horizontal lines separate domains; the factor loadings of the indicators SQLI_10 and SQLI_12 on the respective domain factors were estimated to be negative.
There was almost no difference between model variants for any of the results, meaning that "anomalous" results and the occurence of weak domain factors are not avoided by using the S-1 or S-1c variant.Judging the degree to which the prediction of other variables is affected by domain size is beyond the current simulation study (for a discussion of such models, see Zhang et al. (2021)).

How to avoid problems with weak domain factors?
Before conducting a bi-factor study, it is important to specify its goal: Should domain factors or their scores be used?Is the only consideration to obtain the best possible measure of g g ?Are all domains equally relevant?If those questions are answered at the time of the design of the study (or ideally: the measure), appropriate decisions can be made.

Expected SS k of domain factors
We recommend aiming for domain factor strengths of SS k > 1 regardless of sample size if domains should be measured.Null results of the LRT and non-convergence are unlikely for domain factors of strength SS k > :75: But researchers may overestimate the precision with which such domain factors are measured.About half of the domain factors from the metaanalysis are so small (SS k < 1) that their scores can be expected to contain Շ60% true trait variance (see Figure 8).This makes the use of subscale scores highly questionable (see also Reise et al., 2013).Domain factor variance estimates below zero occur almost exclusively if the true effect size of the domain is tiny (SS k :27).From a theoretical standpoint, factors with SS k > 1 are more meaningful because they represent more variance than any single indicator.In exploratory factor analysis, factors with SS k 1 are almost always omitted, because they cannot be distinguished from random noise (parallel analysis, e.g.Hayton et al., 2004).If a study is merely concerned with measuring g g , SS k < 1 can easily be tolerated (see below).If the measure's design goal is to provide valid and reliable scores of a specific domain, selecting a set of indicators with SS k < 1 is suboptimal, so more or better (higher k s ) indicators need to be selected.

Number of indicators per domain
The desirable number of indicators depends on their specificity, but three to four indicators per domain are too few under most conditions.Few indicators result in small domain factor variances (SS k < 1).Randomly sampling six indicators from those observed in practice (Figure 2) results in SS k < 1 in 61.12% of cases.Adding indicators or selecting a longer measure improves the estimation precision of each individual factor loading.If domains contain very few indicators (or very few indicators with substantial loadings), including correlated error terms may be more appropriate than specifying a domain factor.The importance of increasing the number of indicators per factor to improve the recovery of the factor structure has previously been noted for EFA (Mundfrom et al., 2005;Preacher & MacCallum, 2002).For confirmatory bi-factor models, it is especially important to consider that the same number of indicators usually represents smaller SS k compared to other models, meaning that more indicators are needed to reliably measure domain factors compared to factors of other models (e.g., correlated-factor models).

Indicator specificity
Selecting indicators based on their specificity implies that measures are developed or revised using bi-factor models because other models do not assess indicator specificity. 15In many cases that is not feasible for the purpose of a specific application.But it is feasible to consider the specificity of the indicators to choose realistic study goals.Low specificity is a major contributor to weak domain factors, as showcased in the empirical example (Figure 11).On the other hand, low specificity is desirable for the estimation of g g scores.Factor loadings can themselves be of interest, for example in validation studies.Null results for domain factor loadings occur frequently for true factor loadings of 0.2 and in relatively small samples (n 500) for loadings of 0.3 (Figure 6).In addition, small factor loadings are estimated much less precisely than larger ones (Figure 7).For the abundance of estimated loadings smaller than 0.3 in the literature (Figure 2) it is therefore difficult to judge if they are truly reflecting the domain.Indicators with low specificity are somewhat less problematic if their reliability is good (high k g ).In the empirical example, there seemed to be a strong tradeoff between k g and k s , which we also observed more generally in the metaanalysis (r k g k s ¼ À0.35 (t ¼ À9.02, p < 0.001, 95% CI [-0.42, À0.28]).This tradeoff does not exist for other models.Whereas the literature on factor structure recovery in EFA considers the number and communality (i.e.reliability) of indicators (Mundfrom et al., 2005), we suggest to use SS k for orientation in confirmatory bi-factor analysis instead.From the results of our simulation it is clear that the size of domain factors-not the reliability of indicators-is the most important influence on statistical power and trait recovery regarding domain factors.

A priori power analysis and estimation of domain trait recovery
To estimate the statistical power to detect a domain factor, the results of this study can be used as a guideline.Alternatively, the semPower R package (Moshagen, 2021) can be used to compute the theoretical power.A simple example script is provided in the supplementary materials and can be adapted to the application at hand.The script first shows how to specify the population model and estimation model syntax to obtain the true and the model-implied variance-covariance matrixes.In the next step, the degrees of freedom for power analysis via semPower are set to the difference in the degrees of freedom of the two models.This is different from a standard power analysis for model misspecification.Here, the correctly specified model is the alternative option during model selection, instead of being treated as the unknown truth.The script further demonstrates how to obtain an estimate of the trait recovery for the hypothesized model.Its code is based solely on the expected standardized factor loadings (and domain factor correlations for S-1c models).It needs minimal computational resources (no simulations).If the a priori expectation for the model parameters is very uncertain, a conservative case with relatively low factor loadings should be checked.The distribution from the current meta-analysis (Figure 2) may serve as a reference.It is important to realize that theoretical power does not consider the issue of non-convergence and can therefore vastly overestimate the chance to obtain a significant result (Figure 5).

Measurement of the general factor
The most efficient way to improve the measurement of g g is to use more indicators with higher factor loadings on g g (Figure 9).Non-convergence becomes an issue in cases with weak domains (SS k :27 Figure 10) or when trying to estimate non-existent domains.However, in cases that do converge, strong domain factor loadings (k s !:5, see Figure 9) are an issue.For the estimation of g g factor scores, indicators preferably contain random error instead of domain-specific variance-even if the domain factors are included in the model.The measurement of g g does improve with sample size, but extremely inefficiently (DR 2 < :01).Even a tenfold increase in sample size rarely compensates for an otherwise suboptimal design.

Omission of domain factors or domain factor loadings
It is prudent to consider a set of plausible models for model selection and robustness checks.The popularity of the S bi-factor model may suggest that all indicators should be allocated to a domain, but this serves no statistical purpose.Indicators that do not belong to a domain do not invalidate the model.The current meta-analysis found a large proportion of indicators with low specificity-likely due to indicator selection based on other models.In the empirical example, the bi-factor model of the SQLI included several indicators with little to no contribution to their domain factor, which could have easily gone unnoticed during the development of the measure, even if a correlatedfactor model would have been considered.Indicator allocation to domains should be reconsidered in these cases.For this purpose, exploratory bi-factor analysis techniques (Jennrich & Bentler, 2011, 2012) and bifactor exploratory structural equation models (Morin et al., 2016) were developed.Instead, what does lead to all kinds of problems are domain factors without specific variance in the population.Such null results for domain factors can be perfectly acceptable, for example, if domains represent converging measurement methods.But importantly, the respective factors then have to be omitted from the model.

Troubleshooting non-convergence
If a bi-factor model does not converge, one should try to omit the domain factor that is expected to be the weakest.Non-convergence is not an issue given a correctly specified model and reasonably large domain factors (Figure 10).In practice, however, "all models are wrong" (Box, 1976, p. 792).So with inevitable misspecification, non-convergence may occur more frequently-possibly most frequently for the S model variant.Convergence is worst for the S model on S-1 data, or if domain factors (SS k :27) and sample sizes are small (See Figure 10).The main problem seems to be the specification of superfluous or very weak domain factors, which should be avoided.For a detailed analysis of convergence problems in structural equation modeling and some other potential solutions, see Yuan and Bentler (2017).

How to interpret weak domain factors and weak domain factor loadings?
In the interpretation of bi-factor models, statistical power and the precision of estimates needs to be taken into account more thoroughly.For this, it is useful to compute the SS k of domain factors.Our simulation provides a general reference for statistical power and parameter recovery16 given a range of realistic cases.The example script (supplementary materials) can be used to examine a specific case.Domain factors can include a surprisingly small amount of systematic variance (Figure 8, see also Reise et al., 2013) and may have multiple indicators whose attribution to them is unclear (Figures 2 and  11).If domains are used to predict third variables, this may explain their failure to do so.They could be just as weak as domains that result from random allocation of indicators to domains: Bi-factor models tend to fit almost any pattern in the data (Bonifay & Cai, 2017).
Taking a closer look at the factor loadings is often crucial.The large variation in loadings on the domain factor (Figures 2 and 11) means there is a very uneven mixture of the contribution of indicators to domains (e.g.Watts et al., 2019).To communicate factor composition clearly, figures of factor loadings (e.g.bar charts, such as Figure 11) can be useful.In addition to the variation in the factor loading estimates, there is substantial variation in their estimation precision (Figure 7).They should be interpreted more carefully when they are small (k s :3), overall indicator reliability is far from perfect (Rel < 0.5), or the sample size is small (n 300).Point estimates are most misleading for the most relevant loadings: small loadings that are often hard to interpret.It would be useful to always report (and interpret) standard errors and confidence intervals of factor loadings to make this visible, as we did in Figure 11.However, the fact that many domain factor loadings are estimated near or below zero (Figure 2) cannot be explained by sampling variation alone (Figure 7), certainly not in the empirical example.
Are S-1 models and models with a null result on a domain factor the same?
Models with omitted domain factors should not all be interpreted the same.If a domain factor is omitted because it is too weak, the resulting model is structurally equivalent to an S-1 model.However, the domain in question may not necessarily be interpreted as a natural reference domain, especially if it has small loadings on the general factor.For the interpretation of the remaining estimates, it does not matter if the absence of the domain was defined or estimated, so the interpretation of g g does not need to change.A priori S-1 models on the other hand were proposed irrespective of the size of the unique variance of the reference domain and should therefore be interpreted differently (see Eid et al., 2017).Their reference domain clarifies the meaning of g g , which is especially relevant if the reference domain has a unique variance that could be attributed to it.
Are small domain factor loadings an empirical fact or a technical artifact?
Looking at the distribution of factor loadings in Figure 2, researchers may come to the conclusion that the many small domain-factor loadings (k s < :2) are a valid empirical finding, rather than indicating a statistical or measurement issue.If they reflect the nature of the construct accurately, it would be undesirable to try to find indicators with higher domain-factor loadings.Such an effort could even challenge the validity of the measure.For this reason, it is important to consider the multiple ways in which these factor loadings are influenced.Firstly, indicators may be selected based on their factor loadings-irrespective of their content-usually prefering those with higher reliabilities.This strategy is based on the idea that there are better and worse constructed, and more or less relevant indicators, and the better, more relevant ones should be chosen.Secondly, indicators may be selected for reflecting a certain domain based on their content, in a try to best capture the essence of the domain (e.g., extraversion indicators that most clearly describe prototypical social boldness behaviors).In both cases, near-zero domain-factor loadings would indicate a failure to construct or select appropriate indicators.Thirdly, indicators may be selected, because they are considered to measure an important, irreplacable part of the target construct, irrespective of dimensionality (e.g., symptoms in clinical assessment or criterion-relevant tasks in a performance test).To the degree that these indicators are properly designed, small factor loadings or SS k values of domain factors are then a relevant empirical finding.In these cases, researchers need to deal with the resulting domain factor and accept interpretational difficulties.Overall, we consider the results of our meta-analysis to be a mixture of these different scenarios.The current study should help researchers to avoid obtainining such results by accident, that is without having strong arguments to interpret small factor loadings as a relevant empirical finding.

Limitations and future directions
High estimation precision does not guarantee interpretability.We agree with Eid et al. (2017) that the interpretability of bi-factor models needs more careful attention and should guide model selection.S-1 models were introduced to improve interpretability in cases with a fixed set of domains (in which domains are not randomly sampled).Eid et al. (2017) demonstrated a straightforward interpretation of S-1 models for this common case.They warned that S models lack a clear interpretation of the general factor in cases with a fixed set of domains.Although the current simulation showed that anomalous results occur in all model variants, this does not mean that S and S-1 models are equally interpretable.On top of that, the S variant is prone to identification problems when used as a measurement model in SEM (Zhang et al., 2021).
In the current simulation, factor loadings were fixed to be equal and constant within and across domains.This very selective set of scenarios greatly simplified the design and interpretation of the simulation.Most probably, problems with the estimation of a particular domain factor or domain factor loading are less severe if the rest of the model consists of more reliable indicators.Vice versa, the estimation of one part of the model may become more problematic if the rest of the model consists of less reliable indicators.For this reason, we suggest interpreting the results of the simulation with the whole model in mind.When in doubt one should check the specific case.Furthermore, we omitted imperfections (cross loadings, correlated errors) in the simulated data, which are frequently encounterd in practice (Morin et al., 2016).Such added complexity could both hamper efforts to detect and estimate domain factors and produce spurious or inflated factors.
The current simulation assumes continuous, normally distributed error terms (and latent traits).In practice, this assumption is usually violated (Micceri, 1989) and robust methods should be considered (see e.g., Yuan & Bentler, 2007).Furthermore, data analyzed in Confirmatory Factor Analysis (CFA) are frequently categorical (i.e., measured on Likert-scales).In principle, categorical data are better analyzed using Item Response Theory (IRT) models.The estimation of parameters, v 2 values, and fit indexes in CFA can be-but is not necessarily-biased by the categorization of data (DiStefano, 2002;Finney & DiStefano, 2006).Despite these issues, many researchers make use of CFA models on categorical data.If bi-factor CFA models are used to analyze categorical or decidedly nonnormal data, it is especially important to consider the current results to be an optimistic upper limit of the to-be-expected statistical power, trait recovery, and parameter estimation precision.Future research may show if bi-factor IRT models also tend to produce weak domain traits on typical data.
The current study did not examine how weak domain factors affect estimates in the structural part of SEMs.This topic is only partly touched by the simulation data of Zhang et al. (2021) who demonstrated a strong influence of the model variant on SEM estimates.Further research is needed to explore the influence of domain strength on relationships with other variables.Domain factors with SS k < 1 might show estimates of latent relationships that are imprecise and biased toward zero, because they are measured with less precision.To corroborate the empirical result of our meta-analysis that many measures do not produce a full set of interpretable domain-specific factors, assessing the prevalence of weak or vanishing domain factors using exploratory models (Jennrich & Bentler, 2011, 2012;Morin et al., 2016) on a representative sample of studies would be useful.This is especially relevant, because results of bi-factor CFA might be biased in cases with substantial cross-loadings, which can realistically be expected in many applications (Morin et al., 2016).Finally, several models are structurally similar to the bi-factor model (multitraitmultimethod models, longitudinal models, latent state-trait models, e.g.Koch et al., 2018).Future research may show to what degree these involve similar challenges.

Conclusion
The role and prevalence of study designs that produce small domain factor strengths-which lead to null results or uninterpretable results-are underappreciated in the literature.Study planning and interpretation need to take the (expected) strength of domain factors and domain factor loadings into account.The outlined strategies aim to enable researchers to fully unlock the model's potential.The bi-factor model does not generally produce problematic results, but it needs appropriate data.The crucial step is to select or design measures for the use of bi-factor models.If that is not possible, the results have to be interpreted with caution and alternative models should be considered.Moreover, the current study provides further explanations for the results that Eid et al. (2017) termed "anomalous".It shows that they occur in the S-1 and S-1c variants with roughly the same frequency if there is no misspecification involved.
Many of the above suggestions imply that existing measures need to be revised or new measures need to be developed to meet common study goals.This is both a challenge and a chance.There are many reasons why current measurement practices are considered suboptimal (Flake & Fried, 2020).Bi-factor models offer new opportunities to create improved measures, especially if the underlying construct is multifaceted by definition.The measurement of domain traits may be a practical challenge, but with it comes an opportunity to refine psychological research.
of this negative dependency is counteracted by variation in the indicator reliability: Low values of k s and k g coincide in indicators with a large variance of the measurement error.The resulting correlation between the factor loadings is r k g k s ¼ À0.35 (t ¼ À9.02, p < 0.001, 95% CI[-0.42,À0.28]).This suggests competition in the measurement of the traits.For each indicator with k s > k g , there are 4.30 indicators with k s < k g : 24.

Figure 2 .
Figure 2. Fully standardized factor loadings of individual indicator variables from 28 S bi-factor models; dashed lines indicate simulation conditions.

Figure 3 .
Figure 3. m ¼ number of indicators per domain from 28 S bifactor models; filled bars mark simulation conditions; for some indicators it was unclear if their loadings were fixed or estimated at 0.00.

Figure 4 .
Figure 4. Sum of squared loadings and sample sizes of domain factors from 28 S bi-factor models; domain factors of the same model are connected by a line.

Figure 5 .
Figure 5. Power to detect domain factors by Likelihood-Ratio-Test.Only correctly specified models are shown.Each symbol represents one simulation condition.Vertical lines show the discrepancy between simulated power (symbol) and theoretical power (arrow tail; small horizontal offset for readability).

Figure 7 .
Figure 7. Estimation Precision of domain factor loadings.Each symbol represents one simulation condition.The logarithmic y-axis scale is cut at 0.0001.Only correctly specified models are displayed.m ¼ number of indicators.

Figure 6 .
Figure 6.Power to detect domain factor loadings by Wald-Test.Each symbol represents one simulation condition.m ¼ number of indicators.

Figure 8 .
Figure 8.Average squared correlation between true domain trait values and estimated factor scores.Each symbol represents one simulation condition.

Figure 9 .
Figure 9. Average squared correlation between true general trait values and estimated factor scores.Each symbol represents one simulation condition.The variation between identical symbols is due to sample size (200 to 2000).

Figure 10 .
Figure 10.Convergence rate.Each symbol represents one simulation condition.Only correctly specified models are shown. 13

Figure 11 .
Figure 11.Variance proportions of the Scoliosis Quality of Life Index (SQLI) questionnaire explained by general and specific factors; open data byDueber and Toland (2023); bars are non-overlapping; specific ¼ squared lower limit of domain factor loading estimate, variance attributable to the domain factor with relative certainty; specific ci ¼ complete 95% confidence interval of the squared estimate of the domain factor loading (lower limit to upper limit), variance potentially attributable to the domain factor; gray areas indicate leftover (error) variance if the upper limit of the specific ci were true; thick horizontal lines separate domains; the factor loadings of the indicators SQLI_10 and SQLI_12 on the respective domain factors were estimated to be negative.

Table 1 .
Constructs in the meta-analysis sample.

Table 3 .
Likelihood Ratio Test outcomes (percent) by model variant.
10The same plot, but with 0.95 quantiles (instead of medians) of the RMSE distributions is included in the supplementary materials.11