Comparing MIMIC and MIMIC-interaction to Alignment Methods for Investigating Measurement Invariance concerning a Continuous Violator

Abstract Measurement invariance holds when a latent construct is measured in the same way across different levels of background variables (continuous or categorical) while controlling for the true value of that construct. Using Monte Carlo simulation, this paper compares the multiple indicators, multiple causes (MIMIC) model and MIMIC-interaction to a novel use of alignment optimization (AO) for detecting measurement noninvariance when the violator is a continuous variable. Results showed that MIMIC and MIMIC-interaction in sequential likelihood ratio tests and Wald tests with a Bonferroni correction provided a good balance between identifying invariant and noninvariant (linear violations) items when n ≥ 500 in terms of classification accuracy (CA). AO (CA ≥ .86) was as competitive as MIMIC and MIMIC-interaction to linear invariance violations but was far better under nonlinear quadratic violations when n ≥ 1,000 (i.e., 100 per group for 10 groups).


Introduction
Structural equation modeling (SEM) is a popular statistical technique to describe and study the relationship between constructs and observed scores in psychological and educational research (Kline, 2016).However, for observed scores to be meaningfully compared across groups with different characteristics, such as gender, age, or cultural background, an instrument must measure the latent construct the same way across populations (Vandenberg, 2002).More specifically, an observed item score Y ij is invariant related to a background variable W if the probability of endorsing an option y to an item does not depend on W while controlling for the latent construct g i (Meredith, 1993): where i denotes ith individual and j refers to jth observed item.W could be at any measurement level (e.g., categorical or continuous) and be a violator (Barendse et al., 2012), suggesting noninvariance on observed item scores while controlling for the latent construct.This study examined a novel use of alignment optimization (AO; Asparouhov & Muth en, 2014) in detecting measurement noninvariance concerning a continuous violator and compared AO to the multiple indicators, multiple causes (MIMIC; Muth en, 1989) model and MIMIC-interaction (Masyn, 2017).
In this section we discussed the multiple-group confirmatory factor analysis (MGCFA;J€ oreskog, 1971;S€ orbom, 1974;Vandenberg & Lance, 2000) and MIMIC invariance methods and their limitations.While researchers used MGCFA to evaluate measurement invariance concerning a categorical variable (e.g., nationality) or ordinal variable (e.g., different occasions in repeated measures for time-to-event data), MGCFA does not allow for examining invariance concerning a continuous variable directly, unless researchers discretize the continuous variable into categories so that groups can be formed.For example, to examine the impact of the continuous variable socioeconomic status (SES) on measurement invariance of the Children's Hope Scale (Snyder et al., 1997), Lei et al. (2019) dichotomized SES into low and high groups by using lower and upper 27% rule of SES score from a scale and then used MGCFA for invariance study.However, categorization (e.g., into two or three groups) introduces inaccurate estimation and less statistical power (Bennette & Vickers, 2012;Royston et al., 2006;Thoresen, 2019).Moreover, the categorization in MGCFA is usually done with a few (e.g., 2-3) categories (Kim et al., 2017), which leads to a substantial loss of information for the continuous variable (Royston et al., 2006).Although in theory, MGCFA can handle many groups (e.g., 30), this procedure is not practical due to the large number of pairwise comparisons involved and the need to identify adequate equality constraints to scale the latent constructs.It is also more error-prone when the number of groups is large, as setting inadequate equality constraints can lead to incorrect parameter estimation (Xu & Green, 2016).This method is thus not well equipped to handle continuous violators and constrains proper invariance investigation concerning continuous violators.
On the other hand, MIMIC (MIMIC;Muth en, 1989) defines a measurement model and includes a covariate as a causal indicator of an observed item, denoting potential intercept noninvariance.MIMIC examines uniform noninvariance (i.e., intercept differences between the focal group and reference group; Lee et al., 2017) concerning a continuous covariate by regressing each observed item on the covariate while holding constant the latent construct.This method demonstrated good statistical power to medium uniform noninvariance, but it was not sensitive to nonuniform noninvariance (i.e., factor loading noninvariance; Kim et al., 2012).Adding an interaction between g and W while predicting Y allows for testing nonuniform noninvariance, a method we refer to as MIMIC-interaction (Masyn, 2017).Researchers have also examined the MIMIC-interaction model that models intercept and factor loading noninvariance simultaneously (also called full MIMIC-interaction; Lee et al., 2017;Woods & Grimm, 2011).In other words, MIMIC-interaction can examine uniform and nonuniform noninvariance simultaneously.Previous literature showed that MIMIC-interaction had good statistical power in detecting uniform noninvariance (Lee et al., 2017) but had inflated Type I error rates in an omnibus test (i.e., a joint test to detect uniform and nonuniform noninvariance simultaneously; Woods & Grimm, 2011).While previous studies had examined MIMIC-related models under a linear invariance violation (Kim et al., 2012), few studies focus on how MIMIC and MIMIC-interaction performed when there was a nonlinear (e.g., quadratic) violation.Because MIMIC is a parametric method, it either assumes that the violator is linearly related to noninvariance or that the researchers have correctly specified the functional form of noninvariance, which is rarely the case in practice.
Different from multiple-group analysis, which focuses on a small number of groups (e.g., 4), the AO method was designed to study measurement invariance with many groups (e.g., 2-100; Asparouhov & Muth en, 2014).As AO allows for investigating invariance among many groups, it enables AO to use information from a continuous violator by connecting that violator to group membership.Compared to small groups used in the multiple-group analysis, discretizing a continuous variable into many groups allows more information extracted from that variable for invariance evaluation, with the benefit of not needing a specific functional form of noninvariance, such as linearity.In addition, the AO method is more flexible than multiplegroup analysis assuming approximately invariant parameters between groups with a goal of minimizing total noninvariance among many groups, whereas scalar invariance under MGCFA is often too ideal to achieve (Marsh et al., 2018).Although the AO method showed good recovery and estimation on group-specific parameters like loadings and intercepts, especially in large sample sizes, small and moderate noninvariance simulation conditions (Flake & McCoach, 2018), and small groups (2 to 4; Lai et al., 2022), to our knowledge no previous studies have applied and studied how AO performs in evaluating invariance concerning a continuous violator.
Different from discretizing a continuous variable into a few groups in linear regression models (Thoresen, 2019) or MGCFA, our use of discretization in AO is a nonparametric way to approximate the function of how noninvariance relates to the continuous violator.Different from the MIMIC-related methods that assume functions on invariance violation, AO uses a simplicity function to hold parameters (e.g., factor loadings) close among most groups, identifies groups with large noninvariant parameters, and has no parametric restrictions on how noninvariance is related to the continuous violator in the population.In addition to the flexibility to invariance violations, AO focuses on "minimizing total noninvariance" among groups with an identical model fit as a configural model (Asparouhov & Muth en, 2014); we, therefore, expect the AO method to be sensitive to noninvariance across categorized groups, despite a nonlinear violation.
In this paper, we proposed a novel way to use the AO method in investigating noninvariance concerning a continuous violator by discretizing that variable into many groups (e.g., 10), using the discretized variable as group membership, and then nonparametrically examining invariance across these groups.This study is the first to combine AO and covariate categorization.We aim to compare MIMIC and MIMICinteraction to AO methods in detecting measurement noninvariance concerning a continuous variable and especially understand their sensitivity to both linear and nonlinear forms of invariance violations in a Monte Carlo simulation.

Measurement Invariance under the Multiple-group Factor Model
Under a factor analytic model (e.g., Thurstone, 1947), measurement invariance implies the equality of factor structures and parameters among different groups (Meredith, 1993;Vandenberg & Lance, 2000).Let a latent construct be measured by p (j ¼ 1, 2, … , p) items under m common factors (e.g., five-factor personality measure; O' Keefe et al., 2012).For the kth (k ¼ 1, 2, … , K) group, its mathematical expression is where Four stages of measurement invariance are often explored (Vandenberg & Lance, 2000).First, configural invariance implies same factor structure is evoked among all k groups.Second, metric invariance further requires factor loadings to be equal across all groups (i.e., K k ¼ K), indicating that one unit increase in a latent construct will introduce the same changes in observed scores across different groups.Third, scalar invariance also requires equal regression intercepts (i.e., s k ¼ s), suggesting the initial observed item scores are equal across all subpopulations, given a latent construct with a score of zero.Finally, strict invariance requires equal unique variances and covariances for all k (i.e., H k ¼ H, indicating equal precision in measuring a factor across groups; Kline, 2016).

Multiple Indicators Multiple Causes Related Models
The MIMIC-related models (i.e., MIMIC and MIMIC-interaction) are other alternative models to study measurement invariance.For a measurement model with m factors and p items discussed previously (Meredith, 1993), it can be expressed by Equation (2) for a MIMIC model, where Y, s, and d are vectors for observed item scores, intercepts, and unique factors in length p. g is a vector for latent constructs and K is a matrix for regression coefficients, same as previously defined.d is assumed with multivariate normal with mean 0 and variance H. MIMIC (Muth en, 1989) allows l grouping variable(s) in a vector W to be causal indicators of not only the latent constructs, but also covariates of q out of p observed indicators, by introducing a p Â l regression coefficient matrix b 1 for Y.For invariant item(s), coefficient(s) in b 1 are constrained to be 0; for noninvariant items, coefficient(s) are estimated.
where a and e are vectors for intercepts and errors of factors in length m (and the latent construct is one-dimensional in the study with m ¼ 1.); C is an m Â l regression coefficient matrix, and W could be at any measurement level.If W only affects the latent construct, combining Equations ( 2) and ( 4) gives suggesting no violation of measurement invariance after controlling for W. If the grouping variable is also a covariate of an indicator, combining Equations ( 3) and (4) gives indicating a possible linear violation of intercept invariance due to b 1 .If there is an interaction b 2 Wg (with b 2 being a p Â l coefficient matrix) between the grouping variable and the latent construct after controlling for the effect of W on g, combining Equations ( 2), (4), and the interaction effect gives an equation for a MIMIC-interaction model, suggesting a possible linear violation of factor loading invariance due to b 2. A MIMIC-interaction model (see Equation ( 8)) can include not only an interaction effect b 2 Wg (see Equation ( 7)) suggesting potential factor loading noninvariance, but also a main effect b 1 W (see Equation ( 6)) indicating potential intercept noninvariance.
It is sensible to assume intercept noninvariance (with b 1 to be estimated) when factor loading noninvariance exists (i.e., b 2 6 ¼ 0).
Furthermore, if we assume a nonlinear (e.g., quadratic) violation of intercept invariance (i.e., b 3 W 2 ) or factor loading invariance (i.e., b 4 W 2 g), Equations ( 6) and ( 7) are re-expressed as follows, Measurement invariance study under MIMIC or MIMICinteraction examines whether the path coefficient(s) b 1 or b 2 of the grouping variable(s) or violator(s) W on observed scores Y exists while controlling for the effect of W on a latent construct when we focus on a linear violation.If there are statistically significant differences between the models with and without the effect of W on Y, there is evidence of a violation of intercept invariance or factor loading invariance concerning W (Kim et al., 2012;Woods & Grimm, 2011).Figure 1 displays a MIMIC model and a MIMICinteraction model concerning a grouping variable W in predicting an item Y 6 , indicating intercept noninvariance and factor loading noninvariance, respectively.

Alignment Optimization Method
The AO method is appropriate for measurement invariance studies with many groups (e.g., 2-100; Asparouhov & Muth en, 2014;Muth en & Asparouhov, 2018).It provides group-specific estimates, including factor means, variances, and factor loadings and intercepts, while keeping the maximum number of approximately invariant groups (Asparouhov & Muth en, 2014).According to Asparouhov and Muth en (2014), the observed item scores Y ijk for individual i and item j in group k can be denoted by the same measurement model as described under the multiple-group factor model.Assumptions for this model include that latent construct g ik and errors d ijk of item j are normally distributed.
AO starts with a configural model (also called base model M 0 ) under MGCFA and assumes that it has a better fit than a scalar invariance model (Asparouhov & Muth en, 2014;Marsh et al., 2018;Muth en & Muth en, 1998-2017), with the latent factors standardized across all k groups (i.e., g ik $ N [0, 1]) and provides factor loadings and intercepts estimates k jk, 0 and s jk, 0 of M 0 .If the configural model does not fit the data well, it should be modified.Then, AO frees the factor means and variances and searches for a new model M 1 , with a constraint on the same model fit as the base model.This constraint makes factor loadings k jk, 1 and intercepts s jk, 1 of M 1 as a function of parameters k jk, 0 and s jk, 0 , and factor means a k and variance u k possible.Asparouhov and Muth en (2014) suggested that factor means and variances can be obtained by minimizing a total simplicity function on total noninvariance where k jk1, 1 À k jk2, 1 denote factor loadings differences of item j between groups k 1 and k 2 under new model M 1 ; s jk1, 1s jk2, 1 are intercepts differences of observed item j between k 1 and k 2 : x k1, k2 is a weight coefficient related to sample sizes N k in each group (Asparouhov & Muth en, 2014).
F is minimized when many pairwise groups satisfy small differences in factor loadings and intercepts (i.e., approximate invariance).One form of the simplicity function with e a small positive number like 0.01.More details on how AO works can be found in Asparouhov and Muth en (2014).

Likelihood Ratio Test in MIMIC and MIMIC-interaction
In MIMIC or MIMIC-interaction (Masyn, 2017), the likelihood ratio test (LRT) examines whether the model estimating the effect of W on the item(s) fits the data better than the model not including that effect.For example, under the LRT examining one item, a more constrained model (e.g., a full invariance model, see Equation ( 2)) is compared to a less constrained model (e.g., an intercept noninvariance model with one item released, see Equation ( 6)) under maximum likelihood (ML).The more constrained and less constrained models are estimated by likelihood function; with large samples, their differences in À2 Â maximized loglikelihood follow approximately a chi-square distribution with the degree of freedom (df) equal to differences of the parameter spaces under the two models (Argesti, 2013).

Wald Test in MIMIC and MIMIC-interaction
The Wald test (Argesti, 2013;Wald, 1943) is commonly used in MIMIC to test the null hypothesis where b is a parameter estimate, and SE ( b) is a standard error of the parameter estimate obtained using maximum likelihood under an alternative hypothesis b 6 ¼ b 0 (Argesti, 2013).With a nonnull SE ( b), the z statistic (i.e., a Wald statistic; Wald, 1943) is an estimate under the z test with large samples and approximates a standard normal distribution when b ¼ b 0 for a one-or two-sided test; z 2 has an asymptotic chi-square null distribution with df ¼ 1 for a two-sided test.Wald test examines whether the coefficient of W on an item is statistically different from 0 at a significance level in MIMIC or MIMIC-interaction.If there is a statistical significance (e.g., p < :05), an item would be identified as noninvariant.

Type I Error Corrections in MIMIC and MIMICinteraction
We do not know that which item is in invariance violation in the real world.Except for one or more anchor items, all remaining items need to be tested to detect potential violations; or each item is tested separately with the remaining items being treated as anchor items (e.g., Kim et al., 2012), which leads to multiple tests.To reduce inflated Type I error in multiple tests, some researchers suggested using methods like Bonferroni correction for critical p values (Stark et al., 2006) or Oort's (1992) adjustment for chisquare critical values.For example, two models exist in Oort's (1992) adjustment: a baseline model assuming full invariance with chi-square statistic v 2 0 and degree of freedom df 0 , and an alternative model allowing noninvariance with chi-square statistic v 2 1 and degree of freedom df 1 .The chi-square critical value C' is adjusted as where C is the original chi-square critical value based on differences in the degree of freedom between the two models at a significance level.

Pairwise Comparisons in AO
The AO method estimates factor means and variances via the total simplicity function (see Equation ( 11)) and connects invariant groups in parameters by pairwise group comparisons when the p value is not statistically significant at the .01level.AO further identifies noninvariant groups if the parameter (e.g., intercept) of one group is statistically significant from the average of that parameter across invariant groups by pairwise comparisons with p critical value .001(Asparouhov & Muth en, 2014).In addition, pairwise comparisons by z tests on factor means are conducted to identify statistically significant mean differences (Muth en & Asparouhov, 2013).Because of their differences, we expect MIMIC and MIMIC-interaction to be more effective in detecting noninvariance when the continuous variable (or violator) for group membership is linearly related to the items, as parametric methods generally have higher statistical power when the model is correctly specified.However, when the continuous variable is nonlinearly related to the items, MIMIC and MIMIC-interaction may not be effective unless one can specify the nonlinear function in advance; otherwise, AO, a nonparametric method, should be more effective.

Previous Research on MIMIC, MIMIC-interaction, and AO
For measurement invariance studies under MIMIC-related models, some researchers examined their use in SEM and item response theory (IRT) to detect differential item functioning (DIF; Lee et al., 2017;Woods & Grimm, 2011).For example, Woods and Grimm (2011) explored MIMIC, MIMIC-interaction, and the LRT in detecting uniform and nonuniform DIF in two groups with binary and five-category Likert-type data.They found that the MIMIC-interaction model had higher statistical power rates than MIMIC and IRT with LRTs in detecting both uniform and nonuniform DIF (magnitude of 0.3 to 0.7 differences between two groups) when the sample size for the focal group N F ! 200, but at the expense of inflated Type I error rates (! .15across N F from 50 to 400).Kim et al. (2012) reported inflated Type I error rates under MIMIC and suggested Oort's (1992) critical value adjustment in the chi-square LRT for detecting noninvariance.Kim et al. (2012) found MIMIC had a low statistical power (close to 0) to nonuniform noninvariance even under large effect size (0.4 differences between focal and reference groups) and two DIF items conditions, regardless of the types of data (i.e., dichotomous, polytomous, or continuous).Lee et al. (2017) further examined how MIMIC-interaction performed to uniform and nonuniform DIF under multidimensional IRT.Different from Woods and Grimm's (2011) finding on similar statistical power to both types of DIF under one-dimensional MIMIC-interaction, Lee et al. (2017) contributed to the literature that the multidimensional MIMIC-interaction model had higher statistical power (! .8when n !1,000 in the reference group and n !200 in a focal group) to a medium uniform DIF magnitude (i.e., 0.5 differences in intercepts) than medium nonuniform DIF magnitude (e.g., 0.6 differences in factor loadings) with statistical power ranging from 0 to 1.0 under corresponding sample sizes.
The review above showed that MIMIC and MIMIC-interaction could detect uniform noninvariance; MIMIC-interaction is more effective in detecting nonuniform noninvariance than MIMIC.This study employed the MIMIC and the MIMICinteraction models to detect different invariance violations concerning a continuous violator.Specifically, the MIMIC analytic model was formulated to detect possible intercept noninvariance, and MIMIC-interaction analytic model was designed to detect potential factor loading noninvariance.In addition to using a procedure that treated the remaining items as anchor items when testing an item (nonsequential method; Kim et al., 2012), this study also examined a procedure of testing noninvariant items in sequential procedures.Specifically, under nonsequential methods, each item was tested separately by comparing to the full invariance model, while the remaining items were treated as anchor items in each test (Yoon & Kim, 2014).Identified noninvariant items were released simultaneously in a MIMIC-related model.Under sequential methods, similar to the forward method in regression, items were freed one at a time based on the largest statistics until the largest statistic for the next item was not statistically significant.Same test statistics were used under nonsequential and sequential methods.For example, we searched for the largest chi-square statistics under LRT or smallest p value under Wald test under sequential methods.In addition, limited studies examined the performance of MIMIC-related models to noninvariance when there is a nonlinear (i.e., quadratic) relationship between the latent construct and the grouping variable, and we aim to understand nonlinear invariance violations.
The AO method is designed for estimating factor means and variances of each group and can be used for measurement invariance studies (Asparouhov & Muth en, 2014;Muth en & Asparouhov, 2013, 2018).Simulation studies showed that parameter bias (e.g., factor mean) increased under AO as the number of groups or proportion of noninvariant parameters increased or within group sample size decreased (Asparouhov & Muth en, 2014).More researchers began to apply AO, such as in the IRT context (Flake & McCoach, 2018;Muth en & Asparouhov, 2014) or compared it to multiple-group analysis (real data with 59 groups in Lomazzi, 2018; real data with 30 groups in Marsh et al., 2018).For example, in a simulation with one factor, five continuous indicators, and 15 groups, Marsh et al. (2018) found that AO performed best in estimating factor means than partial scalar invariance and complete scalar invariance procedures when scalar invariance did not hold, as indicated by smaller average mean square error and standard deviation of bias.Furthermore, according to Flake and McCoach (2018), except for extreme conditions (e.g., 43% of factor loading noninvariance items), AO performed well in factor mean and factor variance coverage (i.e., the proportion of replications with true value captured by 95% confidence intervals) for polytomous data.
The review showed that the application of the AO method is sparse but has demonstrated promising results for many groups.Furthermore, the advantages of the AO in handling many groups can provide an efficient way of evaluating invariance on a continuous variable by discretizing it in many groups, but no previous study has done that.Therefore, the adaption of alignment in this context can solve the problem of studying invariance concerning a continuous variable (i.e., violator).In addition, it does not assume a parametric form, unlike MIMIC-related models.Therefore, we hypothesize that the AO method will have a higher statistical power in detecting noninvariance when there is a nonlinear relationship between the violator and an indicator.

Methods
We used Monte Carlo simulation to explore MIMIC and MIMIC-interaction and compared them to the AO method in detecting violations of measurement invariance.500 data sets were generated under each condition in R 3.6.2(R Core Team, 2019).For intercept noninvariance conditions, data were generated under a MIMIC model with one latent factor and six items; for factor loading noninvariance conditions, data were generated under a MIMIC-interaction model with one latent factor and six items.Invariance analysis for MIMIC, MIMIC-interaction, and AO was conducted in Mplus Version 8.3 (Muth en & Muth en, 1998-2017).

Type of Noninvariance
Intercept and factor loading noninvariance were investigated separately.

Number of Noninvariant Items
One (i.e., 16.7%) or two out of six (i.e., 33.3%) items were generated as noninvariant.The proportion of noninvariant item conditions was similar to previous studies (Kim et al., 2012).

Relationships between the Continuous Violator and Noninvariant Items
The continuous violator was either linearly or nonlinearly (i.e., quadratic in this design and see Equations ( 9) and ( 10)) related to the intercept (i.e., main effect) or the factor loading (i.e., an interaction between the factor and the violator) of the noninvariant items.Item characteristic curves between the latent construct and the expected value of an item score were shown in Figures 2 and 3 under different invariance violations.

Magnitude of Noninvariance
Small or medium violations of invariance were examined.We used R 2 effect size (Wright, 1921), which indicates the proportion of observed item variance accounted for by the violator, to reflect the magnitude of noninvariance, with R 2 between .07 and .14 for small effect sizes and between .15 to .29 (see Table 1) for medium effect sizes under a linear violation (Ferguson, 2009).Under small intercept noninvariance, the path coefficient (i.e., a coefficient b 1 under a linear violation or a coefficient b 3 under a nonlinear violation) between a violator and noninvariant item was 0.5 (e.g., R 2 ¼ .07under linear violation); under medium intercept noninvariance, the path coefficient between a violator and noninvariant item was 0.8 (e.g., R 2 ¼ .15under linear violation).Under small factor loading noninvariance, the path coefficient (i.e., a coefficient under a linear violation b 2 or a coefficient under a nonlinear violation b 4 ) regressing noninvariant item(s) on a violator and latent factor interaction was 0.5 (e.g., R 2 ¼ .14under linear violation); under medium factor loading noninvariance, this path coefficient was 0.8 (e.g., R 2 ¼ .29 under linear violation).Under the nonlinear violations, we held the coefficient values the same as those in linear conditions to understand how the noninvariance function (i.e., quadratic) affected different invariance evaluation methods.

Sample Size
There were three conditions of n ¼ 200, 500, and 1,000.Our sample size decisions were based on the minimum ratio of the sample size to the number of parameters estimated (i.e., 10:1; Kline, 2016) in SEM, parameter bias, and recovery related to the proportion of noninvariant parameters, sample size per group, and the number of groups under AO.The sample size was similar to previous studies (e.g., Asparouhov & Muth en, 2014;Kim et al., 2012).
This study employed 2 Â 2 Â 2 Â 2 Â 3 ¼ 48 conditions for comparing the performance of MIMIC and MIMICinteraction to AO methods in detecting noninvariance (see Table 2).In addition, there were 3 (sample size) Â 2 (intercept or factor loading analytic models) ¼ 6 conditions for evaluating familywise Type I error (Frane, 2015) rates of MIMIC, MIMIC-interaction, and AO when data were invariant.For the population model, the one factor was normally distributed with mean ¼ 0 and error variance ¼ 1; all six indicators were continuous and had factor loadings ¼ 1 and all unique factor variances ¼ 1.25.The continuous violator was normally distributed and standardized with mean ¼ 0  .44Note: Effect size R 2 measures the proportion of the variance of the total variance in an item explained by invariance violation in 0-1.and variance ¼ 1.The regression coefficient for the violator as a causal indicator of the factor was 0.6.

Analytic Models and Analysis Methods
Two types of analytic models were used for different types of noninvariance data sets.For full invariance data, we fit MIMIC and MIMIC-interaction models separately for detecting Type I error rates of identifying each item related to intercept invariance and factor loading invariance.For intercept noninvariance data, we fit a linear MIMIC analytic model (see Equation ( 6)) to detect intercept noninvariance; for factor loading noninvariance data, we fit a linear MIMIC-interaction analytic model (see Equation ( 7)) to detect factor loading noninvariance.After fitting the analytic models, we investigated noninvariance on generated data using MIMIC, MIMIC-interaction, and AO methods we introduced below.
Under the MIMIC and MIMIC-interaction methods, nonsequential and sequential methods were combined with LRTs with Bonferroni correction or Oort's (1992) adjustment and Wald tests with Bonferroni correction (i.e., three tests).Therefore, seven analysis methods (i.e., 2 methods [i.e., nonsequential or sequential method] Â 3 tests þ 1 under AO) were used in the study.

Nonsequential Likelihood Ratio Test with Bonferroni Correction (NLB)
For the nonsequential LRT detecting noninvariance, we fit one full invariance model (see Equation ( 2)) and six intercept or factor loading noninvariance models (see corresponding Equation (6) or Equation ( 7)), testing each item in ML under each replication while keeping the remaining five items as anchor items.As six LRTs were conducted with one test for each item, the chi-square critical value was 6.960 under Bonferroni correction (i.e., .05=6for testing six items separately) at a .05significance level with df ¼ 1. Oort's (1992) Adjustment (NLO) For the nonsequential LRT with Oort's (1992) adjustment (NLO) detecting noninvariance, we used the same procedures as NLB in terms of an estimator, full invariance model, and noninvariance analytic models concerning releasing one item.However, instead of using the chi-square critical value of 6.960 under Bonferroni correction, we used a chi-square critical value under Oort's (1992) rule (see Equation ( 15)).For example, for a full invariance model and an intercept noninvariance model with one item released, the original chi-square critical value C ¼ 3.84 was adjusted at a .05significance level and df ¼ 1.

Nonsequential Wald Test with Bonferroni Correction (NW)
For the nonsequential Wald test detecting noninvariance, the estimators and analytic models testing each item were the same as NLB and NLO.Simultaneously, we used an intercept noninvariance model involving b 1 (see Equation ( 6)) and examined its statistical significance for a potential linear or nonlinear intercept noninvariance in the population.We constrained b 1 as b1 and used the Mplus command MODEL TEST: b1 ¼ 0; to test whether it was statistically different from 0. Similarly, for detecting factor loading noninvariance, we constrained the coefficient b 2 of the latent construct as b2 and violator interaction effect (see Equation ( 7)) in Mplus (with the command MODEL TEST: b2 ¼ 0;) and examined its statistical significance for a potential linear or nonlinear factor loading noninvariance in the population.The p critical value was .0083(i.e., :05=6 for testing six items separately) for each test with a Bonferroni correction at a .05significance level.

Sequential Likelihood Ratio Test with Bonferroni Correction (SLB)
For the sequential LRT with the Bonferroni correction, in step 1, we used the same models in NLB and NLO and identified items with statistically significant noninvariance under each replication (if those items existed).In the following steps, we searched for an item (e.g., Y 1 ) with the largest and most statistically significant chi-square statistic among six items and released that item under specific replications.The search stopped until no statistical significance was related to the next item, or five out of six items were released.The chi-square critical value 6.960 was used throughout each search with Bonferroni correction as in NLB.

Sequential Likelihood Ratio Test with Oort's Adjustment (SLO)
The procedures for detecting noninvariant items under SLO were the same as SLB, except that Oort's (1992) rule was used to adjust the chi-square critical value.
ii. 500 iii.1,000 Notes: In total, six items were generated under a one factor model. b 1 and b 3 were path coefficients between a continuous covariate W and an item, denoting potential linear and nonlinear violations of intercept invariance, respectively; b 2 and b 4 were path coefficients due to a continuous covariate W and latent construct interaction on an item, denoting possible linear and nonlinear violations of factor loading invariance, respectively.

Sequential Wald Test with Bonferroni Correction (SW)
For the sequential Wald test detecting noninvariant items, in step 1, we used the same models and ML estimator in NW and identified items with statistically significant noninvariance under each replication (if those items existed).In the following steps, the procedures in detecting noninvariant items were similar to SLB and searched for an item with the most statistically significant and smallest p value.p critical value .0083was used throughout each noninvariant item(s) search as in NW.

Alignment Optimization (AO)
Under AO, the continuous violator was discretized in R into k ¼ 10 groups and used as the group membership.We used fixed AO (i.e., factor mean in the reference group a 10 ¼ 0 as Mplus defaulted to using the last group as a reference group) and fit a factor model with six items in Mplus to detect noninvariance.An item was flagged as noninvariant when at least one parenthesis related to that item was detected in Mplus output.In our preliminary analysis, we found larger k values (e.g., 20) had severe convergence issues, as larger k introduced more parameters while reducing the sample size of each group.Therefore, we chose k ¼ 10 for our proposed method.
Under each method, we defined Type I error rates (i.e., false positive rates) of a test identifying each invariant item under full invariance, or statistical power rates (i.e., true positive rates) for detecting each noninvariant item and Type I error rates for testing each invariant item under noninvariance conditions.As introduced, the procedure or the critical value used in identifying an item as noninvariant was different among the seven methods.

Evaluation Criteria
Under MIMIC and MIMIC-interaction, Type I error rates of a test were the proportion of replications in which an item was falsely detected as noninvariant when it was truly invariant.Statistical power rates of a test were the proportion of replications correctly identifying an item as noninvariant when the item was generated as noninvariant.Under the AO method, Type I error rates were the proportion of replications in which an invariant item was falsely flagged among at least one group when an item was generated with full invariance.Statistical power rates were the proportion of replications in which a noninvariant item was identified among at least one group when an item was generated with intercept or factor loading noninvariance.Considering true positive rates (i.e., statistical power rates) for detecting each noninvariant item and true negative rates (i.e., 1 À Type I error rates) for identifying each invariant item, we used classification accuracy (CA) to evaluate the performance of MIMIC, MIMIC-interaction, and AO methods in detecting violations of invariance.Classification accuracy was the proportion of replications correctly identifying invariant and noninvariant items under each replication.For example, under one intercept noninvariance item (i.e., Y 6 ) condition and NLB, CA was the proportion of replications that flagged only Y 6 as noninvariant, but not the other items.Another example was under full invariance and the AO method.CA was the proportion of replications in which all items were identified as invariant under the identical replication.

Classification Accuracy under Intercept Invariance and Intercept Noninvariance
When there was full invariance, NLB, NW, SLB, and SW under MIMIC and AO methods gave similar high CA (CA !.94)when n !500.When n ¼ 200, NLB, NW, SLB, and SW had higher CA (e.g., .95under SW) than AO (.85).NLO and SLO methods have CAs of less than .62across all sample sizes.
Figure 4 displays the CA of MIMIC and AO methods for correctly identifying one noninvariant item Y 6 and five invariant items (i.e., Y 1 -Y 5 ).The magnitude of noninvariance had a modest impact on CA in linear and nonlinear intercept invariance violations.For example, in the condition of a linear violation, CA was similar and high between a small intercept noninvariance with NLO (.78-.99),SLB (.92-.95) as well as SW (.93-.96), and a medium intercept noninvariance with NLO (.86-1.00),SLB (.94-.96) as well as SW (.95-.97) across all ns.The AO also showed high CA (.85-.99) across both magnitudes of noninvariance when 500 n 1,000.On the other hand, NLB, NW, and SLO did not perform well under a linear violation of intercept invariance with CA between .00-.67.In addition, NLB and NW showed higher CA under a small magnitude of intercept noninvariance than a medium one due to lower Type I error rates for identifying invariant items when one of the five anchor items was noninvariant.However, statistical power rates for detecting the noninvariant item Y 6 were similar.Under a nonlinear violation, Figure 4 shows all MIMIC methods had low CA (.03-.21) across all ns and both magnitudes of intercept noninvariance.On the contrary, AO had CA improved from .09 to a range of .66-.97 when n increased from 200 to 500 n 1,000.
Figure 5 shows CAs of MIMIC and AO methods identifying two noninvariant items (i.e., Y 2 and Y 6 ) and four invariant items (i.e., Y 1 and Y 3 -Y 5 ) when a linear or nonlinear violation of intercept invariance occurred.For a linear violation of intercept invariance, among all MIMIC methods, SLB and SW showed acceptable and higher CA (.85-.97) than NLB, NLO, NW, and SLO, with CA between .00-.78 across small and medium noninvariance when 500 n 1,000; AO had low to high CA (.60-.99) under corresponding conditions.For a nonlinear violation of intercept invariance, MIMIC methods had low CA across all sample sizes and magnitudes of noninvariance conditions (e.g., maximum CA ¼ .31under SLB and SW, medium violation, and n ¼ 1,000).AO had CA in a broad range of .25-.95 under both magnitudes of noninvariance and 500 n 1,000.Once n reached 1,000, AO performed as competitive as SLB and SW to linear invariance violation (CA !.94)and was the best among all methods to a nonlinear invariance violation (CA !.86),regardless of the magnitudes of intercept noninvariance and the proportion of noninvariant items.

Classification Accuracy under Factor Loading Invariance and Factor Loading Noninvariance
When there was no factor loading noninvariance, the results for CA of MIMIC-interaction and AO methods were similar as in conditions of no intercept noninvariance.For MIMICinteraction, NLB, NW, SLB, and SW had high CA (.94-.96) across all ns.NLO and SLO had low CA (.59-.63).For AO, it had good performance with high CA (.95-.99) across all sample sizes n ¼ 200 (i.e., 20 per group) to 1,000 (i.e., 100 per group) for k ¼ 10 groups.
We found similar CA patterns in MIMIC-interaction models detecting factor loading noninvariance as corresponding conditions of MIMIC models detecting intercept noninvariance.MIMIC-interaction was not sensitive to the magnitudes of noninvariance and was more sensitive to a linear violation of factor loading invariance than a nonlinear one across all ns. Figure 6 shows the CA of MIMIC-interaction and AO methods under one noninvariant item (i.e., Y 6 ) and five invariant items (i.e., Y 1 -Y 5 ) when a linear or nonlinear violation of factor loading invariance occurred.SLB and SW had good and similar CA under a linear violation of factor loading invariance, with CA lines for different magnitudes of noninvariance overlapping under each method.Specifically, for SLB and SW, CA were .95-.97 under a small magnitude of noninvariance and .96-.97 under a medium magnitude of noninvariance.When there was a nonlinear violation, all MIMIC-interaction methods were suboptimal with CA .44.
Like detecting intercept noninvariance, AO correctly identified invariant and noninvariant items in linear and nonlinear violations of factor loading invariance.However, AO was affected by the magnitude of noninvariance, especially when n 500.For example, under a linear violation of factor loading invariance on one item, this method had a slightly lower CA under a small magnitude of violation (.10-.84, see Figure 6) than under a medium magnitude of violation (.45-.98) when n 500.Once the sample size increased to 1,000, the two CA lines for the different magnitudes of noninvariance intersected at CA ¼ .99.
When there were two noninvariant items due to a linear or nonlinear violation of factor loading invariance, the AO method also performed well when 500 n 1,000 (e.g., .89-.99, a small magnitude of nonlinear violation, see Figure 7), except for CA ¼ .54under a small magnitude of linear violation, n ¼ 500.On the other hand, SLB and SW under the MIMIC-interaction model were only sensitive to a linear violation of factor loading invariance, with CA !.96,regardless of the magnitudes of noninvariance and sample sizes.Other MIMIC-interaction methods, NLB, NLO, NW, and SLO, did not correctly identify invariant items and factor loading noninvariant items.

Discussion
Our research was driven by the practical need to evaluate measurement invariance concerning a continuous violator.On one hand, previous studies showed that both MIMIC and MIMIC-interaction models were sensitive to intercept noninvariance (e.g., Kim et al., 2012;Lee et al., 2017), and the MIMIC-interaction model was more effective in factor loading noninvariance than MIMIC (e.g., Woods & Grimm, 2011).These MIMIC or MIMIC-interaction studies often used a binary or categorical grouping variable as a violator and compared its performance to MGCFA (e.g., Bauer, 2017).As MIMIC and MIMIC-interaction allowed for grouping variables at any measurement level, we can evaluate their performance related to our question.Also, it was unknown how MIMIC and MIMIC-interaction perform under nonlinear (i.e., quadratic in this design) violations of invariance.
On the other hand, the recently developed AO method is designed for invariance evaluation across many groups (e.g., 2-100; Asparouhov & Muth en, 2014).Therefore, it allows us to explore a new use of the AO method by discretizing a continuous violator into many groups and thereby evaluate invariance concerning the newly categorized violator, which is again related to our goal of solving practical invariance evaluation problems.In addition, the AO method does not assume a specific functional form for invariance violation, leading us to assume that it performs better than MIMIC (under intercept noninvariance investigation) or MIMICinteraction (under factor loading noninvariance investigation) when there is a nonlinear violation.

Summary of Simulation Results
For MIMIC, we found high false positive rates in NLB and NW on each invariant item when one or two of the five anchor items were noninvariant.We confirmed the findings in Kim et al. (2012) that MIMIC was sensitive to a linear violation of intercept invariance when using the LRTs with Oort's (1992) adjustment (i.e., NLO concerning one noninvariant item) and in SLB and SW concerning one and two noninvariant items when n !500 as new findings.False positive rates were higher in two noninvariant items conditions than in one (noninvariant item conditions) in SLB and SW when n ¼ 200.As a result, the rates for correctly identifying the invariance status of all items were low, with CA :81 in SLB under small and medium noninvariance and SW under small noninvariance.
In our simulation, which includes one continuous violator and continuous items, for MIMIC interaction we observed severe inflated Type I error rates for identifying invariant items in NLB and NW when one or two of the five anchor items were truly noninvariant.The results were similar to Woods and Grimm's (2011) findings on severe Type I error rates when researchers used the MIMIC-interaction to do the joint test (i.e., testing factor loading and intercept noninvariance simultaneously, p < :05Þ under one binary grouping variable, binary or five-category ordinal response options, and full invariance conditions.In addition, we confirmed our assumption that MIMIC and MIMIC-interaction models were not sensitive to nonlinear invariance violations (i.e., quadratic in this study).
For the AO method, it showed high CA to identify all six items in intercept invariance (CA !.85)or factor loading invariance (CA !.95)across all sample sizes when data were generated as invariant.The AO method was sensitive to linear and nonlinear invariance violations with CA !.86 when n !1,000 (i.e., 100 per group for 10 groups), regardless of the types of noninvariance, proportion of noninvariant items, and magnitude of noninvariance.As expected, AO, a nonparametric method, did not work well under a small sample size.

Recommendations and Limitations
Once the sample size reached 1,000, the AO method performed best among all methods, correctly detecting invariant and noninvariant items, with classification accuracy rates exceeding 85% in all conditions.Therefore, we recommend a new direction of the AO method in invariance evaluation by categorizing a continuous violator into many groups and using the new categorized variable as a grouping variable.This procedure works well in detecting intercept and factor loading noninvariance so long as the sample size is appropriate (e.g., 100 per group for k ¼ 10 groups), no matter whether the violation is linear or nonlinear.Regardless of the magnitudes of noninvariance, MIMIC models under sequential likelihood ratio tests with Bonferroni correction (SLB) and sequential Wald tests with Bonferroni correction (SW) are suitable for detecting a linear violation of intercept invariance in the conditions of zero to one noninvariant item when n !200 and two noninvariant items when n !500.Likewise, MIMIC-interaction models under SLB and SW are appropriate for detecting a linear violation of factor loading invariance on zero to two items when n !200.However, our work showed that SLB and SW procedures under MIMIC (and MIMIC-interaction) required formulating and comparing many models in Mplus and R, which was less convenient than AO.In the Supplementary Materials, we showed that AO performed well in detecting simultaneous violations of linear intercept invariance and factor loading invariance (CA !.87 to factor loading invariance and factor loading noninvariance, n ¼ 1,500).
Like other studies, there are several limitations.First, we evaluated the performance of MIMIC, MIMIC-interaction, and AO methods in conditions of intercept noninvariance and factor loading noninvariance separately.We aimed to understand the performance of each method under a specific noninvariance level first before going further, as Flake and McCoach (2018) did.Second, we used k ¼ 10 groups to discretize the continuous violator with AO, as models were not identified with larger k values, such as 20, when the total sample size n 1,000.k ¼ 10 gave better results than k ¼ 15 across all conditions and was sufficient for good performance under AO when n !1,000.It needs to investigate how the AO method is influenced by the proportion of noninvariant items, and the ratio between within-group sample size and the number of groups.Third, the factor quality (i.e., error variance = total variance) is reasonable, with 1= (1 þ 0.6 2 ) ¼ 74% of the total variance explaining the observed score in this study.Future studies can examine measurement invariance concerning different factor quality.Fourth, the current study focuses on MIMIC and AO assuming that configural invariance holds, but future research can explore whether MIMIC and AO can be used and how they compare to each other when configural invariance does not hold.Fifth, this study generated noninvariance by partial invariance models, and future studies can examine how the MIMIC-related methods and AO perform under approximate invariance data.In addition, one model similar to MIMIC in detecting measurement noninvariance is restricted factor analysis (RFA; Barendse et al., 2010Barendse et al., , 2012Barendse et al., , 2014)), which presumes a correlation between the grouping variable and the latent construct.Our study did not include an RFA model as we controlled the impact of a grouping variable on the latent construct via a regression directly instead of a correlation.It can be further explored in future studies under different tests.

Open Practices Statement
The R and Mplus codes, and supplementary materials for the simulation study are available on the Open Science Framework (https://osf.io/94zmu/?view_only=b73cfa8293fb4bd7b4349c6b934c0098).

Figure 1 .
Figure 1.MIMIC and MIMIC-interaction model: one covariate on one observed item.MIMIC: multiple indicators, multiple causes; g : latent construct; w : latent construct variance; Y 1 to Y 6 : observed item scores; k 1 to k 6 : factor loadings; s 1 to s 6 : intercepts; d 1 to d 6 : unique factors; W: a covariate; c: a path coefficient between W and g; b 1 : a path coefficient between W and Y 6 , indicating a possible violation of intercept invariance; b 2 : a path coefficient due to W and g interaction on Y 6 , indicating a possible violation of factor loading invariance.b 1 and b 2 was tested separately in the study.

Figure 2 .
Figure 2. Latent construct vs. expected value of item score: intercept noninvariance.The dotted horizontal gray lines represent the expected value of item score 0.

Figure 3 .
Figure 3. Latent construct vs. expected value of item score: factor loading noninvariance.The dotted horizontal gray lines represent the expected value of item score 0.

Figure 4 .
Figure 4. Classification accuracy under intercept noninvariance: one noninvariant item.AO: alignment optimization; NLB: nonsequential likelihood ratio test (LRT) with Bonferroni correction; NLO: nonsequential LRT with Oort's (1992) adjustment, NW: nonsequential Wald test with Bonferroni correction; SLB: sequential LRT with Bonferroni correction; SLO: sequential LRT with Oort's (1992) adjustment; SW: sequential Wald test with Bonferroni correction.Magnitude denotes small or medium noninvariance.For multiple indicators, multiple causes (MIMIC), classification accuracy (CA) was investigated under NLB, NLO, NW, SLB, SLO, and SW.The number of groups in the AO method is 10 with sample sizes 20, 50, and 100 per group.The dashed green lines and solid red lines represent CA under small and medium noninvariance across methods, respectively.The dotted and long-dashed gray lines represent CA ¼ .00 and .90,respectively.

Figure 5 .
Figure 5. Classification accuracy under intercept noninvariance: two noninvariant items.AO: alignment optimization; NLB: nonsequential likelihood ratio test (LRT) with Bonferroni correction; NLO: nonsequential LRT with Oort's (1992) adjustment; NW: nonsequential Wald test with Bonferroni correction; SLB: sequential LRT with Bonferroni correction; SLO: sequential LRT with Oort's (1992) adjustment; SW: sequential Wald test with Bonferroni correction.Magnitude denotes small or medium noninvariance.For multiple indicators, multiple causes (MIMIC), classification accuracy (CA) was investigated under NLB, NLO, NW, SLB, SLO, and SW.The number of groups in the AO method is 10 with sample sizes 20, 50, and 100 per group.The dashed green lines and solid red lines represent CA under small and medium noninvariance across methods, respectively.The dotted and long-dashed gray lines represent CA ¼ .00 and .90,respectively.

Figure 6 .
Figure 6.Classification accuracy under factor loading noninvariance: one noninvariant item.AO: alignment optimization; NLB: nonsequential likelihood ratio test (LRT) with Bonferroni correction; NLO: nonsequential LRT with Oort's (1992) adjustment; NW: nonsequential Wald test with Bonferroni correction; SLB: sequential LRT with Bonferroni correction; SLO: sequential LRT with Oort's (1992) adjustment; SW: sequential Wald test with Bonferroni correction.Magnitude denotes small or medium noninvariance.For multiple indicators, multiple causes with interaction (MIMIC-interaction), classification accuracy (CA) were investigated under NLB, NLO, NW, SLB, SLO, and SW.The number of groups in AO method is 10 with sample sizes 20, 50, and 100 per group.The dashed green lines and solid red lines represent CA under small and medium noninvariance across methods, respectively.The dotted and long-dashed gray lines represent CA ¼ .00 and .90,respectively.

Figure 7 .
Figure 7. Classification accuracy under factor loading noninvariance: two noninvariant items.AO: alignment optimization; NLB: nonsequential likelihood ratio test (LRT) with Bonferroni correction; NLO: nonsequential LRT with Oort's (1992) adjustment; NW: nonsequential Wald test with Bonferroni correction; SLB: sequential LRT with Bonferroni correction; SLO: sequential LRT with Oort's (1992) adjustment; SW: sequential Wald test with Bonferroni correction.Magnitude denotes small or medium noninvariance.For multiple indicators, multiple causes with interaction (MIMIC-interaction), classification accuracy (CA) was investigated under NLB, NLO, NW, SLB, SLO, and SW.The number of groups in AO method is 10 with sample sizes 20, 50, and 100 per group.The dashed green lines and solid red lines represent CA under small and medium noninvariance across methods, respectively.The dotted and long-dashed gray lines represent CA ¼ .00 and .90,respectively.
intercept indicates the observed item score when every latent construct is 0. Let E(g k ) ¼ a k , Cov(g k ) ¼ W k ,and Cov(d k ) ¼ H k for group k.Let each unique factor in d k has a mean 0 (i.e., E[d k ] ¼ 0).It is usually assumed d k is multivariate normal.Furthermore, let the latent factors and unique factors be independent with covariance Cov(g k are vectors for observed item scores, measurement intercepts, and unique factors in length p, respectively; g k is a vector for common factors in length m; K k is a p Â m factor loading matrix.A factor loading reflects how an observed score changes across one unit change in a latent construct; an

Table 1 .
Effect size under violation of invariance on one item.