Modeling Heterogeneity in Temporal Dynamics: Extending Latent State-Trait Autoregressive and Cross-lagged Panel Models to Mixture Distribution Models

Abstract Longitudinal models suited for the analysis of panel data, such as cross-lagged panel or autoregressive latent-state trait models, assume population homogeneity with respect to the temporal dynamics of the variables under investigation. This assumption is likely to be too restrictive in a myriad of research areas. We propose an extension of autoregressive and cross-lagged latent state-trait models to mixture distribution models. The models allow researchers to model unobserved person heterogeneity and qualitative differences in longitudinal dynamics based on comparatively few observations per person, while taking into account temporal dependencies between observations as well as measurement error in the variables. The models are extended to include categorical covariates, to investigate the distribution of encountered latent classes across observed groups. The potential of the models is illustrated with an application to self-esteem and affect data in patients with borderline personality disorder, an anxiety disorder, and healthy control participants. Requirements for the models’ applicability are investigated in an extensive simulation study and recommendations for model applications are derived.

Recent developments in modeling longitudinal data emphasize the importance of investigating time-dependent within-person dynamics of psychological processes as they unfold at the level of the individual (e.g., Hamaker, 2012;Hamaker et al., 2015;Molenaar, 2004;Molenaar & Campbell, 2009).This call for investigating longitudinal dynamics on the within-person level is also omnipresent in the study of psychopathology and psychotherapy (e.g., Piccirillo & Rodebaugh, 2019;Wright & Woods, 2020;Wright & Zimmermann, 2019) and was fostered by technological advances that facilitate the collection of (intensive) longitudinal data and a concomitant increase in the popularity of ambulatory assessment (AA) methods.
As argued by Molenaar (2004), results from between-person analyses (inter-individual variation) can only be generalized to the level of the individual (intra-individual variation) under very strict conditions, termed ergodicity.Ergodicity of a dynamic process requires stationarity of the process as well as homogeneity of the population.Homogeneity is given if the same statistical model with equal model parameters applies to the dynamic process of all individuals within a population (Molenaar & Campbell, 2009).This assumption has repeatedly been observed to be too strict in a myriad of psychological research areas.For instance, empirical applications in emotion research indicate that heterogeneity of affect dynamics across individuals is the rule rather than an exception (e.g., Brose et al., 2015;Fisher et al., 2017;Wright et al., 2019Wright et al., , 2016)).Consequently, longitudinal models that assume population homogeneity with respect to the dynamics of psychological processes, such as (randomintercept) cross-lagged panel models (Hamaker et al., 2015) or autoregressive latent-state trait models (Eid et al., 2017), may be too restrictive.However, as argued by Voelkle et al. (2014), conditional equivalence may be given even if unconditional equivalence (i.e., ergodicity) is not, if between-and within-person structures are equivalent after controlling for relevant factors (sources of heterogeneity or non-stationarity) on either level (e.g., covariates explaining group differences).
In the following, we will focus on differences between individuals' dynamics that are of qualitative nature.Consider affect dynamics in persons with a Borderline Personality Disorder (BPD) as compared to a healthy comparison group.Two of the defining features of BPD are a pervasive pattern of instability in affect as well as in the sense of self (American Psychiatric Association, 2013).Thereby, patients with BPD are assumed to exhibit different temporal dynamics in affect and self-esteem (and potentially also the temporal coupling of the former) than healthy comparisons do.
One recent approach to capture qualitative differences in within-person dynamics are latent class vector autoregressive (VAR) models and clustering approaches for VAR models which cluster persons with similar dynamic processes into subgroups (Bulteel et al., 2016;Ernst et al., 2020;Ernst et al., 2021).One disadvantage of these VAR clustering approaches is that they require comparatively large time series per person (a minimum of 50 observations per person; Bulteel et al., 2016;Ernst et al., 2021).Another disadvantage of these approaches is that they do not handle possible measurement error in the observed variables.This restriction can be overcome by using structural equation modeling approaches which separate reliable from unreliable variance and specify the dynamic processes on the basis of measurement-error free latent variables.For instance, Courvoisier et al. (2007) have extended latent state-trait (LST) models to mixture models, which allow to separate latent subgroups of persons differing in the stability and variability of a construct across time.The latter models were designed for panel data comprising few observations per person sampled at comparatively long time intervals between adjacent measurements (e.g., several months or years).In many instances, data gathered by AAs or diary studies comprise a medium number of observations per person (e.g., between 7 and 30), with short time intervals between measurements (e.g., one or several hours, one day, or one week).While these do oftentimes not meet the sample size requirements for models designed for the analysis of intensive longitudinal data 1 , the temporal dependency between the measurements should be adequately modeled and is often of prime interest.
In the following we propose an extension of autoregressive (AR) as well as cross-lagged (CL) LST models to mixture models, to model unobserved person heterogeneity and qualitative differences in longitudinal dynamics based on comparatively few observations per person while taking into account temporal dependencies as well as measurement error.The proposed models aim to identify latent subgroups of persons with different temporal dynamics with a focus on variability processes.That is, as opposed to growth mixture models (GMMs; e.g., Ram & Grimm, 2009), the presented models are suited to model (short-term) variability and covariation rather than change processes.
Mixture LST-AR models provide several advantages for the analysis of longitudinal data with a moderate number of observations per person.First, in (multiple-indicator) LST models, measurement error can be separated from true situational influences underlying temporal variability of psychological constructs.Second, by including AR effects in LST models, serial dependency that is likely to exist between measurements closely spaced in time is adequately modeled.Thereby, true occasion-specific components of a construct can be separated into temporally predictable and unpredictable components.Instability can then adequately be estimated based on the variance of that part of an occasion-specific observation that cannot be temporally predicted by temporally preceding observations.By extending LST-AR models to mixture LST-AR models, latent classes of persons that differ with respect to their variability and reactivity to unobserved situational events can be separated.Thereby, the model accounts for possible heterogeneity in persons' dynamics across time and allows researchers to identify latent subgroups of persons that differ with respect to key model parameters.That is, we relax the 1 Note that there is a large variety of recently introduced modeling strategies for the analysis of intensive longitudinal data (see, e.g., Asparouhov et al., 2018;Beltz et al., 2016;Driver & Voelkle, 2018;Lane et al., 2019;Oravecz et al., 2009;Schuurman & Hamaker, 2019;Song & Zhang, 2014;Voelkle et al., 2014).A detailed description and comparison of these models is beyond the scope of the present article.However, note that in case of multilevel time series models, the required sample sizes do largely depend on the chosen model and its complexity, i.e., the number of parameters that are modeled as random effects and the number of covariates included, the modeling of measurement error, etc (see, e.g., Asparouhov et al., 2018;Schultzberg & Muth en, 2018) and due to the regularizing nature of hierarchical models, the required length of the time series tends to decrease with the number of persons observed on the between-level, and may additionally be reduced by respective prior specifications with bayesian estimation (e.g., Driver & Voelkle, 2018).
restrictive assumption that the temporal dynamics as well as the reliability of the measures are the same for all persons.We show that the mixture LST-AR approach presented here already performs well with comparatively few observations per person.
To analyze the temporal interplay between different psychological constructs (e.g., self-esteem and affect), the models can be extended to accommodate several constructs and CL effects between these constructs across time.Multiconstruct mixture cross-lagged LST models can also be conceived as an extension of the random-intercept cross-lagged panel model (RI-CLPM; Hamaker et al., 2015) by a) introducing latent classes with differing model parameters to account for heterogeneity in persons' temporal dynamics, on the one hand, and b) modeling these dynamics on the level of latent factors by use of multiple indicators, on the other hand.
Oftentimes researchers are interested in explaining person heterogeneity with respect to temporal dynamics, e.g., inertia, variability, or instability.To this end external variables may be used to predict differences in persons' temporal dynamics in mixture LST-AR or LST-CL models by predicting latent class membership.We might, for instance, want to investigate whether differences in persons' temporal dynamics (as captured by the latent classes) mirror clinical diagnoses, or, framed differently, whether different clinical diagnoses accurately cluster people with respect to their temporal dynamics in clinically relevant variables.To this end, we extend the models to include categorical predictor variables for latent class membership, which allows us to investigate the distribution of encountered latent classes across (patient) groups.That is, we can investigate whether patients with different clinical diagnoses (e.g., Borderline personality disorder or anxiety disorders) and healthy controls have different probabilities to belong to latent classes that differ with respect to the inertia in affect or self-esteem, reciprocal temporal effects of affect on self-esteem and vice versa, instability and reactivity to unobserved internal and external influences, as well as to the temporal coupling of unexplained variability in affect and selfesteem.
The aim of the present work is to extend existing models to mixture LST-AR and LST-CL models, illustrate their potential with an application to clinical data and investigate the models' applicability under different realistic settings.In the following, we first describe the extension of LST-AR models to multiconstruct models, including CL effects between constructs.In a second step, the presented models are extended to mixture models.The models are applied to affect and self-esteem ratings collected in AAs from patients with BPD, an anxiety disorder (AD), and healthy controls (HC).The example of modeling temporal dynamics in affect and self-esteem will already be used throughout the model presentation.In an extensive simulation study based on the results of the empirical application, requirements for the accurate estimation of the models in practice are investigated and recommendations for applied researchers are derived.

Modeling heterogeneity in autoregressive and cross-lagged LST models
In the following model extensions we rely on LST models with AR effects as introduced by Eid et al. (2017), which are defined based on revised Latent-State-Trait (LST-R) theory (Steyer et al., 2015).Eid et al. (2017) have shown that a version of the popular Trait-State-Occasion Model (TSO; Cole et al., 2005) can be defined as a model of LST-R theory and that some of its restrictions can be reasonably relaxed.They provide detailed information on the definition and meaning of the latent variables and introduce different variance components that can be calculated in LST-AR models.We will shortly summarize the LST-AR model within LST theory before introducing the extensions to multiple constructs and mixture LST-AR models.

Autoregressive latent-state-trait models
According to LST-R theory (Steyer et al., 2015), an observed variable Y ijt of item i of construct j at time point t can be decomposed into a latent trait variable n ijt , a state residual variable f ijt and a measurement error variable e ijt .Latent trait variables n ijt are defined as the expectation of the person-specific distribution of Y ijt across all possible situations the person might experience at the time of observation.As reflected in the index t of the trait variable n ijt , LST-R theory explicitly takes into account that a person's trait is time-specific, that is, a trait value might change with the course of time and with different experiences the person makes.Latent state residual variables, on the other hand, are defined as the deviation of the expected value of Y ijt for a person at time t in situation s t from the expected value across all possible situations at time t (i.e., n ijt ).That is, latent state residual variables f ijt capture effects of the situation and/or person-situation interaction at time t.Together trait and state residual variables constitute a person's true score at time t (s ijt ), with e ijt capturing the deviation of the observed score Y ijt from the respective true score s ijt : To define models of LST-R theory, several assumptions have to be made, with different assumptions constituting different models of LST-R theory (Steyer et al., 2015).To define autoregressive LST models in line with the LST-R theory, it is assumed that a person's trait n ijt at time t > 1 can be written as a linear combination of a person's trait at t ¼ 1 and previous state residual variables f ijt 0 with t 0 < t, that is The parameter a ijt is an intercept parameter capturing average additive trait change, that is, changes in the average trait level in case Eðn ij1 Þ ¼ 0: The term k Tijt n ij1 captures the effect of a person's trait at t ¼ 1 on subsequent true scores and thereby stability (k Tijt ¼ 1) or multiplicative trait changes (k Tijt 6 ¼ 1).Multiplicative trait change can be understood as the attenuation (k Tijt < 1) or amplification (k Tijt > 1) of the initial trait values (at time point 1), that can be described by a linear transformation of the initial trait values which uniformly applies to all persons (Eid et al., 2017;Oeltjen et al., 2020, under review).
The term P tÀ1 k¼1 b k j f ijðtÀkÞ captures the effect of cumulative experiences across time, that is, carry-over effects of previous situations and person-situation interactions on subsequent trait values (and true scores).That is, the AR parameter b j quantifies carryover effects across temporally adjacent observations of the same construct.For parsimony, the notation in Eq. ( 3) assumes that the AR parameter b j is timeinvariant, that is, that the strength of carry-over effects is the same between any two temporally adjacent observations.This assumption might be relaxed if time-varying effects are plausible or if successive observations are not measured in equidistant time intervals.
The interpretation of the AR parameters in LST-AR models based on LST-R theory differs slightly from the common interpretation of AR parameters in time series models.In the latter, AR parameters are sometimes interpreted as inertia (resistance to change) or speed of return to baseline after a deviation following a perturbation or shock (Kuppens et al., 2010;Suls et al., 1998).In the LST-AR model described here, the latent trait factor does not necessarily represent a stable baseline but is defined as the trait value at the first occasion of measurement.The AR parameters capture to which degree situation and/or personsituation interaction effects at preceding time points t predict trait values at the following time point.That is, to which degree experienced elevated or decreased levels in a construct due to situation and/or personsituation interactions predict subsequent trait values of the same construct.The interpretation is conceptually similar in that the AR parameter captures carryover effects within the same variable across time, that is, to which degree a variable affects itself or is correlated with itself across time above and beyond the degree that is expected based on a person's initial trait level (at time 1).
A person's true score s ijt at time t > 1 is thus a linear combination of a person's trait value at t ¼ 1, previous state residual variables f ijt 0 with t 0 < t, and the state residual variable f ijt at time t, that is Again, the latent state residual variable f ijt is the deviation of a person's true score s ijt at time t from this person's trait n ijt at time t.It thereby captures effects of the situation and/or person-situation interactions at time t that are not expected or cannot be predicted based on the person's initial trait values and cumulative experiences across preceding time points.That part of s ijt that is not determined by the initial trait n ij1 is combined into the occasion-specific variable O ijt which is defined as For model identification reasons, it is assumed that latent state residual variables f ijt and occasion-specific variables O ijt are perfectly correlated across items i of the same construct j at the same time point t.Note that we define the trait factors T ij to be indicator-specific, allowing for item-specific effects on the level of the latent traits.If reasonable, a more restrictive model variant assuming perfectly correlated latent traits (i.e., a common trait factor) across indicators measuring the same construct can be specified.
However, the assumption of perfectly correlated indicators is often violated in practice and should be tested empirically by use of model comparisons.That is, in principle researchers could fit a large variety of different factor models on the trait-as well as the occasion-specific levels (e.g., indicator-specific trait factors, one-factor or two-factor congeneric factor models, random-intercept factor models, etc.; Maydeu-Olivares & Coffman, 2006).Therefore, we recommend researchers to perform a detailed analysis of the factor structure before proceeding with the extension of the model to a mixture model.Also see Ram and Grimm (2009) for recommendations on step-by-step procedures for implementing mixture model analyses in longitudinal data.
Based on the previous definitions, the resulting measurement equation for the observed variables Y ijt is given by with k Tij1 typically set to 1 for all items i and k O1jt set to 1 for all time points t for identification reasons and Typically, the parameters b jt are set invariant across time, resulting in a common AR parameter b j ¼ b jt ¼ b jt 0 , 8t, t 0 (see below for an extension to inter-individually varying times of observation).Note that we have already imposed this assumption in Eqs.(3)-( 5).An LST-AR model as defined above is depicted for a model spanning four measurement time points in Figure 1.
To ensure interpretability of the AR effects, weak measurement invariance for the latent occasion-specific factors across time needs to be established (Meredith, 1993).This is achieved by holding the respective factor loadings constant across time, that is, by setting k Oij ¼ k Oijt ¼ k Oijt 0 , 8t, t 0 : Note that, in contrast, trait factor loadings k Tijt as well as intercept parameters a ijt do not need to be invariant for a meaningful interpretation of the latent factors and model parameters in LST models.Rather, differences in these parameters across time indicate changes in latent trait scores or average levels across time (for details see Eid et al., 2017).However, note that in this case the dynamics modeled by the LST model deviate from a pure variability process (also see Geiser et al., 2015).Latent growth curve (LGC) models in contrast require different measurement invariance assumptions (Meredith & Horn, 2001).In general, before imposing measurement invariance restrictions, these should be tested in detail, e.g. as a first step after the factor structure of the items has been established.Details on how to perform measurement invariance testing are, for instance, provided in (Millsap, 2011).

Multiconstruct cross-lagged LST models
In the following, the LST-AR model is extended to a model including two constructs.The multiconstruct version of the LST-AR model will be called crosslagged LST (LST-CL) model in the following.See Figure 2 for a graphical representation of the two-construct LST-CL model.The basic decomposition of the observed variables Y ijt for each construct j is identical to the decomposition in the LST-AR model defined above, that is with k Tij1 and k O1jt typically set to 1 for all items i, constructs j, and time points t for identification reasons.In models including two (or more) constructs, the temporal coupling and spillover effects from one construct to the other across time are modeled by the inclusion of CL effects at the level of the occasion-specific variables.That is, the occasion-specific variables are modeled with the following temporal dependencies, for two constructs (j ¼ 1, 2): The parameters b jt capture carry-over effects of a variable on itself across time, that is, AR effects.The parameters c t and d t are CL effects, capturing spillover effects across different constructs across time.Within LST-R theory, these can be interpreted as the effect of experienced situations (and/or person-situation interactions) with respect to a construct that affect latent trait variables of another construct at later time points (controlling for AR effects).That is, the parameters c t and d t quantify the temporal coupling between the constructs across subsequent time points (controlling for AR effects).A positive CL effect of positive affect (at t À 1) on subsequent self-esteem (at time t) would for instance indicate that a level of positive affect that is higher than expected based on the person's trait level at time t À 1 is associated with a subsequent increase in a person's true score in self-esteem at time t (controlling for AR effects).That is, the CL effects capture potential reciprocal effects between different constructs over time by predictive relationships.Note that the term effect is used in the sense of temporal prediction and does not imply causality.
Typically, the parameters b jt , c t , and dt are set invariant across time, resulting in common AR and , t 0 (see below for an extension).Again, to ensure interpretability of the AR effects, weak measurement invariance for the latent occasion-specific factors should be established by setting the factor loadings of the occasion-specific factors equal across time, that is, The variables f jt capture true temporary, occasionspecific deviations from the true score value that is expected based on previous trait as well as stateresidual variables.The variance of f jt thereby quantifies the amount of unexplained variability in construct j across time.In addition to CL effects that capture spillover effects across time, the unexplained deviations f jt of different constructs might be coupled within a time point.That is, time-specific, unexplained deviations of true score values from the predicted values at time t might be correlated for positive affect and self-esteem, with latent correlations quantifying to what extend we expect that larger/smaller positive/negative deviations in positive affect go along with larger/smaller positive/negative deviations in selfesteem.
The multiconstruct LST-CL model as described above can be either seen as an extension of the LST-AR model to multiple constructs, or as an extension of the RI-CLPM (Hamaker et al., 2015) to a model using multiple indicator variables, in which the AR process is modeled on the level of measurement-error free latent variables (also see Mulder & Hamaker, 2021).In the RI-CLPM, between-person differences on the latent trait factor are captured by the random intercept factor.Note that in contrast to the RI-CLPM, we do not constrain factor loadings of the latent trait factors to unity.Rather, differences in latent trait factor loadings across time are assumed to capture multiplicative trait change, that is, changes in trait levels that are elicited by a (uniform) amplification or attenuation of the trait value at the first time point (Eid et al., 2017;Oeltjen et al., 2020, under review).

Unequally spaced time lags
In AA data, unequal time intervals between observations across individuals and time are common.One possibility to accommodate differences in the time lags between adjacent observations across individuals and time is to extend the models by estimating the AR parameters as b jt ¼ b lag it j , where lag it denotes the time lag for individual i between occasions ðt À 1Þ and t (see Eid et al., 2012).Thereby, the parameter b j corresponds to the autoregressive effect for a time lag of lag it ¼ 1.Hence, the variable lag it should be scaled in a way that a value of one corresponds to a meaningfully interpretable time interval for a dataset at hand.We provide a description of how to implement this extension using MPlus in the OSM (OSM section 8; also see Eid et al., 2012).When applying the model to panel data, which do not use a randomized timesampling schedule and are not or only marginally affected by delayed responses, differences in time lags may be negligible and a correction for differential time lags as proposed above may not (or only minimally) affect results.

Trends and cycles
When analyzing data gathered across different times of day and/or across different days of the week, seasonal or cyclical effects (e.g., diurnal or weekly cycles) and trends have to be addressed.That is, if the constructs under investigation show systematic trends or cyclical effects across measurement occasions, the relation between constructs across time may be due to these trends or effects rather than to shared covariation.Therefore, researchers should test for the presence of trends and cycles when setting up the longitudinal model.The presented LST models are suited to model variability processes.That is, in the presence of (person-specific) growth in the data, LGC models are a better suited alternative.The presented LST models can be easily extended to account for trends and cyclical effects.First, person-specific trends across time points (within a day) can be incorporated by extending the presented model by one or two latent factors capturing linear and quadratic growth components.Thereby, the multiconstruct LST-CL with growth factors would present an extension of a multivariate LGC model with structured residuals (Curran et al., 2014) to a multiple-indicator model.In addition to individual growth trajectories, there may be a trend across time points that is common to all individuals, i.e., due to systematic differences between times of day that apply to all individuals.This kind of trend is easily incorporated in the LST model by allowing the intercepts of the indicators to systematically vary across time points.If data with multiple measurements per day across multiple days are modeled, changes or trends across days can for instance be captured by specifying day-specific trait factors (see, for instance, the following data application).

Extension to mixture models
By extending LST-AR and LST-CL models to mixture LST-AR and LST-CL models, latent classes that differ with respect to variability and reactivity to unobserved situational events as well as the temporal coupling between different constructs can be separated.Mixture modeling in general aims at detecting subpopulations that differ in some (specified) model parameters which drive the distribution of the observed variables (McLachlan & Peel, 2000).Mixture structural equation models can be considered an extension of finite mixture models to SEMs, in which it is assumed that in each latent subpopulation the same structural SEM with potentially different parameter values holds.Mixture SEMs can thereby be viewed as conceptually similar to multigroup SEMs where the grouping variable is not observed but latent (B. Muth en, 2001).The density of the general multivariate normal mixture SEM is given by (Dolan & van der Maas, 1998;McLachlan & Peel, 2000): where y i is a vector of observed random variables for subject i (i ¼ 1, … , N) with probability density function f ðÁÞ, unknown covariance matrix R and mean structure l, f c ðÁÞ are class-specific densities (component densities of the mixtures) with unknown covariance matrices R c and mean structures l c , and the vector p contains the mixing proportions or latent class probabilities for the C classes (components) which determine the proportion of subjects in each class, with P C c¼1 p c ¼ 1 and 0 < p c < 1: In SEMs, the covariance and mean structure depend on the parameter vector h: In mixture SEMs, the parameter vectors h c are class-specific, with a specified number of elements of the parameter vector that may vary across classes.
The mixture LST-AR or LST-CL model is a special variant of the general mixture SEM, in which it is assumed that within each class an LST-AR or LST-CL model holds, with potentially differing structural parameters across classes (also see Courvoisier et al., 2007, on mixture LST models).For each latent class or subpopulation, the LST-AR model, conditional on class c, is defined by The parameters k O1jt are held constant across classes to ensure measurement invariance and thereby comparability of latent correlations and regression coefficients across classes.Additionally, the parameters k Tijtc may be restricted to be equal across classes for parsimony, if it is reasonable to assume that multiplicative trait change does not differ across classes.Latent subpopulations might thus differ with respect to inertia and spillover effects between constructs (AR and CL parameters), latent trait variances, latent state residual variances and covariances, latent means or intercepts, and measurement error variances.To ensure comparability of the latent mean structure and still allow for differences in trait levels, the intercept parameters a ijtc can be set invariant across classes while allowing for differing latent trait means, EðT ijc Þ: Identifying restrictions that have to be imposed to extend a SEM model to a mixture SEM model correspond to those needed for multigroup SEMs (Dolan & van der Maas, 1998).
In the application of mixture LST-AR or LST-CL models to data comprising samples from different populations, e.g., clinical populations, a key interest is to investigate whether extracted latent subpopulations coincide with observed groups.That is, mixing proportions of latent classes might differ between, for instance, clinical groups, with some classes exhibiting patterns of stability and variability that are characteristic of specific clinical populations.To investigate whether latent classes extracted in mixture LST-AR or LST-CL modeling capture characteristic longitudinal dynamics of observed groups, the models are extended to include categorical covariates.That is, we allow for potentially differing mixture proportions of classes across groups by allowing latent class probabilities p c to differ across observed groups.Associations between latent class and observed group membership are modeled by (multinomial) logistic regressions of most likely latent class membership on observed group.That is, the probability p ic of person i to belong to latent class c is modeled as for latent classes c ¼ 1, :::, C À 1 and for the reference class C, with X g being dummy variables coding observed group membership for G observed groups with reference group G.

Application to AA data of affect and selfesteem in BPD and AD
To illustrate the proposed approach, the mixture LST-AR and LST-CL models are applied to data from an AA study investigating time trajectories of affect and self-esteem in patients with BPD, an AD, and HCs (Kockler et al., 2022;Santangelo et al., 2020Santangelo et al., , 2017)).Temporal instability in affect as well as the sense of self are defining characteristics of BPD (American Psychiatric Association, 2013).The lack of specificity of affective instability for BPD when compared to clinical control groups as observed in AA studies has led Santangelo et al. (2014) to the hypothesis that the temporal interplay of affect and self-esteem dynamics might distinguish BPD from clinical control groups (also see Santangelo et al., 2017).The current application investigates the existence of latent subgroups in the (joint) dynamics of affect and self-esteem in BPD, AD, and HC by use of mixture LST-AR and LST-CL models.
That is, we expect to find two or three latent subgroups depending on whether the clinical groups can be separated based on their affect or self-esteem dynamics.
The analyzed sample 2 consisted of 353 female participants, comprising 119 patients with BPD, 108 2 Subsamples of the current sample were analyzed in Santangelo et al. (2017;60 BPD and 60 HC participants), Santangelo et al. (2020;119 BPD participants), and Kockler et al. (2022; all of the AD, 65 of the HC, and 59 of the BPD participants), using different models investigating different research questions.
patients with an AD, and 126 HCs.Further sample characteristics are provided in Table S1 in the online supplemental materials (OSM).Recruitment of the participants, diagnostic criteria for inclusion in the study, and the data acquisition process are described in detail in Kockler et al. (2022).Participants carried an electronic diary on four consecutive days, which emitted a prompting signal 12 times a day following a pseudorandomized time-sampling schedule in intervals of approximately 1 hour (±10 min) between 10 am and 10 pm.The resulting dataset contains self-ratings on momentary affective state and selfesteem with an average of 42 (median ¼ 44, min ¼ 10, max¼ 48, SD ¼ 6.83) repeated measurements per person across four days.Momentary affective states were assessed with a measure specifically designed and validated for repeated assessments in e-diary studies (Wilhelm & Schoebi, 2007).For the current analyses, we used the affective state dimension valence, ranging from unpleasant to pleasant affective state, measured by two items on a 7-point rating scale.Current self-esteem was assessed using an adapted four-item short form of the Rosenberg Self-Esteem Scale (Rosenberg, 1965), rated on a 10point rating scale.Reversed items were re-coded such that high values are indicative of a positive state, i.e. high pleasantness or high self-esteem.The four self-esteem items were combined into two item parcels.For further details on the e-diary assessment and measures see Santangelo et al. (2020) or Kockler et al. (2022).Figures S1-S3 in the OSM provide graphical representations of the affect and selfesteem dynamics of BPD, AD, and HC participants, respectively.These indicate that patients with BPD show more variable and on average lower self-esteem as compared to AD and HC participants, with patients with an AD being intermediate to BPD and HC participants with respect to the level and variability of self-esteem.Patients with BPD also show the lowest levels of valence across the three groups, followed by patients with an AD.Differences in variability of valence across time seem to be less pronounced between patients with AD and BPD as is the case for self-esteem, however, with clear differences in variability when comparing the clinical groups to the HC.
In a first step, LST-AR models were fit to the data of two consecutive days, spanning 24 observations across days 1 and 2 (period 1) or days 3 and 4 (period 2), respectively.We tested different measurement invariance restrictions across time points in the LST-AR models.For both valence and self-esteem as well as both time periods, LST-AR models with strict measurement invariance across time fit the data very well (see Tables S2-S5 in the OSM).That is, AR effects, latent state residual variances (with exception of the first time point per day), measurement error variances, loading parameters and intercept parameters were set invariant across time.Models were first estimated with day-and indicator-specific trait variables.These resulted in high correlations of indicator-specific trait factors across days and/or across indicators, such that all models were re-estimated with stable (indicatorspecific) trait factors across days or common trait factors across indicators.The good fit of the models with strict measurement invariance (intercepts fixed to equality across time and trait factor loadings set to one for all time points) along with the very high correlations of day-specific trait factors across days indicate that there is no form of trait change present in the data.That is, models that imply a pure variability process across the respective two days capture the observed dynamics very well.
We accounted for the overnight lag between days by excluding the AR effect from the last observation on a day to the first observation of the following day.Variances of the latent state residual factors of the first time point per day were allowed to differ from those of later time points.Additionally, we checked for potential growth and trends in the data.Respective extensions of the models resulted in growth factors with variances and means close to zero and were therefore discarded.All results are provided on the project's osf page).
In a second step, the models were extended to mixture models.The results of the mixture models estimated for period 2 served as a check for the robustness of the results obtained for period 1.For each time period, a 2-class, a 3-class, and a 4-class solution was estimated.Factor loadings of latent occasion-specific factors were set equal across latent classes to ensure measurement equivalence for the respective factors and thereby ensure interpretability of latent class differences in the remaining model parameters.Furthermore, intercepts were constrained to equality across latent classes to allow for latent mean differences in latent trait levels between latent classes.All remaining parameters were freely estimated and allowed to vary across latent classes.For estimation of the multiconstruct LST-CL model, results of the LST-AR models were taken into account in the model specification (see description below).Mixture analyses were conducted with Mplus 8.2 and 8.3 (L.K. Muth en & Muth en, 1998-2017), using the Maximum likelihood robust (MLR) estimator with 1000/2100 random sets of starting values in the initial stage and 100/210 optimizations in the final stage estimation for the LST-AR/LST-CL model.For the comparatively more complex multiconstruct LST-CL mixture models, the maximum number of iterations allowed in the initial stage estimation was increased from 10 to 50.In case the best loglikelihood was not replicated, the analysis was rerun using the estimated parameters from the respective best solution as starting values for the model parameters in the unperturbed starting value set run.
Following Nylund et al. (2007) and Tofighi and Enders (2007), we compared the best k-class and (k À 1)-class solutions using the Lo-Mendell-Rubin likelihood ratio test (LMR; Lo et al., 2001), and the BIC and sample-size-adjusted BIC (SABIC), to decide on the number of classes3 .Note that it is not uncommon to find inconsistent results with respect to the number of classes selected by different indices in mixture analyses (e.g., Masyn, 2013;Nylund-Gibson & Choi, 2018;Ram & Grimm, 2009).Furthermore, studies investigating the performance of class selection strategies report partially diverging results.Nylund et al. (2007) report for GMMs that the BIC outperforms the LMR which outperforms the SABIC, whereas Tofighi and Enders (2007) found the SABIC and LMR to outperform all other fit statistics.Also, Henson et al. (2007) report that the SABIC performed best, followed by the LMR test, and McNeish and Harring (2017) report that BIC and BLRT perform best in identifying whether a mixture of distributions is present in data4 .Therefore, several authors recommend to base model selection not only on fit indices but also on the models' and classes' substantive interpretability as well as on classification diagnostics (e.g., Masyn, 2013;B. Muth en, 2003;Ram & Grimm, 2009).We will therefore jointly consider the LMR test, information criteria, entropy, class sizes, and the replication of the results across the two time periods as well as the treatment of time lags, and discuss the substantive interpretations of different models' results.
Models were first estimated with fixed AR and CL effects across time as well as individuals and were subsequently re-estimated using the approach for considering individually varying time lags described above.In the latter case, the reported AR parameters correspond to a time lag of one hour.All model outputs (including the MPlus code for the respective models) are available on the project's osf page.

Results of mixture LST-AR and LST-CL models
An overview of convergence, model fit, entropy and class sizes in the different LST-AR models are reported in Tables 1 and S6 for valence and Tables S10 and S11 in the OSM for self-esteem.Due to space restrictions, we will only report the results for valence in the following.Results with respect to the LST-AR models for self-esteem are reported in Tables S12-S16 and summarized in the OSM.

LST-AR model for valence
Four-class models did either not converge or converged to a 2-class or 3-class solution (i.e., the class size of the third/fourth class equaled zero).The loglikelihood of the resulting 3-class solution did not match the best loglikelihood obtained when specifying a respective 3-class model (see Table 1 for details).A 3-class model with common trait factors across days for time period 1 converged with two replicated best loglikelihoods.The remaining 3-class models did either not converge or resulted in a 2-class solution.2-class models converged well for both period 1 and period 2. In 2-class models with day-specific trait factors, these day-specific traits were highly correlated across days, with correlations ranging from .901 to .976.For both period 1 and 2, results of the models with day-specific trait factors and models with stable trait factors across days were almost identical.For reasons of simplicity, we therefore focus on the models with stable trait factors across days in the following.
When re-estimating the models accounting for individually varying time lags, the 3-class models did not converge or resulted in a 2-class solution (see Table S6 in the OSM).The models with individual time lags show a better model fit as compared to their respective counterparts with fixed AR effects, as judged based on information criteria.A comparison of the results from the 2-class as well as 1-class models with fixed AR effect and with AR effects varying by individual lag show that model results stay essentially the same (see Table S7 in the OSM for one-class models and Tables 2 and 3 for the two-class models of valence; Note that class labels switched between the different results, which has to be considered in the comparison).Consequently, we focus on the results from the models with fixed time lags in the following.Warning that latent variance-covariance matrix is not positive definite; estimated correlations between day-specific traits across days $ .984/.986.
For models with fixed time lags, we found a 3-class as well as a 2-class solution for period 1.Both BIC and SABIC indicate that the 3-class model is to be preferred over the 2-class model.In contrast, the LMR test favored the 2-class model over the 3-class model (LMR test p ¼ 0.098; note, however, that even for 12 time points, the LMR test shows a tendency to favor the 2-class model, see below for results).
That is, model comparisons lead to inconsistent results of whether a 3-class solution or a 2-class solution is to be preferred during period 1.However, for period 2, no stable 3-class solution was found, such that the 3-class solution could not be replicated across time periods.This was the case also for the models with individual time lags, for which no 3-class solution was found.In the following, we will therefore report the results of both 2-class (period 1 and 2) and 3-class solutions (period 1) and discuss the different solutions with respect to their substantive interpretation and differences.
The patterns of latent classes in the 2-class solutions (see Table 2) show that substantive results are replicated across the two time periods.Note that the labels of the latent classes are reversed in the two periods.The large entropy values show a clear separability of latent classes.We observe a slightly larger class (55% and 67% of participants) which is characterized by a) lower average levels of trait valence, b) less interindividual variation in trait levels, c) substantively higher time-point specific unexplained variability in valence, d) slightly higher values of inertia in valence across time points, and e) higher measurement error variances as compared to the other class.This class will be termed "instable" class due to the comparatively higher amount of unexplained variability as compared to the other, "stable" class.Across the first two days, the odds to belong to the stable class instead of the instable class is 2.85 for HC participants.The respective odds are significantly lower for BPD (OR ¼ 0.131) and AD (OR ¼ 0.163) patients as compared to HC participants.Similar results are obtained for days three and four: the odds to belong to the stable as opposed to the instable class is 1.59 for HCs.The same odds are considerably lower for BPD (OR ¼ 0.097) and AD (OR ¼ 0.160) patients.
Class 1 and 2 in the 3-class solution (see Table S8 in the OSM) largely resemble the two classes found in the 2-class solution of period 1.The third class that emerged in the 3-class solution is a comparatively small class (8% of the sample) which is characterized by a high level of stability in valence, which surmounts the stability found for the stable class in the 2-class solution.Apart from exhibiting low levels of unexplained state variability, subjects in class 3 are characterized by a high trait level of valence with few interindividual differences in these high trait levels.
Multinomial regressions of class membership on group revealed the following class membership probabilities.For HC participants, the probability to belong to the first (unstable), second (stable), or third (highly stable) class are 16.4%, 63.2% and 20.3%, respectively.The same probabilities are 59.3%, 40.7% and 0.0% for BPD patients, as well as 52.1%, 46.9% and 0.9% for AD patients.That is, the odds of belonging to the first (unstable) as opposed to the third (highly stable) class are 0.81 in the HC, > 1000 in the BPD, and 56.3 in the AD groups and the odds of belonging to the second (stable) as opposed to the third (highly stable) class are 3.11 in the HC, > 1000 in the BPD, and 50.7 in the AD groups.The odds to belong to the first (unstable) as opposed to the second (stable) class are 0.26 in the HC, 1.46 in the BPD, and 1.11 in the AD groups.That is, it is highly unlikely for BPD and AD patients to belong to the very stable valence class, with high probabilities to belong to the instable class.Furthermore, the third class encountered in the 3class solution appears to be a (comparatively small) subclass of the first class in the two-class model which solely contains HC participants.As the 3-class model could not be replicated across the two time periods and was not found when considering varying time lags, we would therefore argue for selecting the 2-class model for substantive interpretations.Figures S4-S9 in the OSM provide graphical illustrations of the affect dynamics of the BPD, AD, and HC participants separated by most likely latent class memberships with respect to the 2-class model solution of time period 1.Table S9 in the OSM displays mean comparisons of different symptom severity scales between the first and the second latent class, separated by observed group.These suggest that patients with BPD in the instable group are, on average, characterized by higher affective intensity and lability as compared to those in the stable group.In contrast, HC participants in the instable class show, on average, increased levels of BPD-characteristic symptoms, and patients with an AD in the instable class show, on average, increased levels of psychological distress and psychopathological symptoms in general, as compared to the stable class.

LST-CL model for valence and self-esteem
Due to the encountered 2-class solutions in the mixture LST-AR models for valence (and self-esteem) and encountered convergence problems in case of 3-class models, only 2-class models were estimated for the mixture LST-CL models combining valence and selfesteem.Based on the LST-AR model results, the LST-CL models were directly specified with stable trait factors across days.The models converged with 4 and 5 replicated best loglikelihoods in period 1 and 2, respectively.The results of the respective 2-class models are summarized in Tables S17 and S18 in the OSM.In accordance with results of the LST-AR models for self-esteem and valence, the smaller of the two encountered classes (18% and 24% of participants, respectively) are characterized by higher trait levels of self-esteem and valence as well as less variability of both constructs across time.Accordingly, the probability to belong to this class of rather stable selfesteem and affect dynamics is substantially smaller for BPD and AD patients as compared to HCs (see multinomial logistic regression coefficients in Tables S17  and S18).Differences between the classes with respect to AR and CL parameters of self-esteem and valence were not replicated across the two time periods.That is, in time period 1, CL parameters were not significantly different from zero in either of the two classes, while in time period 2, the smaller, more stable class showed higher positive CL effects of self-esteem on subsequent valence (not sign. in the instable class), while the instable class showed a small positive CL effect of valence on self-esteem (close to zero and negative in the stable class).We did not observe consistent differences between classes with respect to time-specific associations of unpredicted fluctuations in self-esteem and valence.Latent correlations between the state residual variables at the first time point were higher in the instable as compared to the stable class in period 1, however not in period 2. As the results with respect to CL effects did not replicate across time periods, these should be interpreted with caution.

Monte Carlo simulation study
We conducted two extensive simulation studies to test the performance of the models under realistic conditions and derive recommendations for applied research.

Simulation design and data-generating values
To  4) Inclusion of a categorical group predictor variable (multiple observed groups are modeled vs. not modeled); resulting in a 4 Â 5 x 6 Â 2 design with a total of 240 conditions for the LST-AR as well as the LST-CL model.
In Simulation study II, we investigated the effects of the number of latent classes and latent class separation in the LST-AR model.To this end, we simulated 2-class as well as 3-class models and varied the differences in parameters between classes from small effect sizes to medium effect sizes.Effect sizes are defined based on Cohen's d for latent trait means and on the variance ratio v ¼ r 2 1 =r 2 2 with r 2 1 > r 2 2 for variance parameters, with d ¼ 0.2 and v ¼ 1.1 considered small effects and d ¼ 0.5 and v ¼ 1.5 medium effects (see Cohen, 1988;Faul et al., 2007).For the AR parameters, a difference of 0.1 was considered as small and a difference of 0.2 as medium.The resulting data-generating parameter values are provided in Table S19 in the OSM.Additionally, we varied (1) the number of days: models spanning 1 or 2 days; (2) the number of time points per day: 6, 8, 12; (3) Sample size: 100, 150, 300 persons per group.Note that the design was not completely crossed (with respect to the T ¼ 12 and N ¼ 300 conditions).All datasets were analyzed specifying a 2-class as well as a 3-class mixture model, and the performance of information criteria and the LMR test for identifying the number of latent classes was examined.
We simulated 500 replications per condition.Data were generated separately per observed group using MPlus 8.3 (L.K. Muth en & Muth en, 1998-2017) and subsequently merged using the open source software R (R Core Team, 2019).Data estimation was performed using the MLR estimator for mixture analysis in MPlus 8.3 (L.K. Muth en & Muth en, 1998-2017), using data-generating values as starting values.Results were analyzed and visualized using R (R Core Team, 2019).
Label switching of the latent classes was evaluated based on the parameters with the most pronounced differences between classes, that is, the latent trait means and latent state residual variances, as well as AR parameters in simulation study II.Label switching was assumed if differences in the parameter estimates were reversed between classes for all of these parameter types.There were no replications in any of the simulation conditions that met these criteria in simulation study I (i.e., no case of label switching was detected).In simulation study II, several replications fell under this definition of label switching and the respective replications were removed from further analyses.Additionally, replications that produced negative variance estimates were removed.
Estimation performance was judged by means of the following evaluation criteria.Relative parameter estimation bias (peb) was calculated per parameter p by pebðpÞ ¼ 1 As a measure combining bias and efficiency, the Mean Squared Error (MSE) is considered, with Fourth, 95%-coverage, defined as the percentage of replications within a simulation condition in which the parameter's datagenerating value falls within the 95% confidence interval of the respective parameter.Coverage between 91% and 98% was considered acceptable (L.K. Muth en & Muth en, 2002).

Simulation study I: Mixture LST-AR models
None of the LST-AR models spanning more than one day issued any error or warning messages.In the oneday models, only replications of models with 4 or 6 time points issued errors and warning messages, with models with 6 time points mainly affected in conditions with small samples (N 75).The number of excluded (non-converged) replications as well as the number of error and warning messages for the oneday models are depicted in Figure S10 in the OSM.
We investigated the effect of including vs. excluding replications that were potentially affected by Heywood cases (i.e., replications for which MPlus issued warnings concerning non-positive definite (residual) variance-covariance matrices).Figure S11 in the OSM presents a comparison of peb for relevant parameters in the affected one-day conditions when including and when excluding replications with warning messages regarding Heywood cases.Exclusion of the respective replications did only marginally change parameter bias.Therefore, the following results are based on the full set of replications.
Boxplots visualizing the distribution of relative peb, seb, MSE, and coverage values across parameters per simulation condition are depicted in Figures S13-S16 in the OSM.Exact numbers presented in tabular form as well as all MPlus outputs are available at the project's osf page 5 .
Relative peb fell below the cutoff value of 10% for all parameters in all conditions of models spanning at least two days, that is, were estimated with high accuracy irrespective of the number of measurement time points per day or sample size.In one-day models, 8 time points suffice for peb values below the cutoff, irrespective of sample size, while in case of 6 time points at least 125 persons per group are required.MSE values are in accordance with peb values in that models spanning two to four days show small MSE values with only small differences between 2-, 3-, and 4-day models.As no clear cutoff criteria for MSE values exist, recommendations on sample sizes are derived from the remaining evaluation criteria.
Across all parameters and conditions of the mixture models with group covariate spanning at least two days, seb values exceeded the cutoff of 0.1 in only four cases, with negligible relative bias of maximal 0.11.The largest seb value of 0.11 was observed for a model with 4 measurement time points and 100 observations.For models without covariate modeling, the logit parameters quantifying latent class proportions exhibited relative standard error bias above 0.1, with maximal values of 0.19.In 1-day models, SEs were estimated accurately (seb < 0.1) with more than 6 observations per day (except for logit parameters without covariate modeling; maximum seb ¼ 0.13) or with 6 time points per day and at least 125 observations per group.
Minimal observed coverage across all conditions and parameter types was 0.855.Parameters showing 5 Non-aggregated results per replication as well as results for incorrectly specified numbers of classes are provided by the first author upon request.
coverage values below 0.91 were primarily trait variances and covariances.For models with at least 8 time points per day or at least 125 observations, all coverage values fell into the desired range.For models including 6 time points or 100 observations, coverage values fell into the desired range for all parameters with only very few exceptions of coverage values ranging between 0.902 and 0.908.For models spanning several days including 4 time points per day, no coverage values below 0.91 were observed in case of at least 125 observations.With respect to models including 4 time points across only one day, at least 200 observations are needed such that all coverage values surmount the cutoff of 0.91.

Simulation study I: Mixture LST-CL models
In models spanning at least two days, few warnings regarding non-positive definite latent variance-covariance matrices and regarding the computation of standard errors occurred in case of models with 4 time points or N ¼ 50 only.Apart from these, warning messages with respect to fixed multinomial logit parameters were frequently issued in the models with (but not without) covariate modeling across all conditions.Error and warning messages were more frequent for one-day models, which are depicted in Figure S12 in the OSM.
Boxplots visualizing the distribution of relative peb, seb, MSE, and coverage values across parameters per simulation condition are depicted in Figures S17-S20 in the OSM.Relative peb fell below the cutoff value of 10% for almost all parameters in all models spanning at least two days.Peb values greater than 0.3 that appear in the models with but not the models without group covariate spanning two to four days exclusively belong to the logistic regression parameter of the BPD group dummy in regressing latent class membership on observed group.This can be explained by the parameter values itself: the probabilities of belonging to the latent classes was modeled (based on the real data analysis) such that the probability to belong to the first class is 99.16% for individuals from the BPD group.Hence, data with sample sizes of up to 200 persons per group contain little information with respect to the exact value of a regression parameter that generates conditional probabilities in the range between 99 and 100%.In the extreme case of all observations from the BPD group belonging to the first class, the respective regression coefficient cannot be estimated and is fixed to a large value.Estimated regression coefficients for the BPD group on the logit scale correspond to conditional probabilities of belonging to class 1 ranging between 99.96 and 100%, thereby closely matching the simulated probability.
With respect to models spanning three or four days, peb values greater than 0.1 occurred in very few cases for the trait covariance parameters in conditions with N ¼ 50 or N ¼ 75.In case of N ¼ 75 this applies to four cases of peb values between 0.104 and 0.108, with corresponding absolute biases between 0.012 and 0.013 and coverage values between 0.926 and 0.944.In the N ¼ 50 case, absolute and relative biases were slightly higher, indicating that a sample size of 50 observations per group is not sufficient.
With respect to two-day models, 88% of the parameters with relative peb larger than 10% are CL parameters (disregarding the BPD group logistic regression coefficient reported above).The corresponding CL parameter has a true population value of 0.006, explaining large relative biases in combination with negligible absolute biases and good coverage rates.With respect to sample sizes larger than 50, the respective parameter has absolute biases < .002,MSE values < .002,and coverage rates between .932 and .962.The six remaining cases of parameters with peb values larger than 0.1 in the two-day models occur solely in conditions with N ¼ 50 or four measurement time points per day.
In one-day models, high peb values occurred frequently in conditions with small sample sizes and few measurement time points.Disregarding the logistic regression coefficients for the BPD group, none of the parameters in the model including 12 measurements across one day in combination with sample sizes of at least 125 showed relative biases above the cutoff of 10%.The same holds for models with 10 measurements across one day and a sample size of N ¼ 200.
With respect to SE bias in models spanning at least two days, seb values larger than 0.2 were observed solely for parameters of the regression of class membership on observed group (models with covariate) or the logit parameter quantifying latent class sizes (models without group covariate).In the former case, these elevated seb values occurred for the regression coefficient of the anxiety group dummy variable only under sample sizes of N ¼ 50.For the BPD dummy variable, high seb values result from the fact that the respective regression parameter had to be fixed in many replications due to a probability of one to belong to the first class for members of the BPD group.The resulting bias is therefore not considered practically relevant for model applications.In case of models without group modeling, the respective logit parameter is usually not tested for significance such that standard error bias is of little practical importance.
Relative seb slightly above the cutoff value of 0.1 occurred for few parameters across different conditions, however, primarily in conditions with N ¼ 50.In conditions with N > 50 and more than one day, peb fell below .146for all parameters with values larger than .110 in only 11 cases across conditions and parameters.These correspond to absolute differences between average standard errors and population values lying between 0.002 and 0.052.For one-day models, it is apparent that seb is large especially in conditions with 8 or less time points in combination with 125 or less observations per group.
Minimum observed coverage across all conditions and parameter types for models spanning more than one day was 0.84.Parameters affected by low coverage values were mainly trait variances and covariances.For models spanning at least two days, all coverage values were larger than .870with a maximum of 0.037% of cases per condition with coverage < .91.With respect to one-day models, none of the parameters had coverage values smaller than .91 in conditions with 12 time points and N ¼ 200 or N ¼ 150.In case of 12 time points and N ¼ 125 or N ¼ 100, minimum observed coverage was .901and .887,respectively.In case of 10 time points across one day, coverage < .91 occurred in models with N ¼ 200, N ¼ 150, N ¼ 125, and N ¼ 100, in 0.005, 0.003, 0.002, and 0.002% of cases, with coverage > .881,.905,.898,and .859,respectively.

Simulation study II
Boxplots of relative peb, seb, and coverage values across parameters (except logit parameters) per simulation condition are depicted in Figures S27-S29 in the OSM.Parameters were estimated accurately as judged by relative peb for all 2-class and 3-class models with data spanning two days in case of medium effect sized class separation.An exception were the coefficients of the multinomial logistic regression of latent class probability on observed group.Warning messages related to the multinomial logit regression coefficients indicate that in several replications at least one of the logistic regression coefficients was fixed to a high value or not estimated accurately, most likely due to empty cells in the joint distribution of latent classes and observed groups.The frequency of issued warning messages by simulation condition is depicted in Figures S21-S24 in the OSM.Additionally, some replications produced implausable values for the multinomial logit regression coefficients (e.g., À1356).When excluding replications with respective warning messages as well as logit parameters larger than 10 in absolute value, the multinomial logit parameters were estimated with high accuracy in the remaining replications of models with medium effect sized class separation (see Figure S25).The number of replications excluded per condition by aforementioned definition is depicted in Figure S26.
In one-day models with medium effect sized class separation, parameters in 2-class models are estimated accurately with at least 8 time points per day, while 3class models require more time points and/or larger samples than 2-class models for accurate estimation.A similar picture emerges for standard error bias and coverage values.Overall, the results suggest that in case of very small differences between classes, 8 time points across two days with N ¼ 150 per group is not sufficient for 2-class nor 3-class models (small effect sized class separation).For 3-class models with medium effect sized class separation, at least 12 time points across two days are required, while in case of 2-class models 6 time points across two days or 8 time points on one day with N ¼ 150 appear sufficient.

Class enumeration
The percentage of correctly identified numbers of latent classes by condition based on the AIC, BIC, SABIC, entropy, and (adjusted) LMR test are depicted in Figures S30-S35 in the OSM.For the simulated conditions, the LMR as well as adjusted LMR tests show a tendency to favor a k À 1-class model over a k-class model and thereby tend to under-extract latent classes.The results further suggest that this tendency might vanish in larger samples (with respect to time points and persons).Similarly, the BIC tends to favor 2-class models, while AIC and SABIC select the correct number of classes in approximately 50% of the cases.Both SABIC and AIC appear to perform better with a sufficient number of time points and larger samples.For the simulated conditions, a selection of the number of classes based on choosing the model with the larger entropy shows the best performance of the selection methods.

Conclusion and recommendations
Results of the simulation study show that mixture LST-AR and LST-CL models can be accurately estimated with only few observed time points per person and a moderate number of persons included in the analyses in case of medium effect sized class separation.

Mixture LST-AR model
The following recommendations can be derived for the mixture LST-AR model.In general, we recommend to sample observations from at least 100 persons per group and at least 6 time points per person per day if at least medium effect sizes can be expected.In case that at least 8 measurements per day are collected per person across at least 2 days, observations from 75 persons per group are sufficient for good estimation properties.If measurements are collected within one day only, at least 8, better 10 measurement time points per person should be included.For the extraction of 3 latent classes it is recommended to sample at least 12 time points across two days.

Mixture LST-CL model
In comparison to the mixture LST-AR model, the mixture LST-CL model requires larger sample sizes as well as a larger number of time points.Considering the combination of encountered error and warning messages, peb, seb, and coverage values, the following recommendations can be derived for two-class models.In general, we recommend to sample observations from at least 125 persons per group and at least 10 time points per person per day.We recommend to increase sample sizes above 125 in case observations are taken across two or three days only.In case of 10 measurement time points across one single day, data should be collected for at least 200 persons per group.In case of 12 time points per person across one day, data collected from 125 persons per group seems sufficient.Note that these recommendations refer to 2class models and slightly higher numbers might be needed for good estimation accuracy in 3-class models.
Mixture models with vs. without observed group as covariates We observed only small differences between the models which primarily concern the estimation of the logit parameters denoting latent class proportions or logistic regression parameters.Logistic regression parameters showed good estimation accuracy as judged by peb and coverage values as well as small MSE values given that the recommendations given above are fulfilled.
Logit parameters in the models without groupmodeling showed slightly elevated seb above the cutoff value.Note, however, that the logit parameter denoting latent class proportions in the models without group-modeling is classically not tested for significance, such that its standard error might not be of the same importance as those of the remaining model parameters or those of the logit parameters in case groups are included as a predictor variable.

Discussion
Individuals are likely to differ in their temporal dynamics across time, rendering longitudinal models that assume population homogeneity with respect to intra-individual dynamics overly restrictive.In this paper we present an extension of the LST-AR model and RI-CLPM to mixture distribution models.The proposed mixture models are less restrictive than the general RI-CLPM as well as LST-AR models in that they relax the assumption of population homogeneity and parsimoniously capture qualitative differences in longitudinal processes by identifying latent subgroups of individuals with similar dynamics.
In an empirical application we showed how different subgroups with respect to the variability in and temporal associations between self-esteem and affect can be identified and related to observed diagnostic groups.The mixture LST-AR and LST-CL successfully identified two to three latent subgroups of individuals characterized by distinct patterns of affect and selfesteem dynamics across 24 measurements on two consecutive days.Multinomial regressions of most probable latent class membership on observed diagnostic group revealed that it was most likely for HC participants to belong to a subgroup of individuals characterized by comparatively high and temporally stable levels of valence and self-esteem, while patients with AD and BPD are far more likely to belong to classes characterized by low habitual levels of valence and self-esteem showing high instability across time.For patients with BPD, the odds to belong to a subgroup of highly instable self-esteem and affect are found to be even higher than for patients with AD.However, the results suggest that patients with BPD and AD cannot be clearly separated from each other based on their temporal dynamics.Latent class membership could not be entirely explained by observed diagnostic group, indicating that accounting for observed group membership would not have been sufficient to capture qualitative differences in the dynamics.Furthermore, correspondence between diagnostic classification and subgroups identified based on temporal dynamics in self-esteem and affect was moderate, indicating that diagnostic groups are heterogeneous with respect to individuals' hour-to-hour temporal dynamics of affect and self-esteem.Future research should investigate with respect to which characteristics (e.g., comorbidities or symptom severity) individuals with similar or different temporal dynamics (i.e., that are clustered within the same or different latent classes) might differ.
A decision for a 2-class vs. 3-class solution for the LST-AR models was not definite in the present case.However, the third class was small and did only further separate HC participants (into those HC with stable vs. very stable patterns of affect).After considering the recommended criteria and test statistics (Nylund et al., 2007;Tofighi and Enders, 2007), the substantive meaning and relevance of the encountered classes for distinguishing (clinically relevant) subgroups of interest might additionally inform the decision on the number of latent classes.
In an extensive simulation study the proposed mixture LST-AR and LST-CL performed well with a moderate number of repeated observations per individual and individuals in the sample.In accordance with the Monte Carlo simulation study by Courvoisier et al. (2007), smaller sample sizes can be compensated by increasing the number of measurement occasions.In conclusion, the proposed models can be appropriately applied even in the situation of relatively few observed time points for a moderate number of persons and are therefore suited to analyze data from longitudinal studies with few repeated measurements.
Naturally, the results of the simulation study are to be interpreted as a best-case scenario, as model misspecifications as well as data properties such as nonnormality and skewness that might be encountered in practice are likely to deteriorate estimation and class enumeration performance (e.g., Bauer & Curran, 2003;McNeish & Harring, 2017).It is, for instance, well known that a violation of the within-class normality assumption might lead to the extraction of spurious classes (Asparouhov & Muth en, 2016;Bauer & Curran, 2003).However, results by McNeish and Harring (2017) for GMMs suggest that the BIC and BLRT might work quite well in identifying the true number of latent classes even in the presence of model misspecifications, provided that a large number of random starts and final stage optimizations is used for the mixture model estimation.Asparouhov and Muth en (2016) introduced skew-t mixture analyses which allow for within-class distributions to be skewed and to have heavy tails and thereby reduce the risk of spurious class formations.An investigation of the performance of the skew-t mixture modeling approach for LST models indicated that required sample sizes are increased as compared to the case of within-class normality, but N ¼ 500 with at least 4 and 5 observed time points under mild and high levels of skewness, respectively, produced acceptable results for a mixture LST model without AR effects (Hohmann et al., 2018).
We showed how an extension of the models allows to accommodate unequal time intervals between observations across individuals and time points in (mixture) LST-AR and -CL models.In the present data application, time intervals between the ambulatory assessments varied slightly across individuals and time, did, however, not substantially affect the estimation of AR parameters, with estimates being very close when ignoring and accounting for differences in time intervals.However, in case of unbalanced data with a varying number of time points per individual and unequally spaced time intervals within and across individuals, as is often generated by using a randomized time-sampling schedule in ambulatory assessments, continuous-time dynamic models, such as the hierarchical bayesian continuous time dynamic model by Driver and Voelkle (2018) or the hierarchical Ornstein-Uhlenbeck process model by Oravecz et al. (2009Oravecz et al. ( , 2016)), are better suited and recommended.
With respect to addressing cyclical effects (e.g., diurnal or weekly cycles) or trends in the data, we recommend researchers to test for the presence of idiosyncratic trends and cycles due to time of the day or day of the week when setting up the longitudinal model.The presented LST models may easily be extended to capture person-specific trends or cyclical effects in the data, by introducing latent factors capturing the respective trajectory (e.g., LGC models for growth trajectories).However, a restriction of this approach is that, while individuals may vary with respect to the degree of change/variation captured by the trend/cycle, it requires the general form or pattern of these trends/cycles to be the same across individuals.In the presence of idiosyncratic trends, researchers may consider to detrend the time series at the level of the individual in a first step.This approach does, however, carry the risk of deleting substantive information regarding the co-occurence of the phenomena of interest (see, e.g., Wang & Maxwell, 2015).Researchers may consider comparing different modeling options with respect to the handling of trends as part of a sensitivity analysis.
A restrictive feature of the proposed approach is that it still assumes homogeneity across persons within classes with respect to the parameters that describe the dynamics across time.This restriction stands in contrast to some recently proposed alternative modeling approaches.For instance, Ernst et al. (2021) introduced a probabilistic clustering method which clusters individually estimated VAR model parameters using finite mixture modeling, thereby allowing person-specific VAR model parameters to vary within clusters.Lane et al. (2019) proposed to use S-GIMME (also see Gates et al., 2017) for modeling AA data; a data-driven approach that uses unsupervised classification to identify dynamic relations at the group-, the subgroup-, as well as the individual level.Furthermore, multilevel time series models as well as their extensions to latent variable models (as proposed, for instance, in Asparouhov et al., 2018;Schuurman & Hamaker, 2019;Song & Zhang, 2014) allow for variation in individual-specific parameters by imposing the assumption that these parameters stem from a common distribution of parameters on the between-person level.In contrast to the proposed mixture models, which aim at identifying qualitative differences, these approaches model quantitative differences with respect to persons' dynamics across time and are better suited if intensive longitudinal data is to be analyzed.A disadvantage of the VAR clustering or S-GIMME approach is that they require a comparatively larger number of repeated observations per individual (e.g., 50 or more; Ernst et al., 2020;Lane et al., 2019;Schuurman, Houtveen, & Hamaker et al., 2015).A requirement that is often not fulfilled in research using ambulatory assessed self-reports, especially in clinical populations, due to the associated high burden for the participants (Myin-Germeys & Kuppens, 2021).The proposed mixture LST-AR and LST-CL models can be considered an alternative in situations where a) the estimation of more complex (person-specific) models is not recommended as the sample size requirements on the within-person level are not met, b) the aim is to investigate the presence of qualitative differences in longitudinal processes as it can be captured by few latent classes, and c) for identifying heterogeneity in panel data.

Article information
Conflict of interest disclosures: The author(s) declare that there were no conflicts of interest with respect to the authorship or the publication of this article.The authors are unable to share any data publicly as they did not explicitly ask participants to agree to make their anonymized data available online (sharing participants' data would violate confidentiality).
Ethical principles: The authors affirm having followed professional ethical guidelines in preparing this work.These guidelines include obtaining informed consent from human participants, maintaining ethical treatment and respect for the rights of human or animal participants, and ensuring the privacy of participants and their data, such as ensuring that individual participants cannot be identified in reported results or from publicly available original or archival data.
Role of the funders/sponsors: None of the funders or sponsors of this research had any role in the design.

Figure 1 .
Figure 1.Path diagram of the autoregressive latent state-trait (LST-AR) model.The model is depicted for one construct measured by two indicators on four measurement occasions.For the sake of clarity, residual variables e ijt are only labeled for exemplary indicators.e ijt : measurement error variable; O jt : occasion-specific variable; T ij : latent trait variable; Y ijt : observed indicator variable; f jt : latent state residual variable; b jt : autoregressive parameter; k Oijt : loading parameter of the occasionsspecific factor; k Tijt : loading parameter of the latent trait factor; i: indicator; j: construct; t: measurement occasion/time point.

Figure 2 .
Figure 2. Path diagram of the multiconstruct cross-lagged latent state-trait (LST-CL) model.The model is depicted for two constructs measured by two indicators on four measurement occasions.For the sake of clarity, parameter labels for cross-lagged effects are omitted and residual variables e ijt are only labeled for exemplary indicators.O jt : occasion-specific variable; T ij : latent trait variable; Y ijt : observed indicator variable; f jt : latent state residual variable; b jt : autoregressive parameter; k Oijt : loading parameter of the occasions-specific factors; k Tijt : loading parameter of the latent trait factor; i: indicator; j: construct; t: measurement occasion/time point.
Stable traits refer to models with indicator-specific stable trait factors spanning two days (instead of being day-specific).Time period 1: 24 observations during days 1 and 2 of the study; Time period 2: 24 observations during days 3 and 4 of the study; repl.loglik.: Number of replicated best loglikelihood values in the mixture model estimation; AIC: Akaike information criterium; BIC: Bayesian Information Criterium; SABIC: sample-size adjusted BIC; C: Class; not repl.: best loglikelihood was not replicated.a 3-class solution with loglikelihood that does not correspond to best 3-class solution.b Estimated correlations between day-specific traits across days $ 0.976/0.964/0.901/0.948.c

b
h p , where ĥpe denotes the parameter estimate of parameter p in replication e, h p the data generating value, and n rep the number of replications.Relative standard error bias (seb) seð ĥp Þ e Àsdð ĥp Þ sdð ĥp Þ , where b seð ĥp Þ e denotes the standard error of parameter estimate ĥp in replication e, and sdð ĥp Þ denotes the empirical standard deviation of the parameter estimate across all replications.Peb and seb values falling below a cutoff value of 0.10 (10% deviation from the population value) were considered acceptable (L.K. Muth en & Muth en, 2002).

Table 1 .
Overview of convergence, model fit and class sizes of mixture LST-AR models for valence.

Table 2 .
Mixture LST-AR model application: results of the 2-class solutions for valence.
Note.Mixture autoregressive Latent State-Trait (LST-AR) model results for days 1 and 2 (time period 1) and days 3 and 4 (time period 2).Displayed results stem from the LST-AR models with stable indicator-specific trait variables.Standard errors are given in parentheses.The logistic regression modeled the probability to belong to class 1 in dependence of observed group, with healthy controls as the reference group.i: indicator; a i : Intercept parameters for indicator i; b: autoregressive parameter of the latent occasion-specific variables; e i : measurement error variables; k O2 : latent factor loading for the second indicator of the occasion-specific factors; T i : stable, indicator-specific latent trait factors; f 1 ; latent state residual variables for the first time point per day (t ¼ 1 or 13); f t : latent state residual variables for time point t, with t 6 ¼ 1, 13; EðÁÞ : expectation; VarðÁÞ : Variance; rðÁ, ÁÞ : correlation.

Table 3 .
Mixture LST-AR model application: results of the 2-class solutions for valence with individually varying time lags.Note.Mixture autoregressive Latent State-Trait (LST-AR) model results for days 1 and 2 (period 1) and days 3 and 4 (period 2).Displayed results stem from the LST-AR models with stable indicator-specific trait variables.Standard errors are given in parentheses.The logistic regression modeled the probability to belong to class 1 in dependence of observed group, with healthy controls as the reference group.
i: indicator; a i : Intercept parameter for indicator i; b: autoregressive parameter of the latent occasion-specific variables; e i : measurement error variables; k O2 : latent factor loading for the second indicator of the occasion-specific factors; T i : stable, indicator-specific latent trait factors; f 1 ; latent state residual variables for the first time point per day (t ¼ 1 or 13); f t : latent state residual variables for time point t, with t 6 ¼ 1, 13; EðÁÞ : expectation; VarðÁÞ : Variance; rðÁ, ÁÞ : correlation.