Random effects regression mixtures for analyzing infant habituation

Random effects regression mixture models are a way to classify longitudinal data (or trajectories) having possibly varying lengths. The mixture structure of the traditional random effects regression mixture model arises through the distribution of the random regression coefficients, which is assumed to be a mixture of multivariate normals. An extension of this standard model is presented that accounts for various levels of heterogeneity among the trajectories, depending on their assumed error structure. A standard likelihood ratio test is presented for testing this error structure assumption. Full details of an expectation-conditional maximization algorithm for maximum likelihood estimation are also presented. This model is used to analyze data from an infant habituation experiment, where it is desirable to assess whether infants comprise different populations in terms of their habituation time.


Introduction
Finite mixture models have long been used as a way to model a sample from a population subdivided into some known number of distinct categories that occur in unknown proportions in the case where each individual's category membership is unobserved. Finite mixture models appear in areas such as economics, machine learning, neural networks, and the social and behavioral sciences. In addition to a vast library of journal articles about mixture models, there are also books dedicated to the subject; for instance, see Titterington et al. [40], Lindsay [21], McLachlan and Peel [25], and Mengersen et al. [27]. requires use of an expectation-conditional maximization (ECM) algorithm [26], which we carefully detail in the appendix. We also discuss various hypothesis tests that arise in this situation (e.g. testing the different levels of heterogeneity when conditioned on component membership) and their scientific interpretations for the infant habituation data. The code developed for this analysis is available in the R contributed package mixtools [4].

The model
Consider the following mixed-effects regression model for observation i, i = 1, . . . , n: where y i is an n i -dimensional response vector (trajectory) that is linearly related to two quantities: (1) a known n i × q matrix of explanatory variables U i with an associated q-dimensional vector of unknown regression parameters α and (2) a known n i × p design matrix X i with an associated p-dimensional vector of unknown individual (random) effects β i ∼ N p (μ, ). The β i are distributed independently of one another and of the i , which are n i -dimensional random error terms that are, themselves, distributed independently as N n i (0, σ 2 I n i ), where I d denotes the d × d identity matrix. Now, assume that the ith of n subjects generates an n i -dimensional trajectory that can be modeled similar to Equation (2), but the individual comes from one of k groups, say group j i , and the error terms are now i ∼ N n i (0, σ 2 j i I n i ). The conditional density for this setting is where φ d (· | μ, ) denotes the d-dimensional normal density with mean μ and covariance . We now assume that the p-dimensional vector β i is normally distributed with mean μ j i and variance R j i . Notice that the mean of β i and the variances of both β i and i depend on the mixture component j i from which this trajectory is drawn. Because we are most interested in the conditional distribution of y i given X i , we ignore the distribution of the X i here.
Letting j i specify to which of the k groups the ith trajectory belongs, assume that β i | j i ∼ N p (μ j i , R j i ) and that β i is conditionally independent of i . Let ς 2 = (σ 2 1 , . . . , σ 2 k ) , μ = (μ 1 , . . . , μ k ) , and R = (vech(R 1 ) , . . . , vech(R k ) ) , where vech is the half-vectorization function which gives, in vector form, the lower triangular portion of the symmetric matrix it operates on. Assuming that j i takes value j with probability λ j , the marginal density of β i is the mixture density Here, we let λ = (λ 1 , . . . , λ k−1 ) denote only the first k − 1 component weights, since λ k is a linear combination of these first k − 1 values. Finally, let and β = (β 1 , . . . , β n ). Under the earlier assumptions and given the component membership j i , the joint distribution of the trajectories y i and the random effects β i is derived as for a standard random effects regression model. Straightforward calculations show that, conditional on component membership j i , (y i , β i ) has the following joint multivariate normal distribution: Maximum likelihood estimation using the aforementioned framework can now be performed using an ECM algorithm [26]. We note that the incorporation of σ 2 j is what necessitates the use of an ECM algorithm over a traditional EM algorithm. If σ 2 j ≡ σ 2 for all j = 1, . . . , k, then our model collapses to the traditional model presented in Xu and Hedeker [46] and, thus, an EM algorithm can be employed. The details of constructing an ECM algorithm for our random effects regression mixture model are presented in the appendix.

Equal variances and equal variance-covariance matrices
As we have noted, the difference between the model presented here and the models presented in Verbecke and Lesaffre [44], Xu and Hedeker [46], and Gaffney and Smyth [12] is that we allow for different trajectory variance terms. This means that, given the component membership of the random effects, we allow the trajectories to be heterogeneous between the different components. Here, we use a likelihood ratio test (LRT) for the hypotheses: Under the null hypothesis in Equation (6), the LRT statistic is asymptotically χ 2 k−1 . Note that here standard regularity conditions (see, for example, [8]) are satisfied and so the usual asymptotic results apply to the LRT; we do not have to worry about the difficulties sometimes associated with LRT statistics in a finite mixture model context when the test involves the number of components [6,21,25].
We also allow for k different variance-covariance matrices in the mixture structure for the random effects. Gaffney and Smyth [12] did incorporate this into their model, but did not perform any testing for this heterogeneity. We again use an LRT for the hypothesis: H 1 : Not all R j are the same.

Determining the number of components
Some applications of mixture models specify the number of components based on scientific assumptions. For example, the clinical trial data sets analyzed in Xu and Hedeker [46] are analyzed with the random effects regression mixture model for identification of treatment 'responders' versus 'nonresponders'. Hence, the β i are assumed to come from a two-component mixture of multivariate normals. However, when the number of components is not specified a priori, then one must determine an appropriate number of components to use.
From a likelihood perspective, we consider testing for some positive integer k 0 . Lettingψ 0 andψ 1 denote the MLEs of ψ calculated under H 0 and H 1 , respectively, we could consider the LRT statistic It is well known that standard regularity conditions do not hold in the setting of Equation (8) and thus the asymptotic distribution of Equation (9) is not the usual chi-squared distribution (see [1,21] for a discussion). However, empirical results have shown that model selection techniques (e.g. the Akaike information criterion and Bayesian information criterio) and bootstrapping for testing the number of components both have good empirical results (see [25] for references). We use the latter approach for our analysis, which we now define in greater detail. The bootstrapping approach for determining the number of components was proposed by McLachlan [23]. The algorithm is an attempt to approximate the null distribution of the LRT statistic values given in Equation (9). An outline of the algorithm is as follows: (1) Fit a mixture model with k 0 and k 0 + 1 components to the data, which leads to the estimateŝ ψ 0 andψ 1 , respectively. (2) Calculate the (observed) log-likelihood ratio statistic in Equation (9). Denote this value by obs .
(3) Simulate a data set of size n from the null distribution (the model with k 0 components).
We implement the aforementioned algorithm by first testing 1 versus 2 components. We obtain p B for this test and if it is lower than some significance level α (say, 0.05), we claim statistical significance and proceed to test 2 versus 3 components. If not, we stop and claim that there is not statistically significant evidence for a two-component fit. We proceed in this manner until we fail to reject the null hypothesis.

Some caveats
There are various difficulties and limitations that arise when estimating mixture models. For example, mixture densities having the same parametric families for the components (like the mixture of multivariate normals in Equation (3)) are invariant under the k! permutations of the component labels in the parameter vector; see McLachlan and Peel [25] for discussion. From an estimation standpoint, this can become a concern if performing a parametric bootstrap to obtain standard error estimates (as in our application) or if conducting Bayesian inference via Markov chain Monte Carlo samplers. Such a permutation of the component labels results in the wellknown label switching problem. Fortunately, there are numerous methods in the literature for dealing with label switching; see, for example, McLachlan and Peel [25] and Jasra et al. [17]. One common approach for handling this issue is to impose identifiability constraints as discussed in Aitkin and Rubin [1], who suggest restricting the mixing proportions to ( 1 1 ) The aforementioned identifiability constraint is what is applied in our analysis in the next section.
The label switching problem tends to become less problematic and easier to correct when the components are well-separated. This was discussed in Hurn et al. [16] when performing Bayesian inference on finite mixtures of regressions models. Moreover, if the components of a mixture model are not well-separated, then this can also impact the ability to accurately test the number of components when not known a priori. McLachlan [23] showed how the power of the test in Equation (8) for k 0 = 1 is poor for the mixture of normals setting (with common variances) unless the components are well-separated.
It is also necessary to consider the possibility of multiple local maxima since the likelihood will have multiple roots. This can become an issue when testing the number of components as one risks making a decision based on convergence to a local maximum. When performing maximum likelihood estimation with an optimization algorithm (e.g. with EM algorithms), it is always a good practice to try multiple starting values and assess the reasonableness of the results. Moreover, one also needs to consider the practical interpretation of the results obtained. For example, when testing the number of components, one should consider if there is some additional research or theory that substantiates why the chosen number of components may be appropriate. For our infant habituation analysis, the components of our mixture model are used to model possible subgroups based on visual habituation trends. We will highlight the literature that suggests different subgroups of infants take different strategies when exposed to habituation experiments.
If using an EM algorithm for maximum likelihood estimation, then convergence of the algorithm can be slow. There are various methods that have been proposed for speeding up convergence of EM algorithms, many of which leverage Aitken's acceleration method; see McLachlan and Krishnan [24] and McLachlan and Peel [25] for discussion of such methods. Unfortunately, the ECM algorithm we present in the appendix is also very slow to converge. We have not yet incorporated any methods for speeding up convergence of our ECM algorithm, but note that this is a direction for future research.
One final issue is that the likelihood function may be unbounded, which is also a concern when implementing numerical algorithms. Focusing on local maxima on the interior of the parameter space helps circumvent this problem because under certain regularity conditions, there exists a strongly consistent sequence of roots to the likelihood equation that is asymptotically efficient; see McLachlan and Peel [25]. Regarding numerical optimization, it is again suggested to use either informative starting values or try multiple starting values as a way to handle this situation.

Examples
In this section, we first analyze simulated data to highlight the relevant functions in the mixtools package for a mixed-effects regression mixture analysis. We perform a limited power analysis of the equal variances and equal variance-covariance matrices setting. The analysis of the simulated data is intended to highlight the general application of the procedures in Section 2. We then proceed to analyze the infant habituation data using a random effects regression mixture model. In the following analyses, we highlight some of the limitations discussed at the end of Section 2.

Simulated data
We begin by generating 50 trajectories from a two-component random effects regression mixture model with quadratic regressions and a fixed explanatory variable. The variables for the random effects, x 1 , . . . , x 50 , were generated independently from Unif(0, 10). The vectors were randomly assigned a length n i ∈ {5, 6, 7}, with each length selected with equal probability. Thus, the design matrix for the random effects of a trajectory is i is a vector consisting of the squared entries in x i . The explanatory variables for the fixed effect, u 1 , . . . , u 50 , were generated independently from the Bin(3, 0.5) distribution. The trajectories were then defined by where The parameters for Equation (12) are α = 6, σ 1 = 1, and σ 2 = 6, while the individual effects were generated according to We first ran two simulation studies to check the probability of a type I error when the data are generated according to each of the null hypotheses in Section 2.2. Each simulation was ran 500 times and the 0.05 significance level was used for each test. Each of these tests can be performed using the test.equality.mixed function. For the testing scenario in Equation (6), the data were generated according to the structure outlined earlier, but under the null hypothesis where σ 2 1 = σ 2 2 = 1. The proportion of times the null hypothesis was rejected is 0.0648. For the testing scenario in Equation (7), the data were also generated according to the structure outlined earlier, but this time under the null hypothesis where R 1 = R 2 = I 3 . The proportion of times the null hypothesis was rejected for this simulation is 0.0440.
Next, a data set simulated using Equations (12) and (13) is used to illustrate the method of bootstrapping for the number of components. This bootstrapping procedure is implemented by using the boot.comp function. Plots of this data set, as well as a least-squares fit to each trajectory, can be found in Figure 1(a) and 1(b), respectively.
For testing k = 1 versus k = 2 for the number of components, the bootstrapped p-value is p B = 0.000, suggesting strongly that k = 2 component is more appropriate than k = 1. However, for k = 2 versus k = 3, p B = 0.280. Thus, the bootstrapping for the number of components procedure correctly identifies the two-component structure of these data. Moreover, the LRT statistics for testing the equality of variances of the two components and the equality of the variance matrices hypothesis are 108.1 (one df, p < 0.0001) and 23.944 (six df, p = 0.001), respectively. These results are consistent with the performance of the simulation presented earlier in this section.
The ECM algorithm is implemented by using regmixEM.mixed and the output can be found in Table 1. The observed log-likelihood for the two-component fit is o (ψ) = −1032.534. The standard errors for the ECM algorithm are calculated using B = 500 bootstrap samples according to a parametric bootstrapping scheme given in Efron and Tibshirani [11]. This is implemented via the boot.se function. Label switching did not appear to be present since the identifiability constraint λ 1 < λ 2 is always met despite never being enforced. Figure 2   shows a histogram of λ 2 − λ 1 for each sample and all differences are positive and well above 0. Figure 3 depicts two trajectories and their (y i , X i ) values from the simulated data set with both posterior regression lines plotted. The line generated using the most probable regression coefficients is denoted by a solid line. We have also plotted the regression lines that result from the posterior random effect regressors (the β (t) j values). Since we fit a two-component model to these data, we have two sets of regressors. Furthermore, each set of regressors has a corresponding posterior membership probability and these weights typically indicate which regression line is a better fit for the trajectory. Figure 4(a) shows all such regression lines for each trajectory and the corresponding regression coefficients are plotted in Figure 4(b). Figure 4(c) and 4(e) denotes the subsets of those regression lines whose posterior random effect regressors had the highest posterior membership probability   for each component and the corresponding regression coefficients are plotted in Figure 4(d) and 4(f). Clearly, Figure 4(c)-(f) indicate a separation of the trajectories into two groups. Thus, the analysis we presented using random effects regression mixtures captured the structure of the simulated data, including the correct number of components, unequal variances, and unequal variance-covariance matrices. The simulated data discussed thus far is ideal since the components are well-separated. However, any analysis of random (or mixed) effects regression mixture models is still prone to the issues discussed in Section 2.4. To demonstrate the effects on determining the number of components, consider now a simple random effects regression mixture model measured at where the design matrix is X i = (1 11 , x i ) and i ∼ N 11 (0, σ 2 j i I 11 ), for j i ∈ {1, 2, 3}. The parameters for Equation (14) are σ 1 = 1, σ 2 = 3, and σ 3 = 2, while the individual effects are generated according to We consider a well-separated setting where (τ 1 , τ 2 , τ 3 ) = (0.20, 0.19, 0.21) and a heavily overlapping setting where (τ 1 , τ 2 , τ 3 ) = (2.00, 1.99, 2.01). The regression lines and data generated using these parameters are given in Figure 5. Clearly, one can see how estimation routines with the heavily overlapping setting can be challenging. When testing the number of components using Equation (8) in the well-separated setting, we obtain bootstrap p-values of 0.000, 0.020, Figure 5. (a) Regression lines and (b) data points for the well-separated setting using the β i 's generated by the mixture distribution in Equation (15). (c) Regression lines and (d) data points for the heavily overlapping setting using the β i 's generated by the mixture distribution in Equation (15). Note that the y-axis range differs between the two settings due to the larger spread of the data in the heavily overlapping setting. and 0.574 for k 0 = 1, 2, and 3, respectively. Thus, the correct number of components is determined. However, for the same test in the heavily overlapping setting, we obtain a bootstrap p-value of 0.960 for the simplest test of one component (i.e. no mixture structure) versus two components. Thus, we cannot detect the difference between components in the second setting. Such a difficulty in detecting the difference between components is consistent with the results of McLachlan [23].

Infant habituation data set
Psychological data sets can provide interesting examples for mixture modeling when theories of, say, cognition posit that human populations consist of discrete subgroups. For example, one might assume that different children employ qualitatively different strategies when solving a cognitive task, and any such differences may be detected by fitting a finite mixture model to data collected about the task [39]. Here, we consider an experiment concerning habituation among infants; the motivation will be to determine the appropriateness of a two-component random effects regression mixture model, that is, whether there is evidence for two distinct groups of infants that may then be presumed to possess differing abilities or to use differing strategies.
Habituation is a decrease in response times upon repeated stimulus presentations. Visual habituation studies in infants have attempted to predict later cognitive abilities in childhood; for example, see Slater [35], Hood et al. [14], and Colombo and Mitchell [7]. Much of this research asserts that there are common and stable mechanisms that underly both visual fixation behavior observed during infant habituation studies and later cognitive abilities measured in the child. Hence, it is of scientific interest to investigate possible subgroups of infants based on visual habituation results as this could provide insights into their cognitive development.
The data set involves n = 47 infants at 4 months of age. The infants were exposed to a checkerboard pattern on a computer screen and the time (in milliseconds) until they turned away was recorded. Since this is a habituation experiment, there should be a progressive decline of behavioral response observed. This was repeated for l = 1, . . . , 11 trials on each infant. Thus, each infant will have an associated trajectory of length 11. Plots of the data are given in Figure 6.
For our model, the time until the infant turns away has been converted to the log scale, which we call y i . The model of interest is Equation (2), where the design matrix for the random effects is but no fixed effects are present. Notice the exponential term in the third column. This quantity incorporates some curvature into the estimation of the trajectories, which is an attempt to better capture the longer response times that occur before the infants habituate to the repeated stimulus.
We also shift all exponents by one to reflect a peak at time = 1 (rather than zero), where an infant should exhibit its longest response time since the stimulus is novel to the infant at that initial trial. For determining the number of components, the test of k = 1 versus k = 2 yields p B = 0.0000, while the test for k = 2 versus k = 3 yields p B = 0.2360. Thus, we proceed with a twocomponent fit for the data. We also test for equal variances and equal variance-covariance matrices, which yield p-values of 0.0868 and 0.0002, respectively. Based on these results, the model we fit is where conditional on component membership j i , the i,j i ∼ N(0, σ 2 j i ) are independent, i = 1, . . . , 47. Furthermore, the random effects coefficients are assumed to be distributed as The ECM algorithm output can be found in Table 2. The standard errors for the ECM algorithm are calculated using B = 500 bootstrap samples. Label switching was present in this bootstrap sample. This was diagnosed by first noting that the standard errors appeared to be fairly large. Since σ 1 and σ 2 appeared to be well-separated in the sample, we simply imposed the identifiability constraint of σ 2 < σ 1 in order to correct the label switching (see the histogram in Figure 7).
It is sometimes of scientific interest to estimate a shift in the response variable from 0, which can be done by subtracting a shift parameter θ i from the original response times before taking the  log. One method of estimating θ i involves taking a specified percentage of the minimum time of each trajectory. Another possible method is advocated for lognormal data by Boswell et al. [5], which looks at a function of the 100αth lower, 50th and 100αth upper percentiles for a chosen α. We attempted to estimate θ i using these methods, but found no improvement in the fit of our model. Figure 8 gives plots of regression lines that result from the posterior random effect regressors for two of the infants. The solid lines have the higher posterior membership probability and appear to be the better fit of the two lines. Figure 9(a) shows all such regression lines for each infant's trajectory and the corresponding regression coefficients are plotted in Figure 9(b). Figure 9(c) and 9(e) shows the subsets of those regression lines whose posterior random effect regressors had the highest posterior membership probability for each component; the corresponding regression coefficients are plotted in Figure 9(d) and 9(f).  Figure 9(c) and 9(e) gives a key to interpretation of the two groups that emerge from this analysis. The group of Figure 9(c) and 9(d) tends to exhibit a steeper drop in response times at the earlier occasions than the group of Figure 9(e) and 9(f). The second group appears to follow a more linear trend than the first group. The mixture structure on the random effects coefficients provides an effective way to capture this characteristic and allows us to classify the infants using all of their measured data across time as opposed to, say, a single time point.
The results presented earlier are also consistent with many studies on infant habituation found in the literature. For example, Oakes [31] discusses how habituation experiments provide insights into the cognitive processes of infants. Subgroups with different patterns of habituation as in our analysis are likely because of inherent differences with infants' abilities (e.g. memory and sensitivity to feature combinations) and possibly other mental processes. These two subgroups could, potentially, highlight overall differences in learning strategies. Moreover, Thomas and Gilmore [38] note that since infants are sufficiently different in their behavior, it is important that any modeling approach reflects these differences, which is captured using our random effects regression mixture model. Finally, while not measured with the data we analyzed, there could be some common features within the two identified subgroups regarding their neural responses; see Turk-Browne et al. [41] for a discussion of habituation in functional neuroimaging. Overall, our analysis provides potential evidence of two groups of infants with different patterns of habituation.

Discussion
The random effects regression mixture model allows one to search for latent group structure in data that comprise multiple independent regression observations. The model we present here allows for k different variance-covariance matrices in the mixture structure for the random effects. We also provide an extension to the models presented in Xu and Hedeker [46] and Gaffney and Smyth [12] by allowing the error variance terms to differ between the trajectories, thus allowing a more nuanced view of the infant data. Moreover, relative to Gaffney and Smyth [12] who consider this model from a Bayesian perspective, our approach allows for formal testing of various hypotheses regarding the level of heterogeneity among the mixture components. These tests are helpful in deciding upon an appropriate model in case studies, such as the infant data set considered here. Maximum likelihood is used for parameter estimation, which we accomplish using the ECM algorithm of Meng and Rubin [26].
As with any mixture model, the components should have a meaningful interpretation in the context of the data problem. Given the literature establishing the connection between habituation in infants and cognitive development as children, analysis using the random effects regression mixture model can provide a way of classifying infants that may already be demonstrating differing strategies or cognitive abilities. Moreover, the consideration of other functional forms for the mixture components (e.g. higher order polynomials) may reveal the need for a different number of components.
There are a couple of extensions to the random effects regression mixture model that could be of further interest for analyzing trajectory data like the infant habituation data. One extension to consider is a correlated data structure, like the models presented in Laird [18]. For example, we might consider i ∼ N n i (0, σ 2 i ), where i is an appropriately defined correlation matrix. This would be informative given that a correlated structure for habituation data might be a reasonable assumption.
Another extension might allow for each trajectory to have its own variance term. In other words, the σ 2 j i for the variance of the error terms in model (2) Notice that we write H 1 using τ 2 j i to avoid an abuse of notation. If we had written H 1 using σ 2 j i , then the parameter space of this hypothesis would include {σ 2 1 , . . . , σ 2 k }, while the parameter space of H 2 would include {σ 2 1 , . . . , σ 2 n }. In other words, the first k variances in the parameter space under H 2 need not be equal to the k variances in the parameter space under H 1 .

Appendix. Details of ECM algorithm
From the conditional distribution of y i given component membership in Equation (5), we obtain the (observed-data) log-likelihood function As is standard practice in finite mixture model EM algorithms, we let Z ij = I{j i = j} denote the (unobserved) indicator that the ith observation comes from the jth component. The unobservable β are also treated as missing, so {Z, β} constitutes the missing data for this problem. An EM algorithm consists of two alternating steps, repeated until convergence: the E-step and the M-step. An ECM algorithm operates on only part of the parameter vector at once. In other respects, though, it is just like an EM algorithm: During each E-step, the expectation of the complete-data log-likelihood function is calculated. In the corresponding CM-step, some part of the parameter vector is held fixed (this is the 'conditioning') while the expected log-likelihood is maximized in the other parameters. This process repeats with different subsets of the parameters until finally, each parameter has been updated at least once, and the ECM iteration is complete.
In the E-step for our model, we form the conditional expectation, given the observed y 1 , . . . , y n , of the complete-data log-likelihood function For iteration t, t = 0, 1, . . ., we take the expectation of c (ψ) with respect to the conditional density where ψ is defined in Equation (4). Recalling that the Z ij are marginally Bernoulli distributed with parameter λ j , we havê The conditional density for β i in Equation (A2) may be expressed up to a constant of proportionality not involving β i as where, after some algebra, we obtain and as the conditional variance and mean of the multivariate normal conditional distribution of β i given that the ith observation comes from the jth component. [In Equation (A4), we know the component j i because we are conditioning on Z i in that equation.] Notice that we applied Bayes' theorem in deriving Equation (A4) and combined the complete data and an estimate of ψ to approximate this density. Hence, β (t) ij is simply an empirical Bayes estimate of the random effects. Note that β ij will not always resemble the parameters one would obtain from ordinary least squares regression; indeed, these parameters can be quite different, particularly whenẐ ij is a small probability.
Using the joint distribution of (Z, β) given (U, X, y, ψ (t) ) as summarized by Equations (A2)-(A6), we may find the expectation of the terms in the complete-data log-likelihood function (A2), as required in the E-step of our ECM algorithm, using several well-known identities involving multivariate normal distributions. To wit, the only problematic expectations are and, similarly, where we have used the facts that tr(X i V (t) ij X i ) = tr(X i X i V (t) ij ) and if W ∼ N p (ν, ), then E W 2 = p i=1 ν 2 i + tr( ). Thus, the expectation of the complete-data log-likelihood, which must be maximized as a function of ψ, equals The first CM-step of the ECM algorithm maximizes the aforementioned expectation over all parameters except α. As with any mixture model EM (or EM-type) algorithm, only a single term of Equation (A9) involves the mixing proportions λ, and this term is maximized by ij .
For the means of the random effects, we obtain