Pairwise Likelihood Inference for Nested Hidden Markov Chain Models for Multilevel Longitudinal Data

In the context of multilevel longitudinal data, where sample units are collected in clusters, an important aspect that should be accounted for is the unobserved heterogeneity between sample units and between clusters. For this aim, we propose an approach based on nested hidden (latent) Markov chains, which are associated with every sample unit and with every cluster. The approach allows us to account for the previously mentioned forms of unobserved heterogeneity in a dynamic fashion; it also allows us to account for the correlation that may arise between the responses provided by the units belonging to the same cluster. Under the assumed model, computing the manifest distribution of these response variables is infeasible even with a few units per cluster. Therefore, we make inference on this model through a composite likelihood function based on all the possible pairs of subjects within each cluster. Properties of the composite likelihood estimator are assessed by simulation. The proposed approach is illustrated through an application to a dataset concerning a sample of Italian workers in which a binary response variable for the worker receiving an illness benefit was repeatedly observed. Supplementary materials for this article are available online.


INTRODUCTION
In modeling longitudinal data, it is common to account for the unobserved heterogeneity between sample units, that is, the heterogeneity that cannot be explained by means of observable covariates (Diggle et al. 2002;Hsiao 2003;Frees 2004;Fitzmaurice et al. 2009). This is normally accomplished by the introduction of latent variables or random effects. For instance, a typical approach consists of associating a random intercept to every sample unit that affects the distribution of each occasionspecific response in the same fashion. This allows us to account for a form of time-constant unobserved heterogeneity.
More recent approaches for longitudinal data are based on allowing for a form of time-varying unobserved heterogeneity, relaxing in this way the assumption that the effect of unobservable covariates on the response variables is constant in time. This is sensible in many applied contexts, especially in the presence of long panels and with a limited set of observable covariates. Among these approaches, it is worth mentioning the one described in Heiss (2008), which is based on random effects having an AR(1) structure, and that proposed by Bartolucci and Farcomeni (2009), which is based on a hidden (latent) Markov chain for capturing the unobserved heterogeneity in a dynamic fashion. For a comparison between the two approaches, see Bartolucci, Bacci, and Pennoni (2014).
The above considerations are obviously pertinent when we deal with multilevel longitudinal data, where sample units are collected in clusters, with the addition that it is also appropriate modeling the unobserved heterogeneity between clusters and the correlation between the responses provided by the units in the same cluster. Note that multilevel longitudinal data are more Francesco Bartolucci, Department of Economics, University of Perugia, Perugia 06100, Italy (E-mail: bart@stat.unipg.it). Monia Lupparelli, Department of Statistical Sciences, University of Bologna,40126 Bologna,. The authors thank Laboratorio Riccardo Revelli, of Collegio Carlo Alberto in Torino (IT), for providing us with the dataset analyzed in this article; in particular, the authors thank Dr. Claudia Villosio for her helpful support. Francesco Bartolucci acknowledges the financial support from the grant RBFR12SHVV of the Italian Government (FIRB -Futuro in Ricerca -project "Mixture and latent variable models for causal inference and analysis of socio-economic data"). and more easily encountered in socio-economic contexts. In particular, the dataset motivating this article concerns a sample of workers (sample units) in different firms (clusters), who are longitudinally observed for 10 years. Each response is binary and it is equal to 1 if the employee receives illness benefits in a certain year and to 0 otherwise. In many applications, as the present one, long panel datasets have a limited set of observable covariates and this also motivates the need of approaches for modeling unobserved heterogeneity between both sample units and clusters.
Standard multilevel techniques for clustered data have been developed, among others, by Goldstein (2011). These techniques have been also extended for modeling longitudinal clustered data in several ways within the wide class of mixture models; see among others Muthén and Asparouhov (2009). Algorithms for the maximum likelihood estimation of these models are developed in the Mplus software (Muthén and Asparouhov 2011). Furthermore, multilevel data have been modeled using latent class analysis (Vermunt 2003), which can be implemented and fitted using the Latent GOLD software (Vermunt and Magidson 2005).
We propose an approach for multilevel longitudinal data based on nested hidden Markov chains, which may be seen as an extension of the approach proposed by Bartolucci and Farcomeni (2009) for longitudinal data. In particular, we associate a first-order homogenous hidden Markov chain with every sample unit and with every cluster. The time-specific realizations of these two chains affect the distribution of the response variables together with the covariates observed at unit and cluster level. With reference to the mentioned application, the different states of the unit-level Markov chain correspond to different levels of the residual (not explained by the unit-level observable covariates) tendency to require an illness benefit by an employee. A similar interpretation may be provided for the different states of the cluster-level Markov chain that affects the behavior of all employees in the same firm. Moreover, the possibility is taken into account that the unit-level state changes are due to events of the employee's life that are not recorded in the dataset, such as a sudden worsening of his/her health status. Similarly, a change in the cluster-level state may be due to events about the firm, such as the change of the management. In any case, we can test if the latent effects are indeed dynamic or not for the dataset at hand.
The proposed approach may be cast in the literature about latent Markov (LM) models for longitudinal data, as described by Bartolucci, Farcomeni, and Pennoni (2013). It is worth noting that other multilevel extensions of the latent (or hidden) Markov approach for longitudinal data are available in the literature. We mention, in particular, the extensions proposed by Bartolucci, Lupparelli, and Montanari (2009) and Bartolucci, Pennoni, and Vittadini (2011). About multilevel extensions, see also Asparouhov and Muthén (2008), and about related models including random effects, but not in a context of analysis of multilevel data, see van de Pol and Langeheine (1990), Altman (2007), and Maruotti (2011). In these cases, the effects (fixed or random) associated with every cluster are time constant. However, an extension in which these effects are time varying has not been proposed yet, at least to our knowledge. Nevertheless, it is worth mentioning the recent article by Lagona, Maruotti, and Padovano (2015), which proposes an extension of a hidden Markov model for bivariate time series law data in which there is a nested structure in time (months nested in government periods nested in legislatures). However, the context of this work is rather different from the longitudinal context here considered.
Under the extended model we propose, the manifest distribution of the response variables is computationally intractable in most applications; see Sections 4.2 and 5.1 for a discussion about the computational complexity. Therefore, to make inference on the model we employ a composite likelihood method (Lindsay 1988;Cox and Reid 2004) based on the joint distribution of the response variables for each pair of subjects in the same cluster. A similar approach was followed by Renard, Molenberghs, and Geys (2004) for a multilevel probit model; see also Hjort and Varin (2008) and Varin and Czado (2010). In particular, we show how to compute the pairwise likelihood by using the same recursion employed by Baum et al. (1970) to deal with hidden Markov models and how to maximize this likelihood by an expectation-maximization (EM) algorithm similar to the one they propose and implemented along the same lines as in Bartolucci and Farcomeni (2009).
We also outline a weighted composite likelihood approach that may be in general more suitable in the presence of data grouped in clusters of different size. Through a simulation study, we compare the performance of the unweighted and weighted approaches under different scenarios in terms of efficiency of the estimators for the class of models developed in this article. We also consider some scenarios in which the full likelihood estimation is viable, due to a reduced complexity. In this way, we can also evaluate the loss of efficiency of the proposed estimators with respect to the (typically infeasible) full likelihood estimator. Furthermore, we show how to obtain standard errors for the parameter estimates and how to make model selection using the CL-BIC approach formulated by Gao and Song (2010), which is a composite likelihood-based selection criterion for high-dimensional data models.
The article is organized as follows. In Section 2, we briefly review the LM model with covariates and maximum likelihood estimation of this model. Section 3 provides a summary of existing approaches related to the approach proposed here to deal with multilevel longitudinal data. This section also contains a brief overview of the recent literature about composite likelihood inference. The multilevel extension for LM models is illustrated in Section 4 for the case of continuous and binary response variables. Pairwise likelihood inference for this model is described in Section 5 and studied by simulation in Section 6. In Section 7, we illustrate the approach by an application based on the dataset concerning the sample of workers mentioned above and in Section 8 we draw the main conclusions.
An R implementation of the functions used for the estimation of the proposed model in the presence of binary response variables is available to the reader upon request.

USING HIDDEN MARKOV CHAINS FOR MODELING TIME-VARYING UNOBSERVED HETEROGENEITY
Consider a panel of n subjects observed at T occasions, let Y (t) i denote the response variable of interest for subject i at occasion t, i = 1, . . . , n, t = 1, . . . , T , and let Z (t) i be the corresponding column vector of covariates, which may also include the lagged responses. In the context of our application, the response variables are binary, although the LM model may be also applied with variables having a different nature. In the following, we outline how to model these data accounting for unobserved heterogeneity in a dynamic fashion by introducing a hidden Markov chain, as suggested by Bartolucci and Farcomeni (2009).

Model Assumptions
We assume that, for i = 1, . . . , n, the response variables i are conditionally independent given the covariate vectors Z (1) i , . . . , Z (T ) i and a latent process V i = (V (1) i , . . . , V (T ) i ) that follows a first-order homogenous Markov chain and is independent of the covariates. This chain has k states, labeled from 1 to k, with initial and transition probabilities . . . , T ,v, v = 1, . . . , k. In the above definitions, v refers to the current state, whereasv refers to the previous one. This convention will be used throughout the article. Moreover, the initial probabilities are collected in the k-dimensional column vectors π , whereas the transition probabilities are collected in the k × k transition matrix . Note that these probabilities are the same for all sample units and the transition probabilities are time homogenous.
More parsimonious models can be defined by imposing different constraints on the initial and transition probabilities; see also Bartolucci (2006). In particular, we may assume a stationary hidden Markov model, in the sense that the initial distribution is equal to the stationary distribution. This assumption is rather common in the hidden Markov literature because it provides more interpretable results in the presence of covariates also affecting the response variables; see Section 4 and Zucchini and MacDonald (2009) for a discussion about this constraint. Moreover, we may assume constant off-diagonal elements for each row of the transition matrix. This means that the probability of moving to a different state is invariant with respect to the state of destination, depending only on the current state. This additional constraint makes the model more parsimonious without loss of interpretability.
For subject i at occasion t, the latent variable V (t) i corresponds to the level of the unobservable characteristic of interest. The way in which this characteristic affects the corresponding response variable Y (t) i depends on the assumed measurement model. For instance, in the case of continuous response variables, it is natural to formulate the following assumption on the conditional distribution of where β v is an intercept related to the latent state and δ is a vector of regression coefficients. These parameters, including the variance σ 2 , can be estimated together with the initial and transition probabilities. For binary response variables, it is instead natural assuming that with ψ (t) i (v, z) corresponding to the conditional probability of "success," that is, . The above approach may be extended to response variables having a different nature through a generalized linear model (GLM) parameterization (McCullagh and Nelder 1989) or similar parameterizations for multivariate categorical data (Bartolucci, Colombi, and Forcina 2007). We refer the reader to Bartolucci and Farcomeni (2009) and Maruotti (2011) for further details.

Maximum Likelihood Estimation
In an application, we observe the response configuration y i = (y (1) i , . . . , y (T ) i ) and the sequence of covariate vectors where n is the sample size; we collect the covariates in the unique vector z i (for all time occasions). To perform maximum likelihood estimation of the above model on the basis of these data, we need to compute the manifest distribution of y i given z i , that is, where the sum v is over all the possible configurations v = (v (1) , . . . , v (T ) ) of the latent process V i . Efficient computation of the probability in (1) may be performed by a forward recursion available in the hidden Markov literature (see Baum et al. 1970;Levinson, Rabiner, and Sondhi 1983;Zucchini and MacDonald 2009). For every i = 1, . . . , n, t = 1, . . . , T , and v = 1, . . . , k, the probability (or density) function is obtained by the recursion The manifest probability of y i is finally obtained as This forward recursion requires a number of iterations of order O(k 2 T ), which linearly increases with the panel length. The memory requirement to store all values of q (t) i (v), t = 1, . . . , T , v = 1, . . . , k, is instead of order O(kT ). For further technical details, see Khreich et al. (2010).
Maximum likelihood estimation is performed on the basis of the log-likelihood (θ ) = n i=1 log p( y i |z i ), where θ denotes the parameter vector specified according to the model of interest. We maximize this function by an EM algorithm (Baum et al. 1970;Dempster, Laird, and Rubin 1977) based on the complete data log-likelihood denoted by * (θ), that is, the log-likelihood that we could compute if we knew the latent state of each subject at every occasion.
The EM algorithm alternates two steps (E and M) until convergence: the E-step consists of computing the conditional expectation of * (θ), given the observed data and the current value of θ , using recursions similar to the one illustrated above; the M-step consists of maximizing this expected value with respect to θ , so that this parameter vector is updated. The latter may require simple iterative algorithms of Newton-Raphson type. A detailed description of this EM algorithm is available in Bartolucci and Farcomeni (2009). Nevertheless, the implementation of the algorithm must include further adjustments required to account for the stationarity assumption of the latent process; see Bulla and Berzel (2008).

OVERVIEW OF THE RELEVANT LITERATURE
We provide a brief description of the approaches more closely related to the one proposed in this article, together with an overview about composite likelihood methods.

Models for Multilevel Longitudinal Data
Latent variable models have been formulated in several ways to account for the additional unobserved correlation that may arise in the presence of clustered data. In the multilevel literature, clustering of longitudinal data is generally related to sets of individual observations grouped across the time. However, in this article we consider longitudinal data with a multilevel structure where clustering is related to sets of individuals grouped according to some criteria. Therefore, we are interested in suitable statistical tools for modeling two types of correlation that cannot be explained by means of the observable covariates: (i) correlation between responses of a single individual across time and (ii) correlation between individuals belonging to the same cluster.
The standard LM approach, formulated as in Bartolucci and Farcomeni (2009) and in related articles, provides a wellestablished tool for modeling the first kind of correlation. We also have to mention that some attempts have been already accomplished to extend this approach to deal with the correlation at cluster level in the presence of multilevel longitudinal data. In particular, Bartolucci, Lupparelli, and Montanari (2009) considered an LM model with covariates for assessing the evolution of the health status of patients admitted in certain nursing homes. In this model, the correlation between individuals in the same facility is explained by including suitable fixed effects at cluster level. Later, Bartolucci, Pennoni, and Vittadini (2011) proposed an extension of the LM model with covariates for the analysis of the ability in a sample of students belonging to different classes in different schools; the effect of each cluster is modeled by a discrete latent variable. Both approaches assume time-constant cluster effects. Moreover, the use of fixed effects at cluster level as in Bartolucci, Lupparelli, and Montanari (2009) is suitable only for analysis of samples with a limited number of large clusters.
We propose a multilevel LM approach based on nested Markov chains for modeling time-varying heterogeneity between individuals and between groups of individuals, which allows for time-varying effects of the clusters and may also be used in the presence of many clusters of not large dimension. This is the main novelty of the proposed approach together with the use of composite likelihood methods required by the complexity of the resulting model.
It is also worth recalling the proposal of Altman (2007) based on the mixed LM models where the cluster effect is modeled by including a continuous latent variables in an LM model. This approach permits modeling unobserved heterogeneity still in the presence of a large number of clusters, but it is not explicitly developed to deal with multilevel data. A related approach was developed by Maruotti (2011) on the basis of discrete latent variables.

Composite Likelihood Inference
Composite likelihood methods provide an alternative to full likelihood inference when the latter is not feasible. Problems in computing the full likelihood may arise in dealing with highdimensional data with a complex dependence structure. Nowadays composite likelihood techniques are widely employed in several contexts; for a rather recent review see Varin, Reid, and Firth (2011).
Following Lindsay (1988) and Cox and Reid (2004), the composite likelihood is a function obtained by means of the product of likelihood components referred to specific subsets of data. These components may correspond to marginal or conditional densities, so that the literature typically distinguishes between the conditional and marginal version.
Composite conditional likelihood, originally proposed by Besag (1974), is the product of the conditional density of each observation given its neighbors. Other versions have been later developed, for instance by Molenberghs and Verbeke (2004). Composite marginal likelihood is the product of the marginal density of subsets of observations. In particular, the pairwise likelihood, which considers densities of pairs of observations, represents a relevant case within the marginal approach (see Cox and Reid 2004;Varin 2008) and corresponds to the composite likelihood method we employ in this article. For related works on the pairwise likelihood method and the asymptotic properties of the estimators based on this approach, see also Hanfelt (2004), Varin and Vidoni (2005), and Gao and Song (2011).
Composite likelihood techniques inherit many of the interesting properties of the full likelihood inference, which have been well established for several classes of models, mainly consistency and asymptotic normality of the resulting estimators. Nevertheless, there are still many practical and technical issues that represent open problems and need to be better explored. At this stage, we recall three crucial aspects related to our application.
First, the sample estimate of the score variance is generally computed by the outer product of the composite scores. When this yields to a biased estimate of the Hessian matrix, the pairwise inference could not be asymptotically efficient; see Lindsay (1982). Nevertheless, the loss of efficiency might be reduced if the score variance is estimated by means of bootstrap methods or by Monte Carlo simulations; see Varin, Reid, and Firth (2011, sec. 5.1) for an illustration of this method. The Monte Carlo solution is better illustrated in Section 5.2 for the class of models discussed in this article.
Second, model selection also represents an important aspect because well-known criteria, such as the Akaike information criterion (AIC; Akaike 1973) or the Bayesian information criterion (BIC; Schwarz 1978), cannot be directly used for composite likelihood inference. In this regard, Varin and Vidoni (2005) proposed a method based on CLIC, a composite version of AIC, where the term of penalization depends on the estimate of the variance-covariance matrix mentioned above. Successively, Gao and Song (2010) proposed a composite version of BIC for the selection of high-dimensional data models, named CL-BIC, based on increasing the penalization term with respect to CLIC. The latter is the criterion adopted in this article because, as shown by Gao and Song (2010) through a series of simulations, CL-BIC generally outperforms CLIC, in particular for complex models.
Finally, another important issue is whether a weighted approach provides useful insights for composite likelihood methods. This may be crucial in the presence of unequal cluster sizes, as the composite approach we initially propose gives more weight to observations coming from larger clusters than to those coming from smaller clusters. See Renard, Molenberghs, and Geys (2004) and Kuk and Nott (2000) for a discussion about this issue. Therefore, in Section 5.3 we also propose a weighted pairwise likelihood method for the class of models developed in this work and in Section 6 we compare the performance between the weighted and unweighted approach through a simulation study.

THE PROPOSED MULTILEVEL EXTENSION FOR LM MODELS
In the context of multilevel longitudinal data, the n sample units are grouped, according to some criteria, in H clusters of size n 1 , . . . , n H . Then, for each subject i in cluster h, data are available at T consecutive occasions. In particular, we denote by Y (t) hi the response variable and by Z (t) hi the corresponding column vector of covariates, where h = 1, . . . , H , i = 1, . . . , n h , and t = 1, . . . , T . Moreover, by X (t) h , with h = 1, . . . , H and t = 1, . . . , T , we denote the column vector of cluster-level covariates, which may be time varying.
In the following, we show how multilevel longitudinal data, having the structure described above, may be analyzed by an extension of the LM approach outlined in Section 2.

Model Assumptions
Our extension assumes the existence of a latent process . . , n h , in the cluster. Both processes follow a first-order homogenous Markov chain with k 1 states at cluster level and k 2 at unit level. These processes are assumed to be independent of each other and also independent of the unit and cluster-level covariates. Moreover, extending the assumptions formulated in Section 2, we impose that, for every sample unit hi (unit i in cluster h), the response variables Y (t) hi are conditionally independent given U h , V hi , and the corresponding covariates. This implies that the response vectors for two subjects in the same cluster are conditionally independent given U h , but they are not marginally independent. This type of marginal independence only holds for subjects belonging to different clusters.
The initial and transition probabilities of each cluster-level latent process are denoted by and are collected in the vector λ and in the transition matrix . Moreover, for the unit-level latent processes V hi , we substantially adopt the same notation as in Section 2, and then we let these initial and transition probabilities are collected in the vector π and in the matrix , respectively.
To maintain a parsimonious parameterization, with respect to more standard versions of the LM model, we adopt two constraints on each hidden Markov chain at cluster and unit level: (i) the initial distribution coincides with the stationary distribution (stationarity); (ii) the transition probabilities from one latent state to another only depend on the previous state. With reference to the cluster-level processes, the first constraint amounts to assuming that λ = λ, so that the marginal distribution is the same for any time occasion, that is, p(U (t) h = u) = λ u , u = 1, . . . , k 1 . This constraint makes particularly sense if, as in the application illustrated in Section 7, the common time trend is modeled by including suitable explanatory variables among the covariates. Therefore, the latent states may be interpreted in terms of residuals with respect to this common trend that has always the same distribution, although they are allowed to be serially dependent. The second constraint may be formally expressed as u are common transition probabilities between 0 and 1/(k 1 − 1). For instance, with k 1 = 3 latent states we have In this way, the Markov chain is parameterized on the basis of k 1 free parameters. Note that for the case of two latent states, k 1 = 2, there is no restriction on the transition matrix. Concerning the unit-level Markov chains, the same constraints (i) and (ii) as above are expressed on the basis of the common transition probabilities π * v ,v = 1, . . . , k 2 , which are between 0 and 1/(k 2 − 1).
For the conditional response probabilities, the same considerations expressed in Section 2 still hold. Then, in the case of continuous response variables, we assume that where α u is an intercept related to the cluster-level latent state, β v is an intercepts related to the unit-level latent state, and γ and δ are corresponding vectors of regression coefficients. With binary response variables, instead, it is natural to assume that where log with parameters having the same interpretation as above. Obviously, this formulation may be extended also to other cases by a GLM parameterization or similar parameterizations, as illustrated at the end of Section 2.1.

Manifest Distribution
When we observe a set of multilevel longitudinal data, we have a sequence of responses y hi = (y (1) hi , . . . , y (T ) hi ) for every sample unit hi, with h = 1, . . . , H , i = 1, . . . , n h . We denote by y h the vector obtained by collecting the responses of all subjects in cluster h, that is, y (t) hi for i = 1, . . . , n h and t = 1, . . . , T . Similarly, we observe the vectors of unitlevel covariates z (1) hi , . . . , z (T ) hi ; these covariates are collected in the unique vector z hi when referred to unit hi (for all time occasions) and in the vector z h when referred to all units in the same cluster h. Finally, for every cluster h, we observe the vectors of cluster-level covariates x (t) h , which are collected in the unique vector x h (for all time occasions).
Under the above assumptions, the manifest probability of y h given x h and z h has the following expression: with the sum u extended over all the possible configurations of the latent process U h and v over all the possible configurations of V hi . According to the above expression, the manifest distribution p( y h |x h , z h ) is obtained by computing first the conditional distribution p( y h |U h = u, x h , z h ), and then marginalizing the latter with respect to U h . However, even if we employ recursion (2) instead of the expression in (7) for computing p( y h |U h = u, x h , z h ), the number of operations required for computing p( y h |x h , z h ) may be huge. In fact, this number is of order O(k T 1 k 2 2 n h T ), which linearly increases with the cluster size n h and exponentially with the panel length T. The memory requirement to store all the forward probabilities is instead of order O(k T 1 k 2 n h T ). To clarify this point, Table S-1 in the supplementary material shows the order of the number of operations for different combinations of k 1 , k 2 (1, . . . , 5), T (5, 10), and n h (5,10,20).
The results in Table S-1 confirm that the numerical complexity rapidly increases with T; in particular, with T = 10 this complexity makes prohibitive to compute p( y h |x h , z h ) even for a small cluster size and a limited number of latent states. Indeed, we experienced that computing this distribution is infeasible even with values of T smaller than 10, which may be easily encountered in social and economic applications, and in certain medical studies. Therefore, maximum likelihood estimation for the multilevel LM model parameters cannot be performed in general. This surely happens for the application motivating this article, which is described in Section 7. For this reason, we suggest below a composite likelihood approach based on a likelihood function with components corresponding to all possible pairs of units in each cluster. This is then a pairwise likelihood function across sample units.
Nevertheless, when computing the manifest distribution p( y h |x h , z h ) is feasible, because the number of time occasions is limited, on the basis of this distribution we may compute a full likelihood function. This function may be maximized by an extension of the algorithm described in Bartolucci and Farcomeni (2009) and related to the extension used by Bartolucci, Pennoni, and Vittadini (2011) for estimating a simpler multilevel version of the LM model. We skip details about this algorithm here because the resulting full maximum likelihood estimator is only used as a comparison for the composite likelihood estimator in the simulation study dealt with in Section 6 (when this estimator may be used) and it is not suggested in general.

PAIRWISE LIKELIHOOD INFERENCE
To make inference on the model parameters, we propose the use of the following pairwise log-likelihood: which recalls the pairwise log-likelihood used by Renard, Molenberghs, and Geys (2004) in a simpler context. With binary response variables, the parameter vector θ includes the support points α u , u = 1, . . . , k 1 , and β v , v = 1, . . . , k 2 , the regression coefficients for the covariates γ and δ (see expression (5)), and the parameters ρū,ū = 1, . . . , k 1 , and τv,v = 1, . . . , k 2 , defined on the basis of a logit-type transformation of the transition probabilities λ * u and π * v . More precisely, the transformations used for these probabilities are see the Appendix for technical details. These transformations are useful because the resulting parameter space is |θ| , where |θ | stands for the dimension of θ . Note that, when the dimension of each cluster is two (n h = 2, h = 1, . . . , H ), p (θ) corresponds to the full log-likelihood of the model, since it is based on the manifest probability of the responses provided by all possible pairs of subjects in the same cluster. Moreover, this function must be suitably modified by adding a component of type whereas the sum in (8) is referred to all clusters including at least two units. However, to simplify the description of the algorithm to maximize p (θ ), in the following we consider component (11) as omitted. Nevertheless, the adjustments required in the presence of some clusters with only one unit are minimal.

Computation and Maximization of the Pairwise Likelihood
To efficiently compute the probability p( y hi , y hj |x h , z hi , z hj ) as a function of the parameters in θ , we employ recursion (2) already used for the model illustrated in Section 2. In fact, we have that hj ) . It may be simply proved that, for t = 1, . . . , T , these vectors follow a bivariate LM model with covariates since they are conditionally independent given the latent process W (1) hij , . . . , hj ) , and the corresponding covariates. In particular, this latent process follows a Markov chain with an augmented space of k = k 1 k 2 2 states indexed by w = (u, v 1 , v 2 ) . It is simple to see that the initial probability of state w is whereas, for t = 2, . . . , T , transition probability from statew = (ū,v 1 ,v 2 ) to w is Moreover, the model assumptions imply that, given W (t) hij = w, the conditional probability ofỹ (t) hij is equal to hj ). (15) A similar expression holds for continuous response variables, based on the corresponding density functions.
To compute p( y hi , y hj |x h , z hi , z hj ), recursion (2) is applied with the probability or density p( y (t) i |V (t) i = v, z (t) i ) substituted by the probability or density p(ỹ (t) hij |W (t) hij = w, x h , z hi , z hj ) for all w. Similarly, π must be substituted by the initial probability vector φ with elements φ w and by the transition matrix with elements φ w|w . In this way, the order of the number of iterations, representing the computational complexity, is O(k 2 1 k 4 2 n h (n h − 1)T /2), which linearly, rather than exponentially, increases with the panel length; see Table S-2 for an illustration.
From the comparison between Tables S-1 and S-2 in the supplementary material, we notice that the pairwise-based approach generally leads to a complexity reduction as k 1 > 2. In particular, the ratio between the order of the complexity under the pairwise and the full likelihood approach, when T = 5 is 0.30 for k 1 = 3, k 2 = 2, and n h = 5, it is 0.63 for k 1 = 4, k 2 = 3, n h = 10 and it is 0.59 for k 1 = 4, k 2 = 2, n h = 20. This reduction is more evident in long panels with smaller cluster size; when T = 10, this ratio is always lower than 1 when k 1 > 1; moreover, with k 1 = k 2 = 2 the ratio is 0.03 for n h = 5, 0.07 for n h = 10, and 0.15 for n h = 20; with k 1 = 2 and k 2 = 3 the ratio is 0.07 for n h = 5, 0.16 for n h = 10, and 0.33 for n h = 20; with k 1 = 3 and k 2 = 2 it becomes closer to zero even for a larger cluster size. An emblematic case is for k 1 = k 2 = 5, n h = 5, and T = 10, when the order of the number of operations required by the full likelihood approach is around 7800 times that required by the pairwise likelihood method.
The pairwise log-likelihood p (θ) defined in (8) can be maximized by an EM algorithm having a structure that closely recalls that outlined in Section 2.2. In this case, in particular, the complete data pairwise log-likelihood is In the above expression, d (t) hij (w) is a dummy variable equal to 1 if, at occasion t, cluster h is in latent state u, unit hi is in latent state v 1 , and unit hj is in latent state v 2 ; moreover, we have d (t) hij (w, w) = d (t−1) hij (w)d (t) hij (w). The complete data pairwise log-likelihood may be simply expressed in terms of the parameters of the proposed multilevel model by substituting (13), (14), and (15) in the above expression. For instance, the first component becomes the sum over u ofd where the variablesd (1) hij (u),d (1,1) hij (u, v 1 ), andd (1,2) hij (u, v 2 ) are obtained by summing d (1) hij (w) over suitable configurations of w. In a similar way, we can express the other two components involving the transition and the conditional response probabilities (or densities).
At the E-step of the EM algorithm, the conditional expected value of each dummy variable d (t) hij (w) and d (t) hij (w, w) is computed by using the same recursions used in the algorithm of Baum et al. (1970). At the M-step, the model parameters are updated by maximizing the function resulting by substituting these expected values in (16) and using simplification (17) and similar simplifications. In any case, the EM algorithm is implemented along the same lines as the algorithm used for computing the full likelihood, taking into account that the parameters ρū and τv for the transition probabilities, defined in (9) and (10), are updated by a numerical algorithm, which also accounts for the constraint that the initial distribution of each latent process coincides with its stationary distribution (Bulla and Berzel 2008).
To make the pairwise likelihood estimation faster, we combine the EM algorithm previously illustrated with the BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm that is a wellknown iterative method for solving unconstrained nonlinear optimization problems (Avriel 2003). In practice, the first algorithm is run for a limited number of steps, and then the second algorithm is used to directly reach the maximum of p (θ) starting from the EM solution. For the implementation of the BFGS algorithm, the score of the objective function is analytically computed and only the Hessian matrix is numerically derived. This procedure has proved to have very good performance because the second-order derivative is not recomputed at each step, but it is updated in a suitable way. This leads to a sharp estimate of the observed information matrix and to a faster convergence with respect to using the EM algorithm until the end.
From the results of the EM algorithm and, in particular, on the basis of the posterior expected values computed at the E-step, a local decoding for each cluster and unit can be performed. This amounts to single out the state that, for each time occasion, has the greatest a posteriori probability given the observed sequences of responses and covariates. In particular, the quantities to be used are the posterior expected values of suitable sums of the indicator variables d (t) hij (w) of same type used in Equation (17). An alternative procedure, named global decoding, is based on the Viterbi algorithm (see Viterbi 1967;Juang and Rabiner 1991) that, in the present case, allows us to find in an efficient way the path of states with the highest a posteriori probability for each cluster or unit. The predicted paths, either found by local or global decoding, may be represented in plots similar to those in Figures S-1 and S-2 in which, however, such latent trajectories are conditional on the observed responses.

Model Selection and Hypothesis Testing
For model selection, we adopt the criterion suggested by Gao and Song (2010), which may be seen as the counterpart of BIC for composite likelihood inference. Then, the model to be selected is the one that maximizes the following index: where tr(Ĵ −1K ) is a penalty for the model complexity. The general CL-BIC proposed by Gao and Song (2010) also includes a further penalty term related to the model space complexity, which makes the criterion applicable to much broad ranges of likelihood or quasi-likelihood approaches. However, we do not deem this further penalty term to provide useful insights in the present context and it is here omitted. MatricesĴ andK are the estimates of In Renard, Molenberghs, and Geys (2004), these matrices are estimated aŝ and, consequently, the variance-covariance matrix of the pairwise likelihood estimatorθ , and its standard errors, are obtained by the sandwich formulâ In particular,Ĵ is obtained from the Hessian provided by the numerical algorithm described at the end of Section 5.1. On the basis of a series of simulated samples, we verified that the estimatorK of the variance of the score previously illustrated is often not stable and does not provide coherent values of the penalty term for CL-BIC, because less penalty values may result for more complex models. More stable estimates are instead achieved for the empirical score expectation. Therefore, we estimate K through a Monte Carlo method, as suggested among others by Varin, Reid, and Firth (2011). This method is based on: (i) drawing a suitable number of samples from the estimated model; (ii) computing for each simulated sample the pseudo-score ∂p (θ)/∂θ ; and (iii) estimating K as the variancecovariance matrix of these score vectors. We experienced that in this way more stable estimates of the variance of the score are obtained and this, in turn, implies penalty terms more coherent with the model complexity computed according to expression (18).
We use CL-BIC to select the number of states k 1 and k 2 of each latent process U h at cluster level and V hi at unit level. Moreover, this criterion can be also used for selecting one of the possible parameterizations illustrated in Section 4.
An important issue that must be considered when we deal with latent variable models is the parameter identifiability, as lack of identifiability may lead to complications for estimation and problems regarding the inferential properties of the estimators. As typically happens with reference to these models, we focus on local identifiability, that is, identifiability in a neighborhood of a given parameter value, rather than global identifiability that refers to the whole parameter space (e.g., McHugh 1956;Rothenberg 1971); see also Bartolucci (2006) with specific reference to a basic LM model. To establish that the model is locally identifiable, we check the rank of the observed information matrix. The condition that this matrix has full rank must be always verified in the inferential procedure described in this section (in particular for model selection and to derive standard errors for the parameter estimates), as these procedures are based on the inverse of such a matrix.

Weighted Pairwise Likelihood
In the current form, the pairwise likelihood defined in (8) gives more weight to the units belonging to larger clusters. This is because the number of components involving the data of a specific unit is equal to the number of other units in the same cluster. As suggested by Renard, Molenberghs, and Geys (2004), a weighted version of the pairwise log-likelihood may be more suitable when the clusters are very different in terms of dimension. Then, we propose the following weighted pairwise log-likelihood: where ξ h = 1 if the size of cluster h is n h = 1, and ξ h = (n h − 1) −1 otherwise, for h = 1, . . . , H . Then, the weight given to each sample unit proportionally decreases with the cluster size, so that in models with k 1 = k 2 = 1 (no latent structure) we obtain from the maximization of (20) the same estimates obtained from the maximization of the full likelihood; the same does not happen with the unweighted pairwise likelihood initially defined. Pairwise likelihood estimation and model selection based on the weighted approach can be carried out using essentially the same maximization procedure and CL-BIC discussed in Sections 5.1 and 5.2, respectively. Notice that Renard, Molenberghs, and Geys (2004) conjecture that the weighted approach should be in general desirable, but in some occasions the unweighted pairwise may perform better. Then, the choice strictly depends on the problem at hand. The simulation study presented in the following section compares the performance of the unweighted and weighted pairwise likelihood methods with the performance of the full likelihood method when the latter may be used.

SIMULATION STUDY
In this section, we illustrate and discuss the results of a simulation study aimed at assessing the properties of the pairwise maximum likelihood estimator outlined in Section 5 for the proposed multilevel LM model in its version for binary variables. This study is based on scenarios that recall the application described in Section 7.
Given a model with k 1 and k 2 latent states, 100 datasets are simulated for each scenario obtained combining different values of the number of occasions T, number of clusters H, and sample size. In detail, we consider T = 5, 10, H = 200, 400, whereas the number of units per cluster is generated either from the uniform distribution U (1, 10) or from U (1, 20). In this way, the average sample size is 1100 for H = 200 with U (1, 10), 2200 for H = 400 with U (1, 10), 2100 for H = 200 with U (1, 20), and 4200 for H = 400 with U (1, 20). Moreover two covariates at unit level are generated from an AR(1) process with autocorrelation equal to 0.5.
The simulation design contemplates eight different scenarios that are replicated under three multilevel LM models characterized by three different numbers of latent states; in particular, data are generated for the model with k 1 = k 2 = 2, with k 1 = 2 and k 2 = 3, and with k 1 = 3 and k 2 = 2. Data simulated under the three different models are fitted using the pairwise likelihood inference illustrated in Section 5 based both on the unweighted and on the weighted approach described in Section 5.3. The simulated data are also fitted using the full likelihood inference illustrated in Section 2, only for the scenarios in which computing the full likelihood is feasible, that is, for T = 5 with k 1 = k 2 = 2 or k 1 = 2 and k 2 = 3. The results displayed in Tables S-3 to S-8 of the supplementary material allow us to compare the performance of the unweighted and weighted pairwise inference also with respect to the full likelihood approach. Table S-3 contains the results related to the four different scenarios for the model with k 1 = k 2 = 2 latent states and T = 5. The first row of the table shows the true parameters of the model which are, for the cluster-level latent process U h , the logit-type transformations ρ 1 and ρ 2 of the transition probabilities. For the unit level latent process V hi , we use the logit transformations τ 1 and τ 2 of the transition probabilities. The intercept and the regression coefficients for the two covariates generated at unit level are denoted by δ 0 , δ 1 , and δ 2 and the support points are denoted by α 2 and β 2 (given that α 1 = β 1 = 0). For each scenario, we show the bias, the standard deviation (sd), and the root mean square error (rmse) of the estimates computed on the basis of the 100 replications. Furthermore, for the pairwise likelihood inference, we compare the standard deviation of the estimates with the mean of the estimated standard errors (mean-se) of the estimates computed on the basis of sandwich formula in (19).
The results in Table S-3 show that, for both pairwise and full likelihood approaches, the highest bias values are for the parameters involved in the individual latent process (τ 1 , τ 2 ) and the support point β 2 . The bias decreases as the sample size increases in the second and, in particular, in the fourth scenario. In terms of bias, the weighted pairwise approach performs better than the unweighted one; for instance, for τ 1 in the first scenario the bias under the former case is nearly the 50% of the bias under the latter. More generally, the table shows that the weighted pairwise likelihood inference provides a better approximation of the full likelihood inference with respect to the unweighted approach. For instance, for τ 1 in the first scenario, rmse with the unweighted approach is 2.6 times that with the full likelihood approach; this ratio decreases to 1.6 using the weighted approach. In the pairwise likelihood approaches, the difference between mean-se and sd mainly increases for the parameters involved in the individual latent process, even if these differ-ences are more evident for the unweighted rather than for the weighted approach. For instance, for τ 2 in the first scenario, the ratio between mean-se and sd is nearly 1 under the weighted approach, that is, 0.926, whereas it is 0.575 under the unweighted one. Nevertheless, the sandwich formula (19) provides good estimates of the true standard error as the sample size increases (in particular for the unweighted approach); for the same parameter τ 2 in the fourth scenario, the same ratio becomes 1.109 and 1.143 for the weighted and unweighted approach, respectively. Similar considerations also hold for the scenarios considered in Table S-4 and referred to models with k 1 = 2 and k 2 = 3 latent states, and T = 5. The first row of this table includes one more parameter τ 3 for the latent chain V hi and one more support point β 3 . In models with higher values of k 2 , estimation of the parameters for V hi seems to be more critical; however, the weighted pairwise approach is closer to the full likelihood inference, compared to the unweighted one, both in terms of bias and efficiency. For τ 3 in the second scenario, rmse is 0.833, 0.548, and 0.370 under unweighted pairwise, weighted pairwise, and full likelihood inference, respectively. In these scenarios, the standard error estimates of the parameters for V hi under the pairwise approach are especially critical when the sample size is small, whereas for the other parameter estimates good results are obtained. For instance, consider τ 2 in the first scenario; the ratio between mean-se and sd is 0.392 for the unweighted approach, however the same ratio improves and becomes 0.598 under the weighted approach.
Table S-5 contains the results referred to the four different scenarios for the model with k 1 = 3 and k 2 = 2 latent states and T = 5. The results are only referred to the pairwise approaches because we noticed that, in this case, computing the full likelihood estimator is very slow even with five time occasions. The first row includes one more parameter ρ 3 for the latent chain U h and one more support point α 3 . As in the previous tables, the highest bias and rmse are generally for the estimates of the individual latent state model. The weighted approach always outperforms the unweighted one, especially for H = 200 with a smaller sample size, even for the standard error estimates; in the first scenario, the ratio between mean-se and sd for τ 2 is 0.622 and 0.994 under the unweighted and weighted pairwise, respectively.
The simulation study shows very good results for the pairwise inference both in terms of bias and standard deviation when the number of occasion increases; see the results for T = 10 in Table S-6 for models with k 1 = k 2 = 2 latent states, in Table S-7 for k 1 = 2 and k 2 = 3, and in Table S-8 for k 1 = 3 and k 2 = 2. These tables show the performance only for the pairwise inference because computing the full likelihood is not feasible in these cases. We observe that the results are more positively affected by an increase of the number of time occasions than from an increase of the sample size. In Tables S-7 and S-8, in particular in scenarios with H = 200, we notice that the estimates under the weighted approach are more stable (lower standard error) whereas the estimates under the unweighted approach are more accurate (lower bias). However, in all these tables, the weighted pairwise inference generally performs better in terms of rmse than the unweighted one. For instance, in the fourth scenario considered in Table S-6, rmse for τ 1 is 0.136 and 0.142 under the weighted and unweighted approach, respectively. In Table 1. Values of CL-BIC computed using the Monte Carlo estimate of the score variance for different values of k 1 and k 2 (in boldface the largest CL-BIC value). For each model the penalization term is included (in brackets) the fourth scenario (Table S-7), the rmse for τ 3 is 0.158 and 0.181 under the weighted and unweighted approach, respectively. In the fourth scenario (Table S-8), the rmse for τ 2 is 0.166 and 0.178 under the weighted and unweighted approach, respectively. In particular, we notice that in models with T = 10, the estimates of the standard errors and the standard deviations are very close. For instance, for τ 1 in the first scenario, the ratio between mean-se and sd is 0.850 and 0.842 in Table S-6, 0.874 and 0.861 in Table S-7 for the weighted and unweighted pairwise inference, respectively. For τ 2 in the first scenario of Table  S-8, the same ratio is 1.000 and 0.959 for the weighted and unweighted pairwise inference, respectively.

APPLICATION
We illustrate the proposed approach by an application based on a dataset on individual work histories derived from the administrative archives of the Italian National Institute of Social Security (INPS). We consider a sample of 3850 employees (both blue-collars and white-collars, 24% of them are women) grouped in 402 private Italian firms with more than 1000 workers. The subjects, continuously working in the same firm and aged between 18 and 60 in 1994, were followed for 10 years, from 1994 to 2004.
As already mentioned in Section 1, the binary response variable of interest is illness (equal to 1 if the employee received illness benefits in a certain year and to 0 otherwise). We also consider a set of unit-level covariates: gender (dummy equal to 1 for woman), age in 1994, time (time occasion from 1 to 10), skill (dummy equal to 1 for a blue-collar), income (total annual compensation in thousands of Euros), and part-time (dummy equal to 1 for a part-time employee) and we also include the lagged response. The unit-level covariates are time varying except gender and age in 1994. Then we consider a set of cluster-level covariates: area (North-West, North-East, Center, South, or Islands) and size (number of observed workers in each firm assuming value between 1 and 252). In the application, the nonlinear effect of age and time is assumed including age 2 and time 2 among the covariates.
Given the reduced number of covariates, it is reasonable to consider the response variable of interest to depend on unobservable latent variables related to the propensity of employees in asking for illness benefits and to the average propensity at firm level. Furthermore, we are interested in verifying the hypothesis that, across time, these latent traits could be influenced by dynamic (unobserved) aspects related to the worker's life and to the management.
The model described in Section 4 is fitted on this dataset under assumptions (4) and (5) for the conditional distribution of the binary response illness. Moreover, we assume that the two latent Markov chains at cluster and unit level are stationary with constant off-diagonal elements for each row of the transition matrices, as described in Section 4. Then, in this application, Table 2. Estimates of the logistic regression parameters (collected in the vectors γ and δ) affecting the conditional probabilities under the model with k 1 = k 2 = 1, k 1 = k 2 = 2, and k 1 = k 2 = 3 latent states  we estimate the effect of the covariates on the probability to get ill in addition to the unobservable heterogeneity modeled over time through two stationary Markov processes. The unitlevel process is expected to capture the worker propensity (not explained by the observed covariates) to get ill, whereas the cluster-level latent process explains the effect of different firms on this propensity. Model estimation is carried out using the weighted pairwise inference illustrated in Section 5. The first step of the analysis is the choice of k 1 and k 2 for both processes. This choice is based on CL-BIC defined in (18). The value of this index is reported in Table 1 for different values of k 1 and k 2 . According to these results, the model with k 1 = 3 latent states at cluster level and k 2 = 3 latent states at unit-level is selected. Table 2 collects the estimates of the regression parameters for the selected model with k 1 = k 2 = 3 states (displayed on the right side), and also for other two models of interest with k 1 = k 2 = 1 (model with fixed effects displayed on the left side) and with k 1 = k 2 = 2.
From the comparison of the estimates derived under the three fitted models, we notice that gender and all cluster-specific covariates (size and area) are not significant; also the quadratic effects of age and time are not significant. On the other hand, the probability of receiving illness benefits is positively related to age, time, to being a blue-collar, whereas it is negatively related to income and to having a part-time job. The positive effect of the lagged response requires a more detailed discussion. This effect is shown to be significant under the models with a lower number of latent states, whereas it becomes not significant under the selected model. We consider the significant effects found in the models with k 1 = k 2 = 1 and k 1 = k 2 = 2 to be a spurious association arising because unobserved heterogeneity is not properly accounted for. In fact, this association vanishes after conditioning on the latent processes with k 1 = k 2 = 3. Then, the probability of receiving illness benefits is explained by a set of covariates and also by unobservable aspects that cannot be ignored. This is confirmed by the fact that the model with no random effects, that is, with k 1 = k 2 = 1, provides a lower value of CL-BIC. In particular, the correlation between workers in the same firm seems to be completely explained by the cluster-level latent process given that there are no significant cluster-specific covariates. The dependence on time of the latent variables at cluster and unit level is also considered. Then, the data are fitted assuming three additional constraints on the selected models: (i) identity transition matrix for the cluster-level process (time-constant cluster random effects), (ii) identity transition matrix for the unit-level process (time-constant unit random effects), and (iii) identity transition matrices for both the cluster and unit-level processes (time-constant cluster and unit random effects). Note that, when we constrain the transition matrix to be diagonal, either at cluster or unit level, we adopt free initial probabilities. The CL-BIC values obtained under these three models are (i) −6076.29, (ii) −6069.15, and (iii) −6089.32, respectively; all of them are lower than −6057.69 obtained for the selected model with time-varying latent variables at both levels.
About the distribution of the cluster-and unit-level latent processes, the estimates of the initial and transition probabilities are reported in Tables 3 and 4, which also include the support point estimates.
For both processes, we observe that the states are well separated, the first and second states have the highest probability for the cluster-level process, whereas the second state has the highest probability for the individual-level process. The estimates of the transition matrices show a high persistence, in particular for the cluster-level latent process. To illustrate the behavior of the latent processes at individual and cluster level based on these parameters, in Figures S-1 and S-2 we represent 1000 simulated trajectories from the corresponding distributions.
Nevertheless, simpler models implying time-constant random effects provide a lower value of CL-BIC compared to the selected model. This may depend on the fact that we are dealing with a long panel with a large number of clusters of small size where the transitions between latent states, even if moderate, are not negligible both for employees and firms.

CONCLUSIONS
With reference to multilevel longitudinal data, where sample units are collected in clusters, we propose an approach to account for the unobserved heterogeneity between sample units and between clusters in a dynamic fashion. For this class of models, computing the full likelihood is often infeasible, in particular with long panel data. Therefore, we propose a pairwise likelihood approach for model estimation. We remark that the approach we propose is rather new in the literature for multilevel longitudinal data, because this type of composite likelihood is defined for subsets of individuals in each cluster and not for subsets of time occasions for each individual. Hence, we assess the performance of the proposed pairwise likelihood inference by means of a simulation study. From this study, we draw the following conclusions.
First, we notice that the estimates of the latent parameters are in general more crucial than the estimates of the regression parameters for the observed covariates, which show a very low bias and mean square error. Second, pairwise likelihood inference becomes more efficient as the sample size increases, but in particular as the panel length increases. Third, the weighted pair-wise approach generally performs better than the unweighted one because the former better approximates the maximum likelihood estimates (when computing the full likelihood is feasible) and, in general, provides lower mean square errors, then reducing the loss of information and increasing the efficiency of the parameter estimates.
Pairwise likelihood inference for multilevel LM models is used for the analysis of a longitudinal dataset based on a sample of Italian workers employed in different firms. The results of this application show how the proposed model is suitable for long panels with a large number of clusters of small size, that is, in a context where computing maximum likelihood estimates is demanding (even not feasible) and the pairwise likelihood inference has shown to provide good asymptotic properties. It is worth noting that the analyzed dataset supports the hypothesis of time-dependent latent variables both at cluster and unit level. This is the main novelty of the proposed extension for LM model based on nested hidden Markov chains. This extension provides useful insights for panel data with a limited number of covariates, as the one considered in this application in which the unobservable attitude of employees and firms in asking and allowing for illness benefits are influenced by several aspects, which may deeply vary over the considered period of 10 years.
Finally, there are some aspects that require special attention, among others the so-called initial condition problem (Skrondal and Rabe-Hesketh 2013) that may lead to biased estimates. We deem that this problem is mitigated for the application discussed in this article because most of the covariates (also the lagged response) are not significant and the simulation results show that the bias strongly decreases in similar contexts. Nevertheless, this is an aspect to consider in particular for the analysis of short panels. Finally, although we consider an application with binary response variables, the methodological approach illustrated in this article is completely general in terms of type of response variables, which may be also categorical, ordinal, or not ordinal.

APPENDIX: LOGIT-TYPE TRANSFORMATION OF THE TRANSITION PROBABILITIES
As shown in Section 5, the model parameters include a transformation of the transition probabilities based on (9) and (10). First of all, it is important to note that the inverse of the first transformations is λ * u = exp(ρū) 1 + (k 1 − 1) exp(ρū) ,ū = 1, . . . , k 1 , which ensures that 0 < λ * u < 1/(k 1 − 1) for ρū ∈ . The inverse of the transformation for π * v is defined in a similar way on the basis of τv. Obviously, on the basis of the transition probabilities λ * u and π * v , we obtain in a simple way the corresponding transition matrices and , see expression (3), and then the corresponding vectors of initial probabilities λ and π are obtained by a simple rule given in (Zucchini and MacDonald 2009, sec. 4.2). In particular, we have that λ = (I k 1 − + 1 k 1 1 k 1 ) −1 1 k 1 , where 1 k 1 is a column vector k 1 ones and I k 1 is an identity matrix of the same dimension; in similar way we obtain π from .

SUPPLEMENTARY MATERIALS
The Supplementary Material includes: (i) Tables S-1 and S-2 discussed in Sections 4.2 and 5.1 about the complexity of the algorithm, (ii) Tables S-3, S-4, S-5, S-6, S-7, and S-8 related to the simulation study of Section 6, and (iii) Figures S-1 and S-2 of simulated trajectories from the estimated latent processes mentioned in Section 7. [Received June 2013. Revised June 2014