Bayesian Heterogeneous Hidden Markov Models with an Unknown Number of States

Abstract Hidden Markov models (HMMs) are valuable tools for analyzing longitudinal data due to their capability to describe dynamic heterogeneity. Conventional HMMs typically assume that the number of hidden states (i.e., the order of HMMs) is known or predetermined through criterion-based methods. However, prior knowledge about the order is often unavailable, and a pairwise comparison using criterion-based methods becomes increasingly tedious and computationally demanding when the model space enlarges. A few studies have considered simultaneously performing order selection and parameter estimation under the frequentist framework. Still, they focused only on homogeneous HMMs and thus cannot accommodate situations where potential covariates affect the between-state transition. This study proposes a Bayesian double-penalized (BDP) procedure to conduct a simultaneous order selection and parameter estimation for heterogeneous HMMs. We develop a novel Markov chain Monte Carlo algorithm coupled with an efficient adjust-bound reversible jump scheme to address the challenges in updating the order. Simulation studies show that the proposed BDP procedure considerably outperforms the commonly used criterion-based methods. An application to the Alzheimer’s Disease Neuroimaging Initiative study further confirms the utility of the proposed method. Supplementary materials for this article are available online.


Introduction
Hidden Markov Models (HMMs) have broad applications in medical, behavioral, social, and psychological sciences, wherein heterogeneous longitudinal data are frequently collected and analyzed.HMMs consist of two parts: a transition model to characterize the dynamic transition process between hidden states and a conditional regression (emission) model to examine state-specific covariate effects on the response of interest.
Conventional HMMs typically assume that the number of hidden states (i.e., order of HMM) is known or predetermined through criterion-based methods, such as the Akaike's information criterion (AIC, Akaike 1974) and Bayesian information criterion (BIC, Schwarz 1978).However, despite their successful applications in many substantive studies (e.g., Celeux and Durand 2008;Ip et al. 2013;Song et al. 2017), these criterionbased methods conduct pairwise comparisons among candidate models, which could become increasingly tedious and computationally intensive when the model space is ample.Moreover, these procedures perform estimation in two stages: choosing the order in the first stage and estimating the parameter of the selected model in the second stage, and thus may not be as effective as single-stage approaches.
Penalization methods are valuable alternatives to their criterion-based counterparts in estimating HMMs with unknown order.Two types of over-fitting exist in performing estimation for HMMs.The first type arises when some hidden states are almost empty and thus leading to near-zero mixing probabilities.The second type appears when two or more states have similar emission densities resulting in nearly identical parameter values.Chen and Khalili (2008) pointed out the necessity of preventing the second type of overfitting induced by similar-density components in finite mixture models and suggested a double penalization procedure to avoid the two types of overfitting simultaneously.Ye et al. (2019) extended their method to a finite mixture of varying coefficient models.Manole and Khalili (2021) developed the Group-Sort-Fuse (GSF) procedure for order selection and parameter estimation in multidimensional finite mixture models.For HMMs, Mackay (2002) proposed a single penalization on small state proportions and obtained a consistent order estimate of HMMs.Hung et al. (2013) introduced the double penalized method to nonregression Gaussian HMMs.Zhou et al. (2020) considered continuous-time HMMs and proposed a modified penalized maximum likelihood estimation approach.Lin and Song (2022) extended the GSF procedure and adapted the double penalization idea into regression-based HMMs.Apart from the frequentist penalization methods, Liu and Song (2020) also developed a Bayesian approach by regarding the order of HMMs as a random variable and updating it with other parameters using the reversible jump Markov chain Monte Carlo (MCMC) algorithm (Green 1995).Nevertheless, the available methods focused mainly on finite mixture models or homogeneous HMMs.Moreover, all the existing double penalized procedures are developed under the frequentist framework.However, unlike its frequentist counterpart, the Bayesian approach typically converts the penalization problem to introducing appropriate priors to relevant parameters and updates the penalty through posterior, making the penalization data-driven and easy to implement.Unfortunately, Bayesian double penalization procedures have never been considered in the literature.
This study aims to fill the gap and proposes a novel Bayesian double penalized (BDP) procedure for the simultaneous order selection and parameter estimation of heterogeneous HMMs.The procedure includes two penalties.The first is a lower bound imposed on the summation of mixing proportions to prevent states with near-zero initial probabilities.The second is a least absolute shrinkage and selection operator (lasso)-type penalty introduced to the distance between regression coefficients to avoid states with nearly identical parameters.We develop a hybrid MCMC algorithm that integrates the data augmentation, Gibbs sampler, forward filtering backward sampling (FFBS, Baum et al. 1970), and the Metropolis-Hastings (MH) algorithm.In particular, we offer an efficient adjust-bound reversible jump (ABRJ) sampling scheme to address the challenges of updating the order in implementing the MCMC algorithm.Simulation studies in Section 5 demonstrate that the proposed BDP procedure considerably outperforms the commonly used AIC and BIC in order selection accuracy and two existing one-stage methods in state allocation accuracy.In addition, by setting a sizable upper bound of the order (e.g., 200), the proposed method allows sufficient flexibility in estimating the order of HMMs and thus can accommodate the case where many states exist.Last but not least, the BDP procedure accomplishes order selection and parameter estimation in a single stage.By contrast, criterionbased approaches perform pairwise comparison and parameter estimation on a two-stage basis, and the related computational burden dramatically increases when the candidate model space enlarges.Therefore, the proposed BDP procedure is also superior to the criterion-based methods in terms of computational efficiency.
The rest of this article is organized as follows.Section 2 describes the model and related identifiability issues.Sections 3 and 4 present the BDP procedure and specific sampling schemes.Section 5 evaluates the empirical performance of the proposed method through simulation studies, and Section 6 reports an application to the Alzheimer's Disease Neuroimaging Initiative (ADNI) study.Section 7 concludes the article.The technical details are provided in the supplementary material.

Model
Let Y = (y 1 , . .., y n ) , where y i = (y i1 , . .., y iT ) , and y it is the response of subject i at time t; X = (X 1 , . .., X n ) , where X i = (x i1 , . .., x iT ) , and x it is the covariate vector of subject i at time t; D = (D 1 , . .., D n ) , where D i = (d i1 , . .., d iT ) , and d it is another covariate vector of subject i at time t, and the elements of d it can be distinct or overlapped with those of x it ; Z = (Z 1 , . .., Z n ) , where Z i = (Z i1 , . .., Z iT ) , and Z it is the hidden state of subject i at occasion t, which follows a first-order Markov chain and takes the values of {1, . .., K}.Given the hidden state Z it , y it is assumed to be independent for all i and t, and is formulated through the conditional regression model as follows: where x it = (1, x it1 , . .., x it,p−1 ) be the p×1 vector of covariates, β s is the p × 1 vector of state-specific regression coefficients, δ it is the residual term independent of x it , and Given that hidden states typically have a natural ranking and real meanings in most practical situations, we assume that hidden states {1, . .., K} are ordered.The hidden transition process is then formulated by Z i1 ∼ multinomial(π 1 , . .., π K ) such that 0 ≤ π s ≤ 1 and K s=1 π s = 1, and a continuation-ratio logit model (Agresti 2003) as follows: for t = 2, . .., T, s = 1, . ..K − 1, u = 1, . .., K: where Then, we can easily check that log , which is the log conditional odds of transitioning to the sth state instead of a higher state given Z i,t−1 = u.Therefore, the transition model ( 2) can be equivalently rewritten as logit(ϑ itus ) = ζ us + α d it .
Equations ( 1) and ( 2) define a heterogeneous HMM, under which some time-variant or baseline covariates affect the between-state transition.Let θ be the vector containing all the regression and variance parameters in the proposed model.Then, the complete-data log-likelihood function of the proposed model is given by log where The proposed model is unidentifiable due to the label switching problem.Label switching arises because a random permutation of state labels does not change the likelihood function, which leads to a multi-modal posterior under a symmetric prior distribution.We address the problem by introducing the cluster ordering procedure proposed by Zhou et al. (2020) to sort the multidimensional parameters in the conditional regression model (1), which satisfies the atom property mentioned in Manole and Khalili (2021).We define the cluster ordering procedure in the context of the proposed model as follows.
The above cluster ordering procedure incorporates the ideas from Zhou et al. (2020).The difference is that they number the parameter with the smallest L 2 norm as state one while we take the largest as the first.This procedure guarantees that the state labels are uniquely determined and induces a set of differences η 1 = β (1) , and η k = β (k) − β (k−1) for k = 2, . . ., K. Recall that the hidden states are assumed ordered; we can then rewrite the conditional regression model (1) as follows: Hence, by constructing the bijective mapping between β k and η k , the complete-data log-likelihood function can be formulated as which facilitates the double penalization in the next section.

Bayesian Double Penalized (BDP) Procedure
In analyzing HMMs, we must tackle two types of over-fitting: nearly empty states and redundant states.To prevent the first type of overfitting, we impose a lower bound on π k to ensure the existence of a proper partition and avoid nearly empty states or near-zero mixing proportions.Under the Bayesian framework, the lower-bound penalization is implemented by assigning a symmetric Dirichlet prior distribution to π, denoted as (π 1 , . . ., π K ) ∼ Dir(c K , . . ., c K ), where the concentration parameter c K = c n K and c > 0 is a preassigned constant.With such a prior specification, we have the following proposition: The derivation of Proposition 3.1 is provided in the supplementary material.This proposition ensures that the conditional mean of each element of π is lower bounded by c c+1 1 K , thereby preventing near-zero probabilities or nearly empty states.The constant c can be determined according to the degree of penalty required for specific problems.The lower bound of E(π s |Z) is close to 1 K when c increases while it tapers off when c approaches zero.
To address the second type of overfitting, we impose penalization on the norm of the discrepancy between different coefficient vectors.Manole and Khalili (2021) pointed out that the ordering procedure considerably outperforms the naive approach that penalizes all pairwise differences between β k when many hidden states exist.Therefore, instead of naively penalizing all K 2 pairwise differences between , implying that states k and k − 1 are redundant.Thus, we penalize η k 2 for preventing redundant states.Park and Casella (2008) introduced the Bayesian lasso to achieve shrinkage on regression coefficients by assigning them a conditional Laplace prior.Based on this idea, we modify the Bayesian lasso by introducing the conditional Laplace prior to ||η k || 2 to achieve shrinkage on the entire vector η k as follows: Then, the proposed model can be formulated through the following hierarchical representation: for s = 1, . . ., K, where 0 is the vector of zero elements, I p is the p-dimensional identify matrix, a ψs0 , b ψs0 , a γ k0 , and b γ k0 , are hyperparameters whose values are prespecified.
Proposition 3.2.Under the hierarchical model ( 10), the conditional prior distribution of η k has the form of ( 9).
The derivation of Proposition 3.2 is provided in Appendix 1 of supplementary material.Notably, (9) modifies the conditional Laplace prior proposed by Park and Casella (2008) and plays a similar role to the Bayesian lasso penalty to achieve shrinkage on η k .Moreover, it introduces a state-specific tuning parameter γ k to each η k 2 and penalizes the L 2 norm of the entire vector η k rather than its elements.Therefore, the penalty in ( 9) is adaptive and group-wise, denoted as a modified adaptive group lasso (MAGlasso) penalty.The MAGlasso procedure aims to update the tuning parameters by exploiting the data, thereby automatically imposing large penalties on unimportant coefficients.This target can be naturally achieved by introducing dispersed priors with small hyperparameters a γ k0 and b γ k0 .The degree of dispersion of the gamma priors determines the magnitudes of penalties imposed on unimportant components.Typically, setting a γ k0 to a positive integer (e.g., 1) and b γ k0 to a small value (e.g., 0.1 or 0.01) can induce a dispersed gamma prior.With this prior specification, we can derive the posterior distribution of the tuning parameters, which have the following forms: If ||η s || 2 is significant, τ 2 s tends to be large based on (11).Then, the corresponding γ s is dominated by τ 2 s based on (12).On the contrary, if ||η s || 2 is insignificant, τ 2 s tends to be small, and the related γ s is then dominated by the dispersed prior.
Considering that the Bayesian lasso does not shrink coefficients precisely to zero, we need a criterion to quantify the closeness of ||η s || 2 to a zero vector.Based on the specification of ( 9), we can show that P(η s |Y, , θ ) ∼ N(η * s , * s ), where η * s and * s are provided in Appendix 2 of supplementary material.Therefore, the squared Mahalanobis distance p determines a hyper-ellipse density contour centered at η * s .In this study, we adopt the 95% highest posterior credible region (HPCR) criterion (Harper and Hooker 1976) or equivalently the smallest region covering 95% of posterior probability mass.η s is regarded as redundant if its 95% HPCR covers 0. Alternatively, we can transform the decision rule to a direct comparison of the squared Mahalanobis distance between 0 and η * s with a critical value of

Tuning Parameters and Other Prior Specification
This section introduces how to tune the parameters in the prior distribution to facilitate double penalization and sets a proper prior for the parameters not discussed in Section 3.
As discussed in Section 3, we assign a Dirichlet prior to π = (π 1 , . . ., π K ) to prevent empty states as follows: where c is a hyperparameter.The constant c can be determined according to the degree of penalty required for specific problems.Typically, c around 0.5 can effectively prevent near-zero π s .Alternatively, one can regard c as another tuning parameter and update it in the MCMC algorithm.However, our numerical results show that this data-driving method increases computational complexity but performs similarly to the approach that fixes c in the interval of (0, 1).Moreover, setting a moderate value in (0, 1) can avoid an extreme (too small or too large) penalty on the mixing probabilities.Based on our extensive simulation study, a value around 0.5 performs satisfactorily.Furthermore, we introduce the conditional Laplace prior (9) to η k 2 to prevent redundant states with almost identical parameter values.The proposed model can then be formulated through the hierarchical representation (10).For the tuning parameter γ k involved in ( 9) or (10), we assign the following dispersed gamma prior: where a γ k0 and b γ k0 are hyperparameters whose values are prespecified to achieve a highly dispersed prior.The degree of dispersion determines the magnitude of the penalty on unimportant regression parameters.In this study, we follow the common practice (Guo et al. 2012;Kang et al. 2019) to set a γ k0 = 1 and b γ k0 = 0.01.
For other parameters involved in the transition model (2), we assign conjugate priors: where ζ us0 , σ 2 us0 , α k0 , and 2 αk0 are hyperparameters with prespecified values.The common practice is to set ζ us0 and the elements of α k0 to zero and assign σ 2 us0 and the diagonal elements of αk0 to large values to induce vague priors if the preliminary information about ζ us and α k is unavailable.

MCMC Algorithm
Unlike conventional HMMs that prespecify K, this study regards K as another unknown parameter and updates it with other model parameters in θ .The Bayesian estimate of (θ , K) can be obtained through the mean of the posterior samples drawn from P(θ, K|Y, X, D).However, P(θ , K|Y, X, D) involves unknown hidden states, leading to intractable sampling from P(θ , K|Y, X, D).Using the data augmentation technique, we instead work on P(Z, θ , K|Y, X, D).However, the joint posterior distribution P(Z, θ , K|Y, X, D) is still complex.Thus, the Gibbs sampler is employed to iteratively update each component through sampling from its full conditional distribution as follows: (a) update hidden states by sampling Z from P(Z|Y, X, D, θ , K), (b) update the model parameters by sampling θ from P(θ|Y, X, D, Z, K), and (c) update the order K by sampling from P(K|Y, X, D, Z, θ ).Owing to the transitioning features of hidden states and nonlinearity of the transition model (2), steps (a) and (b) require MCMC techniques, such as the FFBS and MH algorithms.The full conditional distributions involved in steps (a) and (b) are derived in Appendix 2 of supplementary material.Step (c) is the so-called ABRJ step, which allows K to be updated at each MCMC iteration as follows.
Let (K min , K max ) be the lower and upper bounds of K, and max ) be their values at the initial stage and jth iteration of the MCMC algorithm.Typically, we set max to a relatively large positive integer (e.g., 100 or 200) to allow sufficient flexibility in updating K.At the (j + 1)th iteration, K (j+1) can remain unchanged, increase, or decrease by 1.To update K (j) s * > χ 2 p,0.05 , we regard this component as necessary.Then, we jump K (j) upward to max −1, then K (j) remains unchanged, that is, K (j+1) = K (j) .Figure 1 shows the strategy of updating K in the ABRJ step.A pseudocode for implementing the MCMC algorithm is given below.
It should be noted that the two penalties in the suggested strategy target potential overfitting difficulties without providing adequate solutions for underfitting concerns.As a result, the initial upper bound should be a relative large number and the effect of order selection occurs when jumping in an upside-down manner.

Simulation Study
This section includes three simulations to demonstrate the effectiveness of the proposed algorithm in order selection and parameter estimation under various scenarios.Simulation 1 evaluates the performance of the proposed method in the case of K = 2, Simulation 2 assesses estimation performance for HMMs with higher orders, and Simulation 3 focuses on order selection and compares the proposed method with AIC and BIC.

Simulation 1
This simulation considers a 2-state HMM with p = 4 and q = 1.Two sample sizes, (n, T) = (50, 4), (200, 4), are considered.In each setting, 100 datasets are generated from the following model: [ where Update Z (j) by sampling from P(Z|Y, X, θ (j) , K (j) ) FFBS algorithm 4: Update θ (j) by sampling from P(θ |Y, X, Z, K (j) ) see details in Appendix B 5: posterior mean vector 7: posterior covariance matrix 8: s * < χ 2 p,0.05 then 10: else if d 2 s * ≥ χ 2 p,0.05 then 13: max − 1 then 15: ∼ N(0, 1).The true population values of the parameters are set as follows: The hyperparameters of the prior distributions in ( 13)-( 15) are specified as follows (Prior I): a ψs0 = 9, b ψs0 = 4, a γ k0 = 1, b γ k0 = 0.1, c = 0.5, α k0 = ζ us0 = 0, and σ 2 αk0 = σ 2 us0 = 1.In implementing the MCMC algorithm, we impose a constraint described in Definition 2.1 (i.e., β (1) > β (2) ) to each MCMC iteration to avoid label switching.Moreover, we set K (0) min = 2 and K (0) max = 200, which provides an extensive range for K.The algorithm's convergence is checked through the trace plots of the parameters.Figure 2(a) presents the trace plots of three MCMC chains of K starting from different initial values in an arbitrarily selected replication.The three MCMC chains of K mix rapidly and converge to the true value K 0 = 2 within a few iterations.Figure S1 of supplementary material presents the trace plots of three MCMC chains for other randomly selected parameters.Both figures indicate a fast convergence of the MCMC algorithm.To be conservative, we collect 10,000 posterior samples, discard the first 3000 iterations as burn-in, and calculate the bias and root mean square error (RMS) between the parameter estimates and their true values based on the remaining 7000 posterior samples corresponding to the selected order.Table 1 presents the estimation result.The bias and RMS are close to zero, and the performance improves when the sample size increases.
To reveal the sensitivity of Bayesian estimates to the prior input, we disturb the hyperparameters as follows (Prior II): and σ 2 αk0 = σ 2 us0 = 100.Table S1 of Supplementary Material reports the obtained results.The parameter estimates perform similarly to those in Table 1, indicating that the proposed Bayesian estimation is insensitive to the disturbed prior considered.
Furthermore, we check the sensitivity of Bayesian estimation to the misspecification of the distribution of δ it by considering two nonnormal cases: (1) δ it ∼ U(−1, 1) and ( 2) δ it ∼ 0.4N(1, 1) + 0.6N(−1, 1).We simulate 100 datasets from the proposed model with n = 200 and δ it drawn from case (1) or ( 2).The hyperparameters and other settings are the same as  Except for the variance of δ it and some parameters involved in the transition model, most parameter estimates are robust to the violation of the normality assumption of δ it .Therefore, the impact of misspecifying the distribution of δ it is mainly on estimating its variance.

Simulation 2
This simulation examines estimation performance for higherorder models and a model with unordered states.We first consider a 3-state HMM.Covariates x it is the same as in Simulation 1.For simplicity, we set d it = x * it , where x * it is the subvector of x it excluding 1.Two sample sizes, (n, T) = (200, 6) and (400, 6), are considered.The true population values of the parameters are set as β 1 = (3, 3, 3, 3) , β 2 = (0, 1, 2, 2) , β 3 = (−4, 2, 1, 1) , ψ = (0.25, 0.25, 0.25) , π = (0.3, 0.4, 0.3) , . The prior specification and simulation settings are similar to Simulation 1, except that the hyperparameters for α are set as α k0 = 0 and αk0 = I 3 .Figure 2(b) presents the trace plots of three MCMC chains of K starting from different initial values in an arbitrarily selected replication.Again, the MCMC chains mix and converge to the true value of K 0 = 3 rapidly.The trace plots of other parameters (Figure S2 of supplementary material) suggest that the algorithm converges within 3000 iterations.Therefore, we discard 3000 burn-in and use the remaining 7000 posterior samples to obtain the Bayesian estimates of the parameters.Table S3 of Supplementary Material shows the estimation results based on 100 replications under (n, T) = (200, 6), indicating that the proposed method performs satisfactorily in bias and RMS.The results under (n, T) = (400, 6) are further improved and not reported.
Next, we further increase the order to K 0 = 5.We consider a 5-state HMM with covariates x it = (1, x it1 , x it2 ) , where The other model setup is the same as above.The true population values of unknown parameters are set as β 1 = (5, 5, 5) , ) .The prior and simulation settings are similar to those given above.Figure 2(c) presents the trace plots of three MCMC chains of K starting from different initial values in an arbitrarily selected replication, showing that the iterative K quickly converges to its true value K 0 = 5.The trace plots of other parameters (Figure S3 of supplementary material) suggest that the MCMC chains mix well within 4000 iterations.Thus, we discard 4000 burnin and use the remaining 6000 posterior samples to obtain the Bayesian estimates of the parameters.Table S4 of supplementary material presents the estimation results under (n, T) = (200, 6), indicating that the proposed method performs satisfactorily in order selection and parameter estimation when K 0 increases to 5. The results under a larger size of (n, T) = (400, 6) are better and not reported.
To accommodate the scenario when a hidden state may represent a political belief or something unordered, we adapt the transition model (2) to a multinomial logit model and conduct the analysis.Table S5 of supplementary material presents the estimation results.Although the bias and RMS are slightly larger than those in Table 1 due to a more complicated transition model, they still show the satisfactory performance of the proposed method.Moreover, Figure S5 indicates that three MCMC chains of K starting from different initial values mix rapidly and converge to the true value K 0 = 2.

Simulation 3
This simulation assesses the performance of order selection.Considering that no existing methods can simultaneously estimate the order and model parameters of heterogeneous HMMs, we compare the proposed BDP procedure with criterion-based approaches, AIC and BIC, in order selection accuracy.
Table 2 presents the proportions of correct order selections calculated based on 100 replications with K 0 = {3, 4, 5} and n = {100, 200, 400}.The results show that the proposed BDP procedure consistently outperforms AIC and BIC in all the scenarios considered.In general, the performance of the three methods improves when the sample size increases but declines when K 0 increases.In particular, AIC and BIC perform poorly when K 0 = 5 regardless of the sample size; their correct selection proportions are below or around 0.5.By contrast, our proposed method performs much better, and its correct selection proportion attains 0.81 when n = 400.This comparison confirms that our procedure achieves higher accuracy than the conventional criterion-based approaches.In addition, unlike the two-stage methods that perform order selection and parameter estimation in two stages, our proposed method accomplishes the two tasks in a single stage.Moreover, our procedure guarantees that sampling only occurs when the states are necessary and retained.Hence, its advantages in computational efficiency over existing ones become increasingly pronounced when the candidate model space enlarges.Per an anonymous referee's suggestion, we also compare the BDP procedure with two existing one-stage methods offered by Lin and Song (2022) and Liu and Song (2020).Since these two available methods assumed that the between-state transition is homogeneous, the heterogeneity is ignored when applying them to the generated 100 datasets in Simulation 1.In addition, we generate 100 datasets from a homogeneous model by setting α = 0 in the transition model (2).Table S6 of supplementary material presents the comparison results, from which we have two findings.First, the three approaches perform comparably in order selection, and our method performs better than Lin's, especially in the heterogeneous case, but slightly worse than Liu's.Second, our method significantly outperforms the other two in estimating hidden states, especially for heterogeneous data.This result is anticipated, given that the other two approaches disregard heterogeneity.Finally, it is worth mentioning that despite the excellent performance of Liu's method in order selection, it performs the poorest in state allocation accuracy, likely due to its constant switching between states and unstable allocation of each individual.Therefore, the comparison results demonstrate the advantages of the proposed method in handling both homogeneous and heterogeneous scenarios.

Real Data Analysis
In this section, we applied the proposed method to the dataset extracted from the ADNI study to demonstrate the practical utility of the proposed method.ADNI is a longitudinal multicenter study that began in 2004, collecting various participants' imaging and clinical assessments.More information is referred to the official website: www.adni-info.org.
We focused on 616 subjects collected from the ADNI study with four follow-up visits, namely, the baseline, six months, 12 months, and 24 months.Alzheimer Disease Assessment Scale-Cognitive 13 (ADAS13), which reflects cognitive impairment in AD assessment, is treated as response y it .Generally, a high ADAS13 score indicates low cognitive ability.In addition, some clinical and genetic variables were considered as covariates.One is a time-variant continuous variable, x it1 : the logarithm of the ratio of hippocampal volume over the whole brain volume (HIP).Other covariates include APOE-4, coded as 0, 1, 2, denoting the number of APOE-4 alleles and represented by x it2 (x it2 = 1 if carrying one allele and 0 otherwise) and x it3 (x it3 = 1 if carrying two alleles and 0 otherwise), patients' age at baseline, x it4 , and patients' gender, x it5 (x it5 = 1 if female).In this study, we assume The main goal of this study is to simultaneously identify the number of hidden states and the state-specific relationship between ADAS 13 and its important risk factors.
The prior specification and other settings are similar to the simulation study.We imposed constraint described in Definition 2.1 to each MCMC iteration to avoid label switching.The trace plots of K shown in Figure 2(d) indicte that the MCMC chains of K from different initial values quickly converge to K = 4, suggesting a 4-state HMM for the data.Figure S4 of Supplementary Material presents the trace plots of other parameters involved in the selected model.The MCMC chains mixed well within 5000 iterations.Thus, we discarded 5000 burn-in iterations and used the remaining 5000 posterior samples to obtain the parameter estimates.Table 3 presents the parameter estimates of the selected 4-state HMM.Based on the results, we have the following observations.First, the state-specific intercept β 1s exhibits an ascending trend.Patients have the lowest ADAS mean score in state 1 and highest mean score in state 4. According to the existing literature (Kantarci et al. 2013), states 1 to 4 can be interpreted as CN, early mild cognitive impairment (MCI), late MCI, and AD accordingly.
Second, HIP (β 2s ) exerts an adverse effect on ADAS13, implying that a sizable hippocampal volume is associated with a low ADAS13 score and thus high cognitive ability.Moreover, the magnitude of the HIP effect on ADAS13 increases from CN to AD, implying that hippocampal atrophy continuously impairs patients' cognitive ability during AD progression.The published medical reports (e.g., Dickerson and Wolk 2013) also revealed that the loss of hippocampal volume significantly affects AD.
Third, the effects of APEP-4 (β 3s and β 4s ) on ADAS13 are positive, suggesting that carrying APOE-4 increases AD risk, and such impact becomes increasingly pronounced with the disease progression.This finding is in line with the medical report (Risacher et al. 2015) that APOE-4 is a crucial biomarker of AD.Furthermore, the magnitude of β 4s is larger than β 3s for s = 1, . . ., 4, implying that carrying two alleles, in general, impairs cognitive function more significantly than carrying only one allele.Besides, patients' age and gender do not substantially affect ADAS13 when controlling hippocampal volume and APOE-4.An exception lies in β 64 = 0.399(0.101),which suggests that females suffer more severe cognitive decline than males in the late AD progression period.This result again agrees with the existing literature (e.g., Via et al. 2010;Kang et al. 2019).Lastly, the transition pattern described by ζ exhibits a banding structure.That is, patients are likely to transit between adjacent states.Moreover, α 2 and α 3 are significant and negative, implying that the transition pattern between hidden states exhibits heterogeneity.APOE-4 allele carriers are more likely to transit to a worse state rather than remain in the current one than noncarriers; carrying two alleles induces a higher risk of transitioning to a worse state than carrying one allele.This result is consistent with the existing finding (Eunjee et al. 2015) that APOE-4 alleles increase the risk of developing AD.However, other covariates, such as age and HIP, do not significantly affect the between-state transition given APOE-4.This result implies that conditional on APOE-4, the direct effects of age and hippocampal volume on the transition probability are weak.

Discussion
In this study, we have proposed a double penalized method to perform order selection and parameter estimation for heterogeneous HMMs under the Bayesian framework.In addition, we have developed a novel MCMC algorithm with an ABRJ sampling scheme to facilitate a joint updating of the order and model parameters.Multiple simulation studies and an application to the ANDI dataset demonstrate the superiority of the proposed method over existing ones and its utility in realistic settings.Furthermore, the proposed model can cope with general situations where specific covariates simultaneously influence the emission and transition processes.
The present work can be extended in several directions.First, we use a single indicator to represent a response or predictor.For example, we adopt ADAS13 to reflect cognitive ability in the ADNI data analysis.While in practice, multiple tests can be used to examine cognitive impairment, and their scores can be integrated into a univariate latent construct through factor analysis.Such an extension can accommodate latent responses or covariates, reduce model dimensionality, and improve interpretability.Second, our conditional regression model only accommodates a continuous variable.Given that complex data types are frequently encountered in medical, social, and psychological sciences, generalizing our method to incorporate multivariate, functional, or image variables can considerably enhance model capability.Third, this study mainly focuses on order selection.However, variable selection is potentially interesting in the presence of high-dimensional variables.Thus, we can consider additional penalties for simultaneous order and variable selection.Finally, the two penalties in the proposed method aim to tackle the possible overfitting problems without effective strategies to address underfitting issues.Therefore, we design the jumping in an upside-down direction.Given this design, developing a penalization method to prevent overfitting and underfitting simultaneously is of great interest.However, these possible extensions may raise new theoretical and computational challenges.

Figure 1 .
Figure 1.Strategy of updating K in the ABRJ step (c).Algorithm 1 MCMC algorithm for the estimation of heterogeneous HMMs Data: Y, X, D, J, K(0)  min , K (0) max

Figure 2 .
Figure 2. Trace plots of three MCMC chains of K in simulations and the ADNI study.

Table 2 .
Proportions of correct order selections among 100 replications.

Table 3 .
Parameter estimation results for ADNI study.