Bayesian approaches to variable selection in mixture models with application to disease clustering

ABSTRACT In biomedical research, cluster analysis is often performed to identify patient subgroups based on patients' characteristics or traits. In the model-based clustering for identifying patient subgroups, mixture models have played a fundamental role in modeling. While there is an increasing interest in using mixture modeling for identifying patient subgroups, little work has been done in selecting the predictors that are associated with the class assignment. In this study, we develop and compare two approaches to perform variable selection in the context of a mixture model to identify important predictors that are associated with the class assignment. These two approaches are the one-step approach and the stepwise approach. The former refers to an approach in which clustering and variable selection are performed simultaneously in one overall model, whereas the latter refers to an approach in which clustering and variable selection are performed in two sequential steps. We considered both shrinkage prior and spike-and-slab prior to select the importance of variables. Markov chain Monte Carlo algorithms are developed to estimate the posterior distribution of the model parameters. Practical applications and simulation studies are carried out to evaluate the clustering and variable selection performance of the proposed models.


Introduction
In biomedical research, cluster analysis based on patients' characteristics or traits is often performed to identify disease subtypes. Disaggregating disease heterogeneity can help better understand the underlying biological mechanisms, which is a key building block for better disease management strategies, novel treatments and precision medicine.
Of many existing statistical methods, the finite mixture model (FMM) is a popular tool for modeling population heterogeneity. This model refers to modeling with categorical latent variables that represent subgroups of the population that is unobserved and need to infer from the data [40]. When clustering is the main interest, FMM is a powerful tool and is usually referred to as model-based clustering. Application of model-based clustering can be found in many different areas of research, for example, in the analysis of gene expression data [60], brain magnetic resonance image data [61], adolescent alcohol use [11] and identifying asthma phenotypes [52]. A comprehensive review of this topic can be found in the discussion [21] and the recent review of model-based clustering [41]. An R package FlexMix has also been developed to perform finite mixture regression analysis [23,32].
While it is important to correctly identify clusters, for many cluster analyses, identifying clinical factors that are associated with the class assignment is also of great interest. These factors are crucial since they can help clinicians to make informed treatment decisions. However, it is often hardly known which variables are associated with class membership. Including all variables into the model without a selection will result in a large and complex model, which is also often difficult to interpret. Therefore, it is desirable to identify the appropriate subset of the variables in a mixture model to obtain a parsimonious model that simultaneously achieves consistent variable selection and optimal classification. Within the mixture model framework, several methods have been proposed to perform variable selection. Following the categorization in [18], the existing variable selection methods in the mixture model can be roughly divided into penalization approaches, in which variable selection is performed by using a penalized log-likelihood approach [28,59], model selection approaches, in which variable selection is considered as a model selection problem [12,17,19] and Bayesian approaches, in which variable selection is conducted by making inference about the posterior distribution via sampling strategies such as Markov Chain Monte Carlo (MCMC) [51,61]. A common feature of these approaches is that the variable selection procedure is applied in component-specific distribution (see f k (·) defined in Equation (2) in Section 3) to identify important variables or measurements, while assuming the probabilities of mixture components (or mixture weights) (see π ik (·) defined in Equation (3) in Section 3) do not depend on any covariates. Nevertheless, in clinical practice researchers may also be interested in which covariates predict the class assignment. Therefore, allowing variable selection in the context of the mixture model to determine the variables affecting the probabilities of the mixture components would significantly increase the flexibility of the model.
In the current study, we develop two approaches to perform variable selection in the context of the mixture model to identify important predictors that are associated with the class assignment. These two approaches are referred to as the one-step approach and the stepwise approach. The former refers to an approach in which clustering and variable selection are performed simultaneously in one overall model, whereas the latter refers to an approach in which clustering and variable selection are performed in two sequential steps. Shrinkage and spike-and-slab priors are adapted to these proposed approaches for selecting important variables. MCMC algorithms are developed to estimate the posterior distributions of parameters of interest. Real data and simulated data are used to compare the performance of these two approaches under different scenarios. The remainder of this paper is organized as follow: in Section 2, we describe two motivating studies and datasets to be considered in the current study. In Section 3, we first review the FMM and then describe the proposed model for both cross-sectional and longitudinal data. Variable selection methods based on shrinkage and spike-and-slab priors are also introduced in this section. In Section 4, we discuss the Bayesian inference for the proposed models. In Section 5, we apply the proposed models to analyze two motivating datasets for discovering disease phenotypes and their risk factors, and in Section 6, we perform a simulation study to compare the clustering and variable selection performance of these models under different scenarios. Finally, in Section 7, we discuss and conclude our findings.

Motivating examples
Our methodological development is motivated by two clinical studies, namely the Primary Biliary Cirrhosis (PBC) study and the Childhood Asthma Management Program (CAMP).

Primary biliary cirrhosis study
The first example is a well-known study, the Mayo Clinic trial of the liver Primary Biliary Cirrhosis (PBC). PBC is a fatal chronic liver disease with an unknown cause. A randomized placebo-controlled trial was conducted between 1974 and 1984 to study the effect of the drug D-penicillamine on the treatment of PBC. In this 10-year interval, the trial recruited 424 PBC patients who met the eligibility criteria [14]. This dataset can be found in [16] and is also available in R survival package. The trial collected several variables, for example, patients' age survival status (0=alive, 1=liver transplant, 2=dead), drug (1=D-penicillamine, 2=placebo), presence of ascites (0=no, 1=yes), hepatomegaly (0=no, 1=yes), spiders (0=no, 1=yes) and edema (0=no, 1=yes), as well as other indices such as serum bilirubin and cholesterol level. A common problem from the clinical practice is how to make use of these markers of disease progression to identify subgroups (i.e. clusters) of PBC patients with similar characteristics, and what are the predictors of these group assignments. These subgroups may provide an important indication of patients' prognostic.

Childhood asthma management program
It is widely accepted that asthma is not a single disease, but a distinct disease caused by different underlying biological mechanisms. Discovering different phenotypes of asthma and factors associated with these phenotypes can help better understand the underlying biological mechanisms, which is a key building block for better asthma management strategies, novel treatments and precision medicine.
The Childhood Asthma Management Program (CAMP) is a triple-blinded randomized clinical trial originally designed to evaluate whether treatment with either an inhaled corticosteroid (budesonide) or an inhaled noncorticosteroid drug (nedocromil) safely produces an improvement in lung growth when compared with treatment for symptoms only. The primary outcome of CAMP is lung function measured by forced expiratory volume in 1 s (FEV1). Participants recruited to the trial were also followed by three phases of observational follow-up lasting 13 years. Both trial and observational follow-up included an annual prebronchodilator and postbronchodilator spirometry test (during which FEV1 data were collected) as part of the protocol. At baseline, 1041 children were randomly assigned to receive 200 μg of budesonide (311 children), 8 mg of nedocromil (312 children), or placebo (418 children) twice daily. These participants were from 5 to 12 years of age with mild-to-moderate asthma and were treated for four to six years. The trial found that the anti-inflammatory medications did not have a better long-term effect than placebo on lung function growth [7]. The trial also assessed differences between treatment groups regarding airway responsiveness, morbidity and physical growth etc. More details about the design of this study can be found in [6].
Since lung function has been associated with asthma in childhood [35] and adulthood [44], we are interested in understanding the growth and decline in lung function in these patients with childhood asthma. These distinct lung function patterns may reveal links between asthma and subsequent chronic airflow obstruction and aid in developing optimal personalized treatment strategies. Moreover, identifying determinants of the growth and decline in lung function would facilitate early disease diagnosis and prevent it from progressing to a more severe stage.

Methodology
In this section, we begin by first reviewing the finite mixture model, then describing the one-step and stepwise approaches for cluster analysis. We also describe the Gaussian mixture model for both cross-section and longitudinal data, as well as variable selection priors.

Review of the finite mixture model
The model presented here is intended to be in sufficient generality to allow for more complex data structures. Let y i denote the data for N subject (i = 1, . . . , N), where y i can be a single data point or multi-dimensional vector. An FMM with K components (i.e. classes) can be written as f (y i ) = K k=1 π k f k (y i ), for k = 1, . . . , K, where f k (y i ) are densities of cluster k and π k are the probabilities of mixture components (or mixture weights) that sum to one, that is, 0 ≤ π k ≤ 1 and K k=1 π k = 1. It is not uncommon that f k is assumed to have a parametric form, i.e. f k (y i ) = f k (y i ; k ), where k is a set of parameters that characterize the density f k (·). Therefore, the FMM can be written as where = (π , 1 , . . . , K ) . Depending on the types of data, one can specify a different probability model for f k . For example, to model continuous cross-sectional data f k may be a Gaussian density defined by mean and variance, whereas to model longitudinal data f k may be a multivariate Gaussian density defined by mean vector and variance-covariance matrix.

One-step approach
A more general formulation of the FMM is to allow π k depend on some covariates x i and the class-specific density function depend on some covariates u i , that is, where = (π , 1 , . . . , K ) , and k = (β k , θ k ) denotes all parameters to be estimated for the k th component. θ k is a set of parameters characterized the distribution f k . π ik (x i , β k ) is the probability of subject i belonging to class k, depending on the p dimensional covariates x i = (x i1 , . . . , x ip ) and its associated parameters β k = (β k1 , . . . , β kp ) . Specifically, π ik (x i , β k ) can be modeled using a multinomial logistic regression We set β 10 = 0 (i.e. the intercept of the first class) and β 1 = 0 (i.e. the p-dimensional coefficients of the first class) for identifiability purpose. The coefficients β k (for k > 1) can be interpreted in terms of change in log-odds relative to the first category. We refer to this model as the one-step approach in which predictors are included in the FMM.

Stepwise approach
Alternatively, one can also determine the variables that are associated with the class membership using an unconditional approach, also known as the stepwise approach. In the stepwise approach, the classification and prediction are separated steps such that class predictors do not affect the classification results. This avoids the scenarios in which including different covariates could lead to a different classification. Specifically, the stepwise approach to incorporate predictors is consist of two steps. In the first step, we identify the latent classes (i.e. most likely class membership) via an unconditional model without including any predictors, i.e. f (y i ) = K k=1 π k f k (y i ), where π k does not depend on covariates and is identical to all subjects in class k. And in the second step, the class membership variable derived from the model is used as the outcome variable and regressed on the latent class predictors, which can be achieved by using a multinomial logistic regression as in (3).

Gaussian mixture model for cross-sectional data
To specify f k , here we consider two common models based on Gaussian distribution. The first model can be used to model continuous cross-sectional data, whereas the second model can be used to model continuous longitudinal data.
For cross-sectional data, let y i = y i denote a single data point of subject i. Let u i = u i denote the time when the measurement is collected. The Gaussian mixture regression model can be written as where N(·, ·) is a normal distribution and to model the component mean with the time covariate u i , one can further specify a regression model For example, if the component mean μ k is assumed to have a quadratic relationship with u i , one can specify U i = (1, u i , u 2 i ) and γ k = (γ k0 , γ k1 , γ k2 ) , which leads to μ k = γ k0 + γ k1 u i + γ k2 u 2 i . To facilitate the Gibbs sampling of the model parameters, we used the following standard independent conjugate priors, γ k ∼ MVN(0, V k0 ) and σ 2 k ∼ IG(a k0 , b k0 ), where V k0 is the variance-covariance matrix of a q dimensional multivariate normal distribution. IG denotes the inverse gamma distribution with parameters a k0 and b k0 .

Gaussian mixture model for longitudinal data
For longitudinal data, let y i = (y i1 , . . . , y in i ) denote the data for subject i, where n i denote the number of measurements for subject i. Let u i = (u i1 , . . . , u in i ) denote the time when these measurements are collected. The growth mixture model for class k can be written as where θ k = (γ k , k , σ 2 k ) and ϑ ik ∼ MVN(0, k ). U i denotes the q 1 × n i design matrix of the fixed effect and Z i , which is a subset of the columns of U i , denotes the q 2 × n i (q 2 < q 1 ) design matrix of random effect. ϑ ik is the corresponding random effect coefficient. Under this model, the subject i that belongs to class k has a mean trajectory U i γ k and the variance-covariance matrix Z i k Z i + σ 2 k I n i ×n i . We used the following standard independent conjugate priors,

Variable selection priors
For the mixture model specified in (2) and (3), the number of coefficients β needed to be estimated grows as the number of class K and the covariate dimension p increase. From a practical perspective, it is crucial to identify only the variables of importance in order to obtain an interpretable and parsimonious model ; therefore, it is necessary to place restrictions on the estimation of β in order to obtain a robust final model. Bayesian variable selection methods have received increasing attention and a variety of MCMC methods have been proposed for identifying important variables. These methods fall within the concept of Bayesian modeling average (BMA) in which parameter estimates uncertainty and model uncertainty are simultaneously achieved [24,37,58]. A popular class of the method for variable selection is imposing a shrinkage prior on the regression coefficients β to cause a 'shrinkage' of the parameter estimation to lie around the origin. Examples of shrinkage priors include Bayesian lasso (also known as Laplace prior) [46,56], Horseshoe prior [9,10], Dirichlet-Laplace prior [4] and the modified Bayesian lasso method [38]. However, shrinkage prior itself usually would not lead to variable selection, and hard shrinkage (HS) rules (e.g. coefficients > 1 standard deviation (SD)) may be applied to achieve this purpose. A review of these methods can be found in [37]. In our current implementation, we considered a popular shrinkage prior, the Horseshoe prior, which possesses strong theoretical guarantees for estimation, prediction and variable selection [9,10].

Horseshoe prior
Since its proposal, the Horseshoe prior has become one of the most popular methods for shrinkage due to its outstanding performance and computational advantages. The Horseshoe prior is a global-local shrinkage procedure in which the local shrinkage for the coefficient is determined by a hyper-parameter k and the overall level of shrinkage is determined by a hyper-parameter ξ kj . While the Horseshoe prior does not have a closed-form density function, it can be written as a scale mixture of normals, where C + denotes half-Cauchy distribution. Unlike Bayesian lasso which imposes shrinkage effect uniformly across all coefficients, the Horseshoe uses half-Cauchy distributions over the global parameter (ξ kj ) and local hyper-parameter ( k ) which results in strong shrinkage over weak coefficients whereas almost no shrinkage over the large coefficients. Such property has been proven to be useful to discriminate between true effects and noise.
To efficiently sample from the posterior distribution, Makalic and Schmidt [39] proposed a simple sampler for regression models the Horseshoe prior. Another category of variable selection method is the spike-and-slab prior (also known as discrete mixtures) which places a mixture prior of a point mass at β = 0 and a continuous at β = 0. The spike-and-slab prior directly estimates the variable inclusion probabilities and thereby provides a direct measurement of variable importance. Examples of spike-andslab priors include stochastic search variable selection (SSVS) [22], Gibbs variable selection [8,13] and Kuo and Mallick (KM) prior based on unconditional distribution [31]. A review of these methods can be found in [45]. In our current study, we considered a commonly used prior, i.e. SSVS prior.

Stochastic Search Variable Selection
In SSVS, selecting a subset of important predictors is equivalent to setting the associated β kj of those non-selected variables to zero. With this idea, the SSVS places a normal mixture prior on β kj Therefore, The idea here is to set τ kj very small, such that those β kj for which δ kj = 0 will tend to be clustered around 0 (leading to the spike), and to set c kj very large, such that for those β kj for which δ kj = 1 will tend to be dispersed (leading to the slab). To facilitate the posterior computation, the SSVS can also be represented as a multivariate normal prior, It is recognized that SSVS is closely connected to the Horseshoe prior. To see this connection, one can re-parametrize the SSVS in (7) , where s 2 kj ands 2 kj denote the hyper-variances for the spike and slab distributions. Settings kj = 0 induces a degenerate distribution at the origin and under such case the SSVS can be written analogous to (6) as β kj | δ kj ∼ N(0, δ 2 kjs 2 kj ). Here δ kj ands kj have taken the role of k and ξ kj in (6), but instead of giving continuous distribution for δ kj it only allows to take two values (i.e. δ kj = 0, 1).

Bayesian inference
Based on the prior distributions described previously, in this section, we sketch the ideas of posterior computation via Gibbs sampling, followed by the clustering procedure and approaches to determine the number of clusters.

Posterior computation
To facilitate the posterior computation via Gibbs sampling, we considered a dataaugmentation approach which is based on a new class of Polya-Gamma (PG) distribution (a subset of the class of infinite convolutions of gamma distributions), which allows fully Bayesian inference in models with binomial or multinomial likelihoods through an efficient Gibbs sampler.
A challenge of using an MCMC algorithm (including the Gibbs sampler) is the nonidentifiability of the classes (or components). This issue arises because the mixture model is invariant under permutation of the indices of the classes, i.e. the parameters 1 in class 1 cannot be distinguished from parameters 2 in class 2 because they are exchangeable in the sense that the likelihood function will be invariant. As a result, the marginal posterior distributions of the parameters will be identical for each mixture component. This phenomenon is also known as the 'label switching' problem. In the current study, we applied a popular post-processing algorithm to reorder the labels based on Kullback-Leibler divergence [55]. The convergence of MCMC can be checked using Geweke statistics as well as visual inspection of the trace plots. The algorithm for posterior updates of parameters is provided in the Supplementary Materials.
The optimal classification for subject i (i = 1, . . . , N) can be determined based on the marginal posterior component probabilities that subject i is assigned to class k, which is defined as where Y denotes all data and denotes all the parameters. S is the total number of MCMC iterations. z

Determine the number of clusters
Although there is no widely accepted method for determining the number of clusters K in the mixture model, there have been many methods proposed. The common practice assumes the number of clusters K is fixed and finding the best K can be viewed as a model selection process. In such cases, model selection criteria can be implemented to compare models with different values of K. Statistical information criteria are commonly used, such as Akaike's Information Criterion (AIC) [1] and Bayesian Information Criterion (BIC) [50]. Alternatively, the Lo-Mendell-Rubin (LMR) [33] method or bootstrap likelihood ratio test (BLRT) [40] have also been used to determine K in finite mixture models. Nylund et al. [43] compared these indices for selecting the number of clusters in latent class models, factor mixture models and growth mixture models, and found that BIC outperformed the AIC, and the bootstrap likelihood ratio test provided the most consistent indicator of clusters.
Many model selection approaches were also proposed within the Bayesian framework. For example, the Bayes factor (BF) [27] is a commonly used index that is based on integrated likelihood, and can be applied when there are more than two candidate models and can be used for comparing non-nested models. For comparing two models, M 1 and M 2 , the BF is defined as the ratio of the two integrated likelihood, i.e. B 12 = P(y | M 1 ) P(y | M 2 ) . While BF provides the researcher with a directly interpretable number that quantifies the evidence provided by the data, it requires integration over all parameter space and therefore is computationally intensive, particularly when there are many parameters of interest. Alternatively, BIC can be used and previous studies suggest that it provides good performance in model-based clustering context [20]. The BIC is defined as is the log-likelihood under model M k , Y denotes all the data, N is the sample size and v k is the number of parameters. The model with the smallest BIC is usually preferred. The log BF can be approximated by BIC with 2 ln(B 10 ) ≈ 2( BIC), where B 10 is the BF comparing a model with k + 1 classes to a model with k classes, and BIC is the changes between these two models [26,42].

Practical application
In this section, we apply both one-step and stepwise approaches with variable selection priors to the PBC and CAMP datasets introduced in Section 2.

Primary biliary cirrhosis data
For the PBC data, the primary interest is to identify subgroups of patients with similar serum Bilirubin levels and predictors that are associated with these group assignments. Bilirubin is an orange-yellow substance made during the normal breakdown of red blood cells and higher than normal levels of bilirubin may indicate an increased risk of liver problems. Bilirubin was converted to a logarithmic scale (logbili) prior to modeling. Eight predictors were considered in the current analysis, namely treatment (trt, D-penicillamine vs. placebo), edema (edema), alkaline phosphatase (alk.phos), serum cholesterol (chol), serum albumin (albumin), triglycerides (trig), standardized blood clotting time (protime), histologic stage of disease (stage, stage 1 or 2 vs. stage 3 or 4). Here, we initially considered the 312 subjects who participated in the randomized trial. We then removed 30 (9.6%) subjects who had incomplete covariates data. Therefore, the data included in our final analysis consists of 282 subjects. Age was centered and all predictors were standardized such that the means were 0 and variances were 1.
We chose the hyper-parameters of prior distributions to provide weakly information to the parameters of interest. For priors of the growth trajectory parameters, we used V k0 = 1000I 2 , a k0 = 3 and b k0 = 0.01. We used λ 2 1 = · · · = λ 2 K = λ 2 ∼ gamma(ι a0 = 0.01, ι b0 = 0.01), where ι a0 and ι b0 are the shape and rate parameters, respectively. This induces a non-informative prior on the tuning parameter. For the hyper-parameters in SSVS, based on a preliminary analysis, we set c 2 kj = 100 and τ 2 kj = 0.01 (SSVS1) and c 2 kj = 100 and τ 2 kj = 0.04 (SSVS2) for all k and j. These values induce a hyper-variance of 1 for slab and 0.01 for spike in SSVS1, and a hyper-variance of 4 for slab and 0.04 for spike in SSVS2. For each model, we ran the MCMC with 6000 iterations, discarded the first 1000 iterations as burn-in and kept every 5th iterations. The final chain includes 1000 samples.
The BIC for one-to six-class models are 832.  [27]. Therefore, the two-class model provided the best fit to the current data compared to all other models. Therefore, we set K = 2 and refit the one-step approach with predictors and stepwise approaches with predictors based on different variable selection priors described in Section 3. The clustering results for different models are shown in Figure 1 and parameter estimates with 95% credible interval (CR) are shown in Table E1. Overall there were clear patterns of two groups, with Class 1 indicating patients with high Bilirubin and Class 2 indicating patients with low Bilirubin. However, the class proportions were different between the conditional models (i.e. with predictors) and the unconditional model (i.e. without predictors). Specifically, the unconditional model (Figure 1(A)) assigned 39% of patients to Class 1 and 61% to Class 2, whereas conditional models with Horseshoe (Figure 1(B)), SSVS1 (Figure 1(C)) and SSVS2 (Figure 1(D)) priors consistently assigned 46% to Class 1 and 54% to Class 2. This indicates models including covariates changed the patients' posterior class probability, which resulted in higher uncertainty compared to the unconditional model ( Figure E1). This difference between conditional and unconditional models resulted in different class proportions.
The covariate effects based on different models were shown in Figure 2. In general, within the same category of models (i.e. one-step or stepwise approaches), all models provided similar coefficient estimation (Figure 2(A)), which was expected given the small number of predictors considered in this analysis. Based on the HS rule, the one-step model with Horseshoe prior identified six important predictors (i.e. albumin, chol, edema, protime, stage, trig), whereas based on the inclusion probability ( > 0.5), SSVS1 and SSVS2 identified five (i.e. albumin, chol, protime, stage, trig) and two predictors (i.e. chol, trig), respectively (Figure 2(B)). These predictors were shown to be associated with the patients' class membership. For example, a patient with a higher albumin value was more likely to be assigned to Class 1 when compared to Class 2. It is interesting to observe that the estimated coefficients from the conditional models (i.e. Horseshoe-onestep, SSVS1onestep, SSVS2-onestep) moved towards the origin as compared to unconditional models (i.e. Horseshoe-stepwise, SSVS1-stepwise, SSVS2-stepwise). This attenuation was due to the increased number of patients in Class 1 in the conditional models. For the stepwise models, Horseshoe identified the same set of predictors as those in one-step models. The SSVS1-stepwise model also identified the same set of predictors as the shrinkage priors, whereas the SSVS2-stepwise only kept four predictors (albumin, chol, protime, trig) in the model.

Childhood asthma management program data
For the CAMP data, the primary interest is to characterize the FEV1 trajectory at the population level and the individual level, as well as to identify significant predictors of abnormal FEV1 longitudinal patterns. In the current analysis, we included 657 participants who contributed to a total of 15,138 measurements. On average, each participant had 23 measurements. Of these participants, there are 450 (68.5%) white and 391 (60%) male. The mean (SD) of age is 8.97 (2.10). For predictors of the class membership, here we considered 12 baseline covariates and their first-order interaction (66 combinations) as candidate predictors, which induce a total of 78 covariates. These 12 baseline covariates are race (white vs. other), any_pets (any pets, yes vs. no), agehome (age of current home, years), ast.age (age when asthma confirmed by MD), whitecell (white blood cell count), hemoglobin, gas.stove (gas cooking stove, range or oven), wood.stove (yes vs. no), hosp.ast (child ever in hospital for asthma, yes vs. no), mother.ast (mother has asthma, yes vs. no). To model the FEV1 patterns, we considered a quadratic function with random intercept and slope.
We chose similar hyper-parameters for the growth trajectory model as the previous example. Specifically, we used V k0 = 1000I 2 , a k0 = 3 and b k0 = 0.01, r = 4 and R k = 3I 2 . For hyper-parameter of Lasso prior, here we used λ 2 1 = · · · = λ 2 K = λ 2 ∼ gamma(ι a0 , ι b0 ), where ι a0 = 10 and ι b0 = 1/λ 2 , respectively. Theλ denotes the maximum likelihood estimate of λ. This yields a gamma distribution with a mean of 10 times of theλ. For hyperparameters of SSVS, we considered four different settings. In SSVS1, we set c 2 kj = 100 and τ 2 kj = 0.01 which yielded a hyper-variance of 1 for slab and 0.01 for spike. In SSVS2, we set c 2 kj = 100 and τ 2 kj = 0.1 which yielded a hyper-variance of 10 for slab and 0.1 for spike. In SSVS3, we set c 2 kj = 100 and τ 2 kj = 0.3 which yielded a hyper-variance of 30 for slab and 0.3 for spike. In SSVS4, we set c 2 kj = 100 and τ 2 kj = 0.5 which yielded a hyper-variance of 50 for slab and 0.5 for spike. For each model, we ran the MCMC with 60,000 iterations, discarded the first 10,000 iterations as burn-in and kept every 50th iterations as burn-in and kept every. The final chain of each model includes 1000 samples.
The BIC for one-to six-class models are 9556, 8206, 8028, 7936, 7938 and 7932, respectively. While a six-class model yielded the lowest BIC, the 2 ln(B 10 ) of comparing the six-class model to a four-class model was 7.6, which did not show a significant improvement in terms of model fitting. In contrast, the 2 ln(B 10 ) comparing a four-class model to a three-class model was 184, which showed a significant improvement ( > 10) and justified using a more complicated model. This evidence indicated that a four-class model provided a good fit to the current data.
The observed trajectory patterns of the four-class model based on different variable selection priors are shown in Figure 3 and the parameter estimates with 95% CR are shown in Table E2. The four FEV1 growth patterns were similar across different models. Class 1 represented normal FEV1 patterns and patients from this class had the best lung function compared to patients from all other classes. Class 2 and Class 3 represented mild and moderate reduced FEV1 patterns, while Class 4 represented severe reduced FEV1 patterns, which suggested that patients from this group had the worst lung function compared to the patients from all other classes. However, the proportion of patients assigned to different classes differ across different models. Specifically, the proportion of Class 2 (ranged 37% to 39%) and Class 4 (8% to 9%) were relatively stable across different models whereas Class 1 (ranged 18% to 33%) and Class 3 (20% to 34%) were substantially different among different models. The difference in class proportion can be partially explained by the predictors kept by each model. Similar to the previous example, the models including covariates changed the patients' posterior class probability, which also resulted in different class proportions ( Figure E2). Consider Class 1 (with the best FEV1 patterns) as the reference category, the predictors selected by each model were shown in Figure 4. Unlike the previous example when only eight predictors were considered, in the current example different models kept a different number of variables and there is a lack of consistency. Take the comparison between Class 2 vs. Class 1 for example, Horseshoe prior identified only one predictor (i.e. bPREFEV) in either the one-step or stepwise approach. For SSVS prior, we observed that the larger the hyper-variance (for either spike or slab), the smaller number of predictors were kept in the model. For example, SSVS1 resulted in a model with the largest number of predictors in either the one-step or stepwise approach. In contrast, SSVS4 yielded the smallest model. Similar results were observed for the comparison of Class 3 vs. Class 1 and Class 4 vs. Class 1.
To compare the behavior of different models, we provided the estimated posterior distribution of a selected predictor (for variable bPREFEV) in Figure E3. It was evident that the posterior density of the Horseshoe model was tighter (i.e. with smaller SD) compared to other models. And the estimate shifted away from zero when higher class (i.e. the class with lower FEV1 patterns) is compared to Class 1. This is expected given Class 1 was the group with the highest mean FEV1 over time and the higher the class number, the lower the FEV1 and, therefore, the larger the difference (Figure 3).

Simulation study
To further investigate the performance of variable selection in both one-step and stepwise approaches, in this section, we conducted a small simulation study. We only considered the longitudinal setting in this simulation. We generated the data from mixture models with K classes using quadratic mean functions with random intercept and slope. We considered K = 2, 3, 4 which are commonly seen in practice. The trajectory of each class was generated to mimic the FEV1 trajectories from the CAMP. Specifically, we considered two scenarios ( Figure E4), representing high separation (Scenario 1) among classes and low separation among classes (Scenario 2). The total sample size was set at N = 600, and the number of  measurements per individual n i was set to have an equal probability of being 1 to n max = 10, using a uniform distribution.
For subject-level covariates (e.g. baseline covariates), we generated a p = 20 dimensional covariates X from a multivariate normal distribution MVN(0, ), where has exchangeable correlation structure with ρ = 0.5, representing moderate correlation between predictors. The regression coefficients were set to be sparse to represent common scenarios often seen in practice. For identifiability purpose, we set β 10 = 0 and β 1 = 0, and therefore, the first class was considered as a reference category. For K = 2, we set β 20 = log(2) and β 2 = (log(2), log(5), log(5), log (8) To evaluate the performance of different models, we used several evaluation metrics. To evaluate the clustering performance, we considered the correct classification rate (cRate) and adjusted rand index (aRand). Both of these two indices measure the agreement between the estimated class membership and the true class membership, and the higher cRate or aRand indicates a better agreement. To evaluate the variable selection performance, we used the true positive rate (TPR), true negative rate (TNR), accuracy (ACY) and root mean square error (RMSE). These metrics were calculated over all the p(K − 1) estimated coefficients from each model to represent the overall model performance in clustering and parameter estimation. For each setting in each scenario, we simulated 25 datasets, and both one-step and stepwise approaches with Horseshoe or SSVS priors were applied to these datasets to estimate the class and identify predictors of the class membership, assuming the number of classes K is known. The mean and SD of these metrics were reported.
The simulation results for Scenario 1 (high separation) and Scenario 2 (low separation) are provided in Tables 1 and 2, respectively. Specifically, in Scenario 1 and when K = 2, all approaches performed reasonably well in identifying the true class membership. This is expected given the high separation between classes. However, it was interesting that one-step approaches yielded slightly lower cRates and aRands compared to stepwise approaches, while within either the one-step or stepwise approaches, the clustering performance was similar among each other. On the other hand, one-step approaches with Horseshoe and SSVS1 yielded slightly better variable selection performance in terms of accuracy, compared to their stepwise model counterpart. For example, Horseshoe-onestep yielded higher accuracy compared to Horseshoe-stepwise (0.97 vs. 0.94). Of note, the one-step or stepwise approach with Horseshoe prior correctly identified all non-zero coefficients in all datasets, with a mean (SD) TPR of 1 (0). Similar results were observed for K = 3 and K = 4. In particular, the Horseshoe priors yielded higher TPR and ACY compared to SSVS priors (i.e. SSVS1 and SSVS2) for either a one-step or stepwise approach. Of note, Horseshoe prior maintained high variable selection accuracy in both one-step and stepwise approaches. However, the one-step model with Horseshoe prior generally yielded better variable selection performance compared to the stepwise model with Horseshoe prior in all settings and scenarios we considered. Similar results were found in Scenario 2.

Discussion
In this study, we developed and compared two approaches for variable selection in the context of the mixture model to determine the variables affecting the probabilities of the mixture components: the one-step approach and the stepwise approach. Horseshoe prior and SSVS prior were used within these two approaches to select important variables. We also developed MCMC algorithms based on Gibbs sampling to estimate the posterior distribution of model parameters. The proposed models were applied to two clinical datasets with the goal of finding disease phenotypes and their predictors. A simulation study was carried out to investigate the clustering and variable selection performance under different settings and scenarios. In our practical applications, we considered BIC and BF to determine the number of clusters in the proposed mixture models. While these indices have been widely used in many model-based cluster analyses, they can be infeasible when there are a larger number of candidate models needed to fit. Alternatively, one can treat K as a random variable, and the inference of K is considered as part of the modeling process. In this regard, Richardson and Green proposed a reversible jump MCMC method, which allows the sampler jumps between parameter subspaces of different dimensionality [48]. Stephens proposed a birthand-death process, in which the MCMC sampler allows the number of components to vary by allowing new components to be 'born' and existing components to 'die' [54]. Moreover, Bayesian non-parametric methods such as the Dirichlet process mixture [15] can also be employed.
A previous study by Vermunt suggested two improved stepwise approaches to account for the uncertainty of the class assignment estimated from a latent class model [57]. In our current study, we also compared the one-step and stepwise approaches but instead focus on estimating the covariate effects and selecting the variables that are associated with the class assignment. It is worth noting that even if covariate selection works well with both onestep and stepwise approaches, some covariate effects are expected to be downward biased given the penalization effect from either the shrinkage or spike-and-slab prior, which is also reflected in the estimates of RMSE in our simulation study (Tables 1 and 2). Moreover, for the stepwise approach, if the uncertainty of the class membership is not taken into account, a systematic underestimation of covariate effects may occur [57].
The number of predictors considered in the current study is not large as compared to many medical studies which may have hundreds or thousands of predictors, such as genetics studies. The definition of large here is more relevant to the computational scale in which estimating 2 p(K−1) models would be infeasible when p and/or K increase. Therefore, the variable selection provides a convenient tool to identify important predictors and to facilitate better interpretation. Furthermore, the literature has not reached a consensus in terms of whether adding covariates can improve the class recovery or not (i.e. subjects are classified more accurately). Huang et al. [25] found that whether covariates were included in the model and which covariates were included could have an impact on the class assignment, which highlights how the inclusion of covariates and the decision of what covariates to include can dramatically influence the nature of the latent class variable. A previous study also showed that deciding the number of latent classes without predictors of latent class (i.e. via an unconditional model), and including the latent class predictors into the model subsequently lead to good estimates for all model parameters [29]. Similarly, our study finds that the one-step approach with variable selection priors resulted in larger uncertainty in the posterior class probability compared to the unconditional model in which no predictors were included (Figures E1 & E2) and, therefore, resulted in worse clustering performance. The implementation of SSVS in the current study considers several sets of hyperparameters in both Practical Applications and Simulation Studies. It is recognized that SSVS is sensitive to the specification of its hyperparameters c 2 and τ 2 , as observed in our analyses, and these hyperparameters are rarely known in practice. Moreover, tuning these hyperparameters is a difficult process in many applications. While our implementation considers different specifications of these hyperparameters, careful selection and tuning these parameters by a more extensive search may lead to refined results. In contrast, the Horseshoe prior is free of tuning parameters and was found to have desirable variable selection performance in our practical applications and simulation study. However, the Horseshoe prior has also been criticized for not providing sufficient shrinkage to large coefficients, which is undesirable when the parameters are weakly identified. Its variants such as regularized Horseshoe [47] that allows users to specify a minimum level of regularization to the largest values can be considered in the future study to refine the model. The current proposed model refers to a basic model where f k is a Gaussian distribution, and classes (or components) are used to model unobserved heterogeneity. More complicated models can be considered in future studies. In this regard, variable selection priors can be adapted to mixture models such as latent class analysis [3], shape invariant mixture model [36], Bayesian consensus clustering [34] and Bayesian model for multivariate continuous and discrete longitudinal data [30].