Joint Modeling and Clustering Paired Generalized Longitudinal Trajectories With Application to Cocaine Abuse Treatment Data

In a cocaine dependence treatment study, we have paired binary longitudinal trajectories that record the cocaine use patterns of each patient before and after a treatment. To better understand the drug-using behaviors among the patients, we propose a general framework based on functional data analysis to jointly model and cluster these paired non-Gaussian longitudinal trajectories. Our approach assumes that the response variables follow distributions from the exponential family, with the canonical parameters determined by some latent Gaussian processes. To reduce the dimensionality of the latent processes, we express them by a truncated Karhunen-Lóeve (KL) expansion allowing the mean and covariance functions to be different across clusters. We further represent the mean and eigenfunctions functions by flexible spline bases, and determine the orders of the truncated KL expansions using data-driven methods. By treating the cluster membership as a missing value, we cluster the cocaine use trajectories by a likelihood-based approach. The cluster membership and parameter estimates are jointly estimated by a Monte Carlo EM algorithm with Gibbs sampling steps. We discover subgroups of patients with distinct behaviors in terms of overall probability to use, binge verses periodic use pattern, etc. The joint modeling approach also sheds new lights on relating relapse behavior to baseline pattern in each subgroup. Supplementary materials for this article are available online.


Data Description
In this article, we consider data in the form of longitudinal trajectories that are commonly collected in cocaine and other substance abuse treatment studies. The data that we will analyze came from a recently completed clinical trial by the Yale Stress Center. In the study, 142 cocaine dependent subjects were admitted to the Clinical Neuroscience Research Unit of the Connecticut Mental Health Center, for two to four weeks of in-patient relapse prevention treatment for cocaine dependence. Upon treatment entry, all subjects recalled their daily cocaine use retrospectively for the previous ninety days by using the time-line follow-back (TLFB; Sobell and Sobell 1993) Substance Use Calendar, which is a reliable and well established instrument for assessing self-report drug use in alcoholic and drug abusing populations (Fals-Stewart et al. 2000). After completion of the in-patient treatment, all participants were invited back for follow-up interview(s) to assess cocaine use outcomes. The study was conducted in two phases. For the 59 subjects who participated the first phase of the study, only one interview was administrated at day 90 after the treatment. For the remaining eighty three subjects enrolled in the second phase of the study, four interviews were given at days 14, 30, 90, and 180 after the treatment. Daily cocaine use records were recalled based on the Hui Huang is Research Fellow, Center for Statistical Science and School of Mathematical Science, Peking University, Beijing, China, 100871 (E-mail: huanghui@math.pku.edu.cn). Yehua Li is Associate Professor, Department of Statistics, Iowa State University, Ames, IA 50011 (E-mail: yehuali@iastate.edu). Yongtao Guan is Professor, Department of Management Science, University of Miami,Coral Gables,. This research has been partially supported by NIH grant 7R01DA029081 and NSF grants DMS-0845368, DMS-1105634, and DMS-1317118.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/jasa. TLFB procedure during each interview for the period prior to the interview date.
Even though the actual daily cocaine use amounts (in gram equivalents) were estimated by the study participants, such data are often subject to large errors because of the long period of time associated with these recalls as well as the lack of a common scale to assess the amount used due to different methods of consumption (e.g., intranasal use versus injection). We therefore consider only dichotomized trajectories comprised of no use (=0) and any use (=1). We refer to the resulting trajectories before and after the treatment as the baseline and relapse trajectories, respectively. Figure 1 shows the baseline and relapse trajectories from three subjects. Clearly different subjects exhibit very different cocaine use patterns in terms of use frequencies. In particular, subject 3 also appears to be exhibit a strong weekly cycle.
In a cocaine abuse treatment study like ours, an important goal is to understand the causes for the often highly diverse treatment outcomes, which have been given in the forms of daily cocaine use trajectories in our case. It is of particular interest to know the relationship between one's baseline cocaine use pattern and the treatment outcome. For example, do high frequency users during the baseline period tend to use more after treatment, and will those with a weekly baseline cocaine use pattern maintain such a pattern after the treatment? If a significant relationship can be found, then we can potentially provide a much improved prediction for one's treatment prognosis based on his/her baseline cocaine use pattern. More significantly, such a knowledge could even help us design more desirable treatment plans, should multiple treatment options be available. We answer the questions raised above by jointly modeling and clustering the paired baseline and relapse trajectories. The clustering analysis will allow us to assess the different types of cocaine use behaviors both before and after a treatment, and the joint modeling can provide insight on how one's baseline cocaine use pattern may affect the treatment outcome. The mean and covariance structures of these trajectories can be highly complex (e.g., Guan, Li, and Sinha 2011) and therefore be difficult to model by a parametric approach. We will instead develop a more flexible, nonparametric approach based on functional data analysis (Ramsay and Silverman 2005).

Literature Review
In functional data analysis, functional principal component analysis (FPCA) is a useful tool to capture features of both mean and covariance functions. There is a vast volume of recent literature on FPCA, including James, Hastie, and Sugar (2000); Yao, Müller, and Wang (2005a) ;Hall, Müller, and Wang (2006); and Li and Hsing (2010). These papers studied FPCA for a single functional variable. More recently, Di et al. (2009) and Crainiceanu, Staicu, and Di (2009) considered multilevel FPCA, while Zhou et al. (2010) studied hierarchical functional data. In all three papers, the random curves are assumed to be spanned by the same set of eigenfunctions. In contrast, Yao, Müller, and Wang (2005b) and Zhou, Huang, and Carroll (2008) modeled pairs of functional variables, where the two functional variables are allowed to have different sets of eigenfunctions and their principal component scores are cross-correlated. Hall, Müller, and Yao (2008) extended the FPCA to non-Gaussian longitudinal data. They modeled the longitudinal response by a generalized linear mixed model with a known link function, where the latent longitudinal process is assumed to have a FPCA decomposition.
There is also a strong interest in clustering functional data. James and Sugar (2003) proposed a model-based clustering method for sparsely sampled functional data. Jiang and Serban (2011) considered clustering spatially correlated functional data. Both papers assumed that the random curves had different mean curves across clusters but the same covariance structure. Serban and Jiang (2012) studied clustering multilevel functional data under a modeling framework similar to that in Di et al. (2009). Chiou andLi (2007, 2008) proposed to cluster functional data using the k-means method. There are also some related methods for clustering non-Gaussian data in the machine learning literature, such as the model-based methods (Rao and Scott 1992;Banfield and Raftery 1993), the K-means and its variants (Ordonez 2003), and the entropy-based clustering methods (Li 2006). These methods, however, were not designed for longitudinal data and hence not suitable for our problem.
Our proposed method advances the existing literature in two innovative ways. First, joint modeling and clustering of non-Gaussian trajectories are unified in one framework. A joint modeling approach is needed for our motivating data, because one's baseline cocaine use pattern may be related to relapse. To achieve this, we use a generalized linear mixed model to link the binary daily cocaine use records to a latent longitudinal process that admits an FPCA decomposition. At the latent process level, we use a joint modeling approach as in Zhou, Huang, and Carroll (2008) to model the paired longitudinal latent processes, allowing them to have different eigen-systems. The FPCA approach in our joint model can reduce the dimensionality of the latent functional data efficiently, and increase the interpretability of the model. As James and Sugar (2003), we adopt a modelbased clustering framework, by modeling the class labels as multinomial random variables. To handle computational challenges due to non-Gaussian responses, we propose a Monte Carlo EM (MCEM) algorithm based on Metropolis-Hasting steps for parameter estimation. Second, we allow the latent processes to have different mean and different covariance structures across clusters, where in contrast most existing literature only allows the former. In cocaine abuse treatment studies, it is also important to study the covariance structure, because it may carry useful information about one's cocaine use behavior. For example, a covariance structure with a long dependence range often suggests a binge pattern, while that exhibiting weekly cycles indicates a weekly cocaine use pattern. Even though the model will be more complex by allowing different covariance structures, this modeling effort is not formidable because we reduce the dimensionality of the data efficiently by FPCA. We organize the remainder of this article as follows. We present our proposed model in Section 2, and describe its estimation through MCEM in Section 3. We discuss issues on model selection in Section 4. We present some simulation results in Section 5, and apply the proposed method to our motivating data in Section 6. We discuss the strengths and limitations of our approach in Section 7. Technical details are in the Appendices, and more details of the simulation study can be found in the supplementary material.

A General Modeling Framework
The data are collected from n independent subjects. Let T and S be the time intervals of the baseline and followup study period, and {B i (t), t ∈ T } and {R i (s), s ∈ S} denote the cocaine use records for the ith subject in the baseline and followup period respectively. The observed generalized responses are B ij = B i (t ij ), j = 1, . . . , T i , and R i = R i (s i ), = 1, . . . , S i . In our motivating data, B i (t) and R i (s) are both binary trajectories, but we will develop the proposed method in a more general setting.
We assume that, conditional on some latent random factors, the distributions of B ij and R i belong to the canonical exponential family with densities where γ B,ij and γ R,i are canonical parameters depending on some latent random factors, σ k , k = 1, 2, are dispersion parameters, and a k (·), b k (·), and c k (·), k = 1, 2, are known functions (McCulloch and Nelder 1989). We assume where {X i (t), t ∈ T } and {Y i (s), s ∈ S} are two latent Gaussian processes that drive the behavior of the observed longitudinal processes. In the cocaine abuse treatment study, X i (t) and Y i (s) can be understood as unobserved temporally varying factors such as cocaine craving levels that may affect one's probability of cocaine use on a given day. Conditioning on X i (t) and Y i (s), we assume that B i1 , . . . , B iT i , R i1 , . . . , R iS i are mutually independent.
We assume that the subjects are drawn from C populations, and all information in B i (t) and R i (s) related to the cluster membership is completely specified by the latent processes [X i (t), Y i (t)] . Define the latent cluster label for the ith subject as , X i (t 2 )|ω ic = 1] be the mean and covariance function of X(·) in cluster c. Similarly, define μ y,c (s) and y,c (s 1 , s 2 ) as the mean and covariance of Y (·) in cluster c. We allow (μ x,c , x,c , μ y,c , y,c ) to vary across different clusters. We also allow different covariance structures across clusters to better understand the cocaine use pattern in different sub populations of cocaine users.
In our application, some of the baseline trajectories showed periodic patterns and we also anticipate certain trend in the relapse trajectories. We hence model μ x,c and μ y,c nonparametrically. Guan, Li, and Sinha (2011) examined the empirical autocorrelation of individual baseline trajectories. They found that the correlation varied across subjects, decayed at a rate much slower than low order autoregressive correlations and often exhibited periodic patterns. It is challenging to fully capture such a complex correlation structure parametrically. As a result, we model x,c and y,c flexibly as bivariate nonparametric functions. To keep the model computationally tractable, we propose to reduce the dimensionality of the covariance functions using the recently much-celebrated FPCA method.
Suppose that the covariance functions yield the spec- x,c (·, ·) and y,c (·, ·) respectively, and ψ jc (·) and φ jc (·) are the corresponding eigenfunctions. The eigenfunctions are orthonormal functions so that T ψ jc (t)ψ j c (t)dt = I (j = j ) and S φ jc (s)φ j c (s)ds = I (j = j ) for all j, j and c, where I (·) is an indicator function. Given the cluster membership of subject i, we can then express X i (t) and Y i (s) using the Karhunen-Loève expansion where ξ ij and ζ ij are normal random variables with mean zero and variances equal to λ jx,c and λ jy,c . In practice, model (2) is not feasible for estimation due to the infinite dimensionality of the parameter space. Instead, it is common to approximate (2) by using the leading principal components, that is, where P 1,c and P 2,c are sufficiently large to capture most variation in the random processes. The reduced rank model (3) can help obtain more reliable estimates, especially for irregular and/or sparse functional data (James, Hastie, and Sugar 2000; Zhou, Huang, and Carroll 2008). We treat P 1,c and P 2,c as tuning parameters and choose them using data driven methods in Section 4. The Karhunen-Loève theorem ensures that ξ ij 's are uncorrelated between different orders of j, and so are the variables ζ ij 's. Let ξ i and ζ i be two column vectors containing all ξ ij 's and ζ ij 's used in the reduced rank model (3). As discussed in Section 1, it is of interest to understand how one's baseline cocaine-use behavior is related to his/her relapse pattern. In our joint modeling framework, we therefore allow ξ and ζ to be cross-correlated. Following Zhou, Huang, and Carroll (2008) and Yao, Müller, and Wang (2005b), we jointly model ξ i and ζ i by where . . , λ P 2,c y,c }, xy,c is a P 1,c × P 2,c crosscovariance matrix, and yx,c = T xy,c . To better quantify the association between ξ i and ζ i , we also define the cross-correlation matrix as where ρ jk,c = corr(ξ ij , ζ ik |ω ic = 1) for j = 1, . . . , P 1,c and k = 1, . . . , P 2,c . Our conditional modeling of the generalized longitudinal trajectories in (1) resembles that in Hall, Müller, and Yao (2008), but there are three major differences between our approach and theirs. First, the procedure proposed by Hall, Müller, and Yao (2008) is designed for modeling a single non-Gaussian longitudinal process but not for jointly modeling paired longitudinal processes. Second, even though they adopted a conditional modeling of the longitudinal process, their estimation method is a marginal regression method based on asymptotic approximations under the assumption that the latent Gaussian process has a "small" magnitude. Third, even though the estimated principal component scores produced by Hall et al. can be further used for clustering in a two-stage procedure, the estimation error in the first stage will inevitably propagate into the clustering analysis. In contrast, our model based approach treats estimation and clustering under a unified framework and seems to be a more natural approach for this problem.

Reduced Rank Spline Representation
Assuming that the mean functions and eigenfunctions in model (3) are smooth, we approximate them by reduced rank regression splines. Specifically, let S 1 (t) and S 2 (s) be q 1 and q 2 dimensional B-spline bases defined on T and S respectively. For simplicity, we place the knots of the spline bases to be equally spaced in T and S. Using the normalization method in Zhou, Huang, and Carroll (2008), we orthogonalize the spline bases so that where I m×m is an m-dimension identity matrix. We then express the mean functions and eigenfunctions as where ν x,c , ν y,c , ψ,c, and φ,c are coefficients with dimensions q 1 × 1, q 2 × 1, q 1 × P 1,c, and q 2 × P 2,c , respectively. Based on Lemma 1 in Zhou, Huang, and Carroll (2008), the identifiability of , D and xy is guaranteed by two conditions: orthornormality and consistent signs of eigenfunctions. Thus, we set the following constraints on ψ,c and φ,c : In addition, for each eigenfunction, we force the sign of the first element with the largest magnitude to be positive. The reduced rank model (3) can now be expressed as With the spline approximation described above, the parameters to be estimated are Let B i and R i be the data vectors collecting all observations in the baseline and relapse trajectories for subject i, then his/her contribution to the complete data likelihood is where f c (ξ i , ζ i ) is the joint density of the latent variables specified in (4), and f c (B ij |ξ i ) and f c (R i |ζ i ) are the conditional distributions specified in model (1) given that the ith subject belongs to cluster c. In the cocaine abuse treatment data, both B i (t) and R i (s) are binary processes and their conditional distributions are given by To facilitate the Gibbs sampler described in the Appendix, it is more convenient to rewrite (4) in a regression form (Zhou, Huang, and Carroll 2008) where c = xy,c D −1 λy,c and η ic is a zero-mean normal vector independent with ζ i and with a covariance matrix η,c = D λx,c − xy,c D −1 λy,c yx,c . Then, the conditional joint density of (ξ i , ζ i ) can be written as

Monte Carlo EM Algorithm
The complete data likelihood (9) depends on the latent random variables ω i , ξ i , and ζ i and hence cannot be maximized directly. We treat these latent variables as missing data, and estimate the unknown parameters by a Monte Carlo EM algorithm (Wei and Tanner 1990;McCulloch 1997), which was shown by Chan and Ledolter (1995) to converge to the maximum likelihood estimator under some general regularity conditions. Cluster memberships can be determined by assigning subject i to cluster c that maximizes the conditional probability π ic = P (ω ic = 1|B i , R i ). The procedure iterates between an Estep and an M-step described in the following subsections until the convergence of the parameters.

E-
Step. In the E-step, we take expectation on the logarithm of the likelihood (9) conditioning on the observed data B i , R i and the value of the parameter vector prev from the previous EM iteration. For non-Gaussian responses, this conditional expectation does not have a closed form, hence we approximate it by a Monte Carlo average.
Specifically, let i ( ; B i , R i , ω i , ξ i , ζ i ) be the log complete data likelihood for the ith subject given in (9). In the E-step, we need to evaluate the expected log-likelihood We approximate Q(·) by its Monte Carlo counterpart, where K is the Monte Carlo sample size and Since this conditional distribution does not have a closed form, we draw samples from it using the Gibbs sampler (Geman and Geman 1984) incorporated with a Metropolis-Hastings step (Tanner 1993). The detailed algorithm is provided in Appendix A.

M-
Step. In the M-step, we update the parameters to their current values, curr , which maximize Q( | prev ) defined in (11). Up to some negligible constant terms, we factorize Q(·) into One can easily see that Q 1 , Q 2 , Q 3 , and Q 4 depend on mutually disjoint sets of parameters in and therefore can be maximized separately. Specifically, π c 's are updated by maximizing Q 1 , (ν x,c , ψ,c ) and (ν y,c , φ,c ) are updated by maximizing Q 2 and Q 3 , and finally ( c , η,c , D λy,c ) are updated by maximizing Q 4 . When maximizing Q 2 and Q 3 , we apply iterative reweighted least square (IRLS) method and update ψ,c and φ,c under the constraint (7). The detailed algorithm for the M-step is provided in Appendix B.

Implementation and Cluster Membership Estimation
We stop the EM iterations when the parameters satisfy where δ 1 and δ 2 are predetermined constants, l denotes the lth parameter in . Following the suggestion in Booth and Hobert (1999), we set δ 1 = 0.001 and δ 2 = 0.005. Increasing the Monte Carlo sample size K can reduce the Monte Carlo errors in (11), but may waste computation time especially during the early iterations when the parameter values are still far from the truth. As a compromise, we gradually increase K along the iterations. Specifically, in our simulation study and data analysis, we set K = 500 for iterations 1-4, K = 5000 for iterations 5-9, K = 10,000 for iterations 10-19 and K = 20,000 for iterations 20 and over.
At convergence of the algorithm, we estimate the conditional probability of subject i belonging to cluster c given the observed data as π ic = K −1 K k=1 ω (k) ic , c = 1, . . . , C, where ω (k) ic 's are the Monte Carlo samples at the final E-step. We assign the cluster membership of subject i to arg max c { π ic }.

Information Criterion
Model selection is one of the most important issues in modelbased clustering problems. In our setting, there are three key features that need to be determined, that is, the number of clusters C, the numbers of leading principal components {P 1,c } C c=1 and {P 2,c } C c=1 , and the dimensions q 1 and q 2 of the spline bases S 1 and S 2 given in Equation (8). Let M be a candidate model, and N M be the effective number of parameters in M, which is defined as the total number of parameters in M minus the number of constraints in (7). We choose the model that minimizes the following Baysian information criterion (BIC): where n is the number of subjects and Q M is the Monte Carlo average defined in (11) using the Monte Carlo samples from the final EM iteration. The BIC in (14) is a special case of the IC Q criterion in Ibrahim, Zhu, and Tang (2008). Instead of using the loglikelihood of the observed data, which does not have a close form, they used the Monte Carlo expectation of the logarithm of the complete data likelihood. By using the final outputs of the MCEM algorithm, it does not require any extra effort to evaluate the IC Q criterion. Ibrahim, Zhu, and Tang (2008) performed intensive simulation studies on an AIC version of (14) which replaces the penalty term with 2N M , and it performed well in various missing data settings. They also proposed a more sophisticated IC H,Q criterion, but that procedure is considerably more difficult to implement in our setting. We use the BIC version of IC Q to impose higher penalty on overestimating the model, and this procedure performs well in our simulation study.

Expedited Search for the Optimal Model
The MCEM algorithm is computationally intensive, and it therefore can be extremely time consuming to search over all possible models in the model selection procedure. We take further steps to expedite model selection by narrowing the search range.
Among the features that determine model complexity, the numbers of spline basis functions q 1 and q 2 are of least importance. The inverse of the number of spline bases is asymptotically equivalent to the bandwidth in a kernel smoother, and the spline estimators are consistent for a wide range of values for q 1 and q 2 . Assuming the mean and eigenfunctions in our model are twice continuously differentiable, we follow the asymptotic theory of Li and Hsing (2010) to set q 1 ≈ (nT ) 1/5 + 4 and q 2 ≈ (nS) 1/5 + 4, where T and S be averages of T i 's and S i 's. One can fix q 1 and q 2 at these suggested values or do a quick search in a neighborhood of these values.
In contrast to the basis functions, the number of clusters C and the numbers of principal components in each cluster, that is, {P 1,c } C c=1 and {P 2,c } C c=1 , are much more crucial. To further expedite the search, we assume that the number of principal components are the same across clusters, that is P 1 ≡ P 1,c and P 2 ≡ P 2,c , for c = 1, . . . , C. With these approximations, the BIC in (14) only depends on (C, P 1 , P 2 ), and we can minimize the BIC using a three-dimension grid search. Further simplification can be made to narrow the search range for P 1 , P 2 , and C, and we present such a strategy in the supplementary material.

SIMULATION STUDY
We conduct a simulation study to assess the performance of our proposed clustering and estimation method. In the simulation, n = 200 subjects are randomly drawn from C = 2 clusters, with marginal probabilities π 1 = 0.6 from cluster 1 and π 2 = 0.4 from cluster 2. The latent processes X i (t) and Y i (s) are spanned by P 1 = 2 and P 2 = 1 eigenfunctions respectively. We set the mean and eigenfunctions of these processes to be different in different clusters. Details on simulating the latent processes, including the eigenvalues, cross-correlation parameters between X i and Y i as well as the explicit mathematical expressions and plots of the mean and eigen functions are provided in the supplementary material. We first generate the latent processes, and then generate the binary baseline and relapse trajectories B i (t) and R i (s) based on model (1) with a logistic link. We repeat the simulation 100 times. For each simulated dataset, we use the method in Section 4 to determine (C, P 1 , P 2 ). The true model with (C, P 1 , P 2 ) = (2, 2, 1) is correctly selected 95% of the time. This high accuracy provides a strong support for our model selection approach.
We study the performance of clustering using the adjusted Rand index. The Rand index measures the agreement between two partitions of objects (Rand 1971). Hubert and Arabie (1985) proposed an improved version of the index which has an expected value of 0 and is bounded by ±1. A larger adjusted Rand index means a higher similarity between the two partitions. In our simulation, we calculate the adjusted Rand index between the clustering results and the true cluster memberships in each simulation, and then take the average over the 100 simulations to evaluate the effectiveness of our proposed method. For comparison, we perform two additional analyses. In the first analysis, we concatenate the baseline and relapse trajectories and apply the conventional k-means method, assuming that all subjects have the same covariance structure. In the second analysis, we perform clustering and estimation based on baseline and relapse trajectories alone. The average adjusted Rand Indices are 0.9376 for our proposed method, 0.7381 for the k-means method, and 0.7725 and 0.7441 by using baseline and relapse trajectories alone, respectively. These results indicate the superiority of our method over these existing alternative methods.
To assess the accuracy of the estimation, we calculate the mean square error (MSE) for the scalar parameters, and the mean integrated squared error (MISE) for the mean and eigen functions. The detailed results are given in Tables W.1-W.3 in the supplementary material. Compared to the approach using baseline or relapse trajectories alone, our proposed joint modeling approach leads to considerable reductions in MSEs (mean reduction: 44%, range: 8% to 73%) for the scalar parameters and moderate reductions in MISE (mean reduction: 12%, range: 3% to 22%) for the mean and eigen functions. Note that the cross-correlation parameters can only be estimated by the joint modeling approach.
We also provide some graphical summary of the estimated functions in Figures W.4 and W.5 in the supplementary material. The true μ(·) functions, the mean and the 5th and 95th percentiles of the estimated curves based on the joint model are given in Figure W.4. Similar plots for φ(·)'s and ψ(·)'s are provided in Figure W.5. From these plots, we find that the means of the estimated functions are almost identical to the true functions, indicating very small biases. Moreover, the percentile bands are all very narrow indicating small variances in these functional estimators.

COCAINE USE DATA ANALYSIS
We apply our proposed method to cluster the baseline and relapse trajectories of cocaine use described in Section 1. In light of the potential weekly use patterns in the baseline period, we align the baseline trajectories in such a way that all baseline trajectories started on the first Monday of the baseline period and lasted for 80 days. For the relapse trajectories, we will only consider data in the first 90 days, because there were significantly more missing data in the second 90-day period. In particular, the 59 subjects enrolled in the first phase of the study had no data reported 90 days after the treatment. We use the actual number of post-treatment days to index the relapse trajectories. Conventional approaches to analyze similar data are typically conducted under a regression framework. Specifically, summary statistics extracted from the relapse trajectories such as percent days abstinent and time to first relapse are often used as dependent variables, whereas summary statistics derived from the baseline trajectories such as frequency and average daily use amount are included as predictors (e.g., Carroll et al. 1993;Anton et al. 2006;Sinha et al. 2006). For modeling the mean and eigenfunctions, we use cubic Bspline bases with seven interior knots, where the interior knots are equally spaced over the baseline and relapse periods, respectively. We determine the Monte Carlo size and the stopping rule as described in Section 3.2 and closely monitor the convergence of algorithm. Additional trace plots for the Gibbs sampler and the EM iterations are provided in the supplementary material.
Following the procedure described in Section 4, we obtain the minimum BIC score from the model with C = 3, P 1 = 3, and P 2 = 2. In other words, we obtain three clusters, and in each cluster there are three and two leading FPCs for the baseline and relapse trajectories, respectively. At the convergence of MCEM, the numbers of subjects in each cluster are n 1 = 41, n 2 = 59, and n 3 = 33. To summarize the three clusters, we find that the average daily use frequency for baseline and relapse periods are 0.73 and 0.17 in cluster 1, 0.43 and 0.14 for cluster 2, and 0.37 and 0.08 for cluster 3. Our estimation results using the proposed model provide further insights about the mean and covariance structures of the trajectories in each cluster than these simple summary statistics. Figure 2 shows the estimated mean probabilities of daily cocaine use, together with the associated pointwise 95% bootstrap confidence intervals based on 100 bootstrap replicates, where the bootstrap procedure is carried out by resampling subjects within each cluster. The estimated mean probabilities describe the mean structures of the trajectories in different clusters. For all clusters, the probabilities for the baseline trajectories are always higher than those for the relapse trajectories, which indicates the benefit of treatment. For the baseline trajectories, cluster 1 has overall the highest probabilities, followed in turn by clusters 2 and 3. The same order holds for the relapse trajectories. This observation suggests that one's baseline cocaine use pattern is related to relapse, in that a heavier cocaine user in the baseline period tends to use more cocaine after the treatment. Such a phenomenon has been well documented in literature (e.g., Fox et al. 2006).

Results on Mean Structures of the Trajectories
There appear to be weekly patterns in the baseline period for the subjects in clusters 2 and 3, with the pattern being more prominent in cluster 2. The peaks generally occurred on Fridays, indicating that subjects in these two clusters were potentially recreational weekend users. Despite the local bumps in the estimated probabilities, the overall trend of the baseline trajectories appears to be flat. In contrast, the estimated probabilities for the relapse trajectories are smoother but show an increasing trend with time for both clusters 1 and 2, indicating that the treatment effect diminished over time. The increase is especially fast and significant for subjects in cluster 1, who were heavy cocaine users. It may appear puzzling at the first sight that the probabilities drop after day 60 for these subjects. However, the drop was likely due to increased dropout and incarceration rates, in which cases the subjects either failed to report any cocaine use or were forced to be abstinent. Hence, the treatment did not appear to be particularly effective for subjects in cluster 1.
Given the estimated covariance, we can further estimate the correlation and investigate how the correlation differs across clusters. The top panel of Figure 3 shows the estimated correlation surfaces for the baseline trajectories. There appears to be a periodic pattern in the plots for clusters 2 and 3, and a further look reveals that the cycle is seven days. This indicates a weekly baseline cocaine use pattern for subjects in these two clusters. However, no such weekly pattern is observed in the surface plot for cluster 1. The bottom panel of Figure 3 shows the estimated correlation surfaces for the relapse trajectories and none of these plots shows any weekly pattern. This suggests that the weekly cocaine users during the baseline period no longer maintained any weekly pattern after the treatment. Hence, the treatment effectively changed the cocaine use behaviors of subjects in these two clusters. For a further investigation, we average the correlations as Corr xc (r) = |t 1 −t 2 |=r Corr xc (t 1 , t 2 ) |t 1 −t 2 |=r 1 , Corr yc (r) = |s 1 −s 2 |=r Corr yc (s 1 , s 2 ) where r is the time lag. The top panel of Figure 4 shows plots of the average correlation for the baseline trajectories. Note that the weekly patterns become more obvious for both clusters 2 and 3. However, there is also a clear difference between them. Specifically, the average correlation tends to decrease as the time lag increases for cluster 3, but no such trend is present for cluster 2. A decreasing trend in the current context would suggest that cocaine use outcomes are less correlated when the time lag is larger. Hence, subjects in cluster 2 exhibited a more regular cocaine use pattern than those in cluster 3. We stress that such a conclusion cannot be reached by examining the mean structures alone. Note that a decreasing trend is also observed in the plot for cluster 1. The bottom panel of Figure 4 shows plots of the average correlation for the relapse trajectories. The overall trend in these plots is decreasing for all three clusters. There are some irregular alterations at the last few time lags. For a larger time lag, the sample size used to produce the average correlation is smaller.
This in turn will increase the variability of the estimate. The average correlation is large (> 0.7) at all r values for cluster 3. From Figure 2(b), we also see that the estimated probabilities of daily cocaine use are similar over time for the relapse trajectories in cluster 3. Combined together, these two results suggest that the relapse pattern of subjects in this cluster did not change significantly during the study period. In other words, the treatment appeared to have had a long-lasting and positive effect on cocaine users in this particular group.

Results on Cross-Correlation Between Trajectories
Let ρ jk,c (j = 1, 2, 3, k = 1, 2, c = 1, 2, 3) denote the estimated correlation between the jth FPC for the baseline trajectories and the kth FPC for the relapse trajectories in cluster c. Table W.4 in the supplementary material shows the estimates of these cross-correlation parameters and their 90% bootstrap confidence intervals. These results show that most of these correlation parameters are significantly different from zero, which indicates strong correlation between the baseline and relapse behaviors of a patient.
It is of particular interest to study ρ 11,c , because it contributes the most to the correlation between the baseline and relapse trajectories. For the baseline trajectories, the first FPCs explain 75%, 91%, 82% of variation for cluster 1, 2, and 3, respectively. For the relapse trajectories, the percentages of variation explained by the first FPCs become 82%, 90%, and 83%. The values of ρ 11,c are 0.43, 0.17, and 0.28 for clusters 1, 2, and 3, respectively, with 90% confidence intervals (0.13, 0.75), (0.07, 0.32), and (0.12, 0.36). The positive values of ρ 11,c suggest that within each cluster, if one used more cocaine in the baseline period, he/she tended to use more after the treatment. This result agrees with a similar finding that we made by visually comparing the estimated probabilities of daily cocaine use shown in Figure 2. The larger ρ 11,c is, the stronger the relationship is. Hence, the treatment outcomes were the least affected by their baseline pattern for users in cluster 2 (i.e., the strong weekly cocaine users), followed in turn by those in clusters 3 and 1. This result reaffirms that the treatment was effective in altering weekly cocaine users' cocaine use pattern. It also warrants further investigations to understand the mechanism causing the change. From a practical point of view, our results indicate that it is important to treat weekly cocaine users differently from the rest such as binge users. This is significant because most existing studies in literature often do not single them out as one particular group of interest.

Model Diagnostics
To assess the goodness-of fit of the fitted model, we calculate differences between observed frequencies and expected probabilities of cocaine use over time at baseline and relapse within clusters. Given cluster membership c, we define the standardized difference for baseline at day t as , i ] is an estimate of P [B i (t) = 1 | B i , R i ] andξ (k) i are Gibbs samples from the final iteration of the MCEM algorithm. The standardized differences for relapse ε y,c (t) can be similarly defined. We present plots of these statistics in Figure 5. It shows that the standardized differences distribute around 0 with no obvious patterns for both baseline and relapse in all clusters. This indicates that proposed model fits well to the data.

DISCUSSION
We consider subtyping cocaine dependent subjects based on their baseline and post-treatment cocaine use behaviors. The data are in the form of paired binary longitudinal cocaine use trajectories before and after a treatment. We propose a novel functional data analysis approach to jointly model and cluster these trajectories. Specifically, we model the non-Gaussian (i.e., binary) response variables by generalized linear mixed models from an exponential family, and model the latent longitudinal processes as functional data. Our approach allows different clusters to have distinct covariance structures, where most existing methods for clustering functional data assume the same covariance structure.
We have identified three clusters based on our cocaine abuse treatment data. The three groups of patients have different probabilities of cocaine use, which suggest differences in the mean structures of the latent processes. Moreover, their covariance structures are also appreciably different. These two observations combined reveal distinct cocaine use behaviors. Specifically, subjects in cluster 1 are heavy users, subjects in cluster 2 use less, but have a strong weekly pattern, and subjects in cluster 3 use cocaine the least often but have both a weekly and decaying correlation structure. Differences in the covariance structures play an important role in defining these clusters, but such differences cannot be detected using any existing methods. We have also found that the relationship between baseline and post-treatment cocaine use behaviors tends to vary across the three clusters.
To balance between model flexibility and complexity, we have made two important dimension reduction efforts in our model. We first register the mean and covariance function by splines, and then further reduce the dimensionality by the much celebrated, data-driven FPCA. We also develop an efficient MCEM algorithm to fit the proposed model. Our simulation and data analysis show that our proposed modeling and estimation procedure can produce insightful results that cannot be obtained by conventional methods.
There is a potential problem of underreporting for self-report drug use data like ours. We relied on the well-established TLFB procedure to construct cocaine use trajectories in our study. To ensure data quality, all staff had been trained by Ph.D.-level psychologists before data collection and were closely supervised when conducting the interviews. All study participants had been informed upfront that all data would be kept strictly confidential and nonidentifiable, and that no legal consequences as protected by law would be incurred to them by providing the most honest answers. The interviews were conducted in a quiet and comfortable testing room, and excellent rapport was established between the staff and the study participants. These measures helped minimize the chance of underreporting, even though this problem could not be completely eliminated.
The Gibbs algorithm proceeds by repeating steps (A.1) -(A.3) K + K times. The first K samples are discarded as a burn-in period. In the simulation study and data analysis, we use K = 500.