clusterMLD: An Efficient Hierarchical Clustering Method for Multivariate Longitudinal Data

Abstract Longitudinal data clustering is challenging because the grouping has to account for the similarity of individual trajectories in the presence of sparse and irregular times of observation. This paper puts forward a hierarchical agglomerative clustering method based on a dissimilarity metric that quantifies the cost of merging two distinct groups of curves, which are depicted by B-splines for the repeatedly measured data. Extensive simulations show that the proposed method has superior performance in determining the number of clusters, classifying individuals into the correct clusters, and in computational efficiency. Importantly, the method is not only suitable for clustering multivariate longitudinal data with sparse and irregular measurements but also for intensely measured functional data. Towards this end, we provide an R package for the implementation of such analyses. To illustrate the use of the proposed clustering method, two large clinical data sets from real-world clinical studies are analyzed.


Introduction
Biomedical studies often use multiple markers collected over time to depict the evolvement of biological processes. In the practice of precision medicine, clustering such data helps divide patients into homogeneous groups so that targeted treatment can be applied to maximize therapeutic benefits (Paulsen et al. 2019).
Clustering the time course of the observed data is never an easy task because data proximity at selected time points does not guarantee the closeness of temporal trajectories. Most clustering methods that emerged in the last two decades had originated from functional data analysis, where observations are generated from stochastic processes and individual trajectors characterized by functional curves (Ramsay and Silverman 2005). In a review article, Jacques and Preda (2014a) classified the existing clustering algorithms for functional data into four broad categories: Raw data method, filtering method, adaptive method, and distance-based method. Some of these methods have used the K-means clustering technique on B-spline coefficients for individual curves (Abraham et al. 2003), while others have used random-effects models for the spline coefficients derived from individual curves (James and Sugar 2003), wavelet (Giacofci et al. 2013), and functional principal component scores (Bouveyron and Jacques 2011;Jacques and Preda 2014b;Schmutz et al. 2020). These methods were mainly developed for clustering univariate longitudinal or functional outcomes. Clustering methods for multivariate data were few; notable works included Tokushige et al. (2007); Ieva et al. (2013); Genolini et al. (2015); Schmutz et al. (2020). Longitudinal data present a different set of challenges for clustering analysis because data tend to be more sparsely and often irregularly measured (Rice 2004). For this reason, methods developed for functional data are not always applicable to longitudinal data. But some K-means-based algorithms can be extended to longitudinal data. For example, Falissard (2010, 2011) proposed a K-means method for longitudinal data (KmL) that measures the pairwise Euclidean distance between two trajectories when data are assessed regularly and at the same time points. One could also perform K-means clustering analyses on the coefficients of B-spline basis functions (Abraham et al. 2003;Garcia-Escudero and Gordaliza 2005), Fourier basis functions (Serban and Wasserman 2005), and wavelet basis functions (Giacofci et al. 2013) as long as the individual curve fitting is possible. Such K-meansbased algorithms alleviate the constraints on observational regularity required by KmL but have reduced computation efficiency.
For sparse and irregularly measured longitudinal data, model-based methods are good alternatives to the K-means methods, especially considering the maturity of longitudinal modeling. Many clustering methods have been developed based on mixture models (Jones et al. 2001;De la Cruz-Mesía et al. 2008;McNicolas and Murrphy 2010;McNicolas and Subedi 2012;Chamroukhi 2016;Chamroukhi and Nguyen 2019;Jacques and Preda 2014b;Schmutz et al. 2020). With correctly specified models, these methods usually have respectable performance, even in sparsely observed data situations, because the model parameters are estimated with aggregated data from all subjects. A significant issue with the model-based methods, however, is the lack of verifiability of the model assumptions (Biernacki et al. 2000). Several authors have used mixture models with regression splines to reduce the risk of model misspecification (James and Sugar 2003;Luan and Li 2003;Coffey et al. 2014). Typically, model-based methods tend to have a heavier computational burden because different mixture models must be explored to determine the optimal number of clusters by the Bayesian Information Criterion (BIC) (Schwarz et al. 1978). To improve, Chamroukhi (2016) adopted a robust EM algorithm (Yang et al. 2012) to automatically ascertain the number of mixtures of regression models. More recently, Zhu et al. (2018) proposed a B-spline-based penalized regression method that allows the estimation of individualspecific spline coefficients and selects the number of clusters simultaneously using LASSO. Though the approach is theoretically elegant, its performance deteriorates quickly with an increasing sample size, as we shall demonstrate in Section 3.1.
In this article, we put forward a new clustering method for longitudinal data with sparse and irregular observations. The method falls into the broad category of hierarchical agglomerative algorithms. In other words, it is a "bottom-up" procedure that combines trajectories when it goes up in the hierarchy if the "cost" for merging two groups of trajectories is low (Ward 1963). We show that the proposed method exhibits excellent clustering accuracy and its performance is not sensitive to the underlying cluster sizes, and is also computationally efficient. The method can easily handle multiple-outcome longitudinal data, including outcomes that are not readily distinguishable among clusters. Viewed as a whole, we believe that the proposed method provides a reliable and efficient clustering tool for longitudinal data.
The rest of the article is organized as follows: In Section 2, we describe the proposed clustering method and indices for determining the number of clusters. In Section 3, we present numerical results from an extensive simulation study; the performance of the proposed method is compared with competing methods. In Section 4, we present two real-data applications to demonstrate the use of the new clustering method. We end the paper in Section 5 with a few remarks. We present the technical details and additional simulation results as Supplementary Materials.

Methods
We propose a new clustering method for multivariate longitudinal data by carrying out the grouping in a hierarchical fashion (Ward 1963). The method uses a similarity metric that is structurally analogous to the classic Chow test statistic (Chow 1960), which reflects the "merging cost" of combining two clusters. We shall demonstrate the good operating characteristics of the proposed method.

Modeling and Estimating Longitudinal Trajectories
We first introduce the notation for describing the temporal trajectories of longitudinally measured outcomes. 2,...,N,j=1,2,...,n i } be a collection of observed data from N subjects, where the ith subject has n i observations Y i1 , . . . , Y in i , made at times t i = {t ij } n i j=1 . The observation times are subject-specific and they do not have to be common across subjects.
Suppose that the longitudinal trajectory for the ith subject can be described by the following model where f i (·) is the fixed component, and e i (·) is the random component with mean zero.
To retain the maximal flexibility, we avoid specifying a parametric functional form for f i (·). Instead, we use B-splines with p basis functions to approximate f i (·) for i = 1, 2, ..., N and conduct regression spline estimation for f i (·) using the observed data. Letting be the row vector of the B-spline basis functions evaluated at t ij , we fit (1) by rewriting the model as where .
where the random component is the unspecified variance-covariance matrix at observation times t i for subject i. The B-spline coefficients β i are estimated by the ordinary least-squares method.

Cost of Merging Two Subgroups
Clustering is a process of partitioning an index set c = {1, 2, ..., N} into mutually exclusive subsets {c k : k = 1, 2, ..., K}, where such that subjects in the same set are homogeneous in some sense.
In longitudinal data clustering, we seek partitions where the subjects in cluster c k share a common longitudinal trajectory, i.e. they have the same fixed components. Here, we seek to identify c k for k = 1, 2, · · · , K, such that c k = {i ∈ c : f i (·) = f c k (·)}. This process requires fitting the fixed component given in (1). We denote the observed data associated with cluster c k as C k , that is, ij ) i∈c k ,j=1,...,n i }, k = 1, 2, ..., K.
Similarly, we merge the observations for the two clusters, and let C u,v be the union of C u and C v , that is, To determine the appropriateness of clustering, one needs a metric to quantify the cost of combining C u and C v . When the cost is low, the two subsets could be combined with a shared trajectory. When the cost is high, combining the subsets would lead to a larger error in the combined model and thus should be avoided. Errors associated with statistical models are typically described by sum of squared residuals (SSR). We, therefore, contend that the following ratio of SSRs under the separate and combined models gives a quantification for the merging cost: where In the calculation,Ŷ ij =f c k (t ij ) = X ijβk , whereβ k is the ordinary least-squares estimate for β k , the identical B-spline coefficients for all i ∈ c k , using all the data in C k . Alternatively, D(C u , C v ) can also be viewed as a metric of dissimilarity between two sets of longitudinal data, C u and C v . A larger D(C u , C v ) value indicates greater dissimilarity between C u and C v , and thus providing stronger evidence against merging the two subsets.
Remark 1. The numerator of D(·, ·) is conceptually similar to the "merging cost" proposed by Ward (1963) under the classical linear models. We note that for any C u , C v , D(C u , C v ) ≥ 0, meaning that merging actions would not incur negative costs. See Proposition 1 in Web Supplementary Materials.

Remark 2.
We also note that the structure of D(·, ·) is parallel to that of the Chow-test statistic (Chow 1960), which was originally proposed to test whether two datasets are generated from a common linear model. For our purpose, two subsets with the smallest D(·, ·) value among all possible pairs of subsets are least likely to be rejected for homogeneity. Hence sequentially merging two subsets with the smallest D(·, ·) value gives rise to a hierarchical agglomerative clustering algorithm.
Remark 3. Metric D(·, ·) is readily extendable to situations of longitudinal clustering with multiple outcomes: A weighted sum of D(·, ·) from different outcomes represents the overall merging cost where W h is the standardized weight for outcome h, and C ij ) i∈c k ,j=1,...,n i } contains data in cluster k for outcome h.
To maintain computing efficiency, we propose an ad-hoc method to determine the weights by expanding the Chow statistics as defined above, where C (h) includes all subjects, and C (h) i is the "bottom-level cluster" that contains only observations from the ith subject as long as the data allow for fitting an individual-specific spline for calculating SSR(C (h) i ). Some data preprocessing steps are needed when data from an individual are not sufficient to fit the spline model, as we shall illustrate in Section 2.4. The outcome-specific weight W h can be chosen as Remark 4. In our data examples for multivariate clustering, we used the same number of spline basis functions (p) that were defined at the same knots based on the locations of observation times for numerical convenience. The proposed method, however, does not require the same knots to be used across outcomes. In fact, when observation times and/or the number of observations are drastically different across the outcomes, outcome-specific spline fittings would be recommended.

Determination of the Number of Clusters
Choosing an optimal number of clusters is an essential component of all clustering methods. Since there is no universally accepted criteria to define the "optimality" of data partitioning, solutions tend to be problem-specific. Two approaches are frequently used in the literature. One is to optimize the difference of within-cluster (or intra-cluster) and between-cluster (or intercluster) dissimilarity, as done in the CH index (Caliński and Harabasz 1974) and Silhouette statistic (Rousseeuw 1987). The other is to maximize the "gap" between within-cluster dissimilarity and its expectation under the hypothesis that data are fully homogeneous (Tibshirani et al. 2001).
In this work, we explore both approaches in the context of longitudinal data clustering. For a given partition of K clusters, the within-cluster dissimilarity can be defined as and the between-cluster dissimilarity as is the predicted outcome using all data, for whichf (·) is the B-spline estimator of the population mean function. The CH index at K partitions is defined, therefore, by for longitudinal data. The optimal number of clusters corresponds to the maximum value of the CH index, that is,K = arg max K CH(K).
Similarly, we propose a new Gap statistic using the betweencluster dissimilarity, defined as where C −k represents data from all subjects except those in set c k . The new Gap statistic is therefore The expectation in equation (6) can be calculated under certain data homogeneity assumptions. For example, Tibshirani et al. (2001) assumed that the observed data are uniformly distributed in the sample space for clustering cross-sectional data and they used the bootstrap method to compute the expectation. For longitudinal data, however, this approach appears to be numerically intractable. To maintain computational efficiency, we propose an ad-hoc procedure to compute the expectation under the following homogeneous longitudinal B-spline model: where E[e ij ] = 0 and var[e ij ] = σ 2 . Under this model, following the result of Proposition 2 in Web Supplementary Materials. Hence, we have where σ 2 can be estimated by fitting the B-spline model with all the data. Following the recommendation of Tibshirani et al. (2001), we use the first turning point as the chosen number of clusters, that iŝ Remark 5. Both CH(K) and Gap b (K) can be extended to longitudinal clustering with multiple outcomes by adding the SSR for each individual outcome.
Remark 6. Our numerical experiments showed that Gap b (K) performs better for data with a small number of clusters (≤5), and is more sensitive to pick out clusters with extreme cluster size compared to the CH(K) statistic. Additionally, it is numerically convenient in computing as shown above compared to the original Gap statistic, for which bootstrap is required.

The Clustering Algorithm
We propose a clustering algorithm that uses a greedy search strategy with a hierarchical merging mechanism for longitudinal data. The algorithm is structured in an agglomerative (i.e., bottom-up) manner such that the size of subclusters increases as the algorithm proceeds. The process results in a more accurate estimation of the mean function for each subcluster and the dissimilarity measure between subclusters and thus enhancing the clustering performance for sparse longitudinal data situations. The essential steps are as follows: 1. Baseline subclusters: Assuming that for the ith subject, the observed data {(t ij , Y ij ) : j = 1, ..., n i } (e.g., Figure 1 (a)) allow for the fitting of a B-spline with p basis functions, we start with each subject as a baseline subcluster, fit the B-spline model, and calculate SSRs and Gap b . 2. Bottom-up merging process: Conduct a greedy search to merge two subclusters with the minimal cost metric D into one cluster, and then update SSRs and Gap b after each merge. This process is repeated in a hierarchical manner to the top level where only one cluster is left. During the process, the subclusters merged in the previous steps stay in the same cluster. The process is depicted in a dendrogram ( Figure 1 (b)).

Determining the number of clusters (illustrated with Gap b ):
Plot the sequential Gap b statistics against its respective number of clusters during the merging process, and determine the optimal number of clusters by the first turning point in the plot as shown in Figure 1 (c). 4. Conclusion: Summarize the final clusters from the hierarchically structured clustering results with the number determined in Step 3, as shown in the color-coded dendrogram in Figure 1 (d) and the subject-level clustering results in Figure 1 (e).

The data assumption in
Step 1 is commonly used in the longitudinal data clustering (Abraham et al. 2003;Garcia-Escudero and Gordaliza 2005;Zhu et al. 2018). When the assumption is not met, one can add a preprocessing step to form the baseline subclusters for situations where some individual subjects cannot form a baseline subcluster by themselves: Let c (−) = {i ∈ c : n i ≤ p} be the set of subjects whose longitudinal observations are not sufficient for the fitting of a subject-specific B-spline model with p basis functions. Let c (+) = {i ∈ c : n i > p} be the set of subjects with sufficient data to fit a subject-specific Bspline. Each subject in c (+) results in an initial subcluster. The idea of the data preprocessing is to empty c (−) by combining the "most" similar subjects such that the combined one is sufficient to fit a baseline subcluster. The process can be accomplished through another greedy search strategy as described below.
1. For each i ∈ c (−) , merge the data from this subject with that of subject j that satisfies n i + n j > p to fit a B-spline with p basis functions. The mean squared residual, d(i, j), is calculated for the fitted model: If v ∈ c (−) , update c (−) by removing subjects u and v, and update c (+) by adding the two subjects to the combined longitudinal observations as a new baseline subcluster in c (+) ; otherwise, update c (−) by removing subject u, and update c (+) by merging data from subject u into that from v ∈ c (+) . 3. Repeat Steps 1 and 2 until c (−) becomes empty. Then modified subjects in c (+) constitute the baseline subclusters for Step 1 in the clustering algorithm.
We have implemented the method in an R package clusterMLD, short for Clustering for Multivariate Longitudinal Data (https://github.com/junyzhou10/clusterMLD) so that analysts not familiar with the R software can access the clustering function through an interactive web interface.

Simulation Studies
We examined the operating characteristics of the proposed method and compared it against existing methods through an extensive simulation study. We first assessed the method's performance in a relatively straightforward setting involving one outcome and less sparse observations (Zhu et al. 2018); this is a setting where many existing clustering methods are readily applicable. We then examined the proposed method's performance in situations of multiple outcomes and sparse observations, where the existing methods are expected to struggle. In the latter setting, only a few existing methods are available for comparison. In addition, we also compared the proposed method with the competing clustering methods in a functional data setting with complicated functional shapes and a large number of clusters.
Metrics of assessment. Key metrics include: (1) The number of clusters, (2) the overall accuracy of clustering, which is frequently quantified by the rate of pairwise agreement, referred to as the Rand index (RI) (Rand 1971), and (3) the time in seconds in completing the task. Theoretically, RI takes a value between 0 and 1, but it is unlikely to be near 0 even for a random classification when dealing with more than two groups. A more intuitive metric is the Adjusted Rand Index (ARI) proposed by Hubert and Arabie (1985): where the expected value of RI is calculated based on the completely random classification, and max(RI) is the highest value that classification can achieve. ARI is close to zero for a poor clustering method, but it is bounded from above by 1. ARI approaches to 1 as a clustering method becomes more accurate. ARI is applied to assess the overall accuracy of clustering, whether or not the true number of clusters is used. When the correct number of clusters is used, we also explored a clusteringspecific accuracy (CSA) measure to quantify the percentage of subjects correctly assigned to the right clusters. We examined both Gap b and CH statistics for their performance in determining the number of clusters.

Case 1: Single Outcome with Nonsparse Observations
We first considered the following longitudinal model (1), We used a setting previously explored by Zhu et al. (2018), which has four clusters with the mean trajectories f c 5t, and f c 4 (t) = 1.5 − 1.5t, respectively. We assumed a random error ε ij ∼ N (0, 0.4 2 ). However, instead of using 10 evenly spaced observation times in [0, 1], we simulated {t ij } 10 j=1 from a continuous uniform distribution U (0, 1) to add more variability in observation times. We required the interval between two adjacent observations to be larger than 0.06. We modeled the random component in model (8) with a random quadratic function where the random coefficients follow a multivariate normal distribution ⎡ and the following correlation coefficient matrix and diagonal variance matrices σ b = diag(0.1, 0.2, 0.2) and σ b = diag(0.2, 0.4, 0.4), respectively, to indicate low and high noise scenarios. We generated data for 100 subjects for each of the two cases of cluster size combination: A balanced case with 25 subjects for each cluster and an unbalanced case with 5, 25, 25, and 45 for the four clusters.
For the clustering algorithm, we used cubic B-splines (Schoenberg 1946;Schumaker 2007) to model the individual functional curve fitting, as described in the method section. Three internal knots were selected at the first, second, and third sample quartiles of all observation times to form B-spline basis functions. This resulted in a total of 7 basis functions in the Bspline curve fitting (Ruppert 2002).
We conducted comprehensive comparisons between the proposed method and eight different clustering methods mentioned in the Introduction that the computing packages are  (Tibshirani et al. 2001) was used for PAM, and BIC (Schwarz et al. 1978) for GM, longclust, funHDDC, and curvclust to determine the number of clusters, respectively. For KmL, the CH index was used as the recommended method among the alternatives provided in the R package kml for the optimal number of clusters. For UReMix, the number of mixtures was automatically determined through a robust EM algorithm (Yang et al. 2012). The number of clusters in NPG was decided by properly tuning the hyperparameters.
The simulation was repeated 100 times, and we reported the average number of identified clusters, ARI, CSA, and computing time. We also reported the standard deviations for the three performance measures. We summarized the results in Table 1. Table 1 showed that the proposed method clusterMLD with GAP b yielded highly accurate clustering results in low noise settings, and the performance was less sensitive to extreme cluster sizes; the performance was somewhat reduced in high noise settings but still respectable. With CH, clusterMLD yielded better results when cluster sizes were balanced and was insensitive to the data noises, but the performance was largely reduced with the settings of extreme cluster sizes. PAM, GM, and curvclust, on the other hand, often failed to identify the correct number of clusters. When a method failed to identify the correct number of clusters, we did not calculate its CSA level. Similar to clus-terMLD with CH, KmL performed well when the cluster sizes were balanced, but it was unable to identify the right cluster number when the sizes were highly unbalanced. This suggests that CH is a less reliable statistic to identify the right cluster number when the cluster sizes are highly unbalanced. UReMix had outstanding computational efficiency owing to the explicit updating rule of the EM algorithm. But its performance on clustering accuracy left much to be desired, especially when the cluster sizes were unbalanced. NPG and funHDDC showed comparable clustering performance in determining the number of clusters on average in the tested settings compared to the proposed method with GAP b but not in the clustering accuracy measured by ARI and CSA, and were much more costly in computing: In particular, NPG needed 334-358 times more computing times compared to the proposed method. longclust appeared to perform slightly better than the proposed method with GAP b when the noise level was high. However, the computational efficiency of longclust was much lower than the proposed method: On average, the computational times for longclust were 13-15 times that of the proposed method.
The empirical distribution of identified cluster numbers based on 100 simulation runs (Figure 1 in Web Supplementary Materials) showed that (1) the proposed method with GAP b had the highest probability of identifying the right cluster number in low noise settings and good probabilities in high noise settings; (2) the proposed method with CH had the highest probability of identifying the right cluster number in the balanced settings, but the performance was much reduced with extreme cluster sizes. NPG and longclust appeared to be comparable to the proposed Note: CSA was only reported when there were at least two cases in 100 repetitions that resulted inK = 4. method with GAP b in overall clustering accuracy in the tested settings but required much more computing resources, which may prevent them from being applied to clustering larger longitudinal data. Overall, clusterMLD (GAP b )'s excellent ability to identify the right cluster number, as well as its high computational efficiency and clustering accuracy, offers a great advantage for longitudinal data clustering, especially when data volume is large.

Case 2: Multiple Outcomes with Sparse Observations
For this part of the simulation, we considered a multivariate longitudinal model, where Y (h) ij represented the h th outcome from subject i on the j th occasion, and f (h) i (·) the h th fixed component or the h th mean trajectory for the i th subject. Herein, we considered five outcomes with four underlying clusters. The mean trajectories f (h) (·), h = 1, ..., 5 for each cluster were given in Table 2 and graphically illustrated in Figure 2 (A). For this setting, the clusters were mostly distinguished by Y (1) and Y (2) and less distinguished by Y (3) and Y (4) . Y (5) is completely noninformative in separating the clusters.
The random component r i (·) was modeled by following the same equation (9) as in Case 1 with σ b = diag(2, 0.3, 0.06) and the same ρ b . σ ih for h = 1, ..., 5 were the additional random effects that described the correlations among the outcomes and they were generated from the multivariate normal distribution where σ = diag(3, 3, 3, 3, 3). As a pure noise, outcome 5 was set to be totally independent of the other four outcomes. The random measurement error ε ijh was simulated from either a normal distribution N (0, η 2 ) or a uniform distribution U (−η/2, η/2).
To evaluate the performance of the proposed method with different noise levels, we chose η = 3 or 6. The sparse and irregular observations were simulated as follows. For subject i, we sampled the number of observations n i from a discrete uniform distribution between 4 and 12. Then the observation times {t ij } n i j=1 were selected as the order statistics of the n i random observations from U (0, 11) with the first observation fixed at 0 and the interval between two adjacent observations set to be greater than 0.5. Some sample trajectories  color-coded by clusters were plotted in Figure 2 (B) for each outcome.
The performance under different cluster size combinations was also explored: In setting S0 we considered a balanced case wherein the ratio of four cluster sizes was 1:1:1:1; S1 represented a case where one cluster was extremely small, with the cluster size ratio being 1:13:13:13; and S2 represented a case where one cluster was much larger than the other clusters, with the cluster size ratio being 1:10.33:1:1. Settings S1 and S2 were designed to assess whether the clustering algorithm was influenced by an extremely small or large cluster, a situation likely to occur in studies of rare or overly abundant disease phenotypes. Furthermore, to explore the influence of sample size, simulations with a total sample size of 200 and 400 were conducted. When the total sample size was 200, all four clusters had 50 subjects in S0; one cluster had 5, the rest had 65 subjects each in S1; one cluster had 155, and the rest had 15 subjects each in S2.
We again used cubic B-splines (Schoenberg 1946;Schumaker 2007) with the three internal knots as described in Section 3.1 to carry out the individual functional curve fitting. For subjects without sufficient observations, the data preprocessing step was implemented, as described in Section 2.4.
Except for the proposed method clusterMLD, among the eight competing methods used for Case 1, only KmL and fun-HDDC packages have been extended to handle multiple outcomes. However, KmL3d (Genolini et al. 2015), the multivariate version of KmL, required the observations to be aligned at the same time points and thus could not accommodate the simulated data. As a result, we only compared clusterMLD with funDHHC. For each of the simulation settings, we had 100 replications. Results on the mean and standard deviation on identified cluster number, ARI and CSA were summarized in Table 3 and Table 4, respectively. Table 3 showed that clusterMLD with GAP b was generally accurate in its identification of the cluster number (K = 4), and Table 3. Comparison of mean (sd) ofK in 100 replicates between clusterMLD and funHDDC.  its standard deviations were small, whereas the performance of clusterMLD with CH deteriorated, especially when the cluster sizes were severely unbalanced. funHDDC often deviated from the correct number, except for setting S2, where the numbers were closer. The performance of clusterMLD with GAP b improved with increasing sample size; the method performed better with normal errors than uniform errors. The empirical distribution of identified cluster numbers was presented in Figure 2 in Web Supplementary Materials, which showed the probabilities of identifying the right cluster number were much higher in all tested settings for clusterMLD with GAP b , but not at all for funHDDC. Additionally, Table 4 showed that ARI and CSA (when available for calculation) were much higher for clusterMLD with GAP b than funHDDC. It is possible that the less satisfactory performance of funHDDC was due to the fact that it was designed for functional data with more frequent observations. The computation efficiency was an added advantage for clusterMLD compared to funHDDC in this case: On average, clusterMLD took 2.92 and 11.35 seconds to complete the task for sample size N = 200 and N = 400, respectively; the corresponding numbers were 26.79 and 62.28 for funHDDC.

clusterMLD(Gap
In the low noise data setting (η = 3), the average estimated importance weights W h for the five outcomes were 0.419, 0.295, 0.097, 0.125, and 0.064 with uniformly distributed measurement error and 0.411, 0.278, 0.111, 0.126, and 0.074 with normally distributed measurement errors, respectively. In the high noise setting (η = 6), the corresponding weights were 0.418, 0.286, 0.103, 0.126, and 0.067 with uniformly distributed measurement error and 0.384, 0.266, 0.124, 0.135, and 0.090 with normally distributed measurement error, respectively. The estimated weights reflected the order of importance of the multiple outcomes in differentiating the clusters, as shown in Figure 2.
Though the method that we developed is mainly for clustering longitudinal data, it is also applicable to clustering functional data, where it does not require additional data preprocessing steps. We conducted an additional simulation study to assess its performance in clustering functional data in comparison with the existing methods. We generated functional data with a large number of clusters (K = 10) and more complicated functional shapes. The full results of the simulation study are presented in Web Supplementary Materials. Briefly, the simulation showed that the two best-performing methods were clusterMLD with CH and GM, while the other methods had rather disappointing performance to various degrees. Among the two top performers, GM had 98% chance (ARI=0.996) of identifying the correct cluster number, whereas clusterMLD with CH had 81% chance (ARI=0.976). A distant third, funHDDC, had only 11% chance of identifying the right cluster number with ARI=0.808. Interestingly, clusterMLD with GAP b seemed to have lost its advantage in selecting the right cluster number when it is large.
In summary, clusterMLD with GAP b delivers a highly reliable performance for clustering longitudinal data in various data situations in comparison to the existing methods when the number of clusters is not large, and is readily implementable with multivariate outcomes. The method's computational efficiency offers an important advantage in clustering large volume of data. As confirmed by the simulation studies, the above characteristics of clusterMLD with GAP b have made it an appealing clustering tool for a broad class of longitudinal studies, as shown in the two real applications below.

Real-World Applications
The method discussed in the current research is readily applicable to real data applications. To emphasize its general applicability, we present two clustering analyses using data from two real clinical investigations.

The Systolic Blood Pressure Intervention Trial
The Systolic Blood Pressure Intervention Trial (SPRINT) is a randomized clinical trial aimed at reducing cardiovascular complications in people with hypertension by aggressively lowering systolic blood pressure (BP). Participants were randomly assigned to two arms: One is an intensive treatment arm where the systolic BP goal was set to 120 mmHg, and the other is a standard treatment arm where the systolic BP goal remained at 140 mmHg. Therapeutic decisions on how to bring down BP were left to the treating physicians. The study tracked BP in study participants at three-month intervals approximately for up to five years. The study found the intervention had resulted in significantly lower systolic and diastolic BP in patients who received the SPRINT intervention (The-SPRINT-Research-Group 2015). The SPRINT data are publicly available from the National Heart, Lung, and Blood Institute, through its Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) (https://biolincc.nhlbi.nih.gov/home) undersigned Research Materials Distribution Agreements (RMDA).
In this research, we performed a clustering analysis of BP data in SPRINT participants. Since the original trial has established the blood pressure-lowering effect of SPRINT intervention, one would expect that a good clustering method would be able to successfully differentiate the patients with well-controlled blood pressure from those with poorly controlled blood pressure, and the cluster membership should roughly correspond to the original treatment group assignment-the lower blood pressure cluster should include mainly the intensively treated patients and the high blood pressure cluster should include mainly the control patients.
The SPRINT study had a total of 9173 participants, including 4,600 in the intensive treatment arm and 4,573 in the control arm. The participants generated 144,824 BP measurements; each patient, on average, contributed 15.8 BP measures. We excluded those with only baseline BP because they did not contribute any discriminating information to the clustering. Systolic BP was used as the primary outcome so that all methods described in Section 3 could be applied. The pointwise mean longitudinal SBP patterns in the two treatment groups are shown in Figure 3 (a).
The clustering results are presented in Table 5. We used Cohen's Kappa (κ) coefficient (Cohen 1960) to assess the agreement between the cluster membership and the original treatment assignment. The proposed method identified two clusters (K = 2), resulted in a high Kappa coefficient (κ = 0.647), and had a superior computational efficiency. The computing  time on a MacBook equipped with a 2.3GHz 8-Core Intel i9 processor was 147.17 seconds. The low and high systolic BP clusters respectively included 4, 674 and 4, 499 patients. In the low BP cluster, 3, 828 (81.9%) was from the intensive treatment arm; in the high BP cluster, 3, 729 (82.8%) were from the standard treatment group. However, one should not anticipate a perfect agreement between cluster membership and the original treatment assignment because the SPRINT intervention did not work for every patient, and some control patients could achieve lower BP than their peers in the control group. We presented the pointwise mean systolic BP trajectories for the two clusters in Figure 3 (b); the mean trajectories clearly resembled the two treatment group BP trajectories. Most of the competing methods did not fare well: PAM, GM, longclust, funHDDC, and curvclust yielded less meaningful numbers of clusters that contradicted the findings from the main trial paper (The-SPRINT-Research-Group 2015). The only other method that identified two clusters was KmL, but its Kappa value (κ = 0.202) was disappointing, and the method had consumed 23 times more computing time. We were not able to ascertain results from UReMix and NPG because they were not equipped to deal with a dataset of this size. We had to stop the program for the two methods because of memory overflow and excessive computing time.

The PREDICT-HD Study
In the second application, we used clustering analysis to investigate Huntington's disease (HD) progression phenotypes. HD is a neurodegenerative disease caused by the trinucleotide cytosine-adenine-guanine (CAG) in the first exon of the Huntington (HTT) gene (MacDonald et al. 1993). The disease debilitates motor function in the afflicted, often accompanied by accelerated impairment of cognitive functions (Long et al. 2015;Walker 2007). Diagnosis is typically made by using the motor subscale of the Unified HD Rating Scale (UHDRS) (Shoulson and Fahn 1979;Kieburtz et al. 2001). The PREDICT-HD is a 12-year observational study conducted between 2002 and 2014 on prodromal HD individuals who had the HTT gene but had not met the motor diagnostic criteria for HD at the study entry. The study was conducted in 33 sites across six countries (USA, Canada, Germany, Australia, Spain, and UK). Large volumes of data on neuroimaging, motor, cognitive, and psychiatric assessments were collected for predicting the onset of HD (Paulsen et al. 2014).
In this research, we studied phenotypes in HD progression in motor and cognitive impairment and how the progression phenotypes affected disease onset by using the PREDICT-HD data. The data are publicly available through the NIH Database for Genotypes and Phenotypes (dbGap); data requests should be sent to ninds-dac@mail.nih.gov.
We selected five motor and cognitive measures that are commonly used for tracking HD progression: The total motor score (TMS) (Kieburtz et al. 2001;Hogarth et al. 2005), the Symbol Digit Modalities Test (SDMT) (Smith 1973), and the three Stroop Color Word Tests, i.e., Stroop Color test (stroopco), Stroop Word Test (stroopwo), and Stroop Color-Word Inference Test (stroopin) (Stroop 1935;Golden 1978). All five measures are on numerical scales, with a smaller value indicating more impairment, except for TMS, where a larger value corresponds to greater motor impairment. For 1,006 participants in PREDICT-HD, the timings of assessment were quite different among participants though scheduled annually. The average number of observations per participant was 5.62, with a minimum of 2 and a maximum of 13 observations. The application represents a typical example of multiple-outcome longitudinal data with irregular and sparse observations, for which the development of this proposed method was motivated.
Two clustering methods, clusterMLD and funHDDC, are readily applicable. For clusterMLD, we fitted cubic B-spline curves using internal knots selected at the three quartiles of the  total observation time. The clusterMLD identified three clusters, each representing a different disease progression phenotype, coinciding well with the perceived disease progression patterns among the Huntingtin's disease study group (Duff et al. 2010;Harrington et al. 2012;Paulsen et al. 2013). Since funHDDC resulted in only two clusters, we further analyzed data using the clusters generated by clusterMLD. For narrative convenience, we referred to them as Clusters 1, 2, and 3, respectively covering 317 (32.2%), 332 (33.7%), and 336 (34.1%) participants. We present the mean B-spline functional curves in Figure 4. The figure showed that SDMT and the three Stroop tests were clearly separated at baseline, with subjects in Cluster 1 being more impaired throughout the observational window, especially near the end when impairment accelerated. For TMS, the analysis showed that Cluster 1 had a much higher rate of increase, suggesting a more rapid deterioration. In contrast, Cluster 3 had progressed slowly in both motor and cognitive declines. A closer examination of the data showed that only 25 participants (7.4%) in Cluster 3 had HD diagnosis with a median time to diagnosis beyond 12 years after enrollment. The numbers of HD diagnoses for Clusters 2 and 1 were 64 (19.3%) and 149 (47.0%), with the median times to HD diagnosis of 11.1 and 5.6 years from study enrollment, respectively. The estimated survival functions were presented in Figure 5 (a), which confirmed that participants in the three clusters had significantly different time-to-HDdiagnosis distributions (p-value < 0.001) per log-rank test (Peto and Peto 1972).
Previously, Zhang et al. (2011) explored using the CAG trinucleotide to quantify the HD genetic burden. They proposed a CAG-Age Product scale, hereby referred to as the CAP score CAP = AGE × (CAG − 33.7), which was found to strongly predict HD onset and has since been used to characterize prodromal HD risk and used for disease screening (Rodrigues et al. 2019). In the current analysis, CAP was predictive of disease progression-the three clusters had significantly different CAP values (p-value < 0.001). But as a static metric assessed at baseline, CAP does not fully capture the disease progress. A large overlap in CAP values was observed among the three clusters, as shown in Figure 5 (b). We performed an additional Cox proportional hazards analysis with both CAP and cluster labels as covariates. The result showed that after adjusting for CAP, HD progression clusters remain significant (p-value < 0.001), and thus confirming the added value of clustering analysis. This clustering analysis has profound implications in designing intervention trials for modifying disease progression-by identifying individuals at risk for a specific HD progression group to enrich the study sample for the targeted intervention (Paulsen et al. 2019).

Discussion
In this research, we have developed an efficient hierarchical clustering method for multivariate longitudinal data with irregular and sparse assessments. The method uses functional curvefitting techniques to alleviate the influences of noisy and sparse data observation. The method takes a hierarchical agglomerative approach that merges subjects and subclusters in a bottomup fashion and hence lays out all possible candidates of clusters with the cluster number ranging from 1 to N, the total sample size. It is equipped with two statistics for selecting the optimal cluster number, GAP b and CH. The numerical studies showed the proposed method with GAP b is highly reliable for clustering longitudinal data when the true number of clusters is not large (≤ 5). Therefore, it is most applicable to longitudinal clinical data.
While the method's development was motivated by the need for clustering longitudinal clinical data, evidence supports its use in functional data analysis, where the method is numerically more effective because it requires no preprocessing of the data. The proposed method with CH also appears to be reliable for functional data settings where the number of clusters is large and functional shapes are complex. Extensive simulation studies have highlighted the method's superior performance over its competitors in terms of cluster number determination, clustering accuracy, and computational efficiency.
At the core of the algorithm is a metric that quantifies the cost of each merging. The metric is calculated by comparing the sum of squared residuals of the separate and combined models; thus, it can be viewed as a metric of distance between two subclusters and hence clusterMLD belongs to the general category of distance-based clustering methods per Jacques and Preda (2014a). However, this metric is not a traditional L 2 measure of the "distance" between two curves, on which the other functional data clustering methods were based (Ferraty and Vieu 2006;Ieva et al. 2013;Tarpey and Kinateder 2003;Tokushige et al. 2007). Instead, it is based on the sum of squared residuals, which can be calculated for a much broader class of parametric and nonparametric models, including but not limited to the regression B-splines models used in the current research. For traditional linear models, the metric has a well-grounded root in the classical theory of statistical inference, a foundation that the L 2 -based methods lack. This may help explain the superior performance of the proposed algorithm. An essential advantage of the method is that the readily calculated sums of squared residuals have drastically cut down the computational burden. The least-square-based cost metric greatly simplifies the comparison of longitudinal data models between the subclusters during the clustering process. We were surprised that such an intuitive approach had not been previously studied as it has clear potential in clustering both longitudinal and functional data. The strongest support for the method perhaps comes from the numerical studies, which highlighted the many advantages that set this method apart from the existing ones. Major appeals include the accurate identification of cluster number, clustering accuracy, and computational efficiency, in addition to its ability to handle sparse and irregular data, as well as multiple outcomes. Notably, the advantages appear to grow with sample size, as demonstrated in the simulation studies and in the SPRINT data analysis.
In many practical analyses, the sample size is a double-edged sword. While larger data sets lend more information to the analysis, very large datasets tend to present greater challenges to data processing. The computational burden of running iterative and computationally heavy algorithms in larger datasets has frustrated many analysts. For a dataset of size N, agglomerative clustering algorithms typically have a time complexity of O(N 2 ) and require (N 2 ) memory (Nielsen 2016); these could make them too slow even for medium size analyses. Parallel computing could help alleviate the burden by splitting the original data into multiple subsets, each of size n, and then applying the algorithm to each of them in a parallel fashion. The algorithm stops when d subclusters are resulted instead of continuing the process until only one cluster remains. As long as it is larger than the actual number of clusters, no specific requirement for this d is needed. The resultant subclusters from each subset are then combined. This split-and-pool procedure can be continuously applied multiple times until the number of remaining subclusters reaches a manageable scale. Finally, the algorithm is applied to keep merging subclusters until only one cluster is left. This algorithm helps reduce the time complexity to O((n + d)N). In general, larger n and d improve the clustering performance but slow the algorithm. Our analysis of the SPRINT data has confirmed the method's practical feasibility in moderate-tolarge datasets.
As true for all statistical methods, clusterMLD has its limitations: (i) We assume that the time trajectories are smooth functions so that they can be adequately depicted by spline models; (ii) Although the method is designed to accommodate irregular and sparse observations, longitudinal observations among subjects are still expected to have some levels of overlap, and so that individual curves can be fitted; (iii) Both Gap b and CH statistics are not defined for K = 1, thus precluding the possibility of testing the existence of heterogeneity within the data. We assume data heterogeneity a priori. When in doubt, that assumption could be tested with goodness-of-fit statistics; (iv). Since clustering is a data exploratory technique, not an inference procedure, we did not consider the correlations among the multivariate outcomes. Incorporating a proper working correlation matrix will no doubt enhance the inferential efficiency (Liang and Zeger 1986), but this gain is likely to come at the expense of the increased numerical burden without substantively improving the clustering performance. Finally, the method as it currently stands does not work for discrete data. To the best of our knowledge, no existing clustering method has been devoted to the analysis of multivariate mixture data. This said, the distance metric defined in the current paper may have the potential to be extended to other data types. In addition, prior knowledge about the data may help with the choice of GAP p or CH in selecting the number of clusters. In general, it would be more advantageous to develop an adaptive strategy in the choice of GAP p or CH during the hierarchical process of merging subclusters, which still needs more research. Notwithstanding these limitations, we put forward a reliable and efficient clustering algorithm that hierarchically merges subclusters of multivariate continuous data that are observed longitudinally.