A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density

The goal of clustering is to detect the presence of distinct groups in a dataset and assign group labels to the observations. Nonparametric clustering is based on the premise that the observations may be regarded as a sample from some underlying density in feature space and that groups correspond to modes of this density. The goal then is to find the modes and assign each observation to the domain of attraction of a mode. The modal structure of a density is summarized by its cluster tree; modes of the density correspond to leaves of the cluster tree. Estimating the cluster tree is the primary goal of nonparametric cluster analysis. We adopt a plug-in approach to cluster tree estimation: estimate the cluster tree of the feature density by the cluster tree of a density estimate. For some density estimates the cluster tree can be computed exactly; for others we have to be content with an approximation. We present a graph-based method that can approximate the cluster tree of any density estimate. Density estimates tend to have spurious modes caused by sampling variability, leading to spurious branches in the graph cluster tree. We propose excess mass as a measure for the size of a branch, reflecting the height of the corresponding peak of the density above the surrounding valley floor as well as its spatial extent. Excess mass can be used as a guide for pruning the graph cluster tree. We point out mathematical and algorithmic connections to single linkage clustering and illustrate our approach on several examples. Supplemental materials for the article, including an R package implementing generalized single linkage clustering, all datasets used in the examples, and R code producing the figures and numerical results, are available online.


INTRODUCTION
The goal of clustering is to identify distinct groups in a dataset and assign a group label to each observation. We use the term "distinct groups" in the sense of Carmichael, George, and Julius (1968): contiguous, densely populated areas of feature space, separated by con- To cast clustering as a statistical problem we regard the data X = {x 1 , . . . , x n } ⊂ R m as a sample from some unknown probability density p(x). There are two statistical approaches to clustering. The parametric approach (Fraley and Raftery 1998, 2006McLachlan and Peel 2000) is based on the assumption that each group g is represented by a density p g (x) that is a member of some parametric family, such as the multivariate Gaussian distributions. The density p(x) then is a mixture of the group densities, and the number of mixture components and their parameters are estimated from the data. Observations can be labeled using Bayes's rule.
In contrast, the nonparametric approach adopted in this article is based on the premise that groups correspond to modes of the density p(x). The goal then is to find the modes and assign each observation to the "domain of attraction" of a mode. Searching for modes as a manifestation of the presence of groups was first advocated in Wishart's (1969) article on Mode Analysis. According to Wishart, clustering methods should be able to detect and "resolve distinct data modes, independently of their shape and variance." Hartigan (1975, sec. 11;1981) expanded on Wishart's idea and made it more precise by introducing the notion of high density clusters. Define the level set L(λ; p) of a density p at level λ as the subset of the feature space for which the density exceeds λ: Hartigan defined the high density clusters at level λ as the connected components of L(λ; p).
Hartigan also pointed out that the collection of high density clusters has a hierarchical structure: for any two clusters A and B (possibly at different levels) either A ⊂ B or B ⊂ A or A ∩ B = ∅. This hierarchical structure is summarized by the cluster tree of p. Each node N of the tree represents a subset D(N) of the support L(0; p) of p-a high density cluster of p-and is associated with a density level λ(N). The cluster tree is easiest to define recursively (Stuetzle 2003). The root node represents the entire support of p and has associated density level λ(N) = 0. To determine the descendants of a node N we find the lowest level λ d for which L(λ; p) ∩ D(N) has two or more connected components. If there is no such λ d , then p has only one mode in D (N), and N is a leaf of the tree. Otherwise, let C 1 , . . . , C k be the connected components of L(λ d ; p) ∩ D(N). If k = 2 (the usual case), we create daughter nodes representing the connected components C 1 and C 2 , both with associated level λ d , and apply the definition recursively to the daughters. If k > 2, we create daughter nodes representing C 1 and C 2 ∪ · · · ∪ C k and recurse. (Our terminology is different from the one used by Hartigan 1975; Hartigan referred to the connected components of all level sets as high density clusters, whereas we reserve this term for connected components of level sets associated with the nodes of the cluster tree.) Figure 2 shows a density and the corresponding cluster tree. It is worth noting that the topology of the cluster tree of a density is invariant under nonsingular affine transformations of feature space; only the levels of the nodes change. In particular, the topology does not depend on the choice of units of measurement.
We regard estimating the cluster tree as the fundamental goal of nonparametric cluster analysis. In this article we propose a graph-based approach to cluster tree estimation which we call generalized single linkage clustering. In Section 2 we review previously suggested clustering methods that can be described in terms of level sets. In Section 3 we present the basic idea of our graph-based approach, and in Section 4 we illustrate our algorithm on a simple example. In Section 5 we describe a way of measuring the "prominence" of modes of a density, motivating a method for pruning branches of a cluster tree likely to correspond to spurious modes caused by sampling variability. In Section 6 we point out mathematical and algorithmic connections between our approach and single linkage clustering. In Section 7 we show additional examples. Section 8, with a summary and ideas for future work, concludes the article.

PREVIOUS WORK
Several previously suggested clustering methods can be described in terms of level sets and high density clusters.
Probably the earliest such method is Wishart's (1969) one-level mode analysis. The goal of one-level mode analysis is to find the connected components of L(λ; p) for a given density level λ chosen by the user. The idea is to first compute a kernel density estimate p (Silverman 1986, chap. 4) and set aside all observations withp(x i ) ≤ λ, that is, all observations not in the level set L(λ;p). If the connected components of L(λ; p) are well separated, then the remaining high density observations should fall into clearly separated groups. Wishart suggested using single linkage clustering of the high density observations to identify the groups. One-level mode analysis anticipates some of the "sharpening" ideas later put forth by Tukey and Tukey (1981).
A reincarnation of one-level mode analysis is the DBSCAN algorithm of Ester et al. (1996). DBSCAN consists of four steps: (a) for each data point calculate a kernel density estimate using a spherical uniform kernel with radius r; (b) choose a density threshold λ and find the observations withp(x i ) > λ; (c) construct a graph connecting each high density observation to all other observations within distance r; (d) define the clusters to be the connected components of this graph. All observations not within distance r of a high density observation are considered "noise." DBSCAN is closely related to Walther's (1997) method for estimating level sets and to the clustering method of Cuevas, Febrero, andFraiman (2000, 2001). Walther's level set estimator consists of two steps: (i) compute an estimatep of the underlying density p; (ii) estimate L(λ; p) by a union of balls centered at those observations withp(x i ) > λ. Estimating a level set by a union of balls centered at high density observations is also the basic idea of Cuevas, Febrero, andFraiman (2000, 2001). Their estimate differs from Walther's estimate in the way the radii of the balls are chosen. The clusters produced by their method are the connected components of the estimated level set or, more precisely, the corresponding subsets of the observations. (This short summary does not do justice to the work reported by Cuevas, Febrero, andFraiman (2000, 2001); their articles contain a number of interesting ideas, e.g., on using the Bootstrap to more accurately estimate level sets and assess the variability of the estimates.) A weakness of one-level mode analysis or any method that attempts to find clusters based on a level set for a single level λ is apparent from Figure 2. The degree of separation between connected components of L(λ; p), and therefore of L(λ;p), depends critically on the choice of the cut level λ, which is left to the user. Moreover, there might not be a single value of λ that reveals all the modes.
Citing the difficulty in choosing a cut level, Wishart (1969) proposed hierarchical mode analysis, which can be regarded as a heuristic for computing the cluster tree of a kernel density estimatep, although it appears that Wishart did not view it thus. (The word "tree" does not occur in the section of his article on hierarchical mode analysis.) We use the term "heuristic" because there is no guarantee that hierarchical mode analysis will indeed correctly compute the cluster tree ofp as defined previously. Wishart's (1969) algorithm constructs the tree by iterative merging (i.e., is an agglomerative algorithm). It is quite complex, probably because its iterative approach is not well matched to the tree structure it is trying to generate.
The basic weakness of one-level mode analysis was also noted by Ankerst et al. (1999) who proposed OPTICS, an algorithm for "Ordering Points to Identify the Clustering Structure." OPTICS generates a data structure that allows one to calculate efficiently the result of DBSCAN for any desired density threshold λ. It also produces a graphical summary of the cluster structure. Stuetzle (2003) suggested a new method-runt pruning-for extracting clusters from the cluster tree of the nearest-neighbor density estimate (which is isomorphic to the single linkage dendogram) by pruning branches believed to correspond to spurious modes. For a generalization of single linkage clustering to the kth nearest-neighbor density estimate see Wong (1979) and Wong and Lane (1983). Klemelae (2004Klemelae ( , 2005 proposed tools for visualizing the level sets of density estimates that are piecewise constant over (hyper)-rectangles, such as histograms or discretized kernel estimates. Level sets and their connected components for such density estimates are easy to obtain. Klemelae defined a "level set tree" which is different from the cluster tree in that it has nodes at every one of the (finitely many) levels occurring as values ofp.

A GRAPH-BASED APPROACH TO ESTIMATING THE CLUSTER TREE OF A DENSITY
An obvious way of estimating the cluster tree of a density p from a sample is to first compute a density estimatep and then use the cluster tree ofp as an estimate for the cluster tree of p. This plug-in approach works for histograms or binned kernel density estimates (Nugent 2006). Such estimates, however, are only viable in low dimensions because binning high-dimensional space is not practical: ten bins per variable in ten dimensions would result in 10 10 bins. For many other estimates suitable for high-dimensional data, like Gaussian mixtures, kernel estimates, or projection pursuit estimates, computing level sets, and therefore computing the cluster tree, seems intractable; it is not even clear how one would represent level sets on the computer. Instead we define and solve a closely related, but much simpler graph problem.
Letp ij be the minimum value of the density estimatep over the line segment connecting observations x i and x j :p Let G be the complete graph over the observations with edge weightsp ij and vertex weightsp ii . Define the threshold graph G(λ) as the subgraph of G consisting of the edges and vertices withp ij > λ. By construction, the vertices of G(λ) are exactly the observations in L(λ;p).
There is also a link between the connected components of L(λ;p) and the connected components of the threshold graph G(λ): Two observations in the same connected component of G(λ) are guaranteed to lie in the same connected component of L(λ;p) because they are connected by a path in G along whichp ij > λ. Note that the reverse is not necessarily true: there might be a curve x(t) : [0, 1] → R m with x(0) = x i , x(1) = x j , and p(x(t)) ≥ λ for all t ∈ [0, 1], even if there is no path in the graph G with this property. Therefore, observations in the same connected component of L(λ;p) may lie in different connected components of G(λ). However, erroneous splits are rare ifp is smooth; we present some evidence for this assertion in Section 7. We will altogether miss connected components of L(λ;p) that do not contain any observations, but those are probably artifacts of the density estimate and not of interest. In any case, our real target are the level sets L(λ; p) of the feature density and their connected components; occasional mistakes in identifying connected components of L(λ;p) are but one component of the overall estimation error.
The connected components of G(λ) for different values of λ have a tree structure, just like the connected components of L(λ;p). We call this tree the graph cluster tree; it is our approximation to the cluster tree ofp. Like the cluster tree of a density, the graph cluster tree is easiest to define recursively. Each node N of the graph cluster tree represents a subgraphD(N) of G and is associated with a density level λ(N). We refer to the vertex set ofD(N) as the graph high density cluster associated with N . The root node represents the entire graph G and has associated density level λ(N) = 0. To determine the descendants of a node N we find the lowest level λ d for which G(λ) ∩D(N) has two or more connected components. If there is no such λ d , then N is a leaf of the tree. Otherwise, let C 1 , . . . , C k be the connected components of G(λ) ∩D(N). If k = 2 (the usual case), we create daughter nodes representing the connected components C 1 and C 2 , both with associated level λ d , and apply the definition recursively to the daughters. If k > 2, we create daughter nodes representing C 1 and C 2 ∪ · · · ∪ C k and recurse.
Remark 1: In our graph-based approach the observations play two conceptually different roles. First, they are used to compute the density estimatep. Second, they form the vertices of the graph G. The graph is merely a tool for approximating the structure of the level sets ofp. Following Cuevas, Febrero, andFraiman (2000, 2001), we could use a different set of "test" points as the graph vertices. For example, we could generate a large sample fromp, which would reduce the likelihood of erroneously splitting connected components of level sets ofp. We would end up with cluster labels for the test points and could then label the original observations using any classification method.

COMPUTING THE GRAPH CLUSTER TREE
The recursive definition of the graph cluster tree given at the end of Section 3 translates directly into a recursive cluster tree algorithm for its computation. Note that when we apply the splitting procedure to the subgraphD(N) associated with a node N , the only values for the threshold λ we have to consider are the weights of the edges inD(N). However, we can simplify the algorithm and its visualization and expose similarities to other clustering methods (Section 6) by making use of a connection between the threshold graphs of the graph G and of its maximal spanning tree T :

Proposition 1. Let G be an edge weighted graph and T its maximal spanning tree. Then two vertices belong to the same connected component of G(λ) iff they belong to the same connected component of T (λ).
Proof: Two vertices in the same connected component of T (λ) are in the same connected component of G(λ) because the edges of T are a subset of the edges of G. Now assume that vertices x i and x j are in different connected components of T (λ). This means that the unique path in T connecting x i and x j contains at least one edge e with weight ≤ λ. Removing e from T breaks T into two connected components T 1 and T 2 , one containing x i and the other containing x j . If x i and x j were in the same connected component of G(λ), there would be a path in G connecting x i and x j for which all edge weights are greater than λ. This path has to contain an edge e * connecting T 1 and T 2 . Replacing e with e * in T would lead to a tree with larger total edge weight, contradicting the assumption that T was the maximal spanning tree of G.
Proposition 1 implies that we can apply the recursive cluster tree algorithm to the maximal spanning tree of G instead of G itself.
We still face an operational problem: the edge weightŝ of G for j = i are not known explicitly but are solutions of an optimization problem. One way of dealing with this problem is to approximate thep ij using a numerical optimizer, most simply a grid search. We used this method to generate the examples presented in the article.
We do have a more principled but computationally more demanding approach that can be shown to produce the correct tree. It is based on two observations: (i) to compute the maximal spanning tree of G and the graph cluster tree we only need the order of the edge weights of G; their exact values are not important; (ii) if the density estimatep is smooth (e.g., a kernel estimate with a smooth kernel or a Gaussian mixture estimate), then we can obtain upper and lower bounds on thep ij using Taylor expansions. These bounds can be made arbitrarily tight at the cost of additional evaluations ofp and its derivatives. This approach was described by Nugent (2006). In the examples we have tried, grid search, and exact computation produce very similar results.
We now illustrate the cluster tree algorithm on a simple two-dimensional example. Figure 3(a) shows a dataset consisting of two obvious groups, which we will refer to as the "lump" and the "banana." Superimposed are the isopleths of a kernel density estimate. Figure 3(b) shows the maximal spanning tree of G. The "shortest" edge, the one with lowest edge weightp ij and the first one to be broken during the recursive thresholding process, is dashed. The minimum of the density along this edge is assumed at one of the endpoints, drawn in gray. Therefore, thresholding eliminates this edge and the endpoint, leaving us with one connected component. Figure 3(c) illustrates the second step of the algorithm. The second shortest edge, the second one to be broken, is dashed; edges and vertices below the current threshold are drawn in gray. Again, the minimum of the density along the edge is assumed at an endpoint, and thresholding leaves us with one connected component.
The thresholding process progressively peels off edges and vertices until we reach the stage shown in Figure 3(d), where for the first time thresholding results in two connected components, essentially the lump and the banana, with a few low density points removed. Applying the thresholding process to the lump does not result in any more splits-edges and vertices are removed until we are left with an empty graph. We therefore focus on the banana. Figure 3(e) shows the first split of the banana. There are no further splits of the The graph cluster tree shown in Figure 4(a) has four leaves, corresponding to the lump and the three fragments of the banana. In Figure 4(b) observations in the high density clusters corresponding to the leaves of the tree are indicated by numbers; the remaining observations (the fluff ) are drawn in gray. The numbers below the interior nodes of the tree are their runt excess masses (Section 5).

Remark 2:
The density estimate has at least one additional mode, visible in  Remark 3: The maximal spanning tree edge connecting the lump and the banana in Figure 3(d) might seem implausible. Note, however, that there are many edges of the complete graph G with very similar edge weights crossing the density valley separating the lump from the banana. Which of those has the largest edge weight and therefore ends up in the maximal spanning tree depends on minor details of the density estimate and the locations of the grid points along the edges.

Remark 4:
For the data in this example we would hope to obtain a graph cluster tree with two leaves corresponding to the lump and the banana, respectively. However, density estimates are inherently noisy, and the occurrence of spurious modes is unavoidable. Note, though, that the two valleys separating the three spurious modes in the banana are shallow, and the separation between them is not nearly as clear as the separation between the lump and the banana. This fact is not apparent from the graph cluster tree in Figure 4(a) because the tree only indicates the levels of the valleys, not the heights of the peaks. In Section 5 we propose a measure for the "prominence" of a high density cluster incorporating both its spatial extent and the rise of its peak (or peaks) above the valley separating it from the rest of feature space. Given such a measure, we can then prune branches of the graph cluster tree corresponding to clusters with low prominence. Figure 4(b), the graph high density clusters corresponding to the leaves of the graph cluster tree do not form a partition of the data. If we want a partition, we need a way of assigning the fluff to the clusters. In keeping with the recursive nature of the clustering process, it is natural to make this assignment recursively. Consider Figure 3(d) where we make the first split. The graph high density clusters corresponding to the daughters of the root node are the solid black points in the lump and the banana, respectively. The gray points are fluff, and the picture suggests a way of assigning the fluff to the graph high density clusters: Breaking the dashed edge splits the maximal spanning tree into two subtrees, and we assign each fluff point to the high density cluster in its subtree.

Remark 5: As illustrated in
The same recipe can be applied at any stage of the algorithm. A problem with this method is that it occasionally results in counterintuitive assignment of outliers. The minimum densities along the edges in G connecting an outlier to the rest of the observations will all be small, and which of them is the smallest will depend, for example, on the locations of the grid points used to approximate the edge weights. An alternative, used in the examples, is to assign the fluff using a nearest-neighbor rule. The details of fluff assignment do not appear to make much difference in terms of performance.

PRUNING THE GRAPH CLUSTER TREE
There is an obvious way of measuring the prominence of a high density cluster in the population case. Consider Figure 2 showing a density with three modes and the corresponding cluster tree. Recall that each node N of the cluster tree represents a subset D(N) of the feature space and is associated with a level λ(N). We propose to measure the prominence of a high density cluster by its excess mass In Figure 2(a) the excess mass associated with the left daughter of the root node is represented by the shaded area. The concept of "excess mass" is due to Hartigan (1987); for an approach to clustering based on excess mass see also the works of Mueller and Sawitzki (1991) and Polonik (1995).
To find a sample analogon to E(N) observe that and therefore To prune the graph cluster tree we choose an excess mass threshold γ and remove all nodes with excess massẼ(N) < γ and their incident edges. Note that excess mass is monotone: If node N 2 is a descendant of N 1 , thenẼ(N 2 ) <Ẽ(N 1 ). Monotonicity implies that pruning will not result in any isolated branches or nodes. The resulting graph may no longer be a binary tree, but it can be converted into one by splicing out internal degree 2 nodes.
The nodes of the graph cluster tree surviving the pruning process are those whose daughters both have excess mass > γ . Define the runt excess mass of an interior node as the smaller of the excess masses of its two daughters. The numbers 46, 2, and 0 next to the interior nodes of the graph cluster tree in Figure 4(a) are the runt excess masses, multiplied by the sample size and rounded for readability. (Informally we use the term "excess mass" for bothẼ(N) and round(nẼ(N)). Multiplying by the sample size expresses excess mass in units of observations.) Clearly there is only one split separating two prominent peaks of the estimated density, namely the one represented by the root node; the remaining two split off minor bumps. Pruning the graph cluster tree with excess mass threshold 46 results in a tree with two leaves representing the lump and the banana, respectively.
It would be desirable to have an automatic method for determining an appropriate value for the threshold γ . We do not yet have such a method, so the choice will have to be subjective (see Section 7).

Remark 6:
A simpler measure of "significance" of a mode is its size In Figure 2(a) the size associated with the right daughter of the root node is represented by the hashed area. Hartigan and Mohanty (1992) used size as an indicator for "significance" in their runt test for unimodality. Like excess mass, size is monotone and can be used for pruning the graph cluster tree. The runt sizes for the three interior nodes of the tree in Figure 4(a) are 96, 26, and 5. So in this example, runt size does not provide as clear a guide for pruning as runt excess mass.

CONNECTIONS TO SINGLE LINKAGE CLUSTERING
Single linkage and generalized single linkage clustering are connected through the nearest-neighbor density estimatep ii = ∞.

Proposition 2. The graph cluster tree of G is isomorphic to the single linkage dendogram.
A proof of Proposition 2 is given in the Appendix (available online). Proposition 2 shows that generalized single linkage clustering applied to the nearest-neighbor density estimate reduces to single linkage clustering; the name is therefore justified.
The commonly used method of extracting clusters from a single linkage dendogram is dendogram cutting. Stated in terms of the graph cluster tree, dendogram cutting is equivalent to choosing a density threshold λ * and removing all nodes with level λ(N) > λ * and their incident edges. There are two problems with this pruning strategy. First, it tends to result in many singletons or tiny clusters consisting of outliers, and one or a few large clusters. This problem could be remedied by choosing a size threshold and discarding all clusters of size smaller than the threshold. However, there is a more fundamental problem: dendogram cutting forms the clusters based on a single level set of the nearest-neighbor density estimate and, as Figure 2 illustrates, there may not be a single level revealing all the groups or modes. An alternative is to apply the pruning method described in Section 5. Note that the nearest-neighbor density estimate has a singularity at each data point (p (1) (x i ) = ∞) and therefore the measure of prominencẽ reduces to the fraction of observations in the graph high density clusterD(N), that is, to size. Extracting clusters from a single linkage dendogram by pruning branches with small size was proposed by Stuetzle (2003), who also provided experimental results suggesting that pruning is vastly superior to dendogram cutting.

EXAMPLES
The goal of this section is to illustrate generalized single linkage clustering on some examples and to compare it to standard hierarchical clustering methods (single linkage, average linkage, complete linkage, and Ward's method), and to parametric (model-based) clustering with Gaussian mixtures (Fraley and Raftery 1998, 2006McLachlan and Peel 2000). (For hierarchical clustering we used the functions "hclust" and "cutree" in R version 2.7.1 (R Development Core Team 2008). For model-based clustering we used the function "Mclust" in the R package mclust version 3.1-5 (Fraley and Raftery 2008).) Density estimation: Where feasible we present the results of generalized single linkage clustering for two different density estimates: the nearest-neighbor density estimate and a kernel density estimate with bandwidth determined by least squares cross-validation (Silverman 1986). We refer to these two versions as GSL-NN and GSL-K, respectively. The nearest-neighbor estimate is computationally attractive because the maximal spanning tree of G is the Euclidean minimal spanning tree of the observations, and computing a Euclidean minimal spanning tree for 10,000 points in ten dimensions only takes about a minute on a standard PC. We chose kernel estimates as the alternative because they are well understood and easy to implement, and least squares cross-validation offers a simple way for automatic bandwidth selection. We use an elliptical Gaussian kernel with the same covariance as the data. For some of the examples with high-dimensional feature space, kernel density estimation is not viable because least squares cross-validation fails or the kernel is singular. For these examples we only give results for the nearest-neighbor density estimate.
Edge weights: For a kernel density estimate the edge weightŝ are not available in closed form and have to be approximated. A simple approximation method is grid search: approximatep ij by the minimum ofp over a regular grid on the line segment connecting x i and x j . In the examples we used ten grid points.
In practice it is not necessary to calculate the edge weights for all the edges of the complete graph: edges connecting observations with large Euclidean distance will typically not end up in the maximal spanning tree. In the examples we only calculate the edge weights for a sparse test graph with edges that are either in the Euclidean minimal spanning tree of the observations or in the 20-nearest-neighbor graph. This shortcut results in a large speedup of the edge weight calculation: Evaluating a kernel density estimate at a single point takes work on the order O(n), and therefore computing all the edge weights is O(n 3 ). The shortcut reduces this to O(n 2 ) while producing essentially the same clustering results.
Choosing an excess mass threshold for pruning: We sort the runt excess masses for the interior nodes of the graph cluster tree in decreasing order. Typically there is a small number of large values followed by a long trail of small values, like 98, 32, 22, 4, 3, 3, 3, 2, 2, 1, 1, . . . . A large runt excess mass indicates a split separating two prominent modes whereas a small runt excess mass indicates separation of a spurious mode most likely caused by variability of the density estimate. We scan the values from small to large looking for the first clear break, in our example between 4 and 22, and then choose the larger value as the threshold. There are three runt excess masses greater than or equal to the threshold, leading to a pruned tree with three interior nodes and four leaves.

Leaves versus modes:
There is a one-to-one correspondence between modes ofp and leaves of the cluster tree ofp. As pointed out in Section 3, however, the same is not necessarily true for the graph cluster tree. The graph cluster tree may fail to reflect modes ofp whose domain of attraction does not contain any observations and, more importantly, multiple leaves may correspond to the same mode due to spurious splits of level sets ofp. To get an alternative estimate for the number of modes we use numerical optimization. We start a numerical optimizer at each of the n observations and then cluster the resulting local optima using Ward's clustering method, a hierarchical version of k-means clustering. Define the loss associated with a partition as the sum of squared distances of the observations from their closest cluster means. Initially, every observation is a cluster. At any stage of the algorithm, Ward's method merges the two clusters leading to the smallest increase in loss. When applying Ward's method to the local optima there typically is a clear jump in the loss. A jump after the ith merge indicates that the local optima fall into n − i clusters corresponding to n − i modes.
Measuring agreement between partitions: Let P 1 and P 2 be two partitions of a set of n objects. The partitions define a contingency table: let n ij be the number of objects that belong to subset i of partition P 1 and to subset j of partition P 2 . We measure the agreement between P 1 and P 2 by the adjusted Rand index (Hubert and Arabie 1985) defined as Here n i· = j n ij , and n ·j is defined analogously. The adjusted Rand index has a maximum value of 1 which is achieved when the two partitions are identical up to renumbering of the subsets. It has expected value 0 under random assignment of the objects to the subsets of P 1 and P 2 that leave the marginals n i· and n ·j fixed.

NONELLIPTICAL CLUSTERS-THE BULLSEYE DATA
The data in this example consist of a two-dimensional Gaussian cloud forming the center of the bullseye and a second group in the shape of a ring around the center. Figure 5(a)-(f), shows the 2-partitions generated by single linkage, average linkage, complete linkage, Ward's method, model-based clustering, and generalized single clustering with a kernel density estimate, respectively. Generalized single linkage clustering is the only method that recovers the obvious group structure. Single linkage clustering partitions the data into a cluster consisting of a single outlier and a second cluster made up of the rest. All the other methods fail in similar ways. This failure is not surprising, as they are all designed to find elliptical or roughly convex clusters.
The unpruned graph cluster tree of the corresponding kernel density estimate has 210 leaves, suggesting that the density estimate has 210 modes. To obtain an alternative estimate for the number of modes we start a numerical optimizer at each of the 560 observations and apply Ward's method to the local optima. The loss for the first 324 merges stays below 4 × 10 −3 and then abruptly jumps to 3.8 × 10 −1 , suggesting that there are at least 560 − 324 = 236 distinct local optima. We conclude that the kernel estimate indeed has more than 200 modes. Table 1 summarizes the performance of single linkage, average linkage, complete linkage, Ward's method, model-based clustering, generalized single linkage with nearestneighbor density estimate, and generalized single linkage with kernel density estimate. The first row of the table contains the values of the adjusted Rand index when the methods are asked to construct a 7-partition. The second row contains the optimal values of the index (optimized over partition size). Numbers in parentheses are standard errors obtained by half-sampling (Shao and Tu 1995, sec. 5.2.2). All methods except single linkage perform well, although generalized single linkage with the nearest-neighbor estimate falls off a bit.  Figure 6. Graph cluster tree of Olive Oil data for kernel density estimate. Numbers above the leaves are labels; numbers below the interior nodes are runt excess masses.

THE OLIVE OIL DATA
The Olive Oil data consist of measurements of eight chemical components on 572 samples of olive oil. The samples come from three different regions of Italy. The regions are further partitioned into nine areas: areas A1-A4 belong to region R1, areas A5 and A6 belong to region R2, and areas A7-A9 belong to region R3.
We show the results for the kernel estimate. The unpruned graph cluster tree has 514 leaves. The alternative estimate for the number of modes (obtained by clustering local optima) is 501.

THE HANDWRITTEN DIGIT DATA
The data for this example are 2000 16 × 16 gray level images of handwritten digits; the data therefore are of dimensionality 256. (The data were previously used to evaluate machine learning algorithms.) Least squares cross-validation failed for this dataset, and we are reporting the results of generalized single linkage clustering only for the nearestneighbor density estimate. The runt sizes are 288, 283, 90, 84, 74, 47, 37, 35, 22, 21, 21, 19, 19, 18, 13, 12, 12, . . . . The gap after 35 (vaguely) suggests nine clusters. The corresponding adjusted Rand index is 0.64. Table 4 summarizes the performance of single linkage, average linkage, complete linkage, Ward's method, model-based clustering, and generalized single linkage with nearestneighbor density estimate. Ward's method and generalized single linkage perform best; the poor showing of average linkage and model-based clustering is surprising. . Graph cluster tree of ALL data for nearest-neighbor density estimate. Numbers above the leaves are labels; numbers below the interior nodes are runt sizes.

THE ACUTE LYMPHOBLASTIC LEUKEMIA DATA
The purpose of this example is to illustrate that generalized single linkage clustering can be applied to very high-dimensional datasets. The Acute Lymphoblastic Leukemia (ALL) data are oligonucleotide microarray gene expression levels of 12,558 genes for each of 360 ALL patients. Yeoh et al. (2002) divided the patients into seven diagnostic groups corresponding to six known leukemia subtypes (T-ALL, E2A-PBX1, BCR-ABL, TEL-AML1, MLL rearrangement, and Hyperploid>50 chromosomes), and one unknown type, labeled OTHER. The data were taken from the Kent Ridge Bio-Medical Data Set Repository, where they have been split into training and test sets. We clustered the training set comprising 215 patients.
We first selected the 1000 genes with the highest variance and normalized the expression profiles to have zero mean and unit variance; squared Euclidean distance between patients then measures the correlation between the corresponding expression profiles. Kernel density estimation does not make sense in this example, as the observations lie in a 214-dimensional subspace of 1000-dimensional space. Next, we computed the graph cluster tree for the nearest-neighbor density estimate (the single linkage dendogram). The largest runt sizes are 36, 27, 21, 14, 8, 5, 5, . . . , suggesting five clusters. Figure 7 shows the (pruned) graph cluster tree, and Table 5 shows a table of ALL subtype (vertical axis) against leaf code (horizontal axis). The T-ALL, E2A-PBX1, and TEL-AML1 subtypes Table 5. ALL data: leaf code (horizontal axis) tabulated against ALL subtype (vertical axis).

SUMMARY AND DISCUSSION
Nonparametric clustering is based on the premise that groups in the data correspond to modes of the feature density. The goal then is to detect modes of the density and assign each observation to the domain of attraction of a mode. The modal structure of a density is summarized by its cluster tree; the modes of the density correspond to the leaves of the cluster tree. We have pursued a plug-in approach to cluster tree estimation: estimate the cluster tree of the feature density by the cluster tree of a density estimate. For some density estimates the cluster tree can be computed exactly; for others we have to be content with an approximation. We have developed a graph-based method that can approximate the cluster tree of any density estimate. Due to sampling variability, density estimates tend to have spurious modes that do not reflect modes of the feature density and that will lead to spurious branches in the graph cluster tree. We have proposed excess mass as a measure for the size of branches of the graph cluster tree, reflecting the height of the corresponding peak (or peaks) of the density above the surrounding valley floor and its spatial extent. Excess mass can be used as a guide for subjective pruning of the graph cluster tree. The graph cluster tree of the nearest-neighbor density estimate is (essentially) the single linkage dendogram. Excess mass pruning is a generalization of the runt pruning method for extracting clusters from a single linkage dendogram proposed by Stuetzle (2003).
In the examples presented in Section 7, as well as in about a half dozen others not reported here, we have observed that: 1. Single linkage clustering-extracting clusters from the single linkage dendogram by dendogram cutting-is not a viable clustering method. Pruning branches with small size is vastly superior.
2. The other standard hierarchical clustering methods (average and complete linkage, and Ward's method) and model-based clustering cannot find clusters that are highly nonconvex (bullseye), whereas generalized single linkage can.
3. In the (admittedly small number of) examples, generalized single linkage clustering was consistently among the top performers. This suggests that the performance penalty paid for the flexibility of generalized single linkage clustering (the ability to cope with nonconvex clusters) is small.
4. Kernel density estimates with bandwidth determined by least squares crossvalidation tend to have many modes, most of them caused by sampling variability, and pruning the graph cluster tree is crucial. Runt size and runt excess mass provide good indicators for the number of groups.
5. Kernel and nearest-neighbor density estimates give comparable clustering performance in most examples. This came as a pleasant surprise because the calculations for the nearest-neighbor estimate are very fast: clustering the Handwritten Digit data with 2000 observations and 256 variables takes less than 10 sec on a modern PC.
There are several directions for future work: Other density estimates: Kernel and near-neighbor density estimates are known to be susceptible to the curse of dimensionality. It may be worthwhile to investigate the performance of generalized single linkage clustering with other density estimates potentially less impacted by high dimensionality, like Projection Pursuit density estimates (Friedman, Stuetzle, and Schroeder 1984;Friedman 1987).
Alternative pruning strategies: Excess mass pruning is based solely on the prominence of peaks of the estimated density, that is, their height and spatial extent; it does not take into account the spatial separation between peaks. Pruning strategies taking into account both prominence and separation may allow for better detection of small but highly isolated groups.
Automatic pruning: Subjective pruning casts doubts on interpretations of clustering results and makes quantitative comparisons of results difficult. A fully automatic pruning method (analogous to model selection methods for regression and classification) would be preferrable. The work of Nugent (2006) contains some preliminary ideas and results.

SUPPLEMENTAL MATERIALS
Data and results: A zip archive containing an R package implementing generalized single linkage clustering as well as the Latex source, R code, and data needed to reproduce article, Appendix, and all the figures and numerical results presented therein is available online. The file "README.txt" in the archive has a detailed description of the archive. (Stuetzle-Nugent-supp.zip)