The Conditional-Potts Clustering Model

This article presents a Bayesian kernel-based clustering method. The associated model arises as an embedding of the Potts density for class membership probabilities into an extended Bayesian model for joint data and class membership probabilities. The method may be seen as a principled extension of the super-paramagnetic clustering. The model depends on two parameters: the temperature and the kernel bandwidth. The clustering is obtained from the posterior marginal adjacency membership probabilities and does not depend on any particular value of the parameters. We elicit an informative prior based on random graph theory and kernel density estimation. A stochastic population Monte Carlo algorithm, based on parallel runs of the Wang–Landau algorithm, is developed to estimate the posterior adjacency membership probabilities and the parameter posterior. The convergence of the algorithm is also established. The method is applied to the whole human proteome to uncover human genes that share common evolutionary history. Our experiments and application show that good clustering results are obtained at many different values of the temperature and bandwidth parameters. Hence, instead of focusing on finding adequate values of the parameters, we advocate making clustering inference based on the study of the distribution of the posterior adjacency membership probabilities. This article has online supplementary material.


INTRODUCTION
Clustering with the Potts model is a rather recent nonparametric technique introduced by Wiseman (1996, 1997) under the name of super-paramagnetic clustering. There is a large literature on the subject in the physics community (Agrawal and Domany 2003;Ott et al. 2004;Reichardt and Bornholdt 2004). Its impact has reached the medical (Stanberry, Murua, and Cordes 2008), bioinformatics (Getz et al. 2000;Einav et al. 2005), and the computer science and machine learning communities as well (Domany et al. 1999;Quiroga, Nadasdy, and Ben-Shaul 2004). It also has been mentioned in the statistical literature, but as Potts model clustering (Murua, Stanberry, and Stuetzle 2008), where its link with other kernel-based methods and nonparametric density estimation was presented. A similar, simpler model has also been used as a probabilistic framework for K-nearest-neighbor classification (Cucala et al. 2009). One of the main advantages of superparamagnetic clustering over other kernel-based methods is the simultaneous estimation of the clustering and the number of clusters. That is, there is no need to specify the number of clusters a priori. The method uncovers it. From the statistical point of view, one of the advantages of the method is its connection to the well-known Potts model density, also known as the Boltzmann's density. Having a probabilistic framework aids in making inferences about the clusters and their number. In fact, the clustering is estimated by Markov chain Monte Carlo (MCMC) simulation of the labels' distribution.
Let X = {x i ∈ R p , i = 1, . . . , n} be our data. In the super-paramagnetic clustering framework, the observations form the vertices of a graph. Let us denote this data graph by G(X). We say that two observations x i , x j are neighbors if their corresponding vertices are joined by an edge of the graph G(X). In this case, we will write i ∼ j . The edge weights are given by the similarities between the neighboring points. We assume that the similarities are given by a Mercer kernel k ij = k(x i , x j ), that is, a continuous symmetric nonnegative definite kernel (Girolami 2002). The similarities usually depend on the distances between the observations ||x i − x j ||, and a scale parameter σ , the bandwidth, that controls the relative spread of the distances, so that (1) In practice, the data graph is made rather sparse by setting k ij = 0 for observations that are too far apart. One way to accomplish this is by using the K-nearest-neighbor graph, that is, a graph where the only edges associated with each x i are those associated with the Knearest-neighbor points of x i . In this case, the work of Blatt, Domany, and Wiseman (1997) recommends using K = 10 for high-dimensional data, whereas the work of Stanberry, Murua, and Cordes (2008) recommends using a moderate value of K in the range 5 < K ≤ 30. In many applications, such as in bioinformatics (e.g., microarray data), the similarities correspond to the correlation between two signals (e.g., signals coming from two different probe sets or genes that measure the expression levels under several experimental conditions or over certain elapsed time). In this case, ||x i − x j || = 2(1 − corr(x i , x j )). When the Euclidean distances are used, one may want to control for different variations in spread of the different covariates making up each data point x i . This may be achieved by considering a multi-bandwidth model (one σ j for each covariate j ∈ {1, . . . , p}), or by standardizing the covariates. In practice, it is hard to estimate just one bandwidth σ , let alone p of them. We prefer to standardize the covariates. Most of the literature treats the kernel bandwidth as a fixed parameter. It is usually set to the mean of the data point similarities Wiseman 1996, 1997). However, Murua, Stanberry, and Stuetzle (2008) showed that treating it as variable may give better clustering results. Although they suggest the use of a data-driven adaptive bandwidth, we believe that its incorporation as a parameter of the model is more appropriate.
We assume that given the data, the class labels follow the Potts model density. This assigns labels i ∈ {1, . . . , q} to each observation x i , i = 1, . . . , n, so that observations similar to each other are likely to be assigned the same label. Statistically, let z i = 1 if x i has been assigned to the th label class, and zero, otherwise. The Potts model density is given by p({z i }|σ, T , X) = (Z(σ, T , X)) −1 exp where δ ij . = q =1 z i z j = 1 if and only if x i and x j are assigned the same label; T, the temperature, is one of the main parameters of interest in this work, and is the normalizing constant of the Potts density. Since any label configuration {z i } defines a partition of the data, we will refer to them as partitions. They can be efficiently sampled by MCMC algorithms (Swendsen and Wang 1987). The labels do not necessarily indicate cluster membership. However, two observations lying in the same cluster must share the same label. The labels are rather auxiliary variables that help to merge similar observations together. The information on the data clustering is obtained from the probabilities Q ij (σ, T ) that two given data points lie in the same cluster (under the Potts model) We refer to the Q ij (σ, T )'s as the membership adjacency probabilities (they are also known as the spin-spin correlations in the physics literature). In practice, the Q ij (σ, T )'s are estimated from several partition samples generated sequentially using MCMC.
The Conditional-Potts Model. Note that the dependency on the data in the interactions or edge weights k ij (σ ) is present only through the mutual distances (or similarities) between the data points (for convenience in the notation, we will sometimes drop the σ from the k ij 's). Therefore, in what follows, it will be useful to think of the mutual distances (or similarities) between the data points as the observed data. We set (i.e., define) our likelihood for the parameters (σ, T ) to Given a prior distribution for the parameters, π (σ, T ), we consider the Bayesian embedding of the super-paramagnetic clustering whose density is proportional to As a consequence, in this framework, the Potts model density corresponds to the marginal conditional p({z j }|σ, T , X) of the labels given the data and the parameters. In practice, we are interested in estimating the posterior distribution of the parameters (σ, T ). It is straightforward to see that this posterior is given by where Z(σ, T ) is the overall normalizing constant of the density given by (5) (i.e., the integral over the data of Z(σ, T , X)), and as before, Z(σ, T , X) is the Potts model density normalizing constant. Our procedure to unveil the clustering of the data is based on estimates of this posterior. Note that all what we need to estimate the data clustering are the posterior adjacency probabilities Q ij (σ, T ). Equation (6) allows us to get estimates of the marginal posteriors Q ij = p(δ ij = 1|X) by sampling from (6) and then from (2). We will refer to the model given by (5) or (6) as the conditional-Potts clustering model as opposed to the Potts model or super-paramagnetic clustering procedure given by (2). A clustering of the data is obtained by consensus: having obtained several partitions of the data, the probabilities of cluster membership adjacency of two data points are readily estimated from the proportion of times these two data points are found in the same component (Sokal 1996). A new graph, the consensus graph, is constructed using these probabilities. An edge between two data points exists in the consensus graph if and only if the frequency of cluster membership adjacency between these points is larger than a predefined threshold, say Q. A consensus clustering is formed by the connected components of this graph. Note that changing the threshold Q may result in different clusterings of the data. The super-paramagnetic clustering procedure fixes it at Q = 0.5. Choosing the threshold has not received much attention in the literature. However, we show in this work that, together with the temperature, this is a key parameter of the procedure. Our experiments of Section 5 show that dramatic clustering differences may occur depending on its value. Therefore, it is advantageous to conceive a more principled way to obtain a suitable clustering of the data. That is in part what we propose in this article. We advocate to shift the focus of the problem from trying to estimate optimal values of the parameters (σ, T ) to try to obtain good estimates of the marginal posterior adjacency membership probabilities Q ij . Note that in contrast to the Q ij (σ, T ) used in the super-paramagnetic clustering, these quantities do not depend on the parameters (σ, T ). Consequently, we can use these probabilities to get posterior consensus clusterings that are independent of the values of (σ, T ). Their distribution implies an inherent distribution on the number of clusters and hence on the clusterings of the data. This latter distribution is given by the frequency of each particular consensus clustering, say π (c), where c ∈ {1, 2, . . . , n} denotes the number of clusters, as a function of a random membership probability threshold drawn from the distribution of the Q ij 's. We suggest to draw inference on the data clustering from π (c). We show that, in general, several different clusterings are admissible and that some of them have more weight than others. We try to formalize this point by defining for each clustering a measure of cluster evidence (see Section 3).
The advantages of our conditional-Potts clustering model led Linard et al. (2012) to apply our methodology to their large evolutionary biology history data. Evolutionary systems biology studies the evolution of biological networks by integrating evolutionary information extracted from multiple biological levels, from the DNA level (genome context) to protein and phylum levels (protein sequence and structure, species sharing common genes, etc.) (Medina 2005;Loewe 2009). Aiming at easing the analysis of complex evolutionary histories with a human-genome perspective, Linard et al. (2012) developed a formalism to put together diverse sources of data from 17 vertebrates species whose genomes are comparable with the human genome. In this work, we apply our clustering methodology to identify genes that share similar evolutionary histories. That is, to find a meaningful clustering of the barcodes from an evolutionary biology systems history point of view (see Section 6).
Another important contribution of this work is the introduction of a less computational intensive procedure than the super-paramagnetic clustering and the conditional-Potts clustering to do the clustering with the Potts model. This procedure is heavily based on a data-driven estimate of a very informative prior, which is derived from random graph theory and the connection between kernel-based methods and kernel density estimation (Murua, Stanberry, and Stuetzle 2008). We refer to this latter procedure as the informed conditional-Potts clustering. We show in Section 5 that its performance is similar to that of the main stochastic Population Monte Carlo method introduced in this work for Potts model clustering (see Section 2.3).
The article is organized as follows. The stochastic procedure to estimate the posterior densities of the parameters is described in Section 2. The estimation of the consensus clustering is described in Section 3. The informed conditional-Potts clustering method is introduced in Section 4 together with the elicitation of the informative priors for the temperature and kernel bandwidth parameters. The results of the application of our model to certain artificial and real datasets are shown in Section 5. Section 6 presents the application of our methodology to the clustering of evolutionary history of the human proteome barcode data. Section 7 presents some conclusions and a discussion. The proofs of the theorems that led to the elicitation of an informative prior for the temperature, and that established the convergence of our algorithm are shown in the online supplementary materials.

SAMPLING FROM THE POSTERIOR DISTRIBUTION
The normalizing constants Z(σ, T ) and Z(σ, T , X) of the conditional-Potts clustering model are intractable when the data size is moderately large. Therefore, sampling directly from the posterior π (σ, T |X) is not possible. In addition, even MCMC methods such as the Metropolis-Hastings algorithm to generate samples (σ , T ) from the posterior do not work, since knowledge of the ratios Z(σ, T )/Z(σ , T ), Z(σ, T , X)/Z(σ , T , X) are necessary. The literature suggests several possibilities to overcome this problem. The most popular technique seems to be based on path sampling (Ogata 1989;Richardson and Green 1997;Gelman and Meng 1998). However, this involves the estimation of the normalizing constant for all (σ, T ), or at least for a reasonable grid of values of the parameters; it does not take into account the uncertainty in the estimators arising from the Monte Carlo integration. Instead, we follow a different adaptive MCMC procedure inspired by the work of Atchade, Lartillot, and Robert (2013). Section 2.3 describes our adaptation of their algorithm to the conditional-Potts clustering model. It relies on the Wang-Landau algorithm (Wang and Landau 2001) to produce a "flat histogram" on the parameter space. Like the Atchade et al.'s algorithm (Atchade, Lartillot, and Robert 2013), our algorithm generates a stochastic process for which the distribution of (σ m , T m ) approaches the desired posterior distribution as m → +∞. The process is not necessarily Markovian, although it involves Metropolis-Hastingslike sampling. The main idea is to replace the normalizing constants Z(σ, T ), Z(σ, T , X) by a series of stochastic approximations of them. These are derived through population Monte Carlo techniques using a small fixed grid of values of the parameter space. A variant of the Wang-Landau algorithm is used to sample on this parameter-space grid.

THE WANG-LANDAU ALGORITHM
In the original Wang-Landau algorithm, the goal is to sample from the target distribution, say π (u), for u in one of the discrete "energy" states g ∈ {1, 2, . . . , d}. But π (u) is only known up to a normalizing constant. Wang and Landau (2001) suggested sampling instead from π c (u) ∝ d g=1 π(u) c(g) 1 g (u), where c(·) is a function of the energies, and the indicator function 1 g (u) = 1 if and only if u ∈ g, and is zero otherwise. Once an energy state g is visited, c(g) is modified so as to make another visit to g more unlikely. In fact, at iteration m + 1 of the sampler, c m+1 (g) = c m (g)(1 + γ m 1 g (U m+1 )), where U m+1 denotes the current state of U. The sequence γ m is a slowly decreasing random sequence that controls the amount of penalty given to current visited energies. The goal is to make visits to all energy states uniformly as m → ∞. If this is the case, π c m (u c m (g ) may be used as an estimate of the probability of the energy state g. The algorithm seems to be very efficient (Dayal et al. 2004;Ghulghazaryan, Hayryan, and Hu 2006;Zhou et al. 2006). Atchade, Lartillot, and Robert (2013) had extended it to more sophisticated target densities without assuming a discrete energy space. We have adapted this latter procedure to our problem.

DECOMPOSITION OF THE NORMALIZING CONSTANT
As mentioned above, the data X enter into the model through the distances D ij = ||x i − x j || used to evaluate the edge weights. Without loss of generality, we may assume that all distances lie in the interval [0, 1] (the scale parameter σ will absorb the differences in scaling). To simplify the calculation, we will assume that there is no interaction among the distances, that is, we will assume that the domain of integration of the distances is a rectangular region. This assumption yields where β(σ, T ) = − log( 1 0 e −k(D ij /σ )/T dD ij ) (this quantity can be easily computed for all (σ, T ) that need to be evaluated). Note that Z(σ, T ) turns out to be the normalizing constant of the Potts density with constant interaction β(σ, T ).
Let κ(σ, T ; σ , T ) be a kernel function summing up to unity on a fixed grid for all (σ, T ). A population Monte Carlo estimate of Z(σ, T ) can be derived from the following identity that combines (7) with the decomposition of the normalizing constant suggested in Atchade et al. (2010): This hints at estimating Z(σ, T ) by sampling from the Potts densities p(·|β(σ g , T g )), g = 1, . . . , d. Similarly, a population Monte Carlo estimate of Z(σ, T , X) can be derived from (3) by As before, this hints at estimating Z(σ, T , X) by sampling from the Potts densities p(·|σ g , T g , X), g = 1, . . . , d. However, (8) and (9) still require the knowledge of Z(σ g , T g ) and Z(σ g , T g , X). We overcome this problem by considering a stochastic Monte Carlo algorithm based on parallel runs of the Wang-Landau algorithm: one to estimate Z(σ g , T g , X), and a second one to estimate Z(σ g , T g ). The proposals (σ , T ) are generated from stochastic approximations of the posterior distribution based on the approximations of (8) and (9). The size of the grid, d is a parameter to be chosen, which depends on the problem at hand. Atchade, Lartillot, and Robert (2013) suggested using a moderate value, for example, 100 (see also Section 5).

A POPULATION MONTE CARLO ALGORITHM FOR POTTS MODEL CLUSTERING (PMC2)
Our algorithm follows closely that of Atchade, Lartillot, and Robert (2013). The main difference lies in the incorporation of a parallel run of the Wang-Landau algorithm as well as a mixing step to generate the parameters. In what follows, a draw from the Potts model density p(·|σ, T , X) will be denoted simply as {z m i } (for = 1, . . . , q and i = 1, . . . , n) or {δ m ij } (for i, j = 1, . . . , n), whereas a draw from the Potts model p(·|β(σ, T )) will be denoted as {z m β; i } or {δ m β;ij }. The superscript m will denote the iteration or sample number during the run of the algorithm. We will refer to the algorithm that follows as Population Monte Carlo for Potts Model Clustering, and will denote it by PMC2 for short. Given the current state of the parameters at iteration m, ({z m i }, {z m β; i }, T m , σ m ), and I m , I m β ∈ {1, . . . , d}: Step 1.
Note that Equations (13) and (14) resemble Equations (8) and (9), respectively. The expectations have been approximated by the population Monte Carlo sampling estimates.
Steps 1 and 2 correspond to the two parallel Wang-Landau algorithms.
Step 3 is the sampling step. This differs from the original Wang-Landau algorithm in that instead of only sampling from the grid of parameter values, the sampling is also done on the whole space of parameter values. Substeps 1.2 and 2.2 ensure that the sampling on the grid is done so as to obtain on the long run a flat-histogram, that is, that on the long run all points in the grid are equally likely to be sampled.
Step 4 is added so that the posterior adjacency membership probabilities may be estimated marginally, that is, independently of the parameters (σ, T ).
The label vectors z β, i and z i are initialized via the corresponding Potts model density using the Swendsen-Wang algorithm (Swendsen and Wang 1987;Sokal 1996). We suggest monitoring the convergence of the algorithm by computing the relative Euclidean distance between the estimated marginal adjacency probabilities yielded by successive iterations of the algorithm. Recall from the introduction that these probabilities are the quantities of interest to estimate the clustering structure of the data.

ASYMPTOTIC PROPERTIES
At convergence, e c m (g) (respectively, e c β;m (g) ) is proportional to the corresponding normalizing constant Z(σ g , T g , X) (respectively, Z(σ g , T g )). The rejuvenation step-size parameters γ m , γ β;m control the speed of convergence of e c m (g) and e c β;m (g) , respectively. They should slowly decrease to zero. We choose to update them according to the heuristic suggested in Atchade, Lartillot, and Robert (2013, sec. 2.4). The convergence a.s. of these sequences is assured by the Wang-Landau algorithm (Atchade and Liu 2010;Atchade, Lartillot, and Robert 2013), since the two sequences are built independently of each other.
Let the filtration The convergence in probability of the samples (σ m , T m ) is established as in Atchade, Lartillot, and Robert (2013) by an application of their general stochastic convergence Theorem 2.1. To ensure that the assumptions of this theorem are held, we only need that (P1) The sequences γ m and γ β;m are strictly positive random sequences adapted to {F m } satisfying m γ m = m γ β;m = +∞; and with probability one, m γ 2 m < +∞ and m γ 2 β;m < +∞. (P2) The proposal q m (σ , T |σ, T ) used to generate the samples (σ m+1 , T m+1 ) in Step 3 of the PMC2 algorithm is bounded from above and away from zero independently of m and the pairs (σ , T ) and (σ, T ). Moreover, The kernel κ(σ , T ; σ, T ) is bounded from above and away from zero independently of the pairs (σ , T ) and (σ, T ).
(P4) The prior density π (σ, T ) is bounded from above and away from zero.

CLUSTERING EVIDENCE
Having estimates of the marginal posterior adjacency membership probabilities Q ij , we can obtain a consensus clustering of the data as the connected components of a consensus graph. As mentioned earlier, a consensus graph depends on a threshold Q on the probabilities Q ij . It is clear that as the threshold increases, more clusters are found. We note that in the super-paramagnetic clustering the optimal clustering is chosen by fixing the threshold to Q = 0.5. We show in our experiments that, in general, this choice does not necessarily give the best results, unless the temperature T is explicitly chosen so as to give good consensus clusterings for Q = 0.5. To avoid having to select a threshold, we adopt a strategy based on what we called the clustering evidence associated with each possible clustering. For each possible value of Q ∈ Q = {Q ij , i, j = 1, . . . , n}, we obtain the corresponding consensus clustering P(Q), and the number of clusters c(Q) in it. A histogram of c(Q) may be used as support for favoring certain clusterings over others. It may be interpreted as follows: the chances of selecting a clustering with c(Q) components are given by the frequency of such a clustering have the threshold been chosen randomly according to the distribution of the posterior adjacency membership probabilities. The normalized frequencies (or "probabilities") derived from this histogram will be denoted by π (c(Q)). For any c ∈ {c(Q) : Q ∈ Q}, let F (c) = c ≤c π (c ). The clustering evidence odds of a clustering with at most c clusters against a clustering with more than c clusters is defined as odds(c) .
. We consider the pair (odds(c), π(c)) as the evidence supporting a clustering with c components. The best clusterings are chosen as the clusterings with the best clustering evidence: ideally, π (c) must be large in comparison with the other normalized frequencies π (c ), and the odds(c) must be much larger than 1.0. Clusterings with odds against them too large are discarded. In our experiments (see Section 5), we note that sometimes it is sufficient to look at π (c) to decide on the clustering, since some datasets present a large mode at a specific number of clusters. There are occasions where π (c) presents several modes of roughly the same order of magnitude. In these cases, it is better to look at the odds of the different modes. Often, there are clear jumps in magnitude at specific number of clusters c's. The preferred clusterings are associated with these values. Figure 1 illustrates these observations on the clustering evidence for two of the datasets used in our experiments of Section 5. The Gaussian dataset, which is composed of 50 Gaussian clumps, gives strong evidence for 49 clusters (the peak in π (c)). However, the Olive oil dataset gives evidence for three, four, six, or nine clusters. The evidence for more than 11 clusters is much weaker, since the odds against them are too large. Hence, one could choose nine clusters as the preferred clustering. But it may be worth to investigate the solutions with three and six clusters. Further analysis of these and other datasets can be found in Sections 5 and 7.

THE INFORMED CONDITIONAL-POTTS CLUSTERING (IPRIOR + PMC)
In our experiments, we observe that we can obtain similar clusterings by fixing the temperature and bandwidth parameters to their maximum a posteriori (MAP) estimates (σ M , T m ). In this case, the clustering evidence is given by the adjacency membership probabilities Q ij (σ M , T m ), and not by the posterior probabilities Q ij as above. In what follows, we will refer to this method of choosing the clustering as MAP+PMC. One of the main results derived from our experiments (see next section) is that a good clustering may be obtained by maximizing a data-driven prior of the critical temperature and kernel bandwidth instead of using the MAP estimates. The prior is elicited in the next section. It is used in Section 4.2 to build the informed conditional-Potts clustering method.

ELICITATION OF SENSIBLE PRIORS FOR THE PARAMETERS
Consider an extension of the Potts model where on each edge of the data graph a Bernoulli variable b ij , the bond, is introduced. This bond is set to 1 with probability p ij (σ, T ) = p ij (σ, T , {k ij }) = 1 − exp(−k ij (σ )/T ), if i ∼ j and x i and x j share the same label; otherwise the bond is set to zero. The resulting joint label/bond model is known as the Fortuin-Kasteleyn-Swendsen-Wang model (Sokal 1996). The model was discovered in an effort to develop an efficient MCMC algorithm to sample the labels {z i }. The resulting algorithm is the Swendsen Wang algorithm (Swendsen and Wang 1987). The bonds are auxiliary variables that help split the clusters. The marginal over the bonds is known as the random cluster model. Searching for clusterings in the data corresponds to searching for splits in the connected components of the random graph generated by the bonds given the labels. For example, a large variance in the size of the largest component indicates an imminent split of the component. There is a vast literature on this subject for the random cluster model with constant edge-weights (for a list of references, see, e.g., the excellent notes of Sokal 1996, and the book of Bollobas 2001). In this latter model, p ij (σ, T ) = p(σ, T ), for all x i , x j . The probability of having a bond b ij = 1 is then constant. The corresponding random cluster model density reduces to q C p(σ, T ) n 1 (1 − p(σ, T )) n 0 , where n 1 = number of edges with bond b ij = 1, n 0 = number of edges with bond b ij = 0 among neighboring data points, and C = C({b ij }) = number of connected components in the graph. This corresponds to a random graph with parameter p = p(σ, T ). If the original data graph is a r-nearest-neighbor graph, then the resulting random graph is a r-regular random graph, that is, each vertex has exactly r edges. The search for a cluster split in the data corresponds to the appearance of a giant (connected) component in the graph (Bollobas 2001, p. 138, chap. 6), which is a connected component consisting of more than half the number of vertices of the graph. It can be shown that if p is large enough, then this is the only nontrivial component of the graph. The appearance of a giant component is also referred to as a phase transition in the graph. The probability p causing it is the phase transition probability or the critical probability. Since, in general, our weights k ij (σ ) are not constant, we cannot directly use these results to find a good temperature for clustering. However, we can still devise a method to elicit a good prior density for the sought temperatures based on random graph theory. Consider for i, j fixed, the graph G(k ij (σ )) with the same vertices and edges of the original data graph G(X), but whose edges have now constant weight, so that p = exp{−k ij (σ )/T } for all existing edges in G(X) (we stress here that all edge weights are equal to the constant k ij (σ )). If m is the number of edges in G(X), then there are m graphs G(k ij (σ )), one for each different edge weight. The idea is to find estimates of the random graph phase transition probabilities for each of the graphs G(k ij (σ )), i, j ∈ {1, . . . , n}. These are then used to find estimates of the critical temperatures T ij associated with each particular value of the edge weight k ij (σ ). We show below in Section 4.1.1 the explicit connection between these probabilities and the critical temperatures. The temperatures so obtained are used to construct a prior density for the critical temperatures. Since this density depends on a given bandwidth σ , we only are able to elicit a conditional prior that will be denoted by π (T |σ ). Its formal derivation is given below.

The Temperature Prior: The Connection to Random
Graphs. The critical temperatures are inferred by using the random cluster model associated with the Potts model with equal edge weights. We consider random r-regular graphs of order n = the number of data points in X. In the random cluster model, the degree r = r(T ) of the vertices depends on the probabilities of having bonds equal to one (b ij = 1), which are given by p ij = 1 − exp{−k ij (σ )/T }. The idea is to increase the degree r until a giant component appears. This would signal a merging of the clusters, and hence a cluster transition. Since the critical degree r c depends on the temperature, the transition must occur at a particular temperature T c = r −1 (r c ). This computation is done separately for every value p ij . The prior for the critical temperatures is then obtained by estimating the density associated with the set of critical temperatures {T c (i, j )}.
To simplify the calculation, we suppose that each possible r-regular graph G(n, r) generated from the data has the same probability. The problem can be formulated as finding the value of r, so that, with high probability, G(n, r) is connected. To find a bound on this probability, we follow the second proof of Theorem 7.3 in Bollobas (2001). This gives the probability that a giant component appears as the number of edges r increases. Let p G denote the probability of having a connected component in G(n, r) of order at most n/2. Note that there cannot be a component of less than r + 1 vertices as each vertex is connected to r other vertices.

Theorem (probability bound). Let
Then The proof can be found in the online supplementary materials. Suppose now that this bound is upper bounded by p B . Then with probability at least 1 − p B there is only a giant component, as isolated vertices cannot exist. We choose the smallest value of r, say r c = r(p B ), that satisfies s r+1 /( √ 2π (r + 1) 5/2 (1 − s)) ≤ p B . Typically, we set p B to be equal to a small value such as 0.01. Note that the number of edges necessary for the connectivity of the graph is then r c n/2. Assuming that all edges are equally likely to occur, we have p = p c = r c n/(2m), where m is the number of possible edges in the graph. For a K-nearest-neighbor graph, m is bounded by nK. The corresponding critical temperature is then given by

The Bandwidth Prior: The Connection With Kernel Density Estimation.
Exploiting the connection between Potts model clustering and kernel density estimation, we derive a prior for the bandwidth parameter based on an adaptive bandwidth kernel density estimator (Abramson 1982;Silverman 1986, sec. 5.3). Having an estimate of the data densitŷ p(x), the adaptive bandwidth at observation x i is given by We note that in many applications the data graph corresponds to a K-nearest-neighbor graph. A quick estimatep knn (x) of the data density is readily available in this case. We will use this observation to derive a data-driven procedure to estimate an optimal bandwidth and temperature for clustering.

THE INFORMED CONDITIONAL-POTTS CLUSTERING
The idea is to find empirical estimates, that is, data-driven estimates, of the random graph phase transition probabilities for each of the graphs G(k ij (σ )), i, j ∈ {1, . . . , n}. These are used to find empirical estimates of the critical temperatures T ij associated with each particular value of the edge weight k ij (σ ). The explicit relation is given by Equation (16). The sample of number-of-edge temperatures so obtained is used to construct a density estimate of the critical temperature, sayπ (T |σ ). We stress that this density depends on all edge weights present in the data graph for any given bandwidth σ . The same idea can Figure 2. The logarithm of the data-driven priors for the Iris and Yeast cycle datasets. The "*" indicates the value that maximizes the prior. be applied to derive a data-driven prior for the kernel bandwidth. Equation (17) defines a sample of n possible values for the bandwidth parameter. We use an empirical estimate of the density,π (σ ), associated with the bandwidths {σ knn (x i )} as our data-driven prior for σ . In our experiments of Section 5, we have used a kernel density estimator for botĥ π (T |σ ) andπ (σ ). The kernel employed was the Epanechnikov kernel with bandwidths set to the standard deviations of the corresponding samples. Figure 2 shows the data-driven priors for some of the datasets used in our experiments of Section 5.
We say that a region in the temperature-bandwidth space is stable if there is high similarity between clusterings yielded by nearby points. Stanberry, Murua, and Cordes (2008) observed that these regions are also stable regions for the number of clusters and may correspond to phase-free regions of the random cluster model (i.e., no giant component appears in these regions; see Section 4.1.1). Very often in our experiments, the informative data-driven prior maximizer (σ p , T p ) lies in the high-density region of the posterior, and the clustering evidence is similar for both choices of the parameters: the MAP estimates and the informative prior maximizer. In what follows we will refer to this way of estimating the clustering as the informed-Prior PMC or iPrior + PMC for short. We believe that the good results achieved by the iPrior + PMC procedure indicate that, in general, our data-driven prior presents a high-density region in a stable region of the temperature-bandwidth space, as signaled by the posterior found through the PMC2 algorithm. Based on our experiments (see next section), we support the use of the iPrior + PMC procedure as an admissible alternative to PMC2 and MAP + PMC whenever the computational cost is an issue, for example, in the case of very large datasets.

EXPERIMENTAL RESULTS
Here, we present a comparison of the clustering performance of the conditional-Potts clustering achieved through PMC2 with that of the MAP + PMC and iPrior + PMC procedures presented above. For completeness in the comparisons, we also show the results associated with the super-paramagnetic clustering (which we will denote as SPMC). As is now customary in the machine learning and clustering literature, we measure the goodness of fit of the resulting clusterings with the adjusted Rand index (ARI). The ARI is a measure of similarity (agreement) between two clusterings (or partitions). It was first suggested by Rand (1971) and then corrected for randomness by Hubert and Arabie (1985). A perfect  (Yeung et al. 2001) match is signaled by an ARI score of 1.0: the closest the score is to one, the more similar the clusterings are.
In all the simulations, we set a 12-cell grid of values for the kernel bandwidth and a 30-cell grid of values for the temperature. The grids were uniform grids on a slightly larger interval containing the range of values indicated by the informative data-driven priors. More explicitly, if the informative prior suggested the region (σ min , σ max ) × (T min , T max ), we placed the 12 × 30 grid points uniformly in (σ min /κ, κσ max ) × (T min /κ, κT max ), for a fixed κ in the range 1 < κ < 1.5. The idea behind this heuristic is to not give all the weight to the prior so as to safeguard against cases where it has missed some important information. For the datasets in our experiments, the value of κ was not very relevant. We set it to κ = 1.05. We also tried larger values, but obtained similar results. We stressed that the data-driven priors were not used as priors for the conditional-Potts clustering model. The priors used were uniform priors. The temperature grid is finer because finding a good temperature is critical. The sampling of the parameters was started after a burn-in period of at least 300,000 iterations. We did not notice much difference between the results yielded by using only 1000 or 10,000 samples after the burn-in period. The final clusterings were constructed by consensus clustering based on clustering evidence as described in Section 3.

PERFORMANCE ON REAL AND SIMULATED DATA
In this section, we report the results of a simulation carried out to study the performance of the conditional-Potts clustering model on three different artificially generated datasets and three real datasets. For comparison purposes, in addition to our own ARI scores, we also report the best ARI scores published elsewhere for the real datasets below (see Table  1). These are the scores to be compared with the results given in Tables 2 and 3. Based on   Table 2. Clustering results based on the clustering evidence given by PMC2. "Q" stands for a representative adjacency membership probability threshold associated with the clustering evidence, "c(Q)" for the number of clusters, and "ARI" stands for the adjusted Rand index  Table 3. Results based on clustering evidence with a fixed set of values for the parameters. "T" stands for temperature, "σ " for bandwidth, "Q" for a representative adjacency membership probability threshold associated with the clustering evidence, "c" for c(Q), and "ARI" stands for the adjusted Rand index. MAP + PMC and iPrior + PMC stand for the procedures with clustering evidence drawn from the MAP (σ M , T M ) and the datadriven prior maximizer (σ p , T p ). SPMC stands for the super-paramagnetic clustering. The best ARI for each dataset has been highlighted in bold letters Artificial data Real data these scores, the reader may get an idea of how difficult it is to cluster some datasets into the groups selected by some experts (see, e.g., the Yeast cycle data below). The artificial datasets were (a) a 5-clump-3-arc dataset (Murua, Stanberry, and Stuetzle 2008) whose clusters present high variation in shape and distribution and are not very well separated; (b) a three-ring version of the Bull's eye data (Blatt, Domany, and Wiseman 1997), which are a real challenge for most clustering methods; and (c) a 50-Gaussian mixture dataset whose differences in cluster volume may produce difficulties when choosing the appropriate temperature-bandwidth parameters. The data are plotted in Figure 3. The three real datasets were the well-known (a) two-class Iris data (Anderson 1935), which serve us as a test data to check that the procedures work; (2) the nine-class Olive oil data (Forina and Armanino 1982), which are a moderately difficult data to cluster; and (c) the 5-phase subset of the Yeast cell cycle data (Cho et al. 1998), which is very difficult to cluster correctly. Further description of the datasets used in the simulation is shown in the online supplementary materials.  The results for all six datasets are shown in Table 2. Figure 4 shows the clustering evidence yielded by PMC2 for all datasets, except for the 50-Gaussians and Olive oil datasets. The evidence yielded by PMC2 is clear for the Bull's eye and the Iris datasets. For the Iris data, π (c) presents a unique large peak at c = 2. For the Bull's eye data, there are similar peaks at c = 2 and c = 3. The odds evidence makes us lean toward c = 3, since a large jump in odds is not produced until a much larger number of clusters. For the Yeast cycle data, there is a peak at c = 1, but the odds against a larger number of clusters do not increase till c = 5, where a jump from 3.9 (c = 5) to 11.1 (c = 6) is produced. Also, the probabilities π (c) for c = 4, 5, or 6 are not that small relatively to π (1). Therefore, if one believes that there are at least two clusters, then c = 5 is a reasonable estimate following the clustering evidence. For the 5-clump-3-arc dataset, the clustering evidence is not totally clear. There are several similar-size peaks at c = 5, c = 11, and c = 13. We choose c = 11 because of the large jump in odds against a larger number of clusters between c = 11 and c = 13. The results associated with the MAP + PMC, iPrior + PMC, and SPMC are shown in Table 3. Linard et al. (2012) created barcodes that convey information on 10 different evolutionary parameters. Each barcode is associated with one human gene and describes this gene when compared with the closest related homolog proteins in 17 early vertebrates. The parameters are the human protein length, the mean length difference between the human reference protein and the corresponding 17 vertebrate species proteins, the mean number of regions shared between the human reference protein and the vertebrate species, the mean sequence identity between the human reference protein and the vertebrate species, the mean number of protein domains in the vertebrate species, the mean domain conservation, the mean hydrophilicity, the mean number of inparalogs with respect to the vertebrates, the mean number of co-orthologs, and the mean synteny between the human genome and the vertebrate genomes. The data consist of 19,778 evolutionary barcodes representing the evolutionary history that led to the complete human proteome since the emergence of vertebrates. Linard et al. (2012) applied our conditional-Potts clustering procedure using its informative data-driven prior version. Barcodes with missing values had been removed for the clustering leaving a total of 19,465 barcodes. Prior to the clustering, each barcode coordinate had been standardized by their mean and variance. The Euclidean distances between any two barcodes were computed so as to obtain the interactions between vertices in the conditional-Potts clustering model. They found 303 clusters. A Gene Ontology (GO; Ashburner et al. 2000) enrichment analysis was performed to find out if they were biologically relevant: 75% of the clusters had an enrichment p-value < 0.025 (the maximum recommended p-value is 0.05; the lower the p-value, the larger the enrichment). In particular, one of the clusters grouped numerous olfactory receptors that are known to have experienced a vast expansion during the chordate evolution. Indeed, the number of olfactory receptors ranges from a dozen in fishes to over a thousand in rodents (Zhang and Firestein 2009). For further details, the reader is referred to the work of Linard et al. (2012). In the present work, we applied our complete methodology to these data and shed light on why Linard et al. (2012) found about 300 clusters. To estimate the posterior membership adjacency probabilities and the posterior joint density of the bandwidth-temperature parameters, we run our PMC2 algorithm with a burn-in period of 200,000 iterations. Our inference was based on the 5000 samples after the burn-in period. Figure 5 shows the clustering evidence based on PMC2 as well as those associated with the specific critical parameters given by the MAP and iPrior maximizer. The results are discussed in the next section.

DISCUSSION AND CONCLUSIONS
Concerning the evolutionary history of the human proteome data, one can infer from Figure 5 that the clustering evidence for more than about 260 clusters is very weak. Observe that the distribution of π (c) almost vanishes after about c = 260. A closer look at the clustering given by 265 clusters shows that most clusters are of moderate size (the median is 37 barcodes per cluster). It also shows the existence of a large cluster containing about 59% of the barcodes. Similar results are found in the other two clusterings given by the clustering evidence associated with (σ M , T M ) and (σ p , T P ). The three clustering are very similar. Their mutual adjusted Rand indexes are all above 0.60. We note that the 75% highly enriched clusters found in Linard et al. (2012) correspond to about 220 clusters, which is in close agreement with the clustering evidence found here.
Concerning the results in Section 5, one can see that all four clustering procedures applied to the six datasets performed similarly. However, we note that the results given for SPMC are associated with the best performing and admissible large peak T s in the temperature trajectory associated with the magnetization (variance of the size of the largest cluster). Sometimes, it was difficult to decide which peak to choose among two or three competitive peaks: the wrong peak yielded much poorer results. Therefore, our choice of the peaks was sometimes biased toward peaks yielding better results. In this sense, the performance of SPMC has been overestimated in the comparisons reported in this article. We will denote by (σ s , T s ) the parameters associated with this procedure. Note that σ s is simply the square root of the mean of the distances between any two points in the dataset. The procedures based on single values for the parameters, MAP + PMC, and iPrior + PMC perform very well. Note that the value of the temperatures T M and T p may be very different. The informative datadriven prior tends to give more weight to smaller temperatures than the posterior. A closer look at the posterior reveals that the iPrior maximum lies in a relatively high-density region of the posterior but not necessarily near the MAP estimate. These observations indicate that there is a vast region in the bandwidth-temperature space for which good clustering results may be found. This has already been observed by Stanberry, Murua, and Cordes (2008) on fMRI data, and has been suggested by Blatt, Domany, and Wiseman (1996) as a result of what is known about the Potts model in ferromagnetism. Note from Table 3 that the optimal temperature for the SPMC is sometimes very low. The reason is that to obtain a good clustering associated with a high threshold of Q = 0.5 (associated with SPMC), one needs a low temperature. At low temperatures, the chances of creating bonds between neighboring points in the data graph is very high (and equal to 1 − exp{−k ij (σ )/T }). Hence, neighboring points will tend to share more often the same connected component, that is, Q ij (σ, T ) will be large more often.
A comparison between the order of operations required to run PMC2 and the order of operations needed by SPMC reveals that, in general, from a purely computational point of view, PMC2 is more efficient for large datasets. Let M be the number of burn-in iterations used in a run of PMC2. Let m be the number of iterations after the burn-in period. Also, let d be the size of the internal grid representing the grid in the bandwidth-temperature space where the posterior density is to be evaluated. The Swendsen-Wang algorithm requires O(n log n) operations to get the connected components of the graph. The updating of the estimation of the normalizing constant requires O(dn) operations. Therefore, the order of operations during the burn-in period, which corresponds to the Wang-Landau algorithm, is about O(M{dn + n log n}). After the burn-in, we need to sample (σ, T ) and interpolate the normalizing constants for parameter values out of the basic grid. This part comprises roughly O(m{n log n + dn + d 2 }) operations. It should be noted that in general M is much larger than m. SPMC requires O(d T (M + m)n log n) operations, where d T is the size of the grid used for the temperature. Hence, the difference in the order of operations between PMC2 and SPMC is about O((M + m){dn + d 2 m/(M + m) − (d T − 1)n log n}). Therefore, this difference will be negative for small grids and for large n. For example, for the barcode data, we have n = 19, 465, m = 5000, M = 200, 000, d = 12 × 30. If we set d T = 50, the difference is about −5 × 10 11 . The left panel of Figure 6 shows, in log-log scale, the real time per grid point spent by PMC2 on the six datasets used in Section 5 against the theoretical order of computations per grid point divided by 3.4 × 10 9 . This latter quantity is the number of operations per second claimed by the computer used to run the algorithm. The computer is a desktop PC running Linux GNOME 2.28.2. A simple linear regression of observed times as function of the theoretical times yields a coefficient of determination larger than 0.99, thus indicating a good theoretical prediction. The regression fitted values are also displayed in Figure 6. For comparison purposes, the right panel of Figure 6 shows the times spent per grid point (i.e., time/(12 × 30) for PMC2) for PMC2 and SPMC when SPMC is run for the same number of iterations per grid than PMC2 (i.e., M + m = 301, 000). It is clearly seen that PMC2 is much more efficient per grid time spent on each grid point than SPMC. However, as a note of caution, we remark that in general, SPMC may be run for fewer iterations than PMC2 on each temperature. Since the time spent per temperature by SPMC is proportional to (M + m), a simple computation shows that SPMC timing per temperature is comparable to PMC2 run with M + m = 301, 000 when only about 21,000 iterations per temperature are run, which for most problems may be appropriate.
In summary, our model and experiments show that the search for an optimal clustering reduces to a search for a stable high-density region of the parameter posterior and not to a point search in the parameter space. Procedures that serve to uncover the totality or parts of this region will perform as well as procedures that search for an optimal point in the parameter space. This may explain why just estimating the posterior on a grid of the parameter space or just optimizing our data-driven informative prior does well. As long as the grid covers parts of the stable region (or the informative data-driven prior presents a mode inside the stable region), the clustering obtained will be a good clustering. The prior based on random graph theory gives us a strong hint on the location of the stable region of the Potts model. Since we derived this information from the study of many random cluster models with constant interaction, it is reasonable to expect further improvements in clustering with the advancement of the knowledge of the phase transition region of a random cluster model with variable interactions between the data points.

SUPPLEMENTARY MATERIALS
The supplementary material consists of the proof of the convergence theorem for PMC2 (Section 1), the proof of the probability bound theorem used to construct an informative data-driven prior for the temperature parameter of the conditional-Potts model (Section 2), and a more detailed description of the datasets used in our simulations (Section 3).