Semiparametric Bayesian Regression via Potts Model

ABSTRACT We consider Bayesian nonparametric regression through random partition models. Our approach involves the construction of a covariate-dependent prior distribution on partitions of individuals. Our goal is to use covariate information to improve predictive inference. To do so, we propose a prior on partitions based on the Potts clustering model associated with the observed covariates. This drives by covariate proximity both the formation of clusters, and the prior predictive distribution. The resulting prior model is flexible enough to support many different types of likelihood models. We focus the discussion on nonparametric regression. Implementation details are discussed for the specific case of multivariate multiple linear regression. The proposed model performs well in terms of model fitting and prediction when compared to other alternative nonparametric regression approaches. We illustrate the methodology with an application to the health status of nations at the turn of the 21st century. Supplementary materials are available online.


Introduction
Clustering is an ubiquitous problem in statistical theory and practice. Many methods and algorithms are available in the literature. A clustering procedure consists of constructing a partition of a set of individuals or experimental units into nonempty, disjoint, and exhaustive subsets. Usually, the number of subsets (the size of the partition) is not known beforehand. We focus on probability models for partitions (also referred to as random partition models), avoiding purely algorithmic methods. Specifically, we build a prior probability model for partitions that is indexed by covariates and has the property of encouraging coclustering of individuals with close covariate values.
Defining probability models for partitions has been a topic of active research in the past decade in the Nonparametric Bayes (NPB) community. The main reason comes from the NPB model itself. Consider any hierarchical model of the form y i | θ i ind ∼ p(y i | θ i ) and θ i | F iid ∼ F, for i ∈ [n] = {1, . . . , n}. When F is drawn from a discrete random probability measure, a prior distribution on partitions is induced. The discreteness of F implies the existence of ties among θ 1 , . . . , θ n . A partition is obtained by grouping together those elements in [n] that have identical θ values. See, for example, Quintana (2006).
A natural extension of the random partition model consists of incorporating covariate values in its definition. The motivation for this comes in part from the NPB community's interest in defining classes of covariate-dependent random probability measures {F x : x ∈ X } indexed by sets of predictors/covariates x ∈ X . The seminal work by MacEachern (1999MacEachern ( , 2000 proposed a collection of dependent random probability measures with marginal distributions given by the Dirichlet process (DP) CONTACT (Ferguson 1973). This general idea has been extended and applied to the construction of several types of random probability measures, producing what may be generically termed NPB regression. This includes, for instance, the notion of density regression (Dunson, Pillai, and Park 2007;Tokdar, Zhu, and Ghosh 2010). When the F x distributions are discrete, a collection of random partition models indexed by the set of available covariate vectors x n = (x 1 , . . . , x n ) emerges just as before. We note however, that the induced probability distribution may not be easily recognizable.
Exploiting the connection between product partition models (PPM) (Hartigan 1990; Barry and Hartigan 1992) and the DP (see, e.g., Quintana 2006), a covariate-dependent extension was proposed by Müller, Quintana, and Rosner (2011). The usual PPM involves a prior distribution on a partition ρ n = (S 1 , . . . , S k n ) of [n] as p(ρ n ) ∝ k n j=1 c(S j ), where c(S) (the cohesion function) describes how strongly the user thinks the elements of S are to co-cluster a priori. The covariatedependent version (termed PPMx) alters this prior distribution as p(ρ n | x n ) ∝ k n j=1 c(S j )g(x * j ), where x * j = {x i : i ∈ S j } is the set of all covariates for individuals in cluster j = 1, . . . , k n , and g(x * j ) is the similarity of cluster j. Large similarity values are meant to represent homogeneous covariate values in the corresponding cluster. Some alternative extensions to build covariatedependent random partition models explicitly can be found in Dahl (2008), Park and Dunson (2010), and Argiento, Cremaschi, and Guglielmi (2014).
Our motivation is similar to that of Müller, Quintana, and Rosner (2011), but our approach differs substantially from theirs in the way the prior distribution on partitions is constructed. Our approach is based on Potts model clustering, introduced

The Modeling Approach
Suppose that each individual i ∈ [n] in a given sample is associated with a p-dimensional vector y i of responses of interest and a q-dimensional covariate vector x i . Suppose also that the interest lies in studying the relationship between y i and x i , and in particular, in predicting the response y n+1 associated with a covariate vector x n+1 of a future individual. Let p(y i | x i , ψ i ) be a likelihood model stating the relationship between the ith response and the associated covariate vector. For instance, the likelihood could be a multivariate multiple normal regression N p (x i θ i , i ). In this case ψ i would be given by the pair (θ i , i ). The specific nature of p(y i | x i , ψ i ) will not be needed until later when we discuss implementation issues.
We assume a random partition model with a hierarchical structure for these data: Here ρ n is a partition of [n] into k n subsets. Also, s 1 , . . . , s n are cluster membership indicators such that s i = j if the ith individual belongs to the jth cluster. In addition ψ i = ψ * s i for all i ∈ [n]. Model (1) groups in cluster j those individuals having identical parameter value ψ * j . Individuals within this cluster are conditionally iid given ψ * j . The clustering distribution in (1) is generically denoted by p(ρ n | x n ). We clarify that the "| x n " notation refers not to an actual conditioning but to an indexing of the resulting distribution by the vector of available covariates x n . We refer to models that give rise to this type of partitions as dependent random partition models. A review of such models was given by Müller and Quintana (2010). In defining p(ρ n | x n ), we follow the general principles laid out by Müller, Quintana, and Rosner (2011): two individuals i 1 and i 2 with similar covariate vectors x i 1 and x i 2 should be more likely to cocluster a priori. We adopt here a prior defined by the Potts model which we describe next.

The Potts Model Prior
To facilitate the upcoming development, we view the data structure as a graph G = (Y n , E) with n nodes Y n = (y 1 , . . . , y n ) and a set of edges E. We also consider the associated graph (x n , E) whose nodes are formed by the corresponding covariate vectors x n = (x 1 , . . . , x n ). These covariates will be used to index the random partition models as follows. Consider the Potts model applied to the graph (x n , E). We assume that the probability distribution p(ρ n | x n ) for partitions arises from the random cluster model (Sokal 1997) associated with the Potts model. This implies the existence of auxiliary binary variables, the so-called Here, the notation b ⇒ ρ n means that the partition ρ n arises from the connected components of the graph associated to edges given by b. To explain how (2) arises we introduce the following notation. Let r > 1 be an integer indicating the number of label colors for the individuals in the data graph. The value of r is one of the defining parameters of the Potts model. Let z ji be a color label indicator with z ji = 1 if x j is assigned color i, and z ji = 0 otherwise. Let α i j = 1 if x i and x j are neighbors in the graph (x n , E), that is, if there is an edge connecting them, and 0 otherwise. Let δ i j = 1 if x i and x j have the same label, and 0 otherwise. In general, we use values of r in the range 10 ≤ r ≤ 30. These are the recommended values in the literature (Blatt, Wiseman, and Domany 1996;Stanberry, Murua, and Cordes 2008). We stress that the color labels are not the clusters labels, they only serve to segment the graph. They are auxiliary variables that split and/or merge the connected components of the graph. Figure 1 displays an example of the coloring (labeling) and the creation of bonds in a data graph. The figure also illustrates the Swendsen-Wang algorithm described below. Let K(·, ·) be a Mercer kernel (Girolami 2002). The similarities between pairs of covariate vectors are defined by κ i j = K(x i , x j ). As usual, we will assume that K is a function of the distances where σ > 0 is a bandwidth parameter. The kernel function is not limited to real-valued vector components, and can be easily modified to account for discrete variables. To avoid problems with different scales or variances in the data, we standardize the variables x n . An alternative would be to consider variable-dependent bandwidth parameters in the kernel. But this would unnecessarily complicate the model. The choice of the bandwidth parameter is not straightforward. Blatt, Wiseman, and Domany (1996) suggested using the square-root of the mean of the squared distances. Murua and Wicker (2014) estimated it with a Wang-Landau-type stochastic algorithm (Wang and Landau 2001;Atchade and Liu 2010). However, they also suggested a faster estimate obtained from the mode of an informative empirical prior density of the bandwidth. In our experiments, we adopted this latter strategy. The Potts model assumes the following density on the set of labels {z i j } given x n , or equivalently, the {κ i j (σ )} "distances": Illustration of the Swendsen-Wang algorithm. The leftmost panel shows the original data graph. Its edges are given by the black lines. The vertices are colored (labeled) black or red. The second panel shows the modified data graph whose edges are now given by the bonds (gray lines). The third image shows the connected components of the modified data graph after a recoloring of its vertices. The rightmost image shows the original data graph with the new color labels.
Here β is the inverse of the so-called temperature parameter, and Z is the normalizing constant which depends on β, σ and x n . As with the bandwidth parameter, β needs to be estimated. In principle, it could be possible to make formal posterior inference on β and σ . But in practice this is difficult. The main reason stems from the fact that the intractable normalizing constant Z in (3), depends on these parameters. As a compromise, our approach considers setting the pair (β, σ ) to the mode of a data-driven prior as suggested in the work of Murua and Wicker (2014). To work in practice with the Potts model (3), Swendsen and Wang (1987) proposed an algorithm based on augmenting the model by introducing a set of latent variables, the bonds b. The bond b i j = 1 is said to be frozen if b i j = 1 and α i j = δ i j = 1, that is, the points x i and x j are neighbors and have the same label. Otherwise, the bond is not frozen. The bond b i j becomes frozen with probability p i j = 1 − exp{−β κ i j (σ )}. The joint distribution of the labels and bonds is given by This is known as the Fortuin-Kasteleyn-Swendsen-Wang model (Sokal 1997). The distribution in (3) follows by marginalizing out b in (4). And p(b | x n ), needed in (2), follows by marginalizing out {z i j }. Swendsen and Wang (1987) used (4) to propose the following Gibbs sampling strategy for generating samples from the Potts model (and the associated random cluster model p(b | x n )): Step 1. Given the labels {z i j }, each bond b i j becomes frozen independently of the others with probability p i j if α i j = δ i j = 1. Otherwise, b i j is set to 0.
Step 2. Given the bonds {b i j }, each connected subset is assigned the same color label. The color label assignments are drawn uniformly at random and independently from each other.
It is important to note in Step 2 that all vertices in a connected component already share the same color assignment. By assigning a new color to the vertices in the connected component, neighboring vertices with the same color assignment become eligible to join the connected component in the next iteration (merging). Similarly, those neighboring vertices that were not joined by bonds but had the same color as the connected component vertices, become ineligible to join the component in the next iteration (splitting). Figure 1 displays an illustration of the Swendsen-Wang algorithm.

Posterior Simulation in the General Case
Our general modeling approach is expressed as (1) together with (2). The set of parameters ψ = (ψ 1 , . . . , ψ n ) may be alterna- k n is the size of partition ρ n . The corresponding posterior distribution of ψ is a crucial component for our posterior predictive inference goal. Consider a hierarchical model with a discrete random probability measure as prior for the unknown distribution function. Random partition distributions arising from these type of models have several advantages. One of them is that there is typically an efficient posterior simulation algorithm that can be used or adapted to get samples for the posterior of partitions. Unfortunately, this is not the case for Potts model prior, and a different strategy is needed. Note first that (2)) is intractable even for moderate values of n. Therefore, we use the Swendsen-Wang algorithm (Swendsen and Wang 1987) described in Section 2.1 as part of our MCMC approach to the target distribution p(ψ | Y n , x n ). Also note that changing the partition, say from ρ n to ρ n , involves changing the dimension of the parameters ψ. Thus, it is necessary to resort to transdimensional MCMC. We choose to follow the framework given by Besag (2004), rather than the reversible jumps method proposed by Green (1995). Besag's approach avoids the specification of a proposal bijection between the different dimensions. This is done by embedding the problem into a very-high-dimensional space. The idea is the following, where we drop the subindex n for partitions to simplify the notation. Let = {ρ(1), ρ(2), . . . , ρ(B n )} be the collection of all possible partitions of n nodes, where B n is the corresponding Bell number. Denote by ψ = (ψ ρ(1) , ψ ρ(2) , . . . , ψ ρ(B n ) ) the collection of all parameters associated to the elements of , and by S ρ the parameter space associated to the partition ρ ∈ . Also, for every ρ ∈ , let p 0 (ψ ρ | ρ, Y n , x n ) be any distribution on S ρ . Define the target distribution as It is easy to see that the marginal of (ρ, ψ ρ ) under this density is exactly p(ρ, ψ ρ | Y n , x n ). The problem is thus reduced to sampling fromπ (ρ, ψ | Y n , x n ). Besag (2004) considered two different types of moves during the sampling: (a) from the current state (ρ, ψ ρ ), propose an updated vector of parameters ψ ρ and leave the remaining ψ R 's unchanged; and (b) from the current state (ρ, ψ ρ ), propose for some ρ = ρ, the update (ψ ρ , ψ ρ ). The first update is generated with a proposal q(ψ ρ , ψ ρ ), and the second update is generated with proposal yields a Metropolis-Hastings ratio equal to Note that if the update is proposed with q(ψ ρ , ψ ρ ) = p(ψ ρ | ρ, Y n , x n ), then the ratio becomes 1, and the update is always accepted. For the second proposal, the Metropolis-Hastings acceptance ratio is Hence, if we set as before q ρρ (ψ ρ , Note that there is no need to actually generate ψ ρ at all. The proposal is reduced to generating a new partition ρ from p(ρ | x n ).
If the proposal is accepted, then one generates (and automatically accepts) an updated ψ ρ from p(ψ ρ | ρ , Y n , x n ). If the partition is not accepted, then there is no need to generate ψ ρ . Practical implementation of the posterior simulation scheme laid out above requires tractability of p(ψ ρ | ρ , Y n , x n ). When this distribution is intractable, one may attempt to estimate the ratio via MCMC using, for example, methods devised for the equivalent problem of estimating Bayes factors. For a list of such methods see, for example, Lodewyckx et al. (2011). Alternatively, one may choose a different and tractable proposal q(ψ ρ , ψ ρ ), whose nature is problem-specific. If so, the simplification (5) no longer applies, and the above Metropolis-Hastings ratio needs to be evaluated from scratch.

The Regression Setting
The discussion so far is generic in the sense that no particular form was assumed for the likelihood model p(y i | x i , ψ i ). Naturally, details for particular applications are specific to whatever case is being considered. Thus, from now on we consider the special case of multivariate multiple linear regression models. Our main goal here is to develop predictive inference for such regression models with multiple responses and predictors. The response vectors are assumed p-dimensional: y i ∈ R p , and the corresponding covariates are x i ∈ R q . The parameters are then ψ i = (θ i , i ), where θ i ∈ R qp and i ∈ M p×p are subject-specific regression coefficients and covariance matrices. Here, M a×b denotes the collection of matrices with a rows and b columns. The partition-based regression model has thus likelihood derived from (1). Note that we can write where I a represents the identity matrix of dimension a and A ⊗ B is the Kronecker product between two matrices A and B. The individual likelihood corresponds to a p-dimensional multivariate normal with mean vector formed by the individual linear combinations x i θ i for = 1, . . . , q, and covariance matrix The complete likelihood from (1) becomes The model specification is completed by defining suitable priors on the partition ρ n and the parameters ψ i . We assume that where the Potts model was stated in (2). Here, ∈ M p×p and τ ∈ M q×q are positive-definite matrices and μ ∈ R qp . Also, by The regression model defined in (6)-(9) splits the sample into clusters of internally homogeneous individuals, as probabilistically described by the Potts model. In other words, the cluster formation is driven by the bonds, whose distribution is in turn determined by distances between the covariate vectors {x i }. As a result, the model favors, a priori, clusters formed by individuals with similar covariate values. The model then implies common multiple multivariate regression parameters for individuals in the same cluster.

Posterior Simulation in the Regression Setting
To implement the Metropolis-Hastings posterior simulation scheme outlined in Section 2.2 and for posterior predictive inference we require explicit evaluation of (5). Note first that due to the multiplicative structure of the likelihood (8), we can factor this expression into k n terms, one per cluster. Given a partition ρ n with k n subsets, let (θ * 1 , * 1 ), . . . , (θ * k n , * k n ) denote the set of unique, cluster-specific parameters. Recall our earlier notation x * j = {x i : i ∈ S j }, and define y * j = {y i : i ∈ S j }. Let n j = |S j | be the size of the jth cluster. It follows that the likelihood for for j = 1, . . . , k n . Let Y j ∈ M n j ×p and X j ∈ M n j ×q be the matrices whose rows are formed by the y i and x i vectors in y * j and x * j , respectively. The proofs of the three results that follow can be found in the supplementary material.
Lemma 1. The least squares estimator of θ * j is given byθ * j = vec ((X j X j ) −1 X j Y j ), for j = 1, . . . , k n , where vec(A) denotes the vector formed by stacking the columns of the matrix A. Moreover, the likelihood (10) can be written as The posterior distribution of (θ * j , * j ) can be decomposed as where for a ∈ R qp , a mtx denotes the q × p matrix derived by sequentially dividing a in p columns of dimension q. In addi- The results from Lemmas 1 and 2 are useful for posterior simulation. However, to evaluate (5) we need the marginal distribution p(Y n | x n , ρ n ), which follows from the following result.
A direct consequence of Lemma 3 is the expression where ,mtx − μ mtx ), and μ ( j) ,mtx denotes the q × p matrix derived by sequentially dividing μ ( j) in p columns of dimension q. Recall that μ mtx denotes the q × p matrix derived by sequentially dividing μ in p columns of dimension q, and In summary, our computational strategy for posterior simulation can be expressed as a Metropolis-within-Gibbs algorithm. Each scan consists of the following steps: Partition: Let ρ n (0) denote the currently imputed partition.
r Propose a new partition ρ n (1) by applying steps 1 and 2 in Section 2.1. r Accept ρ n (1) with probability min{p(Y n | ρ n (1), x n ) / p(Y n | ρ n (0), x n ), 1}, which can be evaluated in log-scale using (15). Parameters: If k n is the number of clusters of the currently imputed partition ρ n = (S 1 , . . . , S k n ), then for j = 1, . . . , k n draw new parameter values * where the corresponding distributions are specified in Lemma 2.

Posterior Predictive Inference
Another key element of our inferential effort is the implementation of posterior predictive inference. Let y n+1 denote a new individual and let x n+1 denote its associated covariate vector. The goal here is to find an expression for the predictive distribution P(y n+1 | Y n , x n , x n+1 ). This is done by conditioning on the partitions associated to the model. Let ρ n+1 be a partition of the complete data {(Y n , y n+1 ), (x n , x n+1 )}. We have where¯ denotes the index of the cluster S¯ in ρ n+1 that includes the new individual n + 1. We have used the notation V¯ (w, ρ n+1 ) to stress the dependency of V¯ on ρ n+1 and w. Note that the denominator of the above formula (which plays the role of a normalizing constant) depends on ρ n+1 as well. The posterior predictive distribution is given by Note that we have added a new point to the partition, so the n¯ s denote now cluster sizes associated with ρ n+1 . Next, let ρ n denote a posterior partition of the data. Let A(ρ n+1 ) denote the collection of all partitions ρ n giving rise to ρ n+1 , by generating appropriate bonds b n+1 from x n+1 in the graph. Inversely, one can think of ρ n as a partition of the original data that arises by marginalizing out all the bonds from x n+1 . We have Now let [ρ n , b n+1 ] denote a partition of x n ∪ {x n+1 } generated from ρ n by the bonds b n+1 associated to x n+1 . The above equation allows us to write the posterior predictive distribution P(y n+1 | Y n , x n , x n+1 ) as Therefore, given a posterior sample of partitions {ρ n, j : j = 1, . . . , M} and bonds {b n+1, j : j = 1, . . . , M}, we can estimate the posterior prediction by means of In practice, the application of the predictive inference approach discussed above requires a method to deal with the integrals in the denominator of (16). We show that such integrals may be well approximated by expressions that can be resolved analytically.
Proposition 1. Let V¯ (ρ n ) = V¯ be as defined in Lemma 2 (the dependency on the partition ρ n involving only the data is stressed). Let (w, x n+1 ) be a new individual in cluster¯ . Then the updated mean for this cluster is μ (p, * ) = n¯ μ ( j) + w /(n¯ + 1). Suppose that the mean μ (p, * ) ≈ μ ( j) , that is, suppose that either n¯ is large or that w ≈ μ ( j) . Then we have the Here, we have used the determinant identity |A + bb | = |A| (1 + b A −1 b), which holds for any invertible p × p matrix A and any p-dimensional vector b. Approximation (17) is implemented as an extension to the posterior simulation algorithm laid out in Section 3.1. An extra step is added to the algorithm to sample the additional bonds required to include x n+1 in the augmented partition ρ n+1 . Expression (17) is evaluated after collecting a sample of size M of such partitions and bonds.

Posterior Inference on Partitions
The posterior distribution for the number of clusters can be easily obtained from MCMC samples, as each of these involves a partition of the data. To produce a meaningful posterior estimate consisting of a single partition, we use the proposal by Murua, Stanberry, and Stuetzle (2008). The method starts by finding the posterior mode of the number of clusters, and then builds a consensus clustering from all the MCMC sample partitions with this given number of clusters. The proportion of times a given pair of individuals appears together in the same cluster within a given partition is calculated for every pair of neighbors of the original data graph. The consensus clustering is obtained as the connected components of the graph whose edges consist of those neighbors with a positive proportion. In general, one could prune the edges with an associated proportion smaller than certain predetermined threshold (e.g., 0.5). Alternatively, Murua and Wicker (2014) set the threshold to the mode of the posterior density associated with the probability of having two given neighbors in the same cluster. We have adopted this latter strategy. We also have tried the least-squares clustering (LSCLUST) procedure of Dahl (2006). This method is also based on pairwise proportions, but now computed using the entire set of posterior MCMC samples. The estimated partition is chosen as the minimizer of the sums of squares of the componentwise differences between these proportions and the individual co-clustering indicators among sampled partitions. For more details see Dahl (2006). In all the examples we tried, LSCLUST and consensus clustering gave the same results.

Simulation Studies
In this section we study through simulations the performance of the Potts regression model and compare it with that of three other alternative methods: FLEXMIX, as implemented in the homonymous R package (Leisch 2004;Grün and Leisch 2007), the PPMx approach by Rosner (2011), andMixtures of Experts (McLachlan andPeel 2001). For the PPMx and mixtures of experts approaches we keep the same likelihood model (6) but change the proposed Potts model prior to a dependent PPM model in the former case or to a dependent mixture in the latter case. We describe three different simulation scenarios. Two of them relate directly to the multiple regression model in Section 3. In the last scenario we investigate the robustness and performance of the Potts regression model in a setting that is totally different from the one that motivated it. First we concentrate on the two simulation scenarios involving multiple regression. Scenario I: The odd shape cluster dataset was conceived to see if the methods considered here can recover the arbitrary cluster structure of the covariates. It consists of four twodimensional covariate groups and one-dimensional response variable. Groups 1 and 2 were generated as uniform draws around a couple of arcs; Groups 3 and 4 are Gaussian clumps. The groups can be seen in the left panel of Figure 2. The arcs were generated with 1000 points; the clumps were generated with 500 points each. The response variable Y corresponds to Gaussian values with means equal to group-dependent linear combinations of Y 1 and Y 2 . The group-dependent linear combinations are 3Y 1 + 2Y 2 , Y 1 + 3Y 2 , −3Y 1 , and −3Y 1 − 2Y 2 , respectively. The corresponding regression standard deviations are 0.03, 0.05, 0.04, and 0.05, respectively. The densities associated to Y for each group are displayed in the right panel of Figure 2.
Scenario II: The five-copula cluster dataset was conceived to study the performance of our method on a multivariate response data similar to the one studied in the application discussed in Section 5. It was generated using a mixture of five six-dimensional copulas as covariates and a two-dimensional mixture-component-dependent response variable. The copulas were generated from a multivariate Gaussian copula with compound symmetry and marginals set to normal, exponential, gamma, normal, gamma, and exponential. The components differ in the marginal distribution parameters. Details on the method used to generate the samples are given in the supplementary material.
To fit the models, we randomly generated 100 subsamples of size 200 from each of these datasets. We applied our procedure to each sample, and computed the associated root mean square error (RMSE). We also created an independent sample of size 100 to estimate the goodness of fit of the models fitted with each of the 100 training datasets. We report both the RMSE associated with the training data and the RMSE associated with the test data. Note that the RMSE may be computed in two different ways when using the Potts regression model. First, the posterior mean response for each covariate value can be estimated as the mean of the estimated posterior predictive density in (17). Alternatively, we can compute the mean of the corresponding sample values (associated with the covariates) generated from the MCMC output. In what follows we consider both methods. We ran our method with 20,000 burn-in iterations, but only kept 5000 samples after burn-in to do the data analysis. From preliminary computational experiments, we judged this simulation size to be sufficient for the target posterior and predictive inference.
For each compared procedure and each dataset used in the simulation, we computed the average of the 100 RMSE. The results are shown in Table 1 for the odd-shape-clusters dataset, and in Table 2 for the five-copula one. We also computed the adjusted Rand index (ARI) (Rand 1971; Hubert and Arabie  The mixture of experts greatly overfits the data. Its performance is very dependent on the number of hidden units chosen to fit the data. However, it seems to recover well the clustering structure in the covariates. An explanation for this behavior might be that the mixture of experts considers too many parameters. This makes it very flexible so as to capture the clustering structure of the covariates. However, those many parameters also overfit the data, yielding poor predictions. The PPMx-type prior shows a somewhat better performance. It achieves a good partition estimation. However, some of the overfitting trend is also present, although not to the extent seen in the mixtures of experts models. On the other hand, FLEXMIX performs poorly on these datasets. We stress that none of the datasets in this simulation study fits exactly the assumptions of the models tested. It appears that FLEXMIX does not generalize easily to complex datasets. We also consider a third scenario, given by the survival data analysis application described by Müller, Quintana, and Rosner (2011). In this case, we need to change the likelihood model to just a univariate normal distribution that does not involve covariate dependence. To compare Potts regression with PPMx, we use the same simulated data described in Table 1 of Müller, Quintana, and Rosner (2011). The competing models are defined as the univariate likelihood model described above, and either the Potts regression or PPMx priors. We find that PPMx has a very slightly better performance, and that densitybased estimation of the mean performs somewhat worse than the other alternatives. However, as in the previous scenarios, the Potts regression model yields better results for the prediction cases. Due to lack of space, further details for this scenario are found in the supplementary material.
In summary, the Potts regression model seems to be the most reliable method for local regression among the methods compared in this simulation study. It shows the best performance in terms of testing (predictive) RMSE while doing a reasonable job in revealing the cluster structure in the covariates. Note that among the methods compared, the current implementation of FLEXMIX does not support multiple responses.

Data Illustration
On the eve of the 21st century, nations showed a wide variation in their health status. A lot of research has focused on developing policies that could directly affect health outcomes (Hegyvary, Berry, and Murua 2008;Wilkinson 1996;Collins 2003;Deaton 2002;Gatrell 2002;Poland et al. 1998;Dollar 2001). For illustrative purposes only, we consider here a sample from the cross-sectional analysis carried out by Hegyvary, Berry, and Murua (2008). It consists of 134 member countries of the United Nations, representing a wide range of possible economic and demographic conditions in the world. Data on diverse factors were gathered by Hegyvary, Berry, and Murua (2008) from the World Bank, the United Nations, the U.S. Central Intelligence Agency, and others. Out of the 18 factors considered by Hegyvary, Berry, and Murua (2008), we selected six factors as covariates: Secondary Education (SE), Human Resources (HR), Population Density (PDn), Population Dependency (PDp), Birth Rate (BR), and the level of Democracy (DL); and two factors as our response variable (the same two chosen by Hegyvary, Berry, and Murua (2008)): (1) Adjusted life expectancy (ALE), which is the average of the life expectancy and health adjusted life expectancy variables, and (2) Child mortality (CM), which corresponds to the under-five mortality rate. The choice of factors in our study was mainly driven by the availability of complete data (that is, no factors with missing values were selected). All factors were normalized to have mean zero and unit variance. More details on this dataset can be found in Hegyvary, Berry, and Murua (2008).
The   Figure 3). We also tried the PPMx approach, which gave a RMSE of 0.343 and 5 clusters according to the LSCLUST procedure. The results yielded by the Potts regression model are more in agreement with those found by Hegyvary, Berry, and Murua (2008). However, our results hint at a larger Cluster 4 than the one found in their work. In fact, our model puts together the ex-soviet countries with the richest countries. At the turn of the century, the relation between the health status of the ex-soviet countries with the demographic, education, and health system factors considered in this study was comparable to that of the richest western countries. Of the six covariates considered, we took a closer look at BR and HR. These are variables that policy makers could and may want to control to improve the health status of a population. Figure 4 displays contours of the posterior predictive density associated with different levels of BR and HR while keeping the other four variables fixed at their overall observed median values. Each panel of this figure shows contours in ALE -CM space associated with every combination of the observed quantiles 0.1, 0.25, 0.50, 0.75, and 0.90 of the BR and HR variables. Observe that the "median" state of the health status of these 134 countries is right in the middle (the variables have been standardized). Not surprisingly, the overall pattern is that lower BR values and larger amounts of HR are associated with better health. But there are exceptions. For instance, while most countries in Cluster 4 showed a low birth rate, Israel had a relatively high birth rate and showed very good health outcomes. Note also that the best health outcomes are not necessarily associated with the highest amount of HR (see the top-left panels in column 2 and rows 1 and 2). For example, The United Kingdom, Cyprus, Greece, Portugal, and Japan presented very good health outcomes, even though their HR levels were among the lowest in Cluster 4. HR levels do make a difference in countries with a high birth rate, specially for lowering child mortality rates. Take, for instance, Cluster 3 which is composed of countries in the mid range of health outcomes. Examination of the corresponding posterior predictive density (contour plots are shown in the supplementary material) reveals that moving only a single variable, BR or HR, would have little impact in health outcomes. A noticeable change in health outcomes requires a simultaneous movement in both variables. This implies an interaction effect between these variables, suggesting that policy makers should not concentrate in single factors, but on several of them simultaneously. Contours for the other two large groups, Clusters 2 and 4, show no major changes, with "realistic" within-cluster changes in the levels of BR or HR. This is to be expected since these countries are located at either low or high extreme levels for both variables.

Discussion
The main purpose of this work is the use of Potts model as a covariate-dependent prior on partitions. The discussion focused on a particular example (multivariate multiple linear regression) as a concrete illustration, but the approach is quite general and can be extended to be used with literally any likelihood model of interest. This would also require extending our computational approach to the specific model assumptions adopted.
In our examples, the covariates were all of continuous type, but this is not to be seen as a restriction of the method. Any type of covariate can be used, as the kernel function K(·, ·) can be easily modified to account for such changes. It is important though to introduce some notion of standardization in the kernel values so as to mitigate the effect of different scales and/or variabilities in the covariate values.
Our simulation and data analysis results suggest that the Potts prior model has a superior performance in the cases we tried, with both synthetic and real data. The only exception we found to this trend is where the covariate dependence was restricted to the prior on partitions, in which case the PPMx and Potts regression have almost identical performance in the simulation.
A limitation of the algorithm outlined in Section 2.2 is that it requires analytical marginalization of ψ ρ to derive p(Y n | ρ , x n ), which is needed to compute (5). In the case of local multiple regression, this equation can be calculated explicitly. We show that a clever approximation to the posterior predictive density makes the procedure very efficient and highly competitive with other flexible regression methods. When p(ψ ρ | ρ , Y n , x n ) is intractable, one may need to consider other Metropolis-Hastings proposals. However, as noted in Section 2.2, no evident simplification of the acceptance ratio arises.