Label Switching in Bayesian Mixture Models: Deterministic Relabeling Strategies

Label switching is a well-known problem in the Bayesian analysis of mixture models. On the one hand, it complicates inference, and on the other hand, it has been perceived as a prerequisite to justify Markov chain Monte Carlo (MCMC) convergence. As a result, nonstandard MCMC algorithms that traverse the symmetric copies of the posterior distribution, and possibly genuine modes, have been proposed. To perform component-specific inference, methods to undo the label switching and to recover the interpretation of the components need to be applied. If latent allocations for the design of the MCMC strategy are included, and the sampler has converged, then labels assigned to each component may change from iteration to iteration. However, observations being allocated together must remain similar, and we use this fundamental fact to derive an easy and efficient solution to the label switching problem. We compare our strategy with other relabeling algorithms on univariate and multivariate data examples and demonstrate improvements over alternative strategies. Supplementary materials for this article are available online.


INTRODUCTION
The finite mixture model with k components, and we assume k is known, is written as p(y i |w, θ ) = k j =1 w j f (y i |θ j ), independently for i = 1, . . . , n, and let θ = {θ j } k j =1 and w = {w j } k j =1 . For (1) to be a density, the weights, w, must be nonnegative and sum to one. A useful idea when working with model (1) is to include latent allocation variables: z = {z i } n i=1 such that given z i , the component from which y i has been drawn is known, that is, p(y i |θ, z i ) = f (y i |θ z i ), with z i ∼ M k (1|w 1 , . . . , w k ). ( Mixture models can be used to approximate irregular densities or to model heterogeneity. In a density estimation context, any distribution on the real line can be approximated to within any degree of accuracy, in the L 1 norm, using enough normal components; see Ferguson (1983). In this case, the flexibility of (1) is extended by allowing an infinite number of components. On the other hand, when a mixture model is used to model heterogeneity, there is a proper interpretation for (1): k represents the number of clusters within the data, w j is the weight of cluster j in the population, and f (y|θ j ) is a parametric distribution that models the behavior of cluster j. The latent allocations (2) automatically induce a clustering structure over the observations. In a Bayesian mixture modeling setup, to perform inference for any characteristic of the components, the objective is to find the posterior distribution, and then calculate posterior summaries that are approximated via Markov chain Monte Carlo (MCMC) methods. A problem emerges when we note that the likelihood of (1) is invariant under permutation of the indices, and there are k! permutations. Thus, under symmetric priors, the posterior will inherit the likelihood's invariance. As a result, in any MCMC algorithm, labels of the components can permute multiple times between iterations of the sampler. This makes ergodic averages to estimate characteristics of the components useless. This is known as the label switching problem.
Paradoxically, as noted by Celeux, Hurn, and Robert (2000), Frühwirth-Schnatter (2001), Jasra, Holmes, and Stephens (2005), and Papastamoulis and Iliopoulos (2010), among others, label switching is a prerequisite for MCMC convergence. If there is no label switching, it means that the sampler is not exploring all the modes of the posterior distribution of (1). It should visit all the k! symmetric modes. Notably, when the modes of the posterior are well separated, the standard Gibbs sampler fails the label switching test (Jasra, Holmes, and Stephens 2005, p. 55) as it becomes trapped in one of the symmetric modes. While this may be meaningful for inference, it becomes impossible to justify convergence.
To address this problem, three alternative ideas have been suggested. The first one keeps the standard Gibbs sampler and incorporates a Metropolis-Hastings move that proposes a random permutation of the labels; see Frühwirth-Schnatter (2001) and Papastamoulis and Iliopoulos (2010). The second approach is to use more sophisticated and complex MCMC methods to improve mixing of the sampler, and specifically avoiding the use of (2); see Jasra, Holmes, and Stephens (2005).
Initial attempts to "undo the label switching" were focused on imposing artificial constraints on the parameter space via the prior distribution aiming to break the symmetry of the posterior distribution and force a unique labeling; see Diebolt and Robert (1994) or Richardson and Green (1997). However, Stephens (2000) showed that these constraints do not always solve the problem. Another idea, based on identifiability constraints over the parameter space, was proposed by Frühwirth-Schnatter (2001). The main criticism for this method is the difficulty to select the adequate constraints among all the possible choices.
More elaborate solutions aim to undo the label switching deterministically. This means finding permutations of the parameters that minimize a loss function; see Celeux (1998), Stephens (2000), Nobile and Fearnside (2007), and Grün and Leisch (2009). Recent methods combine similar ideas with the use of the maximum a posteriori (MAP) estimator; see Martin, Mengersen, and Robert (2005) and Papastamoulis and Iliopoulos (2010). Other approaches focus on devising a loss function invariant to permutations of the parameters (see, e.g., Celeux, Hurn, and Robert 2000), thus solving the label switching problem immediately. The drawback of this method is that it is computationally expensive and to define such a loss function is not always possible; see Celeux, Hurn, and Robert (2000) and Jasra, Holmes, and Stephens (2005). Alternative proposals, which move away from the loss function framework, incorporate uncertainty in the estimation of the relabeling process. These methods include probabilistic relabeling and the ideas described by Yao (2012). Probabilistic relabeling strategies were first proposed by Jasra (2005) and then developed and extended by Sperrin, Jaki, and Wit (2010).
Our proposed solution to the label switching problem lies in the meaning of the relationship between the allocation variables (2) and the observations. The key is to use this relationship to incorporate the data directly within the loss function used to undo the label switching. From iteration to iteration of an MCMC algorithm, the labels of the clusters may change. But if the sampler has converged, the clusters must remain roughly the same. We use this fundamental fact to derive an easy and efficient solution to the label switching problem, and we compare our proposal with current relabeling alternatives. To assess the effect of the chosen MCMC strategy on the resulting inference, we compare results obtained via the standard Gibbs sampler against those obtained via a transdimensional MCMC algorithm.
The article is structured as follows. In Section 2, we describe the label switching problem with associated ideas. Then, in Section 3, the key ideas under a deterministic relabeling strategy are outlined. In Section 4, our relabeling algorithm is motivated and derived. In Section 5, we illustrate and compare our proposal against other relabeling methodologies. The article concludes with a discussion in Section 6.

LABEL SWITCHING IN MIXTURE MODELS
The ultimate objective under any relabeling algorithm is to gain identifiability for the components and then to be able to perform component-specific inference.

COMPONENT-SPECIFIC INFERENCE IN MIXTURE MODELS
The latent variables (2) introduce a natural clustering structure over the observations and are the basis for making inference about the hidden groups. Under this framework, the goal is to find the cluster from which each y i is drawn. Hence, to group a set of observations into k clusters, we make inference about the classification probabilities of each observation. If a single best clustering is all that is needed, each observation is assigned to the cluster with the highest classification probability. In the rare case that the parameters of the mixture are known, Bayes' theorem can be used to calculate the classification probabilities directly: When the parameters of the mixture are unknown, we need to calculate the posterior classification probabilities: where y = {y i } n i=1 . Using MCMC methods, (4) can be approximated via where (w t , θ t ) ∼ p(w, θ |y) for t = 1, . . . , N (t indexes the iterations of the MCMC). If we want to classify a new observation y * , with y * conditionally independent of y, then This can be estimated in an analogous way to (5), with obvious changes. Note that (6) tells us that to make inference for the scaled predictive of y * is equivalent to the estimation of its classification probabilities. Finally, with the same MCMC output, posterior mean estimates for components' parameters (or functions of them) can be approximated by averaging over the parameters identified by the index of interest: Later on in the article, our aim will be to estimate classification probabilities (4), scale densities (6), and posterior means (7).

THE LABEL SWITCHING PROBLEM: A COMPLICATION FOR INFERENCE
The likelihood of a mixture model is invariant under permutations of its parameters: let P k be the set of the k! permutations of the labels {1, . . . , k}. If for some ρ ∈ P k , we define ρ(w, θ ) := ((w ρ(1) , . . . , w ρ(k) ), (θ ρ(1) , . . . , θ ρ(k) )), it is easy to observe that for every ρ, ν ∈ P k , Consequently, if the support of the parameters is the same, for example, under symmetric priors across the components, the posterior distribution will inherit the likelihood's invariance. Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make (5) and all other ergodic averages to estimate characteristics of the components useless.

MCMC CONVERGENCE IN MIXTURE MODELS
Under a mixture model setup, to assess convergence of an MCMC strategy, marginal posterior estimates are monitored. For the Gibbs sampler, plots of estimates show that in some cases, the labels of the components do not permute (see Celeux, Hurn, and Robert 2000, pp. 960, 962;Jasra, Holmes, and Stephens 2005, p. 55). This has been perceived as a symptom of poor mixing of the sampler.
To solve the lack of label switching, Frühwirth-Schnatter (2001) added a Metropolis-Hastings move that proposes a random permutation of the labels, which is accepted with probability one. This solution can be implemented in a postprocessing stage, without further modification of the sampler. However, it was discarded by Celeux, Hurn, and Robert (2000, p. 959). "Our insistence on searching for an algorithm that can achieve symmetry without such a move is that any concerns over convergence are not necessarily dealt by such a strategy, which simply alleviates the most obvious symptom," and in Jasra, Holmes, and Stephens (2005, p. 55), it is stated that "this course of action is only appropriate if the posterior distribution is not genuinely multimodal (which would not be known a priori to simulation)." Both sets of authors concluded that alternative and more complex MCMC samplers should be considered. A broader discussion along the same lines can be found in Geweke (2007), who is not sympathetic to the construction of complex MCMC algorithms.
To address the convergence problem, Papastamoulis and Iliopoulos (2010) implemented at each iteration of a Gibbs sampler, a Metropolis-Hastings step that proposes a random permutation between two labels of the mixture: j, l ∈ {1, . . . , k}. The acceptance probability for this proposal is given by α(j, l) = min{1, (w j /w l ) n l −n j }. However, this idea was designed to introduce label switching in infinite mixtures, and under this setting, the weights are weakly identifiable; see Papaspiliopoulos and Roberts (2008, pp. 178-179). This is not the case for the weights of a finite mixture model, so this proposal is an alternative to the ideas described by Frühwirth-Schnatter (2001).
The last alternative is to use a transdimensional sampler. The point here is that the moving k sampler is able to move around the modes of the posterior distribution of the fixed k model via models of lower or higher dimensions, thus taking less iterations to escape trapping states and in general mixing better within k. Note that these samplers can be computationally demanding and may be inefficient, because the chain will not necessarily stay in the model of interest for the entire run; see Richardson and Green (1997), Stephens (1997, sec. 4.4.1), and Jasra, Holmes, and Stephens (2005).
A further justification for complex MCMC and transdimensional samplers is possible genuine multimodality. Genuine multimodality is more than the modes being identical up to permutation of the labels; see Stephens (1997, secs. 2.3 and 3.4.2). Hence, the argument against the standard Gibbs sampler is that in the presence of genuine multimodality, it can get trapped in one of the symmetric modes, and not visit genuine modes.

DETERMINISTIC RELABELING STRATEGIES
Suppose a sample from the posterior distribution of (1) is generated: (w t , θ t ) ∼ p(w, θ |y), for t = 1, . . . , N. The aim under any relabeling strategy is to find a permutation ρ t ∈ P k such that the permuted sample for t = 1, . . . , N, recovers identifiably for the mixture components. An obvious way to proceed is to find artificial constraints over the parameter space of (w, θ ). It is important to note that, in this case, the permutation ρ t ∈ P k will be the one for which the constraints are satisfied. We can start with basic constraints and then use summaries of the MCMC output to update them; see Frühwirth-Schnatter (2001). While this might work in the univariate case, it is challenging to extend to the multivariate setting. Thus, the objective is to find constraints automatically via deterministic algorithms. This is reduced to finding the permutation ρ t that minimizes a loss function. It is important to stress that such a strategy can be implemented as an online method or postprocessing the sample, without modifying the MCMC, hence allowing us to make inference with the permuted MCMC sample. For a formal justification, see Stephens (1997, proposition 3.1).
The key points of any deterministic relabeling strategy are the definition of the loss function, the selection of a pivot, and the minimization step. Let L t be the loss function to minimize, and c t l,j be the cost associated if permutation ρ t (l) = j is chosen. The minimization step for the sampled values at iteration t of the MCMC can be formulated via the well-known assignment problem (see Burkard, Dell'Amico, and Martello 2009): We need to solve (10) by minimizing over the (a l,j ); if for some t, ( a l,j ) is an optimal solution for the problem and it is that a l,j = 1, then this corresponds to the permutation ρ t (l) = j . Also, let L t be the the loss function (10) evaluated at ( a l,j ). This formulation was first used by Stephens (1997). In the next section, we describe the algorithms, loss functions, and pivots used by Stephens (2000) and Papastamoulis and Iliopoulos (2010).

KULLBACK-LEIBLER RELABELING
The idea behind Stephens' Kullback-Leibler (KL) relabeling is straightforward; see Stephens (2000). First, during the MCMC algorithm, the classification probabilities are stored. Let P t n×k be the matrix of classification probabilities (at iteration t of the MCMC), that is, Second, let Q n×k be the matrix of "true classification probabilities." Under Stephens' setting, the matrix Q is the pivot, and the goal is to find a permutation ρ t ∈ P k , of the such that the KL divergence between ρ t (P t ) and Q is minimum over all permutations. Clearly, Q is not known; thus, it is approximated via a recursive algorithm. We provide a description of this strategy in Algorithm 1.

EQUIVALENT CLASSES REPRESENTATIVES RELABELING
Equivalent classes representatives (ECR) method (Papastamoulis and Iliopoulos 2010) is a fast and easy strategy to undo the label switching. The idea is as follows. We can define two allocations z 1 and z 2 as equivalent if there is a permutation ρ ∈ P k such that be the vector of "true allocations" and z t be the vector of allocations being sampled at iteration t of the MCMC algorithm, to undo the label switching, we find the permutation ρ t ∈ P k that minimizes the simple matching distance (SMD) between g and ρ t (z t ) = (ρ t (z t 1 ), . . . , ρ t (z t n )), for t = 1, . . . , N: Thus, in this case, the pivot will be the vector g. An easy solution that avoids a recursive algorithm to estimate g is to use the MCMC output to choose a "good" allocation vector, and following Martin, Mengersen, and Robert (2005), Papastamoulis and Iliopoulos (2010) used the MAP estimator: at each iteration of the MCMC, we store the latent allocations, z t , and the log-posterior distribution, t = log{p(w t , θ t , z t |y)}. To choose the MAP estimate, we find t * such that t * = max t=1,...,N t and let g MAP = z t * . A description of the ECR relabeling strategy is provided in Algorithm 2: note that n t j = #{i : z t i = j } and n t j,l = #{i : Algorithm 2 ECR relabeling strategy 1: Find g MAP . 2: For t = 1 to N , find ρ t solving (10) with costs: c t l,j = n t j − n t j,l .
From Algorithm 2, it is not clear how the costs work. We have devised a simple example to understand this point: let z t and g MAP be as in Table 1. It is clear the permutation that minimizes the SMD is ρ t = {3, 1, 2, 4}; hence, ρ t (1) = 3, ρ t (2) = 1, ρ t (3) = 2, and ρ t (4) = 4. Then, the costs are matching distances translated as comparisons of counts. According to Papastamoulis and Iliopoulos (2010), besides the MAP estimate, other valid choices for good allocation vectors are the most probable allocation and the allocation vector corresponding to the maximum of the likelihood (8).
We stress that in this summary of the ECR algorithm, we have restricted our attention to the costs. However, to show that the sampler converges to the correct posterior distribution, Papastamoulis and Iliopoulos (2010) gave more emphasis on arguments about the equivalent classes induced by ρ(z) for ρ ∈ P k and convergence proofs: the name representatives comes from taking a set Z 0 consisting of exactly one representative from each equivalent class and then showing that a sample from the posterior p(w, θ |y) can be obtained by calculating the posterior restricted to Z 0 and then permuting the labels of the simulated values.

COMMENTS ON KL AND ECR ALGORITHMS
The KL and ECR algorithms need estimated pivot variables Q and g, respectively. We stress that these must be chosen as close as possible to the mode of one of the k! symmetric modes of the posterior distribution: the aim is to eliminate the influence of the remaining symmetric copies. The KL algorithm uses a posterior mean and the ECR strategy aims for a posterior mode. This is an important point because it is valid for all relabeling algorithms. KL achieves its pivot starting from a bad estimate and improving it via fixed point theory ideas, thus leading to an iterative algorithm that converges to a local minimum, while ECR aims to find a good estimate from the start (usually the MAP estimate). Once the pivot variables are obtained, these are used to permute the MCMC output, that is, (9), and obtain a representative of one of the modes of the posterior distribution.
The real difference between the algorithms is the costs used in the loss function (10). It is possible to use the MAP estimator of Q: Q MAP on the KL algorithm. In the same manner, via algorithm 1 of Stephens (1997, p. 800), an estimate g = { g i } n i=1 of g can be obtained. A description of the ECR strategy via an iterative process is provided in Algorithm 3, and in this case, n t j,l = #{i : z t i = j, g i = l}. Now we start with a bad estimate g and improve it via an iterative algorithm. The last iteration will be exactly the original ECR algorithm but where the pivot was obtained in an alternative manner.
Note however that neither in the original ECR nor in the iterative version, the pivot g will coincide with the final estimate for the single best clustering. An alternative that will ensure this is provided in Algorithm 4. This algorithm will be computationally demanding.

USING THE DATA TO UNDO THE LABEL SWITCHING
Our aim is to combine all the information about the clusters gathered by the latent allocations, (2), and the data to devise a simple loss function. Then, as with other relabeling algorithms, to undo the label switching, we will find the permutation of the indices that minimizes a loss function.
The key point is that if the MCMC has converged, from iteration to iteration, the labels of each cluster may change. But the clusters should be roughly the same. The difference should be small enough for us to discover where a cluster of the data has moved to. Hence, all that we need is to keep track of the k clusters throughout each iteration of the sampler. For now, to explain our ideas, we will restrict our attention to the univariate case. The multivariate problem is a trivial generalization and is explained in the illustrations for multivariate data.

A SIMPLE LOSS FUNCTION
Let us assume that the center and the dispersion of the clusters within the data are known: {m 1 , . . . , m k } and {s 1 , . . . , s k }. These will be our pivot variables. To undo the label switching, we can find the permutation, ρ t ∈ P k for t = 1, . . . , N, that minimizes where n t ρ t (j ) = #{i : ρ t (z t i ) = j }. With this loss, the costs in (10) will be given by If the clustering assumptions induced by (1) and (2) are met, once the MCMC has converged, at each iteration of the sampler, there must be k clusters identified by the conditional relationship between y i and z t i . Thus, a k-means type of diverging measure, (11), will be able to keep track of each cluster. To measure deviations from the center of each cluster, m l , we eliminate the effect of the scale, and as the number of observations allocated within each cluster can be rather distinct, we make the divergences proportional to the size of each group. These considerations improved the loss function greatly.

Means and Standard Deviations.
There are different alternatives for measures of central tendency and dispersion for the clusters, and here, we derive algorithms based on the mean and standard deviation. Later on, in the article, we describe how to use more robust alternatives.
To estimate the means and standard deviations for each cluster, we could use the mixture parameters or functions of them; for example, in the case of the normal mixture, the obvious options are posterior means for the means and standard deviations of each component, that is, m j = E(μ j |y) and s j = E(σ j |y), for j = 1, . . . , k. However, there are two problems with this approach. First, the algorithm would be restricted to mixtures with a particular component's distribution. Second, we would need to estimate these posterior means from the MCMC output, and the parameters sampled via the MCMC are random and can exhibit great amounts of variability depending on whether the Markov chain traverses regions of high or low probability in the posterior distribution. To solve these problems, we will work with functions depending only on the data and the latent allocations. The data are fixed, and there is only certain amount of variability that subsets of it will admit. On the other hand, the distributional assumptions of the model are contained in the clusters induced by the latent allocations.
With these ideas, to estimate the means and standard deviations of each cluster, we use posterior means for the sample mean and sample standard deviation: m j = E(y j |y) and s j = E(s j |y), for j = 1, . . . , k, where y j = 1 Thus, for (13), the function h(y, z, j) will be given by y j and s j for the sample mean and sample standard deviation, respectively. The posterior means in (13) will be estimated using the MCMC output and calculating Section 4.2, specifically Algorithm 5, will explain the technical details of n t j = 0 or 1.

THE ALGORITHM
As with the previous algorithms, to define the costs (12), a complete knowledge about the pivot variables is assumed-in our case, the means and standard deviations of each group. We have two ways to proceed: via a recursive algorithm or using estimates. We use the second approach simply because it is faster, but for a comprehensive explanation we include both.
The idea to find estimates is as follows: first, we obtain initial estimates of our pivot variables in one of the modes of the posterior distribution, in this case (13). If the MCMC has converged, we choose the first set of allocations after the burn-in period and calculate our starting sample means and sample standard deviations with it. At this point, the MCMC should be traversing one of the k! modes of the posterior distribution, and for our propose, it is not important which particular mode this is. To move these initial pivots closer to the mode of this particular mode of the posterior distribution, we average the same estimates for each group over all the iterations of the MCMC algorithm, preserving identifiability via (11). The final estimates will be close to the mode of one of the modes of the posterior distribution, and we can use these as pivots to find the permutations (9). A formal description of the strategy to find estimates is given in Algorithm 5 (note that R = y (n) − y (1) ).
Algorithm 5 Data-based relabeling: finding estimates 1: For j = 1 to k, initialize n m j = 1, n s j = 1, m j = y (1) + R j k + 1 , and s j = R/k. For j = 1 to k, update m j , s j , n m j , and n s j If n t ρ t (j ) > 0 ⇒ m j = n m j − 1 m j + y t ρ t (j ) /n m j and n m j = n m j + 1.
If n t ρ t (j ) > 1 ⇒ s j = n s j − 1 s j + s t ρ t (j ) /n s j and n s j = n s j + 1.
A description of the Data relabeling via estimates is provided in Algorithm 6. We have defined a two-step procedure: in Algorithm 5, the aim is to find estimates for the mean and standard deviation for each cluster. While in Algorithm 6, the interest lies in the permutations ρ t , for t = 1, . . . , N. It is important to say that several experiments were performed using Algorithm 5, to obtain the permutations directly, and in some cases, the results were as good as with the two-step procedure. However, in general, the two-step procedure was more accurate.
Algorithm 6 Data-based relabeling: estimates strategy 1: Find estimates m j and s j for j = 1 to k. The corresponding recursive algorithm follows directly from algorithm 1 of Stephens (1997, p. 800). A description of this strategy is given in Algorithm 7.

ADDITIONAL COMMENTS
In this section, we describe the ideas under the definition of the costs (12) and possible alternatives.

Main Ideas.
There are two important ideas under the definition of (12). First, it is easier to obtain estimates (pivots) close to the mode of one of the modes of the posterior distribution using general characteristics of the components such as location and dispersion, rather than via those related to individual observations such as classification probabilities or allocations. If the algorithm has converged, we can obtain estimates for the location and dispersion for the groups directly from the MCMC output: the relationship between the observations and the allocations is the key. Second, the observations are incorporated directly within the loss, and not only via the MCMC output. Indeed, note that all the quantities needed for the calculation of the costs, (12), are based on the data, while the groups are indicated by the allocations. This should reduce the variability of the costs and thus lead to a more stable and robust relabeling algorithm. (12) can be computed efficiently and in the examples that we have studied, they have good performance. However, it is straightforward to derive different alternatives; we can start by using robust measures of location and dispersion for the groups, for example, the median and median absolute deviation. Thus, in (13), we substitute (y j , s j ) by (med j , mad j ). With this change, the costs (12) will be resistant to outliers. This will be useful when the sampler visits the tails of the posterior distribution. To find ( m j , s j ), for j = 1, . . . , k, in Algorithm 5, we would use

Extensions. Our costs
The calculation of these statistics will lead to a slower algorithm but providing accurate component-specific inference. Another alternative, as suggested by a referee, is to use in (10) the costs thus devising label switching algorithms for particular component's distributions. The idea again will be to find posterior estimates for θ l via (14), using the data and the allocations. Under this setting, for a mixture of normals, the costs will be given by To estimate μ l and σ l , we can use again (13). Importantly, connecting all the elements of the costs directly with the data.

COMPARISONS
In this section, using simulated and real datasets, we compare the performance of the relabeling algorithms: Data, ECR, and KL. First, all comparisons are carried out under a univariate setting. Then we move to the multivariate case.

UNIVARIATE MIXTURES
For our first comparison, we take samples from three models involving mixtures: We generated samples of size: n = 1000, n = 200, and n = 600, from Models 1, 2, and 3, respectively. To improve comparability, the samples were not randomly drawn. We evaluated the quantile function of each mixture on a grid in (0, 1), and a friendly implementation of this can be found in the package "nor1mix" of R; see Mächler (2011). The idea is that under noninformative priors, inference is as good as the sample and since the objective is to compare our proposal with other alternatives, using as reference the true values is the best way to proceed. This type of data has been used before by Nobile and Fearnside (2007) to illustrate the performance of their transdimensional MCMC for mixture models.
Model 1 was designed to exhibit the same characteristics as the crab data, which was first analyzed by Pearson (1894), and comprises the measurements of the ratio of forehead to body length of 1000 crabs. Model 2 is a mixture of symmetric and poorly separated components, and Model 3 has been used by Papastamoulis and Iliopoulos (2010) to demonstrate the performance of the ECR algorithm. All mixtures have overlapping components and present challenging scenarios for any relabeling algorithm.
For the real data, we have chosen enzyme data. The enzyme dataset concerns the distribution of enzymatic activity in the blood of an enzyme involved in the metabolism of a carcinogenic substance, among a group of 245 unrelated individuals. This dataset was analyzed by Richardson and Green (1997).
To compare the effect on inference related to the choice of MCMC strategy, we will compare posterior mean estimates, obtained via the standard Gibbs sampler, with those obtained via its reversible jump extension.

Relabeling Algorithms, Gibbs
Sampler. For Models 1-3, we work with the Gibbs sampler for normal mixtures described by Richardson and Green (1997), along with their data-based priors. In each case, we generated an MCMC with 60,000 iterations, throwing away the first 30,000 as a burn-in period. Then, with the output of the same MCMC, we undo the label switching using each algorithm: Data, ECR, and KL. Thus, for each case, posterior mean estimates for the weights, means, and variances are calculated. To measure the accuracy of the point estimates, we determined the relative error: where the posterior means for the functions h 1 (w j , θ j ) = w j and h 2 (w j , θ j ) = θ j will be estimated via (7). To have a fair comparison, this was repeated 100 times.
In Table 2, we present the first comparison with the dataset from Model 1. In this case, from the column of the average relative errors, it is clear that Data and KL algorithms are more accurate than the ECR strategy. There is no difference in precision between Data and KL. In Figure 1(a)-1(c), we display the traces for the means (just 10,000 iterations) generated after the algorithms data, ECR, and KL, respectively, were used to undo the label switching. From these graphics, it is clear that while the Data and KL strategies introduce the constraint μ 1 < μ 2 , over all 10,000 iterations, the ECR relabeling introduced the same constraint on iterations where regions of high probability of the posterior distribution were explored. Hence, as Papastamoulis and Iliopoulos (2010) proposed more alternatives to obtain good vectors of allocations, we changed g MAP for the true single best  clustering (there is no better alternative) on the ECR algorithm, and the results were then the same as with the Data and KL strategies. To test convergence problems, we performed individual runs of 500,000 iterations (taking 400,000 as a burn-in period), but the results calculated via the ECR algorithm and g MAP were similar to those presented in Table 2 and Figure 1. For our second comparison, we use the dataset from Model 2, and the results are shown in Table 3. In this case, we have a symmetric mixture of four components and where the only difference between components is the mean. From average relative errors for the weights and means, in this case, we observe that Data relabeling is far more accurate than either ECR or KL strategies. Estimates of variances are equally bad across the three algorithms. In the scaled components, predictive and single best clustering, presented in Figure 2, we observe again that Data relabeling performs much better than the alternatives. True scaled densities were plotted via w j N(y|μ j , σ j ) and the true single best clustering was determined using Equation (3). Here we can observe that the single best clustering achieved with the data relabeling, Figure 2(d), is almost identical to the one calculated with the true values [ Figure 2(a)]. For curiosity, we used the true single best clustering instead of g MAP on the ECR algorithm and the results shown in Table 3 with the data relabeling remain more accurate. To test convergence problems, we ran the Gibbs sampler for 1,000,000 iterations discarding the first 900,000, and the results were almost identical to those described above.
Our insistence in maintaining a maximum of 100,000 iterations for the analysis lies in the storage requirements for the classification probabilities and the time needed by the KL algorithm to converge (in this case approx. 16 min). See the Appendix (given in the online supplementary materials) for more about the storage requirements.
In Model 3, we have three separated components: components 3, 4, and 5. Two components have the same mean and weight: components 1 and 2. But component 1 has a larger variance than component 2. Here, posterior mean estimates for the separated components are acceptable, but for components 1 and 2 we observe difficulties, see Table 4. In this case,  posterior mean estimates for the weights calculated via our relabeling algorithm are more accurate. For the other variables, the ECR algorithm performs marginally better. For the enzyme data, we ran the Gibbs sampler for normal mixtures described by Richardson and Green (1997), along with their data-based priors. Here, assuming k = 4, we generated 200,000 iterations and used the first 100,000 as a burn-in period. Then, with the output of the same MCMC, we undo the label switching using each algorithm: Data, ECR, and KL. Estimated scaled densities and single best clustering are displayed in Figure 3. Here, there are differences in the single best clustering produced with the three algorithms; see groups 2 and 4 (in red and purple, respectively). Also, there are differences in the estimates for the scaled densities.
In Table 5, we present the recorded average time required to undo the label switching for each algorithm. For the Data relabeling, we are differentiating between the registered time to find estimates, Algorithm 5, and the actual relabeling time, Algorithm 6. For the ECR strategy, the time required for the calculation of the log-posterior distribution and the time to find and extract the MAP estimate was not recorded. Given the order of the system times displayed in Table 5 (in seconds), we believe that if these calculations are considered, Data and ECR relabeling should be equally fast. We remark that this is for the univariate case.

Relabeling Algorithms, Reversible Jump.
Here, the results obtained in the last section for Models 1-3 are compared with estimates obtained via the reversible jump sampler, for normal mixtures, described by Richardson and Green (1997) and their data-based priors. First, to obtain convergence, we generated 500,000 iterations and kept the last 100,000 for Model 1, and 200,000 for Models 2 and 3. Then the output relevant to a given k was extracted: k = 2, 4, and 5 for the data-sets of Models 1, 2, and 3, respectively. Finally, after relabeling the output, posterior mean estimates were calculated. In this case, because reversible jump is computationally more expensive than the Gibbs sampler, we did not carry out 100 experiments. A remarkable difference between the two samplers is how they explore the support of the posterior distribution. The reversible jump sampler uniformly visits regions of high and low probability in the posterior distribution, while the Gibbs sampler explores mainly high probability regions and it does so in a nonuniform manner. This can clearly be observed in Figure 1 where traces for the means for the dataset from Model 1 are displayed, for only 10,000 iterations. Figure 1(d)-1(f) are from the reversible jump's output and undoing the label switching using the Data, ECR, and KL algorithms, respectively. Figure 1(a)-1(c) are the corresponding traces generated with the output of the Gibbs sampler.
For Model 1, with reversible jump, we did not observe major improvements in accuracy and all the estimates were very close to those shown in Table 2.
For the dataset from Model 2, we observed an improvement in the accuracy for the estimation of the weights. For example, with the reversible jump and Data relabeling, we achieved a relative error of 0.269, while the average relative error via the Gibbs sampler was 0.442, with a standard deviation of just 0.089. Similar improvements were obtained with the ECR and KL strategies. Curiously, estimates for the means and variances went in the opposite direction. In the case of the Data relabeling, with the Gibbs sampler, we obtained an average relative error of 0.776 and a standard deviation of 0.065, and via reversible jump, a relative error of 0.923 was recorded.
For Model 3, we observed again an improvement in the estimation of the weights. This time it happened just with the Data relabeling: from an average relative error of 0.818 (and a standard deviation of 0.230), it went to 0.455. All the other estimates were of the same order as those displayed in Table 4.
To conclude this section, it is important to say that the computational cost of the reversible jump sampler is far greater than that of the Gibbs sampler. For example, to obtain the results for Model 1, it took almost 9 min to run the MCMC and a further 4 min to extract and calculate the classification probabilities for k = 2. For the standard Gibbs sampler, less that 2 min were needed overall. Hence, given the evidence observed in our experiments, we see no reason to use the transdimensional MCMC rather than the standard Gibbs sampler.

MULTIVARIATE NORMAL MIXTURES
Here, we extend the loss function (11) to allow for multivariate data. In this case, y = {{y i,r } p r=1 } n i=1 , and hence, each group has p means and standard deviations: m j = {m j,r }   To illustrate the performance of our methodology in a multivariate setting, the Gibbs sampler for multivariate normals described by Stephens (1997), with default priors, was implemented. We simulated n = 200 observations from the distribution 4 j =1 w j N 2 (μ j , j ) with actual values shown in Table 6 (note that = ( 11 , 12 = 21 , 22 )). This mixture model has been considered by Papastamoulis and Iliopoulos (2010). In our experiments, we took k = 4. For the analysis, we kept the last 30,000 iterations from a chain of 100,000 iterations.
In Figure 4(a), we display the level curves calculated via the true model, along with the true single best clustering. In Figure 4(b)-4(d), we display level curves of the plug-in density estimates and single best clustering generated via ECR, Data, and KL algorithms, respectively. In this case, the results obtained with each algorithm are identical. We repeated the experiment several times and there was no difference between the estimates generated via the three algorithms.
The time required to find estimates for the data algorithm was 1.352 sec, and then it took a further 1.525 sec to undo the label switching. With the ECR and KL strategies, it took 1.457 and 30.319 sec, respectively, to undo the label switching.

DISCUSSION
We have described the key ideas for a deterministic relabeling algorithm, and from it we have derived an easy and efficient solution to the label switching problem. Our proposed solution, the Data relabeling, lies in the meaning of the relationship between the allocation variables (2) and the observations. Label switching can be considered under a general missing data model framework that includes as special cases finite mixtures, hidden Markov models, and Markov random fields; see Papastamoulis and Iliopoulos (2011). In this article, we restrict our attention to finite mixtures. An application is the mixture of Dirichlet process (MDP) model; see Lo (1984). In this setting, to deal with an infinite measure, there are two alternative ideas: marginal and conditional methods (see Griffin and Holmes 2010, sec. 6.3). If the Dirichlet process is integrated out, the so-called marginal algorithm, the labels of the clusters become unidentifiable; see Papaspiliopoulos and Roberts (2008). For the conditional algorithms, the weights of the Dirichlet process are only weakly identifiable, so there is label switching.

SUPPLEMENTARY MATERIALS
The supplementary materials include an Appendix with the computational issues and storage requirements to implement each algorithm. The actual code used for this article is also included: Data, ECR, and KL relabeling strategies and the standard Gibbs sampler for a fixed k (LabelSwitching.zip).