An Objective Approach to Prior Mass Functions for Discrete Parameter Spaces

We present a novel approach to constructing objective prior distributions for discrete parameter spaces. These types of parameter spaces are particularly problematic, as it appears that common objective procedures to design prior distributions are problem specific. We propose an objective criterion, based on loss functions, instead of trying to define objective probabilities directly. We systematically apply this criterion to a series of discrete scenarios, previously considered in the literature, and compare the priors. The proposed approach applies to any discrete parameter space, making it appealing as it does not involve different concepts according to the model. Supplementary materials for this article are available online.


INTRODUCTION AND BACKGROUND
The purpose of this article is to outline a novel approach to constructing objective prior distributions for discrete parameter spaces. The difficulty in dealing with this type of parameter space is well known and, while problem-specific solutions have been successfully developed, it seems that a generalized approach that could deal with any type of problem remains elusive. In recent work, Berger, Bernardo, and Sun (2012) presented a solution to the problem by embedding the discrete structure in a continuous one, through the application of one or more of four possible approaches, and applying reference analysis to the continuous structure to obtain a prior distribution. In this cited article, it is also possible to find literature to past work on the development of a general procedure to obtain objective priors for discrete parameter spaces. Other work on the subject can be found, for example, in Rissanen (1983), Lindsay and Roeder (1987), and Barger and Bunge (2008).
Our aim is to propose a general framework that can be applied to any discrete parameter space, and which is objective in the sense that, once a model has been chosen, it does not require subjective input. Alongside this, we intend to define prior distributions for all the models covered in Berger, Bernardo, and Sun (2012), from now on referred to as BBS. It is worth mentioning that we have kept the notation as much as possible similar to the one in BBS.

Alternative Approaches
We would first like to briefly summarize the approach outlined by BBS. It is in fact our intention to derive prior distributions for the discrete models discussed in their article, and any comparison we make largely benefits from this short review.
Their basic idea is to embed the original model, defined by a discrete parameter (or vector of discrete parameters), into a continuous model, such that the structure is preserved. Then, prior distributions are obtained by applying the regular reference

The Idea
Let us consider a probability distribution f (x | θ ), which depends only on the value of a discrete parameter θ ∈ . In a subjective Bayesian approach, the prior π (θ ) represents the initial degree of belief that we have about the possible values that θ can take within the parameter space. Then, by combining it with the information contained in the observed data, represented by the likelihood function, the initial beliefs are updated to the posterior probability distribution π (θ | x). The prior and the posterior retain the same meaning.
While it is problematic to assign an objective probability, while attempting to be least informative, we argue that the fact that parameter value θ has been selected to be a part of the model conveys information that can be exploited. For such inclusion, the value θ must have a worth associated with it. We can objectively evaluate the worth by determining what is objectively lost if the θ value is removed from the model. Let us denote by π (θ ) the prior distribution for the discrete parameter θ ∈ . If a prior π has been assigned, then we can link this to a worth of each element by means of the self-information loss function − log π (θ ); or, equivalently, the utility of θ is denoted by log π (θ ). Details of this particular loss/utility function can be found, for example, in Merhav and Feder (1998). We can then identify an appropriate objective way to associate a utility with each θ , representing its worth in the model, and the prior distribution π (θ ) then follows. Furthermore, we note that in this way the Bayesian approach is conceptually consistent, as we update an initial (i.e., prior) utility assigned to θ , through the application of Bayes' theorem, to obtain the resulting utility expressed by log π (θ | x). Indeed, there is an elegant procedure akin to Bayes, which works from a utility point of view, namely, that which has the interpretation of This, with a "−" sign placed in front of Utility, is a cumulative loss function for assessing the loss of θ in the presence of two pieces of mutual information x and π . Here K is a constant that does not depend on θ .
The utility to be assigned to each model θ is a function of the Kullback-Leibler divergence (Kullback and Leibler 1951) measured from the model to the nearest one; where the nearest model is the one defined by θ = θ ∈ , such that D KL (f (·|θ ) f (·|θ )) is minimized. This is justified by the fact that, if the model is misspecified-that is, θ is removed and it is correct, the posterior distribution accumulates asymptotically at the nearest θ value in with respect to the Kullback-Leibler divergence (Berk 1966). Thus, this divergence represents the loss in information incurred by removing θ , if it is the true one. This will then be the quantification of the utility of that value of θ . The objectivity of this is obvious, as it will depend on the available set of options (i.e., choice of the family of densities) solely. As the divergence to the nearest alternative θ increases, θ becomes more valuable to the model.
The formal derivation of the prior distribution for θ on the basis of our idea can be expressed as follows. Let us write u 1 (θ ) = log π (θ ) and let the minimum divergence from θ be represented by u 2 (θ ). We want u 1 (θ ) and u 2 (θ ) to be matching utility functions; though as it stands −∞ < u 1 ≤ 0 and 0 ≤ u 2 < ∞, and we want u 1 = −∞ when u 2 = 0. The scales are matched by taking exponential transformations; so exp(u 1 ) and exp(u 2 ) − 1 are on the same scale. Hence, we have where g(u) = log(e u − 1).
By setting the functional form of g in (1) as it is defined in (2), we derive the objective prior distribution for the discrete parameter θ : In other words, for each value θ ∈ , we identify the value θ ∈ for which the divergence measured between the two defined models f (x | θ ) and f (x | θ ) is minimum; the prior mass at θ is proportional to the obtained Kullback-Leibler divergence. The prior defined in (3) will be applied to all the models considered in this article.
The outline of the article follows the sections in BBS. So in Sections 2 to 6, we define the objective prior in accordance with the approach we propose and compare the results with the ones obtained by BBS. Each of this sections has the same structure: first, we examine the model and the prior distribution defined by BBS; we then derive the prior applying (3) and compare the two. Section 7 contains final discussion points.

A POPULATION SIZE MODEL
The first case considered is the estimation of the size of a population by means of a Type II censoring. In this experiment, there is a sample of N units with a lifetime that is modeled by an exponential distribution with unknown parameter λ. The experiment stops when a predetermined number of failures R is reached. Failure times t 1 ≤ · · · ≤ t R , which are assumed to be independent, have the following joint distribution: It can be shown (Goudie and Goldie 1981) that inference on N is equivalently performed by considering the following density: where V = (t 1 + · · · + t R )/t R and BBS applied Approach 1. That is, the parameter space of N, where i R (N) is the Fisher information. The prior for θ = N − R + 1, given R = 2, 3, 4, is Numerically, BBS verified that the prior is proper up to R = 100, which is necessary as the likelihood function converges to a constant when N tends to infinity, and the posterior would be improper. Furthermore, the prior distributions for different values of R are all very similar, except at θ = 1. Therefore, the recommendation is to use the prior for R = 2 for any value of R, as it is considered a good approximation.

The Objective Prior
To define the prior distribution for N, by applying the objective approach we propose, we first consider the Kullback-Leibler divergence between two densities of type (4) with different value of N: where the expectation is with respect to p(V | N ), and c is an integer. This is an increasing function in c, which means that the nearest model to p(V | N), in terms of Kullback-Leibler divergence, is either p(V | N − 1) or p(V | N + 1). We have ascertained by computation that the Kullback-Leibler divergence from p(V | N) is minimized by c = 1, rather than c = −1. Therefore, by applying (3), we see that the objective prior for N is given by The prior distribution (5) is proper, as the following theorem, which has proof in the online supplementary material, shows.
Then the prior distribution π (N | R) is proper.

Illustrations and Comments
The prior distribution for N assigns large mass for small values of the parameter and rapidly decreases as n increases. This is graphically verifiable in Figure 1 (right graph), where the priors for R = 2, R = 3, and R = 4 have been plotted. This behavior is similar to the one obtained by BBS (left graph). It has to be noted that we have analytically proved that our prior is proper. Given that the likelihood function is constant as N → ∞, as discussed in BBS, this is a necessary condition for the posterior to be proper. We have performed an analysis of the posterior performance (not shown here) by considering the approximate frequentist coverage of the 95% credible intervals. In addition, we have considered the relative mean squared error, √ MSE(θ )/θ , for the posterior means. For 20,000 simulations of sample size m = 1, the coverage of BBS prior and our prior are remarkably similar although conservative, as they tend to be higher than the nominal level. In terms of MSE, BBS prior has a better performance for low values of θ (except for θ = 1); however, when θ increases, the two priors tend to have similar performance. Our prior has a more stable performance, in terms of MSE, over the whole parameter space considered. We have performed the simulations for m = 1 so the differences in the frequentist performances of the priors are more prominent. However, for more realistic sample sizes (e.g., m = 30), the performances become remarkably similar.

HYPERGEOMETRIC MODEL
Let us consider now a hypergeometric distribution with probability mass function given by (6) with the population size N and the sample size n known, and R = 0, 1, . . . , N, representing the units in the population, which satisfy a certain criterion (or that possess a certain characteristic). The parameter R is unknown, and the aim is to define the prior π (R).
BBS obtained a prior on R by applying Approach 2. It is assumed that the parameter has a binomial hierarchical model, Bin(R | N, p), where the continuous parameter p is unknown. Therefore, the problem reduces in finding the reference prior for p. As shown in Bernardo and Smith (1994), this is accomplished by marginalizing out the parameter in the lower level. Given that the reference prior for the parameter p of a binomial distribution is the Jeffreys' rule prior, we have which, as it is defined on a finite parameter space, is proper.

The Objective Prior
Let us consider two hypergeometric models that differ for the value of parameter R only, say p R = p(r | N, R, n) and p R = p(r | N, R , n). The Kullback-Leibler divergence between the two models is given by Prior distribution for the population size model with BBS approach (left graph) and our approach (right graph). The prior has been computed for R = 2, 3, 4, and put on θ = N − R + 1.
where E is the expected value with respect to p R , and R represents the set of possible values of r, that is, max (0, n − (N − R)) < r < min (n, R). We note that the random process modeled through a hypergeometric distribution is symmetrical around R = N/2. In fact, by swapping the role of the units that satisfy the criterion, we have To prove the above result, it is sufficient to rearrange the terms of Equation (6). In other words, the model with parameter R is equal to the model with parameter N − R, for the same values of N and n.
For a hypergeometric set of models with common known parameters N and n, the minimum divergence from each element of the set is determined by the following Lemma 1, which has a proof in the online supplementary material.
Lemma 1. Consider the hypergeometric distribution p R 0 , with parameters R 0 , N, and n, where N and n are assumed to be known. If we indicate by p R the hypergeometric distribution that differs from p R 0 only by the number of units in the population N, which satisfy a certain criterion (i.e., R 0 ), then the Kullback-Leibler divergence from p R 0 to p R is minimum In deriving the objective prior for the parameter R of the hypergeometric distribution, we make first the following considerations. We assume that parameters N (population size) and n (sample size) are known. Given the result of Lemma 1, for R ≤ N/2 the minimum Kullback-Leibler divergence is obtained from (7) by setting R = R + 1: Therefore, for R ≤ N/2, the prior is obtained by applying (3), and is given by By symmetry, the prior mass for R, when R > N/2, is given by

Illustrations and Comments
To illustrate the behavior of the prior distribution as defined in (8) and (9), we can inspect Figure 2. The continuous curve shows the prior π (R | N ) for N = 25. The distribution has higher value at the extremes of the parameter space, symmetrically decreasing when approaching the center of the space (i.e., N/2). In the same graph, we see the prior obtained by BBS (dashed curve). Our prior is flatter than the one BBS proposed, but it has a very similar shape. In particular, it is symmetric about N/2.
Both we and BBS have computed the π (R | N ) for different values of N. In all circumstances, the distributions are similar, in terms of behavior. Unlike the objective prior of BBS, our prior depends on the value of n. As in Section 2, we have analyzed the frequentist performance of the posterior for 20,000 simulations (not showed here). In particular, for values of N = 10, N = 20, and N = 25, we have computed the (approximate) frequentist coverage of the 95% credible intervals for n = 1 and n = 3. Both BBS and our prior have similar coverage (although conservative), for n = 1 and n = 3. We have computed the MSE from the mean in all the above cases; the performance of BBS prior appears to be better than ours toward the center of the parameter space, while the roles are swapped in the extreme regions of the space. However, these differences are not substantial, and sensibly diminish for either n > 1 or N > 10.

MULTIVARIATE HYPERGEOMETRIC MODEL
Consider the multivariate hypergeometric distribution MH d (N, R, n) of dimension d, with probability mass function: where N d is the d-dimensional space of nonnegative integers, and with n ∈ {0, 1, . . . , N}, d j =1 R j = N , d j =1 r j = n, and r j ≤ min(n, R j ) for j = 1, . . . , d. For d = 2, we obtain the univariate hypergeometric distribution, discussed in Section 3. We assume that parameters N and n are known, and R = (R 1 , . . . , R d ) represents the vector of unknown parameters.
The prior distribution for vector R defined by BBS is based on Approach 2, similarly to Section 3, for the univariate hypergeometric model. In particular, BBS considered the vector of hyperparameters p d , where each element is defined as p i = R i /N , for i = 1, . . . , d. This vector represents the hyperparameter vector of the multinomial distribution Mu d (R d | N, p d ), on which the prior mass is put on. This prior has the following form: which corresponds to the Jeffreys' prior (Jeffreys 1961).

The Objective Prior
The Kullback-Leibler divergence between the multivariate hypergeometric distribution with parameters N, R, and n, indicated by p N,R,n , and the multivariate hypergeometric distributions with parameters N, R + a, and n, indicated by p N,R+a,n , where a ∈ Z d , is given by where E is the expectation of log p N,R,n /p N,R+a,n with respect to p N,R,n . The following lemma, which has a proof in the online supplementary material, determines the minimum Kullback-Leibler divergence from model p N,R,n .
Lemma 2. Consider the d-dimensional multivariate hypergeometric distribution p N,R,n , where parameters N and n are assumed to be known, with probability mass function as specified in (10). If we consider the hypergeometric distribution p N,R ,n , which differs from p N,R,n by the composition of the unknown d-dimensional parameter vector R, then the Kullback-Leibler divergence between p N,R,n and p N,R ,n is minimum, when R = R + c, where c is a vector of dimension d with d − 1 zeroes and, in correspondence of the element of R closer to N/2, has a minus or plus one depending if the "closeness" is, respectively, from above or below N/2.
To summarize the results obtained in the proof of Lemma 2, and to generalize to any multivariate hypergeometric distribution, we have that the smallest difference between p N,R,n and p N,R+c,n is obtained when only one of the components of R is changed. In particular, when the change is an increase or decrease of one unit. Therefore, c will have d − 1 elements equal to zero and the remaining one is equal to plus or minus one. From the analysis of the Kullback-Leibler divergence between two bivariate hypergeometric models, the nearest model to p N,R,n Figure 3. Graphical representation of our normalized objective prior for a bivariate hypergeometric model with R 1 , R 2 , N = 10 and n = 3, as per Table 1. corresponds to the model p N,R + c,n , where c will have d − 1 null elements and the remaining one, in position i (where i is the index of element R i of R nearest to N/2, either from below or above), and will have value one if R i ≤ N/2, and minus one if R i ≥ N/2.
The objective prior for the R can be found from the result of the previous paragraph, and it has the form where vector a is determined so that the divergence is minimized, as discussed in Lemma 2.

Illustrations and Comments
We have computed the prior distribution in the case of a bivariate hypergeometric distribution. In particular, we have considered a bivariate distribution with population size N = 10 and sample size n = 3. Table 1 represents the normalized prior distribution π (R), where R = (R 1 , R 2 ). To have a better feeling, we have also plotted the distribution in Figure 3.
This prior can be compared with the one obtained by BBS, as in (11). We have computed the bivariate hypergeometric distribution considered in the illustration above, that is, for N = 10. This objective prior is represented in Table 2. Note that, due to the "new" nature of the problem, some areas of the discrete parameter space have now assigned a prior mass that tends to infinity.
If we compare our prior distribution with the one defined by BBS, we note some similarities. In particular, both show a symmetrical behavior, which can be noticed by row and by column, that is, by fixing one of the two parameters, the remaining is symmetrical around its central value.

BINOMIAL-BETA MODEL
Let us now consider a binomial model with parameters n and p, with n = 1, 2, . . . and p ∈ (0, 1). We also assume that parameter p can be represented by a beta distribution with parameters a and b, both strictly positive. The binomial-beta distribution is given by BBS derived the prior for n in the binomial-beta model by applying Approach 3 and Approach 4. As their choice, based on the analysis of frequentist properties of the posterior, is the one coming from Approach 3, we focus on that only. The approach aims to apply reference prior theory with a consistent estimator. In this case, the estimator for n is given byn = (ak) −1 (a + b) k j =1 x j , where k is the number of independent samples from (13). Thus, the proposed prior is obtained by applying Jeffreys' rule on the extended model p(n|n) ≈ N (n|n, {n(n + a + b)}/{a 2 (a + b + 1)k}), when n is considered continuous: It is worth mentioning that BBS seemed to focus on values of a and b larger than one. In fact, to identify the best prior among the one deriving by applying Approach 3 and the one deriving by applying approach 4, they studied the frequentist coverage of credible intervals. And this comparison is done for values of a = b = 5, a = b = 20, and a = b = 50. As such, we consider the same restriction, that is, a, b > 1.

The Objective Prior
To find the model that is nearest, in terms of Kullback-Leibler divergence, to p(x | n), we first note that, as p(x = n | n ) = 0 for n > n , we have Therefore, the nearest model will have the parameter representing the number of trials larger than n. The Kullback-Leibler divergence from p n to p n is minimum when n = n + 1. This has been computationally verified. However, it can also be proved on the line of the proof of Lemma 3 in Section 6. Thus, where the expectation is taken with respect to p n . The objective prior for n then is Table 1. Our normalized objective prior distribution for the bivariate hypergeometric distribution with N = 10, n = 3, and parameter vector R = (R 1 , R 2 ). Note the symmetry of the prior with respect to the center of the two-dimensional parameter space The prior in (16) is improper. The following Theorem 2 shows that, with only one observation, the posterior is proper. The proof of the theorem is in the online supplementary material.
Theorem 2. Let us assume that we observe the data point x 1 from a binomial-beta distribution with parameters a > 1, b, n, and p. Also, assume a prior distribution for the parameter n as π (n) ∝ exp{D KL (p n p n+1 )} − 1. Then, the posterior distribution given by is proper.
The following theorem, proved in the online supplementary material, shows that the posterior distribution for n is consistent.
Theorem 3. Consider the family of binomial-beta distributions p n , with n = 1, 2, . . . and common parameters a and b. We also assume that the true value of n is n 0 . Given the prior distribution π (n) and the set of observations from p n 0 , x = x 1 , . . . , x k , the mass of the posterior corresponding to n 0 converges to one almost surely. That is, π (n 0 | n ≥ n 0 , x 1 , . . . , x k ) → 1, for k → ∞.

Illustrations and Comments
To have a feeling of the behavior of the prior in (15), we have computed it for different values of a and b. In particular, as shown in Figure 4 (right graph), we have computed the prior for the parameter n given a = b = 5, a = b = 20, and a = b = 50. The three prior distributions are very similar. As expected, a large part of the mass is put on relatively small values of n.
To compare our prior to the BBS one, we have computed the prior in (14) for n = 1, . . . , 70 and a = b = 5, a = b = 20, a = b = 50. The results are in Figure 4 (left graph). Unlike Table 2. Objective prior of BBS, computed from (11), for a bivariate hypergeometric model with N = 10, n = 3, and parameter vector R = (R 1 , R 2 ). We have indicated by"-" the points of the parameter space where the prior is infinite. The embedding in a continuous structure does not allow to determine a finite mass for the point at the edge of the parameter space --10 - to our approach, it appears that the values of a and b play a nonnegligible role, as the three distributions are different; in particular for small values of n.
The comparison of our prior versus the one of BBS has been completed through a simulation study of the frequentist performance of the posterior. In particular, for each of the cases a = b = 5, a = b = 20, and a = b = 50, we have performed 20,000 simulations of sample size m = 1. As mentioned in Section 2.2, the choice of m = 1 relies on the fact that possible differences are more prominent for small sample sizes. We have computed the frequentist coverage of the 95% credible intervals and the frequentist relative MSE from the posterior means. Figure 5 shows the results of the simulation. On the left, we have the frequentist coverage, and on the right the relative MSE. Both priors lead to very similar coverage performance, with a more conservative value for n < 5. In this area of the parameter space, our prior has a better MSE, with the difference from the BBS prior decreasing rapidly as n increases. For a = b = 5, the results are consistent with the above; however, the differences in the MSE are more significant. For a = b = 50, both priors have, substantially, the same performances.
It appears that the fact that our prior, compared to BBS's, is less sensible to the values of a and b represents an advantage, as the simulation results show. This is particularly relevant for small values of the parameters.

BINOMIAL MODEL
Let us assume that the random variable x is binomially distributed with number of trials n and probability of success p. Its probability mass function is For simplicity in the notation, the model representing a binomial random variable x, with parameters n and p, will be written as f n instead of f (x | n, p).
The prior for parameter n (with known p) defined by BBS has the following form: Figure 5. Frequentist coverage of 95% credible intervals for n (left graph) and square root of the relative mean squared error for the mean (right graph). The simulation has been run for a = b = 20 and for BBS prior (black-dashed line) and our prior (red-continuous line).

Illustrations and Comments
To have an understanding of the objective prior for n, we have computed and plotted it for a given value of p = 0.5, and for n = 1, . . . , 100. The result is shown in Figure 6. We see that the highest mass is put on n = 1, as the largest divergence is D KL (f 1 f 2 ), and that it will decrease as n increases. This is obvious, as the difference between a binomial with n number of trials and the binomial with n + 1 number of trials, diminishes as n tends to infinity; as such, the Kullback-Leibler divergence measured between the two models tends to zero and so does the prior.
The comparison between the prior we propose and the prior defined by BBS is straightforward. By simply inspecting expression (17), we notice that the BBS prior distribution assigns large mass to small values of n, and then it decreases as n becomes larger (Figure 6). In particular, the behavior is similar to our prior; even though our prior depends on the value of p, by computation, as we have seen that the changes in the prior are negligible.
We performed a simulation study of the frequentist performance of our prior and the one proposed by BBS. We ran 20,000 simulations of sample size m = 1 for values of p equal to 0.05, 0.25, 0.50, 0.75, and 0.95. Similar to Section 5, we have considered the frequentist coverage of the 95% credible intervals and the frequentist relative MSE from the posterior means. Figure 7 shows the results for p = 0.50. We note very similar performances for both priors, with a slightly better MSE result for our prior. For the other values of p (not shown here), the outcome is similar, with a more marked better performance of our prior for lower values of p than for higher ones.
We have not examined in detail the case for p unknown, as it has been done in BBS. However, the task does not bear any complication, as we can define the prior on n and p and π (n, p) ∝ π (n | p)π (p), where π (p) ∼ Be(1/2, 1/2) (Jeffreys' prior) would be a natural choice. That is, which, even though improper, will yield a proper posterior when the sample size if larger than or equal to 1. Figure 7. Frequentist coverage of the 95% credible intervals for n (left graph) and square root of the relative mean squared error for the mean (right graph). The simulation has been run for p = 0.5 and for BBS prior (black-dashed line) and for our prior (red-continuous line).

DISCUSSION
The strength in deriving prior distributions by being objective, not by defining probabilities directly, but in assigning a worth to each element of the parameter space, is that it can be applied to any discrete parameter. It has been shown that such an objective prior can be defined for all the models presented in this article, and that these prior distributions are suitable for inference.
We believe the reason why this approach works can be found in the solid foundations it lies upon. Bayes' rule, when applied in an objective setup, carries an important inconsistency, which is represented by the fact that π (θ ) (the prior) and π (θ | x) (the posterior) do not carry the same interpretation. By resetting the update rule in the context of losses or utilities, the inconsistency disappears.
This approach can be extended to any situation where a prior has to be defined for a discrete parameter. A further noteworthy application can be found in Villa and Walker (2014), where a prior distribution for the number of degrees of freedom of t distribution is derived.
The frequentist analysis of the models has shown that, in general, our prior performs at least as well as the BBS prior. The comparisons have been made by using very small sample sizes (i.e., m = 1) where the possible differences in the frequentist properties of the priors are more prominent. For realistic values of sample size both priors appear to have very similar frequentist properties.
For the practitioner, we would argue that a solid explanation of the prior, subjective or objective, is mandatory. This becomes a more acute obligation when the prior is objective. Our prior has such an explanation in terms of measuring the worth of a parameter value being included in the model. As such a user can find a comprehensible interpretation of the prior and fully understand the derivation of π (θ ) for each θ . Moreover, the approach to derive our prior is applicable to any discrete parameter model.
Finally, by considering the limiting process in Blyth (1994), the idea we propose can be extended to continuous parameters. That is, where I i,j (θ ) is the (i, j )th element of the Fisher information matrix. In Brown and Walker (2012), it is shown that the approach, in the continuous case, may yield Jeffreys' prior.

SUPPLEMENTARY MATERIAL
Proof of Lemmas and Theorems: This file contains the proofs to lemmas and theorems of the main article. [Received March 2013. Revised May 2014