Finite-dimensional Discrete Random Structures and Bayesian Clustering

Abstract Discrete random probability measures stand out as effective tools for Bayesian clustering. The investigation in the area has been very lively, with a strong emphasis on nonparametric procedures based on either the Dirichlet process or on more flexible generalizations, such as the normalized random measures with independent increments (NRMI). The literature on finite-dimensional discrete priors is much more limited and mostly confined to the standard Dirichlet-multinomial model. While such a specification may be attractive due to conjugacy, it suffers from considerable limitations when it comes to addressing clustering problems. In order to overcome these, we introduce a novel class of priors that arise as the hierarchical compositions of finite-dimensional random discrete structures. Despite the analytical hurdles such a construction entails, we are able to characterize the induced random partition and determine explicit expressions of the associated urn scheme and of the posterior distribution. A detailed comparison with (infinite-dimensional) NRMIs is also provided: indeed, informative bounds for the discrepancy between the partition laws are obtained. Finally, the performance of our proposal over existing methods is assessed on a real application where we study a publicly available dataset from the Italian education system comprising the scores of a mandatory nationwide test.


Introduction
The definition and the investigation of flexible discrete priors, beyond the Dirichlet process, for Bayesian analysis has attracted increasing interest in recent years.Among the proposals that have been made, we recall the Pitman-Yor process (Pitman and Yor 1997), the normalized inverse-Gaussian process (Lijoi, Mena, and Prünster 2005), the normalized generalized gamma process (Lijoi, Mena, and Prünster 2007) and the general classes of Gibbs-type priors (Gnedin and Pitman 2005) and of normalized random measures with independent increments (NRMIs) (Regazzini, Lijoi, and Prünster 2003).These priors have been mainly employed to address predictive inference and density estimation.More recently they have emerged as powerful tools to achieve sparsity in network models as effectively described in Caron and Fox (2017).While these are all instances of infinite-dimensional (nonparametric) priors, in this paper we embrace a different perspective and rely on a class of finitedimensional priors whose flexibility can be increased at will, such that they can be seen as approximations of suitable nonparametric priors.Although this argument has been provably useful in Bayesian analysis (Green and Richardson 2001;Miller and Harrison 2018), it has mostly led to the specification of Dirichlet-based priors thus introducing in the analysis some limitations that are widely known in the literature.For example, the underlying model for the clustering structure is somehow restrictive, since it depends on a single parameter, the so-called concentration or total mass, thus calling for more flexible specifications.Such a narrow flexibility has further relevant effects on the associated system of predictive distributions.In the nonparametric case, these and other aspects are illustrated in De Blasi et al. (2015).Of course, one might mitigate this issue by placing a prior distribution on the concentration parameter, however the resulting posterior would not be available in closed form.Hence, albeit simple, such a specification has mostly turned out to be a sort of "black-box" prior.Besides these drawbacks, the lack of theoretical results has also prevented the development of computational algorithms that may facilitate their use.For example, while our investigation leads to exact sampling from the posterior distribution, a prior on the concentration parameter in a Dirichlet-based structure would require a Metropolis step in the computation.
The main focus of our article will be clustering and density estimation.Just to set up some preliminary notation, let (X n ) n≥1 be a sequence of X-valued exchangeable random elements and let K : X × → R + be a transition kernel such that x → K(x; θ) is a density function on X, for any θ ∈ .Conditionally on a random probability measure p, we suppose that (X i |p) iid ∼ K(x; θ)p(dθ) and where δ θh denotes the unit mass at θh , H h=1 π h = 1 almost surely and the collections { θ1 , . . ., θH } and {π 1 , . . ., π H } are independent.Furthermore, θ1 , . . ., θH are iid draws from a diffuse probability measure P. Hence, p is a proper species sampling model (Pitman 1996).It is worth noting that the theoretical results that are displayed in the next sections are essential for applying our proposal to a number of other inferential problems, beyond the specific ones we consider here.With H < ∞, the symmetric DIRICHLET(c/H, . . ., c/H) distribution is a popular prior choice for (π 1 , . . ., π H ) and it is well-known that, in this case, the law of p in (1) weakly converges to the law of the Dirichlet process, with parameters (c, P), as H → ∞.Such a symmetric Dirichlet specification, when directly used for exchangeable data (X 1 , . . ., X n |p) iid ∼ p, is often referred to as Dirichlet multinomial model.See, for instance, Ishwaran and Zarepour (2002).It is worth recalling that this model may be used for sparse finite mixtures as a way to circumvent the issue of selecting the number of mixture components (Malsiner-Walli, Frühwirth-Schnatter, and Grün 2016).Indeed, under suitable assumptions, overfitted mixtures consistently select the correct number of mixture components (Rousseau and Mengersen 2011).See also Nguyen (2013) for further asymptotic validations of mixture models based on Dirichlet-like mixing measures.In addition, the Dirichlet multinomial process has been employed in a variety of statistical applications, for example, for Bayesian modeling of network data (Durante, Dunson, and Vogelstein 2017;Durante and Dunson 2018), for semiparametric random effects in regression models (Rigon, Durante, and Torelli 2019), and for functional data analysis (Dunson, Herring, and Siega-Riz 2008;Petrone, Guindani, and Gelfand 2009).Finally, the symmetric Dirichlet for (π 1 , . . ., π H ) may be combined with a "repulsive" prior for the locations θ1 , . . ., θH as in Xu, Müller, and Telesca (2016).
In this article we introduce and study a general class of finite-dimensional random probability measures p as in (1), which will be termed normalized infinitely divisible multinomial processes (NIDM).The investigation we develop will benefit from the connection with homogeneous NRMIs, whose theory has achieved remarkable advances in the recent years (Lijoi and Prünster 2010).The proposed NIDM processes allow for a finer control of the underlying clustering mechanism and for a robustification of the prior specification process.We will provide both theoretical and empirical evidence in support of this claim, in line with what was already noticed in the infinite-dimensional case (Lijoi, Mena, and Prünster 2007).As a by-product of our investigation, we note that one might employ NIDMs as approximations of their infinite-dimensional counterpart.Besides the theoretical interest that a result of this type may give rise to, it is also very helpful from a practical standpoint since it helps lightening computational bottlenecks.Indeed, posterior inference for NRMIs most often relies on series representations of the type devised in Ferguson and Klass (1972), hence requiring numerical and analytical approximations.In contrast, the posterior structure of several NIDMs can be computed and sampled exactly, without the need of MCMC steps.This is an important bonus, even when compared to standard Dirichlet-multinomial specifications.Such a gain, however, does not come for free since the probabilistic structure of our model yields some challenging technical hurdles when it comes to determining distributional properties of interest for Bayesian inference.These difficulties parallel those that arise when one uses a discrete base measure for a nonparametric prior process.See, for example, Canale et al. (2017) for an example in the Pitman-Yor case.
The main application we focus on concerns the INVALSI 2016-2017 dataset, a national examination conducted in Italy.Specifically, we aim at measuring the teaching competencies of a set of schools by taking into account the socio-demographic characteristics of its students.Great effort has been made by the public research institute INVALSI to provide reliable quantifications of the effect of each school on the test performance.Indeed, such an indicator is nationally relevant especially for the development and the evaluation of educational policies.We address this problem via semiparametric modeling with nonparametric school-specific random effects, which will be interpreted as a proxy of the added-value of the school.
The article is organized as follows.In Section 2 we review some background material about completely random measures and species sampling models.In Section 3 we define NIDM processes and we discuss several a priori properties, such as the law of the random partition they induce, and the weak convergence to NRMIs.In Section 4, we discuss generalized urn schemes and posterior characterizations, with a particular emphasis on the normalized generalized gamma (NGG) multinomial process.In Section 5 we employ the NGG multinomial prior for the analysis of a real dataset, highlighting practical advantages over existing methods.We conclude with a final discussion in Section 6. Detailed proofs are deferred to the supplementary material.This further includes an extensive simulation study that shows the greater flexibility of the NGG multinomial prior under a wide range of data generating distributions, with different choices of the hyperparameters and sample sizes.

Discrete Priors and Random Partitions
Since NIDM processes are closely related to homogeneous NRMIs, we sketch a concise overview of some preliminary definitions and notation on random measures.
Let be a separable and complete metric space and let B( ) be its Borel σ -field.A random measure μ∞ on ( , B( )) such that μ∞ (A 1 ), . . ., μ∞ (A d ) are mutually independent random variables for any choice of pairwise disjoint A 1 , . . ., A d in B( ), and for any d ≥ 2, is a completely random measure (CRM).For our purposes, it is useful to recall that a CRM with no fixed points of discontinuity and no deterministic drift can be represented as μ∞ = ∞ h=1 J h δ θh and is characterized by the Laplace functional transform (2) where f : → R + is a measurable function and the measure ν on R + × , termed Lévy measure, or intensity, characterizes the CRM and is such that R + ×A min{1, s}ν(ds, dθ) < ∞ for any bounded A ∈ B( ).In the following, we consider only homogeneous CRMs, which amounts to having a Lévy intensity of the form ν(ds, dθ) = ρ(s)ds cP(dθ), where P is a probability measure over ( , B( )) and c ∈ R + .We will use the notation μ∞ ∼ CRM(c, ρ; P).If one additionally has that R + ρ(s) ds = ∞, then 0 < μ( ) < ∞ almost surely, and a homogeneous NRMI is defined as p∞ = ∞ h=1 (J h / J)δ θh , where J = ∞ h=1 J h = μ( ).We will write p∞ ∼ NRMI(c, ρ; P) to denote such a random probability measure.Several relevant nonparametric priors are NRMIs and a noteworthy class, which is the object of investigation in the present paper, arises when whose additional parameters 0 ≤ σ < 1 and κ ≥ 0 are such that at least one of them is positive (Brix 1999).The resulting NRMI is often referred to as normalized generalized gamma (NGG) process, and it includes some well-known nonparametric priors as special cases.See Lijoi, Mena, and Prünster (2007).
Both NRMIs and the novel class of NIDM processes are discrete random probability measures.Thus, when their law identifies the de Finetti measure of the exchangeable sequence of -valued random elements (θ n ) n≥1 , there will be ties with positive probability, namely P[θ i = θ i ] > 0 for any i = i .Hence, an n-sample θ (n) As discussed in Pitman (1996), this clustering mechanism is regulated by a symmetric function called exchangeable partition probability function (EPPF), which is defined by where the vector n (k) = (n 1 , . . ., n k ) of positive integers is such that n j = card(C j ) and k j=1 n j = n.Notice further that the conditional probabilities that θ n+1 displays a new value from the diffuse base measure P or coincides with a previously observed θ * j , given θ (n) , can be expressed in terms of the EPPF as follows w 0 (n (k) ) = (n 1 , . . ., n k , 1)/ (n 1 , . . ., n k ) and w j (n (k) ) = (n 1 , . . ., n j + 1, . . ., n k )/ (n 1 , . . ., n k ), for j = 1, . . ., k, which, for any A ∈ B( ), entails P(θ n+1 ∈ A|θ (n) where ψ(u) = R + (1 − e −us )ρ(s)ds is the Laplace exponent and τ m (u) := R + s m e −us ρ(s) ds, for any integer m ≥ 1.If ρ is as in (3), that is, p∞ is a NGG process, one finds out that , for any real a and integer n ≥ 1, (a) 0 = 1, and where β = cκ σ /σ , and (x; a) = ∞ x s a−1 e −s ds for any a > 0 is the incomplete gamma function.One may refer to Lijoi, Mena, and Prünster (2007) for details about the determination of (6).

Normalized Infinitely Divisible Multinomial Processes
In order to define a more flexible, and still tractable, class of finite-dimensional priors we crucially start from infinitely divisible masses and adopt a normalization procedure similar to the one that yields NRMIs.

NIDM Processes
The main building block of the new class of processes that we are proposing is a collection of independent and infinitely divisible random variables.In particular, we will deal with finite and strictly positive infinitely divisible random variables, say J, such that (1 − e −λs )ρ(s) ds , (7) for any λ > 0 and some constant c > 0. The function ρ is a nonnegative measurable function satisfying the same conditions outlined in Section 2, that is, R + min{1, s}ρ(s) ds < ∞ and R + ρ(s) ds = ∞.We will henceforth use the notation J ∼ ID(c, ρ).
where H < ∞ and P is a diffuse probability measure on ( , B( )), is a multinomial random measure with infinitely divisible jumps and will be denoted as μH ∼ IDM(c, ρ; P).
It is important to stress that IDM's are not completely random, since μH (A 1 ), . . ., μH (A d ) are not mutually independent random variables for any choice of pairwise disjoint sets A 1 , . . ., A d in B( ) and for any d ≥ 2. Indeed, if A ∈ B( ) is such that 0 < P(A) < 1, then P( μH (A) = 0) = (1 − P(A)) H > 0. On the other hand, P( μH (A) = 0| μH (A c ) = 0) = 0 since P( μH ( ) > 0) = 1.Hence, independence between μH (A) and μH (A c ) does not hold true.Nonetheless, the Laplace functional transform of a IDM is available and, for any non-negative function f , equals where, conditional on m, the random variables J 1 , . . ., J m are iid from ID(c/H, ρ) and J 0 is a point mass at zero: this provides an interesting and alternative representation of μH (A).
The new class of finite-dimensional processes that we define are obtained by normalizing random measures in Definition 1.In this respect, the construction is related to the normalized infinitely divisible (NID) family of distributions in Favaro, Hadjicharalambous, and Prünster (2011), in the sense that the random probability weights are obtained by normalizing infinitely divisible random variables.
then pH is a normalized infinitely divisible multinomial process and is denoted as pH ∼ NIDM(c, ρ; P).Notice, further, that π is a normalized infinitely divisible (NID) random vector and we use the notation Hence, a NIDM process is a random probability measure with finite support whose vector of probability weights π is termed NID according to the terminology in Favaro, Hadjicharalambous, and Prünster (2011).Indeed, Favaro, Hadjicharalambous, and Prünster (2011) propose this class of distributions on the simplex as an alternative to the Dirichlet distribution.While they investigate some distributional properties and discuss examples for which the density function of π is available in closed form, they neither consider their uses in a Bayesian setting nor construct random probability measures with finite support based on π .From Definition 2, it is apparent that μ( ) ∼ ID(c, ρ), thus ensuring that the above normalization is well-defined.Note that a NIDM process is also a proper species sampling model (1) with H < ∞, since P is diffuse.If we set ρ(s) = s −1 e −s we recover the Dirichlet multinomial process, whose weights (π 1 , . . ., π H−1 ) ∼ DIRICHLET(c/H, . . ., c/H).Henceforth, we take ρ as in (3) and obtain a novel and tractable class of NIDM processes that we will term NGG multinomial process.
It is useful to point out that NIDM processes display a hierarchical structure.Indeed, if ( μH |p 0,H ) ∼ CRM(c, ρ; p0,H ) and p0,H = (H) −1 H h=1 δ θh , with θh iid ∼ P, then μH ∼ IDM(c, ρ; P).Hence, if we pick pH = μH / μH ( ) one has that pH ∼ NIDM(c, ρ; P) and in view of the previous remark, it can be described through the following hierarchical model The converse holds true as well: any NIDM can be represented as a hierarchical process with a NRMI having a discrete baseline measure at the bottom of the hierarchy.Note that the law of p0,H is that of a specific Gibbs-type prior, arising when the discount parameter goes to −∞; see Gnedin and Pitman (2005).Furthermore, this representation relates any NIDM process to hierarchical constructions like those presented in Camerlenghi, Lijoi, and Prünster (2018).
In view of (10), it is instructive to compare the finitedimensional distributions of a NIDM pH with those of a p∞ ∼ NRMI(c, ρ; P).In particular, it follows that, for any finite partition {B 1 , . . ., B d } of into B( )-sets, the distribution of (p H (B 1 ), . . ., pH (B d−1 )) can be expressed as a mixture of a NID distribution with multinomial weights, motivating the NIDM denomination.More precisely, where m = ( m1 , . . ., md ).On the other hand, since p∞ −→ (cP(B 1 ), . . ., cP(B d )).A similar argument will be used in the next section to show that NIDM processes weakly converge to the corresponding homogeneous NRMI as H → ∞.Finally, note that the moments of a NIDM process can be obtained in closed form; see the supplementary material for details.

Convergence of NIDM Processes
The previous discussion suggests that, when H is large enough, a NIDM approaches a nonparametric prior that corresponds to a homogeneous NRMI.The main result of this Section concerns the convergence of IDM random measures to homogeneous CRMs and we henceforth resort to the more concise notation f ) for any positive and measurable function f : Note that if f in the above theorem is integrable with respect to P then the condition ψ(f (θ ))P(dθ) < ∞ is satisfied.In addition, given the convergence of the Laplace functionals, vague convergence μH v −→ μ∞ as H → ∞ is implied by (Theorem 4.11, Kallenberg 2017) since the above condition is satisfied if f is a positive, continuous and bounded function.Now recall that pH ∼ NIDM(c, ρ; P) and that p∞ ∼ NRMI(c, ρ; P).An important implication of Theorem 1 is the convergence of the finite-dimensional distributions of pH to those of p∞ .Indeed, plugging in Theorem 1 the simple function as a consequence of the continuous mapping theorem.When working with random probability measures, the convergence of the finite-dimensional distributions suffices to guarantee the weak convergence of the whole process, which will be indicated with the w −→ notation; see (Theorem 4.11, Kallenberg 2017).The above statement implies the convergence in distribution of general functionals pH (f ) d −→ p∞ (f ) as H → ∞ when f is a continuous and bounded function.In the Dirichlet multinomial process case, related results were previously obtained, e.g., in Kingman (1975) and Green and Richardson (2001).

Random Partitions and Number of Clusters
In this section we study the clustering mechanism underlying a NIDM process that amounts to determining the EPPF, namely the probability distribution of the induced exchangeable partition: this will be denoted by H and it is defined through (4), with p being replaced by pH .To be more specific, it will be shown that the EPPF of any NIDM process is a finite mixture of the partition functions arising in the infinite-dimensional setting.Before stating the theorem, let us introduce some further quantity of interest.Define for any m ≥ 1 and set V 0,H := 1, where ψ(u) is the Laplace exponent defined as in ( 7) and with the inner sum running over all vectors q = (q 1 , . . ., q ) of positive integers such that r=1 q r = m.Moreover, for any vector x ∈ R p , we let |x| = p i=1 x i .
Theorem 2. Let (θ n ) n≥1 be an exchangeable sequence directed by a NIDM process prior.Then, the associated EPPF when k ≤ min{n, H} is given by Moreover, if ∞ is the EPPF of the corresponding homogeneous NRMI in (4), one has where the first sum runs over all vectors = ( 1 , . . ., k ) such that j ∈ {1, . . ., n j }, and the jth of the k sums runs over q j = (q j1 , . . ., q j j ) such that q jr ≥ 1 and with |q j | = n j .
This mixture representation in ( 11) is reminiscent of the one in Camerlenghi, Lijoi, and Prünster (2018) for hierarchical NRMIs, which is not surprising in view of the hierarchical representation of NIDM processes in (10).Furthermore, note that lim H→∞ m,H (u) = τ m (u) for any u > 0 and m ≥ 1.This leads to show that lim H→∞ H = ∞ , namely the EPPF associated to a NIDM process converges to the one induced by a homogeneous NRMI and displayed in (5).As a matter of fact, one can say something more precise and identify bounds for the ratio between H and ∞ , for any H, as stated in the next theorem.
Theorem 3.For any k ≤ min{n, H} one has Note that q ∞ is the density function of a latent random variable that is used in James, Lijoi, and Prünster (2009) to provide posterior characterizations of NRMIs.Both bounds converge to 1 as H → ∞.As a simple application of Theorem 3, one might obtain bounds also for the predictive distributions, by exploiting their relationship with the EPPF.For NGG multinomial processes and, a fortiori, in the Dirichlet multinomial case, the EPPF and the related bounds can be computed explicitly.This is illustrated in the following examples.
Example 1 (Dirichlet multinomial process).Let the NIDM process be characterized by the intensity function ρ(s) = s −1 e −s .For any k ≤ min{n, H}, in view of Theorem 2 one has This EPPF can be found, for example, in Green and Richardson (2001).Additionally, note that such a prior is a Pitman-Yor process with discount parameter σ = −c/H < 0 and H|σ | = c > 0. See, for example, De Blasi et al. (2015).Straight application of Theorem 3 yields This makes clear that H and ∞ are close when either H is large, as it is natural to expect on the basis of Corollary 1, or the total mass parameter c is small.Note that this is the same bound obtained in the Appendix of Ishwaran and Zarepour (2002) by means of different techniques, and thus Theorem 3 can be seen as a generalization of their result.
Example 2 (NGG multinomial process).Let the NIDM process be characterized by the generalized gamma intensity function ρ(s) given in (3).On the basis of Theorem 2, for any k ≤ H one has for k ≤ n are the generalized factorial coefficients, which can be computed by exploiting recursive formulas.Hence, the EPPF of a NGG multinomial process has a simpler form compared to the general Equation ( 11), because it only depends on the integers 1 , . . ., k .This also enables the actual evaluation of the upper bound in Theorem 3, thus yielding where The availability of H naturally leads one to address the problem of determining the distribution of the number of partition sets K n,H , that is, the law of the number of distinct values observed in a sample θ (n) from a NIDM process prior.As one might expect, K n,H converges to the number of partition sets K n,∞ , namely the number of distinct values generated by an exchangeable n-sample from homogeneous NRMI, when H → ∞.Another interesting connection between K n,H and K n,∞ is formalized in the following theorem.
Theorem 4. For any k ≤ min{H, n} where is the Stirling number of the second kind for , k ≥ 0.Moreover, the expected value of K n,H is given by .
From Theorem 4 it can be easily seen that Thus, the convergence of the expected value to the infinite case occurs at the linear rate O(1/H).We expanded the above ratio up the the second order, to gain further understanding about the speed at which E(K n,H ) approaches its limit: broadly speaking, quick convergence occurs whenever E(K 2 n,∞ ) ≈ E(K n,∞ ), which is the case when E(K n,∞ ) is relatively small.On the other hand, one needs a large value of H when trying to approximate an infinite-dimensional process having a high number of expected clusters in a sample of size n.This is in line with the discussion of Example 1.
Example 3 (Dirichlet multinomial process, cont' d).A straightforward application of Theorem 4 yields This simple form is due to the Gibbs-type structure of the Dirichlet multinomial process, and indeed the above formula might be deduced from Gnedin and Pitman (2005).In the relevant particular case c = H, that is, when the weights of the NIDM process are uniformly distributed, the following simplification occurs: a particular instance of hypergeometric distribution.As for the expected value of K n,H , for any value of c we have the following simple formula: Note that, as H → ∞, the right-hand side converges to the expected number of cluster induced by the Dirichlet process with concentration parameter c.
Example 4 (NGG multinomial process, cont' d).Direct application of Theorem 4 yields for any k = 1, . . ., min{H, n}, since To illustrate the clustering mechanism, in Figure 1 we depicted the distribution of the random variable K 100,H under both a Dirichlet multinomial process and a NGG multinomial process, for different values of H.To make these prior choices comparable, we have set the hyperparameters c and (β, σ ) such that the expected number of clusters for the corresponding infinite-dimensional NRMIs E(K 100,∞ ) is the same.
As highlighted in Figure 1, the distribution of K 100,H under the NGG multinomial prior is "flatter", that is, less informative, compared to the one induced by the Dirichlet multinomial prior, for any of the values of H being considered.We note that the variance of K n,H in the NGG prior can be tuned through the parameter σ .When σ → 0 the Dirichlet multinomial case is recovered.If H → ∞, this effect of σ was extensively discussed in Lijoi, Mena, and Prünster (2007) and it comes as no surprise it is reflected also in the finite H case, in view of Theorem 4. Theorems 2-4 show that, despite the technical difficulties posed by the finite H setting, distributional properties similar to those available in the infinite-dimensional case can still be achieved.Nonetheless, we remark that if one is interested in approximating the infinite-dimensional NRMI prior, a relatively large value of H is typically required.For instance, Figure 1 suggests that we should set H = 250 to satisfactorily approximate both the Dirichlet and the NGG processes with their NIDM counterparts, in this specific setting.

Posterior Characterizations
We complete here the picture of the distributional properties of NIDM processes by determining their posterior distribution.We will refer to a framework where pH ∼ NIDM(c, ρ; P).
(12) and will provide two representations of the posterior distribution of pH one of which is effectively described in terms of the the multiroom Chinese restaurant metaphor introduced in Camerlenghi, Lijoi, and Prünster (2018).While our results are general, particular emphasis is given to the NGG multinomial process and, despite the lack of conjugacy, we are able to devise algorithms for exact (iid) sampling of the posterior.In contrast with most widely known samplers for homogeneous NRMIs (e.g.Lijoi and Prünster 2010; Barrios et al. 2013;Favaro and Teh 2013;Arbel and Prünster 2017), which either require Gibbssampling strategies or truncating certain series representations, the posterior laws of NIDM processes can be sampled exactly.

Predictive Distributions and Posterior Laws
Since the EPPF H can be evaluated through Theorem 2, it is straightforward to compute the predictive distributions.To this end, for any n ≥ 1 we define a positive random variable U whose density function on R + , conditional on the sample θ (n) , equals The normalizing constant of the above density is finite and it essentially identifies the EPPF of a NIDM process.The density function q H parallels the one, say q ∞ , of the latent variable appearing in the posterior representation for NRMIs.See James, Lijoi, and Prünster (2009).Note that q H converges pointwise to q ∞ as H → ∞ since m,H (u) → τ m (u) for any m ≥ 1 and u > 0.
Corollary 2. Let θ 1 , . . ., θ n be as in ( 12), and such that they admit k distinct values θ * 1 , . . ., θ * k with θ * j having frequency n j .Then P(θ n+1 ∈ A|θ (n) ) = w 0 (n (k) )P(A) + k j=1 w j (n (k) )δ θ * j (A), where n (k) = (n 1 , . . ., n k ) and for any j = 1, . . ., k This result is reminiscent of the predictive distributions obtained for homogeneous NRMIs, and indeed similar sampling strategies can be borrowed from that context.For example, note that conditionally on the latent variable U one has P(θ n+1 ∈ A|θ (n) Hence, one can devise a generalized Pólya-urn scheme by first drawing U from its density q H and then sampling from the predictive distribution using the above formula.The terms m,H might be expensive to compute in practice, mainly because of their combinatorial nature.However, this issue can be attenuated by relying on the recursive definition of m,H provided in James, Lijoi, and Prünster (2006).Furthermore, in the fairly general NGG multinomial case, the weights m,H have an explicit formula, and this allows for the implementation of an exact sampling algorithm for U; see the supplementary material.
Remark 1.If one is only interested in obtaining an exchangeable draw for θ (n) , a direct strategy consists in simulating pH and then, conditionally on it, sampling iid observations from pH , according to (12).Indeed, any ID random variable whose Laplace exponent is available in closed form can be sampled, for example through the general algorithm of Ridout (2009), and this enables the simulation of NIDM processes.
This representation is related to the posterior characterization for homogeneous NRMIs, which can be recovered as H → ∞.Indeed, the first term in ( 14) converges to a CRM with the exponentially tilted Lévy intensity ρ * , as a consequence of Theorem 1.On the other hand, J * j d −→ 0 and E e −λI j |θ (n) , U → τ n j (λ + U)/τ n j (U) for any j = 1, . . ., k: hence, the second term on the right-hand-side of ( 14) converges to the fixed jumps component of the posterior representation of NRMIs.
See James, Lijoi, and Prünster (2009).Interestingly, a structural property is shared by NRMIs and NIDMs: conditionally on a latent variable U the posterior law is a mixture of: (i) a component with a tilted intensity and (ii) a collection of independent jumps corresponding to the distinct values θ * 1 , . . ., θ * k in the sample θ (n) .However, it must be stressed that for NIDMs the tilted component vanishes as soon as k = H distinct values are recorded in the sample, and the posterior distribution will coincide with the law of a measure with jumps at fixed locations identified by the distinct values θ * 1 , . . ., θ * H .
Example 5 (Dirichlet multinomial process, cont' d).Note that m,H (λ+u)/ m,H (u) = τ m (λ+u)/τ m (u) for any m ≥ 1 and H, implying that the random variables in Theorem 5 have a simple H, where we agree that I j = 0 a.s.for any j > k.Hence, after normalization, we get the usual Dirichlet structure with updated parameters, which does not depend on the latent variable U. Finally, it is easy to check that Corollary 2 can be specialized to obtain the well-known predictive distributions of the Dirichlet multinomial process.
Moreover, the random variables J * 1 , . . ., J * H of Theorem 5 are conditionally conjugate, because ρ * (s) = e −Us ρ(s) = (1 − σ ) −1 s −1−σ e −(κ+U)s identifies an updated generalized gamma process.The distribution of each J * 1 , . . ., J * H is known as tempered stable and there exists several methods for drawing samples from it; see for example Ridout (2009) and references therein.Furthermore, the random variables I 1 , . . ., I k , given U, have the following conditional mixture densities for j = 1, . . ., k, where GAMMA(w; a, b) denotes the density function of a gamma random variable.Finally, for any A ∈ B( ) some algebra yields ξ n j , j ,H (U) Remark 2. To enable posterior inference through random sampling it suffices to simulate iid U values from q H and, then, make use of the above posterior representation.Although q H is known up to a normalizing constant, we can nonetheless draw samples by acceptance-rejection algorithms.The simulation of the limiting density q ∞ was typically addressed via Markov Chain Monte Carlo (Lijoi, Mena, and Prünster 2007).However, this further complication can be avoided in the NGG setting, given the availability of algorithms for exact sampling, which are discussed in the supplementary material both for q H and q ∞ .As such, these algorithms might be useful beyond their application to NIDMs.This data-augmentation scheme circumvents the need of computing the weights V n,k , which may be numerically unstable for large values of n (Lijoi, Mena, and Prünster 2007).
In contrast, a potential bottleneck of such a strategy is the evaluation of the generalized factorial coefficients C (n, k; σ ) when n is large, although the availability of recursive formulas for C (n, k; σ ) may mitigate the issue.

Multiroom Chinese Restaurant Metaphor
To gain further insights about structural properties of NIDM processes, we now describe a data-augmentation based on the hierarchical representations ( 10)-( 11).To this purpose, it is worth recalling the so-called multiroom Chinese restaurant metaphor coined by Camerlenghi, Lijoi, and Prünster (2018), which can be adapted to NIDM processes.Suppose that there exists a restaurant which serves a finite number of dishes H, corresponding to iid draws from the diffuse P. The restaurant has infinitely many rooms, and each room contains infinitely many tables and is associated to a single dish out of the H available from the menu.The first customer seats at one of the tables of the first room and selects a dish.The nth customer can either select a dish previously chosen by the other n − 1 customers or she can choose a new dish.In the former case, she will be seated in the room serving the dish of choice and she may be seated either at a new table or at an existing one.If a new dish is chosen, she will sit in a new room and at a new table.An illustration of this generative scheme is depicted in Figure 2.
Recalling the notation used so far, the entries of θ (n) = (θ 1 , . . ., θ n ) represent the dishes eaten by the n customers of the restaurant, whereas the labels identifying the single tables and their respective frequencies tables are unobservable and, hence, latent quantities.Specifically, we consider the latent random variables T (n) = (T 1 , . . ., T n ), which can be thought of as the label of the table where each customer is seated.Recall that k distinct dishes are served at the restaurant, that is, there are θ * 1 , . . ., θ * k distinct values having frequencies n 1 , . . ., n k , meaning that the total number of customers seating in room j or, equivalently, eating dish j, corresponds to the frequency n j .Finally, each j ∈ {1, . . ., n j } represents the number of tables in room j where customers are seated, while each q jr denotes the number of customers seating at table r in room j, so that k j=1 j r=1 q jr = k j=1 n j = n.When H → ∞, the probability of observing a new table where the same dish is being served tends to zero, since j −→ 1 for any j = 1, . . ., k.This in turn implies | | −→ k and, hence, each room will have only one table.We denote the | | distinct labels of the tables with T * j1 , . . ., T * j j having frequencies q jr for r = 1, . . ., j and j = 1, . . ., k.Thus, the joint augmented model for θ (n) = (θ 1 , . . ., θ n ) and T (n) = (T 1 , . . ., T n ) follows immediately from equation ( 11).Indeed, one has the following Corollary 3. The joint probability distribution of (θ (n) , T (n) , n,H ), where n,H is the partition of [n] induced by θ (n) through (12), is The number of customers eating the first dish (i.e., first room) is n 1 = 1 r=1 q 1r = 6, while the number of customers eating the second dish (i.e., second room) are n 2 = 2 r=1 q 2r = 7.
The baseline distribution of T (n) is set equal to P for simplicity, but any other diffuse probability measure would obviously work.Indeed, only the clustering mechanism implied by the table configuration is relevant for our purposes, and not the actual labels.The representation in (17) thus suggests that, conditionally on the table configuration induced by T (n) , the predictive distribution for θ n+1 can be easily obtained.Moreover, given the previously observed values θ (n) , the table configuration can be drawn through a Gibbs sampler.For the sake of the exposition, we do not attempt the full derivation of these conditional distributions, but the interested reader may refer to Camerlenghi, Lijoi, and Prünster (2018) for a detailed discussion, though in a different setting.
Example 7 (NGG multinomial process, cont' d).Conditionally on the latent random variables T (n) , the predictive probabilities can be readily obtained from (17), so that Hence, a relevant simplification occurs when considering NGG multinomial processes.In particular, the above conditional distribution depends on the table configuration only through the number of distinct tables 1 , . . ., k rather than on the tablespecific frequencies q jr .This is a major computational advantage, since we only need to sample k latent variables rather than n, as in general NIDM processes.Moreover, the above conditional law is intimately related to ( 16), as it arises from the combination of an augmentation over the table configuration and of a marginalization with respect to the latent variable U.
In the following, we expand the posterior characterization of Theorem 5 by conditioning also on the table configuration.The random variable U, conditionally on (θ (n) , T (n) ), is a nonnegative latent variable whose density function is given by Thus, conditionally also on T (n) , the latent variable U has the same structure of that involved in the posterior derivation of NRMIs.Similar simplifications occur also for the fixed jump component, as summarized in the next corollary.
Hence, conditionally on the table configuration, the posterior structure of pH closely resemble the one of homogeneous NRMIs, for any finite value of H. Specifically, the distribution of the random variables I jr have the same distributional structure of the jumps in NRMI-based models.
Example 8 (NGG multinomial process, cont' d).Specializing Corollary 4 we obtain that ( j r=1 I jr |θ (n) , T (n) , U) ∼ GAMMA(n j − j σ , κ + U), for j = 1, . . ., k, which depends on T (n) only through the number of tables 1 , . . ., k .Note that this representation is essentially the augmentation of equation ( 15) with respect to the number of tables.Note that when σ = 0, the posterior law pH becomes independent on the table configuration.

The INVALSI Dataset
We consider a publicly available dataset gathered by INVALSI institute, which is a public research center for the assessment of the Italian education system.In particular, the 2016-2017 dataset we are going to analyze is part of a national examination program conducted in Italy with the aim of "carrying out periodic and systematic checks on knowledge and skills of students", as declared in the official documentation of the INVALSI statistical service1 .A great effort was put by the INVALSI in order to quantify the added-value of a school, based on these data.The Bayesian framework is a natural choice when trying to combine multiple sources of information, that is, the schools.This can be accomplished through hierarchical nonparametric models, which enable flexible borrowing of information between different institutions.A broad and systematic sociodemographic analysis is beyond the scope of this paper and, hence, we focus on the presentation of novel modeling strategies based on NIDM priors.
We focus on data related to 8th grade students from schools in the city of Bologna: more specifically we consider those questions related to the comprehension of the Italian language.Having omitted few observations for which covariates were not available, the resulting dataset comprises a total of N = 8126 observations (students), belonging to 84 educational institutions.The INVALSI test has 45 questions and the performance of each student might be well summarized by the proportions of correct answers.To ease the modeling process we take a logistic transformation of the original proportions, and we define the score S ij for the ith student in the jth school as S ij := logit # of correct answers, ith student jth school + 1/2 # of questions + 1 , for i = 1, . . ., N j and j = 1, . . ., 84, where N j denotes the number of students in the jth school.In the above ratio, we added a small correction to the original proportions to avoid boundary issues.Such a transformation maps the original scores into R, and therefore it is more amenable for classical linear modeling with Gaussian errors.Consistently with this, we model the scores as follows for i = 1, . . ., N j and j = 1, . . ., 84, where z ij = (1, z ij1 , . . ., z ijp ) is a vector of student-specific covariates which are associated to a (p + 1)-dimensional vector of regression coefficients γ = (γ 0 , γ 1 , . . ., γ p ) .Each vector z ij encodes student-specific categorical variables, namely: the gender of the student, the education level of her/his father and mother (primary school, secondary school, etc.), the employment status of her/his father and mother, the regularity of the student (i.e.regular, in late, etc.), and the citizenship (Italian, first generation immigrant, etc.).Moreover, the coefficients μ 1 , . . ., μ 84 represents the school effects or, in the terminology used so far, the addedvalue of the school given a set of covariates, thus being the main quantity of interest in our analysis.Note that since the intercept term is included in z ij , the coefficients μ 1 , . . ., μ 84 are not identified.In practice, this is not a concern if inference is based on the "centered" set of parameters η j = γ 0 + μ j , for j = 1, . . ., 84, rather than the original random effects.
We aim at introducing a flexible nonparametric prior, for the school effects μ 1 , . . ., μ 84 , that allows for: (i) borrowing information across schools; (ii) arbitrary deviations from Gaussianity, and (iii) robustness under model misspecifications.Gaussian mixture models are able to capture all these three aspects and we specifically let for j = 1, . . ., 84.Selecting the appropriate number of mixture components H is typically a difficult task, and the BIC indexcustomarily employed in this framework -is not theoretically well justified here.Hence, overfitted mixture models can be exploited to circumvent this issue.Here we propose the usage of general NIDM processes, which amounts to have the following prior specification for the parameters in ( 19) where P is a diffuse probability measure on R×R + .By enlarging the class of priors from the Dirichlet multinomial to general NIDM multinomial processes we are essentially acting on the robustness requirement, that is, we are ensuring that the clustering mechanism is less affected by specific choices of the total mass c, whose specification is often critical.For this specific application, we employed in (19) a NGG multinomial process having jump intensity (3).The hyperparameters are set equal to c = 0.1, κ = 0.1 and σ = 0.87 and combined with a very conservative upper bound H = 70 for the number of mixture components.The a priori effect of these values on the number of cluster K n,H is depicted in Figure 3, where it is compared with the distribution induced by a Dirichlet multinomial process (c = 45.21), a Dirichlet process (c = 21.69), and a NGG process (c = 0.2, κ = 0.1, σ = 0.8), having roughly the same expected value E(K n,H ) ≈ 35.As apparent from Figure 3, the parameter σ plays a crucial role in controlling the variability of K n,H .Hence, one can obtain flat distributions for K n,H , which leads to more robust inference on the cluster configurations.Such an effect persists a posteriori, as we shall discuss later on and is further corroborated by the extensive simulation study provided in the supplementary material.As for the choice of P, we assume the conditionally conjugate prior μh H, where we set σ 2 μ = 0.1 and a σ = 1.5, b σ = 0.045.It is well-known that the choice of these parameters is delicate, since it may influence posterior inference.This issue however affects both the Dirichlet and NGG specifications.To conclude our Bayesian formulation we consider a multivariate Gaussian prior for the regression coefficients γ , and an inverse Gamma prior for the residuals variance σ 2 , and we let γ ∼ N (b, B), σ −2 ∼ GAMMA(a σ , b σ ), where we set b = 0 and B = diag(100, . . ., 100), to incorporate the neutral hypothesis of no relevant effects and a σ = b σ = 1, which induces a fairly noninformative prior.Posterior inference was conducted through a Gibbs sampling, whose details can be found in the supplementary material.For comparison, we also estimated the same model under a Dirichlet multinomial, a Dirichlet process, and a NGG prior as for Figure 3.We run the algorithm for 20,000 iterations after a burn-in period of 2000 simulations; the traceplots show no evidence against convergence and an excellent mixing.
From Figure 3, it is apparent that the posterior of K n,H is heavily influenced by the choice of the prior.In the Dirichlet multinomial and Dirichlet process cases, the number of mixture components peaks at around 29 and 24, respectively, as a consequence of the strongly informative prior distribution.Conversely, when the less informative NGG priors are used, the data adaptively select a much smaller number of components, peaking at around 4-6 mixture components.As expected, the infinite-dimensional NGG and the NGG multinomial behave in a similar fashion.Thus, if a simple Gaussian random effect model were employed, the random effects η j would be overshrunk towards the global mean, potentially affecting the quality of the analysis.On the other hand, the large number of clusters induced by Dirichlet priors could lead to undershrinkage, resulting in a model that is overly adapted to the data.
To empirically validate our expectations, we estimated a Gaussian random effects model, corresponding to H = 1, and a so-called "no pooling" model, in which no shrinkage is induced on the school effects η j .All these models were compared in terms of the DIC, WAIC (Gelman, Hwang, and Vehtari 2014), and the Bayesian leave-one-out (LOO), computed as in Vehtari, Gelman, and Gabry (2017).These indexes, reported in Table 1, represent an overall goodness of fit measurement and yield, as a byproduct, an estimate for the "effective number of parameters", which is a measure of the model complexity.In addition, in Table 2 we report the posterior mean of some (centered) school effects η j − γ 0 .According to the DIC, the WAIC, and the LOO it is clear that a certain amount of shrinkage is required: the "no pooling" model is overly complex.Conversely, the Gaussian model is too parsimonious and it overshrinks the school effects η j towards the global mean.Mixture models are in between these two extremes, although a critical distinction between Dirichlet and NGG models is quite evident from Table 1 and Table 2. Specifically, Dirichlet priors are much closer to the "no pooling" model, displaying a very high number clusters a posteriori, in turn inducing undershrinkage on the school effects.On the other hand, NGG specifications are more conservative, favoring a stronger shrinkage while allowing deviations from Gaussianity whenever needed.In other words, the findings in Table 1 and Table 2 imply that NGG specifications induce the right amount of shrinkage, suitably balancing flexibility and complexity.We finally investigated the impact of the covariates z ij on the response variable S ij .The posterior distributions of the associated coefficients γ are reported in the supplementary material.We found that all the considered covariates explain a considerable fraction of the variability of the response variables.In particular, we noticed a strong positive association between the education level of the parents and the students' performance on the test.Specific employments (e.g.executive positions) were also found to be positively associated with the outcomes.Unsurprisingly, one of the stronger predictors is the regularity of the studies, meaning that students that failed an exam are less likely to obtain a high score at the INVALSI test.

Discussion
In this article we introduce a novel class of discrete random probability measures, termed NIDMs, and provide a complete toolbox for their use as priors for Bayesian inference.They are characterized by a finite number of atoms H and, importantly, homogeneous NRMIs are recovered as a limiting case when H → ∞.Although our results are general, in applications we focus on the NGG special case, which has appealing analytical and computational properties, representing a robust alternative to Dirichlet-multinomial specifications when interest is in the number of clusters K n .We conclude our presentation by offering a high-level comparison between NIDMs and allied models.
For moderate values of H, there are important differences between NIDMs and the corresponding NRMIs.An illustrative comparison between the Dirichlet-multinomial and the Dirichlet process is given in Green and Richardson (2001), who show that the finite-dimensional specification leads to more "balanced" random partitions.Such a difference carries over to our setting, at least a priori, since the distribution of the weights (π 1 , . . ., π H ) of a NIDM is symmetric.Hence, the balancedness of the random partition is a peculiar property of NIDMs which might be desirable or not, depending on the specific application.
Also for large H NIDMs represent a viable alternative to the corresponding homogeneous NRMIs, especially from a computational perspective.Indeed, posterior computations in the latter case usually rely on truncations of the Ferguson and Klass (1972) representation (see e.g.Arbel and Prünster 2017), which requires numerical approximations while implicitly reducing to a finite-dimensional model.Alternatively, one may use the slice sampling strategy described in Favaro and Teh (2013), which however requires simulating a further layer of latent variables "slicing" the infinite-dimensional model, which could potentially deteriorate the mixing.In contrast, the algorithm we propose for NGG multinomial priors allows to draw iid samples from the posterior distribution of pH given the data θ 1 , . . ., θ n .
Finally, mixture models based on NIDM processes lead to a quite different modeling framework compared to mixture of finite mixtures (MFM) models (Richardson and Green 1997;Nobile 2004).In the latter context, a Dirichlet-multinomial process with a prior on H is employed, the inferential target being often, albeit not always, H itself. Instead, our focus is the number of occupied clusters K n ≤ H, representing the number of nonempty mixture components.Although these two quantities are sometimes used interchangeably in applications, they are different objects.This distinction is particularly relevant in species sampling applications, where K n represents the number of discovered species within the sample, whereas H represents the total number of species in the population.Using NIDMs for species discovery is an interesting research direction.Nonetheless, the MFM framework is a lively research topic, as testified by the recent contributions on asymptotic analysis (De Blasi, Lijoi, and Prünster 2013), on the algorithmical aspects (Miller and Harrison 2018), and applications to complex data structures (Legramanti et al. 2022).We believe that NIDMs may play a pivotal role also within the context of MFMs, after considering a suitable prior on H. Indeed, we envision that our theoretical findings would enable the derivation of marginal algorithms in the same spirit of Miller and Harrison (2018), while generalizing the current MFM framework.Finally, it is worth noting that while the manuscript was being reviewed, other contributions have appeared proposing alternative modeling approaches that explicitly distinguish occupied clusters and mixture components; see, for example, Frühwirth-Schnatter, Malsiner-Walli, and Grün (2021), De Blasi and Gyl-Leyva (2021), Nguyen et al. (2021) and Argiento and De Iorio (in press).

Figure 2 .
Figure 2. The multiroom chinese restaurant metaphor: circles represent tables and bullets represent customers.The number of tables for these two rooms are ( 1 , 2 ) = (3, 2) so that | | = 1 + 2 = 5.The number of customers eating the first dish (i.e., first room) is n 1 =

Figure 3 .
Figure 3.A priori and a posteriori distribution of the number of cluster K n,H in the INVALSI application, under a Dirichlet multinomial prior, a Dirichlet process, a NGG multinomial process prior, and NGG process with H = 70 and E(K n,H ) ≈ 35.

Table 1 .
Gelman, Hwang, and Vehtari (2014)and LOO (deviance scale, lower is better) in the INVALSI application, defined as inGelman, Hwang, and Vehtari (2014).The quantities p DIC , p WAIC , and p LOO are the associated estimates for the "effective number of parameters".

Table 2 .
Posterior mean of the (centered) school effects η j − γ 0 , under different model specifications.Schools with ID from 1 to 5 were randomly chosen from the set of 84 schools.Instead, schools with ID 6 and 7 were identified to better emphasize the over-and under-shrinkage effects.