Pseudo-Bayesian Classified Mixed Model Prediction

Abstract We propose a new classified mixed model prediction (CMMP) procedure, called pseudo-Bayesian CMMP, that uses network information in matching the group index between the training data and new data, whose characteristics of interest one wishes to predict. The current CMMP procedures do not incorporate such information; as a result, the methods are not consistent in terms of matching the group index. Although, as the number of training data groups increases, the current CMMP method can predict the mixed effects of interest consistently, its accuracy is not guaranteed when the number of groups is moderate, as is the case in many potential applications. The proposed pseudo-Bayesian CMMP procedure assumes a flexible working probability model for the group index of the new observation to match the index of a training data group, which may be viewed as a pseudo prior. We show that, given any working model satisfying mild conditions, the pseudo-Bayesian CMMP procedure is consistent and asymptotically optimal both in terms of matching the group index and in terms of predicting the mixed effect of interest associated with the new observations. The theoretical results are fully supported by results of empirical studies, including Monte-Carlo simulations and real-data validation.


Introduction
Classified mixed model prediction (Jiang et al. 2018) is a new method developed out of the traditional mixed model prediction that is particularly suitable for subject-level inference, such as in precision medicine and public health.For example, the National Research Council of the United States in 2014 defined precision medicine as the ability to classify individuals into subpopulations that differ in their susceptibility to a particular disease, in the biology and/or prognosis of those disease they may develop, or in their response to a specific treatment.Preventive or therapeutic interventions can then be concentrated on those who will benefit, sparing expense and side effects for those who will not.
In spite of being a new idea in prediction with many potential applications, a key step in CMMP, that is, the matching between a class among the training data and an unknown class associated with the new/future observations, is based on a crude sample mean of the responses observed for the unknown class.In fact, as noted in Jiang et al. (2018, p. 273), "as m, the number of classes increases, the probability of identifying the true index I decreases, even in the matched case; thus, there is no consistency in terms of estimating I, even in the matched case." Here, I denotes the class index for the unknown class.However, in terms of prediction of the mixed effect of interest, CMMP is still consistent as m goes to infinity.This is because, as m increases, classmatching becomes less important in terms of approximation to CONTACT Jiming Jiang jimjiang@ucdavis.eduDepartment of Statistics, University of California, Davis, USA.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/JASA.the mixed effect associated with the unknown class.In other words, even if one does not have the exact match, there is a high chance, as m increases, that one will find an approximate match in the sense that the corresponding mixed effect is close to the one associated with the unknown class.Thus, if prediction of the mixed effect is of primary interest, CMMP will perform well as long as m is large.
On the other hand, there are many practical situations where m, the number of classes, is not very large.This happens, for example, in the analysis of network data, where the number of communities identified within the network is typically not large, or only moderately large (see below).Empirical studies have shown (e.g., Sections 3 and 4) that CMMP may perform poorly in such situations, which is consistent with the statement of Jiang et al. (2018) noted earlier.It turns out that, in the case that m is small or moderately large, precision in matching is critically important.Having realized the importance of the precision of matching in CMMP when m is relatively small, Sun et al. (2018) proposed to incorporate covariate information in the matching.The authors assumed that there exist some class-level covariates that can be used in the matching.The modified CMMP procedure then focuses on matching the classlevel covariates to those of a training data classes.However, for the latter method to work one needs a (nearly) one-to-one correspondence between the class-level covariates and random effects, which may not hold in practice.In fact, by definition, the random effects are supposed to be orthogonal to the covariates in a mixed effects model, so the strategy may also lack theoretical justification.
A main motivation of the current work comes from the analysis of network data, which has generated substantial interest in both research and applications.See, for example, Bickel and Chen (2009), McAuley and Leskovec (2012), Bickel and Sarkar (2016), Ma, Su and Zhang (2018), and Li, Shen and Pan (2020).One of the extensively studied topics in network analysis is community detection in networks.Such work and results are potentially useful in identifying the classes that we are concerned about.It should be noted that our primary interest is prediction of mixed effects, or future observations, by using the class information.These classes are potentially closely related to the communities detected in the network.Thus, it is natural to consider using the existing work in community detection in CMMP.In fact, community information within a network has been used in improving prediction accuracy in precision epidemiology for infectious disease control (e.g., Keeling andEames 2005, Ladner et al. 2019).As another example, it is known that social networks have effects on economics (e.g., Bailey et al. 2018); therefore, it may be possible to use network information in making prediction of characteristics of economic interest.
One feature of the network data is that the number of communities within the network is typically not very large.For example, in the Jazz musician network (Gleiser and Danon 2003; data available at www.redhotjazz.com),the number of communities known to exist is 3.In the political books network (Newman 2006; data available at www.orgnet.com), the number of known communities is also 3. In the Facebook friendship network (available at www.snap.stanford.edu),11 communities have been identified (Ma et al. 2018).With such a small or moderate number of communities, the original CMMP method of Jiang et al. (2018) would not apply, if the communities are treated as the classes.
Furthermore, there is also a concern on whether the communities in a network and the classes in CMMP should match exactly.Typically, the classes in CMMP correspond to random effects under a mixed effects model.The random effects are associated with the means or proportions of some characteristics of interest.It is unlikely that these characteristics are solely determined by the community membership, even though the two may be associated.Therefore, uncertainty, or discrepancy, should be allowed in matching the communities to the classes.Due to such a consideration, we consider the following model for the data, which includes a working probability model for the class membership.
Suppose that we have a network with known communities.As mentioned, there has been extensive work on identification of the network communities (e.g., Bickel and Sarkar 2016;Ma et al. 2018), so we can assume that the network communities are known.Suppose that the training data associated with the network satisfy a nested-error regression (NER; Battese, Harter and Fuller 1988) model: i = 1, . . ., m, j = 1, . . ., k i , where i represents the community, k i is the number of subjects in the training data that belong to community i; y ij is the outcome of interest, x ij is a vector of associated covariates, β is an unknown vector of regression coefficients (the fixed effects), α i is a community-specific random effect, and ij is an error.It is assumed that the random effects and errors are independent with α i ∼ N(0, σ 2 ) and ij ∼ N(0, τ 2 ), where σ 2 > 0, τ 2 > 0 are unknown variances.We are interested in prediction of a mixed effect associated with a new subject.The mixed effect may be a conditional mean, proportion, or other characteristic, given the random effect associated with the subject.Suppose that the new subject belongs to a known community c n .Here, the subscript n stands for "new." The random effect associated with the new subject, however, is not entirely determined by c n -it is subject to some uncertainty.This happens, for example, when the training data were collected from a previous time period, a network that has grown bigger, or smaller, or a network that is not exactly the same as the one relevant to the new subject.Consider the following working probability model.Let γ n denote the true class index of the new subject.Note that the words "class" and "community" do not necessarily mean the same in that the former refers to that associated with the random effect while the latter to that associated with the network.For the training data, however, the classes match the communities, by assumption, but this is not necessarily true for the new subject.For now, let us assume that γ n is an unknown integer between 1 and m.This is called a matched case.Later we also consider the case that γ n does not match any of the integers between 1 and m.This is called an unmatched case (Jiang et al. 2018).We assume that there is a working probability model for γ n : where π(•) is a known probability distribution, which is not necessarily the true distribution of γ n .For example, in connection with using the network information in the matching, one may consider the following working model: where p is a given probability (see below); in other words,  3).We call π(•) a pseudo-prior due to a role it plays in deriving our method in the sequel.Although (3) is a special case of (2), it is an important special case that motivates our method.In this special case, the community index to which the new subject belongs, c n , is known.However, it is not necessarily equal to the true class index, γ n .The pseudo-prior probability that γ n = c n is p.Because, in a way, c n may be viewed as part of the data, the pseudo prior (3) may be viewed as a conditional probability.In general, the pseudo-prior (2) could be conditioning on certain data other than y, the combined responses of the training data and new data (see below).However, for notation simplicity, the conditioning notation is suppressed [e.g., π(γ Similarly, the pseudo-posterior derived below may be viewed as conditioning on y and c n , in case of (3), although the c n part is suppressed in notation [e.g., P π (γ n = i|y) instead of P π (γ n = i|y, c n )].The asymptotic results, established later in this article, should be regarded as conditional on c n .Also note that there is no randomness in c n ; in other words, c n is a known constant.
Furthermore, suppose that the outcomes of interest corresponding to the new subject, y nj , 1 ≤ j ≤ k n , satisfy a similar NER model to (1), that is, 1 ≤ j ≤ k n , where x nj is the corresponding vector of covariates, and nj s are the new errors that are independent and distributed as N(0, τ 2 ), and are independent with α γ n and the α i s and ij s associated with the training data.Note that, given γ n = i, (4) becomes where X u = (x uj ) 1≤j≤k u and V u = τ 2 I k u + σ 2 J k u (I k , J k denote the k×k identity matrix and matrix of 1s, respectively).Similarly, given γ n = i, the joint pdf of (y i , y n ) is given by where Combining the above results, and with some reorganization of terms, we obtain where k • = m u=1 k u , and y = (y 1 , . . ., y m , y n ) .From ( 2) and ( 7), we obtain the pseudo-posterior distribution of γ n : The match of γ n to the training data groups is chosen as the pseudo-posterior mode, that is, Note that the denominator in ( 8) is not needed in obtaining γn .
In case that the maximizer of (8) or ( 9) is not unique, choose the one with the lowest index number.
Note.Although the procedure clearly resembles the maximum posterior, or Bayesian classification (e.g., Nurty and Devi 2011), the set-up is not Bayesian.In fact, the distribution π in (2) is treated as a working model rather than a prior.In other words, there is no underlying assumption that the working model is the true distribution of γ n .In this regard, the method is similar to Henderson's original derivation of BLUP-best linear unbiased predictor (e.g., Jiang, Jia, and Chen 2001, p. 98).On the other hand, whenever there is additional information regarding the class index, γ n , such as network information, it should be incorporated into the working model.As will be demonstrated, the choice of the working model does not make a difference asymptotically, but it may make a difference in terms of the finite-sample performance.Due to its similarity to the Bayesian classifier, we call the proposed procedure of matching maximum pseudo-posterior matching (MPPM).
Once γn is determined, the prediction of the new mixed effect is carried out as in CMMP (Jiang et al. 2018).Namely, given γ n = i, the best predictor (BP), in the sense of minimum mean squared prediction error (MSPE), of where ȳi The classified mixed effect predictor (CMEP) of θ nj , denoted by θnj , is given by the right-hand side of (10) with i replaced by γn , and β, σ 2 , τ 2 replaced by their estimators (e.g., REML estimators; e.g., Jiang 2007, sec.1.3.2) based on the training data.We call the new procedure pseudo-Bayesian CMMP, or PBCMMP, due to its connection to both MPPM and CMMP.
The difference between PBCMMP and CMMP is in their strategies of matching the class index, γ n , which is a key component of CMMP.The CMMP of Jiang et al. (2018) used a strategy that matched the sample mean of the new observations and the empirical BP (EBP), given the class index, of the mixed effect associated with the new observations (see Jiang et al. 2018, 3rd paragraph in sec.7.2).The new PBCMMP uses the maximum pseudo-posterior idea, as noted above.As will be seen, the new matching strategy of PBCMMP, that is, MPPM, has both asymptotic and empirical superiority over the CMMP matching strategy.The superior matching strategy of PBCMMP does lead to (significantly) better predictive performance over CMMP, as we show both theoretically and empirically.
One can also make prediction of a future observation, say, y nj , if it is unobserved.By (4), we have y nj = θ nj + nj .Because the new error, nj , is independent of the training data, the BP of y nj is E(y nj |y) = E(θ nj |y) + E( nj |y) = E(θ nj |y).This shows that the BP of y nj is the same as the BP of θ nj ; therefore, naturally, the PBCMMP of y nj , denoted by ŷnj , is the same as θnj , the CMEP of θ nj based on MPPM.
In Section 2, we establish the asymptotic superiority of MPPM mentioned above, as well as consistency of MPPM in terms of the class-index matching.In Section 3, we establish consistency and asymptotic optimality of the CMEP based on MPPM.It should be noted that the consistency and asymptotic optimality of MPPM and CMEP based on MPPM hold for any working model π subject to some regularity conditions.In Section 4, we carry out extensive simulation studies to investigate finite-sample performance of MPPM and CMEP based on MPPM, and their comparisons with existing methods.The simulation results demonstrate, in particular, the superiority of MPPM and CMEP based on MPPM over the CMMP of Jiang et al. (2018), as predicted by the theory.In Section 5, we discuss measure of uncertainty for the proposed classified predictor.In Section 6, we consider an application to Facebook network data, and provide a real-data validation that further demonstrate the advantage of our new methods.Conclusion and discussion are offered in Section 7. Proofs and additional tables are deferred to Supplementary Material.

Consistency and Asymptotic Optimality of MPPM
We first consider a simpler, but also realistic situation in some cases (e.g., network data), where m, the number of classes in the training data, is bounded.Later we extend the result to the case that m increases with the sample sizes at an appropriate rate.Throughout this section, we assume a match case, which means that γ n matches one of the indexes 1 ≤ i ≤ m.This is reasonable because, otherwise, there is, of course, no consistency.

Consistency and Asymptotic Optimality When m is Bounded
We assume that a consistent estimator of β, β, is available; however, for the σ 2 , τ 2 , we only assume that some estimators, σ 2 , τ 2 , are available, which satisfy where a, A are some known constants.Note that (11) can always be met by truncating the estimators of σ 2 and τ 2 .For example, one may choose a as a small positive number (e.g., 10 −6 ) and A a large positive number (e.g., 10 6 ), and redefine the value of σ 2 as a, if it is less than a, and A, if it is greater than A; and similarly for τ 2 .Also note that, for the consistency of β, one does not need m → ∞.In fact, m → ∞ is necessary for the consistency of σ 2 but not for that of β and τ 2 .However, it is known that consistent estimator of β is available given any "working" estimators of σ 2 and τ 2 , which need not to be consistent.For example, such a result was established earlier in the context of generalized estimating equations (GEE; Liang and Zeger 1986), and later in the context of generalized linear mixed models (GLMM; e.g., Jiang 1999, Jiang et al. 2001).Define To distinguish MPPM based on different working models, let us denote the MPPM, (9), by γn,π , where π corresponds to the working model ( 2).Let γ * n denote the γn,π when π is the true distribution of γ n .The latter is not practically available, of course, but we can still get it involved in theoretical studies.Define Note that γn does not depend on π (therefore no index of π is needed).
Theorem 1 (consistency of MPPM and more).Suppose that the following hold: (iii) β is consistent, and (11) holds.Then, we have the following conclusions: (I) P( γn,π = γn ) → 1 for any working model π , including the true distribution of γ n .(II) MPPM is consistent for any working model π , that is, The proof of Theorem 1 is given in the supplementary material.If we apply the result to the special case that π is the true distribution of γ n , then we have the following result.
Corollary 1.Under the conditions of Theorem 1, we have, for any working model π , Proof.Using conclusion (I) of Theorem 1, we have Next, we establish asymptotic optimality of MPPM.To do so we first introduce a lemma that states the (exact) optimality of γ * n , which we call maximum posterior matching (MPM), even although it is not practically available.The optimality is in terms of minimizing the probability of mismatch, that is, P( γn = γ n ), where γn is any class matcher of γ n .

Lemma 1 (Optimality of MPM). P(γ
Although such a result is well known in the literature of Bayesian classifier (e.g., Nurty and Devi 2011), we were unable to find a simple proof.Thus, a proof is given below.
Proof.Note that, when π is the true distribution of γ n , the righthand side of ( 8) is equal to P(γ n = i|y), the (true) conditional probability.Thus, by definition, γ * n has the property that Now, for any other class matcher, γn , we have On the other hand, by replacing γn with γ * n , and using property (15), it is seen that the same arguments of ( 16) hold with the inequality in the third line replaced by equality.Therefore, the right-hand side of ( 16) is equal to P(γ * n = γ n ).
Because the MPM, γ * n , is optimal, in view of ( 14), it is not surprising that MPPM is asymptotically optimal.Let N denote the different sample sizes associated with y, such as k i , 1 ≤ i ≤ m and k n .Let a N be a sequence of positive numbers such that 17) By Theorem 1 and Corollary 1, we have P( γn,π = γ n ) → 0 and P( γn,π = γ * n ) → 0. Thus, for example, for any 0 is a sequence that satisfies (17).We have the following result.
Theorem 2 (Asymptotic optimality of MPPM).Suppose that the conditions of Theorem 1 are satisfied.Let γn be any other class matcher.Then, we have for any sequence a N that satisfies (17).
Proof.We have, using Lemma 1, Thus, by multiplying a N on both sides, we obtain The result then follows by taking the lim sup on both sides.Now we know that MPM is optimal and MPPM is asymptotically optimal.A question is how much is the difference between the two.The next result gives an upper bound on this difference in terms of the probability of match.To simplify the notation, write P(i) = P(γ n = i|y) and P π (i Theorem 3 (Error probability bounds).The following inequalities hold: Intuitively, the expression inside the first expectation on the right-hand side of ( 19) is the distance between the maximum pseudo posterior and pseudo posterior at the MPM; similarly, the expression inside the second expectation on the right side of ( 19) is the distance between the maximum posterior and posterior at the MPPM.An implication is that what matters is how close the MPPM is to maximizing the posterior, and vice versa (how close the MPM is to maximizing the pseudo posterior).For example, the pseudo posterior does not need to be close to the posterior everywhere, as long as it is close to the posterior where it peaks; in other words, the posterior and pseudo posterior peak around the same place.The proof of Theorem 3 is given in the supplementary material.

Consistency and Asymptotic Optimality When m → ∞
First note that some results from the previous subsection, namely Lemma 1 and Theorem 3, are not asymptotic, which means that they hold under fixed sample size.These results extend without change when m → ∞.The theorem below extends all of the asymptotic results, namely Theorems 1 and 2, in that it allows m to increase at a suitable rate.Define k * = min 1≤i≤m k i and k * = max 1≤i≤m k i .
Suppose that (a) m → ∞ and there is Assumption (a) states, in particular, that the rate that m increases is relatively slower than that of k n , and even slower than those of k i , 1 ≤ i ≤ m.This is reasonable because, so far as this paper is concerned, the cases we are concerned with are such that m is much smaller than the k i 's or k n .The key ideas of the proof are similar to those of Theorem 1, but with more careful evaluation of the bound for the log-ratio of selection probability and the lower bound for the distance between different random effects, taking into account that now m → ∞.Detail of the proof is given in the Supplementary Material.
To conclude this section, we have shown (Theorems 2 and 4) that MPPM is asymptotically superior than any other class matcher, including the CMMP class matcher of Jiang et al. (2018) [see the new paragraph below (10)], in terms of smaller probability of mismatch under a suitable asymptotic framework.This theoretical result is fully supported by our empirical studies presented in Sections 4 and 6.Furthermore, this is the very reason for the superior performance of PBCMMP over CMMP, which we demonstrate both theoretically and empirically in the sequel.

Consistency and Asymptotic Optimality of θn,π
We now switch attention to asymptotic behavior of the CMEP based on MPPM.Let θ n denote a mixed effect associated with some new observations that satisfy (4).The mixed effect can be expressed as θ n = x n β + α γ n .For example, x n may be one of the observed x nj , in which case θ n = θ nj , defined above (10), but x n can also be a value not among the new observations y nj , x nj , 1 ≤ j ≤ k n .The CMEP of θ n , denoted by θn,π , is given by ( 10) with x nj replaced by x n , i replaced by γn , the MPPM defined by ( 9), and β, σ 2 , τ 2 replaced by β, σ 2 , τ 2 , respectively.We first establish consistency of the CMEP based on MPPM.Later, we study asymptotic optimality of the CMEP based on MPPM under a suitable framework.

Consistency When m is Bounded
Although in Jiang et al. (2018), the authors also established consistency of CMEP based on their proposed class matcher, the result was proved under the assumption that m → ∞ (see Assumption A5 in the supplementary material of Jiang et al. 2018).Here, we consider consistency of CMEP based on MPPM when m is bounded, which is also practical, for example, in some cases of network data.
Theorem 5 (consistency of CMEP when m is bounded).Suppose that x n is bounded.Then, under the assumptions of Theorem 1, the CMEP based on MPPM is consistent, that is, θn,π −θ n P −→ 0. The result holds regardless of the choice of the working model π .
The proof of Theorem 5 is given in the supplementary material.It should be noted that, although we have established consistency of the CMEP of θ n , there is no consistency of the corresponding classified predictor for y n , a future observation associated with θ n .This is because y n is subject to a new error, which has a nonvanishing variance; in other words, y n = θ n + n , where n ∼ N(0, τ 2 ) with τ 2 > 0. Therefore, there is no way to estimate, or predict, y n consistently due to the nonvanishing n .

Consistency When m → ∞
In this case, there are two scenarios, the matched case and unmatched case (Jiang et al. 2018).In the matched case, it is assumed that the true class number of the new observation, γ n , belongs to {1, . . ., m}, the set of indexes associated with the training data classes.This is the case that we have considered, sofar, in studying the asymptotic behaviors.Note that consistency in terms of the class matching is only possible in the matched case.In the unmatched case, γ n does not belong to the above index set.There is no matching consistency, of course, in the unmatched case.Nevertheless, we can still establish consistency of the CMEP in the unmatched case.The latter result was also obtained in Jiang et al. (2018).
Theorem 6 (consistency of CMEP in matched case).Suppose that x n is bounded.Then, under the assumptions of Theorem 4, the CMEP based on MPPM is consistent, that is, θn,π − θ n P −→ 0, regardless of the choice of the working model π .
The proof of Theorem 6 is given in the supplementary material.
We now consider the case that γ n / ∈ {1, . . ., m}.This means that the random effect corresponding to the new observations does not match one of the random effects associated with the training data.Such a case was considered in Jiang et al. (2018) and Sun et al. (2018).Of course, in this case, there is no consistency in terms of matching the class index; however, it was shown (e.g., Jiang et al. 2018) that, as long as m → ∞, the CMEP of θ n (based on the mismatched class index) is still consistent.The rationale is that, although there is no exact match of the class index, since m is large, there is always some α i that comes close to α γ n , which is all that matters, so far as consistency of CMEP is concerned.The following theorem states that a similar result holds for the CMEP based on the MPPM.
Theorem 7 (consistency of CMEP in unmatched case).Suppose that α γ n is independent with α i , 1 ≤ i ≤ m, and α γ n = O P (log m).Then, under the assumptions of Theorem 6, we have θn,π − θ n P −→ 0, that is, the CMEP based on the MPPM is consistent, regardless of the choice of the working model π .
Note that, because γ n does not match any of the indexes 1 ≤ i ≤ m, it is reasonable to assume that α γ n is independent with α i , 1 ≤ i ≤ m.The additional assumption, α γ n = O P (log m), takes into account the fact that γ n is considered as a random index.Note that if, instead, γ n is a fixed index, then we have α γ n = O P (1) provided that E(|α i |) < ∞ for every i.To see a case where γ n is random, consider the following example.
The proof of Theorem 7 is given in the supplementary material.

Asymptotic Optimality
Let us focus on the matched case with m → ∞.To simplify the notation, write γ = γ n , θ = θ n , and θπ = θn,π .Our approach is to first establish a lower bound for a normalized limit of the MSPE of a classified predictor, that is, a predictor of θ , θ , based on a class matcher γ .We then show that the normalized MSPE of θπ attains the asymptotic lower bound.In a way, the approach is similar to the Cra ḿer-Rao lower bound and asymptotic optimality of the maximum likelihood or Bayes estimators (e.g., Lehmann and Casella 1998, chap. 6).Recall the notation b ij defined in Theorem 1. Recall ȳi• , xi• are defined below (10), xn• is defined below (13).Also let ȳn Theorem 8. Suppose that (11) holds and there is a constant h > 1 such that, as m → ∞, the following are O( 1 where for some g 1 > (2h − 1)/4, and there is g 1) for some g 2 > 2g + (5/2)h + 7/4 and g 3 > g + h/2 + 7/4.Then, for any sequence b N satisfying the conditions in (I), we have The result holds regardless of the choice of π .
Note that, if β is a (consistent) estimator of β based on the training data, its accuracy, or effective sample size, is typically much higher than the number of classes, m.In fact, under some regularity conditions, we have E(| β − β| 2 ) = O(n −1 ), where n = m i=1 k i is the total sample size.Therefore, the conditions involving E(| β − β| 2 ) are expected to hold.The rest of the conditions are regarding the relative rates that m, k * and k n increase.For the most part, it requires that k * and k n increase at a (much) faster rate than m, and k * increases at a faster rate than k n .The conditions seem reasonable, at least, in applications to network data.The proof of Theorem 8 is given in the supplementary material.

Simulation Studies
We carry out a series of Monte-Carlo simulation studies to investigate finite-sample performance of the proposed PBCMMP method and compare it with the existing methods, including the CMMP method of Jiang et al. (2018) and the standard regression prediction (RP) method.We begin with an example under the same simulation setting of Jiang et al. (2018).We then study a number of more complex situations, including more covariate predictors, different working models, and noise in random effects so that an exact match does not exist.The section is concluded with a summary.
Let c n = 1, that is, the new subject is thought to belong to the first community (i = 1), but there is a chance that this may be wrong.The true index, γ n , satisfies (3), where the true value of p is 0.85.However, we pretend that this is unknown, and two proposed values of p are considered: 0.75, 0.9.The following combinations of sample sizes are considered: m = 10, k = 10, 50, 100; m = 20, k = 50, 100, 200, 400.For each of the combinations, we consider G = 0.1, 1; k n = 1, 10 for the cases of m = 10, and k n = 1, 10, 50 for the cases of m = 20, resulting a total of 36 combinations of m, k, G, k n .
There are two objectives of interest: Identification of the true index, γ n , and prediction of the true mixed effect, θ n = 1 + 2x 1,n + 3x 2,n + α γ n .As in Jiang et al. (2018), unknown parameters are replaced by their REML estimators based on the training data.We run 100 simulations under each combination of m, k, G, k n , and p values specified above, and report (i) the empirical MSPE: E( θn − θ n ) 2 , where θn corresponds to PBCMMP, CMMP, or RP, based on the simulation runs; (ii) ratio of the empirical MSPEs, that is, R c/p , which is the empirical MSPE of CMMP divided by that of PBCMMP, and R r/p , which is the empirical MSPE of RP divided by that of PBCMMP; and (iii) empirical probability of correct matching, that is, proportion of times that the class index is matched correctly, for PBCMMP (PCM p ) and CMMP (PCM c ).The results for p = 0.75 are presented in Table 1; the results for p = 0.90 are deferred to the supplementary material.The numbers in the parentheses are empirical standard deviations for the empirical MSPE.
The ratio of MSPE is a measure of relative efficiency comparing two predictors, with the ratio greater than 1 indicating the denominator method (i.e., PBCMMP) is more efficient than the numerator method (CMMP or RP).It is seen that the majority of the ratios for CMMP are greater than 1, some much greater than 1; and all of the ratios for RP are greater than 1, some much greater than 1.This suggests that, overall, PBCMMP performs significantly better than CMMP and RP in terms of the predictive performance.It is interesting to note that CMMP does not always perform better than RP, while PBCMMP does.It is also seen that, in terms of probability of correct matching, PBCMMP performs much better than CMMP, which may explain the better predictive performance of PBCMMP.
Table 1 of the supplementary material shows very much the same pattern for p = 0.90.In fact, similar pattern is found much more broadly, not just for the values of p presented here.For  example, Figure 1 presents two plots of the ratio of the empirical MSPE of CMMP over that of PBCMMP, that is, R c/p , for the cases of m = 20, k n = 1.It is seen that, as long as p is not very small, the raio is (well) above 1, especially when k is large.

More Covariate Predictors
In this subsection, we extend the model in the previous subsection by adding more covariates.The extended model can be expressed as follows: i = 1, . . ., m, j = 1, . . ., k, where x k,ij , k = 1, 2 are generated from N(0, 1), x 3,ij are generated from Bernoulli(0.5), and x 4,ij are generated from U(0, 1).The x's are then fixed throughout the simulation.The rest of the settings are the same as in the previous subsection.The results are presented in Tables 2 and  3 of the supplementary material.It is seen that the observed patterns are very much the same as those of Table 1.

Different Working Models
In this subsection, we consider a different type of working model than the one considered earlier.We compare PBCMMP with CMMP as well as an ideal matching strategy, in which the working model is the true distribution of γ n .The latter is not available, of course, in practice, but the comparison would give us some idea on the relative efficiency of a working model compared with the true distribution (see Theorem 3).Specifically, we let m = 20; the true γ n is generated from the 1 + Binomial(9, 0.01) distribution while the working model is Other settings are the same as in Section 4.1.The results are reported in Table 2, where BCMMP represents PBCMMP using the true distribution of γ n as the working model, and P b , P p , and P c denote the empirical probabilities of correct matching for BCMMP, PBCMMP, and CMMP, respectively.
The big picture is quite similar to what have been observed in the previous two subsections.It is also seen that the matching probability under the true distribution of γ n , which corresponds to γ * n in, say, Lemma 1, is the highest in most cases.This is consistent with Lemma 1.It is also observed that the matching probability under the working model, which corresponds to γn,π in, say, Theorem 2, comes quite close to that of γ * n , and the two probabilities, P b and P p , are both much higher than P c .These are consistent with the theory we have established, namely Theorems 2 and 4.

Noise in Random Effect
Finally, we consider a situation where there is no exact match between the random effect associated with the new observations and a random effect associated with the training data.Specifically, a noise is added to the random effect of the new observations that an exact match does not exist.This is practical in some situations.The simulation setting is the same as that of Section 4.1 except that the true random effect of the new observations is α γ n + e n , where e n is an additional noise that is distributed as N(0, G/10).It follows that the standard deviation of the noise is about 1/3 of that of the true random effect.The results for p = 0.9 are presented in Table 3; those for p = 0.75 are deferred to Table 4 of the Supplementary Material.Although there is no exact match of the random effect in this case, we can still obtain empirical probability of matching the main part of the random effect, α γ n , and call it probability of approximate matching, denoted by PAM p and PAM c , respectively, for PBCMMP and CMMP.
Once again, the results follow the same patterns that have been observed in the previous subsections; in particular, the probability of approximating match behaves similarly as the probability of correct matching that were observed previously.

Summary
To summarize this section, we focus on the comparison between PRCMMP and CMMP.There are a total of 7 cases of different scenarios, including those presented in the Supplementary Material.Denote the two cases in Section 4.1 corresponding to p = 0.75 and p = 0.90 by I-1 and I-2; the two cases in Section 4.2 corresponding to p = 0.75 and p = 0.90 by II-1 and II-2; the scenario of Section 4.3 by III; and the two cases in Section 4.4 corresponding to p = 0.75 and p = 0.90 by IV-1 and IV-2, respectively.It is seen that the minimum MSPE ratio is less than one, with the smallest about 0.43; after the first quartile (Q1), all of the summaries of the MSPE ratio are greater than one, with most of the maximum over 19.Overall, PBCMMP is seen to have significant advantage over CMMP both in terms of the probability of correct (or approximate) matching and in terms of the MSPE of the prediction.

Measure of Uncertainty
A standard measure of uncertainty, in the context of prediction, is the MSPE.Jiang and Torabi (2020) proposed a Sumca method of MSPE estimation that is applicable to a broad range of problems involving complex predictors, including the current PBCMMP.In fact, in a similar application, Sun et al. (2018) had applied the Sumca method to classified mixed logistic model prediction.The method takes advantage from small area estimation (e.g., Rao and Molina 2015), where estimation of the area-specific MSPE has been extensively studied.Two of the main approaches are the Prasad-Rao linearization method (P-R; Prasad and Rao 1990) and resampling methods.See Rao and Molina (2015) for details.Sumca proposed to put together the best parts of the two approaches.Specifically, it uses analytic expression to compute a leading term of the MSPE estimator, which is similar to the P-R method, and a Monte-Carlo method to evaluate a lower-order, bias correction term, which is similar to the resampling method.See Jiang and Torabi (2020) for details.An alternative method is double bootstrapping (DB; Hall and Maiti 2006), which is one of the resampling methods noted above.DB is known to be computationally very intensive.On the theoretical side, both Sumca and DB are known to produce the second-order unbiased MSPE estimators, that is, the order of the bias is o(m −1 ).
A simulation study is carried out to evaluate performance of the two MSPE estimators, Sumca and DB, under the setting of Section 4.1.We consider a case of relatively small sample sizes with m = k = k n = 10, and a case of relatively large sample sizes with m = 20, k = 200 and k n = 50.In each case G = 0.1 and G = 1 are considered.We use the Monte-Carlo sample size K = 50 for computing the Sumca estimator; for DB, we use K 1 = K 2 = 50 as the bootstrap sample sizes for the two stages of DB.As noted, DB is computationally very intensive.Although for Sumca we have no computational difficulties even with larger K, we intensionally keep K, K 1 , K 2 the same so that the results are more comparable when computational costs are put aside (see below).
Table 5 presents the percentage relative bias defined as %RB = 100×[{E( MSPE)−MSPE}/MSPE}], where MSPE is the true simulated MSPE and E( MSPE) the simulated mean of the MSPE estimator, either Sumca or DB.The %RB is a standard measure of performance for MSPE estimation (e.g., Jiang and Torabi 2020).The results are based on 100 simulation runs.
It is seen that the performance of both Sumca and DB improve as G increases.Also, the performance of both Sumca and DB seem to get worse when m increases, especially when G is small.One explanation is that, when the sample sizes get larger, or when G gets smaller, the actual MSPE, which serves as the numerator of %RB, decreases.Therefore, it requires more accuracy in the MSPE estimation (the numerator) in order to keep the %RB small (in absolute value).As for the comparison between Sumca and DB, it is seen that Sumca performs better in most cases.One should also be reminded that DB is computationally much more expensive than Sumca.Roughly speaking, the computing time for DB is about 15 times that of Sumca for the current simulation study.

Prediction with Facebook Network Data
As a real-data validation, we apply our proposed method to a large social network data regarding Facebook users (available at http://snap.stanford.edu/data/ego-Facebook.html).A node in the network represents a user and an edge a friendship between two users.From Facebook we obtained user profile information and network data from 10 ego-networks, consisting of 4039 nodes and 88,234 edges.For each node, feature vectors have been provided and their interpretations obscured.For instance, where the original dataset may have contained a feature "political=Democratic Party, " the new data would simply contain "political=anonymized feature 1. " Therefore, by using the anonymized data it is possible to determine whether two users have the same political affiliations, but not what their individual political affiliations represent.In this dataset, features are "1" if the user has this property in the profile, and "0" otherwise.
The data have been analyzed by several authors; see McAuley and Leskovec (2012), Bickel and Sarkar (2016) and Ma et al. (2018), among others.However, the focuses were on problem of determining the number of communities or clusters, and on identifying the users' social circles within the Facebook network.Here, we use the data to demonstrate the effectiveness of PBCMMP in predicting the number of friendships for a Facebook user, assuming, of course, that the latter is unknown.The PBCMMP is based on the working model (3) with p = 0.75, or p = 0.95.Because the dimension of feature vectors for each ego-network is different, and the feature value is either 0 or 1, the proportion of features with the value being 1, that is, the number of features equal to 1 divided by the dimension of feature vector, is used as a covariate.The outcome of interest is the log-transformed number of friendships, that is, the number of edges for each node.
To assess the predictive performance of PBCMMP, and its comparison with CMMP and RP, we randomly selected 10% of the data from each community (see below) as testing data; the remaining 90% of the data were used as training data.The testing data has size 404 and training data 3635.It is widely believed that there are 10 communities within the network associated with the Facebook data (e.g., McAuley and Leskovec 2012; Bickel and Sarkar 2016).Those 10 communities were used to divide the training data into m = 10 classes.For the testing data, however, the community information is known, but is only used as a "prior" according to the description of our PBCMMP method [see Section 1, above ( 2)].Table 6 reports the average squared prediction errors, that is, the average of the squared prediction errors over the subset of testing data according to each community for three comparing methods, PBCMMP (with p = 0.95 and p = 0.95, respectively), CMMP and RP.Note that these are the true prediction errors, which is a more convincing measure of predictive performance than the empirical MSPE based on simulation studies (see Section 4).The results clearly demonstrate the superiority of PBCMMP over CMMP and RP.It is remarkable that, in most cases, the average squared prediction error is zero (which means zero in every single case that was predicted).It is also seen that there is no difference, in most cases, under different values of p for PBCMMP, p = 0.75 or p = 0.95.Also reported in Table 6 (in the parentheses) are proportion of correctly matched class index, for PBCMMP and CMMP.Note that, for the testing data, their class indexes are known, which are the same as their community indexes.However, we pretend that this is unknown to us.Instead, we use the working probability model, (3), with either p = 0.75 or p = 0.95 as the tuning parameter, to carry out the PBCMMP.In the end, we can find out exactly how many class indexes are correctly determined using either PBCMMP or CMMP.It is seen that the matching proportion of PBCMMP is nearly perfect; in contrast, the matching proportion of CMMP is relatively poor.This explains the difference in the average squared prediction errors between PBCMMP and CMMP, namely, PBCMMP predicts better by matching correctly.Again, there is almost no difference between different values of p for PBCMMP, p = 0.75 or p = 0.95.
It is also seen that, in terms of the predictive performance, PBCMMP, with p = 0.95, performs (much) better than RP for all communities; PBCMMP, with p = 0.75, performs (much) better than RP for most communities (8 out of 10).On the other hand, CMMP does not perform better than RP for most communities (8 out of 10).Therefore, this real-life-data example fully demonstrates the power of PBCMMP.

Conclusion and Discussion
We have developed a pseudo-Bayes strategy called PBCMMP to significantly improve the efficiency of CMMP.The strategy has flexibility of using a working prior, typically chosen based on knowledge about the connection between classes in the training data and the potential class of the new data.The superiority of PBCMMP is demonstrated theoretically via the established theory of consistency and asymptotic optimality, both in terms of class-matching and in terms of prediction of the mixed effect associated with the new data, and these results hold regardless of the choice of the working prior, subject to mild regularity conditions.It should be noted that such results of asymptotic analysis are rarely seen in the context of mixed effects models (e.g., Jiang 2007;McCulloch, Searle and Neuhaus 2008;Demidenko 2013;Jiang 2017).
The theoretical results are fully supported by the results of extensive simulation studies, where we compare the finitesample performance of PBCMMP and CMMP as well as the standard regression predictor (RP).A real-data application on the Facebook social network illustrates the striking difference in improvement of prediction accuracy via PBCMMP over CMMP and RP.
A major advantage of the new PBCMMP method over the existing CMMP methods is that it enjoys the double consistency, that is, consistency in terms of the class matching and that in terms of prediction of the mixed effect associated with the new data, and this property holds whether the number of classes in the training data, m, is bounded or not.In contrast, all of the previous CMMP methods only possess single consistency in terms of the mixed effect prediction, under the assumption that m increases.The double consistency of PBCMMP not only improves the predictive performance of CMMP, as a by-product it is also capable of correctly identifying the class index for the new observations, which in some cases may be of interest as well.Furthermore, in some cases, such as in case of network data, the number of classes, m, is fairly small, or at most moderate.The previous CMMP methods do not have guaranteed performance in such situations, while our new PBCMMP method has guaranteed performance, as we have demonstrated both theoretically and empirically.Furthermore, we have established asymptotic optimality of PBCMMP both in terms of the class matching and in terms of the prediction of the new mixed effect.This kind of theoretical results are rarely seen in the context of mixed model analysis.Basically, the asymptotic optimality assures that PBCMMP is the best class of classified predictors that one could possibly get.
One area that deserves further research, both theoretically and empirically, is regarding the MSPE estimation.Although the Sumca and DB methods are both known to be secondorder unbiased, our simulation results have not found that their ) and (13) holds; (b) xi• , 1 ≤ i ≤ m and xn• are bounded; and (c) m d ( β − β) P → 0 for the same d, and (11) holds.Then, we have the following conclusions: (I) P( γn,π = γ * n ) → 1 and P( γn,π = γ n ) → 1. (II) (18) holds for any other class matcher γn and sequence a N satisfying (17).
, (y i , y n ), y i+1 , . . ., y m , where y i = (y ij ) 1≤j≤k i and y n = (y nj ) 1≤j≤k n .The pdf of y u (u = i) is given by

Table 2 .
Empirical MSPE, ratio of MSPE, and probability of correct matching.

Table 4 .
Mean prob. of matching and 6-number summary of MSPE ratio.

Table 6 .
Average squared prediction error (proportion of correct matching).