Prediction Using Many Samples with Models Possibly Containing Partially Shared Parameters

Abstract We consider prediction based on a main model. When the main model shares partial parameters with several other helper models, we make use of the additional information. Specifically, we propose a Model Averaging Prediction (MAP) procedure that takes into account data related to the main model as well as data related to the helper models. We allow the data related to different models to follow different structures, as long as they share some common covariate effect. We show that when the main model is misspecified, MAP yields the optimal weights in terms of prediction. Further, if the main model is correctly specified, then MAP will automatically exclude all incorrect helper models asymptotically. Simulation studies are conducted to demonstrate the superior performance of MAP. We further implement MAP to analyze a dataset related to the probability of credit card default.


Introduction
Prediction is often the goal in many statistical analysis.Based on a statistical model and existing data with both covariates and responses, that is, the labeled data, the usual practice is to estimate the unknown components of the model and use the resulting model to predict the response associated with a new set of covariates, that is, the unlabeled data.This practice works well when the labeled data and the new, unlabeled data share the same relation between the response (label) and the covariates.When additional labeled data are available, whose dependence of the response and covariates does not necessarily obey the same statistical rule as the original data, we usually cannot make use of them because information carried in such data may not benefit our prediction purpose.
However, it is not uncommon that even when data generated from different scenarios follow different models, they can still share some common components.For example, two datasets may both follow linear regressions and share a common covariate subset and its effect.Further, if the response of the second dataset is masked out and instead, only a dichotomous variable indicating whether or not the response is positive is available, then we will have two models share common covariate effects, where one model has continuous response while the other categorical.The familiar mixed effect model can be viewed as one particular example as well.In this case, each cluster can be viewed as a sample from a population and observations in the same cluster follow the same distribution.However, different cluster share some common features with the same coefficients, commonly referred as the fixed effect because it is the same CONTACT Xinyu Zhang xinyu@amss.ac.cnAcademy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China.Supplementary materials for this article are available online.Please go to www.tandfonline.com/UBES.across all clusters (Wu 2010).It is then natural to borrow information from observations in other clusters even if the main purpose is doing prediction in one specific cluster.
Such consideration directly leads us to consider prediction using multiple models from heterogeneous populations.Specifically, consider independent datasets from N populations.Let the jth dataset be Y ji , X ji , Z [j] i for i = 1, . . ., n j , where X ji , j = 1, . . ., N, i = 1, . . ., n j are iid observations and Z [j] i , i = 1, . . ., n j are iid observations, but for j 1 = j 2 , Z [j 1 ] i and Z [j 2 ] i can be different covariates.We describe the dependence of the response Y ji on the covariates X ji , Z [j] i with a conditional pdf f Y ji |X ji ,Z [j] i (y, x, z [j] ) = f j (x T β + z [j] T α j , y, γ j ), j = 1, . . ., N, i = 1, . . ., n j , where y, x, z [j] represent a realization of the random variables Y ji , X ji , Z [j] i .Thus, even though different functional forms f j and different parameter values α j , γ j are allowed in different populations, the population specific models still share a common parameter β.Our purpose is to predict the response for a given set of covariates X, Z [1] .Here, without loss of generality, we consider prediction in the first population.To distinguish the model associated with the first population and the ones associated with the remaining populations, we name the first model the main model, while the rest the helper models.
To perform prediction, a natural practice is to first estimate β, α 1 , γ 1 using samples from the first population, and then use the estimated parameter values and the functional form f 1 to form a prediction, for example, the mean calculated as yf 1 (X T β + Z [1] T α 1 , y, γ 1 )dμ(y), where β, α 1 , γ 1 are the estimators of β, α 1 , γ 1 and μ(•) is the probability measure of response.One may also think of improving the quality of the parameter estimation hence the quality of the prediction by using Maximum Likelihood Estimator (MLE) based on all the data.This will improve the β estimator, and hence may further improve the estimation of other parameters and the prediction itself.Such a consideration is based on a simple fact that all the models are correct.As long as one model, whether it is the main model or one of the helper models, is misspecified, such practice runs the risk of giving too much trust to the possibly misspecified model and hence can cause deteriorated prediction performance.
We take a different approach to making use of the multiple samples through model averaging.This approach has the advantage that it can achieve the best prediction in large samples even if the main model that the prediction is based on is misspecified.On the other hand, if the main model is correct, the procedure will automatically eliminate the influence from misspecified helper models and use only the correctly specified helper models to form prediction. Given that in practice it is not usually known whether any model is correctly or incorrectly specified, automatically eliminating misspecified models is a desirable feature.In addition, our procedure also has the flexibility of incorporating population specific statistical model forms, population specific parameters, and even population specific response variable type.For example, we allow some populations to have continuous response while others to have categorical responses.Finally, our procedure does not require pooling different samples from different populations together in order to carry out the parameter estimation.This can be a very important advantage since in our era data size is easily too large to handle, and our procedure can thus be used to justify splitting the data first, performing estimation separately, and then pool the results together.It in fact prescribes a method to pool the results together in an optimal way in terms of prediction performance.
In the literature, Bayesian model averaging has been widely studied for several decades; see Hoeting et al. (1999) for a comprehensive review.The model averaging method proposed in the current article belongs to the frequentist model averaging framework.The main difference between our method and the existing works on frequentist model averaging, such as Yang (2001), Hjort and Claeskens (2003), Hansen (2007), Liu and Okui (2013), Liu (2015), Chen et al. (2018), Zhang, Chiou, and Ma (2018), Mitra et al. (2019), Zhang and Xia (2019), Fang et al. (2019), Fang, Li, and Wang (2019), Fang, Li, and Xiao (2022), Li et al. (2016, 2018), and Feng et al. (2022) is that we adopt different models on different datasets, while existing model averaging methods consider different models using a single dataset.As far as we know, the current article is the first one where different models are adaptively fitted for different datasets, and then model averaging is applied to improve the prediction for a target quantity.Our work is related to meta-analysis in that we also consider multiple datasets.However, the goal of meta-analysis is mainly in parameter estimation, while here we aim at prediction.Further, meta-analysis generally assumes all the models are correctly specified, while we do not make such an assumption.Our work is also related to forecasting combinations.However, forecasting combinations are generally based on a single dataset and do not pursue the asymptotic optimality and weight convergence as we do.
The remainder of our article is organized as follows.In Section 2, we describe our procedure.Section 3, we provide the theoretical properties of our procedure.Sections 4 and 5 present the results of simulation studies and real data examples, respectively.Concluding remarks are in Section 6.The proofs of theoretical results and additional numerical studies are in the supplementary material.

Prediction Procedure
The prediction procedure we propose is very simple.Given N populations, the models described in (1) and the observations {Y ji , X ji , Z [j] i } for i = 1, . . ., n j , j = 1, . . ., N, to predict the response Y associated with the covariates X, Z [1] from the first population, we carry out the following procedure.
Step 3: Combine Y j 's to construct a function of w through where w = (w 1 , . . ., w N ) T is a vector of weights that satisfy 0 ≤ w j ≤ 1 for j = 1, . . ., N and N j=1 w j = 1.Let the set of all such w's be W.
Step 4: Construct a cross-validation criterion to evaluate the prediction performance of any set of weight choices, where Y 1i (w) is calculated the same way as Y(w) described above except that Y 1i is left out of the calculation, that is, it is the leave-the-ith-observation-out prediction of Y 1i under weight choice w.
Step 5: Select w by minimizing the cross-validated average prediction error, that is, w = argmin w∈W CV(w).
( 3 ) Step 6: Let the resulting prediction be Y( w).We term it model averaging prediction (MAP).
The algorithm described above is rooted from a very basic idea of finding Y(w) so that the expected error squared, that is, E[{ Y(w) − Y} 2 ], is minimized.Because we do not want to put our full trust on the model f 1 (X T β + Z [1] T α 1 , y, γ 1 ), we approximate E[{ Y(w) − Y} 2 ] via a model free fashion through using sample average, while mimicking the procedure of obtaining Y(w) through the leave-one-out procedure Y (−i) 1i (w).Indeed, it is not difficult to show that CV(w) is an asymptotically unbiased estimator of E[{ Y(w) − Y} 2 ], the quantity we would like to minimize.We provide the detailed verification of this result in Section S.1.1 of the supplementary material.Let Y (−i)  1i,j be the prediction of Y 1i with Y 1i deleted using the jth model, Y 1i,N ) T , l be an N × 1 vector with ones, and so minimizing CV(w) is a quadratic programming and can be solved very quickly.Using cross-validation to determine weights has been used as stacking (Wolpert 1992;Breiman 1996;Van der Laan, Polley, and Hubbard 2007).In these seminal papers, the predictions averaged were either from the same dataset, or they did not consider heterogeneous populations as we considered here.In addition, the predictions to be averaged in our problem are calculated based on both the main model and the helper models, while they are typically based on only one model in classical stacking.Lastly, in the current article, we rigorously prove the optimality and weight assignment properties for combining predictions, while in stacking, these properties have not been built as far as we know.
Note also that our model averaging procedure and purpose are very different from the traditional model averaging, where a unique population is concerned and the goal is to perform best prediction when it is unclear which models are the best to use.In such case, the models under consideration often correspond to sub-models of a large model including all possible covariates, and hence the number of sub-models naturally diverges when the number of covariates diverges.Here, the number of total models N is often fixed, where one of it is the main model that we use to perform prediction in the main population and the remaining N -1 helper models are used to help with the task.In fact, we can allow one helper model, that is, N = 2.Of course, we also allow N to diverge.Indeed the properties will be established for a general N.In practice, we may only have limited helper samples, which corresponds to a constant order N. Finally, because in practice, it is very unlikely that all N models happen to be correct, we recommend estimating the parameters in each model separately then average, instead of imposing the common β restriction in the estimation via, for example, generalized method of moment, so as to prevent the deterioration of the estimation due to incorrect models.

Theoretical Properties
It is not surprising that the theoretical properties of the above procedure derived below depend on whether or not the main model is misspecified, since the main model is the critical factor we rely on to perform prediction.Interestingly, although our prediction procedure also incorporates all the helper models, whether or not none, one or more of the helper models are misspecified does not affect the validity of the theoretical results.

Theoretical Properties under Misspecified Main Model
To present the theoretical properties, we first formally define the risk function as Ideally, one would certainly aim at minimizing R(w) with respect to w in the set W. In Theorem 1, we show that this goal is essentially achieved by our procedure.Usually, the prediction risk function where the second component is not controllable.Thus, we aim at minimizing R(w).
Before stating Theorem 1, we first describe some assumptions that are needed to formally establish the theoretical properties of our predictor.All limiting processes considered in this article correspond to n → ∞ where n = min 1≤j≤N n j .We allow both N → ∞ and N fixed.

Assumption 1 (Regularity of estimators).
There exist values α j , γ j , and Assumption 1 is a simple requirement on the regularity of the estimators being averaged.This excludes, for example, estimators with high dimension and obtained under additional constraints such as sparsity hence are generally of slower convergence rate.When a model is correct, the limiting parameter values α j , γ j , and β [j] are naturally the true parameter values, while when a model is misspecified, these limiting parameter values also exist and usually satisfy certain properties depending on how the estimators are constructed.Indeed, when the estimator is MLE, White (1982) has shown that α j , γ j , and β [j] are parameters that minimize the Kullback-Leibler distance between the true data generation procedure and the family of distributions parameterized by these parameters under his conditions A1-A6.When α j , γ j , and β [j] are M-estimators, α j , γ j , and β [j] are parameters that minimize the mean target function.When α j , γ j , and β [j] are Z-estimators, α j , γ j , and β [j] are parameters that result in mean zero of the Z-estimating function.Here, for simplicity, we write the convergence of the parameter estimators to their limiting values in the general case as an assumption.To accommodate uniform convergence under a possibly diverging N, we weaken the convergence rate assumption to (n j /N) −1/2 , while in the common situation with a fixed N, this rate is n We now define some limiting quantities when the estimators are replaced by their corresponding limits.Specifically, let We write the "risk" calculated under the limiting parameter values, instead of the estimated parameter values, as R (w) ≡ E {Y (w) − E(Y|X, Z [1] )} 2 , and let ξ ≡ inf w∈W R (w) be the minimum risk under the ideal weights if the limiting parameter values were known.To be explicit, we write E(Y|X, Z [1] ) as g true (X, Z [1] ; β, α 1 , γ 1 ).Here, we use "true" since the expectation is taken under the unknown true model.
Then, E(Y 1i |X 1i , Z [1]  i ) = g true (X 1i , Z [1]  i ; β, α 1 , γ 1 ).Under the main model, we write Y j = g(X, Z [1] ; β [j] , α 1 , γ 1 ), Y 1i,j = g(X 1i , Z [1]  i ; β [j] , α 1 , γ 1 ), and Y j = g(X, Z [1] ; Assumption 2 (Existence of fourth moments).The expectations E{g 4 (X 1i , Z [1]  i ; Assumption 3 (Differentiability and boundedness of regression mean functions).The function g(X 1i , Z [1]  i ; ≤ C} where C is a constant, there exists a positive constant c 1 such that for any 1 ≤ j ≤ N, E g(X 1i , Z [1]  i ; Assumption 4 (Existence of expectations).The expec- Assumptions 2-4 are technical conditions on the existence, differentiability, and boundedness of various moments, and are rather mild conditions.Taking linear model Y = X T β + Z [1] T α + as an example, the sufficient conditions for Assumptions 2 and 3 are E( 4), E(X T 1i β + Z [1]   i i 2 ) exist.Assumption 4 is also a moment boundedness condition.When N is finite, it simply requires the existence of E ( Y j − Y j ) 2 and E {Y j − E(Y|X, Z [1] )} 2 for all models.To handle a diverging N, we write the corresponding requirement in the expectation of the supreme form.In practice, when the models under considerations are concrete and the distributions of the covariates are known, these assumptions will take more detailed form and need to be checked in a case-by-case fashion.
Assumption 5 is a critical condition.It essentially requires the main model to be sufficiently misspecified.To see this, we first consider the fixed N case.In such case, if the main model had been correct, then hence, Assumption 5 is violated.On the other hand, if the main model is misspecified with ξ 1, then Assumption 5 is satisfied.Here, the condition ξ n 1/2 → ∞ (i.e., the convergence rate of ξ : n −1/2 ) serves as the threshold between the true and misspecified cases.To accommodate a possibly diverging N, the threshold is adapted to (n/N 3 ) −1/2 , which is larger than the fixed N case.Similar assumptions can be found in eq. ( 7) of Ando and Li (2014) and Assumption 1 (e) of Liu, Yao, and Zhao (2020).
Under the above assumptions, the procedure we described in Section 2 leads to the optimal weight choice, in that the risk of the prediction using the estimated weights is the same as the risk of the prediction using the best possible weights to the leading order.The result is stated in Theorem 1 with its proof given in Section S.1.2 of the supplementary material.
Note that the optimality of the MAP method established in Theorem 1 is among the class of all weighted averaged predictions.It does not include prediction based on parameter estimated from pooled data such as "MLE all" discussed in Section 4. Note that estimating model parameters based on pooled data suffers a critical drawback, in that it cannot automatically adapt to model misspecification.Thus, for such methods, a misspecified model can lead to deteriorated overall prediction perforce.

Theoretical Properties under Correct Main Model
We now consider the case that the main model is correctly specified.Because the helper models can be correct or misspecified, we let D be the subset of {1, . . ., N} that consists of the indices of the correctly specified models.Obviously 1 ∈ D. Let D c be the complement of D. Write w = ( w 1 , . . ., w N ) T .Let τ ≡ j∈D w j , τ ≡ j∈D w j and M be the cardinality of D c .
In practice, when the model complexity does not increase with the sample size, the risk of averaging misspecified models is typically of constant order.We hence also write Assumption 7 as an alternative of Assumption 6.
Assumption 6 obviously has similarity with Assumption 5.It imposes the relation between N, n and the optimal risk if all weights are assigned to the misspecified model.For simplicity, consider the fixed N case.Assumption 6 implies that the risk based on the misspecified models only is much larger than n −1/2 .When the main model is misspecified, the optimal risk automatically will be sufficiently large hence we do not need to restrict the weight set in Assumption 5.However, when the main model is correctly specified, the deterioration of the risk has to result from the misestimation of the parameter β, hence, the correct models will need to be excluded for this purpose.This is why we have to restrict the weights assigned to the correct model to be zero in Assumption 6. Assumption 6 serves as a separation between the performance of the correctly specified model and the misspecified model.Using any correct model in combination with the correct main model, the resulting risk in the limit will be zero.On the other hand, using any misspecified model, even when combined with the correct main model, the resulting risk in the limit will be much larger than n −1/2 , as required by the assumption.Thus, intuitively it is clear that none of the incorrect model will be chosen when we minimize the risk.We can also allow N to diverge, provided that Assumption 6 holds.For example, when N diverges, Assumption 6 is typically satisfied if N 3/2 = o(n 1/2 ) and inf w∈W, j∈D w j =0 R (w) 1.We summarize the result in Theorem 2, with its proof in Section S.1.3 of the supplementary material.
Theorem 2 is a kind of model selection consistency, in that the method will automatically exclude the misspecified models.
We stress that such model selection consistency is based on the correctness of the main model f 1 , and it only has the ability to exclude the misspecified helper models.For the case that the main model itself is misspecified, selection consistency in Theorem 2 does not hold in general and only prediction optimality in Theorem 1 applies.Regardless a helper model is correct or not, the prediction always relies on the main model hence the correctness/misspecification of the main model dominates the performance of the prediction procedure.
In modern applications, big data arise very often and it is common to perform prediction in each section of the data separately and then combine the predictions by simple average; see, for example, Li, Lin, and Li (2013) and Battey et al. (2018).The problem with this practice is that as long as one model is misspecified, which is often the case due to the heterogeneity and complexity of data, the prediction error of the simple average can deteriorate very quickly.However, using the model averaging procedure described in Section 2, the prediction properties described in Theorems 1 or 2 are automatically guaranteed and the procedure will yield better performance than simple average.We summarize this property in Corollary 1 and provide a brief argument in Section S.1.4 of the supplementary material.
Corollary 1. Assume the number of models and the sample size satisfy n(N −4 M 2 + N −7 M 4 ) → ∞, where M is the number of misspecified models.Regardless the main model is correct or not, under Assumptions 1-5, and 7, the risk of the model averaging procedure in Section 2 is smaller than the risk associated with simple average to the first order.
Note that in Corollary 1, when the number of available models N increases, the number of misspecified models is also required to increase, even if the number of observations per model can be fixed.This is often naturally satisfied.In Section S.2 of the supplementary material, we further explore the variance of the averaged prediction Y( w) for both the misspecified and correct main model cases, and establish that the variance of MAP, that is, var{ Y( w)}, converges to zero under suitable conditions.We note that for the future observation Y, the prediction variance var{ Y( w) − Y} = var{ Y( w)} + var(Y), thus, our prediction is optimal in the sense that the only variability is the inherent variability associated with the randomness of the future observation.

Simulation Designs
We first consider linear regression models.We generate data from for i = 1, . . ., n j and j = 1, . . ., N, where Y ji is a continuous response variable, σ = 0.5, and (X T ji , Z [j] i T ) T is generated from a multivariate normal distribution with mean zero and correlation structure as AR(1) with correlation coefficient 0.5 and variance 4.
We first set the number of helper models N = 7.In Design C1.1 (C means the main model is correctly specified), we generate data from (7), and let n 1 = 100, n 2 = 200, n 3 = 100, n 4 = 200, n 5 = 100, n 6 = 200, and n 7 = 100.We set the shared parameter β to be (0.5, 0.6, −0.61, −0.48) T .Parameters α j 's are assigned according to Section S.3.1 in Supplementary Material.Note that now α 1 , α 3 , and α 5 are five-dimensional vectors, but α 2 and α 4 are six-dimensional vectors.When we estimate (β T , α T 2 ) T of the helper model 2 and (β T , α T 4 ) T of the helper model 4, we omit the last component of Z [2]  i and Z [4]  i respectively, so the helper model 2 and the helper model 4 are misspecified.In addition, when we estimate (β T , α T 6 ) T of the helper model 6, we omit the last component of Z [6]  i , thus, the helper model 6 is misspecified as well.In summary, in Design C1.1, we have the main model correctly specified, three helper models correctly specified, and three helper models misspecified.
Moreover, we increase the number of heterogeneous populations N to 15, and also further increase the length of coefficients β and α j 's and sample sizes.In Design C1.5, we set n 1 = n 3 = n 5 = n 7 = n 9 = n 11 = n 13 = n 15 = 400 and n 2 = n 4 = n 6 = n 8 = n 10 = n 12 = n 14 = 800.The shared parameter β = (0.5, 0.6, −0.61, −0.48, 0.97, 0.72, −1.1, −0.5) T .We set α j 's according to Section S.3.1 in supplementary material.When we estimate the parameters, we omit the last component of Z [m]   i for m = 2, 4, and 6; we omit the second and last components of Z [m]   i for m = 8, 10, 12, and 14.In summary, in Design C1.5, the main model and seven helper models are correctly specified, and seven additional helper models are misspecified.

Comparison Methods
For comparison, in addition to our MAP method, we also implement six additional methods.The first comparison method is the simple average method, where we follow the same procedure as in the proposed method, except that we use an equal weight w i = 1/N, instead of using cross-validation to select a set of weights.The second comparison method is named "MLE main" method, where we ignore all helper models, and simply perform the standard prediction incorporating the MLE estimated parameters from the main model.The third comparison method is named the "MLE all" method, where we estimate parameters by maximizing the composite log-likelihood Note that in the "MLE all" method, we use the helper models in the corresponding designs.The fourth and fifth comparison methods are respectively AIC and BIC based model averaging methods, termed as smoothed AIC and smoothed BIC, respectively (Buckland, Burnham, and Augustin 1997;Claeskens and Hjort 2008), where the averaging weight is respectively set as w j = exp(−AIC j /2)/{ j exp(−AIC j /2)} and w j = exp(−BIC j /2)/{ j exp(−BIC j /2)}, although, it is often thought that AIC and BIC values can only be compared for the same dataset when fitting different models.Finally, the sixth method is a meta-analysis based prediction, where we first obtain the meta-analysis based estimator of β, and then combine with α 1 and γ 1 to calculate the prediction for a new observation in the main population.

Simulation Results
We compare all methods using a test dataset of 500 observations generated from the main model.Specifically, we calculate the mean prediction error R(w i )} 2 on the testing data, where n new = 500.We repeat the procedure 1000 times.The resulting average R(w) values are provided in Tables 1-6.The "Gain" column in each table is the percentage of average R(w) decreases when comparing our MAP method with one of the other six methods that has the smallest average R(w).
In the last columns of Tables 1-6, we report the average τ 's for the designs where the main model is correctly specified.It is seen that the average τ 's are all above 0.8 except C2.3 in Table 4 and when the sample sizes are larger, they are closer to 1.This performance verifies the consistency shown in Theorem 2. The τ values of the designs with misspecified main models are not reported in the last columns of the tables, because there is no guarantee that the model averaging procedure will tend to put specified weights on any models when the main model is misspecified.
When the main model is misspecified, from the results in the tables, we see that our MAP method yields the smallest average R(w) compared to all the competitor methods, reflecting the theoretical result in Theorem 1.When the main model is correctly specified, our MAP method still has advan-   tage.The possible reason is that our MAP method asymptotically puts zero weights on the misspecified helper models, but the simple average, "MLE all", the smoothed AIC/BIC and the meta-analysis methods depend on the misspecified helper models substantially and the "MLE main" does not use any information from correctly specified helper models.We also note that "MLE main" tends to perform better than "MLE all", reflecting the harm due to the incorrectly specified helper models when blindly included into the analysis by "MLE all".For smaller simple size or larger N, we find that the gain from our MAP method is more significant.We also find that the gain under logistic regression is more significant than that in linear regression.In addition, we conduct simulated studies with N = 3 in Section S.3.2 of supplementary material.The results show that the proposed MAP method still outperforms the competitor methods.
Last, we inspect Tables 3 and 4. We can see that our MAP method still has obvious advantages over other methods.Similar to the previous findings, when sample size is smaller, the gain of our MAP method is more significant.For C-type designs, when the sample sizes are larger, the average τ 's are closer to 1.

Real Data Example
We analyze the data "default of credit card clients" of an important bank in Taiwan, publicly available at the UCI machine learning repository.We consider five populations of clients, with credit scores equal to 10k, 110k, 210k, 310k, and 410k, respectively.The sizes of the samples from the five populations are respectively 493, 588, 730, 272, and 78.Note that if our interest had been prediction in the group with credit score, say lower than 300k, and if we have evidence that the effect from other covariates may be identical in the groups of credit score less than and more than 300k respectively, then we would have divided the sample into two groups.
The response variable is Y, with Y ji = 1 if the ith client of the jth population defaulted, and Y ji = 0 otherwise.The covariates X ji1 , . . ., X ji5 are the payment ratios of the previous five months, which have numerical values between 0 and 1.The payment ratio is calculated as the payment amount of this month divided by the bill statement balance posted last month.If the bill statement balance is less than or equal to 0, or the payment amount is greater than the bill statement balance, then the payment ratio is set as 1.Other covariates include gender Z ji1 , with Z ji1 = 1 representing male and Z ji1 = 0 female, education level Z ij2 , with Z ji2 =1 indicating university education or above and Z ji2 = 0 otherwise, marital status Z ji3 , with Z ji3 =1 if married and Z ji3 = 0 otherwise, and age Z ji4 .
In Section S.3.5, we conduct a group-wise logistic regression as well as an overall logistic regression to examine the interaction effects between payment ratios and credit scores on the default risks.The results suggest that the payment effects across different populations roughly remain the same.This agrees with Jeon, Kwon, and Choi (2017) and Fang and Chen (2019), both explain that the coefficients of payment ratios can be considered the same across different populations.We analyze the dataset with the following logistic model: We consider rotating the role of the five models.That is, each model will have one opportunity to serve as the main model while the other four models will serve as the helper models, so we have five settings.In the jth setting, we use the model associated with jth population as the main model.Table 7 shows the weights by our MAP method under all settings.It is interesting that in most settings, the weights of the helper models are larger than those of the main models, which means the helper models indeed help the predictions.Also, some weights can be zero.To explain these zeros, similar to Feng, Liu, and Okui (2020) and by ( 4), we can rewrite our estimator as w = argmin w∈W {w T Hw + ζ( N j=1 |w j | − 1)} under the constraint 0 ≤ w j ≤ 1 for j = 1, . . ., N, where ζ is the Lagrange multiplier.Thus, the l 1 penalty serves the purpose of encouraging sparsity of the weights.
To further check the performance of our MAP method, we randomly divide the data from the population of the main model into two parts with equal sizes for analysis and testing.We repeat the procedure 1000 times, and calculate the prediction errors for the methods compared in the simulation section, with prediction error calculated as where {r} denotes the rth replication and n j is the size of the sample from the jth population.Since the above prediction error also measures the prediction risk E{ Y ji − E(Y ji |X ji , Z [j] i )} 2 .Figure 1 shows the prediction errors under the five settings.First, we can see that our MAP method performs the most robust in all five settings.In almost all settings except the second and fourth, our MAP method outperforms the simple average method and in the second and fifth settings, they perform similarly.The "MLE main" method performs worse than MAP in all settings, possibly because it does not use information from helper models.The "MLE all" method mostly loses to MAP as well, where in the first, and fifth settings, it performs much worse than MAP, and in the remaining settings, it performs similarly or slightly better than MAP.The smoothed AIC and BIC methods have very similar performance, and they perform worse than MAP in the third and fifth settings, while have similar performance as MAP in the remaining settings.Finally, the meta-analysis method is in general worse than MAP as well, except in the second and fourth settings, where it performs similarly as MAP.
In this real data example, only a binary response variable Y ji is available indicating the default status of a client, hence, all our models are logistic.If more concrete information had been available, for example, if in some samples, the amount a client owes the credit company had been known, then a linear regression model could be applied.This would lead to a mixture of different models, including logistic and linear, similar to the simulation studies conducted in Section 4.

Concluding Remarks
In the context that a model of main research interest shares partial parameters with several other models, we have developed a model averaging procedure to improve prediction for a new observation from the main model.The procedure achieves the optimal prediction risk when the main model is misspecified; and if the main model is correctly specified, the sum of weights assigned to the main model and the correct helper models converges to one.Numerical examples show that the procedure has excellent finite sample properties compared with the simple averaging method, the MLE using the main model only and the MLE using all models.
As most of the literature on optimal model averaging, we did not study the limiting distribution of the resulting average prediction because of the difficulties produced by the random weights and possible model misspecification.Using bootstrap may be a feasible way to solve this problem and this certainly warrants future work.In our procedure, we use a single model under each dataset.It is straightforward to extend our procedure to allow multiple models under each dataset, which can accommodate possible modeling difficulties.However, the theoretical properties of this procedure is challenging to obtain due to the possible correlations between samples associated with different models.We leave this as an interesting future research topic.Last, inspired by the reviewers, we point out that the framework of a common β across different models does not impose an essential restriction, because we allow helper models to be misspecified.Indeed, when our method is used in the situation that β in one or more helper models is different from that in the main model, the result established for misspecified helper model will apply automatically.In this sense, we can view our framework to be more general than it appears, where β does not have to be common across all models.

Figure 1 .
Figure 1.Boxplot of prediction errors in the real data example.In Setting j, the main model is Model j which is associated with the jth population.

Table 1 .
N = 7 and all models are linear.MAP is the proposed method.Sim is the Simple Average Method.MLEm is the MLE main estimator.MLEa is the MLE all estimator.sAIC is the smoothed AIC method.sBIC is the smoothed BIC method.MT is the meta-analysis method.

Table 2 .
N = 7 and all models are logistic.MAP is the proposed method.Sim is the Simple Average Method.MLEm is the MLE main estimator.MLEa is the MLE all estimator.sAIC is the smoothed AIC method.sBIC is the smoothed BIC method.MT is the meta-analysis method.

Table 3 .
N = 7, main model is linear, four helper models are linear and two helper models are logistic.MAP is the proposed method.Sim is the Simple Average Method.MLEm is the MLE main estimator.MLEa is the MLE all estimator.sAIC is the smoothed AIC method.sBIC is the smoothed BIC method.MT is the meta-analysis method.

Table 4 .
N = 7, main model is logistic, four helper models are logistic and two helper models are linear.MAP is the proposed method.Sim is the Simple Average Method.MLEm is the MLE main estimator.MLEa is the MLE all estimator.sAIC is the smoothed AIC method.sBIC is the smoothed BIC method.MT is the meta-analysis method.

Table 5 .
N = 15 and all models are linear.MAP is the proposed method.Sim is the Simple Average Method.MLEm is the MLE main estimator.MLEa is the MLE all estimator.sAIC is the smoothed AIC method.sBIC is the smoothed BIC method.MT is the meta-analysis method.

Table 6 .
N = 15 and all models are logistic.MAP is the proposed method.Sim is the Simple Average Method.MLEm is the MLE main estimator.MLEa is the MLE all estimator.sAIC is the smoothed AIC method.sBIC is the smoothed BIC method.MT is the meta-analysis method.

Table 7 .
Weights obtained by MAP method in the data analysis.