Data Integration with Oracle Use of External Information from Heterogeneous Populations

Abstract It is common to have access to summary information from external studies. Such information can be useful for an internal study of interest to improve parameter estimation efficiency when incorporated. However, external studies may target populations different from the internal study, in which case an incorporation of the corresponding information may introduce estimation bias. We develop a penalized constrained maximum likelihood (PCML) method that simultaneously (a) selects the external studies whose information is useful for internal model fitting and (b) incorporates the corresponding information into internal estimation. The PCML estimator has the same efficiency as an oracle estimator that fully incorporates the useful external information alone. We establish estimation consistency, parametric rate of convergence, external information selection consistency, asymptotic normality, and oracle efficiency. An algorithm for implementation is provided, together with a data-adaptive tuning parameter selection. Supplemental materials are available online containing some details referred to throughout the article.


Introduction
Data integration has become an active research area due to increasing availability of data from many sources in the era of data science. Data from different sources oftentimes contain information that can help make a better decision or a more accurate conclusion compared to using any single data source alone, if the information is properly incorporated. For instance, in survey sampling, although a probability-based sample is desired to ensure the representativeness of the population, the financial cost in practice may limit the sample size of the probabilitybased sample. On the other hand, nonprobability samples may be easy to obtain with large sizes and can then be used to improve the precision and/or accuracy for estimation. Another example is that, in genetics research, evidence from multiple genome-wide association studies can be integrated to help better identify genetic determinants for diseases, whereas a single study alone may not have the desired power.
Statistical methods for data integration vary depending on many factors, including the types of information to be combined. Methods that can deal with summary or aggregate information are particularly attractive because of their less demand on data sharing and data storage, as well as ethical considerations such as maintaining confidentiality and privacy of study participants. The need for statistical methods that can incorporate summary information from external studies into an internal analysis arises in many areas. Summary information has become widely available. For instance, in survey sampling aggregate information such as stratified population means is oftentimes available from published census reports, and in biomedical and public health research aggregate information such as demographic distributions and model fitting results is oftentimes available from published articles. Our development in this article is under such a setting to incorporate useful summary information from external studies to improve estimation efficiency for an internal study that has individual-level data.
Some authors have considered similar settings, but most of them made the assumption that the internal and external study populations are the same (e.g., Imbens and Lancaster 1994;Wu and Sitter 2001;Chen, Sitter, and Wu 2002;Chaudhuri, Handcock, and Rendall 2008;Qin et al. 2015;Chatterjee et al. 2016;Huang, Qin, and Tsai 2016;Cheng et al. 2019;Gu et al. 2019;Han and Lawless 2019;Huang and Qin 2020;Zhang et al. 2020;Han, Taylor, and Mukherjee 2022). Such an assumption can be easily violated in practice since, for example, the demographic distribution and outcome prevalence often vary between study populations, in which case methods based on this assumption are no longer valid. In the presence of population heterogeneity, some authors proposed to shrink the internal study results toward the external information (Estes, Mukherjee, and Taylor 2018;Gu, Taylor, and Mukherjee 2021).
In this article we consider the setting where the focus is on the internal study population, and the goal of integrating external information is to assist the internal analysis to improve estimation efficiency. This may be the case, for example, when the internal study has a clear target population and is based on a careful design with a well-controlled sampling. In this case, only the information from the external studies that target the same population as the internal study should be incorporated, as otherwise the external information may lead to estimation bias for internal analysis. Based on this consideration we make it explicit that the internal study population is the target for inference, and the external information that may introduce estimation bias should be discarded. It is worth to point out that there are settings where it is desired to incorporate information from all available external studies, not only from some particular ones, but these settings are not the focus of this article.
In the possible presence of study population heterogeneity, we develop a method that can simultaneously select and incorporate the information from those external studies that target the same population as the internal study and discard the information from the rest. The external information is formulated as moment constraints on the internal study model. The constraints corresponding to external studies that target the same population as the internal study are valid and should be incorporated to help efficiency improvement, and those corresponding to the other external studies are invalid and should be discarded. This formulation makes the data integration problem into a selection of valid moment constraints. We then further formulate it as a variable selection problem by introducing nuisance parameters that represent the biases of the moment constraints under the internal data distribution and select the ones with zero biases. Variable selection can be achieved by shrinkage techniques that estimate some parameters exactly as zeros through a penalization on the nuisance parameters.
The method we develop is penalized constrained maximum likelihood (PCML). The constrained maximum likelihood (CML) type methods have been considered by many authors for data integration when the internal and external study populations are the same (e.g., Qin 2000;Qin et al. 2015;Chatterjee et al. 2016;Huang, Qin, and Tsai 2016;Han and Lawless 2019;Zhang et al. 2020;Han, Taylor, and Mukherjee 2022). In the presence of population heterogeneity, we make use of adaptive group Lasso penalties (Tibshirani 1996;Zou 2006;Yuan and Lin 2006;Wang and Leng 2008) on the CML as a way to simultaneously select and incorporate useful external information into internal analysis. There are situations where, for an external study, the vector of introduced nuisance bias parameters is not a zero vector but does have some zero components. Such an external study can provide some information that is still useful for internal efficiency improvement without introducing estimation bias, even if the study population is not exactly the same as the internal one. Section 2.4 contains a detailed relevant discussion. To account for such situations, we consider both group-wise and component-wise shrinkage for selecting the moment constraints to ensure a maximal incorporation of useful information. The PCML method makes an oracle use of the external information in the sense that the PCML estimator has the same efficiency as the oracle CML estimator that knows which external information is useful and fully incorporates that information alone. Compared to a recently proposed two-step procedure (Sheng et al. 2021) that first conducts a hypothesis test for population heterogeneity and then assumes a nuisance model to link the external information to the internal study, our method simultaneously selects and incorporates the valid external information without specifying any additional models beyond the internal study model. Our proposed method has implicit connections to some literature on penalized empirical likelihood (Tang and Leng 2010;Leng and Tang 2012;Chang, Tang, and Wu 2018), due to the connections between the CML-type methods and the empirical likelihood (Han and Lawless 2019). But the settings are different. In our data integration setting some external studies provide invalid moment constraints due to population heterogeneity, whereas the penalized empirical likelihood assumes all moment constraints are valid.
We provide a detailed theoretical investigation of the PCML method. Under a set of regularity conditions, including assumptions on the convergence rate of the tuning parameter, we establish the asymptotic properties of the PCML estimator as follows. First, estimation consistency is established by explicitly exploiting the saddle-point representation of the PCML method. Second, the convergence rate of the PCML estimator is shown to be the parametric √ n-rate. Third, external study selection consistency is established by showing that the nuisance parameters representing the biases of moment constraints are estimated exactly as zero with probability approaching one when the true biases are zero. Fourth, the asymptotic normal distribution is derived jointly for both the internal model parameters and the nuisance parameters representing the nonzero biases of the moment constraints. And last, the asymptotic variance of the PCML estimator for the internal model parameters is shown to be equal to the asymptotic variance of the oracle CML estimator.

Setting and Notation
We consider the setting where (a) an internal study collects individual-level data to fit a parametric regression model, (b) some external studies have fitted similar regression models using less detailed covariates with large sample sizes and their model fitting results are available, and (c) these external studies are conducted for possibly different populations. The aim is to incorporate external information that is useful to improve the internal model fitting, since the external information uncertainty is low due to their large sample sizes. One major challenge is how to identify and incorporate only the useful external information, because the information from external studies that do not target the internal study population may introduce estimation bias when incorporated. This setting is motivated by research in many areas, particularly biomedicine and public health. For example, the internal study collects new covariates such as newly discovered biomarkers, as well as certain conventional covariates such as demographic variables, to investigate their associations with a disease outcome. The internal study sample size may not be large due to budget or technique restrictions. On the other hand, the associations between the outcome and some of the conventional covariates have been established by external studies with large sample sizes, with results available in published articles. Such external information, if incorporated into internal analysis, may substantially improve internal model fitting.
To fix notation, let . . , n, denote the individual-level data from a random sample collected by the internal study, where Y is the outcome variable, X is the vector of conventional covariates that are typically collected by studies on the same outcome, and Z is the vector of covariates that are only collected by the internal study. We allow Z to be the null set if the internal study only collects X. The main interest is to fit a parameteric regression model f (Y|X, Z; β) for the distribution f (Y|X, Z), where β is a q-dimensional vector of parameters with true value β 0 such that f (Y|X, Z; β 0 ) = f (Y|X, Z). With no additional information, β 0 can be estimated by the maximum likelihood estimator (MLE)β MLE that maximizes the likelihood Suppose there are K external studies on the same outcome Y that can potentially provide useful information to improve the internal model parameter estimation. In this article we consider K to be a fixed finite number. The kth external study, k = 1, . . . , K, used covariates X (k) and fitted a model f (k) (Y|X (k) ; θ (k) ) for f (k) (Y|X (k) ). Here, for generality, we allow X (k) to be a possibly coarsened version of X, such as a subset or a categorization of some components of X, the subscript of f (k) is to explicitly indicate that the kth external study population may be different from the internal study population, and θ (k) is the parameters for this model, which is possibly misspecified by the external study. Let h (k) Y, X (k) ; θ (k) denote the d k -dimensional score function for the model f (k) (Y|X (k) ; θ (k) ). The kth external study then provides an estimateθ (k) that is the solution to the corresponding score equation. When the external study sample size is large, the uncertainty inθ (k) is negligible compared to the internal study and we will use notation θ * (k) instead ofθ (k) , where θ * (k) is the probability limit ofθ (k) under the external study. The assumption that the external study uncertainty is negligible compared to the internal study has been made by many authors (e.g., Chaudhuri, Handcock, and Rendall 2008;Qin et al. 2015;Chatterjee et al. 2016;Huang, Qin, and Tsai 2016;Cheng et al. 2019). It is made based on the consideration that the internal study sample size is usually not large due to the collection of new covariates and, to improve estimation efficiency, the external studies to be considered usually have much larger sample sizes. Please see Section 7 for more discussion. In simulation studies in Section 5 we also show the performance when the external study sample sizes are not very large. The summary information from the kth external study is where the expectation E (k) (·) is taken under f (k) (Y|X (k) ).
It is worth pointing out that (1) is a very general way to summarize the external study information, not only for the information derived based on parameteric models as above. For instance, many population registries or big data bases provide outcome summary information, such as the mean, median and standard deviation for continuous outcomes and the prevalence for binary outcomes, stratified by demographics such as age and sex. Such information can all be formulated in the form of (1) with different h (k) functions. As an example, the disease prevalence information given by E (k) (Y|X (k) ∈ X ) = θ * (k) , for a stratum defined by (X (k) ∈ X ) for some X , can be summarized

The CML Method Assuming Population Homogeneity
Hereafter we will use E(·) to denote expectations under the internal study data distribution. When all study populations are the same, (1) (Y, X (k) ; θ * (k) )|X, Z]}. Thus, defining U (k) (X, Z; β, θ * (k) ) = h (k) (Y, X (k) ; θ * (k) )f (Y|X, Z; β)dY, we then have which summarizes the information from the kth external study in the form of moment constraints under the internal study covariate distribution.
To incorporate the external summary information in (2) into estimating β 0 , the CML method introduces a discrete distribution p i ≥ 0 on the internal study covariate data (X T i , Z T i ) T , i = 1, . . . , n, and the CML estimatorβ CML for β 0 is defined through where g(X, The CML-type estimators have been proposed and studied by many authors under different settings (e.g., Qin 2000;Qin et al. 2015;Chatterjee et al. 2016;Huang, Qin, and Tsai 2016;Han and Lawless 2019;Zhang et al. 2020;Sheng et al. 2021), and they are closely connected to the empirical likelihood literature (Qin and Lawless 1994;Owen 2001). When all study populations are the same, the CML estimator defined through (3) is more efficient than the MLE and the efficiency gain comes from the integration of the external summary information. With heterogeneous populations, however, in general the CML method no longer works in the sense that the CML estimator can be severely biased after incorporation of the external information.

The PCML Method for Heterogeneous Populations
In the presence of heterogeneous populations, the moment constraints in (2) may no longer be valid. To account for this, we introduce some unknown nuisance parameters γ 0(k) , where , to represent the bias of the moment constraints resulted from the population difference. Thus, the moment constraints from all external studies can be reparameterized as The zero components of γ 0 identify the external studies that are based on the same population as the internal study and whose summary information should be incorporated. It is desirable to estimate the zero components of γ 0 to be exact zeros, which will simultaneously select the external studies that provide useful information and incorporate the information into internal model fitting. The shrinkage estimation techniques can help achieve this goal.
Among the many shrinkage techniques available in the literature that are capable of shrinking the parameter estimates to exactly zero, the Lasso (Tibshirani 1996) is one of the most widely used due to its simplicity and effectiveness. Zou (2006) developed the adaptive Lasso (aLasso) so that both the selection of the zero parameters and the estimation of the nonzero parameters are consistent and the final estimator is as efficient as if the zero parameters are removed from the model before estimation, the so-called oracle property (see also Fan and Li 2001). Therefore, we adopt the aLasso shrinkage to achieve our goal of data integration. Since we are considering multiple external studies, intuition suggests that the shrinkage needs to be carried out at the study level so that an external study should no longer be considered if it is for a different population. Such a group-wise shrinkage can be achieved based on the group Lasso (gLasso) developed by Yuan and Lin (2006). The adaptive version of group Lasso (agLasso) by Wang and Leng (2008) ensures the consistency of both group selection and parameter estimation, as well as the oracle property of the final estimator. Thus, we adopt the agLasso to deal with multiple external studies.
Based on all the considerations so far, we propose the PCML estimatorβ for β 0 that is the β-component of (β,γ ) defined through is the agLasso penalty with tuning parameter λ n > 0, · is the Euclidean norm,γ (k) is some first-step consistent estimator of γ 0(k) , and w > 0 is some user-specified positive number. The most natural choice forγ (k) in the setting we consider is to take the corresponding components fromγ = n −1 n i=1 g(X i , Z i ;β MLE ). A common choice for w is w = 1 or 2 (e.g., Zou 2006;Wang and Leng 2008).
Compared to the optimization in (3) for the CML estimator, the optimization in (4) for the proposed PCML estimator has an agLasso penalty that shrinks the estimate of γ 0 toward zero. When the degree of shrinkage is properly chosen through the tuning parameter λ n , some γ 0(k) will be estimated exactly as zeros and the corresponding information summarized in the moment constraints (2) will be automatically incorporated into the estimation of β 0 . Furthermore, when only the γ 0(k) corresponding to the external studies that are for the same population as the internal study are estimated as zeros, the resulting PCML estimator for β 0 will be consistent and have improved efficiency compared to the MLE. The penalization in (4) allows simultaneous selection of useful external information and estimation of β 0 incorporating that information.
Using the Lagrange multiplier method, it is easy to show that the PCML constrained optimization in (4) can be equivalently written as where , and ρ is the Lagrange multiplier. The expression in (6) is the socalled saddle-point representation in the empirical likelihood literature (e.g., Owen 2001;Newey and Smith 2004) and is the expression used both for derivation of the asymptotic properties and for the numerical implementation in later sections.

Group-wise Shrinkage versus Component-wise Shrinkage
The agLasso penalty (5) is based on the intuition that an external study should no longer be considered for information integration if its population is different from the internal study. The penalty (5) ensures that data integration is carried out in a group-wise manner at the study level. However, a further investigation reveals that not all components of (2) . Example 1 shows that (2) may still hold if the difference between the internal and external study populations is only in f (X). Example 2 shows that, in the presence of a difference in any of f (Y|X, Z), f (Z|X) and f (X), some components in (2) may still hold even though the rest do not.
Example 1. Suppose that the internal and external studies have different distributions for X but share the same distribution for both Y|(X, Z) and Z|X, and thus, they also share the same distribution for Y|X. Suppose that the external study used a correctly specified model f (Y|X; θ ), which implies that E[h(Y, X; θ * )|X] = 0. Note that in this case, due to the correct specification of f (Y|X; θ ), the moment equality is conditional on X and thus, holds regardless of the difference in the X distribution between the internal and external studies. Thus, the same calculation leading to (2) shows that E[U(X, Z; β 0 , θ * )|X] = 0, which then implies (2).
For the external study with data generated as in Cases (a)-(c) below, the model Y|X ∼ . Some calculation shows that under the internal study E(X 2 ) = 0.5, E(Z) = 0 and E(XZ) = 0, and thus, T . Now consider three cases for the external study data distribution. (a) The distributions of Z|X and Y|(X, Z) are the same as the internal study while X ∼ N(0, √ 0.5 2 ). Some calculation shows that θ * = (0.75, 1.5) T , which then leads to and Y|(X, Z) are the same as the internal study while Z|X ∼ N(X + 0.5, 1 2 ). Some calculation shows that θ * = (0.75, 1.5) T , which then leads to E{U(X, Z; The distributions of X and Z|X are the same as the internal study while Y|(X, Z) ∼ N(0.25 + 0.5X + 0.5Z, 1 2 ). Some calculation shows that θ * = (0.25, 0.5) T , which then leads to The implication of these two examples is that γ 0(k) may still have zero components even if the kth external study has a population different from the internal study so that γ 0(k) = 0. In this case the external study still provides useful information for efficiency gain. This observation is also easy to understand from a practical perspective. For example, the association between the same outcome and covariates may not differ much across populations with certain specific heterogeneity.
Therefore, for information integration, it may be beneficial to do a component-wise shrinkage on γ 0(k) instead of a groupwise shrinkage, especially when no external study appears to be useful with a group-wise shrinkage. A component-wise shrinkage in this case may help incorporate the useful information contained in a subset of the moment constraints from the external study that is not selected by the group-wise shrinkage. Component-wise shrinkage is easy to achieve by replacing the penalty K k=1P λ n γ (k) in (4) As a matter of fact, the component-wise shrinkage is a special case of the group-wise shrinkage based on the agLasso penalty in (5) with all group sizes equal to one, by pretending that each moment constraint came from a separate external study. There is no special treatment needed for componentwise shrinkage in either asymptotic property investigation or numerical implementation.

Estimation Consistency and √ n-Convergence
We first establish the consistency of the proposed estimator (β,γ ). The assumptions needed on the model f (Y|X, Z; β) and the moment function g(X, Z; β) are similar to those for the consistency of the MLE and the empirical likelihood estimator (e.g., Newey and McFadden 1994;Qin and Lawless 1994;Newey and Smith 2004). In addition, the penalty function needs to be small enough compared to the likelihood function and this is achieved through an assumption on the turning parameter λ n .
Here Assumption 1(vi) is a functional Central Limit Theorem and is a standard result in the empirical processes theory (Donsker's Theorem, e.g., Andrews Andrews 1994;van der Vaart and Wellner 1996;van der Vaart 2000;Kosorok 2008). It is a uniform version of the standard Central Limit Theorem that holds under the typical regularity conditions (e.g., Newey and McFadden 1994). Assumption 1(vii) makes sure that the shrinkage effect when estimating the nonzero components of γ 0 disappears as n → ∞. Under Assumption 1, the consistency of (β,γ ) is given by Theorem 1. The proof makes use of the saddlepoint representation in (6). This proof, together with the proofs of all other theorems, is given in the supplementary materials.
To establish the √ n-convergence of (β,γ ) we need some additional assumptions.
The assumptions needed on f (Y|X, Z; β) and g(X, Z; β) are similar to those in Newey and McFadden (1994), Newey and Smith (2004), and Liao (2013). The √ n-convergence requires that the tuning parameter converges to zero as fast as possible so that the penalty term in (4) is asymptotically small compared to the likelihood term in (4). Assumption 2(v) ensures that λ n converges to zero fast enough to obtain √ n-convergence of (β,γ ). Under Assumptions 1 and 2, the √ n-convergence of (β,γ ) is given by Theorem 2. Theorem 2 also gives the √ n-convergence of the Lagrange multiplierρ corresponding to (β,γ ), and this result is oftentimes of independent interest. For example, the tuning parameter selection in Section 4.2 makes use of this result.

External Study Selection Consistency
Let K =0 = {k : γ 0(k) = 0, k = 1, . . . , K} and K =0 = {k : γ 0(k) = 0, k = 1, . . . , K} denote the index sets for the zero and nonzero groups in γ 0 , respectively, corresponding to external studies that are for the same population as the internal study and those for different populations. LetK =0 = {k :γ (k) = 0, k = 1, . . . , K} andK =0 = {k :γ (k) = 0, k = 1, . . . , K} denote the index sets for the zero and nonzero groups inγ , respectively, corresponding to external studies that are selected by the PCML method for information integration and those are not selected. The consistency ofγ from Theorem 1 implies thatγ falls into a shrinking neighborhood of γ 0 with probability approaching one, and thus, for those γ 0(k) = 0 we must haveγ (k) = 0 with probability approaching one. However, consistency ofγ alone does not implyγ (k) = 0 with probability approaching one for those γ 0(k) = 0, and thus, does not imply external study selection consistency. To ensure the selection consistency, we impose a further condition on the convergence rate of the tuning parameter λ n . This condition is based on takingγ = n −1 n i=1 g(X i , Z i ;β MLE ), which is √ n-consistent for γ 0 under Assumptions 1 and 2. This condition ensures that λ n does not converge to zero too fast so that its shrinkage effect can shrink γ (k) to exactly zero for those γ 0(k) = 0. This condition is the same as that in Zou (2006).
Theorem 4 (Asymptotic Normality). Under Assumptions 1, 2 From Theorem 4, some calculations lead to the asymptotic distribution for the PCML estimatorβ.
Theorem 5 (Oracle Efficiency). Under Assumptions 1, 2, and 3, where Compared to the MLE based on the internal study data alone, whose asymptotic variance is S −1 0 , the proposed PCML estimatorβ has a smaller asymptotic variance because G T 0 −1 0 G 0 is positive-definite. On the other hand, the asymptotic variance in (7) is the same as that of the oracle CML estimator defined in (3) with only g =0 (X, Z; β) used. In other words, the proposed estimator has the same efficiency as that of the oracle CML estimator incorporating only useful external information. This optimal estimation efficiency for the parameter of interest, together with the external study selection consistency from Theorem 3, implies the oracle use of information from external studies in the presence of population heterogeneity.

Implementation Based on Saddle-Point Representation
The numerical implementation of the proposed PCML method is based on the saddle-point representation (6) (6). When the given value (β, γ ) is close to the true value (β 0 , γ 0 ), which is indeed the case during the implementation if the initial value of (β, γ ) is taken to be the consistent estimator (β MLE ,γ ), the inner loop is a concave maximization with a unique maximizer (e.g., Han 2014). Thus, the inner loop can be easily implemented based on the Newton-Raphson algorithm, for which the initial value can be simply set as ρ = 0 because of Theorem 2.
The outer loop computes the PCML estimator (β,γ ) in the following steps.
and equal to the root of the equation as an equation for γ (k) if (8) does not hold, wherê Step 2. Setβ (l+1) equal to the root of the equation (10) as an equation for β.
Step 3. Repeat Step 1 and Step 2 until convergence such that β (l+1) −β (l) and γ (l+1) −γ (l) are smaller than some prespecified small number andK Equations (9) and (10) are the first-order condition of the saddle-point representation (6) with respect to γ (k) when γ (k) = 0 and β, respectively, treatingρ(β, γ ) as an implicit function of β and γ . These equations can be solved based on the Newton-Raphson algorithm, for which the calculation of the Jacobian matrices of the left-hand sides of (9) and (10) needs to again treatρ(β, γ ) as an implicit function of β and γ . The expression of the Jacobian matrix for (10) is the same as that in Han and Lawless (2019) and the expression for (9) can be similarly derived. Details are omitted here due to their lengthy expressions.

Tuning Parameter Selection
The rate of convergence of the tuning parameter λ n is crucial when deriving the asymptotic properties of the PCML estimator in Section 3, and Assumptions 2(v) and 3 specify some sufficient conditions on the convergence rate that guarantee the √ nconvergence and the oracle property of the PCML estimator. For practical implementation, however, we need an effective way of selecting a concrete value for the tuning parameter.
We now discuss how to select C > 0 by following the idea in Liao (2013). From the proof of Theorem 4 it is seen that υ is a (d + q) dimensional standard Gaussian random vector, and d = dim(γ 0 ) = K k=1 d k . On the other hand, under √ nconsistent estimation, the left-hand side of (11) has the same asymptotic distribution as √ nρ (k) = √ nB kρ , where B k is the d k ×d matrix that selects the components ρ (k) from ρ. Therefore, to account for the study heterogeneity of the left-hand side of (11) and to normalize the linear combination of υ, we allow the C in the tuning parameter λ n = Cn −1/2−w/4 to be study-specific and choose C (k) = B kΥ F , where · F is the Frobenius norm andΥ is an estimate of ϒ with a preliminary PCML estimator plugged in. For the preliminary PCML estimator the tuning parameter can be taken as λ n = n −1/2−w/4 with C = 1.
Since external study 1 has the same data distribution as the internal study, we have γ 0(1) = 0 and thus, incorporating the information from this external study should improve the efficiency for internal parameter estimation. For external study 2, some numerical calculation based on a sample size 10 6 for both the internal and external studies shows that γ 0(2) = (−0.1651, −0.0036, −0.0957). The second component of γ 0 (2) is very close to zero, and thus, part of the information available from external study 2 may be helpful for efficiency improvement as well. To evaluate the numerical performance of the PCML method, we consider three scenarios where the external summary information is available from (i) external study 1 only, (ii) external study 2 only, and (iii) both external studies. The MLE and the CML estimators are included for comparisons. In each scenario, both the group-wise shrinkage and the componentwise shrinkage are applied. We take w = 2 in the penalty function (5).
For the external studies we consider two sample sizes, 50,000 and 3000, corresponding to large and moderate sizes, respectively. In both cases two internal sample sizes, n = 300 and 800, are considered. When the external sample size is 50,000, all replications use the same external study data. When the external sample size is 3000, in each replication the external data are regenerated together with the internal data. Table 1 contains the results for external sample size 50,000, and Table S1 in the supplementary materials contains the results for external sample size 3000, both based on 1000 replications. The observations from these two tables are very similar.

Simulation Observations
When only using External Study 1, the CML estimator CML-1 is the oracle CML estimator and has a substantial reduction of empirical standard errors, compared to the MLE, for the estimates of β c , β X 2 and β X 3 corresponding to the regressors used in External Study 1. This observation is in full agreement with the existing CML literature that the efficiency improvement occurs mainly for the estimates corresponding to external study covariates. The PCML estimator with group-wise shrinkage (PCMLg-1) has a performance very close to CML-1, especially with n = 800. Even with n = 300, compared to the MLE, PCMLg-1 has substantially smaller empirical standard errors for the estimates of β c , β X 2 , and β X 3 . The PCML estimator with component-wise shrinkage (PCMLc-1) has a performance almost identical to CML-1.
The closeness in performance in this case between PCMLg-1, PCMLc-1 and CML-1 is because the PCML method is able to automatically incorporate all the information available from External Study 1. For PCMLg-1, the selection rate of External Study 1 is 97.1% for n = 300 and 98.7% for n = 800, and for PCMLc-1 the selection rate of the entire External Study 1 is 99.6% for n = 300 and 99.8% for n = 800.
When only using External Study 2, the CML estimator CML-2 clearly has a large bias because this external study data distribution is different from the internal study. On the other hand, the CML estimator CML-2o that only uses the second moment constraint out of the three that External Study 2 provides has little bias. In addition, CML-2o has a considerably smaller empirical standard error for the estimate of β X 1 compared to the MLE. This improved performance of CML-2o over the MLE is because that the second component of γ 0(2) is very close to zero. Thus, CML-2o may be treated as the oracle CML estimator in this case for comparison purposes. In this case the PCML estimator with group-wise shrinkage (PCMLg-2) has identical results to the MLE, since the group-wise shrinkage detected the distribution difference and made no use of the external information for both n = 300 and n = 800. The PCML estimator with componentwise shrinkage (PCMLc-2) has a performance almost identical to the oracle CML-2o when n = 800, showing the effectiveness of the component-wise shrinkage in integrating useful external information in the presence of population heterogeneity. The rate of estimating the second component of γ 0 (2) and only this component exactly as zero is 99.8% in this case. When n = 300, due to randomness in the internal data, PCMLc-2 sometimes estimates the third component of γ 0(2) also as zero, whose true value is −0.0957. Specifically, when n = 300 the rate of estimating the second component of γ 0(2) alone as zero is 75.4% and estimating the second and third components but not the first as zero is 23.5%. Incorporating information from the third moment constraint leads to slight bias and larger empirical standard errors for PCMLc-2 compared to CML-2o, but all these disappear when n = 800.
When using both external studies, the CML estimator CML-12 has a large bias. Compared to both CML-1 and CML-2o, the oracle CML estimator CML-12o that uses all moment constraints from External Study 1 and the second moment constraint from External Study 2 has a further reduction in empirical standard errors for certain estimates. The PCML estimator with group-wise shrinkage (PCMLg-12) has a performance almost identical to PCMLg-1, especially when n = 800, since the group-wise shrinkage correctly selected External Study 1 with rate 96.6% when n = 300 and rate 98.7% when n = 800, and never selected External Study 2.
The PCML estimator with component-wise shrinkage (PCMLc-12) has a performance almost identical to the oracle CML-12o when n = 800. When n = 300 PCMLc-12 has a slightly larger empirical standard error compared to CML-12o due to occasionally estimating the third component of γ 0 (2) as zero. Specifically, when n = 300 the rate of correctly selecting External Study 1 together with only the second moment constraint from External Study 2 is 97.2%, and the rate becomes 99.4% when n = 800. Compared to PCMLc-1, PCMLc-12 shows a better overall efficiency, especially when n = 800, due to the integration of additional useful information from External Study 2. Compared to PCMLc-2, the efficiency improvement of PCMLc-12 is substantial. Compared to PCMLg-12, PCMLc-12 has a clear reduction in the empirical standard error for the estimate of β X 1 , corresponding to the covariate X 1 that is used only by External Study 2.
Based on all these observations, the PCML method is very effective in incorporating useful external information in the presence of study population heterogeneity. Especially, the PCML estimator based on component-wise shrinkage can make a partial use of the information from an external study that is not selected by the group-wise shrinkage. The numerical performance is overall excellent even with a small internal sample size.

Bootstrap for Inference
In finite samples, the standard error of the PCML estimatorβ calculated based on the asymptotic distribution (7) does not properly account for the finite-sample study selection error, and thus, may lead to poor inferences about β 0 . A theoretical development and investigation of a method that takes the study Table 1. Simulation results summarized based on 1000 replications with external study sample size 50,000.
Internal sample size n = 300 Internal sample size n = 800 selection error into account is challenging and is beyond the scope of this article. Instead, we evaluate the performance of the bootstrap method for a numerical calculation of the standard error. The results are summarized in Table 2. It is seen that, when n = 300 the bootstrap standard errors overall overestimate the empirical standard errors, but the overestimation becomes much milder when n = 800, in which case the difference is less for component-wise shrinkage compared to group-wise shrinkage. In the presence of overestimation, bootstrap will lead to more conservative inference. Overall, when the internal sample size is not very small, the bootstrap method seems to have an acceptable performance and provide a feasible way for standard error calculation in the absence of other formal methods.

Data Application
We apply the proposed method to study the association between the risk of developing high-grade prostate cancer (Gleason score ≥ 7) and certain risk factors. The effects of some commonly considered risk factors, including age, race, the prostate specific antigen (PSA) level, the digital rectal examination (DRE) finding and prior biopsy result, have been studied extensively in the literature. Among the studies, Thompson et al. (2006) built an online risk calculator for calculating the risk of developing high-grade prostate cancer, using data collected in the 1990s from 5519 men in the placebo group of the Prostate Cancer Prevention Trial (PCPT). This PCPT risk calculator is the first online prostate cancer risk assessment tool and is among the most widely used ones. Detailed information about the study, including the model behind this risk calculator, is provided in Thompson et al. (2006). Recent research on the biological mechanisms related to the progression of prostate cancer shows that two specific biomarkers, TMPRSS2:ERG (T2:ERG) and prostate cancer antigen 3 (PCA3), may lead to a better early detection of the disease (e.g., Tomlins et al. 2016). Therefore, it is of great interest to study the effects of both the aforementioned conventional risk factors and the new biomarkers on the risk of prostate cancer after adjusting Table 2. Results of the bootstrap method for standard error calculation, with external study sample size 50,000, 1000 replications, and 200 bootstrap samples for each replication.
Internal sample size n = 300 Internal sample size n = 800  for each other, as an update to the effect estimation typically done without considering the biomarkers. We use part of the sample collected in Tomlins et al. (2016) as the internal data, which consists of 1218 men presenting for diagnostic prostate biopsy at seven community clinics throughout the United States. We fit the logistic regression model logit(P(Y = 1)) = β c + β 1 log 2 (X 1 ) + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + β 6 log 2 (Z 1 + 1) + β 7 Z 2 . Here Y is the high-grade prostate cancer status, X 1 is the PSA level (ng/ml), X 2 is age, X 3 is a binary indicator of an abnormal DRE result, X 4 is a binary indicator of negative previous biopsies, X 5 is a binary indicator of being African American, Z 1 is the PCA3 score, and Z 2 is a binary indicator dichotomized at the sample median of the T2:ERG score . When fitting this model, we will incorporate the information available from Thompson et al. (2006) that led to the PCPT risk calculator, a logistic regression model given by logit(P(Y = 1)) = −6.2461 + 1.2927 log(X 1 ) + 0.0306X 2 + 1.0008X 3 − 0.3634X 4 + 0.9604X 5 . This external information may help improve the accuracy of the effect estimation since the PCPT study has a fairly large sample size. There are some apparent differences between the internal study data distribution and the data distribution reported in Thompson et al. (2006). Of the 5519 men included in Thompson et al. 's analysis,4.7% developed high-grade prostate cancer and 47.1% were at age 70 or older, while the numbers are 18.3% and 27.2%, respectively, for the internal study cohort. For Thompson et al. 's cohort the median PSA level was 1.5 ng/ml and 88.6% had a PSA level ≤ 4.0 ng/ml. In contrast, for the internal study cohort the median PSA level is 4.6 ng/ml and 36.5% have a PSA level ≤ 4.0 ng/ml. The heterogeneity in the study cohorts can also be clearly seen fromγ = n −1 n i=1 g(X i , Z i ;β MLE ). In this application we haveγ = (0.042, 0.056, 2.909, 0.006, −0.004, −0.004). The large component 2.909 clearly indicates a cohort heterogeneity. On the other hand, however, the last three components ofγ are very close to zero, showing that part of the external information may be useful to improve the internal estimation. In our analysis the group-wise shrinkage did not lead to information integration. The component-wise shrinkage did estimate the last three components of γ , as well as the second component, exactly to be 0. Table 3 contains the analysis results. Due to population heterogeneity, the CML estimates for the effects of DRE, prior biopsy and race are quite different from the MLE. In contrast, the PCML estimates are considerably closer to the MLE, with some effect change observed for prior biopsy and race. For the MLE, the effects of all covariates but race (indicator of being African American) are highly significant. In the internal study cohort there are only 81 African Americans, and this small number leads to the nonsignificance of the corresponding effect (p-value = 0.788). The PCML method incorporates part of the information from Thompson et al. 's (2006) cohort, which includes 175 African Americans. The information integration leads to a better estimate of the race effect together with a reduced standard error, resulting in a significance (pvalue = 0.002) that is in agreement with the general findings in existing literature. Based on the PCML method, while having had previous negative biopsies is significantly associated with a decreased risk of high-grade prostate cancer, having a higher PSA level, older age, abnormal DRE results, being African American, and higher PCA3 and T2:REG scores are all associated with significantly increased risk.

Discussion
We considered two penalties in this article, the adaptive group Lasso (agLasso) penalty for group-wise selection of external studies and the adaptive Lasso (aLasso) penalty for componentwise selection of external study moment constraints. It is hard to have a general rule on when to use which penalty, as the choice may depend on many factors, including the respective covariates used by the internal and external studies, the forms they are included and the dimensions. In our experience the group-wise selection is more conservative in selecting external information. Therefore, a possible approach would be to first carry out a group-wise selection using the agLasso penalty. If none of the external studies are selected or if the selected studies only cover a small subset of the internal study covariates, a componentwise selection using the aLasso penalty can be employed. It is worth to point out that, alternative penalties may be considered to achieve the same theoretical properties. One example would be the SCAD penalty (Fan and Li 2001), which is a widely used alternative to the Lasso-type penalties. Another would be the penalties proposed by Breheny and Huang (2009) that lead to bi-level variable selection and, when applied to our data integration setting, could achieve oracle external information selection while maintaining the group structure of external studies. We will investigate this in a future project.
We made the assumption that the external study uncertainty is negligible compared to the internal study following the literature. Under some settings different from the one considered in this article, there have been recent developments on accounting for the uncertainty associated with external studies when their sample sizes are not much larger than the internal study (e.g., Han and Lawless 2019;Zhang et al. 2020). A very interesting finding is that, under certain scenarios, the external study uncertainty may reduce the internal estimation variance (Han and Lawless 2019), an observation similar to that using estimated weights helps to reduce the asymptotic variance compared to using the true weights for the inverse probability weighting method in missing data literature (e.g., Robins, Rotnitzky, and Zhao 1994;Liang et al. 2004). This deserves an investigation under the setting considered in this article.
In our simulation studies, results in Tables 1 and S1 (in supplementary materials) showed the excellent performance of the PCML method for point estimation. For standard error calculation we evaluated the bootstrap method, the overall numerical performance of which seems to be acceptable. In some unreported simulation studies of the bootstrap method, as a way to account for the external information uncertainty when the external sample size was 3000, for each bootstrap sample from the internal study data we generated a valueθ (k) from the normal distribution with meanθ (k) and variance the corresponding covariance matrix ofθ (k) . We then used the bootstrap samples paired with the generatedθ (k) 's to compute the bootstrap PCML estimates, which lead to the bootstrap standard error. But such a way of accounting for external information uncertainty considerably overestimated the empirical standard error. As a future research topic, we will investigate standard error calculation that can properly account for the external information uncertainty.
In this article we considered that the number of external studies K is finite and fixed. There are many practical settings where a large number of external studies may be available, for which a more appropriate theoretical framework would be to let K increase with the internal sample size. In this case, other than the studies that provide invalid information, there may also be studies that provide redundant information when they have similar model structures and variables. Such redundant information not only affects the computation but also brings theoretical complications. We are currently developing methods that deal with this more general and more challenging setting. Some other extensions of the proposed method are also of interest. When the new covariates collected by the internal study are high-dimensional, a variable selection may be needed to build a sparse internal model. Such a setting is similar to the one in Sheng et al. (2021) and can be achieved by adding an additional penalty for variable selection. Another extension is to take into account the design of studies. In this article the internal study data are a random sample, but in practice a biased sampling is often used for data collection, such as case-control sampling, and it is of vital importance to take these study designs into consideration.

Supplementary Materials
The online supplementary materials contain an Appendix and R code.
Appendix: It contains (a) detailed proofs of all theorems, (b) an expression of Equation (4) under our simulation setting in Section 5, and (c) some additional simulation results. (Appendix.pdf) R code: R code is provided for computing the estimator PCMLc-12 in Table 1. (Rcode-PCMLc-12.R)