Estimation and Inference for High-Dimensional Generalized Linear Models with Knowledge Transfer

Abstract Transfer learning provides a powerful tool for incorporating data from related studies into a target study of interest. In epidemiology and medical studies, the classification of a target disease could borrow information across other related diseases and populations. In this work, we consider transfer learning for high-dimensional Generalized Linear Models (GLMs). A novel algorithm, TransHDGLM, that integrates data from the target study and the source studies is proposed. Minimax rate of convergence for estimation is established and the proposed estimator is shown to be rate-optimal. Statistical inference for the target regression coefficients is also studied. Asymptotic normality for a debiased estimator is established, which can be used for constructing coordinate-wise confidence intervals of the regression coefficients. Numerical studies show significant improvement in estimation and inference accuracy over GLMs that only use the target data. The proposed methods are applied to a real data study concerning the classification of colorectal cancer using gut microbiomes, and are shown to enhance the classification accuracy in comparison to methods that only use the target data. Supplementary materials for this article are available online.


Introduction
Generalized linear models (GLMs) are widely used in many areas of statistical applications (Hastie et al., 2009).In genetic applications and other medical studies, the number of covariates can be quite large and high-dimensional GLMs are frequently adopted for classifying diseases and health-related outcomes.In the age of big data, the availability of public datasets makes it possible to improve the learning performance of a new study by incorporating information from the existing ones.This is the goal of transfer learning, which aims to incorporate the knowledge from different but related studies to enhance the accuracy of the target study of interest (Torrey and Shavlik, 2010).Transfer learning has been successfully applied in a range of different fields, including pattern recognition, natural language processing, and drug discovery (Pan and Yang, 2009;Turki et al., 2017;Bastani, 2018).In particular, transfer learning for the GLMs has been used in image classification and disease diagnosis (Hosny et al., 2018;Sevakula et al., 2018).However, little is known about their statistical guarantees.
In this paper, we study transfer learning for high-dimensional GLMs in the setting where the data are available from a target study and multiple auxiliary studies.In the target study, we observe n 0 i.i.d.samples x (0) i ∈ R p and y (0) i ∈ Y ⊆ R, i = 1, . . ., n 0 drawn from a GLM with parameter β ∈ R p .The negative log-likelihood for the target data is where ψ is the link function that satisfies certain smoothness conditions.Additionally, we have observations from K different auxiliary studies.For k = 1, . . ., K, let (x i ), i = 1, . . ., n k , denote the observations from the k-th study drawn from a GLM with the parameter w (k) ∈ R p and link function ψ.The similarity between the k-th study and the target study is captured by the contrast vector δ (k) = w (k) − β.The smaller the magnitude of δ (k) , the higher the similarity.Let h denote the similarity level such that max 1≤k≤K δ (k)  q ≤ h for some fixed q ∈ [0, 1].Specifically, q = 0 corresponds to the exact sparse contrast vectors and when q > 0, δ (k) can have many nonzero coefficients but their magnitude decays relatively fast.The range of q in consideration is flexible in applications and our proposed method can adapt to q.
The goal is to optimally estimate and make inference for the target parameter β ∈ R p based on the available data from the target and auxiliary studies.

Related work
In the conventional setting where only data from the target study is available, estimation for high-dimensional GLMs has been well-studied.Van de Geer (2008) uses 1 -penalty and derives an oracle inequality and estimation error rates.Negahban et al. (2012) studies Mestimators and proves estimation error rates under the restricted strong convexity condition.Huang and Zhang (2012) considers convex loss functions with weighted Lasso penalties.van de Geer et al. (2014) proposes a debiasing procedure for inference by computing the correction score via another Lasso on the Hessian matrix.Cai et al. (2020a) introduces a debiasing procedure for the GLMs with binary outcomes via quadratic optimization.The idea of debiasing has also been generalized to tackle high-dimensional proportional hazards models (Fang et al., 2017), mixed-effects models (Bradic et al., 2020;Li et al., 2019), and for multiple testing (Zhang and Cheng, 2017;Dezeure et al., 2017;Javanmard and Javadi, 2019;Ma et al., 2020).
Transfer learning has been studied in different models.Cai and Wei (2021) considers nonparametric classification and establishes the minimax optimal rate and proposes an adaptive classifier.Tripuraneni et al. (2020a) proposes an algorithm in linear models that assumes all the auxiliary studies and the target study share a common, low-dimensional linear representation.Transfer learning in general functional classes has been studied in Tripuraneni et al. (2020b) and Hanneke and Kpotufe (2020).Li et al. (2020a) proposes methods for transfer learning in high-dimensional linear models and establishes the minimax optimal rate.Li et al. (2020b) introduces a method for estimation and edge detection in high-dimensional Gaussian graphical models with knowledge transfer.However, the methods established in the aforementioned two papers cannot be directly used as the link functions in GLMs are nonlinear in general.Liang et al. (2020) studies high-dimensional classification with auxiliary outcomes in the setting that the same set of individuals are used to generate different outcomes, which is different from our setting.
A related but different problem is multi-task learning (Zhang and Yang, 2017), where the goal is to jointly estimate all the parameters for multiple tasks.Multi-task learning has been studied in various settings, including linear regression (Agarwal et al., 2012;Dondelinger et al., 2020) and graphical models (Chen et al., 2010;Danaher et al., 2014).
An optimal multi-task procedure does not necessarily yield an optimal estimator for the target task in transfer learning.

Our contributions
A novel algorithm is developed for estimation and inference in high-dimensional GLMs with knowledge transfer.The proposed method estimates the target parameter and contrast vectors jointly via constrained 1 -minimization.Minimax rate of convergence is established and the proposed estimator is shown to attain the optimal rate under mild conditions.The optimal rate for transfer learning is faster than the corresponding rate in the single-task setting under mild similarity conditions between the auxiliary and target tasks.
A debiasing method is introduced in the transfer learning setting.The debiased estimator of an individual coefficient is shown to be asymptotically normal and is then used for constructing confidence intervals.It is shown that this debiased estimator has a smaller magnitude of remaining bias in comparison to the one in the single-task setting.As a result, the asymptotic normality holds under weaker sparsity conditions on β in transfer learning when the auxiliary studies are sufficiently informative.Consequently, inference for a given coefficient β j is no longer restricted to the "ultra-sparse" regime for β.This reveals the benefit of transfer learning for statistical inference.

Organization
The rest of the paper is organized as follows.In Section 2, we introduce a transfer learning algorithm using a constrained 1 -minimization approach for estimation in GLMs.Section 3 provides the theoretical guarantees for our proposal and establishes the minimax lower bound.In Section 4, we introduce a debiasing procedure for inference of β j and prove its asymptotic normality.To guarantee positive transfer, an aggregation procedure is developed in Section 5. Section 6 considers the numerical performance of our proposed algorithms in comparison to some existing methods.The results provide empirical evidence of the gain of transfer learning.The proposed methods are applied to analyze a microbiome data set for classifying colorectal cancer in Section 7. The results demonstrate the advantage of transfer learning.The proofs and additional numerical results are given in the supplementary materials (Li et al., 2021).The sub-Gaussian norm of a random variable u ∈ R is u ψ 2 = sup l≥1 l −1/2 E 1/l [|u| l ] and the sub-Gaussian norm of a random vector

Notation
Let z α be the (1 − α)-th quantile of the standard normal distribution.

Transfer learning via constrained 1 -minimization
In this section, we introduce our proposed algorithm via constrained 1 -minimization.We begin with preliminaries and model setup in Section 2.1.The rationale behind the proposed method is described in Section 2.2 and the algorithm for estimating β is introduced in Section 2.3.The theoretical guarantees, including both the upper and matching lower bounds, are provided in Section 3.

Model setup
Formally, the target model can be written as where β ∈ R p is the target parameter of interest, c(σ (0) ) is a nuisance scale parameter and ψ(•) is the cumulant generating function of y given x.The GLM is, first of all, a generalization of the linear model: setting ψ(µ) = µ 2 /2 and c(σ) = σ 2 in (2) recovers the (Gaussian) linear model.Model (2) also includes other popular models such as logistic, multinomial and Poisson regression models.In the high-dimensional regime where p can be much larger than the sample size n 0 , β is often assumed to be sparse such that the number of nonzero elements of β, denoted by s, is much smaller than p.With i.i.d.
samples {(y i )} n 0 i=1 drawn from the model (2), the general approach is to minimize the negative log-likelihood function (1) with some sparsity-induced penalty.
In the context of transfer learning, we additionally observe {(y i=1 , k = 1, ..., K, generated from the auxiliary models where w (k) ∈ R p is the coefficient vector for the k-th study satisfying w (k) = β + δ (k) .For convenience, we define δ (0) = 0.As described in Section 1, we assume max 1≤k≤K δ (k) q ≤ h for some constant q ∈ [0, 1].We will introduce the estimator for β in the sequel.
As opposed to transfer learning for linear models, we see from (4) that there is no way to separate the estimation of β and {δ (k) } K k=1 in GLMs.This brings additional challenges in devising the algorithm and in the theoretical analysis.We propose a constrained optimization algorithm for jointly estimating the target parameter β and contrast vectors {δ (k) } K k=1 .For a parameter vector b ∈ R p , we denote the empirical score function by where λ β and λ k , 1 ≤ k ≤ K are the tuning parameters and will be specified later.The objective function in (5) encourages sparse solutions.Notice that there are (K + 2) × p constraints in (5) while there are (K +1)×p unknown parameters.All these constraints are essential.Specifically, the constraint L(0) (β) ∞ ≤ λ 0 is inherited from the target model, imposing that β should be identified as the true parameter for the target model.The constraint L(k) (β +δ (k) ) ∞ ≤ λ k comes from the score functions from k-th auxiliary study, imposing that δ (k) should be identified as w (k) − β.The last constraint in (5) aggregates the moment equations for all the studies in use.It ensures that the estimation of β borrows information across auxiliary studies.Specifically, imagining {δ (k) } K k=1 are known, the last constraint ensures that β is estimated based on N independent samples and hence can lead to a faster convergence rate.We formalize the transfer learning algorithm in Section 2.3.

Estimation of the target parameter
Our proposed algorithm for estimating β is a detailed version of (5).In Step 1, an initial estimator of β is constructed by minimizing an 1 -penalized negative likelihood based only on the target data.In Step 2, we modify (5) by adding one more constraint using the initial estimator.We now introduce the detail algorithm and then provide further comments on the algorithm.Let x (k) i be the i-th row of X (k) and y (k) i be the i-th element of y (k) , k = 0, . . ., K.
Step 1: Compute an initial estimator Step 2: In comparison to (5), the last 7) is needed for technical convenience when ψ(µ) is nonlinear.This constraint is mild as λ 0 = o(1).This condition can be removed if the target parameter satisfies β 1 ≤ cλ −1 0 for some positive constant c.Computationally, the joint optimization in ( 6) is still a convex programming.
The proposed algorithm can also be used for multi-task GLM learning, where the goal is to jointly estimate β and {w (k) } K k=1 (Zhang and Yang, 2017).Specifically, after fitting β and δ (k) with the proposed algorithm, one can estimate w (k) with β + δ(k) .The corresponding convergence rate is implied by the results in Section 3.

Theoretical guarantees for estimation
In this section, we establish the minimax optimal rate and show that the proposed estimator is rate optimal.Define the population Hessian matrices as We introduce two regularity conditions.
) is bounded away from zero with high probability and it is mild for sub-Gaussian designs.The covariance matrix Σ (k) for different studies can be different, i.e., the distributions of the covariates in different tasks are allowed to be heterogeneous.Condition 3.2 requires the random noises to be sub-Gaussian, which is typical in high-dimensional analysis for fast convergence rates.Condition 3.3 is a Lipschitz condition on the link function.Conditions 3.1, 3.2, and 3.3 are common in the study of the GLMs, see Huang and Zhang (2012); Negahban et al. (2012); Cai et al. (2020b) and the reference therein.It holds for linear, logistic, and multinomial models.Beyond the GLMs, some other models for binary outcomes can also applicable, such as model (1.1) in Cai et al. (2020a).The Poisson or log-linear models have heavy-tailed distributions and may not satisfy Condition 3.2.We comment that our method is still applicable but the convergence rate may not be as sharp as what we will establish in Theorem 3.1.
We now analyze the convergence rate of the estimator obtained in Algorithm 1. Formally, the parameter space we consider is where q ∈ [0, 1] enforces either a hard (q = 0) or soft (q ∈ (0, 1]) form of sparsity on the contrast vectors.Let n min = min 0≤k≤K n k and N = K k=0 n k .In our theoretical analysis, we take the tuning parameter λ β as This tuning parameter λ β depends on the sparsity parameter s and h.This is mainly for establishing a desirable 1 -error bound for the proposed estimator, which is needed in the debiasing step for statistical inference.As we will prove in Remark 3.1, if β is sufficiently sparse, then it suffices to choose λ β = c 1 √ N log p, which is independent of h and s.In practice, the tuning parameters can be chosen by cross-validation.Next, we define the following quantity that will be used to characterize the rate of convergence.
We are now ready to present the theoretical guarantees for the output β of Algorithm 1.
Theorem 3.1 establishes the convergence rate of β under mild regularity conditions for any fixed q ∈ [0, 1].We first highlight the gain of transfer learning over the singletask GLM estimation.We know that the minimax optimal rate for single-task GLM is s log p/n 0 .Theorem 3.1 implies that when N n 0 and T n 0 ,q ∧ h 2 s log p/n 0 , β would admit a faster convergence rate than the single-task minimax rate.This result implies that a significant amount of knowledge can be transferred from auxiliary tasks to the target task when the similarities between target and auxiliary studies are high.In fact, T n 0 ,q ∧ h 2 is the minimax error rate for estimating a p-dimensional vector with sample size n 0 and q -sparsity h.This term comes from the estimation of contrast vectors.The condition T n 0 ,q ∧ h 2 s log p/n 0 is guaranteed by h s when q = 0 and by h s log p/n 0 when q = 1.Hence, when the similarity between auxiliary studies and the target study is high, the estimation performance can be improved by transfer learning.When q = 1, ( 9) recovers the convergence rate of Oracle Trans-Lasso in linear models (Li et al., 2020a).We also remark that the 1 -error in Theorem 3.1 is useful for conducting statistical inference for the target parameters.We will illustrate this further in Section 4.
We now provide some discussion on the regularity conditions in Theorem 3.1.The condition s log p ≤ n 0 is standard for single-task sparse regression.The condition s log p ≤ √ N is mild in the regime of interest N n 0 .As h is relatively small, bounded T n 0 ,q is not hard to satisfy in applications.Finally, the condition s log pT n 0 ,q ≤ c 1 guarantees the consistency of β in 1 -norm.
Moreover, we establish the following lower bound result showing that our proposed algorithm makes full use of the auxiliary information as the convergence rate obtained in 3.1 is in fact minimax rate-optimal.
Theorem 3.2 (Minimax lower bound).Suppose β is an estimator based on n 0 i.i.d.samples i )} n 0 i=1 drawn from model (2), and auxiliary samples {(x 4 Inference for the target parameters In this section, we consider statistical inference of β j for a given j ∈ [p] in the transfer learning setting.A debiasing method for the proposed estimator is introduced in Section 4.1 and its asymptotic normality is established in Section 4.2.

A debiased estimator
We introduce a debiased estimator for β j based on β, the output of Algorithm 1.We will use the target data for debiasing.Specifically, following the general debiasing recipe (Zhang and Zhang, 2014;van de Geer et al., 2014;Javanmard and Montanari, 2014), define β(db) where γj ∈ R p is the correction score approximating the j-th column of the inverse Hessian i ) , and then solve γj by the following constrained optimization γj = arg min where c 1 and c 2 are two tuning parameters.In ( 12), the correction score γj is obtained via a constrained 1 -optimization based on the target Hessian matrix.The two constraints are linear and therefore the optimization is convex and computationally efficient.The first constraint guarantees that γj approximates the j-th column of Σ −1 β .The population Hessian matrix Σ β is approximated by an empirical estimator based on the design of the target model and β.The second constraint is on the magnitude of |(x (0) i ) γj |.This constraint is employed in justifying the Lyapunov central limit theorem for the sum of independent noises.Additionally, we would like to point out that while the 1 -minimization in (12) encourages a sparse solution, the probabilistic limit of γj is not necessarily sparse.Indeed, we will see that the optimization in ( 12) is effective no matter the j-th column of the true inverse Hessian Σ −1 β is sparse or not.In other words, any feasible solution to (12) is a proper correction score for the debiasing task.A similar constraint has been studied in Zhu and Bradic (2018) for hypothesis testing in single-task high-dimensional linear models.
Here we extend this idea for constructing confidence intervals in high-dimensional GLMs, and further to the transfer-learning setting.
Our proposed debiasing scheme can also be used in single-task GLMs, in which case one can replace β with, say, the single-task generalized Lasso estimator ( Van de Geer, 2008).
In comparison, the Lasso-based debiasing for the GLMs (van de Geer et al., 2014) requires β to be sparse.Another method, Cai et al. (2020a), computes the correction score under the same constraints as in (12) but the objective function is a quadratic function of γ.The theoretical benefits of the current method will be demonstrated in detail in the next subsection.
Next, we provide a variance estimator for the debiased estimator (11).In GLMs, the variance estimation necessitates to estimate σ 2 i = Var(y Our variance estimator is given as follows.For linear models, let σ2 For models with c(σ) = 1 in (2), which includes logistic, multinomial, Poisson, and log-linear models, let σ2 i = ψ((x (0) i ) β).We now define the variance estimate of β(db) We establish the asymptotic distribution of β(db) for some 1 ≤ j ≤ p and show the variance estimator V j is consistent in the next subsection.

Asymptotic normality
We next study the asymptotic distribution of β(db) j for some 1 ≤ j ≤ p.We first show that the limiting distribution of β(db) j is normal in linear models, and present the result beyond linear models afterward.
In the following lemma, we prove that, with high probability, the variance estimator V j in (13) converges to its limit and its limit is lower bounded by a positive constant.
Lemma 4.1 (Asymptotic property of the variance estimator in linear models).Assume the conditions of Theorem 3.1 and ψ(µ) = µ.For V j defined in (13), i ) γj } 2 σ 2 i , and some positive constant c 0 , it holds that By Lemma 4.1, V j is the probabilistic limit of V j and it is only a function of {x where In Theorem 4.1, we decompose the limiting distribution of β(db) j into two parts: an asymptotically normal part z j and a remaining bias part rem j .To have the asymptotic normality, one needs the asymptotically normal part to dominate the bias term, that is, ).This leads to the following sparsity conditions for asymptotic normality, which are In the single-task setting, the minimax optimal rate in Cai and Guo (2017) implies that it is necessary to require s log p √ n 0 .We see that the requirement in ( 14) is much weaker when we have a large amount of auxiliary data (N n 0 ) and these data share the similarity with our target ( √ n 0 T n 0 ,q 1).The condition √ n 0 T n 0 ,q 1 holds when h = o( n 0 / log p) if q = 0 and when h √ log p = o(1) for q = 1.In words, when the similarity of the auxiliary studies are sufficiently large, i.e., when h is sufficiently small, the asymptotic normality of β(db) j requires weaker sparsity conditions than the debiased estimator in the single-task setting.Additionally, while we require a much weaker condition, the length of the proposed confidence interval in the transfer learning setting has the same order (n ) as that in the single-task setting.In applications, these results imply more accurate coverage probabilities with the debiased transfer learning estimator without inflating the lengths of confidence intervals.To summarize, the confidence interval in ( 16) is asymptotically valid for linear models when the conditions of Theorem 4.1 and ( 14) hold.
We remark that the results of Theorem 4.1 do not require the sparsity of inverse Hessian Σ −1 .When {Σ −1 } .,j is sufficiently sparse, standard arguments can be leveraged to show does not assume sparse Σ −1 but the semi-parametric efficiency is not shown.
We now derive the asymptotic normality for the proposed β(db) j beyond linear models.
In this case, γj depends on β and hence depends on y i .This leads to technical difficulties in justifying the asymptotic normality in GLMs.For the GLMs, we first impose a high-level Condition 4.1 and prove the main theorem.We will later verify this condition in different settings.
Condition 4.1 (Independence of the correction score).There exists some γ o j ∈ R p such that conditioning on γ o j and {x i ) β) are independent with mean zero.
Condition 4.1 essentially requires that the estimated γj converges to a "deterministic" vector γ o j in 1 -norm.Here, "deterministic" means that γ o j is independent of the random noises y We will demonstrate the realization of Condition 4.1 after presenting the following main theorem on the asymptotic normality for β(db) j .
We first establish the consistency of the proposed variance estimator V j in (13).
Lemma 4.2 (Asymptotic property of the variance estimator in GLMs).Assume the conditions of Theorem 3.1 and Condition 4.1.For V j defined in ( 13), i , and some positive constant c 0 , we have By Lemma 4.2, V o j is the probabilistic limit of V j and it is independent of the random noises by Condition 4.1 in GLMs.In fact, V o j is the variance of β(db) where In Theorem 4.2, we see that the remaining bias term rem j has an extra √ log n 0 term comparing to the results for linear models (Theorem 4.1).This inflation comes from the uncertainty in the weights of the Hessian matrix, which is estimated based on β.This extra term also appears in Cai et al. (2020a) for the single-task debiased estimator.Implied by Theorem 4.2, the sparsity condition for asymptotic normality in GLMs is s log p N/ log n 0 and T n 0 ,q log n 0 s log p 1. ( 15) With the target study only, the analysis in Cai et al. (2020a) requires s log p n 0 / log n 0 for the asymptotic normality.Again, this shows that transfer learning helps reduce the remaining bias when the auxiliary studies are sufficiently similar to the target one.We can conclude that the confidence interval in ( 16) is asymptotically valid for the GLMs when the conditions of Theorem 4.2 and ( 18) hold.
In the following, we verify Condition 4.1 in different cases.(ii) We first split n 0 samples into two folds such that β is independent of the debiasing (iii) The j-th column of Σ −1 β has at most s j nonzero elements such that (s j log p) 2 = o(n 0 ) and N n 0 log n 0 .
To summarize, for linear models, Condition 4.1 holds for free.For the GLMs, Condition 4.1 can be guaranteed by a sample splitting argument or by the sparsity of {Σ −1 β } .,j .The third statement demonstrates the benefit of the optimization (12) over the quadratic programming in Cai et al. (2020a).That when s j is sufficiently small, the sample splitting technique can be avoided.In fact, sample splitting always leads to sub-optimal empirical performance, especially when the samples are limited.We will see from the numerical experiments that our proposal has reliable performance for both sparse and non-sparse inverse Hessian matrices.
We conclude this subsection by summarizing the algorithm of constructing a two-sided (1 − α)-level confidence interval for β j in Algorithm 2.

Aggregated TransGLM with positive transfer warranty
As seen in the theoretical analysis, the performance of transfer learning always depends on the level of similarity, h, which is typically unknown.When h is large, incorporating the auxiliary studies into the analysis can potentially reduce the estimation and inference accuracy of the target parameter.To guard against such "negative transfer", we propose an additional aggregation step based on the likelihood.
Given a collection of initial estimators, an procedure (Rigollet and Tsybakov, 2011;Dai et al., 2012) selects the best or a convex combination of the initial estimators by minimizing certain empirical risk measures based on the observed data.Here our primary goal is to prevent negative transfer and we propose a simple step to aggregate two initial estimators, the estimator obtained by using the target samples only, and the estimator obtained using combined dataset.More specifically, we propose our final procedure, aggregated TransGLM, shorthanded as "aTransGLM", that aggregates the transfer learning estimator β with the single-task GLM Lasso β(init) , which is formally given below.
Algorithm 3: aTransGLM, an aggregated transfer learning algorithm Input : β(init) , β, and some samples from the target study which are independent of ( β(init) , β), denoted by {(( Step 1: Thresholding β: Step 2: Aggregation based on the likelihood. We show in the supplement (Li et al., 2021) that the truncated estimators βt has the same convergence rate as β but βt has sparsity no larger than the order of s.This facilitates upper-bounding the 1 -error of β and further prepares β for the downstream statistical inference.In Step 2 of Algorithm 3, the independent target samples can be obtained by a sample splitting of the target samples before the analysis.Hence, we consider ñ n 0 .
The computed η is a weight vector to combine two initial estimators.We also comment that the optimization of η can be with the Q-aggregation (Dai et al., 2012) or its variations, which can achieve the same convergence rate but sharper constants.As an illustration, we focus on a more intuitive aggregation based on the likelihood as in Step 2.
Theorem 5.1 shows that the aggregated estimator β is guaranteed to be no worse than the single-task estimator with high probability, which demonstrates that it provides a positive transfer warranty.
Let q ∈ [0, 1] be a fixed constant.Assume that the true parameters are in Θ q (s, h), s log p ≤ c 1 n 0 ∧ √ N , T n 0 ,q ≤ c 1 , s log pT n 0 ,q ≤ c 1 , and N ≥ c 2 Kn 0 , for some positive constants c 1 and c 2 .Then with probability at least Theorem 5.1 essentially shows that the aggregated estimator β has no slower convergence rate than the those obtained by β(init) and βt .Theorem 5.1 guarantees that β − β 2 2 s log p/n 0 with high probability as long as s = 0. Hence, the performance of β is robust to a large h, i.e. low similarity levels.We also obtained the convergence rate in 1 -norm by utilizing the sparsity of β(init) and the sparsity of the thresholded estimator βt .
The cost of aggregation is of order 1/n 0 , which is negligible in most scenarios of interest.
For example, when q = 0, as long as h ≥ 1 and s ≥ 1, the cost of aggregation is always dominated by the second term.Hence, in practice, it is almost no harm to perform an aggregation step.
The inference results based on β can be similarly proved.Let β(db) where and for V j defined in (13), Implied by Theorem 5.2, the sparsity condition for asymptotic normality is The requirement in ( 18) is always no worse than the sparsity requirement in the single-task setting as a consequence of aggregation.The verification of Condition 4.1 can be similarly proved as in Lemma 4.3.In the next section, we evaluate the numerical performance of β and β(db) j .
In both (a) and (b), the design matrices are heterogeneous among studies.The target covariance matrix Σ (0) is sparse in (a) but not in (b).Hence, (b) provides a challenging setting for statistical inference.
To accommodate the practical setting that some auxiliary studies can be very far from the target study, we define A ⊆ {1, . . ., K} to be the set of informative studies.Specifically, we generate δ (k) in two ways.
We see that in both (i) and (ii), {δ (k) } k∈A are sparser than {δ (k) } k∈A c .Moreover, {δ (k) } k∈A c are even denser than β and we treat studies in A c as non-informative studies.In (i), δ (k)   is exact sparse and in (ii), δ (k) are approximately sparse.We will consider four scenarios generated by (a) and (b) crossing (i) and (ii), denoted by (a-i), (a-ii), (b-i) and (b-ii), respectively.Each configuration is replicated with 300 independent experiments.In the main paper, we report two settings generated by (a-i) and (b-i).The results for (a-ii) and (b-ii) are analogous and are reported in the supplementary materials (Section E).
We compare five methods numerically.The first one is generalized Lasso based on the target study, denoted as "GLM Lasso".The second one is Algorithm 1, denoted by "TransGLM".The third method is Algorithm 1 based on target and informative auxiliary studies.That is, we apply Algorithm 1 with {1, . . ., K} replaced by A. We denote this method by "oracle TransGLM" as it depends on the oracle A. The fourth method is Algorithm 3, denoted by "aTransGLM".The last one is a simple aggregated estimator, denoted by "Simple-Agg".It first applies the GLM Lasso to each task and then aggregate these K + 1 estimators using the optimization in Section 5.This method can be viewed as a meta-analysis paradigm with adaptive weights.It is widely used in applications for its simplicity and we include it as another benchmark method.For the inference results, we construct confidence intervals with oracle TransGLM, aTransGLM, and the single-task method in van de Geer et al. ( 2014).The detailed implementation of different methods is illustrated in the supplementary materials.

Classification errors
In every experiment, we evaluate the classification errors in an independent target sample with a sample size 200.From Figure 1, we see that the performance of single-task GLM Lasso does not change as the informative sample size changes.The oracle TransGLM significantly reduces the classification errors in comparison to the GLM Lasso as the informative sample size increases.It is always no worse than the GLM Lasso because it never incorporates non-informative samples.The TransGLM method reduces classifications errors when a significant proportion of the auxiliary samples are informative.This is because it uses all the auxiliary studies and when few studies are informative, the errors can be large according to Section 4. The aTransGLM method also improves classification accuracy when the informative sample size is relatively large.On the other hand, the aggregation step in aTransGLM achieves robustness to negative transfer in the sense that the performance of aTransGLM is always no worse than the single-task GLM Lasso.When |A| is close to K, the TransGLM has slightly smaller errors than aTransGLM.This is because TransGLM does not split the samples for aggregation but aTransGLM does.However, robustness can be more important than the mild gain in accuracy and hence aTransGLM should be favorable over TransGLM in most practical applications.The "Simple-Agg" method has limited improvement when the informative samples are large and its performance is very sensitive to the levels of h.By comparing the plots at different levels of h, we see that the performances of Oracle TransGLM, TransGLM, and aTransGLM are getting slightly worse as h increases, which agrees with our theoretical analysis.The overall performance also demonstrates that our method is robust to heterogeneous design matrices.The estimation errors are reported in the supplementary materials (Section E).

Confidence intervals
We construct 95% two-sided confidence intervals for β j , j = 1, . . ., p.We compare our proposed debiased oracle TransGLM and debiased aTransGLM with the single-task inference method for the GLMs (van de Geer et al., 2014).
In Table 1, we report the results in setting a-i, where the inverse Hessian matrix Σ −1 β is relatively sparse.All the methods have reliable coverages for β j = 0.For β j = 0.5, we see that the single-task method has coverage probabilities lower than the nominal level.This is mainly due to the large remaining bias of the single-task debiased estimators, which have been studied in Li (2020).The proposed debiased oracle TransGLM and debiased aTransGLM have improvements in coverage probabilities for β j = 0 without inflating the length of confidence intervals.The increased coverage probabilities are due to the smaller remaining bias of the debiased transfer learning estimator, which agrees with our theoretical results.In Table 2, we report the inference results in b-i which gives a non-sparse Σ −1 β .For the true signals, the debiased transfer learning estimators have significantly higher coverage probabilities than the single-task debiased method.This again demonstrates the smaller remaining bias of the debiased transfer learning estimators.
Table 1: Average coverage probabilities (standard deviations) for β 3 = 0.5 and β 13 = 0 in setting a-i.selected using the single-task method with the 95% CI not including zero and 18 covariates are selected using the transfer learning method with the 95% CI not including zero.In the Baxter Study, 13 covariates are selected using the single-task method with the 95% CI not including zero and 16 covariates are selected using the transfer learning method with the 95% CI not including zero.

For
two sequences of positive numbers {a n } and {b n }, we write a n b n if a n ≤ cb n for some universal constant c ∈ (0, ∞), and a n b n if a n ≥ c b n for some universal constant c ∈ (0, ∞).We say a n b n if a n b n and a n b n .We use c, C, c 0 , c 1 , c 2 , • • • , and so on to denote universal constants.Their specific values may vary from place to place.For an integer k > 0, [k] denotes the set {1, 2, ..., k}.For a vector v ∈ R d and a subset S ⊆ [d], we use v S to denote the restriction of vector v to the index set S. We write supp(v) := {j ∈ [d] : v j = 0}.Let v p = ( d j=1 |v j | p ) 1/p for 0 < p ≤ ∞, and let v 0 denote the number of non-zero coordinates of v.For a function f : R → R, f ∞ denotes the the essential supremum of |f | and ḟ and f denote the first and second derivatives respectively.

j
can adapt to the sparsity of the inverse Hessian.The advantage of γj is that it is robust to non-sparse inverse Hessian and can achieve semi-parametric efficiency(van de Geer et al., 2014) for sparse inverse Hessian.In comparison, the quadratic optimization-based debiasing(Javanmard and Montanari, 2014) mention that Lemma 4.2 can be viewed as a generalization of Lemma 4.1 beyond linear models.This is because, in the case that ψ(µ) = µ, Condition 4.1 always holds with γ o j = γj .Hence, Lemma 4.2 recovers Lemma 4.1 when ψ(µ) = µ, i.e., in linear models.Theorem 4.2 (Asymptotic normality for β(db) j in GLMs).Assume the conditions of Theorem 3.1 and Condition 4.1.It holds that β(db) j − β j = rem j + z j ,

Lemma 4. 3 (
Sufficient conditions for Condition 4.1).Condition 4.1 holds if one of the following three statements hold: (i) ψ is a positive constant.

Figure 1 :
Figure 1: Classification errors in setting a-i (first row) and in setting b-i (second row).The horizontal line is the average classification errors given by oracle β.

Table 4 :
Significant covariates based on the single-task method or the proposed method at 95% confidence level in the Zackular study.The p-values with * are significant at 95%