Nonparametric Fusion Learning for Multiparameters: Synthesize Inferences From Diverse Sources Using Data Depth and Confidence Distribution

Abstract Fusion learning refers to synthesizing inferences from multiple sources or studies to make a more effective inference and prediction than from any individual source or study alone. Most existing methods for synthesizing inferences rely on parametric model assumptions, such as normality, which often do not hold in practice. We propose a general nonparametric fusion learning framework for synthesizing inferences for multiparameters from different studies. The main tool underlying the proposed framework is the new notion of depth confidence distribution (depth-CD), which is developed by combining data depth and confidence distribution. Broadly speaking, a depth-CD is a data-driven nonparametric summary distribution of the available inferential information for a target parameter. We show that a depth-CD is a powerful inferential tool and, moreover, is an omnibus form of confidence regions, whose contours of level sets shrink toward the true parameter value. The proposed fusion learning approach combines depth-CDs from the individual studies, with each depth-CD constructed by nonparametric bootstrap and data depth. The approach is shown to be efficient, general and robust. Specifically, it achieves high-order accuracy and Bahadur efficiency under suitably chosen combining elements. It allows the model or inference structure to be different among individual studies. And, it readily adapts to heterogeneous studies with a broad range of complex and irregular settings. This last property enables the approach to use indirect evidence from incomplete studies to gain efficiency for the overall inference. We develop the theoretical support for the proposed approach, and we also illustrate the approach in making combined inference for the common mean vector and correlation coefficient from several studies. The numerical results from simulated studies show the approach to be less biased and more efficient than the traditional approaches in nonnormal settings. The advantages of the approach are also demonstrated in a Federal Aviation Administration study of aircraft landing performance. Supplementary materials for this article are available online.


Introduction
Powerful data acquisition technologies have greatly enabled the simultaneous collection of data from different sources in many domains. It is often impossible or inappropriate to simply aggregate all the data to draw inference, due to concerns over storage, privacy, cost constraints, or the desire to enhance inference by incorporating external or publicly available data sources, etc. Instead, one would need to combine the inference results from individual sources to draw an overall inference. Fusion learning refers to synthesizing inferences from multiple sources or studies to provide a more effective inference than that from any individual source or study alone.
recommending that the height of the aircraft at the crossing of runway threshold be around 15.85m and touchdown distance be around 432m from runway threshold. To help assess whether aircraft landings generally follow these guidelines, we can simply test the hypothesis H 0 : μ = (15.85, 432) , where μ is the mean vector for the height and distance. A sample of 2796 landing records (820 from Airbus and 1976 from Boeing) yields a combined sample mean of (15. 86, 432.95) , and a p-value of 0.942 from Hotelling's T 2 test. The finding would lead to the conclusion that aircraft landings generally comply with the FAA guidelines. Surprisingly, this conclusion appears to contradict the conclusion that we would draw from the two separate individual tests from Airbus and Boeing, with respective p-values 0.006 and 0.167. A closer examination of the scatter plots in Figure 1, of the two individual studies for Airbus and Boeing, indicates that the two samples do not appear to follow the same distribution and neither follows an elliptical distribution, and that the Boeing sample seems to be truncated on the right. This casts doubt on the aforementioned conclusion of landing operations meeting the FAA guidelines, and suggests the need of a nonparametric test for the hypothesis that both landing operations from Airbus and Boeing meet the FAA guidelines, that is, H : μ Airbus = μ Boeing = (15.85,432). More importantly, this example shows that blindly aggregating data from different data sources may not necessarily yield correct overall inferences. This example is discussed further in Section 7.

Outline of Proposed Approach and Highlights of Results
In this article, we develop an efficient nonparametric approach for fusion learning to make inference for the common features or parameters shared by different studies. This approach consists of two key parts: (a) We develop a general nonparametric inference procedure to ascertain a valid inference for each individual study by applying the notion of depth confidence distribution (depth-CD) and its associated depth confidence curve (depth-CV). Specifically in this article, we construct a depth-CD using data depth and bootstrap and show the depth-CD as a comprehensive summary distribution of all the inferential information for the target parameter; (b) We derive an overall combined inference by suitably combining the depth-CVs from the individual studies. Our proposed approach for individual-study inference and that for the combined inference are completely nonparametric and data driven, and broadly applicable without any model assumptions. For instance, this can substantially broaden the scope of the existing meta-analysis and evidence synthesis, where common practice routinely requires parametric models, often the normality assumption, see, for example, Normand (1999) and Sutton and Higgins (2008). The proposed fusion learning framework is established based on depth confidence distribution (depth-CD), which is a new powerful inference tool developed in this article by using three distinct concepts: confidence distribution (e.g., Xie and Singh 2013;Schweder and Hjort 2016), data depth (e.g., Liu 1990;Liu, Parelius, and Singh 1999;Zuo and Serfling 2000), and bootstrap (Efron 1979). Simply put, a depth-CD is a sampledependent distribution function defined on the space of the target parameter, which summarizes all the information from the data that is relevant for the inference of the parameter. Based on the evidence in the given data, a depth-CD can also be viewed as a reference function that reflects the plausibility or "confidence" associated with each possible parameter value in the parameter space. We investigate general properties of depth-CD, in particular the following three, in Sections 3.2-3.4, (P-1) a depth-CD is an omnibus form of confidence regions at all confidence levels; (P-2) a depth-CD is an omnibus form of p-values for testing any parameter value on the entire parameter space; (P-3) the contours of the level sets of a depth-CD shrink toward the true value of the parameter as the sample size increases.
These properties show that a depth-CD is useful in yielding all inference outcomes commonly sought in practice, and also that it is a versatile tool for nonparametric fusion learning. Under the proposed general depth-CD fusion learning framework, we develop an efficient nonparametric fusion learning approach by fusing the depth-CDs from individual studies where the depth-CD of each study is constructed from data depth and nonparametric bootstrap as described in Section 4. The fused output, similar to the individual input, remains a distribution function on the parameter space, which now depicts the level of "confidence" in assuming each possible parameter value in view of the totality of all available evidence gathered from all studies. This combined depth-CD, following P-1, 2, 3 above, can readily provide an overall inference such as confidence regions, p-values, or consistent point estimators.
The proposed fusion approach is shown to be efficient, general and robust. More precisely, it is efficient, as it achieves highorder accuracy and Bahadur efficiency under suitably chosen combining elements, as shown in Section 5.1. It is general, as (a) it covers multiparameter settings, (b) it is nonparametric, and (c) it permits flexible choices of mappings of input functions, weighting schemes and methods for deriving each individual depth-CD, across all studies. Such choices are often needed to account for the different circumstances or degrees of trustworthiness surrounding each individual study. It is robust, as it adapts efficiently to the fusion of heterogeneous studies, covering a wide range of complex and irregular studies, as investigated in Section 5.2. In fact, our fusion approach covers the particularly challenging setting where the target parameter may not be even estimable in some subset of studies, such as in the case of incomplete studies. Although the target parameter vector may not be estimable in incomplete studies, those studies often contain information from their data that can contribute to the overall inference of the target parameter, as the information among different component parameters is often correlated; see, for example, Liu, Liu, and Xie (2015). This data information from incomplete studies is often regarded as indirect evidence. Therefore, our fusion approach can incorporate both direct and indirect evidence, all in a nonparametric manner. This is a desirable property since it gains efficiency in the overall inference, as shown in Section 6.1.
We present an extensive comparison study of our fusion method with several existing methods in the setting of making inference for a common mean vector, in three data scenarios. The results can be summed up as three advantages of our method, namely, in the absence of the normality assumption: (i) it preserves inference accuracy in hypothesis testing and confidence regions; (ii) its point estimator has less bias and is more efficient; and (iii) it achieves a gain of efficiency in the presence of heterogeneous studies. We also present a numerical study of our method in meta-analysis of correlation coefficients in Section 6.2. There we observe that traditional methods may yield misleading conclusions, while ours remains valid in both normal and nonnormal cases.
The remaining article is organized as follows. Section 2 gives a general setup for fusion learning. Section 3 covers a brief review of confidence distribution (CD) and data depth, and then the development of depth-CD and depth confidence curve (depth-CV) for multiparameter inference. Section 4 provides a concrete procedure for constructing a depth-CV by using bootstrap and data depth. Section 5 develops nonparametric fusion learning by combining the depth CDs derived from individual studies. Section 6 covers all simulation studies. Section 7 applies our fusion learning approach to conduct the FAA study of aircraft landing performance. Section 8 contains more comments and discussions.

A General Problem Setting for Fusion Learning
We consider the problem of fusion learning in a general setting. Suppose that K independent studies are available for analysis to address the same scientific or business question. Let be the sample from the kth study, k = 1, . . . , K, where F k is an unknown p k -dimensional multivariate distribution. Assume that the parameter of interest θ k is a finite-dimensional functional of F k , which can be scalar or vector-valued. Assume that The goal is to make an efficient inference for θ by fusing the information from all K studies, without assuming specific parametric forms of the distributions F k (θ k ). This setting covers: Example 1 (common mean inference). Let θ k = x dF k (x) be the mean of the distribution F k , θ ≡ θ 1 = · · · = θ K is the (pdimensional) common mean of the K unknown distributions. We are interested in constructing a confidence region for θ or testing the hypothesis 0 : θ = μ versus 1 : θ = μ for a particular value μ.
Example 2 (correlation inference). Consider the correlation coefficient of any two components of the p-dimensional distribution F k . Let θ k include all such pairwise correlation coefficients. Then θ ≡ θ 1 = · · · = θ K is a parameter vector of dimension p(p−1)/2. We are interested in testing the hypothesis 0 : θ = 0 versus 1 : θ = 0.
These two examples will be used throughout the development and simulations to illustrate the proposed fusion approach. In various scenarios these examples help showcase the merits of our approach, described briefly as efficient, general and robust in the Introduction. To elaborate further, our fusion framework is general because it requires no specific parametric forms of the underlying distributions F k . It is also robust because it permits a broad range of heterogeneity among studies: (i) the individual studies do not have be homogeneous in terms of their designs, reporting formats, models, and inference methods; (ii) the studies can have different types of data (e.g., continuous, binary or ordinal responses); (iii) the studies can be analyzed using different models, such as linear regression models for continuous outcomes in some studies and logistic regression models for binary outcomes in others, and (iv) the individual studies can even use different inference methods, for instance, estimating the population location by the sample mean, the trimmed mean or the median as dictated by the specific situation of each study; and v) our fusion framework does not require that the parameter θ k be estimable in all studies.
To be more precise with the last point, our CD fusion approach applies even if the parameter of interest θ k is not estimable in some studies, as long as there exists a known continuous mapping from the parameter space (of θ) to a lower-dimensional space k such that is estimable. Similar formulation of partially estimable parameters also arose in the applications in (Sutton and Higgins 2008;Liu, Liu, and Xie 2015). Obviously, when all f k 's are identifiable mappings, this setting reduces to the case where all θ k 's are estimable. Our fusion approach is thus adaptable to indirect evidence. Two numerical examples in Sections 6.1 and 7 illustrate how our approach gains efficiency in the final combined inference from incorporating indirect evidence.

Confidence Distribution (CD) and Confidence Curve (CV) for Scalar Parameter
The idea of the confidence distribution (CD) is borne out of the wish to use a sample-dependant distribution function, rather than a point estimate or an (confidence) interval estimate, to estimate an unknown parameter. For a scalar parameter θ ∈ , a function H n (·) ≡ H n (X n , ·) is said to be a CD function for θ if it meets these two requirements: (i) given a sample X n , it is a distribution function on ; and (ii) at the true parameter , as a function of the sample X n , follows the uniform distribution on (0,1) (Schweder and Hjort 2002;Singh, Xie, and Strawderman 2005). Essentially, (i) says that a CD function is a ' distribution estimate' dependent on the observed sample, and (ii) ensures that a CD function carries frequentist properties in terms of repeated sampling. For instance, under (ii), (−∞, H −1 n (1 − α)) is a (1 − α) confidence interval, and also H n (θ o ) can be used as a p-value for testing the hypotheses 0 : θ ≤ θ o versus 1 : θ > θ o . More precisely, given a dataset and an inferential procedure, a CD function represents a set of confidence intervals for all possible confidence levels. It describes a distribution of confidence associated with each θ value in .
Note that, conditional on the observed sample X n , a CD function H n (θ ) ≡ H n (X n , θ) is a distribution function on the parameter space . Let θ * be a random variable following the distribution H n (·). We refer to θ * as a CD-random variable. Conditional on the given data, we can used simulated samples θ * 's from H n (·) to carry out inference, as discussed in Section 4.
To illustrate the CD inference approach, we consider simple example with a sample x = {x i , i = 1, . . . , n} from N(θ , 1), where the mean θ is the parameter of interest. A natural CD for θ is N (x n , 1/n) or equivalently its cumulative distribution function H n (θ ) = ( √ n(θ −x n )). Given a sample x, the function H n (θ ) is a distribution function on the parameter space , and it carries all commonly used inference outcomes. For instance, confidence interval for θ , for any 0 < α ≤ 1; the mean/median of (H −1 n (.5) (=x n ) is a point estimate for θ ; and the tail mass ) is a p-value for testing the one-sided hypothesis K 0 : θ ≤ b versus K 1 : θ > b. The curve in Figure 2(a) is a CD function given a random sample of size n = 20. The dashed lines there help illustrate all types of inference outcomes from a CD function, including a point estimate of 0.11, a 90% confidence interval of (-0.26, 0.48), and a p-value of 0.31 for testing 0 : θ ≤ 0 versus 1 : θ > 0. For the ease of visualization of confidence intervals of different levels, the distributional form of a CD H n (·) ≡ H n (X n , ·) seen in in Figure 2(a) can be expressed alternatively as a confidence curve (CV) seen in Figure 2(b) which is defined as (see Xie and Singh 2013, pp. 29 and 31;Schweder and Hjort 2016, pp. 10-14 ). While the CD function H n (θ ) represents the upper limits of one-sided confidence intervals, the confidence curve CV n (θ ) gives an omnibus form of the limits of two-sided confidence intervals. In Figure 2(b), the two limits of a 90% confidence interval identified by the two points on the confidence curve at the height α = 0.1 are exactly the same as those obtained from Figure 2(a). Furthermore, following the duality between confidence intervals and hypothesis testing, CV n (θ 0 ) can serve as a p-value function for the two-sided hypothesis testing, 0 : θ = θ o versus 1 : θ = θ o , for any θ o ∈ . Also, the confidence curve peaks at the median of the CD function, that is, CV −1 n (1) = 0.11 as shown in Figure 2(b), which yields a median-unbiased estimate for θ .
Without linking to CD, the concept of CV has actually been explored in Birnbaum (1961), Blaker (2000), and Blaker and Spjøtvoll (2000) for a scalar parameter θ . In fact, Blaker and Spjøtvoll (2000) interpreted a CV as a summary of "how each parameter value is ranked in view of the data" from the peak decreasing gradually along the tails. This ranking interpretation of the CV in fact suggests a natural extension of the CD to the multiparameter setting by incorporating the notion of data depth, which has been developed to establish a center-outward ordering of multivariate observations. We will develop this extension after the brief review of data depth and its properties.

Data Depth and Center-outward Ordering of
Multivariate Data Data depth is a way to measure how deep or central a given point is with respect to a multivariate sample cloud, say {ξ 1 , ..., ξ m } ∼ F ∈ I R p , or to its underlying distribution F. It naturally yields a measure of "outlyingness" and thus also a center-outward ordering of these multivariate points. Common depth functions include, Mahalanobis depth (MD) (Mahalanobis 1936), halfspace depth (HD) (Hodges 1955;Tukey 1975), simplicial depth (SD) (Liu 1990), among others.
Using simplicial depth as an example, the depth at point ). In I R 2 , Figure 2. The curves represent a confidence distribution function (a) and the corresponding confidence curve (b) for the mean parameter θ in the normal distribution N(θ , 1). They are obtained based on a sample x = {x i , i = 1, . . . , 20} from N(0, 1). Illustrated is how to draw commonly used inferential outcomes such as a point estimate of 0.11, a 90% confidence interval of (−0.26, 0.48), and a p-value of 0.31 for testing the hypothesis 0 : θ ≤ 0 versus 1 : θ > 0.
3 , the fraction of the triangles (ξ i , ξ l , ξ k ) generated from the sample that contains z inside. Clearly, a point with a larger depth value indicates that it lies more central within the data cloud or closer to the center of the distribution.
By computing the depth values for all data points ξ i 's and then ordering ξ i 's by their descending depth values, we can obtain the depth order statistics {ξ [1] , . . . , ξ [m] } with an ordering from the deepest (or most central) point ξ [1] to the most outlying ξ [m] . This center-outward ordering naturally gives rise to nested central regions expanding with increasing levels of probability coverage. The convex region spanning the deepest (1−α)n sample points is referred to as the (1−α)-central region. Formally, the population and empirical versions of (1 − α)central region can be expressed respectively as Here, C F (z|D) and CF(z|DF) are referred to as centrality functions with, respectively, The central regions A (1−α);F are data driven and nonparametric, and are shown to be particularly useful for inference under asymmetric underlying distributions or nonstandard asymptotics. Lemma 1 below shows important properties of centrality functions. Its proof is in the appendix (supplementary material).
Lemma 1. Let η be a random vector following a p-dimensional distribution F. The centrality function in Equation (6) satisfies these properties: (a) (Uniform transformation) The transformed variable C F (η|D F ) satisfies C F (η|D F ) ∼ U(0, 1), provided that the depth contours {η : D F (η) = t} all have probability zero w.r.t. F for any t > 0. (b) (Affine-invariance) Let A be a p × p nonsingular matrix and b a p × 1 constant vector. If both F(·) and D F (·) are affine invariant, that is, for any point z ∈ I R p ,F(Az + b) = F(z) and D F (z) = DF(Az + b), then so is the centrality function C F (·), that is, C F (z|D F ) = CF(Az + b|DF).
Typically, depth functions have been used to rank sample points and provide a center-outward ordering of sample points in the sample space, as reviewed above. In this article, a depth function will be used instead to rank parameter values and provide an ordering of all parameter values in the parameter space. Specifically, instead of applying depth ordering to the sample ξ i 's drawn from the distribution F(·), we apply it to the sample CDrandom variables θ * i 's drawn from the confidence distribution H n (·). This center-outward ordering in the parameter space can be interpreted as the plausibility of each parameter value relative to the others. This line of interpretation underlies the proposed CD fusion learning framework and justifies the resulting inferences, for example, using the central regions formed by θ * 's as confidence regions for the parameter of interest θ . This is elaborated further in Sections 3.2-3.4.

Depth-CD and Depth-CV: an Omnibus Form of Confidence Regions
The definition of a CD as a sample-dependent distribution function on the parameter space that can represent confidence regions for all possible confidence levels applies to a scalar parameter (as seen in Section 3.1.2) as well as a vector parameter. However, mathematical rigor for multiparameter CDs has so far been elusive, since the region created by the inversion of a multivariate cumulative distribution function may not be unique or suitable for providing any natural form of inference.
To this end, (Singh, Xie, and Strawderman 2007;Xie and Singh 2013;Schweder and Hjort 2016) have proposed to limit confidence regions within a certain subclass. In this article, we propose to consider the set of center-outward nested confidence regions derived from using data depth, which we refer to as depth-CDs. The depth-CDs provide a natural extension of the CD concept from the scalar setting to the multiparameter setting.
As discussed in Section 3.1.1, a confidence curve (CV), as plotted in Figure 2(b), can provide two-sided confidence intervals for a scalar parameter of all levels, with the intervals expanding outward to two tails as the level of confidence increases. The CV in Figure 2(b) clearly ranks parameter values in the parameter space from the center outward as the level α decreases. In fact, the CV defined in Equation (4) can be re-expressed, using data depth and its associated centrality function, as E is a closed half-space in R p and ϑ ∈ E} is the half-space depth when p = 1 and P H n is the probability measure corresponding to the CD H n (·) on the parameter space, that is, By extending Equation (7), we can directly define a CV for a parameter vector ϑ ∈ ⊂ R p as where D is a depth function with the associated probability measure P H n from a multivariate depth-CD H n (·) on . Formally, we define multivariate depth-CD and CV as follows: it is a distribution function on the parameter space for any fixed sample set X n ; and (ii) the : C H n (ϑ) ≥ α}, is a confidence region for θ with a coverage probability of (1 − α). Here, the centrality function associated with depth D and CD H n (·), that is, C H n (ϑ), is also referred to as depth confidence curve (depth-CV). If the statements in (ii) holds only asymptotically, then we refer to H n and C H n as asymptotic depth-CD and asymptotic depth-CV, respectively.
Continuing with half-space depth in the scalar setting, we see that the result in Lemma 1(a) resembles 2 min{G(Z), 1 − G(Z)} for a univariate random variable Z with its cumulative distribution function G. This result ensures that {θ : leads to a (1 − α) confidence region for a multiparameter θ .
In conclusion, Lemma 1 ensures that the depth-CD and depth-CV (defined in Definition 1) can provide valid center-outward confidence regions of all levels. For convenience, we use the familiar bivariate normal to illustrate the above framework of depth-CD inference, where the depth contours have explicit expressions, although the normality assumption is not required in our general framework.
Assuming that is known, then the bivariate normal distribution BN(Ȳ n , ) is a depth-CD for θ , since: I) BN(Ȳ n , ) is a sample-dependent distribution function of the parameter space of θ , and II) the depth contours of BN(Ȳ n , ), using any depth mentioned in Section 3.1.2, provide valid center-outward confidence regions of all levels.
For a given simulated sample of size {Y i } 20 i=1 under θ = (1, 1), and = 1 1 1 2 , using Mahalanobis depth, we obtain the depth-CD on the parameter space as a 3D-surface plot in Figure 3(a). Projecting this 3D plot to (the two-dimensional plane below) gives depth contours in a gray-color heat map in Figure 3(b), where the brighter the region, the larger the depth value. A depth contour in Figure 3(b) connects the points in which have the same depth value D H (·). Corresponding to Figure 3(b), a similar projection of the 3D-surface plot of the depth-CV results in Figure 3(c) showing the contours which connect the points in with the same centrality value C H n (·|D). For instance, the peak of depth-CV corresponds to the deepest (or most central) point in Figure 3(c), which also corresponds to the highest point in the depth-CD in Figure 3(a) as well as the deepest point in Figure 3(b).The depth-CV in Figure 3(c) ranks the plausibility of each possible value of the bivariate parameter space . For instance, the black round dot being on the contour with centrality value 0.9 implies that this particular parameter value is deeper than 90% of all the possible parameter values w.r.t. the confidence distribution H(·) or more plausible than 90% of all the possible parameter values in .
Inferences about θ can be derived from the depth-CD or depth-CV with its contours in Figure 3. For example, the largest elliptical region within the contour of centrality value .1 (labeled with a solid triangle) in Figure 3(c) is a 90% confidence region for θ. The deepest point in all three plots (0.94, 0.92), marked by a cross, can be considered the most plausible parameter value and thus a suitable point estimate for θ . This point estimate is shown to be consistent later in Section 3.4.
When is unknown, the BN(Ȳ n ,ˆ ) can be shown to be a depth-CD for θ asymptotically. Hereˆ is the sample covariance matrix. Similar illustration plots and asymptotic inferences can be drawn accordingly.

Depth-CD and Depth-CV: An Omnibus Form of p-Values
To show how depth-CD and depth-CV can give rise to an omnibus form of p-values, we first justify that, for a given ϑ ∈ , the depth CV C H n (X n ,·) (ϑ) is a limiting p-value for testing the hypothesis 0 : θ = ϑ versus 1 : θ = ϑ. Liu and Singh (1997) defines a sequence of statistics p n to be a limiting p-value if p n ∈ [0, 1] and p n satisfies (a) lim sup n→∞ P F {p n ≤ t} ≤ t for all F ∈ 0 and any t ∈ [0, 1]; and (b) p n → 0 in probability for all F ∈ 1 , as n → ∞.
Note that (a) is required to ensure the testing size and (b) to ensure that the asymptotic test power goes to 1, as the sample size increases. To see why C H n (X n ,·) (ϑ) is a limiting p-value, we need the simple but useful result below: Proposition 1. The statement that R 1−α (H n (X n , ·)) is a confidence region for θ with a coverage probability of (1 − α) (Requirement (ii) in Definition 1) is equivalent to the statement that C H n (θ o ) ≡ C H n (X n ,·) (θ o ), as a function of the sample X n , follows the uniform distribution on [0,1], where θ o is the true value of θ .
In view of Proposition 1, C H n (X n ,·) (ϑ) is a limiting p-value as long as C H n (X n ,·) (ϑ) → 0 in probability for all ϑ = θ o , which is a mild condition that holds generally. Such a CD p-value and the classical p-value share a similar idea in their approaches, as they both try to assess the degree of inconsistency between the given data and the target null hypothesis by comparing a fixed value w.r.t. a reference distribution. But they are fundamentally different, since • A classical p-value is derived by comparing the observed value t of a statistic T with a reference distribution over the sample space, that is, the null distribution of T, say F T,0 ; • A CD p-value is derived by comparing a hypothesized value ϑ of the parameter θ with a reference distribution over the parameter space, that is, the depth-CD H n (X n , ·).
The key difference is that the assessment of statistical significance, namely, measuring the outlyingness of the value (t or ϑ) w.r.t. the reference distribution (F T,0 or H n (X n , ·)), is performed in different (sample or parameter) spaces.
The CD p-value has several advantages, including: (A1) Given an inference procedure, the reference distribution H n (X n , ·) is determined solely by the sample X n and it does not depend on the specified value in the null hypothesis. This is different from the classical p-value method where the reference distribution must satisfy the null constraint and thus may vary depending on the null value. (A2) Since the CD method does not need to rely on a test statistic, the CD p-value essentially serves simultaneously as a test statistic and a p-value. Thus, it compress the usual three-step test procedure in the classical p-value approach into just one, bypassing: (i) construct an explicit test statistic, and then (ii) establish or approximate its sampling distribution. This advantage will be elaborated further in Section 4 where bootstrap is used to devise depth-CD functions. (A3) The CD reference distribution H n (X n , ·) carries infinitely many p-values {C H n (X n ,·) (ϑ) : ϑ ∈ } for a set of hypothesis testing problems {{ 0 : θ = ϑ versus 1 : θ = ϑ} : ϑ ∈ }. This implies that C H n (·) provides a distribution of p-values over ; see for example Figure 3(c), where the contours of centrality values can be used as p-value contours. As a p-value in testing θ = ϑ is generally viewed as the strength of evidence from the data in support of the assumption θ = ϑ, C H n (ϑ) can be viewed as a measure of the plausibility of assuming θ = ϑ. The smaller the value of C H n (ϑ), the less plausible θ = ϑ. In Figure 3(c), for example, the parameter value marked by the solid triangle is much less plausible than the one marked by the solid round dot. To sum up, a depth-CD provides a simple but comprehensive summary of data evidence in the sense that a single reference distribution H n (X n , ·) can express the plausibility of every θ value for the entire parameter space . In contrast, the reference distribution F T,0 used in the classical p-value method expresses only the plausibility of the specific parameter value under the null hypothesis. For instance, it does not simultaneously provide the plausibility of θ values in the alternative parameter space.
Following along Example 3, for a given sample there, the centrality value C H n (θ 0 ) in the depth-CV as in Figure 3(c) would be a p-value for testing 0 : θ = θ 0 versus 1 : θ = θ 0 .

Depth-CD: Its Deepest Point As a Consistent Estimator
Given a dataset X n and a depth-CD H n (X n , ·), we propose to use the deepest point of the depth-CD H n (X n , ·) or equivalently the maximum point of the centrality function C H n (X n ,·) (·), denoted byθ MCE n , as a point estimate for the parameter of interest θ. That is,θ This estimate is referred to as a maximum centrality estimate (MCE). Note that, in I R 1 , the MCE corresponds to the highest value of CV in Figure 2(b) or, equivalently, the median (also central most point) of the CD in Figure 2(a). The estimateθ MCE n extends to general multiparameter settings the idea of using the "median" or deepest point of a CD function for point estimation.
We show below thatθ MCE n is a consistent estimator under some mild conditions. Proposition 2. Assume that for any > 0, as n → ∞, The condition n ( ) → 0 basically requires that the depth contours {ϑ : C H n (X n ,·) (ϑ) = } (e.g., the contours in Figure 3(c)) shrink to a single point (e.g., the cross in Figure 3(c)) as the sample size n → ∞. For scalar parameters, n ( ) is the distance between the two intersection points of the CV and the horizontal dashed line in Figure 2(b). The condition n ( ) → 0 means that as information increases (n → ∞), the depth-CD concentrates onto a shrinking area of the parameter space whose measure decreases to zero. This is a mild condition which holds often in practice. Under this condition, Proposition 2 justifies that the estimateθ MCE n converges to the true value θ o . Thus, we have established Property (P-3) of depth-CDs stated in the Section 1.

Construct Depth-CDs From Nonparametric Bootstrap
This section provides a concrete approach of using nonparametric bootstrap to construct a depth-CD and derive inferences for the target parameter vector in an individual study. Given the sample {X 1 , X 2 , . . . , X n }, assume thatθ n is an estimate of the target parameter θ , wherê θ n ≡θ n (X 1 , X 2 , . . . , X n ).
Letθ * n =θ * n (X * 1 , X * 2 , . . . , X * n ) be a bootstrap estimate of θ , where {X * 1 , X * 2 , . . . , X * n } is a bootstrap sample drawn independently from {X 1 , X 2 , . . . , X n } with replacement. Let B n and B * n denote the sampling distribution ofθ n andθ * n , respectively. Theorem 1 below shows that the bootstrap distribution B * n is a depth-CD for θ asymptotically under the following regularity conditions: (C1) Let L n be the distribution of a n (θ n −θ ) for some positive sequence a n → ∞, as n → ∞. Assume that L n converges Dregularly to a distribution L, namely, (i) L n converges weakly to L as n → ∞; and (ii) lim n→∞ sup x∈R p |D(L n , Theorem 1 shows that the bootstrap distribution B * n is a depth-CD, and hence justifies the validity of using B * n to make inferences about the parameter vector θ. For example, the deepest point of the distribution B * n can be used as a point estimate of θ , and the central region R 1−α (B * n ), as defined in Equation (10), as a (1 − α) confidence region for θ . Moreover, for testing the hypothesis θ = ϑ versus θ = ϑ, the value of the centrality function at ϑ, that is, C B * n (ϑ), can be used as a p-value. This p-value approach for hypothesis testing is fundamentally different from traditional approaches, as mentioned in Section 3.3. First, the reference distribution here is depth-CD B * n , which is fully determined by the sample. Once the sample is given, it does not vary, unlike the traditional approaches. Second, depth-CD B * n is a single reference distribution and provides a p-value for testing each parameter value in the entire parameter space . Third, in the derivation of the p-value, C B * n (ϑ) does not rely on any test statistic. Essentially, C B * n (ϑ) now simultaneously serves as a test statistic and as a p-value. This actually compresses into a single step, namely calculating the centrality of ϑ w.r.t. depth-CD B * n , the usual three steps in traditional testing procedures, namely identifying a test statistic, establishing its sampling distribution, and then calculating the p-value. Fourth, the reference distribution depth-CD B * n is obtained by resampling directly from the empirical distribution, rather than from the null distribution that is usually restricted by parametric assumptions. This also explains why our CD inference here can be obtained without distributional assumptions of the sample.
The idea of connecting bootstrap to data depth for multiparameter inference is not new. For example, it has been used in Liu and Singh (1997) for hypothesis testing and in Yeh and Singh (1997) for deriving confidence regions. These two inference methods can be viewed as special cases in our depth-CD inference framework, since Theorem 1 implies that the bootstrap distribution is a depth-CD.
Theorem 2 is a direct consequence of Lemma 1 and Proposition 1, and it provides a procedure for constructing depth-CDs from pivot statistics.
Theorem 2. Assume that A n (X n ) is a nonsingular matrix such that A n (X n )(θ n − θ) follows a distribution Q n that is free of all unknown parameters. Also assume that η n is a random vector, independent of the sample X n , following the distribution Q n . Then, conditional on X n , the distribution function of (θ n − A n (X n ) −1 η n ) is a depth-CD for θ , under the following conditions (i) The depth D is affine-invariant; and (ii) The depth contours {η n : D Q n (η n ) = t} all have probability zero w.r.t. Q n .
Theorem 2 shows that when the statistic A n (X n )(θ n − θ ) is a pivot, a depth-CD can be easily derived using the inverse probability function. Returning to Example 3 where we make inference for the mean parameter of a bivariate normal distribution. In this case, we know that −1/2 (Ȳ n − θ ) is a pivot following a bivariate standard normal distribution and that the three depths mentioned in Section 3.2 are affine-invariant. Thus, by Theorem 2, the bivariate normal distribution BN(Ȳ n , ) is a depth-CD for θ . When is not known, BN(Ȳ n ,ˆ ) is a depth-CD for θ asymptotically. In this example, the distribution of the pivot −1/2 (Ȳ n − θ ), namely Q n in Theorem 2, is structured under certain distributional assumptions, that is, Q n is bivariate standard normal. But generally, as long as A n (X n )(θ n − θ ) can be structured to have (approximately) a parameter-free distribution, Theorem 2 can be applied to construct depth-CDs and draw all forms of inference accordingly, as seen in Section 3.
With the distribution Q n as a prerequisite, Theorem 2 may be perceived as applicable only for deriving depth-CDs in the setting of parametric inference, but its precise formulation can actually shed light on the bootstrap approach in Theorem 1 and other general nonparametric approaches for deriving depth-CDs. Here is how Theorem 2 explains intuitively why the nonparametric bootstrap distribution is indeed a depth-CD. To avoid making assumptions about the distribution Q n of the statistic A n (X n )(θ n − θ ), a natural choice is to use the bootstrap distribution of A n (X n )(θ * n −θ n ) to approximate Q n , that is, set η n = A n (X n )(θ * n −θ n ) in Theorem 2. Assuming that this approximation is appropriate (e.g., under conditions (C1)-(C4)), Theorem 2 shows that the distribution of (2θ n −θ * n ) (=θ n − A n (X n ) −1 A n (X n )(θ * n −θ n )) is a depth-CD for θ. Note that (2θ n −θ * n ) andθ * n have an identical distribution (provided that the distribution of (θ * n −θ n ) is symmetric, see condition (C3)), since when centering aroundθ n , the reflection image of (2θ n −θ * n ) is exactlyθ * n . Hence, the bootstrap distribution ofθ * n is also a depth-CD.

Combining Depth-CVs
We have shown that the very form of the depth-CD being an all-encompassing distributional function estimate, rather than a mere point or interval estimate, is the key feature that leads to the omnibus form of all inferences of a parameter. This feature will also be shown to underlie the great flexibility that makes depth-CDs particularly suited for combining inferences from different and even heterogeneous studies. For each individual study, we can obtain a depth-CD H k,n k (·) ≡ H k,n k (X k ; ·) and its corresponding depth-CV C H k,n k (·) (see Definition 9) for the parameter θ , k = 1, 2, . . . , K, from K independent studies. Here we propose a general formula in Equation (13) for synthesizing those K individual inference results to draw an overall and efficient inference for the parameter θ, n 1 (θ), C H 2,n 2 (θ), . . . , C H K,n K (θ )) . (13) Here, g c (u 1 , u 2 , . . . , u K ) is a continuous mapping from [0, 1] K to R which is increasing in each coordinate, and G c (t) ≡ P{g c (U 1 , U 2 , . . . , U K ) ≤ t} where U k 's are iid random variables following U[0,1] distribution. A special yet important case of Equation (13) to which we will return often later is with g c (u 1 , u 2 , . . . , where ϕ(·) is a monotonic increasing "mapping function" and w k > 0 is the weight assigned to the k-th study. Fusion formulas similar to Equations (13) and (14) have been used in Singh, Xie, and Strawderman (2005), Strawderman (2011), and to combine CDs for a scalar parameter. However, these do not apply directly to combining depth-CDs for multiparameter inference. If the combining formula Equation (13) were applied directly, the resulting function G c (g c (H 1,n 1 (θ ), H 2,n 2 (θ), . . . , H K,n K (θ))) would not yield any valid statistical inference. To mitigate this shortcoming, our proposal in Equations (13) and (14) combines the depth-CVs through their corresponding centrality functions C H k,n k (θ )'s to obtain C (c) (θ) (cf. Equation (13)) and the desired overall inference.
Theorem 3 justifies that the overall inferences based on the combined centrality function C (c) (θ ) can be made in ways similar to those based on centrality functions from individual studies. For example, Theorem 3(a) shows that, similar to each individual centrality function, the combined function C (c) (θ ) is a single invariant (under the given samples) function defined on the parameter space and it provides infinitely many p-values for testing all θ values in the entire parameter space. It expresses the relative ranking or level of plausibility of each θ value w.r.t. the totality of evidence collected from all studies. This expression of relative ranking of plausibility adapts readily to the common interpretation of a p-value. Theorem 3(b) describes a (1 − α) confidence region for θ as the collection of parameter values whose C (c) (θ ) is no less than α. Theorem 3(c) shows that the maximizer of the combined C (c) (·) is a valid point estimator.
Our fusion learning does not rely on parametric assumptions, if Equations (13) or (14) is applied to depth-CDs from nonparametric approaches, such as bootstrap. This fusion approach is broadly applicable. It is valid as long as the input functions H k,n k (θ )'s are depth-CDs (or asymptotically).

Higher-Order Accuracy of C (c) (·) and its Inference
Results The depth-CDs obtained by bootstrap are not exact, in the sense they only satisfy asymptotically the requirement (ii) in Definition 1 or (ii) in Proposition 1. To see how this approximation accuracy affects the accuracy of the inference results, we consider the example of a univariate common mean problem, where a CD obtained by the regular bootstrap in Section 4 is H k,n k (θ ) = P * {X * k ≤ θ }. Such a CD can yield confidence regions whose coverage probability approximates the nominal value. A better accuracy can be achieved by using the bootstrap t (Efron and Tibshirani 1994). This bootstrap method generates a second-order accurate CD H S is an estimate of the standard deviation. It would be interesting to know whether or not the improved accuracy in the input CDs is carried over to the combined outcome. It is worth noting that our fusion approach in Equation (14) generally does preserve the order of accuracy of the individual depth-CDs, even if they are not exact. To state the result, we first define the order of accuracy for a depth-CD.

A depth-CD function H n (·) ≡ H n (X n , ·) on
⊆ R p is said to be jth-order accurate, if the random variable C H n (θ o ) ≡ C H n (X n ,·) (θ o ), where θ o is the true value of θ , converges in distribution to the uniform distribution on (0,1) at the order of n −j/2 , that is, P C H n (θ o ) ≤ a − a = O(n −j/2 ) for any a ∈ (0, 1). If a depth-CD function H n (·) is jth order accurate, the coverage probability of R 1−α (H n ) in Equation (10) converges to its nominal level at the rate of O(n −j/2 ). Theorem 4. (Accuracy of C (c) (·)). For the kth study (k = 1, 2, . . . , K), assume that n k /n converges to a constant a k and also that its depth-CD function H k,n k (θ ) is jth-order accurate uniformly, in the sense that P C H k,n k (θ o ) ≤ a − a = O(n −j/2 ) uniformly for all a ∈ (0, 1) as n → ∞. Then the combined function C (c) (θ) in its general form Equation (13) is also jthorder accurate.
Our numerical studies in Section 6 show that, even in smallsample cases, the overall inferences are quite accurate, when the input CD functions H (t) k,n k (θ )'s are obtained by the bootstrap t.

Bahadur Efficiency of C (c) (·)
The fusion formula in Equation (13) provides a general class of fusion approaches for synthesizing nonparametric or parametric inferences. We show here that among this general class, a specific form of Equation (14) with w k = 1 for all k and ϕ(t) = log(t) yields the most efficient combination in terms of achieving Bahadur efficiency. Following the ideas in Littell and Folks (1973) and Singh, Xie, and Strawderman (2005), we define the concept of Bahadur slope for a depth-CV.

Definition 2. A nonnegative function
is said to be the Bahadur slope for the depth-CV function C H n (·) along the direction λ, where λ ∈ R p and ||λ|| = 1, if S λ (b) ≡ lim n→∞ − log{C H n (θ o + bλ)}/n almost surely for any non-zero b ∈ R.
The Bahadur slope S λ (b) defined above reflects the rate, in an exponential scale, at which C H n (θ o + bλ) decays toward zero as the sample size increases. The larger the slope, the more efficient the depth-CV in Bahadur's sense. In the multiparameter case where the depth-CD H n (θ ) is a multivariate distribution, we need Bahadur slope functions S λ (b) along each direction λ to characterize how fast the tails of the distribution decay to zero.
The Bahadur slope provides a means assessing the efficiency of the proposed fusion method Equation (13). Specifically, the theorem below establishes an upper bound of the Bahadur slope (i.e., the fastest possible rate of tail decay) for the combined function C (c) (θ ). It also suggests a specific combination formula for achieving exactly this bound.
Theorem 5. Under θ = θ o and n k = {a k + o(1)}n, as n → ∞, the following inequality holds for any fused function C (c) (θ) as defined in the general fusion formula Equation (13) lim sup Furthermore, let C log (c) denote the fused function in its specific form Equation (14) when w k = 1 for all k and ϕ(t) = log(t). Then, Theorem 5 states that the Bahadur slope of any combined function C (c) (·) derived from Equation (13) has an upper bound, and that this upper bound can be achieved by taking w k = 1 for all k and ϕ(t) = log(t) in Equation (14). In this case, the explicit formula for combining depth-CVs is This formula turns out to be the same as Fisher's method used for combining p-values. The optimality of this particular choice does not rely on the direction λ. Thus, an interesting implication is that if we use Equation (18) to combine depth-CVs, the highest Bahadur slope (or the fastest rate of tail decay) will be achieved along every direction (i.e., the line spanned by θ o + bλ as b varies). The optimality established in Theorem 5 is a global, rather than merely directional, property of Equation (18).

Fusion of Heterogeneous Studies
Our fusion framework is general and can cover complex and irregular settings containing heterogeneous studies. Study heterogeneity arises often in practice, due to different study designs, populations or outcomes, as seen in the applications in (Chen, Chatterjee, and Carroll 2013;Yang et al. 2014;Liu, Liu, and Xie 2015;Chatterjee et al. 2016;Gao and Carroll 2017). In the presence of heterogeneous studies, the parameter of interest may not be estimable in some studies. These studies are often excluded from conventional analyses, which can result in a nonnegligible loss of information. Our fusion method Equation (13) can be extended to incorporate heterogenous studies in the analysis. The theoretical results established in previous sections remain valid and applicable as well.
To accommodate heterogeneous studies, we give up the assumption that θ k is estimable in each study. Instead, we assume only that a certain mapping of θ k , denoted byθ k (= f k (θ k )) as in Equation (3), is estimable and its corresponding depth-CD H k,n k (θ k ) forθ k can be derived, say, using bootstrap. With a minor modification, the general fusion formula Equation (13) is still applicable to combining depth-CDs from different θ k 's for making the overall inference about the common parameter of interest θ. More specifically, Similar to Equation (14), a special case is Theorem 6 shows how to use C Het (c) (θ ) in Equations (19) or (20) to make valid combined inference about θ . Here, we require that θ be identifiable in the combined function C Het (c) (θ). Following Rothenberg (1971) and Little, Heidenreich, and Li (2010), we say that θ is (locally) identifiable if for any θ ∈ , there is no ϑ = θ (in a neighborhood of θ ) such that C Het (c) (X 1 , . . . , X K ; θ ) = C Het (c) (X 1 , . . . , X K ; ϑ) almost surely.
Theorem 6 justifies the validity of using the modified combined depth-CV to draw overall inferences from heterogenous studies, which is the counterpart of Theorem 3 in the case of homogeneous studies. In fact, the counterparts of Theorems 4-5 can also be established to obtain the same theoretical results of high-order accuracy and Bahadur efficiency under heterogenous studies. In short, the combining in Equation (19) preserves the order of accuracy of each individual study, and it achieves Bahadur efficiency when w k = 1 for all k and φ(t) = log(t).
Compared to the meta-analysis of heterogeneous studies in Liu, Liu, and Xie (2015), our fusion method here is more general. Liu, Liu, and Xie (2015) requires normality of the distribution of summary statistics. Our fusion method does not require a parametric form of distributional assumptions. If each individual depth-CD is derived using the nonparametric bootstrap, then the inference drawn from the combined function is also nonparametric. When it is reasonable to make an assumption of the underlying distribution, we can derive depth-CDs using Theorem 2 and our fusion method is still applicable and yields valid inferences.

Simulation Studies
To demonstrate the theoretical advantages of our fusion method, we conduct simulation studies for the common mean problem and meta-analysis of correlation coefficients.

The Common Mean Problems
Making inference on the common mean parameter of multiple populations is referred to as the common mean problem. This problem has been investigated extensively, see, for example, Lin, Lee, and Wang 2007; Pal et al. 2007, and the references therein.
Traditional approaches rely on the assumption that the sample of each study is drawn from a normal distribution. The normality assumption however is often unrealistic in practice, and it can be hardly justified when the sample size is small. To the best of our knowledge, there have not been any systematic investigation of the common mean problem in general and nonnormal situations.
Our framework of fusion learning readily applies to the common mean problem, in both normal and nonnormal settings. In this section, we examine its numerical performance, in comparison with that of several existing methods associated with the well-known Graybill-Deal estimator (Graybill and Deal 1959). The numerical results show that without the normality assumption, our fusion method has the following advantages: (i) it preserves inference accuracy in hypothesis testing/confidence regions; (ii) its point estimator has less bias and is more efficient; and (iii) it achieves a gain of efficiency in the presence of heterogeneous studies.
In the multiparameter setting, the Graybill-Deal estimator iŝ . This estimator yields confidence intervals and p-values by considering the statistic (Lin, Lee, and Wang 2007) Assume that X k,j follows a multivariate normal distribution, then T 2 k 's are Hotelling's T 2 statistics and n k −p p(n k −1) T 2 k ∼ F p,n k −p . Thus, the statistic in Equation (21) follows a weighted convolution of multiple F distributions. We evaluate (21) in the construction of confidence regions and hypothesis testing when w k ≡ 1 (referred to as the GD method) and w k = var(T 2 k ) −1 = {2p(n k −1) 2 (n k −2)}/{(n k −p−2) 2 (n k −p−4)} (referred to as the KJ method, Jordan and Krishnamoorthy 1995). If the normality assumption holds, both the GD and KJ methods are exact in the sense that the test (or confidence region) achieves the nominal Type I error (or coverage probability), since the exact distribution of Equation (21) is known. We also consider a method based on the central limit theorem (CLT). This method needs a weaker assumption, namely thatX k only approximately follows a normal distribution. The inference relies on the statistic whereˆ k = (n k −1)S k /n k . This statistic follows χ 2 distribution with p degrees of freedom.

Inference Accuracy in Hypothesis Testing/Confidence
Regions To assess inference accuracy, we present the null distribution of p-values in Figures 4-6 (based on 10,000 simulation replications). The deviation of this distribution from the U(0,1) distribution depicts the difference between the actual and nominal Type I error rates in hypothesis testing, or equivalently, the difference between the actual and the nominal coverage probabilities of confidence regions. When the sample distribution is normal, Figure 4 shows that the null distribution of p-values aligns well with the U(0,1) distribution for all the methods considered, except that the CLT method is slightly off the target line. However, when the sample distribution is nonnormal, such as χ 2 , Figure 5 shows a notable deviation for GD, JK, and CLT methods. More details on those deviations can be seen from the empirical values reported in Table 1 for a set of specific points. The numerical values in the table can also be viewed as the (nominal or actual) Type I error rates. Boldfaced are the values with a notable deviation from their nominal levels. For example, when the nominal probability (or the Type I error rate) is 0.05, the actual probability is 0.18, 0.18, and 0.25, respectively, for GD, JK, and CLT methods. Such a substantial deviation indicates a non-negligible loss of inference accuracy and raises serious concerns on using those methods for inference. Only our CD method yields a null distribution following very closely the target distribution. This example shows that our CD method, due to its nonparametric nature, is robust against the violation of the normality assumption. In Scenario 3, we sample from a bivariate Cauchy distribution, whose mean does not exist, and our inference is on the location parameter instead. Since the moments of Cauchy distributions do not exist, it is not surprising to see in Figure 6 that GD, JK and CLT methods all exhibit an appreciable loss of inference accuracy. Again, our method remains approximately accurate, when using the median in Equation (12) to construct depth-CDs. The advantage of CD method seen in Figure 6 is also confirmed numerically in Table 1, where the actual Type I error rates are quite close to the nominal levels. This example highlights the flexibility of our method in adapting easily to irregular situations where moments of the distribution do not exist.

Bias and Efficiency in Point Estimation
We compare our CD point estimator in Equation (15) and Graybill-Deal estimatorμ GD in estimating the common mean (or location) parameter μ = (μ 1 , μ 2 ) . The distribution of estimates (based on 1000 simulation replications) is presented   as boxplots in Figure 7. When the sample distribution is normal, it shows in the first column that both estimators (i) are unbiased; and more interestingly, (ii) have comparable variabilities. More precisely, the standard errors of GD and CD estimates are 0.125 and 0.126 for μ 1 , and 0.258 and 0.261 for μ 2 , respectively. This observation implies that although the CD method is nonparametric, it sustains negligible efficiency loss compared to the GD method which does make use of the parametric assumption.  When the sample distribution is χ 2 , the second column of Figure 7 shows that the variabilities of the two estimators are still comparable, but the GD estimator now shows a notable bias, whereas the CD estimator remains unbiased. When the sample distribution is Cauchy, the third column of Figure 7 shows that both estimators are unbiased, but the CD estimator has much smaller variability than the GD estimator, which indicates that the CD method is more efficient. To summarize, in the absence of normality, the CD estimator outperforms the GD estimator in terms of both bias and efficiency.

Gain of Efficiency in the Presence of Heterogeneous Studies
We consider a setting of heterogeneous studies by replicating the two studies in Scenario 1 (bivariate normal) and assuming that the two replicated studies are irregular, in that only the sum of the two components of the random vector X is observed. We are interested in combining inferences from all four studies. Here, neither of the two marginal means μ 1 and μ 2 is estimable in all studies, but the sum μ 1 + μ 2 is. The GD estimatorμ GD can combine only the two regular studies but discard the other two, whereas our CD estimator in Equation (20) can incorporate the two irregular studies as well. This same simulation is repeated under Scenario 2 (χ 2 ). To visualize the gain of efficiency in combining the inferences from all four studies, we present in Figure 8 the boxplots of the GD and CD estimates of μ 2 (based on 1000 simulation replications). The boxplots show that in both normal and nonnormal cases, our CD estimator, by combining all studies, is less variable and thus achieves a greater efficiency. Moreover, our CD estimator still remains almost unbiased in the nonnormal case. This phenomenon highlights again the flexibility of our fusion method in accommodating a broad class of study heterogeneity.

Meta-analysis of Correlation Coefficients
In social and behavioral sciences, correlation coefficients, being invariant to the measuring scale, are often used to represent the size of an effect. The meta-analysis of such an effect size has long been used as a tool to draw a more comprehensive conclusion on the bivariate association; see Schulze (2004) for an in-depth discussion. Classical meta-analysis inference methods for correlations, such as Fisher's z-transformation, assume that the samples of (X k , Y k ), k = 1, . . . , K, all follow bivariate normal distributions. When such an assumption is violated, inference outcomes could be invalid. In what follows, we show that our CD fusion method readily applies to meta-analysis of correlation coefficients, without requiring any parametric assumptions.
To illustrate the CD fusion method, we use the Pearson sample correlation r as an estimate of the correlation coefficient ρ in Equation (12), and apply regular bootstrap (with 2000 replicates) to construct a depth-CD in each study. To combine depth-CDs, we use half-space depth and Bahadur-efficient combination Equation (18). We compare our method with a naive method and the Hedges-Olkin (HO) method (Schulze 2004). The naive method merges the datasets as if all the data are from a single source. It then calculates the sample correlation r and applies Fisher's z-transformation z = 1 2 log( 1+r 1−r ), where z follows approximately a normal distribution with mean 0 and variance 1/(n − 3). The HO method obtains Fisher's z-transformed statistic z k from each study, and combines them usingz = K k=1 (n k − 3)z k / K k=1 (n k − 3). The inference is based on that z K k=1 (n k − 3) follows approximately the N(0, 1) distribution. Figure 9 compares the three methods by examining the null distribution of p-values for testing the hypothesis H 0 : ρ = 0. When the samples of (X k , Y k ) indeed follow a bivariate normal distribution, the upper row of Figure 9 shows that the distribution of each p-value approximates the U(0,1) distribution quite well. This observation indicates that all three methods lead to valid inference in normal cases. In the absence of normality, we let X k = Z k and Y k = Z 2 k , where Z k ∼ N(0, 1). The lower row of Figure 9 shows that the p-value distributions of the naive and HO methods deviate substantially from the U(0, 1) distribution. More specifically, the Type I error rates (α = 0.05) are 0.38 and 0.37, respectively. The results indicate that these two methods may lead to invalid inference in nonnormal cases. The p-value distribution of CD method remains very close to the U(0, 1) distribution, which is indicative of its robustness to changing distribution assumptions.

Case Study: Analysis of Aircraft Landing Performance
Recall from the Introduction the motivating example from the FAA project on investigating whether or not aircraft landing operations generally comply with the FAA recommendation that the height of the aircraft at the crossing of runway threshold be around 15.85m and touchdown distance be around 432 m from runway threshold. This question can be addressed by where μ is the mean vector for the height at the runway threshold and touch down distance. We are given landing records of two fleets of aircraft, 820 from Airbus and 1976 from Boeing. In view of the large samples, an intuitive approach would be just to apply Hotelling's T 2 test to the entire sample of 2796 landing records, pooling together both fleets. This yields a p-value of 0.942, which would suggest that there is no evidence supporting that the landing performances do not comply with the FAA recommendation. This intuitive approach for combining two studies however is flawed, since it implicitly assumes that the two studies follow the same distribution and thus fails to account for the difference underlying the two studies, shown clearly in Figure 1. After all, it is only natural to expect difference in performance from different aircraft manufactured by different makers or designs.
To accommodate such potential study heterogeneity, our fusion learning method can synthesize evidence from the two studies to provide a valid answer to the question raised. Specifically, this problem setting consists of two independent studies sharing a common bivariate mean parameter μ, that is, μ A = μ B = μ, where μ A and μ B are the means of the Airbus Study and the Boeing Study, respectively. We construct a depth-CD from each study to carry out separately the two tests μ A = μ 0 and μ B = μ 0 with μ 0 = (15.85, 432) , and then combine the two test results using Equation (14) to draw the overall inference on testing the hypothesis in Equation (22) (14) for testing the hypothesis in (22). Our fusion method yields a p-value of 0.008, indicating that the data provide strong evidence against the null hypothesis that the landings follow the FAA recommendation. This conclusion is drawn without assuming the sample follow any particular (say, normal) distribution.
The seemingly contradictory results between the intuitive method and our fusion method may be best explained visually by the plots of individual depth-CDs for Airbus (blue circles) and Boeing (black crosses) in Figure 10. The depth-CDs here are represented by the empirical distributions of their respective bootstrap estimates. The red triangle marks the null value μ 0 = (15.85, 432), which is clearly far from the centers of the two depth-CDs (which are the point estimates of their two respective means). The centrality values at μ 0 w.r.t. the two depth-CDs, or equivalently the two individual p-values, are 0.006 and 0.167. This finding implies low plausibilities for the assumption μ A = μ 0 or μ B = μ 0 . Thus, a small p-value (0.008) from our fusion method leading to the rejection of H 0 should be expected.
We also plot in Figure 10 the depth-CD (green diamonds) obtained from the pooled data erroneously assuming the same distribution for the two studies. The red triangle μ 0 is near the center of this depth-CD, which suggests that μ 0 as a plausible target value, as also reflected in a large p-value 0.956. This example shows that ignoring the heterogeneity of data sources or blindly aggregating data may mask important signals and lead to invalid and misleading conclusions.
Finally, to demonstrate the flexibility of our fusion method in handling more challenging situations, we suppose that the recordings of the variable "height" from Airbus aircraft are not available. In this scenario, traditional methods can make inference about "height" only based on the landings of Boeing aircraft. For example, applying Hotelling's T 2 test to Boeing observations yields a p-value of 0.152. This again yields an incorrect conclusion. Unlike traditional methods, our fusion method can efficiently incorporate the information in the incomplete observations from Airbus Study, as shown in Section 5.2. Combining the indirect evidence from Airbus with the direct evidence from Boeing, our method yields a p-value of 0.016. This result suggests strong evidence against the null hypothesis, which is consistent with our conclusion drawn from the complete data from both studies. Our analysis here shows that indirect evidence may contain valuable information (e.g., possibly through the correlation between "height" and "distance" in this case) without which incorrect inference outcome may be reached.

Discussion
We have used the concept of depth-CD and depth-CV to develop a new framework for fusion learning for multivariate parameters, especially in nonparametric settings. This fusion learning framework imposes no model assumptions on the data or statistics in individual studies. It has been shown to be efficient, general and robust by both theoretical properties and numerical studies. In the nonnormal settings, it can reduce bias and improve efficiency in inference, as observed from simulation studies. In addition, our fusion framework can easily adapt to complex heterogeneous studies settings where existing methods fail. In particular, it can incorporate indirect evidence from heterogeneous studies for which the target parameter is not estimable, and achieve an additional gain of efficiency, as illustrated in both our simulation and case study. The phenomenon of incorporating indirect evidence to gain efficiency has also been observed, though only in the normal or asymptotic normal settings, in for example, , Yang et al. (2014), Liu, Liu, and Xie (2015), Hoff (2019), Chen, Chatterjee, and Carroll (2013), Chatterjee et al. (2016), and Gao and Carroll (2017). The last three combined information from diverse studies through estimating equations, using large sample central limit theorem under parametric models.
Our proposed formula for nonparametric fusion learning is versatile as it permits flexible choices of fusion elements, namely, the depth function D(·), the mapping function g(·) and the weighting scheme w k 's. Unless there are concerns that not all studies are equally trustworthy, we may use equal weights (w k = 1) with the mapping function ϕ(·) = log(·) and halfspace or simplicial depth for general implementations. This set of choices is used in our numerical studies, showing our fusion formula to compete well with the classical approaches in the normal case and outperform in nonnormal cases, in gaining efficiency and reducing bias. This superb performance is rooted in the theoretical results established in Theorems 4 and 5, which ensure the fusion recipe to achieve high-order accuracy and the optimal efficiency in Bahadur's sense.
Fusion approaches derived from geometric depths such as half-space or simplicial depth have the desirable property of being nonparametric, and hence broader applicability. Although efficient exact algorithms for computing half-space and simplicial depths are available thus far for dimensions not higher than 3, as seen in Rousseeuw and Struyf (1998), the random approximation algorithm in Cuesta-Albertos and Nieto-Reyes (2008) is computationally efficient in any dimension. The refined sample geometric depth constructed in Einmahl, Li, and Liu (2015), by incorporating the extreme value theory, can help mitigate the inherent complications in calculating sample depth outside the data region (which by definition is zero without refinement) or in breaking the increasing number of ties in highdimensional settings. As the computing technology continues its fast advances alongside the competing research effort in the computational geometry community to develop efficient depth computing algorithms, we believe that the concern over depth computational feasibility is likely to lessen gradually.
Similar to a Bayesian posterior distribution, a CD also uses a (sample-dependent) distribution function on the parameter space to estimate the target parameter and it contains the information for all possible inference. But, different from a Bayesian method, a CD method does not need to assume any prior distribution. In most cases when the sample size is sufficiently large, a Bayesian posterior can be shown to be a CD under suitable regularity conditions (see, e.g., Xie and Singh 2013;Thornton and Xie 2020). In this case, the Bayesian posterior can be used as a CD to draw inferences and also as an input study in the CD fusion learning framework. Although the notion of CD is developed completely within the frequentist framework, it is shown to provide a common platform for unifying Bayesian, frequentist and fiducial approaches and also for making direct comparisons or combinations of inferences across these different paradigms. This also shows that Bayesian, frequentist and fiducial (BFF) inferences are indeed much more congruous than they have been perceived historically in the scientific community; as argued in recent research (Kass 2011;Reid and Cox 2015;Hannig et al. 2016;Shen, Liu, and Xie 2018;Thornton and Xie 2020) A depth-CV, obtained through depth-CD (see Section 3.2), incorporates the idea of centrality measure from of data depth to form nested central regions expanding with growing probability mass. The capturing of the nested central regions with their associate probability coverages is key in making depth-CV such a versatile and effective multivariate inference tool. This formulation of central regions expanding with growing probability is akin to those referred to as "quality index" and "multivariate spacings" considered in Liu and Singh (1993) and Li and Liu (2008) in the context of assessing the distribution underlying the data for quality control purpose. The depth-CV and the fusion learning method developed in this article can help broaden those two problem settings to make them more practical in reality, especially in multivariate quality control.
The concept of depth-CD plays a key role in developing our nonparametric multivariate fusion framework. As a distribution function embedded in the parameter space, a depth-CD conveys the level of "confidence" on each possible parameter value, w.r.t. the given data. It is an omnibus of all intrinsic inference forms of any parameter, including the common inferences of point estimates, confidence intervals/regions and p-values. This allinclusive characteristic affords our fusion scheme the desirable theoretical and numerical properties seen in this article. Given that depth-CD is a general multivariate extension of CD, many challenging problems in fusion learning in the scalar or normal setting that have been solved by combining CDs can be expected to be solved by using depth-CD if they arise in nonparametric multivariate settings. These include robust inference with outlying studies (Xie, Singh, and Strawderman 2011), exact inference for discrete data Yang et al. 2016), efficient inference for heterogeneous studies or network metaanalysis (Clagget, Xie, and Tian 2014;Yang et al. 2014;Liu, Liu, and Xie 2015), or with external data ) scalable split-conquer-combine approaches for massive data (Chen and Xie 2014) and individualized inference for a particular study (Shen, Liu, and Xie 2019). Cheng, Liu, and Xie (2017) gave a brief review on fusion learning via CDs.
For the ease of presentation, the article has focused on fusion of independent studies, with iid observations in each study. But we stress that our fusion framework is quite general and can remain applicable even if these assumptions are violated. For example, our fusion extends to possibly non-iid observations within a study. Specifically, since the fusion is on the inference for a common parameter shared by the studies, our method is valid as long as the estimate within each study converges to the common parameter even if the observations are not identically distributed. To this end, in our bootstrap implementation of the method, the method remains valid as long as bootstrap works, as seen in the non-iid setting covered in Liu (1988). Our method can also be extended in the direction of fusing related studies seen in Li, Hung, and Xie (2020).
Our approach is shown to enable the fusion of multivariate inferences from a wide range of data sources, including studies of irregular, incomplete or heterogeneous of various types. The development of depth-CV here may be further extended to cover different data types in the domains of directional data (data on circles/spheres) (Liu and Singh (1992)) and functional data (López-Pintado and Romo 2009;Claeskens et al. 2014;Narisetty and Nair 2016;Fan and Liu 2019), where applications abound, for example, an efficient fusion of the existing different climate or weather forecast approaches. Those extensions would be worth exploring.

Supplementary Material
Supplementary materials containing all the technical proofs for this article are available online.