Distributed Censored Quantile Regression

Abstract This article discusses an extension of censored quantile regression to a distributed setting. With the growing availability of massive datasets, it is oftentimes an arduous task to analyze all the data with limited computational facilities efficiently. Our proposed method, which attempts to overcome this challenge, is comprised of two key steps, namely: (i) estimation of both Kaplan-Meier estimator and model coefficients in a parallel computing environment; (ii) aggregation of coefficient estimations from individual machines. We study the upper limit of the order of the number of machines for this computing environment, which, if fulfilled, guarantees that the proposed estimator converges at a comparable rate to that of the oracle estimator. In addition, we also provide two further modifications for distributed systems including (i) a communication-facilitated adaptation in the sense of Chen, Liu, and Zhang and (ii) a nonparametric counterpart along the direction of Kong and Xia for censored quantile regression. Numerical experiments are conducted to compare the proposed and the existing estimators. The promising results demonstrate the computation efficiency of the proposed methods. Finally, for practical concerns, a cross-validation procedure is also developed which can better select the hyperparameters for the proposed methodologies. Supplementary materials for this article are available online.


Introduction
Quantile regression (Koenker and Bassett 1978) has been a popular tool for statistical modeling including analysis of timeto-event data.It can directly relate the quantiles of the event time with covariates of interest with applications in areas including, for example, business analysis for airline alliances and app stores; see, for example, Gudmundsson and Rhoades (2001), Jung, Baek, and Lee (2012).
Censored quantile regression has also become a popular class models for various applications.One of the first models was studied by Powell (1984Powell ( , 1986) ) for fixed censoring.To extend the censoring mechanism to random censoring, Ying, Jung, and Wei (1995) proposed a semiparametric method under the independent censoring assumption.Portnoy (2003) further relaxed the assumption of the censoring mechanism and proposed a recursive inference procedure based on the redistribution-tothe-right principle of the Kaplan-Meier (KM) estimate's selfconsistency.Subsequent works, including Peng and Huang (2008), Wang and Wang (2009), Wang, Zhou, and Li (2013), Kong and Xia (2017), Xu et al. (2017), and Backer, Ghouch, and Keilegom (2019), to name but a few, have been actively developed over the past two decades.
In the era of big data, advancement in technology facilitates collection of vast volume of data.Usually, the full datasets of interest cannot be processed by a single, standalone computer due to memory or storage limit.To solve this problem, the seminal works of Jordan (2013) and Li, Lin, and Li (2013) discussed divide-and-conquer and distributed statistical methods.
One of their key contributions is, instead of analyzing the full dataset as a whole, one may split the massive dataset concerned into a number of smaller sub-datasets upon which traditional inference procedures can be applied.The final estimates can be derived from suitable aggregation of these preliminary estimates from individual sub-datasets; see, for example, Volgushev, Chao, and Cheng (2019).
Another variant of the divide-and-conquer technique is that, if there is only one machine with capacity that can only handle samples of limited sizes, we can iteratively analyze the small sub-datasets in a similar way as in mini-batch stochastic gradient descent (Chen, Liu, and Zhang 2019).Such a scenario is frequently encountered as individual machine is not equipped with large physical memory.In addition to studies on parametric models (Wang et al. 2018;Zhao, Zhang, and Lian 2019;Dobriban and Sheng 2019), Wróbel, Gudyś, and Sikora (2017) and Sikora, Wróbel, and Gudyś (2019) considered separate-andconquer method to estimate survival function for censored data, and Zhang, Duchi, and Wainwright (2013), Zhang, Duchi, and Wainwright (2015), Szabó and van Zanten (2019), Xu, Shang, and Cheng (2018), Xu, Shang, and Cheng (2019), and Kang and Jhun (2020) investigated in nonparametric methods, providing the convergence rate of the estimators in various distributed models.Some very recent studies can be found in He et al. (2020), Fan, Lin, and Yin (2020), Yu, Chao, and Cheng (2020), Galvao, Gu, and Volgushev (2020) among others.
We also observe the necessity of reducing computational complexity for handling censored data for two reasons.First, with the development of technology, there is an increasing demand for efficient algorithms to handle large-scale censored data.Examples include, but not limited to, healthcare (posttransplant mortality; Tang et al. 2022), engineering (reliability of device; Kuhajda 2016;Yang et al. 2022), economics (unemployment duration; Danacica andBabucea 2010, Wang, Li, andReddy 2019), e-commerce marketing (customer lifetime modeling; Chen et al. 2018), to name but a few.Second, similar to the problem in kernel ridge regression mentioned in Zhang, Duchi, and Wainwright (2013), to the best of our knowledge, existing censored quantile regression algorithms typically incur high computational costs.Wang and Wang (2009) and Kong and Xia (2017) adopted a mass redistribution approach in which the weight for each sample is calculated based on the Kaplan-Meier (KM) estimate.However, the associated computational complexity is O(N 3 ), where N denotes the sample size.This level of complexity is substantially higher as compared to that of the estimation of model coefficients itself; this becomes a severe bottleneck.For example, as later discussed in our simulation study, it takes an extended period of time to directly apply Wang and Wang (2009) method even if the sample size is 10,000.Intuitively, one can better select the bandwidth for cross-validation if both censored and uncensored observations are considered.The resulting parameter is expected to reduce estimation error of coefficients.
Observing this problem, we would like to adapt the existing algorithms for censored data into a distributed fashion to facilitate the computation.As will be further elaborated in the sequel, the computational complexity of our proposed algorithms is much reduced as compared to its undistributed version.If the computation facilities allow, one can also deploy our algorithms in a distributed environment to further save time, though the total complexity is not changed.As a by-product, deploying the algorithms in a distributed environment can also, for instance, ensure the security or privacy of the data: One can only exchange digested information, say, the aggregated gradient, other than the raw user data (see Reddi et al. 2020).Our proposed algorithms are applicable in these scenarios.
In addition to the main concerns in computation complexity, there is also a bandwidth parameter for the local weights which require tuning.Although Wang and Wang (2009) introduced a cross-validation procedure for a single machine, it only uses observed data for evaluation.We are also interested in developing further improvement for the relevant cross-validation procedure.
We address these issues in this article whose contributions can be summarized as follows: (i) First, we formulate a distributed censored quantile regression model and provide theoretical guarantees for a sufficient condition on the maximum order of S.Although the existing analysis presented in Wang and Wang (2009) and Kong and Xia (2017) can be simply adapted by taking an average on the error bounds over working machines, such a straightforward aggregation overlooks the independence among machines.Upon this observation, we provide a weaker sufficient condition on S. Also, as mentioned previously, the computational complexity with a single machine is O(N 3 ).In contrast, the overall complexity in the distributed method will be reduced to O(N 3 S −2 ), which suggests the effectiveness of the proposed method in improving computational efficiency.Our numerical results show a 20-fold time reduction when using distributed quantile regression without comprising the estimation precision.To the best of our knowledge, there have been individual works on censored observations in large-scale learning (see, e.g., Zuo et al. 2020), but there is little theoretical justification on whether or not the proposed estimators manage to converge as fast as the oracle one does.(ii) Second, in addition to extending the framework of Volgushev, Chao, and Cheng (2019) for censored observations, we also discuss the corresponding treatments for the communication-facilitated algorithm in the sense of Chen, Liu, and Zhang (2019) and the nonparametric censored quantile regression formulated under Kong and Xia (2017)'s model so as to cater different needs of users of censored quantile regression models.(iii) Third, there is a bandwidth parameter that requires tuning in distributed quantile regression.To better select this hyperparameter, we propose an improved version of the cross-validation procedure considered in Wang and Wang (2009).Our promising numerical results demonstrate that our proposal can yield a reduction in mean square errors and better empirical coverage for the estimates.This new generalized approach can be applicable to many other cross-validation procedures that involve censored observations as well.
The structure of this article is as follows: In Section 2, we define the censored quantile regression problem concerned and introduce how to adapt the locally weighted censored quantile regression into a distributed version.This section also states the main theorem that quantifies the number of working machines with the Bahadur representation of the estimates.When the number of working machines grows slowly with respect to the sample size, the estimator proposed is consistent and asymptotically normally distributed upon suitable scaling.Section 3 discusses the theoretical results for communication-facilitated censored data (Chen, Liu, and Zhang 2019), and nonparametric censored quantile regression into distributed version (Kong and Xia 2017).Section 4 reports an extensive set of simulations to verify the numerical performance of our proposed methods.Two data analysis are also included, followed by some concluding remarks in Section 5.

Notation and Assumptions
In this section, we describe the mathematical formulations of the problems concerned and the corresponding inferential procedures.Let T and x the survival time and the p × 1 observable covariates, respectively.Define F 0 (t|x) = P(T > t|x) as the conditional survival distribution of T given x, and C as the censoring variable which is conditionally independent to T given x.The censoring indicator variable is denoted as δ = I(T ≤ C).Instead of observing the true event time T, we only observe Y = min(T, C) and δ as the response.Our goal is to estimate the p-dimensional quantile coefficient β τ for some quantile level τ ∈ (0, 1) satisfying where (x) is a random error whose τ th conditional quantile is zero.Alternatively, the τ th quantile of T given x is given by Denote (Y i , X i , δ i ) for i = 1, . . ., N as iid samples of (Y, X, δ).To estimate β τ in a parallel computing environment with S computers or machines, we can divide the total N samples into S partitions, with each machine handling n = N/S samples.Without loss of generality, we assume throughout the paper that S divides N. Denote these S sub-datasets as S s for s = 1, . . ., S.
In the machine s, we implement the locally weighted censored quantile regression.As novelly proposed in Wang and Wang (2009), this method leverages the idea of redistribution-of-mass discussed in Efron (1967) that redistributes the probability mass P(T > C|C, x) of the censored cases to the right.In the ideal situation where F 0 is known, the weight for the ith data is For this specific machine s, the model parameter β τ can be defined as the minimizer that minimizes where ρ τ corresponds to the celebrated check function ρ τ (u) = u{τ − 1(u < 0)} proposed in the seminal paper Koenker and Bassett (1978).
To understand the weight (1), one may consider a special case with no censoring, all the weights w i (•) = 1, for i = 1, . . ., n in which case the estimating equation (2) reduces to the original Koenker and Bassett (1978) version.But when there are censored observations, the sign of each component in the negative subgradient of Q ns (β, F 0 ) with respect to β, that is, may not be determined for cases where C i < X i β and δ i = 0. Hence, to redistribute the complimentary probability mass Pr(T i > C i |C i , X i ) to the right, one needs to evaluate which coincides with w i (F 0 ) for the case when δ = 0 and As the sign is preserved for any large value, we replace the unknown Y i by a sufficiently large number that exceeds all X i β 0 (τ ), which we denoted as Y +∞ .Since F 0 is unknown, a consistent estimator of w 0i is needed to ensure that the estimating equation ( 2) is implementable.Denote h n as a sequence of constants converging to 0 as n → ∞.
For a kernel function K(x), define K h n (x) = K(x/h n ).Then F 0 in the sth machine can be estimated using where B sj is a sequence of nonnegative weights adding up to 1 for the sth working machine at x with After obtaining the estimate of F 0 in machine s, one can obtain the estimated coefficient βs τ via solving Output: βτ .
As discussed in Wang and Wang (2009), weighted quantile regression is easy to implement using existing R packages.After obtaining the weight w ni for each sample in machine s, for censored observations, we append another data (X i , Y +∞ ) with weight 1−w ni into the dataset.After rearranging the dataset, one can carry out the estimation procedure with the function rq in an R package quantreg.Existing packages such as kernlab can also be applied to evaluate w s ni for s = 1, . . ., S.Besides the survival function F 0 for the uncensored variable T, we denote 1 − G(t|x) = P(C > t|x) as the conditional survival function of censoring variable C given x, and h(x) as the marginal density of x.Through the local weight estimate of the KM estimator, one can estimate F 0 given that it is sufficiently smooth in the following sense: Definition 1. Define z as the L 2 norm of some vector z.Denote D r as the differential operator, ∂ [r] /∂x r 1 1 . . .∂x The set C s M (X ) is essentially the class of functions whose the partial derivatives (up to order s ) are uniformly bounded, and the highest partial derivatives are Lipschitz of order s − s .If a class of function is smooth enough, that is, C s M (X ), then its covering number grows slowly in n.
Upon this definition of smoothness, we impose the following assumptions that are necessary for us to establish the asymptotic results of the proposed method: function of x and its derivative g belong to C r 0 M (D) for some r 0 > 0. The density g is uniformly bounded.(A3) The distribution function F 0 (t|x) as a function of x and its derivative Assumption (A1) is commonly adopted in the literature.A wide range of distributions such as the Gaussian distribution, sub-Gaussian distributions, and distributions defined on compact sets satisfy this assumption.The smoothness assumptions (A2) and (A3) are also valid as long as the distribution functions are smooth enough.Assumption (A4) imposes mild restrictions on the kernel function, which can be satisfied by many common kernels including the Gaussian kernel.Assumption (A5) gives a range of possible rates at which h n converges to 0 as n grows.The lower bound nh p n / log n → ∞ is essential as it renders that not all K h n (X i − x)'s are almost zero.The upper bound nh p+s 0 n / log n < ∞ is necessary to control the error of the KM estimator (see Kong and Xia 2017).Assumption (A6) controls the smoothness of the functions with p.We remark that similar conditions are also required in Wang and Wang (2009) and Liang et al. (2012).

Preliminary Results
When S = 1, it is straightforward to see that the distributed censored quantile regression reduces to the setting considered in Wang and Wang (2009), thus, n 1/2 ( βτ − β τ ) is asymptotically normal.This is also true when S is a finite fixed constant.The proposition below is a direct result from Wang and Wang (2009) to characterize the estimator in each working machine: Proposition 2 (Theorem 2 of Wang and Wang 2009).Under Assumptions (A1) to (A6) , for each s = 1, . . ., S, (5) almost surely and uniformly in x ∈ D and t > 0, where As mentioned in Wang and Wang (2009), the KM estimator Fs is uniformly consistent for F 0 , and it provides a nice linear representation.This is an advantage over other existing methods.For instance, in Portnoy (2003), a simplex pivoting method is adopted in which a linear quantile regression model is fitted at a sequence of breakpoints.However, it is difficult to obtain the stochastic expansion of F − F 0 using such a recursive method.Following the representation of KM estimator, one can establish the Bahadur representation of βs τ for each working machine.Through taking a naïve average on the error over all machines, we obtain the following consistency and weak convergence results: Corollary 4 (Asymptotic normality through Simple Aggregation).If Assumptions (A1)-(A6) hold, and S is chosen such that r N = o p (N −1/2 ), then the aggregated estimator βτ satisfies that N 1/2 ( βτ − β τ ) converges to Gaussian as N → ∞.
For the consistency result, a weaker assumption on S is imposed compared with convergence in distribution results.As long as the estimator in each working machine is consistent, their average is still consistent.For establishing the distributional convergence result, the remainder terms are affected by the number of samples in the specific machines; thus, although βs τ converges weakly to a Gaussian distribution, there is no guarantee that βτ converges as fast as the oracle estimator.A more careful analysis is, therefore, required.

Main Results: Bahadur Representation
Corollaries 3 and 4 are consequences when the estimator in each working machine achieves asymptotic consistency and N 1/2 -normality, respectively.These results, however, do not take into account the mutual independence amongst the machines, which can, in fact, result in smaller estimation errors upon the aggregation.Therefore, the sufficient condition on the number of machines S can be further relaxed.The following lemma presents an improved version of Bahadur representation for βτ with a more refined remainder term: The residual term R N = S −1 S s=1 R s N denotes an average of the ones from each working machine R s N , s = 1, . . ., S, which satisfies When there is no censoring, the bound reduces to the same one derived in Volgushev, Chao, and Cheng (2019).
To prove Lemma 5, we need to control the key term R s N .Together with the probability bounds considered in Wang and Wang (2009), we establish an analytical form of the expectation of R s N .Since R s N 's are iid among individual machines, the average (R N ) shares the same mean and a smaller range, which leads to a tighter bound in Lemma 5 as compared to r N in Corollary 3. A similar observation can be found in (S.6.8) for the improved aggregation in Volgushev, Chao, and Cheng (2019).Since our formulation handles partially observed censored data, there are some additional terms associated.
Based on Lemma 5, one can establish the following asymptotic results: Theorem 6 (Consistency).Under the conditions assumed in Lemma 5, when S is chosen such that R N → 0, then βτ − β τ → 0.
Theorem 7 (Asymptotic normality).Under the conditions assumed in Theorem 5, when S is chosen such that Compared to Corollaries 3 and 4, we relax the number of machines S in Theorems 6 and 7.Such relaxation is significant.For instance, for the purpose of illustration, assume (improperly) that h n = 1, then from Corollary 4, we require S = O(N 1/3 log −1 N), while from Theorem 7, it becomes S = O(N 1/2 log −1/2 N); a similar observation can also be found for consistency results.The order O(N 1/2 log −1/2 N) also matches the necessary condition in Volgushev, Chao, and Cheng (2019) with only slight difference.
It is also important to notice another merit of the proposed method, namely the reduction of computational complexity.Originally for S = 1, the complexity is O(N 3 ).For a general S, the corresponding order is O(N 3 S −2 ), which is substantially lower than that in the single machine setting when S → ∞.
Remark 8.In this article, we consider the scenario where the full dataset is randomly partitioned to different workers in which case the sub dataset assigned to each worker independently follows the same distribution.Our results in Lemma 5, Theorems 6 and 7 still hold when slightly relaxing the above condition.Assume all samples are independent with each other, and the distribution of (X, T) is the same among all workers.When the empirical distributions of the censoring variable C may differ in different workers, n 1/2 ( βs τ − β τ ) still asymptotically converges to a zero-mean Gaussian distribution, and the variance depends on the specific distribution of C in each worker.Consequently, the final aggregated estimate βτ remains consistent and the conventional asymptotic normality can still be preserved.

Communication during Training
The distributed algorithm introduced in the previous section executes the estimation procedure in each worker in a parallel manner without communication among workers before the final estimate.Thus, one may question whether the algorithm can be improved via communication among the workers.While Xu, Tian, and Gu (2016) and Fan, Guo, and Wang (2021) consider such communication for smooth loss functions, the algorithm considered in Chen, Liu, and Zhang (2019) can be viewed as a communication efficient distributed algorithm tailored for quantile regression.Different from the scenario in Volgushev, Chao, and Cheng (2019), where a distributed system is comprised of several machines, Chen, Liu, and Zhang (2019) considered the case where only one machine is available with limited memory.Despite this, one can also introduce small adaptations so that the algorithm can also be implemented in a distributed fashion.
To handle partially observed data due to right censoring properly, we approximate the original loss function as where w i is defined in (1) and H is a smooth approximation of the indicator function I(x < 0).Here, H(x) = 1 when x ≥ 1 and H(x) = 0 when x ≤ −1.For example, as mentioned in Chen, Liu, and Zhang (2019), one can take the follow choice of H(v) when |v| ≤ 1 H(v) = 1/2 + 15(v − 2v 3 /3 + v 5 /5)/16.As we shall discuss later, an iterative method is applied for estimating βτ with a decreasing sequence of h g for g = 1, . . ., q.Although an approximation error occurs between K H (x) and ρ τ (x), it will be dominated by the order of with N −1/2 if S and h g are properly chosen .
As discussed previously, calculating the weights in Wang and Wang (2009) incurs huge computational cost especially when the sample size is large.Thus, we calculate it within the subset S s for s = 1, . . ., S. After obtaining the weight for each sample, in each step g for g = 1, . . ., q, we calculate the following quantities for each subset of data S s : for some bandwidth h g and the estimator βg−1 .With a preliminary estimate βg−1 that is closed to β τ , using the first-order optimality condition, one can estimate βg as For the initialization β0 , we adopt Wang and Wang (2009)'s estimate on S 1 to obtain an initial β0 .The corresponding detailed algorithm is in Algorithm 2. Unlike the distributed censored quantile regression where S s is only used in sth machine, all data analyzed are used in each iteration under the communicationfacilitated framework.
Algorithm 2 Distributed Algorithm with Communication (DC) Input: data (X 1 , y 1 ),…, (X N , y N ), number of machines S, quantile τ , bandwidth h n for KM estimator, bandwidth sequence (h 1 , . . ., h q ).Partition the N samples randomly into S subsets with equal size.Denote as S 1 ,…, S S .for s = 1 to S do Obtain Fs using samples in S s based on (3).Calculate w i for i ∈ S s .end for Use the samples in S 1 to solve the censored quantile regression as in (4).Set the solution as β0 .for g = 1 to q do for s = 1 to S do Calculate U g s and V g s using h = h g and βg−1 .end for Aggregate s k=1 U g k and s k=1 V g k .Set βg using (7).end for Output: βq .
The bottleneck of quantile regression with uncensored data in Chen, Liu, and Zhang (2019) is the quantile regression itself, but for censored quantile regression, the bottleneck appears at the step of calculating w ni .We need to estimate w ni 's using offline sub-datasets; thus, an online version with both online KM estimator and online coefficients is not applicable in the current framework.Chen, Liu, and Zhang (2019) provided the theoretical guarantee for uncensored data.Since the weights estimated for censored observations are consistent, the modified version with censored data asymptotically works as well.Similar as the distributed censored quantile regression, the communicationfacilitated method for censored data reduces to the original method described in Chen, Liu, and Zhang (2019) if the event times can be fully observed.
Theorem 9.Under Conditions (A1)-(A6), when S is chosen such that N 1/2 ( β0 −β τ ) converges weakly to a limiting Gaussian distribution, if the smoothing function H is twice differentiable with bounded second derivatives H , and the decreasing sequence {h g } is taken such that h q N 1/2 → c > 0 and qN −1/2 → 0, then we have where R N = O (R N ) with R N is the remainder term defined in Theorem 5 for distributed censored quantile regression.For uncensored quantile regression, R N = 0.
This communication-facilitated method demands less computational effort than distributed censored quantile regression since it does not need to solve an optimization problem in every machine individually.Assume that q iterations are needed for an optimization procedure with a random initialization, then the computational cost incurred is O(Nq ) in the original distributed method considered, whereas the cost will become O(nq + Nq) for the algorithm with communication.The above order does not take into account the cost due to obtaining the KM estimates.On the other hand, the smoothing in the communicationfacilitated method results in a faster convergence rate using the information of the Hessian matrix, so q is usually smaller than q .Consequently, the computation cost is reduced in the communication-facilitated method.
As to be further elaborated in Section 4, such a benefit has a small tradeoff in its stability: for large S, if the weights w ni and the initial guess are not accurate enough, the algorithm may diverge as a result, that is, V defined in ( 6) is no longer invertible after some iterations.From the definition of H, since H(v) = 1 if v ≥ 1, there is always a stable proportion of samples which have H(v) ≥ 0, leading to a stable U g s .However, H (v) is nonzero only when v ∈ [0, 1], so an inaccurate w ni results in an unstable V g s .When the elements in S s=1 V g s are much smaller than those in S s=1 U g s , the update will become unstable and diverge consequently.For a finite sample of size N, this iterative update approach is not as stable as the gradient-based optimization adopted in distributed censored quantile regression; a larger n for each worker is required to ensure w ni is accurate enough.
Asymptotically, such instability does not impair the convergence since we require n → ∞ for each worker.In Section 4, we present a simple remedy to improve the quality of the initial guess.
It is noteworthy that in Fan, Guo, and Wang (2021), the original goal of communication efficient distributed algorithm is to overcome the bottleneck of S = o(N 1/2 ), and in Chen, Liu, and Zhang (2019), there is no constraint on S.However, in the context of censored quantile regression, the bottleneck is the estimation of the KM estimate.When S is too large, the KM estimate in each subset of data will introduce large biases, which further corrupt the whole estimation procedure.As a result, we still require S = o(N 1/2 ) in our scenario.For Fan, Guo, and Wang (2021) and Chen, Liu, and Zhang (2019), the computation cost is not greatly changed when taking S = 1 or a large S. As a result, if we follow the original idea in Chen, Liu, and Zhang (2019) to implement the algorithm in a single machine, we do not have a preference for S if there is sufficient memory.However, this is not the case for censored quantile regression.For the algorithms proposed in this article, the overall cost is O(N 3 ) when S = 1 and is O(N 3 S −2 ) for S > 1.Consequently, even if it algorithm is executed in a single machine, it is preferred to choose an appropriate S to speed up the computation.
Remark 10.There is no significant difference when implemented in a single machine (Chen, Liu, and Zhang 2019) or in a distributed system (Volgushev, Chao, and Cheng 2019).If we fix the random seed when partitioning the full dataset, and fix the random seed when solving the optimization problem in each subset of data, these two implementations result in the exact same final result.In our simulation study, we repeat the experiment 300 times, and each repetition runs in a single machine for all subsets of data.
For the distributed algorithm without communication and the nonparametric algorithm in the later sections, assuming that it takes m seconds to run a single subdataset, we expect the total time taken in the same machine (single thread) is mS seconds, while the time needed in a distributed system is m seconds plus the communication waiting time.In the communicationfacilitated method, one can foresee that there will be more communications in which case the waiting time will be longer.

Distributed Nonparametric Censored Quantile Regression
Another important extension in recent literature for censored quantile regression includes Kong and Xia (2017), which studied the Bahadur representation of nonparametric censored quantile regression with their proposed generalized KM estimator.To understand how nonparametric censored quantile regression changes in a distributed framework, we extend Theorem 5 to a nonparametric setting.Kong and Xia (2017) adopted local polynomials at x to estimate Q τ (x).Denote x r = p i=1 X r i i for simplicity.Unlike the semiparametric model where Q τ (x) is a linear function of x, we approximate Q τ (X) as Define A = {r ∈ Z p |r i > 0, [r] ≤ k} as the set of polynomials, and μ n (x) = δ −[r]  n x r , r ∈ A , where the bandwidth δ n converges to 0 as n → ∞.Further define kernel function The corresponding algorithm is as summarized in Algorithm 3.

Output: βτ (x).
In the above algorithm, the KM estimator in Wang and Wang (2009) is presented instead of the one proposed by Kong and Xia (2017) to maintain the coherence of the article.Below are the technical assumptions imposed: (B6) The kernel density K(•) is a symmetric density function on R p with finite second moments and bounded first order derivative.And K ∈ C r 0 M ([0, 1]).(B7) The smoothing parameter h n satisfies that h n → 0, The above assumptions (B1), (B4)-(B7) are adapted from (A1)-(A7) for nonparametric estimation.For (B2), it is relaxed from the parametric model.In local nonparametric estimation, one only focuses on the τ th quantile at certain point x, so a global linear representation on Q τ is not essential.In contrast, we only require Q τ is a smooth function.Condition (B3) regulates the kernel function we use in the local censored quantile regression.
Denote τ n = (log n/nδ p n ) 1/2 and τn = (log n/nh p n ) 1/2 , then we have the following Bahadur representation of βτ (x) − β τ (x): Theorem 11.Under Conditions (B1) to (B7) , denote τn = (log n/nδ p n ) 1/2 .Take δ n and h n such that δ r 0 n / τn < ∞ and where The remainder term satisfies There are two bandwidth parameters (δ n , h n ) in distributed nonparametric quantile regression.Fixing δ n = δ for all choices of S and tuning h n in each machine, for some proper choices of S → ∞, we have the following result: Corollary 12.Under the conditions adopted in Theorem 11, assume we take δ uniformly among the choice of S, when ) asymptotically converges weakly to a Gaussian distribution.

Cross-Validation
There are two main issues considered in the cross-validation procedure.We use the following example for illustration.The same setting, which we regard as Model 1, was also studied in Example 2 of Wang and Wang (2009).For this experiment, we simulate the covariates, event times and censoring times as follows: The observation Y = min(T, C), and the considered quantile levels τ ∈ {0.5, 0.7}.The censoring rate is approximately 40%.For all of our numerical implementations, we adopt the choice of Wang and Wang (2009) and carry out the kernel estimation with a biquadratic kernel K(x) = 15(1 − x 2 )I(|x| ≤ 1)/16 in univariate case; for multivariate x = (x 1 , . . ., x p ), we take K(x 1 , . . ., x p ) = p i=1 K(x i ) as the kernel function.The first concern is that, in Wang and Wang (2009), only the uncensored observations are included in the loss function of cross-validation, that is, all samples are used to estimate parameters, but only uncensored observations are included for calculating losses in the cross-validation procedure.In Model 1, a smaller X i gives a smaller T i in which case it is less likely that C i < T i , hence, δ i = 0.As a result, the distribution of (X i , T i , C i ) for uncensored data is different from the whole dataset and the loss function may not reflect the true mean square error.
To incorporate censored samples for cross-validation, we define β−s τ ,h as the estimate using bandwidth h and distributed censored quantile regression with all but the sth machine, and Fs h as the KM estimator obtained in sth machine using bandwidth h.Then we consider the following optimization problem min where h 0 is a pre-defined bandwidth value to obtain a fixed weight w i for each sample when we tune ) is likely to be chosen.It is unfair to compare among different bandwidth candidates if the weights w i 's are not fixed.To circumvent this problem, we propose to use a h 0 to calculate w i for all h n 's while using h n itself to estimate the model coefficients.
From ( 4), ideally, one may choose w i (F 0 ) as the weight for each sample.Since F 0 is unknown, we adopt h 0 to calculate the weight in Q n (h 0 , h n ).This may introduce an error on the weight, further leading to an error when choosing h n .However, as will be shown later in one of our numerical examples, since the estimated weight is within some reasonable range near the best weight, the cross-validation result is not severely compromised.
Another issue regarding the conventional cross-validation process is that, it can still be time-consuming for large datasets.To simplify the cross-validation step, following the idea in Yuan (2006), we approximate the leave-one-out loss via where f (x) = βT x, β is an estimate given x, and x is the leave-one-out estimate with all but the ith data.Following (2.5) of Yuan (2006), we replace Regarding the choice of κ, as suggested in Yuan (2006), this parameter needs to be reasonably small so that it can give an accurate approximation of ∂ f /∂y.However, based on our numerical experience, the choice of κ is not sensitive given a reasonable range for its magnitude; see Figure 1.The change  of c 0 affects the value of relative loss when h is close to 0, but the change of loss along with h shares the same pattern across c 0 's.The selected h is thus approximately the same for all c 0 's.
When changing κ while fixing c 0 , the effect of κ is small-this is consistent with what Figure 1 exhibits in which the curves are virtually the same.

Effect of the Number of Working Machines
In this experiment, we again generate {(X i , T i , C i ), i = 1, . . ., N} from Model 1.Our aim is to verify that there is no great difference when selecting different S as long as it is within a proper range.To examine this, we consider S = 5, 10, 20, and 40 when N = 2000.Each setting is carried out with 300 repetitions.The mean square errors and empirical coverage probabilities are evaluated.
As Volgushev, Chao, and Cheng (2019) verified the performance of quantile regression into distributed quantile regression (dlcq), we provide the omni result as benchmark given full uncensored data and naïve result as another benchmark which directly use (X i , Y i ) in the quantile regression; see Wang and Wang (2009) for specific descriptions of individual estimators/methods.In addition, we also compare our results with Portnoy (2003) estimates (crq) as another benchmark.
The results for N = 2000 are summarized in Tables 1 and  2 for mean square errors and empirical coverage probabilities.Statistics for the benchmark methods and the running time are presented in Tables 3 and 4, respectively.The total running time can be estimated based on Remark 10.In short, while dlcq obtains a good set of mean square errors and empirical coverage probabilities, the running time it consumed is comparable to the other methods when choosing S properly.As shown in Tables 1 and 2, both the mean square errors and empirical coverage probabilities do not vary significantly when the number of machines is small for all the methods with all choices of h.This verifies our theorems which guarantees the performance of dlcq.On the other hand, in Wang and Wang (2009), dlcq outperforms crq for both mean square error and empirical coverage.One may also observe from Table 3 that the coverage probabilities for the intercept and coefficient using crq are 87.7% and 90.0%, both of which are well below the notional value of 95.0%.
To examine the performance of our proposed methodology in multivariate cases, we also conduct a simulation experiment with (S, N) = (40, 20000) for Model 1, and S = 2000 for multiple choices of S in a new model where X 1 , X 2 , . . ., X 5 are random variables following N(0, 1) independently, and and C ∼ Unif(0, 25) (10) with b i = i for i = 1, . . ., 5 and b 0 = 2.The data has a censoring rate around 20%.The results in Tables 5-7 also demonstrate the effectiveness of the proposed algorithm.In terms of the computation time, it takes around 0.6 sec when S = 5 and 0.26 sec when S = 10 to execute the censored quantile regression for each subset of data.Our empirical experience suggests that the computation complexity increases as a linear function of the data dimension p.

Communication-Facilitated Method and Nonparametric Method
For the communication-facilitated method, the model we adopt is the same as the previous simulation study.There are two bandwidth parameters that need tuning.The first is the one in censored quantile regression: we set h 0 = n −1/3 to evaluate generalized cross-validation and select the best bandwidth from 2 −10 to 2 2 .Another bandwidth parameter h g is set to be h g = max(N 1/2 , n −2 g )c g , for g = 1, . . ., q with q = 15.Using generalized cross-validation, we select c g for 0.1, 1, 10, and 100, and take the total number of iterations q = 15 to make the result stable enough.Model 1 is considered in this experiment.The results are presented in Table 8 and are similar to distributed censored quantile regression.Another observation is that for S = 20, 40, the initial estimate using one-fold of data tends to be away from the true value; thus, the conditions in Theorem 9 are not satisfied, and the algorithm diverges.When S = 5, 10, the algorithm converges for all 300 repetitions.
One possible solution to addressing the above instability issue is to improve the accuracy of the initial guess.To be more specific, one may run the vanilla censored quantile regression for several batches and aggregate them together as an initial guess.We conduct a small experiment to verify this using the same model above and with S = 40.Different subset sizes are used for generating the initial guesses and we count, out of 300 trials, the number of times in which the algorithm fails.
The result is summarized in Table 9.When the number of subsets increases, we are less likely to observe that the proposed algorithm diverges.
For the nonparamteric estimate, we adopt the following model to evaluate the performance of distributed nonparametric censored quantile regression with its corresponding oracle.Specifically, in Model 2, we simulate: where b 0 = 2, b 1 = 1 and σ (X) = 0.2 + 2(X − 0.5).The main difference between Model 1 and Model 2 is that the definition of is modified so that Q τ (x) is a nonlinear function of x in Model 2. Since nonparametric censored quantile regression is a local approximation, it does not require Q τ (x) as a linear function of x.
Denote Qτ (x) as the estimated value of the τ 'th quantile Q τ (x) at x.We follow Kong and Xia (2017) to calculate the quantity below as a performance measure: 20 i=1 | Qτ (x i )−Q τ (x i )|/20 with x = 0.025, 0.075, . . ., 0.975.Since there are 20 different values of x, we repeat the nonparametric quantile regression for 300 times for each x.The results are summarized in Table 10.When τ = 0.5, the loss when using all data in the generalized cross-validation is approximately 0.0375 for all chosen S's.The corresponding statistics are close to 0.06 for the cases considering uncensored data only.In other words, using all data in generalized cross-validation leads to a better performance as compared to using uncensored data only in the generalized cross-validation procedure.

Data Analysis
We present three real data analysis to demonstrate the performance of distributed quantile regression.Our goal is to validate that distributed quantile regression has little effect on the estimates in real data if the number of workers S is properly chosen.The first dataset was studied in Aalen, Borgan, and Gjessing (2008),1 which was collected from the Medical Birth Registry of Norway.We attempt to model how the mother's age at first birth affects the median of the time from the first birth to second birth of children.In some studies, for instance, Patilea and Van Keilegom (2020), this relationship is not considered.We select the part of data where the first child was still alive at the birth of the second child.There are a total of 53,296 samples amongst which 69.63% are censored observations.Coefficient estimates and their corresponding standard errors are reported in Table 11.
When applying distributed censored quantile regression to this dataset, the is no significant change in the final estimate and its estimation variance for different number of machines S chosen.When taking S = 1, which corresponds to the original method proposed in Wang and Wang (2009), the computational burden is extremely heavy for this dataset in which case the analysis cannot in fact be implemented with a standard computing facility.In contrast, the distributed censored quantile regression method proposed can easily produce estimates within a reasonably short period of time.
The second dataset originated from the KKBox's Churn Prediction Challenge in Kaggle. 2 The goal of the competition is to predict whether a user will churn after their subscription expires, that is, not renew the subscription within 30 days after expiration.Similar to Kvamme, Borgan, and Scheel (2019), we study the subscription length instead of making churn predictions.With X as the subscription length, we define δ = 1 if there is no renewal after 30 days since the last renewal, and δ = 0 if the last renewal is within 30 days from March 2017, the last date of the available data.Since our goal is to justify the effectiveness of the proposed algorithm instead of conducting feature engineering, we follow Kvamme, Borgan, and Scheel (2019) to extract the covariates.
After pre-processing the data, we obtain 25,676 samples with proper user IDs and age information.The resulting censoring rate is 24.2%.Unlike Kvamme, Borgan, and Scheel (2019) which uses the prediction performance as an evaluation metric, our goal is to quantify the covariate effects, so we remove attributes with high correlations to avoid multicollinearity.The coefficient estimates and their standard deviations are summarized in Table 12.For both S, the considered attributes are all important.
The last experiment focuses on a study conducted at the University Clinical Center in Ljubljana (Pohar and Stare 2006) and serves as a comparison with respect to Wang and Wang (2009).The details are postponed to the appendix as the sample size is small.Briefly speaking, our results are similar to Wang and Wang (2009).

Conclusion
In this article, we introduce distributed algorithms for censored quantile regression.In particular, we propose two methods, namely a distributed framework and its communicationfacilitated adaptation, which can be regarded as extensions of Volgushev, Chao, and Cheng (2019) and Chen, Liu, and Zhang (2019), respectively.We also provide theoretical justifications on adopting a distributed framework into nonparametric quantile regression in the sense of Kong and Xia (2017).
Besides establishing rigorous theoretical justifications, we also propose an improved version of cross-validation and adapt generalized cross-validation to improve the hyperparameter selection for censored observations.Our numerical results suggest that our adaptations facilitate the selection of a better bandwidth parameter and reduction in the computational cost.
There are a number of possible extensions in the future that we find interesting.First, in Volgushev, Chao, and Cheng (2019) the authors considered a sequence of τ 's instead of one single τ .The corresponding set up cannot be simply be extended to our model directly as weights for different quantile levels need to be calculated separately.Accelerating the calculation on the generalized KM estimator using a sequence of τ 's is an interesting topic in the future.Second, we notice the possible connection between our proposal and specific scenarios considered in federated learning (see, for instance, Andreux et al. 2020) in which each party holds local data and does not exchange them with other parties, and the data are not essentially randomly allocated in the parties.Either in the horizontal case with different parties hold different user data or in the vertical case where each party has some attributes for all users instead of all attributes, the existing results in distributed learning may not be straightforwardly extended to handle situations of this sort.It is also an interesting topic to develop suitable procedures accordingly.Finally, one may also consider extending the algorithms discussed in this article for high-dimensional data as well as data collected under left-/interval-censoring mechanism.Although the computational time of our proposed algorithms scales up ) The kernel functions K(•) is a symmetric probability density function on R p with finite second moments and bounded first-order derivatives.(B4) For censoring variable C, denote the conditional distribution function as G(Q τ (x)|x) is uniformly bounded away from 1 for all x and τ ∈ [τ * , 1 − τ * ].The distribution function G(t|x) is Holder-continuous over {(x, t)

Figure 1 .
Figure 1.Distributed quantile regression with m-fold cross-validation and generalized cross-validation using s = m = 10 and N = 2000 for one set of data and the loss for different choices of c 0 and κ.Left: κ = 0.1.Right: c 0 = 1.

Table 4 .
Computation time for individual methods for each subset of data.

Table 8 .
Mean square errors (×10 −4 ) and empirical coverage probabilities (×10 −2 ) of the communication-facilitated method for different number of machines S with N = 2000.

Table 9 .
Number of failures when running GCV using the communicationfacilitated method out of 300 experiments with S = 40.