Rank-Based Greedy Model Averaging for High-Dimensional Survival Data

ABSTRACT Model averaging is an effective way to enhance prediction accuracy. However, most previous works focus on low-dimensional settings with completely observed responses. To attain an accurate prediction for the risk effect of survival data with high-dimensional predictors, we propose a novel method: rank-based greedy (RG) model averaging. Specifically, adopting the transformation model with splitting predictors as working models, we doubly use the smooth concordance index function to derive the candidate predictions and optimal model weights. The final prediction is achieved by weighted averaging all the candidates. Our approach is flexible, computationally efficient, and robust against model misspecification, as it neither requires the correctness of a joint model nor involves the estimation of the transformation function. We further adopt the greedy algorithm for high dimensions. Theoretically, we derive an asymptotic error bound for the optimal weights under some mild conditions. In addition, the summation of weights assigned to the correct candidate submodels is proven to approach one in probability when there are correct models included among the candidate submodels. Extensive numerical studies are carried out using both simulated and real datasets to show the proposed approach’s robust performance compared to the existing regularization approaches. Supplementary materials for this article are available online.


Introduction
High-dimensional survival data is ubiquitous in many scientific fields such as genomics, medicine, and ecology.One of the major goals of survival analysis is to predict risk quantities of clinical interest from a set of patients' predictors.However, analysis of high-dimensional survival data is usually complicated and challenging because it is subject to right-censoring, as the survival time is not fully observed.
To address the challenges of survival analysis with highdimensional data, a large variety of variable selection approaches have been extended from completely observed data to censored data.These approaches include LASSO, MCP, and SCAD.They only select a subset of predictors and construct a single optimal model to formulate predictions.Hence, the success of these approaches is built upon correct model specification.However, identifying the correct model is a challenge, especially for complex data such as high-dimensional survival data.As a remedy, model averaging incorporating information by applying weighted averaging across a set of candidate submodels can effectively mitigate the risk of misspecification.Worked examples on real data have been well studied, including predicting the core inflation (Kapetanios, Labhard, and Price 2008;Gospodinov and Maasoumi 2021), risk factors for strokes (Volinsky et al. 1997), and cross-section earnings (Hansen and Racine 2012).
CONTACT Li-Xing Zhu lzhu@bnu.edu.cnCenter for Statistics and Data Science, Beijing Normal University at Zhuhai, Zhuhai, China.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/JASA.
There is a large and growing body of literature on model averaging.For example, Buckland, Burnham, and Augustin (1997) constructed model averaging weights based on AIC or BIC scores.Hoeting et al. (1999) conducted a comprehensive review of Bayesian model averaging that assigns posterior model probabilities to candidate submodels.There is also a large body of literature on model averaging from the frequentist perspective (Hansen and Racine 2012;Zhu et al. 2019;Feng et al. 2021).Under a high-dimensional regime, Ando andLi (2014, 2017) proposed delete-one cross-validation model averaging for linear and generalized linear regression models.There are several approaches to semiparametric and nonparametric model averaging, which include varying coefficient model averaging (Li et al. 2018;Zhu et al. 2019) and Bayesian approaches to semiparametric model averaging (Wang and Li 2018;Chen and Gerlach 2021).Li et al. (2020) made a semiparametric model averaging prediction for multiple categories, coupled with the AdaBoost algorithm.
Despite this remarkable progress, model averaging for survival analysis is far less understood in the high-dimensional setting.Under the low-dimensional regime, a few model averaging approaches have been extended to survival data, such as Bayesian model averaging (Volinsky et al. 1997) and model averaging based on a focused information criterion (Hjort and Claeskens 2006;Sun et al. 2017).Those may not be directly applicable to high-dimensional survival data due to some strong conditions, such as the local root-n assumption for predictor parameter estimation.He et al. (2020) recently developed a semiparametric Cox-based cross-validation model averaging for high-dimensional survival data.However, it may be more sensible to consider models with less structural restrictions to capture the relationship between survival time and predictors.In this line, Song et al. (2007) proposed rank-based estimation for the transformation survival model with multiple predictors.This approach directly exploits the monotonicity between survival time and the linear predictor effect without requiring any parametric assumptions on the form of the baseline hazard function or the error distribution.Song and Ma (2010) extended this work to a penalized version.Cheng, Wei, and Ying (1997) considered predicting survival probabilities for transformation models.Several approaches have been proposed for transformation models with uncensored outcomes, including parametric estimation (Cavanagh and Sherman 1998) and goodnessof-fit testing (Allison, Hušková, and Meintanis 2018;Kloodt, Neumeyer, and Van Keilegom 2021).In addition, Fan et al. (2020) studied the statistical properties of rank-based estimators in increasing dimensions for the transformation model while the predictor dimension is smaller than the sample size.It may not be applicable in many scientific fields, such as genomics, where the number of predictors is much higher than the sample size.Furthermore, the estimators tend to be unstable with high-dimension predictors, particularly with a small sample size.
In this article, we propose a novel rank-based greedy model averaging approach that involves aggregating a set of transformation models for high-dimensional survival data.Specifically, we rank the predictors based on correlations with survival time from high to low, and then group predictors with similar values to formulate working transformation models, so as to reduce the dimension.These procedures are similar to those in Ando and Li (2014), except that we do not screen out any predictors that are marginally less correlated with survival time, to avoid missing important predictors.Following Song et al. (2007), we derive predictions by maximizing the concordance index in each submodel.It is interpreted as the agreement between the predicted risk and the survival time.We further choose the optimal model weights via a penalized concordance index function.As direct optimization of the concordance index is computationally expensive, we provide a smooth surrogate by replacing the indicator function with the scaled sigmoid function.We do not remove any predictors while constructing the candidate model, whereas it is still a high-dimensional optimization.To address this issue, a greedy algorithm (Frank and Wolfe 1956) is adopted.Theoretically, we explicitly express the error bound for the optimal weights in terms of the sparsity of submodels, the number of submodels, and the sample size, and establish that the summation of weights assigned to correct submodels has a probability going to one.
Compared with existing works, our proposed RG model averaging approach has several advantages.First, unlike traditional model averaging, the proposed RG approach compromises multiple transformation models, covering a variety of interesting survival models, and thus enjoying great generality.Second, to the best of our knowledge, our RG approach is the first to incorporate the concordance index into model averaging.The procedure doubly exploits the concordance index in a natural and reasonable fashion and thus may provide more flexible predictive inference than parametric model averaging.Lastly, approximating the indicator function with the scaled sigmoid function is computationally efficient.It does not require bandwidth or knots selection.The greedy algorithm further improves our approach's performance.It works well to balance prediction accuracy and model complexity.
The remainder of the article is organized as follows.In Section 2, we introduce rank-based learning for the transformation model and the proposed RG model averaging for handling high-dimensional survival data.A smoothed surrogate function for the concordance index and a greedy algorithm are provided in Section 3. We derive the error bound for the optimal weights and establish that the summation of correctly specified model weights approaches one in Section 4. We conduct simulation studies in Section 5 to assess the proposed approach's performance and make comparisons with the existing regularization approaches and Cox-based model averaging.Application of our approach is further illustrated with the lung cancer dataset in Section 6, and Section 7 concludes with remarks.Technical proofs, additional simulation results and discussions, and another application using ovarian cancer dataset are provided in the supplementary materials.

Problem Setting and Method
Let T denote the failure time, and C denote the censoring time.Let Z = (Z 1 , . . ., Z p ) T be a p-vector of predictors, X = T ∧ C be the observed time, and = I(T ≤ C) be the censoring indicator, where a ∧ b is the minimum of a and b, and I(•) is the indicator function.Assume that T and C are conditionally independent given Z.In practice, when the true data generating mechanism is very complicated and the model form cannot be easily decided, one usually has to adopt working models to make a prediction.To achieve greater flexibility, we specify the working model as a transformation model given by where h(•) is an unspecified monotone increasing function, is the random error with an unknown distribution, and β is a p-vector with unknown regression parameters.It is worth noting that the function h depends on Z only through a linear combination.This model includes many popular models as special cases, such as the proportional hazards model, the proportional odds model, and the accelerated failure time model.Without censoring, it also includes econometric models such as the binary choice model and the ordered discrete response model.
In practice, it is often critical to predict the risk for a specific subject with predictor z.The predictors are the potential risk that may shorten or extend the lifetime.We formally define the risk effect as z T β.To make a prediction, it is natural to first estimate β and then plug that into the regression model.For i = 1, . . ., n, let (X i , i , Z i ) be independent and identically distributed copies of (X, , Z). Khan and Tamer (2007) proposed a partial ranked approach based on the concordance index between the predicted risk and survival time.This approach does not require specification of the exact form of h or distribution of .Formally, the concordance index function is given by Thus, the rank-based estimator has the form β = arg max β C(β).
Note that for any constant γ > 0, C(β) = C(γ β).To avoid an identifiability problem and without loss of generality, motivated by Song et al. (2007), the first element of β is restricted to be {−1, 1}.Many well-known variable selection approaches such as LASSO, MCP, and SCAD have been developed for use when the dimension p of predictors Z is larger than the sample size n.They identify informative predictors and then formulate an optimal single model based on the selected predictors for further predicting.However, they may miss data predictors that are almost as good as those selected and ignore model uncertainty.As a viable alternative, we propose a model averaging approach to predict the risk effect for high-dimensional survival data.The idea is to combine several candidate submodels by assigning optimal weights, which are determined by a well-defined criterion.Overall, it has the potential to enhance prediction accuracy.

Candidate Predictions
To emphasize that p depends on n, we rewrite it as p n .For simplicity, we use [p n ] to denote the set {1, . . ., p n }.Ando andLi (2014, 2017) proposed a novel model averaging approach that entails ranking predictors and screening out less important ones to reduce dimensionality.In a similar spirit, we construct a series of low-dimensional candidate submodels; however, we do so without removing any predictors, as choosing the cut point for determining the number of predictors to retain is often subjective.To be more specific, we quantify the marginal importance of the survival time and each predictor based on the L q -norm introduced by Hong et al. (2020).We then rank the p n predictors from high to low based on their marginal importance and partition the ordered predictors into K n nonempty sets, where K n is some positive integer depending on n.The K n nonempty sets are denoted as {A k : k = 1, . . ., K n }.Furthermore, |A k |, the cardinality of A k , is assumed to be much smaller than n for k = 1, . . ., K n .For a p n -dimensional vector a = (a 1 , . . ., a p n ) T , let a (k) denote the subvector of a indexed by the set A k .Consequently, based on these index subsets A k 's, we can construct K n candidate submodels.Formally, the kth candidate submodel is given by , where β k is an |A k |-vector with unknown regression parameters and ε (k) is the random error for kth candidate submodel.The concordance index function for the kth candidate submodel is where T is the subvector of Z i , consisting of coordinates specified by the set A k .The corresponding rankbased estimator is given by βk = arg max Based on the kth candidate submodel, prediction of the risk associated with predictor z can be written as Qk = z T (k) βk .

Optimal Weights
Define then the weighted prediction is given by K n k=1 ω k Qk .Denote the risk vector for the ith observation as Ri = Ri1 , . . ., RiK n T , where Rik = Z T i(k) βk .Considering the form of the joint model, we propose a rank-based criterion to derive the optimal weights as Typically, the number of submodels becomes larger when the predictor dimension is higher.We introduce a penalty to determine the model weights.As different models offer different information, inspired by Dai, Rigollet, and Zhang (2012), we induce the entropy penalty where π ∈ n is a given prior.Intuitively, less is penalized when the prior information assigned to a submodel is believed with more confidence.Overall, the optimal weights are given by ω = arg max where λ > 0 is the tuning parameter.It should be noted that we impose the constraint K n k=1 ω k = 1, which has been adopted by most existing works, but not Ando and Li's (2014).With this constraint, K n is allowed to be divergent at the rate of the exponential of n, which will be discussed thoroughly in Section 4. When preparing the candidate predictions in the proposed model averaging approach, we do not need to specify the transformation function, as the concordance index only depends on the rank agreements of the survival times and the predicted risk effects, hence, the proposed approach is robust against model misspecification for the first time.When determining the model weights in the proposed model averaging approach, we do not need to specify the transformation function either, hence, the proposed approach is robust against model misspecification twice.Therefore, we say that the approach is doubly robust against model misspecification.The discontinuous property of indicator function may be computationally inconvenient, and we approximate it using a smooth surrogate in the next section.

Smoothed Greedy Model Averaging
To tackle the computational challenge caused by the indicator function's discontinuity, a smooth sigmoid function is used for approximation.For more flexible approximation, we introduce a sequence of strictly positive and decreasing numbers σ n satisfying lim n→∞ σ n = 0 and use g n (x) = g(x/σ n ) to approximate The corresponding smoothed estimator for β k is given by Further, the smoothed surrogate of (2.2) is given by and the rank-based estimator is formulated as where R j is the counterpart of Rj by replacing βk with β k .
As C k is continuously differential with a low dimension, (3.1) can be maximized with respect to β k using the Newton-Raphson algorithm or other nonlinear solvers.Quite a few packages in R also serve this purpose.In this article, we employ the "lbfgs" function in the "lbfgs" package.However, it cannot be applied to maximizing (3.2).Intuitively, more candidate submodels need to be considered as the predictor dimension grows.This is high-dimensional optimization with a box constraint.To tackle this problem, we develop a greedy model averaging algorithm.The greedy algorithm has regained popularity in recent years due to its simplicity, effectiveness, and theoretical guarantees.It is applied to classification (Mannor, Meir, and Zhang 2003), feature selection (Zhang 2009;Tsamardinos et al. 2019), and factor matrix estimation (Zhang et al. 2019).It addresses optimization over the convex hull of a basis set.The K n -simplex n is commonly parameterized by the canonical basis, {e k }, where e k is the kth vector of the canonical basis of R K n .The selection of a particular basis set allows for the incorporation of encoding structures such as sparsity and nonnegativity (of the basis) into the solution.We summarize the proposed greedy estimation procedure in Algorithm 1.
Algorithm 1: Greedy algorithm for estimating the weight vector ω.
Initialize ω (0) ∈ n For = 0, 1, 2, . . ., do In the above algorithm, • ∞ denotes the infinity norm, κ = 0.001, and the initial value can be chosen as 0 or e 1 .The key feature of the algorithm is that, at each step, it greedily chooses to add the best coordinate to the current representation.The iterate is then updated as a linear or convex combination of previous iterates and the newly obtained atom (basis).Hence, the greedy algorithm would result in at most nonzero components after iterations.This algorithm thus, explicitly controls the number of submodels included and strikes a balance between prediction accuracy and model complexity.
For a new observation with z, the predicted risk under the kth candidate model can be obtained by Note that all the submodels share the same transformation function.In other words, the candidate risk effects are associated with the survival time via the same transformation function h(•).Hence, one can predict the final risk using K n k=1 ω k Q k with the optimal weights ω.The concordance index is a widely used measure to evaluate the predictive accuracy of censored data regression; see, for example, Gönen and Heller (2005).The appeal of this formulation is that it is not necessary to specify the transformation function and the error distribution as long as the monotone property is satisfied.Consequently, we cannot predict the survival probability.This is the cost of implementing our proposed approach.Fortunately, all the risk quantities (e.g., survival probability and relative risk) are involved in the risk effect, z T β.In addition, predicting the risk effect is of great practical value for a large number of medical studies, especially those focusing on the evaluation of specific covariates' effects.

Theoretical Results
Before presenting the theoretical properties, we define a correctly specified model that includes all the truly relevant predictors with survival as the correct model.Assume that β k converges in probability to β * k .When the kth candidate submodel is correct, the value of β * k is the true parameter value.When the kth candidate submodel is misspecified, the value of β * k is the one closest to its true value in some metric.For example, Struthers and Kalbfleisch (1986) defined β * k as the solution of the population score function in the censoring case.Le and Clarke (2022) defined that β * k is the minimizer of the relative entropy distance between the true distribution and the empirical distribution in the noncensoring case.Denote To present the theoretical properties, we define the population risk function as where The optimal weight vector is We first derive an error bound for the 2 -norm of the difference between the empirical and theoretical optimal weights, that is, ω − ω * 2 .This involves deriving the supreme norm of the gradient function, and the cone condition.Before presenting the error bound, we give some conditions below.
C1 There exists a constant Condition C1 is on the boundedness of predictors which is assumed by Fang Fang, Yang, and Liu (2017)) and holds in most real applications.Condition C2 is a simple convergence property on each candidate model's parameter estimators.Indeed, Song et al. (2007) rigorously established this result if σ n → 0 under certain regularity conditions with respect to the predictors and the loss function, when the candidate submodel is correct.When the model form is misspecified or ignores some important predictors, limiting values also exist, and the estimators have asymptotic normality properties that may be proved following procedures similar to those given in Song et al. (2007).Here, for simplicity, we write the result as an assumption in general cases.Define the index set of active model as with a probability of at least 1 − δ, where This lemma bounds the magnitude of the gradient of the loss at the optimum point ω * .Similarly to in the high-dimensional variable selection (Bickel, Ritov, and Alexandre 2009), we derive a cone condition that applies to the difference between ω and ω * .The following lemma shows that as long as λ is sufficiently large, ω − ω * is guaranteed to be in a restricted cone related to the priors.
Noting that Javanmard and Montanari (2014) showed that Bühlmann and van de Geer (2011, sec. 6.13) used a more general definition by replacing 3 with a constant L. We restrict ourselves to the case 1 + 2 πu / πl , since the penalty is related to the prior.We further assume additional conditions to establish the error bound.
C3 Let ϒ n be the set [−1, 1] K n .For any q 1 , q 2 ∈ ϒ n , there exists a constant M 1 such that max where q = q 1 − q 2 and R
Condition C3 places a constraint on the eigenvalues of the risk design matrix.This is similar to Assumption 3 of Bickel, Ritov, and Alexandre (2009) and Condition A3 of Peng, Wang, and Wu (2016) for analyzing sparse least squares regression and the support vector machine, respectively.Condition C4 leads H(ω * ) to satisfy the compatibility condition (Javanmard and Montanari 2014) for the set S * with constant M 2 .Based on Lemmas 1 and 2, we can obtain an error bound depending on n, K n , Āu , s 0 , πu , and πl .
with a probability of at least 1−δ, where u log K n /n) → λ 0 > 0, the upper bound can be scaled as σ −2 n s 9/4 0 Ā2 u log K n /n.This indicates that the weight estimator ω approaches the optimal weight ω * at the rate σ −2 n s 9/4 0 Ā2 u log K n /n.In the case where Āu is fixed and σ n , s 0 , and K n have the order n −ϑ , n δ , and e n η , respectively, ω − ω * 2 p −→ 0 has a high probability if 8ϑ + 9δ + 2η < 2. Ando and Li (2014) pointed out that the model size K n is only allowed to be divergent at the rate of a polynomial of n, but not larger, whereas with our approach, the order of K n can be as large as the exponential of n.This is the by-product of restricting the summation of weights to 1.The proof of Theorem 1 relies on two essential steps, namely the bound for the supreme norm of S n (ω * ) and the cone condition, both of which are respectively, derived in Lemmas 1-2.
Next, we consider the case where there are correct models included among the candidate submodels.Let M be the subset of {1, . . ., K n } consisting of all the correct models.To prove the following weight selection consistency, we assume some additional conditions.C5 The random error i is distributed independently of R * i and C i .C6 The set {R * ∈ S R : P( i = 1|R * i = R * ) > 0} has a positive measure and is not contained in any proper linear subspace of R K n , where S R is the support of The independence condition C5 requires that i is independent of both R * i and C i .Condition C6 requires that the probability of censoring is not equal to one for all R * .Both are relatively mild; they were made for modelling censored survival data and are close to their counterparts, namely Khan and Tamer (2007) Assumptions I1 and I3, respectively.Condition C7 is a key assumption to guarantee the consistency of weight selection when there are correct models among the submodels.It imposes a relationship between s 0 , Āu , K n , and n.
Theorem 2 shows that all the weights are assigned to the correct models asymptotically if there are correct models among the candidate submodels.Fang, Li, and Xia (2021) also provided a similar weight consistency within the existence of correct models.It can be viewed as a kind of model selection consistency.We can then aggregate all the correct submodels and ignore the misspecified submodels to make a prediction if there are correct submodels.We further establish the convergence of the greedy algorithm in the following theorem.Denote F(•) = L n (•) − λK(•, π ).We assume that there exists a parameter ν n for all ω, ω † ∈ n , such that This is similar to the restricted smooth property (Loh and Wainwright 2013).
Theorem 3.For the th iteration in Algorithm 1, it holds that Theorem 3 shows that the approximation of the greedy algorithm depends on the smooth parameter ν n , the bound of the constrained set n , and the number of the greedy steps.It indicates that the suboptimal solution approximates the optimal sparse solution, with a linear rate in terms of the iteration steps.As the number of greedy steps goes to infinity, the second item on the right-hand side of the inequality will vanish.Proofs of Lemmas 1-2 and Theorems 1-3 are provided in the supplementary materials.

Simulation Studies
In this section, we assess our approach's performance using various simulations.To make a comparison with existing competitors, we consider the following model-based prediction approaches in all simulated datasets: (a) Cox-LASSO, (b) Cox-MCP, (c) Cox-SCAD, (iv) AFT-LASSO, (v) AFT-MCP, and (vi) AFT-SCAD.The variable selection approaches are all executed with the R package "ncvreg." We also consider maximizing the penalized smooth concordance index (PSCI) and take LASSO as an example, since it attains the smallest prediction error compared with the other penalties in our settings.In addition, we make a comparison with the Cox-based ECV model averaging approach proposed by He et al. (2020).In order to demonstrate the robustness of the proposed approach, we consider three classes of transformation models: (a) Case I: Cox model with the baseline hazard function λ(t) = (t − 0.5) 2 ; (b) Case II: AFT model with i generating from a normal distribution with mean 0 and standard deviation 0.5; (c) Case III: Case III differs from Case II only in terms of the transformation function, where Here, the high-dimensional predictor Z i = (Z i1 , . . ., Z ip n ) T follows a p n -dimensional normal distribution with mean 0 and covariance matrix = (0.5 |j−j | ) for j, j = 1, . . ., p n .The sample sizes are n = 100 and 200, coupled with the predictor dimensions p n = 1000 and 2000.In Case I, we set β 1 = −1, β j = −0.2 for j = 2, . . ., 15 and the others as 0. For Cases II and III, the coefficient values are the opposite of that in Case I.The censoring time is C i = C i ∧ τ , where C i is generated from exp(0.12).The study duration τ is chosen to yield a censoring rate of 20%.Case III is neither the Cox model nor the AFT model.Thus, the model-based approaches we consider all suffer from misspecification.
Before implementing the proposed model averaging approach, we formulate a general means to construct the candidate submodels by partitioning the indices of the predictors [p n ].Given that Hong et al.'s (2020) screening approach is model-free and easily implemented, we adopt it to rank the predictors by importance.Following "MCV1" of Ando and Li (2014), we fix the submodel predictor size as d and then every 10 or 20 predictors are grouped to formulate a candidate model.This leads to a total of K n = 100 or 50 candidate submodels for p n = 1000 and K n = 200 or 100 for p n = 2000.In practice, we could choose an optimal model size K n with the cross-validation procedure, but we do not pursue this for simplicity and time-saving.To show the importance of the candidate submodels, we choose one predictor as the "anchor" and restrict its estimated coefficient to 1. Without loss of generality, we pick the first ranked predictor for calibration.Thus, we choose 11 or 21 predictors (except the first one) in each submodel.In the initial step, three types of priors π for the candidate submodels are adopted.The first one is the equal constant prior, wherein each component of π is 1/K n .We refer to this approach as ERG.Obviously, the equal noninformative prior cancels out the regularization term, freeing our approach of the tuning parameter λ.The second is the uniform prior (URG), for which we independently generate K n uniform variates over (0, 1) and set each component of π as a uniform variate divided by the summation of K n uniform variates.Clearly, such a uniform prior is also noninformative.The third is an informative prior, for which we sort the uniform variates (SRG) in descending order and assign more prior weights to the more effective submodels based on the ranking.The tuning parameter for model averaging approaches associated with the uniform prior and the sorted uniform prior as well as PSCI are chosen via a 3-fold cross-validation procedure.We set the smoothing parameter as σ n = 0.01.
To examine the proposed approaches' prediction performances, we evaluate the risk effect for a subject with predictors drawn from a p n -dimensional normal distribution with mean 0 and covariance matrix .We also generate a testing dataset (X i,t , i,t , Z i,t ) of the sample size n t = 200 to predict the concordance index.We calculate the squared errors for the risk effect and concordance index, which are defined, respectively, as and We report the mean (Mean) and the standard deviation (SD) of CI se and RE se over 100 replications.Table 1 summarizes the results for Case I. We can see that the proposed approaches outperform the competing approaches with regard to better predicting the concordance index.This is unsurprising, as the prediction procedure doubly maximizes the smooth concordance index.In terms of predicting the risk effect, although our proposed RG approaches perform worse than the ECV model averaging when p n = 2000, they show some superiorities with respect to lower means when p n = 1000.They also attain lower means than the Cox-based regularization approaches, with p n = 1000 and d = 10.The regularization approaches for the AFT model are much worse than those for the Cox model and the proposed RG approaches, as its specified model is not correct.The performance of PSCI is also significantly inferior to that of our approaches in most cases.Specifically, its performance is sensitive to the sample size and improves dramatically with an increased sample size.We next turn to Table 2.Although the performances of the proposed RG approach with different priors are inferior to that of AFT-based (correct) regularization approaches, they still much better than the Cox-based regularization approaches, ECV model averaging, and PSCI.These findings indicate that the performances of our proposed approaches fall in between that of the Cox-and AFT-based regularization approaches in most cases and are close to the optimal level or are even sometimes the best.
We further examine the performance in Case III, where the underlying model is neither a Cox model nor an AFT model.As shown in Table 3, the model-based regularization approaches perform similarly, but our approaches outperform them all.The main reason is that the specified model is wrong.The ECV model averaging is also inferior to the proposed model averaging.These further demonstrate the superiorities of our proposed model averaging when all the submodels are misspecified.The results in Tables 1-3 also indicate that the usage of an informative prior (SRG) can improve model averaging approach's prediction quality, while a wrong prior (URG) may introduce potential side-effects, causing the enlargement of means or SDs.
The candidate submodels in the above scenarios tend to all be misspecified, as some important predictors are missing.By contrast, we consider a situation in which the correct model is among the candidate submodels in Cases I-III with the sample sizes n = 100 and 200 associated with p n = 1000.For the construction of candidate submodels, the screening approach is applied only to the p n − s nonsignificant predictors in order to obtain the ordered predictors.The first submodel is formulated with all the 15 significant ones for d = 10 and the top 5 ranked predictors and the 15 significant predictors for d = 20.The remaining submodels are constructed in a manner similar to that used before.As shown in Table 4, the performances are improved once there is a correct model among the candidate submodels.The other findings coincide with the results described above.We also report the averaging k∈M ω k , that is, ω 1 .There exists an obvious trend where the averaging ω 1 approaches 1 as the sample size increases.For example, in Case I and d = 10, the summation dramatically increases from 0.6205 to 0.849 as the sample size increases from 100 to 200 with ERG.These phenomena demonstrate the weight selection consistency proved in Theorem 2.
In the supplementary materials, we additionally (a) plot the model weights when all the submodels are misspecified with an equal prior and compare the running times of all the considered approaches, as well as (b) discuss the impacts of the priors and parameters λ and σ n on the proposed approaches' performances.In summary, our proposed approaches perform well at prediction.When the actual model is purely a Cox model or AFT model, they perform rather close to the optimal penalized approach or even better.When the Cox or AFT modeling structure is violated, they show superior performances compared to model-based regularization approaches.When the correct model is included among the candidate submodels, the proposed RG approach with different priors shows improved performance, undoubtedly outperforming its competitors in nearly all cases, except for Case II with n = 200 and p n = 1000.Compared with the PSCI, the main advantage of our approaches is that they are computationally efficient and robust, particularly with a small sample size.

Real Example
In this section, we present applications to demonstrate the predictive power of our proposed model averaging by lung cancer dataset.We also compare the proposed RG approach with the same approaches used for comparison in the simulation study, namely Cox-LASSO, Cox-MCP, Cox-SCAD, AFT-LASSO, AFT-MCP, AFT-SCAD, PSCI, and ECV model averaging, in terms of prediction power and time efficiency.
Lung cancer is one of the most malignant tumors, causing over 1,000,000 deaths each year worldwide (Ma et al. 2020).A principal subtype of lung cancers is lung adenocarcinoma (LUAD).Predicting LUAD risk is crucial in determining further treatment strategies.We downloaded the data from the Genomic Data Common (GDC) Portal https://portal.gdc.cancer.gov.The full TCGA-LUAD dataset includes 18,176 genes and 508 samples.Each sample has survival time and survival status.During the study periods, 183 patients died of lung cancer, and the other 325 were censored, leading to a censoring rate of 0.64.The Kaplan-Meier curve, with its 95% confidence interval, is plotted in Figure 1.
As an initial step, we apply the model-free screening approach of Hong et al. (2020) to rank the importance of genes, with correlations from high to low.We only select the top 2000 and group every 10 or 20 genes together to formulate the candidate submodels, yielding 200 and 100 submodels, respectively.Our primary interest is in predicting patients' risk effect.The new predictor z is taken to be the column-wise median of 2000  gene expressions.We add the "anchor" predictor, which is at the first rank, to each candidate model for calibration.The anchor's coefficient value is allowed to be 1 or −1, as we do not know whether the transformation function is strictly increasing or decreasing in the real application.The priors for the ranked submodel are chosen in a manner similar to that used in the simulation studies.
We first use the whole dataset to predict the risk effect.After performing RG model averaging of the submodels, the predicted risk effect associated with predictor z over the first 200 iterations in the greedy algorithm is plotted in Figure 2. The greedy model averaging algorithm exhibits stable convergence after 100 iterations.The proposed ERG, URG, and SRG all use 13 and 9 submodels for d = 10 and d = 20, respectively.The corresponding weights of submodels using the proposed RG model averaging are summarized in Table 5.It can be seen that the first five submodels aggregate most of the weights.We further summarize the genes selected by the proposed RG approach when d = 10 and d = 20 in Tables S3  and S4, respectively, which are delegated to the supplementary materials.In addition, to further evaluate the predictive power of the proposed RG approaches, we use the metric concordance index.We randomly sample 305 observations for training; the remaining are for testing.This process is repeated 100 times for the proposed approaches and all the competing approaches.Table 6 summarizes the risk effect predictions, concordance index, and the running times of all the approaches.The regularization approaches tend to yield higher risks, except for Cox-MCP, whereas the proposed RG model averaging approaches yield lower risks.Compared to ECV model averaging and PSCI, the proposed RG approaches have higher risks.Based on the results shown in Table 6, we can see that the proposed RG approaches perform better than all the comparative approaches in terms of the concordance index.Although applying the proposed RG model averaging approach with different priors is time-consuming, the extra time is well spent, as this flexible approach avoids making a fixed parametric model assumption and can thus, yield prediction results that are closer to the real data, as evidenced by the higher concordance index values in Table 6.
Table 5. Model weights using the proposed RG model averaging in the lung cancer dataset.

Conclusion
The transformation models are widely used in many sciential fields and of great flexibility.We consider them as the working candidate submodels and propose a novel RG model averaging approach to predict the risk effects for high-dimensional survival data.One can easily see that the proposed approach does not require the rigid designation of a correct joint model; rather, it integrates a number of possible submodels.The plausibility of each submodel is then evaluated according to its weight in the averaging step, which is robust against model misspecification.Furthermore, coupled with the greedy algorithm, the procedure enables control of the trade-off between model complexity and predictive approximation quality via the number of iterations.This is useful in a large variety of use cases, especially for highdimensional analysis.Extensive numerical studies suggest the satisfactory and robust performance of our proposed model averaging.
We consider a low-dimensional setting in submodels while allowing the model size K n to diverge with n to accommodate the high-dimensional scenarios.As an alternative, we can construct high-dimensional candidate models and use regularization-based methods with different tuning parameters to estimate these models.The optimality of submodel construction procedures is a very interesting study topic, especially when addressing how to consider the randomness induced by model construction procedures.On the other hand, the independence between the error variable, the risk effect, and censoring time may be relaxed.It is possible to allow for heteroscedasticity.However, all issues are not trivial and we leave them to future research.
and the cardinality of S * by |S * | = s 0 .Lemma 1 below gives an upper bound for the supreme norm of

Figure 1 .
Figure 1.Kaplan-Meier survival curves and its 95% confidence interval for analyzing the lung cancer dataset.

Figure 2 .
Figure 2. Prediction of the risk effect using the proposed RG model averaging in the first 200 iterations of the greedy algorithm, in the lung cancer dataset.
Prediction of the risk effect and concordance index, and the running time (in minutes) of the proposed RG model averaging, ECV model averaging, various regularization approaches, and PSCI, in the lung cancer dataset.

Table 1 .
Simulation results over 100 runs for predictions of the risk effect and the concordance index using the proposed RG model averaging, ECV model averaging, various regularization approaches, and PSCI when the failure times are generated from the Case I and the correct model is not included among the candidate submodels.

Table 2 .
Simulation results over 100 runs for predictions of the risk effect and the concordance index using the proposed RG model averaging, ECV model averaging, various regularization approaches, and PSCI when the failure times are generated from the Case II and the correct model is not included among the candidate submodels.

Table 3 .
Simulation results over 100 runs for predictions of the risk effect and the concordance index using the proposed RG model averaging, ECV model averaging, various regularization approaches, and PSCI when the failure times are generated from the Case III and the correct model is not included among the candidate submodels.

Table 4 .
Simulation results over 100 runs for predictions of the risk effect and the concordance index, and the mean and standard deviation of estimating the first component of weights using the proposed RG model averaging when the correct model is included among the candidate submodels.