A survival tree based on stabilized score tests for high-dimensional covariates

ABSTRACT A survival tree can classify subjects into different survival prognostic groups. However, when data contains high-dimensional covariates, the two popular classification trees exhibit fatal drawbacks. The logrank tree is unstable and tends to have false nodes; the conditional inference tree is difficult to interpret the adjusted P-value for high-dimensional tests. Motivated by these problems, we propose a new survival tree based on the stabilized score tests. We propose a novel matrix-based algorithm in order to tests a number of nodes simultaneously via stabilized score tests. We propose a recursive partitioning algorithm to construct a survival tree and develop our original R package uni.survival.tree (https://cran.r-project.org/package=uni.survival.tree) for implementation. Simulations are performed to demonstrate the superiority of the proposed method over the existing methods. The lung cancer data analysis demonstrates the usefulness of the proposed method.


Introduction
A classification tree is a tree-like model developed by a sequence of binary classifications (called splits) of the samples. A classification tree consists of internal nodes of the binary splits (branches) and terminal nodes of the classified groups (leafs). One of the major purposes of classification trees is to predict the outcome for a sample by identifying the most appropriate group. The tree provides a graphical tool for pediction, which is especially useful for clinicians/physicians to predict the disease outcomes for their patients.
Breiman et al. [5] described the classificatiion tree as a non-linear regression model, which becomes a popular idea in statistics and machine learning community. In the statistical literature, the deviance and the two-sample test are the two most commonly used measures to find the optimal spliting rules. See Everitt and Howell [21] for the overview of the classification tree for continuous and discrete outcomes. For survival outcomes, which are typical outcomes for fatal diseases, the two most commonly used measures are the significance tests based on the logrank statistic and Cox regression analysis.
The earliest works of Ciampi et al. [9,10] used a logrank test for classifications, including the applications to non-Hodgkin's lymphoma patients and breast cancer patients. While the literature shows a large number of criteria to replace the logrank test [4,31,38], the classification tree based on the logrank test remains the simplest for computation and interpretation. However, the practical application of a tree is feasible only by the software packages.
So far, the R package 'rpart' (Therneau and Atkinson [39]) is one of the most popular and well-developed tools to construct trees, along with the graphical add-on R package 'partykit' (Hothorn et al. [27]). These powerful packages allow various response types, including continuous, binary, and survival responses. For survival responses, significance tests under the conditional Cox model are employed for selecting the optimal splits. The resultant tree is called 'conditional inference tree' (Hothorn et al. [25,26]), which is often abbreviated as 'ctree'. Illustrations of the ctree are found in the well-known textbook of R (Hothorn and Everritt [24]).
High-dimensional covariates often arise while developing a survival prognostic prediction method with gene expressions [17,19,41,42]. Such data pose many challenges for users of a survival tree. When one applies the R package 'rpart' to high-dimensional gene expression data, no covariate is deemed significant in the usual scaling of P-values (e.g. 0.01 and 0.05) due to the adjustment for multiple tests. It should be emphasized that the usual scaling of P-values (e.g. 0.01 and 0.05) for gene screening remains useful without adjustment (e.g. Beer et al. [3]; Chen et al. [7]; Zhang et al. [44]). It is also too time-consuming to find the optimal tree in the presence of a number of covariates. Therefore, the univariate score tests have been suggested to select significant covariates [19,34,35,42], which do not require model fitting and are workable for the commonly used P-value thresholds. A predictor constructed from the selected genes is useful for predicting survival in lung cancers [3,7], ovarian cancer [20], and other cancers [30,8] for the usual scaling of the P-value threshold, such as 0.05, 0.01, and 0.001.
Motivated by the computational efficiency and interpretability (of P-value) in the univariate score tests, this article introduces a new tree construction algorithm of classifying survival data with the high-dimensional covariates. This proposed method differs from the conditional tree based on the multivariate Cox-regression (Hothorn et al. [25]) that is implmented in the existing R packages. However, the methods are similar in its principle to use 'P-value' as the threshold for determining the final tree. The proposed method also differs from the tree constructed by the logrank tests (called the logrank tree). We argue that the logrank tests are occasionally unstable and invalid for finding the best split in high-dimensional and small sample settings. The key novelty in the proposed method is the stabilization technique originally proposed by Witten and Tibshirani [42] for high-dimensional multiple tests. The stabilized score test is a minor correction to the logrank test, which however produces a large influence on a tree via multiple tests.
This article is constructed as follows. Section 2 reviews the background of the research. Section 3 proposes a new survival tree method. Section 4 includes the simulation studies. Section 5 contains a real data analysis. Section 6 concludes the article. Software implementation is given in Appendix.

Background
This section reviews the methods for classification trees for survival data. In order to motivate our newly proposed method, we point out some defects of the existing classification tree methods.

Notations
is a p-dimensional vector of covariates, and n is the number of samples. Define a hazard function h(t|w ij ) = −dlog P(T i > t|w ij )/dt, the instantaneous risk of death at t given the covariate information w ij . The w ij can be either {x ij > c} or {x ij ≤ c} for a cut-off value c.

Survival trees
Survival trees are supervised classification algorithms to develop prognostic models based on observed survival outcomes. A tree constructed from survival data can classify patients into different prognostic groups according to their risk of death. With a survival tree, physicians and clinicians can intuitively classify their patients to identify their survival prognoses.
A survival tree is constructed by an algorithm called 'recursive partitioning'. It is an algorithm to sequentially partition the samples into subgroups based on the data To construct a tree, we define a node by the split of the samples into two subgroups: {i; x ij ≤ c} and {i; x ij > c}, where c is a cut-off-value. The choices for the covariate j and the value of c are determined by maximizing the prognostic difference between two subgroups. The set {1, 2, . . . , n} is called a parent node (or root node). Two sets of partitioned samples are called child nodes ( Figure 1). A tree arises by recursively creating children from parents until some stopping criterion is met. Terminal nodes are the nodes without children. Obviously, different trees arise from different measures of the prognostic difference of the two child nodes.

Logrank tree
The logrank test has been the most commonly used way to measure the difference of the two groups in terms of survival prognosis. For constructing a survival tree, Ciampi  et al. [10] proposed the logrank test as a criterion for splitting a parent node into two child nodes. Since then, it has been routinely utilized for constructing a survival tree [4,31,38]. The logrank tree minimizes the P-value for testing H 0 : h(t|x ·j ≤ c) = h(t|x ·j > c) for each node in the tree. The recursive partitioning is terminated when the P-value gets larger than a pre-specified threshold.
However, the test results of the logrank test may lose the statistical meaning when the two child nodes have unusual sample allocations, the cases frequently occur in a tree construction algorithm for high-dimensional covariates. Here we give an example to describe the invalid results for the logrank test due to unbalanced sample sizes.
Suppose that one sample is allocated to Group 0 and other n − 1 to Group 1 ( Figure 2). In this setting, the Z-value for testing the two groups' difference is and the P-value= Pr(|Z| > z) where Z ∼ N(0, 1). Thus, the P-value < 0.05 if n ≥ 5. However, such a small P-value may not indicate the true difference of the two groups. More generally, if one group contains the earliest k deaths, and the other group contains the remaining samples, the Z-value is Again, it is likely to get the P-value < 0.05 if n ≥ 5. The false P-values generally arise in small samples, not restricted to the above cases. In order to avoid the above mentioned problems in the logrank tree, Leblanc and Crowley [31] applied a permutation test. However, the permutation test is not a realistic approach for high-dimensional covariates due to its extremely high computational cost. A more common approach is to set the minimum number of samples in each terminal nodes in order to insure that the sample sizes in {i; x ij > c} and {i; x ij ≤ c} are greater than a specfied number (e.g. 10). However, the sample size itself does not have statistical meaning, difficult to interpret, and unclear by censoring. Thus, the choice of the sample size is often subjectively or arbitrary chosen by a user. Or, a user may try different choices before getting a tree that he/she wants. A bit more systematic way is to prune the tree after a large tree is constructed under a very small sample size threshold. This strategy typically needs to choose a cost function and perform an additional cross-validation to optimize the prediction accuracy, which is extremely time-consuming for high-dimensional covariates. Our stragegy to correct the logrank test is different from these complex or unrealistic approaches.

Stabilized score test
It is well-known that the logrank test is related to a score test under the Cox model h j (t|w ij ) = h 0j (t)exp(β j w ij ), where w ij = I(x ij > c), and h 0j (·) is the baseline hazard function. Let R i = { ; t ≥ t i } be the risk set at time t i . For testing H 0j : β j = 0 vs. H 1j : β j = 0, the score statistic and its variance are derived from the log-partial likelihood j (β j ) under β j = 0, j , which is equivalent to the logrank test. Hence, the score and logrank tests share the same problem under unusual sample allocations.
Witten and Tibshirani [42] stabilized the Z-value by using a small constant d 0 ≥ 0 such that The value d 0 is regarded as a shrinkage parameter: a larger value d 0 shrinks the Z-value more toward zero ( Table 1). The value d 0 = 0 corresponds to the original score statistic or the logrank statistic. Since V 1/2 converges in probability to one, the asymptotic distribution of z d 0 j is equivalent to that of the logrank test. This means that a P-value can be computed with the reference to the standard normal distribution as in the logrank test. Table 1 shows that the stabilized Z-value mitigates the problem of the logrank test under the unbalanced sample allocation. Emura et al. [19] developed the compound.Cox R package that implements the computation of the stabilized Z-values in multiple tests. The function uni.score(.) in this package can efficiently compute many Z-values of the stabilized score test.

Conditional inference tree (ctree)
The ctree of Hothorn et al. [25] performs a recursive patition according to the tests for the hazard ratio via the conditional Cox model given the status of their ancesters. They select a covariate x j and its cut-off value c such that the conditional test for H 0 : h(t|x ·j ≤ c) = h(t|x ·j > c) leads to the greatest significance. The most important advantage, as advocated in Hothorn et al. [25], is that the statistically meaningful measure -the P-value, can control the tree size without the sample size contraint and the pruning process. However, the interpretational difficulty arises when the dimension p is large (Section 5).

Proposed method
This section proposes a new method for constructing a survival tree based on the stabilized score test. Recall that the stabilized score test mitigates the problem of the logrank test under unbalanced samples between two splitting groups (Section 2.4). Hence, the new algorithm is intended to improve upon the logrank tree. Moreover, we explain how this algorithm is efficiently performed for the purpose of constructing a survival tree.

Algorithm to find the optimal split
A tree is contructed by recursively partioning a parent node into two chidren by finding the optimal split. Hence, it is essential to develop an efficient algorithm to test a number of hypotheses for all the possible splits. Below, we explain how efficiently the stabilized score tests are performed.
We wish to decide an optimal split for testing It is then necessary to search over all possible covariates x ·j 's and their cut-offs by testing a number of hypotheses. Below, we propose an efficient computing method for performing multiple tests and finding the optimal split. Let c 1 < c 2 < · · · < c m be the m ordered values such that c 1 is the smallest value and c m is the largest value of x ij 's. Thus, all x ij 's are scaled into c 1 ≤ x ij ≤ c m so that c 1 , c 2 , . . . , c m−1 are the possible cut-off values for all the covariates, giving binary splits . . .
The (i, j) element of this matrix is denoted by w ij , i = 1, 2, . . . , n, and j = 1, 2, . . . , p(m − 1). Then, for a model h j (t|w ij ) = h 0j (t)exp(β j w ij ), we perform a test for H 0j : β j = 0. The univariate score statistic and its variances are . Finally, the optimal split is determined by To find the optimal split, it is necessary to perform high-dimensional tests on p(m − 1) covariates, which is computationally expensive. A naive method is to use the 'for loop' to compute them. However, consider a k-node tree having to compute over k · p(m − 1) tests. Then, one requires a huge computational burden due to triple loops for k, p, and m.
To overcome the problems of high-dimensional score tests, we extend the matrix-based computation technique of Emura et al. [19] to a tree algorithm. In order to apply their method, we first prepare the stacked matrices: Then, we define the binary split indicator matrix W = I(X > C).
We define a vector of score statistics and a vector of their variances as where δ = (δ 1 , . . . , δ n ) and . . .
Thus, the vector of Z-statistics is S/(V 1/2 + d 0 1). Here, the operators × and / are applied in the component-wise manner.
Above operations should be done on the vector S and V instead of computing S j and V j one-by-one through the loop for j = 1, 2, . . . , p(m − 1). For R users, one first prepares W, and then applies it to the uni.score(,d0 = ) function in the compound.Cox package. The output gives a vector of Z-values of dimension p(m − 1). The following example illusrate how this idea works. Example 3.1: (a toy dataset of n = 10) We give an example of using the proposed computing method. Consider a dataset of size n = 10: A split indicator matrix defining all the possible splits is shown as follows: We put this 'W' matrix into the function uni.score(.) in the compound.Cox package, which produces the vector of Z-values and P-values of the score tests: We see that I(x 5 > 1) leads to the greatest significant since the P-value of the score test is the smallest, serving as the optimal split.
To see the effect of the stabilized score test, we set d 0 = 0.1. Then, the greatest significance is shown to be I(x 3 > 2). Thus, the optimal split has been changed by stabilization.

Remark 3.1:
The proposed algorithm is suitable to deal covariates having the same scale (the example above). However, clinical covariates, such as tumor size, gender, and cancer stages, may have different scales. The simplest way to unify the scales is to transform all covariates into quantile levels, e.g. quartile levels, 1, 2, 3, or 4 ( ∼ 25th, 25th ∼ 50th, 50th ∼ 75th, or 75th ∼ percentile). A factorial covariate has to be transformed into either an ordinal covariate or dummy variables. A variable taking 0 and 1 has to be changed into 1 and 4 to conform this scale. In practice, the choice of c 1 < c 2 < · · · < c m may not need a large m. Clinical interpretations of the cut-off values c 1 , c 2 , . . . , c m−1 may also be relevant.

Algorithm to construct a survival tree
We have suggested our optimal splitting strategy by minimizing the P-value for each split. After recursively splitting nodes, one needs to stop splitting when certain conditions are met. We suggest our stoping rule to be 'P-value', unlike a commonly used rule based on the number of samples in a node (or uncensored samples in a node). We believe that Pvalues are statistically more meaningful than sample sizes. Especially for censored data, the sample size carries little information. For instance, a node with 10 uncensored samples is more informative than a node with 10 censored samples. If a tree is determined by the sample size criterion, it has to perform an additional pruning process. Furthermore, users have little sense to determine how a sample size is large or small. On the other hand, they may agree that 'P < 0.05', 'P < 0.01', and 'P < 0.001' are reasonable 'evidence' to have a split for a node. In summary, our survival tree adopts a single spliting/stoping criterion, the P-value, without restricting the node sizes or pruning a tree. This results in a very simple algorithm with an interpretable tree.
To develop our tree-construction algorithm, the initial step is to: Step 0: Set a P-value threshold, denoted as 0 < P < 1 (e.g. P = 0.05), and a shrinkage parameter d 0 > 0. Define a current (parent) node by all the samples in the dataset {1, 2, . . . , n}.
We propose the following recursive partitioning to construct a survival tree: Step 1: If the current node has n ≤ 2, set it as a terminal node. For n > 2 in the current node, calculate the Z-value z d0 jk for testing {i; Step 2: If P j * k * < P, create two child nodes by spliting the current node into {i; x ij * ≤ c k * } and {i; x ij * > c k * }: go back to Step 1, replacing the current node with the child nodes. If P j * k * ≥ P, stop the algorithm and define the current node as the terminal node.

Remark 3.2:
If the samples in the current node are all censored, one has P d0 j,k = 1 due to z d0 jk = 0 for all possible splits (j, k)'s. Thus, this node becomes a terminal node no matter how the sample size is large. If the current node has a high proportion of censored samples, it is likely to be a terminal node. However, we stick to the P-value for the stopping rule without referring to the censored proportion. This is because the P-value contains every information necessary to judge the decision on the binary split.

Selection of d 0
We suggest minimizing the Akaike Information Criterion (AIC) to choose d 0 for a given P-value threshold. The AIC of a tree can be computed from a Cox model having the terminal nodes as factorial covariates (computed by the AIC(.) function in R). If the tree has unbalanced samples in the terminal nodes, the Cox model may not converge. Thus, the value of d 0 is optimized among those models making the Cox model converges.
As d 0 increases, the likelihood tends to decrease while the number of terminal nodes decreases. The AIC achieves a good balance between the convergence, goodness-of-fit, and the model complexity. We do not recommend cross-validation that requires a high cost of computation. Different values of d 0 can produce the same AIC value because there are a limited number of trees coming from all the values of d 0 . In our data example, the values d 0 = 0, d 0 = 0.01, d 0 = 0.1, and d 0 = 0.2 were enough to produce all possible trees. Among them, d 0 = 0.01 was chosen as the best choice while d 0 = 0 and d 0 = 0.1 did not make their Cox models converge.

Survival risk prediction
A proposed survival tree can be used for prediction for a new sample not included in the training dataset. At each inner node, a positive Z-value (z d0 j > 0) implies that the group of {x j > c j } has a higher risk of death than {x j ≤ c j } does. Conversely, a negative Z-value (z d0 j < 0) implies that the group of {x j ≤ c j } has higher risk of death than {x j > c j } does. Thus, the signs of the Z-values in inner nodes determines the order of the death risks among the nodes.
We denote the ranks of the terminal nodes by R ∈ {1, 2, . . . , k}, where k is the number of terminal nodes in a tree. For instance, a terminal node has the highest risk R = k if it is judged 'higher risk' at all inner (ancestral) nodes. Similarly, the node of the lowest risk is assigned for R = 1. To determine all other intermediate ranks, we impose a rule: any child node judged higher risk than the other child cannot be reversed by further splits.
The prediction for a new sample is performed by allocating the rank of the sample, and then calibrating the survival probability based on the samples in the corresponding node (e.g. by the Kaplan-Meier estimator). In the numerical analyses of this article, we define the left node to be the higher risk group and the right node to the lower risk group. In this way, all the terminal nodes are ordered by their risks from the highest risk (the leftmost node) to the lowest risk (the rightmost node). The tree becomes further informative by adding the Kaplan-Meier survival plot made by the samples in each terminal node.
The prediction performance of a tree can be assessed through the assigned ranks of risk. If test samples are available (in addition to training samples), one can compute the predicted risk ranks for all the test samples. Then, one can evaluate the concordance between the predicted risk ranks and the survival outcomes for all the samples in the test dataset. We adopt the c-index that is easily computed from censored data through the concordance(.) function in the survival package [40].

Simulations
Monte Carlo simulations were conducted to assess the performance of the proposed method and the existing methods (logrank tree and ctree) under high-dimensional covariates. The simulation consists of several steps: generating training data, constructing a tree, generating testing data, and evaluating the performance of the tree.
Since the evaluation of classification trees involves a few different aspects, we chose the following criteria: (i) Selection ability: We examine if the tree contains a reasonable proportion of informative nodes. The tree may have inner nodes that are falsely chosen (false positive). We use the 'precision' defined as the proportion of the inner nodes whose split uses informative covariates (true positive). A larger value of the precision corresponds to a better selection ability. (ii) Classification ability: We examine if the tree has a good classification ability. We use the likelihood ratio (LR) test for the equivalence of the classified groups based on the testing data. We use the P-value obtained from Cox regression treating terminal nodes as factors. A small P-value corresponds to a better classification ability. (iii) Prognostic ability: We examine if the tree has an ability to correctly assign the prognositic risk ranks for the testing data. We adopt the c-index that is easily computed from censored data through the concordance(.) function in the survival package.

Simulation designs
Data were simulated as follows. We generated p = 100 discrete-valued gene expressions with We considered a sparse setting (q = 10) with β = 0.5 or 1, and a non-sparse setting (q = 30) with β = 0.05 or 0.1. The intra-cluster correlation of gene expressions was 0.8 for the first two clusters of size q, and 0 for the last cluster of size 100 − 2q (Emura et al. [17]). We first generated continuous gene expressions y ij ∼ Unif (−1.5, 1.5) from the R function X.pathway(., rho1 = 0.8, rho2 = 0.8) in the compound.Cox package. We then discretized them by the two scenarios: Balanced covariates: Unbalanced covariates: For the latter, we randomly selected j ∈ {2q + 1, . . . , 100} and replace x ij with 4. Figure 3 is an example for q = 5, where j ∈ {11, . . . , 100} are randomly chosen for replacement. Given the gene expressions, we generated T i from a Cox model h(t|x i ) = h 0 (t) exp(x i β) with h 0 (u) = 1. We also generated U i ∼ Unif (0, 1), and set t i = min{T i , U i } and δ i = I{T i ≤ U i }. The censoring percentage is around 50 ∼ 56%. Thus, we generated training data {(t i , δ i , x i ); i = 1, . . . , n} to develop trees under P-value thresholds of 0.01 and 0.005 (the conditional inference tree uses the adjusted thresholds (e.g. the adjusted value for 0.01 is 1 − (1 − 0.01) 100 ≈ 0.633)). We also generated independent and identically distributed test samples in order to measure prediction accuracy. All simulations were based on 100 repetitions, where each repetition develops a tree and tests its prediction performance.
Supplementary Materials include additional stimulations where the data were generated from the continuous gene expressions (without discretization). We observe that the effect of discretization is minimal. Tables 2 and 3 compare the performance of the proposed tree, the logrank tree, and the ctree.

Simulation result
If the covariates are balanced, the proposed tree and the logrank tree give almost identical performance in terms of the classification ability (LR test) and prognostic ability (c-index). While the ctree gives the lowest c-index, the classification ability is often the best. In fact, all the trees performed competitively well so that no clear-cut conclusion is derived to judge the best tree.
If the covariates are unbalanced, the performance of all the trees gets lower while the proposed tree exhibits a modest advantage over the logrank tree in terms of the LR test and the c-index. The improved results come from the fact that the proposed tree avoids false nodes by stabilizing the test results for unbalanced samples. This is exactly what we expected. However, the improvement under non-sparse setting is weaker than that under the sparse setting (Table 2 vs. Table 3). This means that the variance stabilization is effective specifically for the sparse setting.
We learn that the sizes of the tree (No. terminal nodes) are remarkably different among the three methods. In principle, a smaller tree is preferred to a larger tree if they have the same prediction performance. We observe that the logrank tree gives the largest tree while the ctree gives the smallest tree. In this respect, the ctree provides the most parsimonious model. However, the tree seems to be too small. Indeed, the low c-index for the ctree likely indicates the failure to have informative nodes in a tree. The proposed method gives an intermediate tree size.
Under the sparse setting ( Table 2) the precision is usually less than 50% for all trees. Hence, the majority of the nodes are falsely selected. Nonetheless, the proposed tree has higher precision than the logrank tree. This improvement is due to the conservative tests results of the proposed tree relative to the logrank tree. The ctree should have the highest precision due to the smallest number of terminal nodes in a tree; however, we were unable to calculate the precision from the output.
Based on the above findings, we conclude that the proposed method effectively selects nodes from high-dimensional and unbalanced covariates. The proposed tree has a better prognostic ability than the ctree and a more precise selection ability than the logrank tree. However, under the balanced covariates, the performance of the all the trees are equally well.

Data analysis
This section applies the proposed method to the lung cancer data to illustrate how a survival tree is constructed, comparing the results with other tree-based methods and model-based methods.

The lung cancer data
We consider the lung cancer data of Chen et al. [7] consisting of 125 lung cancer patients. Each patient has gene expressions from his/her tumor, coded as 1, 2, 3, or 4 ( ∼ 25th, 25th ∼ 50th, 50th ∼ 75th, or 75th ∼ percentile). The endpoint is overall survival (i.e. timeto-death due to any reason). During the follow-up, 38 patients died and other 87 patients were censored. In Chen et al. [7], the 125 patients were separated into 63 training and 62 testing samples. The difference between the training and testing sets was not found by significance tests [22].  We use the subset of the data containing 97 gene expressions as available in the Lung object in the compound.Cox R package [19]. Available are survival time (time to either death or censoring) in months, censoring indicator (1 = death, or 0 = censoring), index for training sample (TRUE = training sample, or FALSE = testing sample), and genes named by their symbols, VHL, IHPK1, . . . , RPL5. We shall construct survival trees based on the training samples (n = 63), and validate their performance by the testing samples (n = 62).

Results: the root node
We illustrate the advantage of the stabilized score test over the logrank tree owing to unbalanced samples. Figure 4 shows how the logrank test optimally splits the n = 63 samples into{ANXA5 ≤ 1} vs. {ANXA5 > 1}.
In Figure 4, we see that the Kaplan-Meier plots for the two groups are not convincingly separated due to the unbalanced sample sizes. On the other hand, the stabilized score tests gave {IRF4 ≤ 3} vs. {IRF4 > 3} as the best split. Indeed, the stabilized test gives a more convincing separation between the low and high risks (Figure 4).

Results: survival trees
From the n = 63 training sample, survival trees were developed from the following methods: (I) The logrank tree (using the score tests with d 0 = 0), (II) The proposed tree (using the stabilized score tests with d 0 = 0.01 chosen by AIC), (III) The conditional inference tree (ctree).
In (I) and (II), we set the stopping criterion of P = 0.01. Other choices of P were also examined (see Supplementary Materials for details). In (III), the choice of P shall be explained later.
(I) The logrank tree The root node splitted into{ANXA5 ≤ 1} and {ANXA5 > 1} ( Figure 5). Starting from the root node, the splitting process was recursively continued. The tree consisted of six inner nodes (including the root node) and seven terminal nodes. The Kaplan-Meier survival plots for the seven terminal nodes show that the prognosis groups are correctly ordered from the highest risk group (the leftmost node) to the lowest risk group (the rightmost group). One major concern is the too small sample size (n = 1) for the highest risk group ( Figure 5).
(II) The stabilized score (proposed) tree The tree based on d 0 = 0.01 ( Figure 6) has the smallest AIC value among all the possible d 0 ≥ 0. The root node was {IRF4 ≤ 3} and {IRF4 > 3} ( Figure 6). This split mitigates the major concern of the logrank tree. Except for the root node, the inner nodes are similar to those of the logrank tree.
The Kaplan-Meier survival plots for the six terminal nodes are consistent with the risk groups, correctly ordered from the highest risk group (the leftmost node) to the lowest risk group (the rightmost group).

(III) Conditional inference tree by the R function ctree(.)
The ctree under a threshold P = 0.01 was performed (by ctree(,control = ctree_control (alpha = 0.01, where alpha is the significance level for the splits). Consequently, the output returned the null tree consisting of only one terminal node of all samples since none of the splits reached the significance level of 0.01. However, this conclusion is not reasonable since the covariates should have some predictive ability for survival outcomes as previously reported [7,15,17,19]. To understand this phenomenon, one should note that the P-value threshold in ctree(.) is the Bonferroni adjusted value P Adj = 1 − (1 − P) p , where p is the number covariates and P is the non-adjusted P-value for individual tests (Hothorn et al. [26]). Since p = 97 and P = 0.01, we set the adjusted Pvalue threshold P Adj = 1 − (1 − 0.01) 97 = 0.62. However, the resultant ctree still had only two terminal nodes. Larger values of P Adj yielded more reasonable trees with 3 terminal nodes (0.64 ≤ P Adj ≤ 0.67, Figure 7a) and 4 terminal nodes (0.67 < P Adj ≤ 0.90, Figure 7b). See Supplementary Materials for our detailed process of choosing P Adj .
Kaplan-Meier survival plots in Figure 7 do not exhibit a clear separation between the risk groups. However, it turns out that they have some prognostic ability when applied to the test samples. The split between {DUSP ≤ 3} and {DUSP > 3} appears in the root node, which is the most significant split in the tree. However, the attached P-value to this split is 'P = 0.505', showing no evidence to separate the two groups in terms of survival prognosis. Thus, the input and output of the P-value thresholds are difficult to interpret, contradicting to the predictive performance. While the problem can be solved by adjusting the P-value, users may not know how to do it.

Validation results
We validated the prediction performance of the developed trees by the test samples of n = 62. For all the test samples, we predicted their risk groups using one of the trees. We then calculated the c-index, a concordance measure between the predicted risk ranks and the actual survival outcomes (Section 3.4). Higher values in the c-index means better prediction performance, and values less than 0.5 mean no prediction ability. Table 4 shows the c-index for the five different trees. All the trees have similar values of the c-index (0.564 ∼ 0.584). With these small differences in the performance, it is not reasonable to select the best tree based on the largest c-index. The c-index values all not far from 0.5 are reasonable considering the small training samples (n = 63), more than half of

0.600
Note: CC = compound covariate; Copula = copula method; CS = compound shrinkage; Wald = genes are selected by the Wald tests; score = genes are selected by the score tests. See [19] for details.
which were censored, and a number of possibly uninformative genes (p = 97). However, we notice that the tree structures are different, especially between the proposed tree and the ctree. In fact, this phenomenon is typical if prediction formulas are derived from a large number of predictors. That is, there could be a number of prediction schemes all yielding the same performance in prediction. Table 4 also shows the c-index for the six linear predictors as previously reported in Emura et al. [19]. Again, the c-index for these predictors are similar to those for the trees. However, we notice that trees' discrete predictors (ranking of 1, 2, . . . ) cannot be comparable to linear predictors' continuous predictors. Hence, this comparative c-index values may actually demostrate the advantage of the trees over the continuous predictor. The continous predictors usually need to be further splitted to yield clinically interpretable subgroups (e.g. good prognosis groups vs. poor prognosis group). Thus, we have validated all the trees by showing comparable predictive ability to the previously reported benchimark ability.

Conclusion of the data analysis
While we have validated all the trees (logrank tree, stabilized score tree, and ctree), the best one could not be selected by thier prediction performance. To discuss the best tree among the five trees, we search for the clinical interpretations for the trees. The proposed tree had an important gene 'IRF4' (Interferon Regulatory Factor) that is a strongly effective gene as a diagnostic and prognostic marker for human non-small cell lung cancer. Medical researchers pointed out that the gene 'IRF4' is upregulated in this cancer (Alvisi et al. [1]), that is, a larger value of 'IRF4' leads to a higher risk for lung cancer patients. This evidence demonstrates the clinical advantage of the proposed tree over other trees. Besides, the gene 'ANXA5' selected by the logrank tree and all other existing predictors does not have a clear interpretation since it may not be a predictive gene of lung cancer. Instead, it is a prognostic marker of cervical cancer, and head and neck cancer (Kang et al. [29]).
In summary, we have demonstrated the advantage of the proposed method (the stabilized score tree) over the logrank tree, the ctree, and other linear predictors in terms of the clinical relevance. However, all the methods exhibited similar numerical performance for prediction. Indeed, all the existing trees and prediction methods are highly sophisticated so that they all reached maximum levels in terms of quantitative prediction abilities.

Conclusions
We have proposed a new tree-based classification method motivated by some difficulties in the two most popular methods: the logrank tree and conditional inference tree. Motivated by the unusual detection of the survival difference by the logrank test, we stabilized the variance when computing the Z-value. This idea was previously proposed by Witten and Tibshirani [42] for testing a large number of covariates. The adaptation of their idea to a survival tree is a very reasonable approach, yet it has not been considered. We also have developed a computationally efficient method to find the optimal split among a large number of candidate splits. This extends the technique of Emura et al. [19].
Our simulations have demonstrated the improved precision and classification/ prognostic ability of the proposed method over the existing methods. Specifically, the proposed method is advantageous when a number of unbalanced covariates tend to yield false positive nodes in the logrank tests. This is exactly the case where the stabilized score tests work better than the logrank tests. However, this advantage diminishes when the false positive rate is reduced in the presence of a large number of informative covariates (the non-sparse setting).
By applying the proposed tree to the lung cancer data, we have demonstrated the advantage of the proposed method (the stabilized score tree) over the logrank tree, the ctree, and other continuous predictors in terms of the clinical relevance. However, all the methods exhibited similar performance for prediction.
The critical assumption made on survival data is that censoring mechanism is independent of survival time. The so-called 'independent censoring assumption' needs to be assumed in order to have valid results on the logrank test and score tests. In clinical survival data, some patients may drop out from the medical follow-up study due to their poor health, violating the independent censoring assumption [11,15,16]. Under such dependent censoring schenarios, the singificance of the tests could be false. Thus, in a classification tree, the treatment of survival data with dependent censoring is an important issue as discussed by Moradian et al. [36]. They suggested a new measure of the binary splitting criterion by adjusting for dependence between survival time and censoring time by a copula-graphic estimator; see [18,32,33] for the applications of the copula-graphic estimator to estimate survival difference. Recently, Emura and Hsu [18] studied the performance of the two-sample test under copula-based dependent censoring models. A measure that is free from the specification of a copula could also be considered under a competing risks framework [6]. A test based on the Fine-Gray model may also be considered by treating dependent censoring as a competing risk [2]. More explorations are necessary to deal with dependent censoring.
The so-called survival forest is a re-sampling version of a survival tree, which has gained popularity in recent years [25,28,36]. The extension of the proposed tree to develop a survival forest is a promissing future work. While there could be several advantages for survival forests over survival trees, we believe that survial trees are easier to use in clinical practice due to its computational and conceptual simplicity.
The logrank test remains the most popular nonparametric method to detect the survival difference between two groups due to its simple computation and interpretation. The stabilized score test can be regarded as a minor correction to the logrank test to mitigate some unusual cases, keeping essentially the same computation, interpretation, and asymptotic distribution as the logrank test. One could achieve an improved tree for early or late survival difference from the null via weighted logrank tests [23]. The stabilization test is applicable to the weighted logrank statistics by simply correcting its variance estimate. This is a relevant generalization of the proposed stabilized tree especially when users focus on early or late survival benefit.
We are reluctant to use deviance-based measures that need a likelihood function under some parametric models. However, we note that deviance-based trees are common in continous, binary, and count data, under the framework of the generalized linear model (GLM). Similarly, we have not considered entropy-based measures that needs a specific criterion, such as AIC, BIC, and Gini-index. For continous (normally distributed) data, Yanagawa and Tajiri [43] suggested the AIC-based criterion as an alternative to the twosample problem under small sample sizes. As for the correction to the small sample problems of the logrank tests in survival data, bootstrap and permutation methods are commonly used altervatives (Leblanc and Crowley [31]; Ditzhaus and Pauly [12]). There are growing interest in using the Mann-Whitney type effect for measuring survival difference under the independent censoring scenario [13,14,37] and the dependent censoring scenario [18,6]. These computationally expensive methods may provide some rooms for improvement.

Appendix. Computer software
We made our original R package 'uni.survival.tree' to implement the proposed methods. Presently, there are seven functions in this package. Due to the space limitation, we focus on the most important function 'uni.tree(.)' that produces a survival tree given a survival dataset {(t i , δ i , x i ); i = 1, 2, . . . , n}.
We also explain the function 'risk.classification(.)' that is useful for predicting the risk of death given a tree made by uni.tree(.) and test samples. We also briefly explain two functions for generating data, which was used in the simulations. We use the following styles for inputting the dataset (similar to the compound.Cox R package), • t.vec: a vector (t 1 , t 2 , . . . , t n ), • d.vec: a vector (δ 1 , δ 2 , . . . , δ n ), • X.mat: a matrix with the i-th row x i = (x i1 , . . . , x ip ) for i = 1, . . . , n, • P.value: the P-value threshold (0 < P < 1), • d0: the shrinkage parameter d 0 > 0 that stabilizes the score test, • score: = TRUE for the score test; = FALSE for the logrank test.
The dataset of n = 10 in Example 3.1 is used for illustration under P = 0.05. The code below creates t.vec, d.vec, and X.mat, and apply them to the uni.tree(.) function.
The output shows all the nodes created. The output 'node_status:' tells us whether the node is 'inner' or 'terminal'. First, we identify two inner nodes in the above output, corresponding to the root node and Node 2 in Figure 8. Their P-values are less than 0.05, indicating that these nodes have their child nodes. Next, we identify three terminal nodes in the above output, corresponding to Nodes 1, 3, and 4 in Figure 8. The P-values for the terminal nodes are all greater than 0.05, indicating that these nodes have no child. The output 'Risk score' can be used to assign the rank of death risk. The largest value '1' corresponds to the highest risk group (Node 1, Rank = 3) and the smallest value'−1.1' corresponds to the lowest risk group (Node 3, Rank = 1). To predict the risk of death for test samples, one has to identify the group that the samples belong (Section 3.4). The test samples are those who do not belong to the dataset. Nonetheless, we illustrate the predicted risk of death for all the same patients in the training dataset of n = 10. Inputting the