A Closer Look at the Kernels Generated by the Decision and Regression Tree Ensembles

Abstract Tree ensembles can be interpreted as implicit kernel generators, where the ensuing proximity matrix represents the data-driven tree ensemble kernel. Focus of our work is the utility of tree based ensembles as kernel generators that (in conjunction with a regularized linear model) enable kernel learning. We elucidate the performance of the tree based random forest (RF) and gradient boosted tree (GBT) kernels in a comprehensive simulation study comprising of continuous and binary targets. We show that for continuous targets (regression), this kernel learning approach is competitive to the respective tree ensemble in higher dimensional scenarios, particularly in cases with larger number of noisy features. For the binary target (classification), the tree ensemble based kernels and their respective ensembles exhibit comparable performance. We provide the results from several real life datasets for regression and classification relevant for biopharmaceutical and biomedical applications, that are in line with the simulations to show how these insights may be leveraged in practice. We discuss general applicability and extensions of the tree ensemble based kernels for survival targets and interpretable landmarking in classification and regression. Finally, we outline future research for kernel learning due to feature space partitionings.


Introduction
Decision and regression tree-based ensembles have been timeproven statistical machine learning mainstays (Schoelkopf and Smola 2001).Neural networks have recently emerged as leading ML methods for prediction from unstructured data for example, data ensuing from medical imaging or natural language processing (NLP) applications.However, the tree-based ensembles such as random forest (RF) (Breiman 2000) or gradient boosted trees (GBT) (Friedman 2001;Chen et al. 2020) are considered to be the current methods of choice for prediction from tabular (i.e., structured) data (Feng, Yu, and Zhou 2018;Shwartz-Ziv and Armon 2022).
In the biopharmaceutical and biomedical applications, tabular data are ubiquitous and they are frequently used in prediction tasks using continuous and binary targets (i.e., regression and classification, respectively).For example, Quantitative structure-activity relationship (QSAR) modeling is used to predict the biological activity of compounds of interest for prioritization of potential treatments to be submitted for subsequent investigation.Here, the information contained in molecular descriptors (features) is used for the biological activity prediction (Svetnik et al. 2003;Svetnik et al. 2005;Kuzmin et al. 2011;Sheffield and Judson 2019;Verissimo et al. 2019).Biomarker development is another area, where the tree-based ensembles have shown their utility.In this context, a biomarker is consid-ered to be "a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathological processes, or biological responses to a therapeutic intervention" (FDA-NIH Biomarker Working Group 2016).Examples from this space include biomarkers for prediction of disease state or progression in various therapeutic areas such as neuroscience, immunology or oncology, etc. (see, e.g., Boulesteix et al. 2012;Gonzalez et al. 2017;Ishwaran and Lu 2019;Tarasova et al. 2019;Huang et al. 2021;Li et al. 2022).
Tree-based ensembles such as RF and GTB naturally furnish their respective kernels and can be interpreted within the framework of kernel learning.As usually applied, kernel methods fit linear models in nonlinear feature spaces that are induced by the kernels.The kernel interpretation of the RF and GBT was explored and expounded theoretically to investigate their asymptotic properties in Scornet (2016) and Chen and Shah (2018), respectively.On the other hand, there has been interest in the use of algorithms based on the RF kernel (Davies and Ghahramani 2014) in practice.For example, in Davies and Ghahramani (2014), performance of a Bayesian (mondrian) forest algorithm was found competitive in several regression tasks on various datasets from the UCI repository.
Our focus is investigation of the tree ensemble based kernels (including RF and GBT kernel algorithms) used in conjunction with ridge regularization penalty in regression and classification and elucidation of their performance characteristics, beyond the investigation carried out in previous work.The remainder of the article is organized as follows: Section 2 introduces the theoretical framework of the tree ensemble based kernels (RF and GBT) for targets of interest; Section 3 provides a motivational example using the well known Fisher Iris data; Section 4 details a simulation study that systematically evaluates performance of the RF and GBT kernels in conjunction with ridge regression in various scenarios; Section 5 summarizes the results on real life datasets relevant to biopharmaceutical and biomedical applications, and Section 6 provides discussion, conclusions and future research directions.

Terminology
Following Breiman (2000) and Ishwaran and Lu (2019), Chen andShah (2018), andScornet (2016) we consider a supervised learning problem, where training set and Y i can be continuous or binary target.For continuous and binary targets the Y i ∈ R and Y i ∈ {0, 1}, respectively.

Kernels for Regression and Classification
Kernel methods in the machine learning literature are a class of methods that are formulated in terms a similarity (Gram) matrix K.The similarity matrix K i,j = k(X i , X j ) represents the similarity between two points X i and X j .Kernel methods have been well developed and there is a large body of references covering their different aspects (Herbich 2001;Schoelkopf and Smola 2001;Friedman, Hastie, and Tibshirani 2009).In our work we used a common kernel algorithm, namely kernel Ridge Regression (KRR) for regression and classification.For the two class classification, we developed the KRR model with targets of -1 and 1 denoting the two classes.The predicted class label was obtained by thresholding around 0.
KRR is a kernelized version of the traditional linear ridge regression with the L2-norm penalty.Given the kernel matrix K estimated from the training set, first the coefficients α of the (linear) KRR predictor in the nonlinear feature space induced by the kernel k(., .)are obtained: where λ is the regularization parameter.
The KRR predictor h KRR (X) is given as where

Random Forest (RF) and the RF Kernel
Random Forest (RF) is defined as an ensemble of tree predictors grown on bootstrapped samples of a training set (Breiman 2000).When considering an ensemble of tree pre- representing a single tree (D n is the training set defined above).The 1 , 2 , . . .M are iid random variables that encode the randomization necessary for the tree construction (Scornet 2016;Ishwaran and Lu 2019).
The RF predictor is obtained as: RF kernel ensuing from the RF is defined as a probability that X i and X j are in the same terminal node R k ( m ), with k = 1 . . .T, where T is the number of terminal nodes (Breiman 2000;Scornet 2016).
where I(•) denotes the indicator function.

Gradient Boosted Trees (GBT) and the GBT Kernel
The GBT are (similarly to RF) ensemble of tree predictors.In contrast to the RF, the GBT ensemble predictor is obtained as a sum of weighted individual tree predictors h m (X, D n ) through iterative optimization of an objective (cost) function (Friedman 2001;Chen and Guestrin 2016) : The objective function of GBT comprises of a loss function and for the extreme gradient boosting a regularization term is added to control the model complexity.In our work we used the extreme gradient boosting (XGB) implementation of the GBTs (Chen and Guestrin 2016).The objective function of the XGB algorithm L XGB is given in the Appendix.We use GBT and XGB interchangeably hereafter.
As for the RF kernel, the GBT kernel is defined as a probability that X i and X j are in the same terminal node R k (h m ) (Chen and Shah 2018).
In general, the tree-based ensemble kernels can be characterized by a feature map φ that maps the input feature space (R p ) to the space generated by feature space partitioning ensuing from the tree ensemble (Ren et al. 2015;Balog et al. 2016;Gogic, Ahlberg, and Pandzic in press).The feature space induced by φ through random partitioning is also referred to as a random feature space (Balog et al. 2016;Fan et al. 2020).The mapping φ is defined as φ : R p → {0, 1} P , where P = MT is the number of terminal nodes across the tree ensemble.Thus, φ(X i ) can be represented by a P-dimensional vector where each entry corresponds to a particular terminal node in the tree ensemble.For each sample X i , an entry in φ(X i ) then encodes belonging of X i to a particular terminal node R k and φ( A tree ensemble based kernel that represents the random input feature space partitions can then be obtained via φ (represented as P dimensional column vector) as For example, when the RF and XGB kernels are considered as special cases of tree based ensemble kernels, the equations 4 and 6 coincide with 7, respectively.

RF and GBT Kernel Predictors for Regression and Classification
RF/XGB kernel predictors for regression was obtained by substituting for the RF and XGB kernel in (2), respectively, as: The RF and XGB kernel predictors for classification was also obtained as that for regression using (8), by building a regression model with target classes denoted as {−1, 1} and a class prediction threshold of 0.
The code for the simulation and real life data analysis was developed in the R programming language (R Core Team 2017).For the continuous and binary targets the ranger (Wright and Ziegler 2017) implementation of RF and xgboost (Chen et al. 2020) of XGB was used, respectively.The regularization parameter λ was chosen at minimum value, such that the matrix K+λI n was invertible.
In the simulations, all algorithms were applied using their default parameters.In order to further elucidate the impact of tree depth on the simulation results we carried out a sensitivity analysis, where we doubled the minimum size of the tree terminal node to generate more shallow trees for RF.The doubled minimum tree node size equaled to 10 and 2 for regression and classification, respectively.For the XGB, the maximum depth of the tree equaled to 2 in the sensitivity analysis.

Motivating Example
As a motivating example we show kernel matrices obtained from the the Fisher's Iris data.The Iris data consists of recordings of three Iris subspecies: Setosa, Versicolor and Virginica (50 samples each recorded on 4 numerical features).We compare the RF/GBT kernels and the Laplace kernel for this dataset.The Laplace kernel is defined as k In Figure 1, RF and GBT kernels are compared with the Laplace kernel for a couple of different values of the parameter σ .
The similarity between RF/GBT kernels obtained as a proximity matrix and the Laplace kernel is assessed by the Mantel statistic, that is, matrix correlation in this case between two similarity matrices (Legendre and Legendre 2012), respectively.In the Figure 1, the RF/GBT kernels capture the underlying structure of the data well and the three classes can be clearly distinguished.Similarly, the Laplace kernels also reflect (with different success) the partitioning of the data in three classes.The Laplace kernel that has the higher Mantel statistics with respect to the RF/GBT kernels appears to be the best in terms of the target alignment.
Furthermore, the RF/GBT kernels could characterize a more precise similarity function (Balcan, Blum, and Srebro 2008) leading to more accurate classification results.When RF/GBT kernels are compared to the Laplace kernel, the observations from the same class are more similar (closer) to each other than those from different classes.This is demonstrated by the histograms shown in Figure 2, with the RF/GBT kernel histograms peaking at 1 and 0 for the observations from the same class and those from different classes, respectively.In this motivating example we illustrated potential utility of the tree ensemble based kernels on an example of a RF/GBT kernels.We also compared them with the Laplace kernel.In more complicated simulated and real life scenarios, the tree ensemble based kernels such as RF/GBT kernels usually outperform the Laplace kernel.The Laplace kernel is isotropic (Balog et al. 2016).As the tree ensemble based kernels are anisotropic (Vens and Costa 2011) they adapt to local data features well (Chen and Shah 2018).Preliminary experiments with the Laplace kernel confirmed this notion.Therefore, in our manuscript we focus on the investigation of the utility of the RF/GBT kernels in building predictive models for regression and classification.

Simulation
Simulation scenarios for performance evaluation of RF/GBT kernels for regression were set up according to previously reported simulation benchmarks for continuous targets including Friedman (Friedman 1991), Meier 1, Meier 2 (Meier, Van de Geer, and Bühlmann 2009), van der Laan (Van der Laan, Polley, and Hubbard 2007) and Checkerboard (Zhu, Zeng, and Kosorok 2015).These were also adapted for classification.

Simulation Setup
For each simulation scenario, the predictors were simulated from Uniform (Friedman, Meier 1, Meier 2, van der Laan) or Normal distributions (Checkerboard), respectively.
Continuous targets were generated as Y i = f (X i ) + i .For the definitions of f (X i ) for each simulation case see below.
To generate a binary outcome Y i , the continuous outcome was first centered by the median M of its marginal distribution to obtain a balanced two class problem.Then the binary target was generated as a Bernoulli variable with p i = prob(Y i = 1|X i ), where p i was calculated as follows: where f (X i ) is obtained from the continuous models.
To characterize the intrinsic complexity of the classification problems, we also calculated the Bayes error rate from a large sample of the continuous outcomes (n = 10 7 ) and subsequently applied the formula Error Bayes = 1 −E max j Pr(Y = j|X) , j ∈ {0, 1} according to (James, Witten, and Tibshirani 2013).The j ∈ {0, 1} refer to the class indicators.
The five functional relationships f (X i ) between the predictors and target for different simulation settings are specified as follows.
1. Friedman.The setup for Friedman was as described in Friedman (1991).
Bayes error rate for the Friedman classification problem is 0.02.It is the least complex problem by this measure among those investigated.
2. Checkerboard.In addition to Friedman, we simulated data from a Checkerboard-like model with strong correlation as in Scenario 3 of Zhu, Zeng, and Kosorok (2015).
The (j, k) component of is equal to 0.9 |j−k| .Bayes error rate for the Checkerboard classification problem is 0.18.
Bayes error rate for the van der Laan classification problem is 0.34, making it the most complex by this measure among those investigated.4. Meier 1.This setup was investigated in Meier, Van de Geer, and Bühlmann (2009).
Bayes error rate for the Meier 1 classification problem is 0.28. 5. Meier 2. This setup was investigated in Meier, Van de Geer, and Bühlmann (2009) as well.
Bayes error rate for the Meier 2 classification problem is 0.19.We used mean squared error (MSE) and classification accuracy to measure the prediction performance for continuous and binary data, respectively.For continuous and binary data, we prefer smaller MSE and higher accuracy, respectively.
For each functional relationship f (X i ) (Friedman, Checkerboard, Meier 1, Meier 2, and van der Laan) and each outcome (continuous or binary), we simulated data from four scenarios with different samples sizes n = 800 and n = 1600 and number of covariates p = 20 and p = 40.Within each scenario, we simulated 200 datasets and for each dataset we randomly chose 75% of samples as training data and remaining 25% as test data.

Simulation Results
The performance of the RF/GBT methods and RF/GBT kernels on test data for the Friedman generative model for regression and classification are shown in Figure 3 , respectively.To demonstrate the superiority of RF/GBT versus RF/GBT kernel, we show also the box plots of difference in performance measures between RF kernel, GBT kernel and the corresponding RF, GBT methods in Supplementary Figure 1.A reference horizontal line with y-axis value equal to zero was drawn in each plot.The further away the box plot to the reference line (downward for MSE and upward for accuracy), the better the results of RF/GBT kernel compared to RF/GBT.The Figures showing the boxplots of the performance metrics differences for Checkerboard, Meier 1, Meier 2, and van der Laan are provided in Supplementary Figures 2-5.For completeness, the overall summary of the performance results from primary analysis across all setups for continuous and binary targets are provided in Supplementary Table 1 and 2, respectively.
With respect to the RF/GBT kernel vs. RF/GBT comparison, the kernels generally outperformed RF/GBT for regression.Furthermore, for regression, it tended to be the case that with the same sample size (fixed n), the smaller the signal-to-noise ratio (the larger the value of p), the larger the improvement of adopting RF/GBT kernel approach after using RF/GBT.In addition, with a fixed number of covariates, the results from RF/GBT kernel was more accurate compared to the RF/GBT as the sample size decreased.For classification, the performance was impacted by the target dichotomization and generally RF/GBT was found comparable with kernels.Specifically, for the Friedman data the RF/GBT kernels were performing slightly better than RF/GBT (Figure 3(b)).For Meier 1 (Supplementary Figure 3(d)) and Meier 2 (Supplementary Figure 4(d)), the RF kernel was marginally worse the RF.The XGB kernel for Meier 1 and Meier 2 performed slightly better than XGB.For the checkerboard (Supplementary Figure 2 (d)) the RF kernel and RF performed about the same and the XGB kernel slightly outperformed XGB.For the the van der Laan data (Supplementary Figure 5(d)), the RF/GBT and RF/GBT kernel classification performances were about the same.
When comparing XGB based methods versus those based on RF for regression, the XGB based methods performed slightly worse than the RF based methods (Figure 3(a), Supplementary Figures 2(a), 3(a), and 5(a)) across the different simulation scenarios.Dataset Meier 2 yielded comparable performance (Supplementary Figure 4(a)).
The results from the sensitivity analysis for Checkerboard, Meier 1, Meier 2, and van der Laan are shown sub-figures (eh) in Supplementary Figures 2-5.The corresponding numerical results are given in Supplementary Tables 3 and 4 for regression and classification, respectively.These results were in line with those from the primary analysis.

Regression and Classification (Continuous and Binary
Outcome) The performance of the RF/GBT vs. RF/GBT kernels was evaluated on six benchmark datasets from the biopharmaceutical and biomedical applications (three datasets for regression and classification, respectively).The datasets were obtained from the UCI repository: https://archive.ics.uci.edu/ml/index.php(Dua and Graff 2017).The used datasets are summarized in Table 1.Using these datasets, for the regression task and for two class classification task we predicted the continuous and binary targets, respectively.We added results from additional regression and classification datasets across various areas of application in the Supplementary material.These additional datasets are summarized in the Supplementary Table 5.

3.
Comparison of MSE and classification accuracy using RF, RF kernel, XGB, and XGB kernel using default and nondefault setup in RF and XGB for data simulated from Friedman setting For the larger datasets (QSAR Bioconcentration and Protein Tertiary Structure), we randomly selected 2000 samples and split them into training and test set, with 1500 and 500 samples, respectively.We repeated the analysis 200 times to evaluate the performance of RF(GBT) and RF(GBT) kernel algorithms, respectively.
For the other datasets we split the data into training and test set in the ratio 3 to 1, respectively, and repeated the analysis 200 times.
Similarly for classification we randomly split the data with the binary target.In case of imbalanced data, we fixed the minority class (i.e., the class with smaller number of samples).Then we randomly sampled from the majority class (i.e., the class with larger number of samples) equal number of samples as in the minority class to create a balanced dataset.As for the regression, we split the data into training and test set in the ratio 3 to 1.We evaluated the performance of RF(GBT) and RF(GBT) kernel, in a binary classification setting, respectively.We repeated the analysis 200 times.
As for the simulations, in our experiments with the real life data, we used default parameters to train the tree-based ensembles.
Results from the real life problems mimic those obtained in simulation (Figure 4 for regression and Figure 5 for classification).In regression setting, the RF/XGB kernels were generally competitive to the RF/XGB across the datasets investigated.For classification, also in agreement with the results in simulation, the performance was impacted by dichotomization of the targets.Across the datasets, RF/XGB kernels were comparable and in some cases slightly worse than the RF/XGB algorithms.
The results obtained from the real life datasets for regression and classification are summarized in the Tables 2 and 3, respectively.The results from the additional datasets are provided in the Supplementary Tables 6 and 7 and Supplementary Figures 6-15.These results are also in accordance with the those obtained from the datasets provided in the main text.

Discussion and Conclusions
Kernel generation via feature space partitioning algorithms has been viewed as a byproduct of the tree ensembles and it has been often overlooked and underutilized in kernel learning (Marcus 2017).We investigated a kernel learning approach where the tree ensemble based kernels are plugged into regularized linear algorithms.In our contribution we systematically evaluated the RF and XGB kernel prediction models in conjunction with ridge regression in a comprehensive simulation study that included regression and classification targets.Although the tree ensemble based kernels are furnished by their respective tree ensembles, the prediction model is built in a different way than that of the tree ensembles.The difference lies in how the kernel and tree ensemble algorithms aggregate across the partitions of the feature space emanating from the recursive tree partitioning.The predictor obtained by using an ensemble based kernel is obtained from the kernel and a linear learning method (model) as opposed to averaging across the single trees (Ren et al. 2015;Balog et al. 2016), sometimes referred to as a global refinement (see, e.g., Ren et al. 2015 and Gogic, Ahlberg, and Pandzic in press).In our investigation, RF/XGB underlies both RF/XGB and RF/XGB kernels, where the prediction model is a linear (ridge regression) model that capitalizes on the RF/XGB kernels.In our simulations we show that for cases with larger number of noisy features, the tree ensemble kernel learning is superior to that of the ensembles themselves.This beneficial effect was found particularly consistent for the continuous targets in regression.The simple ridge regression model that follows the kernel construction was found to be less prone to the noisy features in the simulation scenarios investigated.For classification we used the kernel ridge regression with target classes denoted as -1 and 1.As the dichotomization of the continuous target leads to loss of information (Fedorov, Mannino, and Zhang 2008;Friedman, Hastie, and Tibshirani 2009), the performance results were less pronounced for classification than those for the regression.There are other potential options for linear models that could have been used here.To this end, we also tried the regularized kernel logistic regression as implemented in the package gelnet (Sokolov et al. 2016) with RF and RF kernel yielding comparable performance to each other and with that of the kernel ridge regression.
The RF based models (RF and the RF kernel) performed as well or slightly outperformed the XGB based methods (XGB and the XGB kernel) in our experiments.However, there is no free lunch for machine learning and consequently for a universally optimal kernel (Wolpert 1996;Davies and Ghahramani 2014;Fernandez-Delgado et al. 2014).The tree based ensemble kernels should be competitive in situations when the data generating mechanism is conducive to the recursive partitioning (Fan et al. 2020), for example, in the presence of feature interactions as frequently found in biomedical and biomarker applications (Boulesteix et al. 2012).Other recent examples, where the RF kernel has shown promising are studies of the image classification in hyperspectral imaging (Zafari, Zurita-Milla, and Izquierdo-Verdiguier 2019) and face alignment from imaging data (Gogic, Ahlberg, and Pandzic in press).
In addition to the simulations, we have also shown that in real life applications RF/XGB kernels in conjunction with the ridge regression are competitive to their respective ensembles.Ridge regression for big datasets poses a computational challenge, with a bottleneck of inverting large regularized kernel matrix (You et al. 2018).Recently a divide and conquer (DC) approaches have been proposed to address this issue (You et al. 2018).Furthermore, as the kernel matrix can be interpreted as a similarity matrix, a local approach to the KRR can be naturally applied.The kernel matrix interpretation can be also straightforwardly used to define prototypical (archetypal) points (observations) with insights into the geometry of a given problem (Pekalska, Paclik, and Duin 2001;Balcan, Blum, and Srebro 2008;Bien and Tibshirani 2011;Brophy and Lowd 2020).Research in this direction is germane to interpretable machine learning (Brophy and Lowd 2020;Bien and Tibshirani 2011).
Tree ensemble based kernel methods can be further adapted to the survival target (Ishwaran and Lu 2019;Wang et al. 2020), where the potential censoring of the data needs to be addressed.Elucidation of the properties of the survival target attracted interest recently and it has been driven by real life applications, particularly in biomedical area (Ishwaran and Lu 2019).In a related work Chen investigated survival forest kernel (Chen 2019) and leveraged the survival forest kernel to warmstart kernel learning (Chen 2020).We carried out preliminary experiments with the RF kernel and the survival target with results similar to those presented here for continuous target (regression).
An in depth investigation of these topics will be communicated in a separate contribution.
We investigated RF and XGB kernels as facilitators of treebased ensemble kernel learning.However, as the notion of kernel generation holds for feature space partitioning algorithms in general (Fan et al. 2020), other potential tree ensemble based kernels include for example those obtained from recent extensions of the RF that include for example, oblique RFs, rotational forests, or mixup forests (Menze et al. 2011;Rodriguez, Kuncheva, and Alonso 2006;Rodriguez et al. 2020), etc., respectively.Further understanding of the kernels ensuing from the Bayesian approaches such as the Bayesian random forests (e.g., Mondrian forests (Balog et al. 2016)), Bayesian boosting (BART) (Linero 2017) and other nonparametric Bayesian partitions (Fan et al. 2020) is another interesting topic for future research.The Bayesian methods naturally provide prediction intervals to quantify the uncertainly around the prediction estimates, that can be harnessed in subsequent decision making.

Figure 1 .
Figure 1.RF Kernel, XGB Kernel, and the Laplace Kernel for the Fisher Iris dataset.Corr denotes the matrix correlation given by the Mantel statistic and sigma is the σ parameter of the Laplace kernel.

Figure 2 .
Figure 2. Distributions of values from RF Kernel, XGB Kernel, and the Laplace Kernel with σ = 1 and 10 for the Fisher Iris dataset.
(a)-(d).The results for the default setup are shown in Figure 3(a)-(b) and for the sensitivity analysis in Figure 3(c)-(d).The performance metrics for the default setup for regression and classification are shown in Figure 3(a) and Figure 3(b)

Figure 4 .
Figure 4. Comparison of MSE using RF, RF kernel, XGB, and XGB Kernel for the real data.

Figure
Figure Comparison of Accuracy using RF, RF kernel, XGB, and XGB Kernel for the real data.

Table 1 .
Summary of the real life datasets.

Table 2 .
Results from the real life datasets from regression, MSE mean (s.d.).

Table 3 .
Results from the real life datasets from classification, accuracy mean (s.d.).