Fast and Exact Leave-One-Out Analysis of Large-Margin Classifiers

Abstract Motivated by the Golub–Heath–Wahba formula for ridge regression, we first present a new leave-one-out lemma for the kernel support vector machines (SVM) and related large-margin classifiers. We then use the lemma to design a novel and efficient algorithm, named “magicsvm,” for training the kernel SVM and related large-margin classifiers and computing the exact leave-one-out cross-validation error. By “magicsvm,” the computational cost of leave-one-out analysis is of the same order of fitting a single SVM on the training data. We show that “magicsvm” is much faster than the state-of-the-art SVM solvers based on extensive simulations and benchmark examples. The same idea is also used to boost the computation speed of the V-fold cross-validation of the kernel classifiers.


Introduction
Among many existing classification methods, the kernel support vector machine (Cortes and Vapnik SVM, 1995;Vapnik SVM, 1995, 1999) is widely recognized as one of the most competitive classifiers.With extensive numerical studies, Fernández-Delgado et al. (2014) declared that the kernel SVM is one of the best among hundreds of popular classifiers, in the same league as random forest, boosting ensemble, and neural nets.The statistical view of the SVM reveals its connection to nonparametric function estimation in a reproducing kernel Hilbert space (Hastie, Tibshirani, and Friedman RKHS, 2009), which also suggests a unified derivation of many kernel classifiers based on a penalized loss formulation.Let y = 1, −1 denote the class label in a binary classification problem.Given a random sample (y i , x i ) n i=1 , the kernel SVM can be defined as a function estimation problem min where (1 − u) + ≡ max(1 − u, 0) is the so-called hinge loss and f is found within an RKHS H K with reproducing kernel K.The classification rule for x is sgn(f (x)).One can replace the hinge loss with other classification calibrated margin-based loss functions (Bartlett, Jordan, and McAuliffe 2006) in problem (1), and the resulting classifier is min ( 2 ) Some popular margin-based loss functions in the literature include the logistic regression loss with L(u) = log(1+e −u ) and the squared hinge loss with L(u) = (max(1 − u), 0) 2 , among CONTACT Hui Zou zouxx019@umn.eduSchool of Statistics, the University of Minnesota, Minneapolis, MN 55455.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/TECH.others.For ease of exposition, we do not include the intercept term throughout the article.
In this article, we consider the leave-one-out analysis of the kernel classifier defined in problem (2).Specifically, we use fλ to denote the kernel classifier in problem (2), and use f [−i] λ to denote the leave-one-out classifier, We repeat the above for i = 1, 2, . . ., n.The leave-one-out analysis is closely tied to jackknife resampling that was proposed for bias and variance estimation of an estimator.Although bootstrap has replaced jackknife in statistical inference (Efron 1982), the leave-one-out analysis is still widely used in assessing the predictive accuracy of a model, that is, the cross-validation method.As early as 1969, it was shown that the leave-oneout cross-validation yields a nearly unbiased estimator of the predictor error (Luntz and Brailovsky 1969).In 1979, Golub, Heath, and Wahba (1979) studied the leave-one-out analysis of ridge regression.Their result directly motivated us to conduct the research in this article, so it is necessary to review their work.The ridge regression is fλ and X is the matrix whose ith row is x i .Consider the same ridge regression with the ith observation removed: Golub, Heath, and Wahba (1979) showed the following Golub-Heath-Wahba formula: where h ii is the ith diagonal element of matrix H.It is important to see that Equation ( 3) is directly from the following leave-oneout residual formula: which can be extended to a family of linear smoothers including smoothing splines and kernel ridge regression (Wahba and Wold 1975;Wahba 1977;Craven and Wahba 1978;Hastie, Tibshirani, and Friedman 2009).In other words, the Golub-Heath-Wahba formula holds for a wide class of nonparametric regression methods as well.Equation (3) was recently used to improve the Mallow's C p (Rosset and Tibshirani 2020).The leave-one-out analysis was also used as a predictive inference tool in Barber et al. (2021).
The generalized cross-validation criterion was further proposed by using the average of h ii in place of each h ii in the Golub-Heath-Wahba formula, to make the criterion more stable and computationally easier.The generalized cross-validation criterion has been widely used to select the smoothing parameter in generalized additive model; see discussions in Hastie and Tibshirani (1990), Wahba (1990) and implementations in the R packages gam (Hastie 2020), gss (Gu 2014), and mgcv (Wood 2021), among others.
A close inspection of the derivation of the Golub-Heath-Wahba formula reveals that its proof critically depends on two facts: (a) the loss function is the squared error loss, and (b) the regression method is a special linear smoother with the form ŷ = S λ y, where S λ is the smoother matrix that only depends on the prechosen parameter λ and the predictors, and S λ has a self-stable property (Fan et al. 2020).In fact, the Golub-Heath-Wahba formula does not hold for many powerful regression models such as random forest, gradient boosting and the SVM regression model because they are not selfstable linear smoothers.This is why for a long time the Golub-Heath-Wahba formula is considered as a special property of some linear smoothers in regression (and not even all linear smoothers).In the kernel classifier case, the squared error loss is further replaced with a margin-based loss like the SVM hinge loss.Wahba (1999) proposed a generalized approximate cross-validation (GACV) to estimate the leave-one-out crossvalidation error of SVM; however, GACV's derivation is based on approximate Taylor's expansion and a smooth approximation of the hinge loss.Thus, it basically follows the derivation of GACV for regression methods.It remains an open problem how to extend the Golub-Heath-Wahba formula for the kernel classifiers.
In Section 2, we develop a new leave-one-out lemma that can essentially generalize the spirit of the Golub-Heath-Wahba formula to the kernel SVM and related classifiers.In Section 3, we apply the lemma to design a novel algorithm named magicsvm for computing the kernel classifier and its n leave-oneout cross-validated variants in the same order of computations as the single kernel classifier on the whole training data.We present numerical examples in Section 4. The technical proofs are relegated to an appendix .

The Leave-One-Out Lemma
Lemma 1.Given a nonnegative convex loss function L(•), consider the corresponding margin-based classifier: and its leave-one-out solution Create a response vector ỹ[i] by letting ỹ[i] i = 0 and ỹ[i] j = y j for all j = i.Then we have λ can be obtained by using Equation (4) and replacing y i with zero while keeping all other variables unchanged.Lemma 1 can be easily proven by using the fact that L(0) = 0, so the proof details are omitted.In the next section we explain how to exploit this property to design a new algorithm for doing leave-one-out analysis of the kernel classifier.
We further recognize that Lemma 1 can be naturally generalized to the leave-m-out validation, m > 1.It is worth noting that there is no leave-m-out generalization of the Golub-Heath-Wahba formula for ridge regression.So, this can be seen as an advantage of the leave-one-out lemma in this article.
Lemma 2. Given a nonnegative convex loss function L(•), consider the corresponding margin-based classifier fλ that is defined in problem (4).Let [v] denote a subset of the training data and there are m observations.Let f [−v] λ denote the fitted margin-based classifier after deleting the set [v] from the training set, that is, Lemma 2 can be used for computing V-fold cross-validation.The details are given in the next section.

Motivation
In this section, we show that the leave-one-out lemma enables us to further boost the computing efficiency of the kernel margin classifiers including the kernel SVM.We derive a new algorithm for the kernel classifier with a smooth loss function and then extend it to handle the kernel SVM.Thus, our discussion mainly focuses on the SVM.Let K(•, •) be the reproducing kernel function of H K .The representer theorem of reproducing kernels (Wahba 1990) indicates that the solution to problem (1) is where K is the kernel matrix such that the (i, j)th element is K(x i , x j ) and K i is the ith column of K.The resulting classification rule for x new is the sign of n i=1 αi K(x i , x new ).We assume the kernel matrix has full rank.
It is interesting to note that the current state-of-the-art algorithms typically solve the dual of problem (6) (Hastie, Tibshirani, and Friedman 2009).The dual problem is solved either by an interior point algorithm (Vanderbei 1999), which is implemented in an R package kernlab (Karatzoglou et al. 2004), or by being broken down into a series of smaller problems, namely sequential minimal optimization (Platt 1999;Fan, Chen, and Lin 2005), which is implemented in libsvm (Chang and Lin 2011) and interfaced in an R package e1071 (Meyer et al. 2019).
The "standard" state-of-the-art approaches typically fit the kernel SVM on the training data separately with the leaveone-out variants.As a result, a "standard" approach of the leave-one-out analysis actually applies the same base algorithm (n + 1) times, and thus the whole computation time is roughly (n + 1) times as large as the time of a single fit.When n is not small (e.g., n ≥ 50), the "standard" approach is considered to be too expensive to be useful in practice.As a shortcut, people often do V-fold cross-validation with V = 5 or V = 10.
Based on the leave-one-out lemma, we integrate the training and tuning of the kernel SVM such that the whole computation time is of the same order of fitting one classifier on the training set.Therefore, the leave-one-out analysis is not computationally prohibitive to do for the kernel classifiers.Note that the current state-of-the-art algorithms cannot benefit from the leave-oneout lemma because they work in the dual space.To apply the lemma, we develop a new algorithm as explained in the following subsection.

Exact Finite Smoothing Principle for SVM
Finding an efficient algorithm for the kernel SVM is interesting as the kernel SVM is the most representing example among the margin-based classifiers.Computing the SVM directly from problem ( 6) is typically hard since it is nonsmooth.This is why the current state-of-the-art algorithms solve the dual problem, not problem (6).The leave-one-out lemma is on the primal form of the SVM.After a careful examination of problem (6), we prove that we can compute the exact SVM solution to problem (6) by solving a finite number of smoothed version of problem (6) followed by a projection step.
For any δ > 0, we define a δ-smoothed hinge loss function Treat L δ (•) as a new margin-based convex loss and the corresponding classifier in RKHS H K is It is natural to expect that as δ approaches zero L δ becomes closer and closer to the hinge loss, and consequently α δ is approaching to α SVM .Proposition 1 further quantifies the quality of α δ as an approximate solution of the SVM in terms of the objective value.When δ/(4Q(α δ )) < , where is a prespecified small constant, say = 10 −3 , then the optimal SVM objective value is within the interval Q(α δ )[1 − , 1].From a practical perspective, Q(α) reaches the optimal SVM objective at α δ .
Furthermore, we can do another step to obtain the exact SVM solution from α δ .For the discussion, we assume there exists some i such that y i K i α SVM = 1.The assumption basically says that not all the training points are support vectors.We can easily verify this assumption before doing any serious computation.If y i K i α SVM = 1 for all i, then α SVM must be K −1 y for a positive definite kernel.By the Karush-Kuhn-Tucker (KKT) conditions of problem (6), one can see this happens if and only if y i w i ∈ [0, 1 2nλ ] for all i, where w i is the ith element of K −1 y.By this simple check step, we either know the SVM solution is indeed K −1 y or the assumption holds.In the latter case, we continue to use the following procedure.
We first need to define several quantities.Define E 0 = {i : Lemma 4 is that αδ will actually equal α SVM once δ is less than δ .
Lemma 4. Suppose there exists some i such that y i K i α SVM = 1.It holds that αδ = α SVM as long as δ < δ , where δ is given in equation ( 8).
The threshold in theory may well depend on the training set and is unknown before we have the SVM solution.This issue can be handled by the following computation procedure.We solve problem (7) on a decreasing sequence of δ: δ (d) with δ (d+1) = τ δ (d) and 0 < τ < 1.For example, τ = 1/8 is used in our implementation.After obtaining αδ (d) we solve problem (9) to get αδ (d) and then check if αδ (d) satisfies the Karush-Kuhn-Tucker (KKT) condition of the SVM problem (6).If so, then it is the exact SVM solution.If not, we consider the next δ.Lemma 4 guarantees that the iterative process will terminate within a finite number of iterations.In our experiments, the iterative process stops after a few iterations.However, it still remains an open theoretical question to quantify the dependence of the number of iterations on the sample size.
We now develop an efficient algorithm for solving αδ (d) , we observe that the first-order derivative of L δ (•) is Lipschitz continuous: where κ = 2δ.For a given δ, we propose to solve problem (7) using the accelerated proximal gradient descent (Parikh and Boyd 2014).Define α (1) to be an initial value.For each k = 1, 2, . .., the proximal gradient method updates α (k+1) by where z (k) is an n-vector whose ith element is y i L δ (y i K i α (k) )/n.It is easy to see that k) , where For a fixed sample size n and kernel matrix K, the proximal gradient method requires O(1/( κ)) times of the update (11) to achieve the prescribed precision with regard to the objective function.
We can further boost the convergence rate using the Nesterov's acceleration (Nesterov 1983(Nesterov , 2005(Nesterov , 2013;;Beck and Teboulle 2009).Construct a sequence, r k , such that r 1 = 1 and 0) and α (1) as initial values.For each k = 1, 2, . .., we solve α (k+1) as and the ith element of z(k) is y i L δ (y i K i ᾱ(k) )/n.The convergence rate of the Nesterov's accelerated algorithm is O(( κ) −1/2 ), which is quadratically faster than the algorithm without Nesterov's acceleration.We note the complexity of the update step (11) is O(n 2 ).The genuine bottleneck of the algorithm is the inversion of matrix P λ (K), whose complexity is O(n 3 ).
The computation of αδ (d) can be obtained by using essentially the same proximal gradient descent algorithm in which we use an additional projection step (Parikh and Boyd 2014) to handle the equality constraints during the iteration procedure.

The Integrated Algorithm for Computing SVM and the Leave-One-Out Analysis
We have shown that the exact SVM solution of problem ( 6) can be obtained by solving a finite sequence of problem ( 7) with smooth losses.In this section, we shall show that, with the accelerated proximal gradient descent as the base algorithm, Lemma 1 enables us to drastically cut down the whole computation time of fitting SVM on the training set and its leave-one-out variants.
Let us consider the "standard" approach: the leave-i-out solution is for each i = 1, . . ., n, where K [−i] is the kernel matrix with the ith row and column removed and is the jth column of K [−i] .As K [−i] differs for each i, P λ (K [−i] ) needs to be inverted individually, so each inversion requires O(n 3 ) operations.
Based on Lemma 1 we know that α[−i] can be obtained via (12) Therefore, for each i, we construct the response ỹ[i] as instructed by Lemma 1, and we subsequently use the accelerated proximal gradient descent to solve problem (12).It is critically important to observe that in this process the same kernel matrix K appears in these slightly different versions of expression ( 12), namely the same X and slightly different ỹ[i] .Thus, we only invert P λ (K) once and store it, which avoids inverting an (n − 1) × (n − 1) matrix n times.
Algorithm 1 summarizes the entire integrated algorithm for training and tuning SVM.We can also have a similar and even simpler procedure for other margin-based classifiers such as logistic regression and squared SVM.It is easy to verify that, in the Lipschitz condition (10), κ = 4 for logistic regression and κ = 1/2 for squared SVM.We use the same accelerated proximal gradient descent algorithm for computing the solution.See Algorithm 2 for details.
The computation of V-fold cross-validation can be likewise reduced by applying the leave-m-out formula in Lemma 2. For sake of space, we opt not to repeat the algorithm here.We have implemented the magic SVM algorithm in an R package magicsvm.The package handles the leave-one-out analysis as until the convergence condition is met.13: end for of the coordinates are zeros.In each example, the positive class was generated from a mixture Gaussian distribution 10 k=1 0.1N(μ k+ , σ I) with each μ k+ drawn from N(μ + , I), and likewise the negative class was assembled by 10 k=1 0.1N(μ k− , σ I) with each μ k− from N(μ − , I).We set μ = 2 and σ = 4.We included 12 examples, with the sample sizes n varying as {200, 300, 400} and the dimensions p = 0.2n and p = 0.5n.
Table 1 compares the computation time of magicsvm with kernlab and libsvm.We selected λ by leave-one-out cross-validation, and then computed the objective values in problem (6).We observe that the objective values of the three packages are exactly the same.We also see that magicsvm is consistently faster than the two competitors.For example, when n = 400 and p = 200, libsvm spent more than three hours to complete one time of leave-one-out cross-validation.The same results were obtained by magicsvm using about six minutes.The superiority of magicsvm is very clear in this example.
We further illustrate the performance of magic LOOCV algorithm for fitting and tuning large-margin classifiers with smooth loss functions.We considered kernel logistic regression and kernel-squared SVM.We used the same simulation data   for Table 1.We compared two approaches, (a) fitting each model using the APG algorithm and employing the ordinary LOOCV approach to select the tuning parameters, and (b) fitting and tuning each method using the integrated magic LOOCV algorithm introduced in Section 3.3.We plotted the ratio of the run time of the two approaches in Figure 1 to visualize the magnitude of speed-up.We observe that the magic LOOCV algorithm dramatically accelerates the ordinary LOOCV approach: for example, when n = 400, the magic LOOCV algorithm speeds up the ordinary LOOCV approach about 10 and 12 times faster.The rate of improvement roughly grows linearly with the sample size n.

Comparison Among Different V-Fold Cross-Validation
It is a popular claim that the leave-one-out cross-validation has very high variance compared with 5-or 10-fold cross-validation.If we consider the overall bias-variance tradeoff, one should prefer either 5-or 10-fold cross-validation over the leave-oneout.However, in the context of regression, many authors show such a claim is in fact false, for example in Burman (1989) and Zhang and Yang (2015).In the context of kernel learning, the comparison is missing, and it is largely because of the expensive computation brought by the standard leave-one-out analysis.Due to the leave-one-out lemma and magic SVM algorithm presented in this work, it is now feasible and desirable to actually compare the performance of V-fold cross-validation for the kernel SVM.We shall demonstrate that leave-one-out crossvalidation is better than 10-, 5-, and also 2-fold cross-validation in kernel learning.We used the simulated data from the mixture Gaussian distribution mentioned earlier in this section.In Table 2 we evaluated the bias, variance and mean squared error of each V-fold crossvalidation error as an estimator of the generalization error of the kernel SVM.It can be seen that the leave-one-out crossvalidation is the best.It has the least bias (as expected) and also has similar variance as that of 5-or 10-fold cross-validation.It is interesting to observe that 2-fold cross-validation actually has the largest variance, which contradicts the belief than the variance of cross-validation increases with the number of folds.

Benchmark Data Applications
We further compared magicsvm with kernlab and libsvm on seven benchmark datasets which are available from the UCI machine learning repository (Dua and Graff 2019).The link to each data is provided in supplementary Materials.Those real data examples have various combinations of sample size and dimension.We randomly split each data into a training set and a test set with equal sizes.For sake of exposition, we only presented the computation time because the three packages give identical results except for small difference due to implementation.From the left panel of Table 3, we observe that magicsvm is superior over kernlab and libsvm for the leave-one-out analysis.Taking the dataset arrhythmia as an example, the computation time of magicsvm is 48.5 sec, which is more than 15 times faster than kernlab and 40 times faster than libsvm.
We further used the three R packages to compute 10fold cross-validation error, and we observe that magicsvm is still much faster than the other two competitors.In addition, we notice that the run time of leave-one-out analysis using magicsvm is roughly on the same level of using kernlab or libsvm to compute 10-fold cross-validation.In other word, magicsvm enables us to conduct leave-one-out with the same computing resource that originally led to only 10-fold crossvalidation.

Discussions
In this article, we have developed leave-one-out and leavesome-out formula for large-margin classifiers in RKHS, which leads to the design of a new, exact, and much faster algorithm magicsvm for training and tuning the kernel SVM and related classifiers.We have also shown that the LOOCV error, as an estimator of the generalization error of the SVM with a fixed regularization parameter, works as well as the 10-fold error or 5-fold CV error.Therefore, with magicsvm the computation time should not be a factor preventing us from using leave-oneout cross-validation to estimate the generalization error and select the regularization parameter of the kernel classifier.The numerical experiments have clearly demonstrated the advantage of our new algorithm.

Supplementary Material
This file contains all the technical proofs and links to the R packages and benchmark data.

Figure 1 .
Figure1.Magnitude of speed-up by using the magic LOOCV formula: ratios of the run time of the ordinary LOOCV approach and the run time of the integrated magic LOOCV approach.The left panel is for kernel logistic regression, and the right panel is for kernel squared SVM.The run time includes both fitting and tuning each method.The results are based on 20 independent runs.Computations were conducted on an Intel Xeon CPU E5-2680 (2.40GHz) processor.

Table 1 .
Simulated data: comparison of the three R solvers of kernel SVM: magicsvm, kernlab, and libsvm.The run time includes the leave-one-out analysis.The run time and objective values are averaged over 50 independent runs, and the standard errors of the objective values are given in parentheses.Computations were conducted on an Intel Xeon CPU E5-2680 (2.40GHz) processor.

Table 2 .
Comparisons of leave-one-out (LOO), 10-fold, 5-fold, and 2-fold CV as means for estimating the prediction error of the kernel SVM.RMSE) of the estimates of the generalization error.The numbers are the average quantities over ten different λ values and over 50 independent runs.

Table 3 .
Run time (in second) comparison of leave-one-out and 10-fold cross-validation on seven UCI benchmark data.
NOTES: All the time are averaged over 50 independent runs.Computations were carried out on an Intel Xeon CPU E5-2680 (2.40GHz) processor.