Choosing shape parameters for regression in reproducing kernel Hilbert space and variable selection

The Gaussian radial basis function (RBF) is a widely used kernel function in kernel-based methods. The parameter in RBF, referred to as the shape parameter, plays an essential role in model fitting. In this paper, we propose a method to select the shape parameters for the general Gaussian RBF kernel. It can simultaneously serve for variable selection and regression function estimation. For the former, asymptotic consistency is established; for the latter, the estimation is as efficient as if the true or optimal shape parameters are known. Simulations and real examples are used to illustrate the method's performance of prediction by comparing it with other popular methods.


Introduction
Over the past decades, kernel-based methods have become very popular in machine learning, especially in dealing with nonlinearity in data. See, for example, Wahba (1999), Schölkopf et al. (2002) and Hofmann et al. (2008). The kernel function plays a key role in the kernel-based methods. Among the existing kernel functions, the Gaussian RBF kernel is arguably the most popular, which has the form of K(x, x ) = exp{−γ x − x 2 }, where γ (γ > 0) is a tuning parameter of the kernel, called shape parameters hereafter. In the literature, a pre-specified constant scalar is often assigned to γ . For example, in Yuan and Chu (2006) and Konar and Chattopadhyay (2011), γ = 1/2; Chang and Lin (2011) set γ = 1/p, where p is the dimension of x; and in Jaakkola et al. (1999), γ = 1/2σ 2 where σ 2 is the median of all the pairwise distances of the training sample. As demonstrated in Wang et al. (2003), however, an 'appropriate' value for γ is critical for the robust performance of these kernel-based methods: a too-small γ would result in under-fitting, or one risks over-fitting if otherwise.
Several methods have been proposed to choose γ . The commonly used one is the cross-validation (CV), e.g. Camps-Valls and Bruzzone (2005) and Chang and Lin (2011). Using grid-search, this method works well when γ is a scalar parameter of Gaussian RBF kernel. However, our target of this paper is the general Gaussian RBF kernel with multidimensional parameter, which is introduced later. It is nearly impossible to implement in the high-dimensional setting due to the explosive number of lattices of the grids when p gets large. Alternatively, Li et al. (2010) proposed an automatic parameter selecting (APS) method using the gradient descent algorithm. Damodaran (2018) selected the parameter by maximizing an independence measure. However, these methods are designed for classification problems.
Moreover, regression problems are often involved in variable selection, which is crucial to both interpretation and estimation efficiency. Kernel methods are no exceptions. Under the linear model assumption, Tibshirani (1996) proposed the least absolute shrinkage and selection operator (LASSO) method; Fan and Li (2001) proposed the smoothly clipped absolute deviation (SCAD) method. Under nonparameter settings, there are also many methods available; see, for example, the multivariate adaptive regression spline (MARS) of Friedman (1991), the component selection and smoothing operator (COSSO) of Lin and Zhang (2006), the method in Ravikumar et al. (2009) and Huang et al. (2010) for the additive model.
In this paper, we consider a general nonparametric model where X ∈ X ⊂ R p , g * (x) = E(Y|X = x). We first consider that g * belong to the reproducing kernel Hilbert space (RKHS) generated by a general Gaussian RBF kernel, also called an isotropic Gaussian kernel: 2 and x k is the kth element of x. Note that when γ 1 = γ 2 = · · · = γ p , the general Gaussian RBF kernel degenerates to the Gaussian RBF kernel. This kernel was first considered in Chang and Chou (2015) where they proposed a method for the selection of these shape parameters for classification problems. Hang and Steinwart (2021) investigated regression modeling within the RKHS associated with this kernel, while Jordan et al. (2021) formulated this kernel in a slightly different way. In this paper, we shall propose a procedure which could not only be used for the selection of γ , but also and more importantly, serves as a variable selection method, since a zero γ k is a natural indication of x k being insignificant.
The rest of this paper is organized as follows. Section 2 contains some background knowledge on the general Gaussian RBF kerne, its associated RKHS, as well as kernel ridge regression. In Section 3, the new method is presented along with the corresponding algorithm and some related practical issues. Asymptotic results, i.e. consistency theory related to both variable selection and estimation are given in Section 4. Simulation studies and empirical examples are to be found in Sections 5 and 6, respectively. A brief discussion is provided in Section 7. All proofs are delegated to the supplement, while the computation code is available from https://github.com/Txin1994/generalGaussianKRR.

Reproducing kernel Hilbert space and kernel ridge regression
where L x (f ) is the evaluation function and · H is the induced norm by the inner product <, > H ; see Aronszajn (1950). A Hilbert space H is RKHS if and only if there exists a map K : X × X → R such that If such a K exists, then it is unique and is positive semidefinite (PSD), i.e. for all n ∈ N + and This PSD kernel is also called reproducing kernel (RK). Every RK corresponds to a unique RKHS, and a RKHS associates with a unique RK. Thus, a RKHS can be defined by a RK K as follows (1), g * is the minimizer of the mean squared error E{Y − g} 2 . The empirical least square loss and a penalty based on the Hilbert norm is commonly used to estimate the regression function, i.e. g arg min where λ > 0 is a regularization parameter. When H is a RKHS associated with kernel K, then estimator (2) is called the kernel ridge regression (KRR) estimator. By the representer theorem (Wahba 1999), the solution to problem (2) belongs to the linear space spanned by with · 2 n = 1 n · 2 , orα . . , α n ) T , and I is the identity matrix of n × n.

General Gaussian RBF kernel
We know from the above discussion that the kernel function is an essential part of an RKHS. In this paper, we focus on the general Gaussian RBF kernel. Let H γ be the RKHS associated with the general Gaussian RBF kernel K γ , i.e.
Thus, there is another definition of H γ , where g k =< g, φ k > 0 . Moreover, for any g ∈ H γ , we have Lemma 2.2: Suppose {φ k } ∞ k=1 is given in Lemma 2.1. Then, where I {k=l} = 1 for k = l and I {k=l} = 0 for k = l.
Note that the KRR estimatorĝ γ is completely determined by kernel K γ , where γ determines the importance of a variable: the bigger γ k is, the more important x k is. If γ k = 0, then x k is redundant. Motivated by this fact, we propose the following method to simultaneously estimate the regression function and select variables.
First, there is an over-fitting problem in selecting γ : the bigger γ is, the estimated functionĝ(X i ) is more close to observed Y i if we use the least-squares estimation directly. To fix this problem, we consider the cross-validation approach.
Partition the sample D into K subsets D 1 , D 2 , . . . , D K approximately evenly. In our calculation below, we set K = 5. Calculate the KRR estimator of each subset D − D k , k = 1, 2, . . . , K, is a penalty function which can be either LASSO (Tibshirani 1996) or SCAD (Smoothly Clipped Absolute Deviation, Fan and Li (2001)), λ, τ > 0 are the regularization parameters. For given λ and τ , our selection of γ is given bŷ Finally, we obtain the KRR estimatorĝ by (3) with γ =γ and In this paper, we consider the situation where |A * |= p 0 is fixed. Thus, for the variable selection, we define the estimated set of important variables asÂ {k |γ k = 0} and in our calculation below, we takeγ k as zero ifγ k / γ ∞ < 10 −3 . We choose the regularization parameters λ by using the generalized cross-validation. Let H = Kγ (Kγ + nλI) −1 . Note thatŶ = HY. Then the generalized cross-validation estimator of the risk is As τ determines the number of important variables, it is feasible to use the Bayesian information criterion(BIC) Note that Section 4 provides the rate restrictions of λ and τ in theory. Thus, we could select them by grid-search in a reasonable range based on their rate restrictions. Algorithm 1 is the complete algorithm for our estimation method.

2.
For fixed τ and current γ , obtain the KRR estimator with the best λ chosen by GCV. 3. For fixed λ and τ , update γ =γ by solving (3) with gradient descent algorithm. 4. For each τ in a reasonable range, repeat 2 and 3 until convergence. Choose τ according to BIC, denoted by τ 0 . The solution ofγ andĝγ corresponding to τ 0 are the final estimators.

Asymptotics
We begin with a brief discussion on the 'true' shape parameter, referred to as γ * hereafter. Let G = {γ | γ k ≥ 0 for k = 1, 2, . . . , p}, and for any γ ∈ G, let A γ {k | γ k > 0}, and where δ tends to zero. G 0 is empty if g * does not belong to any RKHS generated by a general Gaussian RBF kernel. Thus, we consider two cases. The first case is that G 0 is not empty, then, g * ∈ ∩ γ ∈G 0 H γ . Note that A γ 1 = A γ 2 = A * for any γ 1 , γ 2 ∈ G 0 when g * = 0, according to Proposition 4.1 and our proof in the Supplement. Thus, we can define Berlinet and Thomas-Agnan (2011). Furthermore, it is easy to prove that H γ * = ∩ γ ∈G 0 H γ . For ease of explanation, we only consider The other case is when G 0 is empty. We consider By Proposition 3 of Hang and Steinwart (2021), we have d γ ≤ l∈A γ c l γ −2κ l l when the density function of X is bounded on compact X . This implies that the distance between g * and H γ could be sufficiently small as if γ k , k ∈ A * is big enough. Similar to the previous discussion, we define γ * δ (still denoted by Theorem 4.2: Suppose Assumptions 1-4 in the Supplement hold and g * ∈ H γ * . Let λ ∼ n −1 and τ = o(1). When γ = γ * , the KRR estimator satisfies

Remark 4.2:
If we only consider the estimation of the regression function, τ = o(1) is sufficient for the approximation. Besides, whether g * is in H γ * or not, our selection method is consistent.

Simulations
In this section, we study the empirical performance of our proposed method (denoted by KRRγ ) for estimation and variable selection, by comparing them with COSSO (Lin and Zhang 2006), MARS (Friedman 1991) and KRR c (i.e. the KRR estimation with the shape parameters γ c = (1/c, 1/c, . . . , 1/c) T , where c is selected by CV). For COSSO and MARS, R packages 'cosso' and 'earth' are used in the calculation. We use the mean square error to measure the estimation efficiency. As for variable selection, we use average relative model size (Size, the number of selected important variables divided by the number of truly important variables) and false discovery rate (FDR, the number of redundant variables selected divided by the number of truly uninformative variables) to evaluate the performance. The following five examples are considered. The covariates have the structure as X k = W k +rU 1+r for r ∈ [0, 1] and k = 1, 2, . . . , p and ε ∼ N(0, 1). Therefore, corr(X k , X l ) = r 2 1+r 2 for k = l. When r = 0, all variables are independent. The first and fourth examples were used by Lin and Zhang (2006) and He et al. (2018). The others have more general marginal distributions for the covariates.
Example 5.5: Let Y i be generated in same way as Example 5.2 except that Y = g * (X) + 3X 6 ε.
For each example, we consider a total of 16 settings with r = 0, 1, and n = 100, 200 and p = 10, 20, 50, 100. For each setting, we replicate the simulation 200 times and use bold-font to highlight the best performance methods.
The results of Example 5.1 are given in Table 1. Note that this is an additive model. Thus, it fits COSSO better than the other methods. Both KRRγ and COSSO have decent performance in selecting variables. COSSO performs slightly better than our proposed procedure due to the special additive structure of the model. Besides, MARS tends to select too many variables. Although COSSO performs better than the others in terms of estimation, its associated standard errors (in parentheses) are the largest most of the time. Moreover, the selected shape parameters can improve the estimation of KRRγ a lot over KRR c , especially when the sample size is significantly larger than the number of variables.  The results of Example 5.2 are listed in Table 2. The proposed KRRγ performs significantly better than the other methods in variable selection. MARS tends to select too many variables, and COSSO fails to give a decent estimation, especially when r = 0. Again, the selected shape parameters by our method improve KRR c substantially.
The results of Example 5.3 are given in Table 3. Example 5.3 is more complicated than the previous two examples. We can see that KRRγ still performs much better than MARS and COSSO in variable selection. Besides, the increase of p has little effect on MARS, but it adversely affects the others in estimation. Again, our proposed method can improve KRR c estimator significantly, especially when the correlation between variables is strong.
The results of Example 5.4 are shown in Table 4. COSSO performs the best when r = 0, whereas the KRRγ performs best when r = 1 in variable selecting. Besides, the estimation KRRγ is much better than MARS and COSSO.
The results of Example 5.5 are shown in Table 5. Note that the model is complicated with conditional mean and variance depending on the covariates. The estimations of all procedures are not good, especially when n = 100. Although COSSO performs best in the most of cases, our KRRγ is not much worse than COSSO. In summary, the proposed method has superior performance over competitors in variable selection and model estimation, especially when the model has interaction and strong dependence among variables or when the model is complicated.

Real examples
This section applies our proposed method to three real data sets: baseball players' salary data, Boston housing data, and residential building cost data. The baseball player's salary data at http://lib.stat.cmu.edu/datasets/baseball.data concerns the salaries of major league baseball players; there are 16 covariates and 263 observations. The Boston housing data is available in R package 'mlbench'; it studies the house price using 13 predictors with 506 observations. The residential building cost data is available at https://archive.ics.uci.edu/ml/datasets/Residential+Building+Data+Set; it has 103 covariates and 372 observations. First, Box-Cox transformations are employed to make the variables distributed close to normal in our analysis. Then we randomly take 2/3 of the data as a training set and the rest as a testing set, and then apply the four methods to train the models and make predictions. Finally, we repeat the simulations 200 times and use their averages of prediction errors to assess the performance of the methods. The results are given in Tables 6-9. Table 6 shows the prediction performance of the four methods. It shows that our method has much smaller prediction errors than the others. Table 7 is the average size of chosen variables using different methods. Table 8 lists the relative selected frequency of each variable for the baseball salary data. The frequencies seem to align with the common understanding of the importance of the variables. For example, the number of years in the major leagues is important for the player's salary. Besides, Table 9 lists the chosen γ for Boston housing data, showing that γ k s are different for different predictors, which can substantially improve the prediction of KRR that use only one common γ c .

Discussion
In this paper, we have proposed a method for selecting shape parameters of a general Gaussian RBF kernel. The asymptotic properties of estimation and the consistency of variable selection are established. Compared with Gaussian RBF kernel, the proposed method can simultaneously improve the estimation and identify the important variables. Compared with COSSO and MARS, the proposed method has a better performance in both variable selection and prediction, especially when the correlation among the variables is strong.
Our method is proposed for the situation that ε and X are independent. Example 5.5 shows that all methods mentioned above perform not well when ε and X are dependent. If we can estimate the conditional variance of ε given X, a weighted least squares estimation could be helpful. The least absolute deviations (LAD) estimation could also reduce the effect of the conditional variance. This needs to be studied further. Besides, we have only studied the situation when the dimension of covariates is fixed in this paper. Extending the theory to high dimensional covariates where the dimension increases with sample size would be interesting. The minimization problem with (3) also needs to be studied when the dimension increases.