Locally penalized single-index model using B-splines and spherical coordinates

Abstract In this study, we focus on the estimation of the regression function in the single-index model based on B-splines using penalization techniques. We adopt a spherical coordinates reparameterization of an index vector to deal with an identification problem of the single-index model. To provide a spatially adaptive method, two types of penalties are applied to the estimation of the index vector and the regression function. A special penalty called the localized penalty is introduced to handle the sparsity of the index vector using the spherical coordinates, and the total variation penalty is considered to deal with the smoothing function. Using a coordinate descent algorithm with a grid search of the two tuning parameters, the entire solution paths of the index coefficients and the regression functions for tuning parameters can be obtained efficiently. The performance of the proposed estimator is studied through both numerical simulations and real data sets. An R software package pbssim is available.


Introduction
The single-index model is one of the most popular methods in multivariate semiparametric regression.It is a generalized version of the linear regression model that does not specify the form of link function.The single-index model can avoid both the model misspecification that parametric models may suffer and the so-called "curse of dimensionality" by reducing the dimensionality of predictors to a one-dimensional space, but still capturing important features in multidimensional data.The single-index model is applied in a variety of fields, such as discrete choice analysis in econometrics and dose-response models in biometrics (Hardle, Hall, and Ichimura 1993).
When large numbers of irrelevant predictors are included, the single-index model may suffer from lower prediction accuracy and interpretation may become difficult.To address the problem, the least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996) and its variants have been studied in this area.Wang and Yin (2008) proposed sparse minimum average conditional variance estimation (sMAVE), where local linear smoothing is adopted with the LASSO penalty under multiple-index models.Lu, Zhang, and Hu (2016) considered B-spline approximation with an adaptive LASSO penalty (Zou 2006) and Peng and Huang (2011) incorporated penalized least-squares estimation with a smoothly clipped absolute deviation (SCAD) (Fan and Li 2001) penalty function.
To penalize the index coefficients and the link function simultaneously, Zeng, He, and Zhu (2012) improved sMAVE, and added the derivative of the link function to the penalty function.The penalty function not only performs variable selection, but also excludes data points with small derivatives.Yang, Lv, and Guo (2016) proposed least absolute deviation (LAD) regression with a SCAD penalty, and this approach showed higher efficiency in heavy-tailed error distribution settings.Recently, Kim (2018) combined truncated power splines estimation with the LASSO penalty for both variable selection and smoothing separately using two tuning parameters.
This article considers a different approach and makes the following key contributions: 1.Although Wang and Yin (2008); Lu, Zhang, and Hu (2016); Peng and Huang (2011) show promising results with penalized mehtods, they do not adopt a penalty term that can be adjusted for the smoothness of the estimated link function.Yang, Lv, and Guo (2016) used a single tuning parameter for variable selection and link function smoothing at the same time.This may overlook the necessity of shrinking coefficients in a different speed.In addition, they considered the predefined bandwidth, which is also a crucial factor for managing the smoothness of the link function.In this article, we present a study on the penalized singleindex model estimator using spherical coordinates and B-splines.We consider a new type of penalty function to handle the sparsity of the index vector based on spherical coordinates.Several approaches in the literature adopt spherical coordinates to deal with the index vector satisfying the identifiability condition (Gaïffas and Lecu e 2007).Moreover, the total variation penalty of B-splines is used to control the smoothness of the link function.2. Kim (2018) considered the design matrix of the truncated power basis which becomes dense when the number of interior knots increases, thus, several computational problems may occur.In addition, for the identifiability of the single-index model, the index coefficients should be rescaled to be of unit length in every updating step.We define the nonparametric function estimator using B-splines.The B-spline basis is known as a numerically superior alternative basis to the truncated power basis because it has small supports so that the design matrix is sparse and the information matrix has a banded structure (De Boor et al. 1978).3. Once we express the estimator with a linear combination of B-splines, both spline coefficients and index coefficients are obtained by minimizing a penalized square loss function.We propose the iterative coordinate descent algorithm that can provide the entire solution paths of the index coefficients and the regression functions for all sets of tuning parameters.
The rest of the article is organized as follows.In Section 2, we define the PBS-SIM estimators based on the spherical coordinates and the total variation penalty.The details of the grid search updating method based on the coordinate descent algorithm and its implementation are described in Section 3. The performance of the proposed estimator is studied through both numerical simulations and real data sets that are given in Section 4. Section 5 summarizes the conclusions of the article.In a document in the supplementary materials, we display the full set of results over the entire simulation scenarios.An R software package pbssim implemented in the Rcpp language is also available online.

Reparameterization single-index model using spherical coordinates
Suppose a set of observations ðx 1 , y 1 Þ, :::, ðx N , y N Þ follows the single-index model where x i ¼ ðx i1 , :::, x ip Þ 2 R p is a p-dimensional predictor vector, y i 2 R is a response and e i is independent error for i ¼ 1, :::, N: We are interested in estimating an unknown smooth link function f and an index vector n ¼ ðn 1 , :::, n p Þ 2 R p : The index vector n in the representation (1) is not uniquely defined so that we assume that the Euclidean norm of n is equal to one, that is, jjnjj 2 ¼ 1, to ensure the identifiability condition on n.
For convenience of setting the ' 2 -norm of n to 1, we adopt the spherical coordinates reparameterization of n (Natterer 2001).Let S pÀ1 be the unit hyper-sphere defined by S pÀ1 ¼ n 2 R p : jjnjj 2 ¼ 1 and n p > 0 È É and let A pÀ1 be the set of spherical coordinates of n 2 S pÀ1 given as ½ , a 2 , :::, a pÀ2 2 0, p ½ and a pÀ1 2 0, 2p ½ È É : Let q be a function such that q : A pÀ1 !S pÀ1 defined by qða 1 , :::, where . . .
Based on (2), our approach is to estimate the spherical coordinate a instead of n because the function q is one-to-one, so that qðaÞ always satisfies the identifiability condition of n.
We consider the general methodology of the optimization of the square loss function where is a linear combination of B-splines B 1 , :::, B J having knots spaced on the subinterval I & R with dimension J.In general, cubic B-splines with degree d ¼ 3 are used, however, we provide the solution for any degree for d ! 1, which can be a piecewise linear (d ¼ 1) or a piecewise quadratic (d ¼ 2) spline.
One critical issue of the single-index model is that ( 4) is nonconvex with respect to the index vector n.This is also applied to the spherical coordinates reparameterization.Consider the following example for illustration with p ¼ 2.
Example 2.1.The response is generated from the model where ðzÞ þ ¼ z Á Iðz !0Þ for z 2 R, where x ik are randomly generated from a uniform distribution on ½À1:5, 1:5, and e i $ Nð0, 0: Each row of the first column of Figure 1 represents the B-spline fit obtained by an unpenalized least-squares estimator with 1 and 50 interior knots, respectively.The second column of Figure 1 shows RðhÞ with respect to a 1 2 ½0, 2p, which have a local minima around the true value a 1 ¼ 2p=3 but it is not a convex function.Moreover, it shows local oscillating behavior when the number of spline knots increases.Thus, both the number and the position of knots are critical issues in finding the local minimum of RðhÞ with respect to the spherical coordinates.

Penalized B-spline single-index model estimator
To achieve the sparsity of the index vector and the smoothness of the spline function, we consider a penalized square loss function Here, P a and P b are two penalty functions corresponding to controlling the predictor selection and smoothing, respectively.The details of the penalty functions are discussed in the next section.Define the penalized estimator of h by (6) Then, we call f k ¼ f ĥk a function estimator of the PBS-SIM, and nk ¼ qðâ k Þ is an estimated index vector obtained by the spherical coordinates.

Localized Penalty for predictor selection
We propose a new type of penalty function, which we call the localized penalty, to deal with the sparsity of the index vector using the spherical coordinates.One can easily check that n becomes sparse to zero when the elements of a are toward some specific polar coordinates, that is, a k 2 0, p 2 , p, 3p 2 , 2p È É for k p À 1, which make sin a k or cos a k zero.Thus, the localized penalty function is given by Figure 2 displays (7) with respect to a 1 2 ½0, 2p for p ¼ 2, which has the form There are five local minimizers at 0, p 2 , p, 3p 2 , 2p È É that are the thresholding points as the tuning parameter k a increases.
To show the role of P a more specifically, we perform a simple example.Focusing on p ¼ 2, the penalized objective function ( 5) is represented by a univariate function with respect to a 1 : ignoring the constant terms for a 1 with other values fixed.The square loss term can be obtained by a first-order approximation and b 2 R is an unpenalized minimizer of r 0 .Figure 3 displays the thresholding functions corresponding to the localized penalty, with difference values of k a : Interestingly, the unpenalized estimator b shrinks to the adjacent point among 0, p 2 , p, 3p 2 , 2p È É when the tuning parameter k a increases.This approach gives sparsity to the index vector resolving the identifiability problem itself since n ¼ qðaÞ always lies on S pÀ1 for any a 2 A pÀ1 : Returning to Example 2.1, Figure 4 shows the solution path of the spherical coordinate a 1 and its corresponding index vector for increasing k a sequences with the local penalty, respectively.As expected, the unpenalized estimator moves to the adjacent localized point p=2 as k a increases, and it makes n sparse.Moreover, we can provide an entire path of solutions of the index coefficients with any tuning parameter using the efficient grid search algorithm introduced in Section 3.

Total variation penalty for smoothing
To provide a spatially adaptive estimator, we consider an additional penalty for estimating f.Jhong, Koo, and Lee (2017) proposed the total variation penalty for B-splines where D is a ðJ À d À 1Þ Â J jump matrix to represent the sum of the jump sizes of the dth derivative of the B-splines at the interior knots.When the spline knots are equally spaced on ½0, 1, then (9) can be expressed as where D is a ðd þ 1Þ th-order forward difference matrix The interested reader can refer to Jhong, Koo, and Lee (2017) for a general calculation of D for any degree d and any knots sequence.

Implementation
Minimization of (5) involves jointly optimizing a and b.Considering the distinct role of the coefficients a and b on the function as well as the penalty term, we coordinately update to minimize (5) with respect to each coordinate alternatively.Algorithm 3.1 describes the procedure of the proposed estimation for fixed tuning parameters k ¼ ðk a , k b Þ: Based on the Algorithm 3.1, we find the set of optimal tuning parameters through two-dimensional grid search.The grid search algorithm considers all combinations of tuning parameters and we choose the final model by some model criteria.In this article, we adopt the Bayesian information criterion (BIC) (Schwarz 1978), which is commonly used in the regression literature: where J k is the number of active coefficients.The R package pbssim we developed also provides the Akaike information criterion.

Find ĥk ¼ h:
For any coordinate z 2 h, the penalized objective function ( 5) can be represented as a form with respect to z holding other coefficients fixed: where a > 0, b 2 R, c 1 , :::, c M are positive real numbers, d 1 , :::, d M are real numbers with d 1 d 2 Á Á Á d M , and s > 0: By Theorem 1 of Jhong, Koo, and Lee (2017), this univariate convex function always has a global minimum.Thus, the proposed coordinate descent algorithm iteratively updates the coefficients by the minimum of (10) until convergence.

Update a
For the sake of illustration, we consider p ¼ 2, and let a ¼ a 1 in this section.Replacing by a firstorder approximation, ( 5) is represented by with respect to a, where f a ðx i ; aÞ ¼ P J j¼1 bj B j ðqðaÞ > x i Þ are the B-splines with current coefficients b, Thus, combined with the localized penalty, a can be updated by the form of ( 10) with a and b, which are obtained in (11).As the tuning parameter k a becomes large, the minimizer shrinks to d that makes n sparse.

Update b
For each j ¼ 1, :::, J, the proposed coordinate descent algorithm partially optimizes with respect to b j holding other coefficients fixed.Choose 1 j J, then ( 5) is represented by Rðã, b1 , :::, b j , :: The univariate penalty function P b ð b1 , :::, b j , :::, bJ Þ has various forms depending on the degree d of the splines and the index j of the B-spline coefficient.Using the entries of the jump matrix D related to b j , this can be represented by where c m > 0 and d m 2 R for m ¼ 1, :::, M j : Thus, we update b j by bj argmin with suitable a, b, c, and d which can be obtained in ( 12) and (13).

Initialization
For any n 2 S pÀ1 , it is obvious that Àjjxjj 2 n > x jjxjj 2 for x 2 R p : Thus, we place the initial interior knots of the spline within the range ½Àjjx ðNÞ jj 2 , jjx ðNÞ jj 2 in the first step, where x ðNÞ ¼ max 1 i N jjx i jj 2 : As we prune knots for the spline function that are no longer active as the tuning parameter k b increases, it is better to start with a large number of initial knots.We typically choose the initial dimension J by J ¼ N or J ¼ N=2 in the default setting.
It is crucial to obtain the appropriate initial index coefficients ñ for the coordinate descent algorithm to converge and to avoid becoming stuck in local minima.We adopt a model-based sliced inverse regression method (Scrucca 2011) for initializing the index vector.It is the dimension-reduction technique that exploits information from the inverse regression mean function.We choose ñ as the direction of the central dimension-reduction space that explains the largest variance.When the dimension p is large, another method is to choose the ' 1 penalized estimators as the initial values of n.In our package pbssim, the packages msir (Scrucca 2011) and glmnet (Friedman, Hastie, and Tibshirani 2010) are used to compute the initial values.

Simulations
We consider three example functions for simulation studies to investigate the finite sample performance of the proposed estimator.Figure 5 displays the underlying functions of Examples 4.1-4.3.These underlying functions are motivated by Zeng, He, and Zhu (2012), Yang, Lv, andGuo (2016), andRadchenko (2015).Each of the examples is obtained from the model (1), where the predictors are randomly generated as UðÀ1, 1Þ and e i $ Nð0, 1Þ for N ¼ 100, 200 and p ¼ 10, 30, 50 with the true index vector n ¼ ð1, 1, 2, 2, 3, 0, :::, 0Þ= ffiffiffiffiffi 19 p 2 R p : Example 4.1.f ðxÞ ¼ x 2 exp ðxÞ: Example 4.3.f ðxÞ ¼ 2x 3 : We compare PBS-SIM with the LASSO for the single-index model (SIM-LASSO) of Zeng, He, and Zhu ( 2012), the high-dimensional single-index model with post-process (HD-SIM-PST) of Radchenko (2015), the penalized single-index model based on smoothing splines (SIM-EST) of Kuchibhotla and Patra (2017), the adaptive lasso spline estimation of single-index model (SIM-ALASSO) of Lu, Zhang, and Hu (2016), and the SCAD penalty based single index models (SIM-SCAD) of Peng and Huang (2011).SIM-LASSO and the SIM-HD-PST are obtained using the code provided by the authors of the corresponding papers.SIM-EST is available in the simest package in R. SIM-ALASSO and SIM-SCAD are computed by lars and ncvreg packages in R, respectively.To ease the computational burden, we compute all estimators using the default settings of each package.
We evaluate the methods on the accuracy of function estimation, index vector estimation, and variable selection.The mean squared error (MSE), mean absolute deviation (MAD), and the maximum deviation (MXDV) criterion are considered for the discrepancy measure between the underlying function and each function estimator, which are given as follows: To evaluate the index vector estimation, we adopt the ' 1 error of estimates for n, which are obtained by jj n À njj 1 ¼ P p k¼1 j nk À n k j: We also report the average number of true positives (number of noise predictors identified as noise) and the average number of false positives (number of noise predictors identified as signal) to measure the performance of the variable selection.
Here we summarize a slice of the accuracy results, focusing on Example 4.1 with the sample size N ¼ 100 and the dimension p ¼ 10, 30, 50.In the supplementary material, we provide the full set of results, over the entire simulation scenarios.Figure 6 displays the results of MSE, MAD, and MXDV for the estimators with 100 repetitions.It is noteworthy that PBS-SIM achieves lower MSE compared with other estimators for all experimental scenarios and, thus, this suggests the proposed method performs better in identifying the true underlying link function.Overall, PBS-SIM performs comparably well with other estimators for all discrepancy measures.In addition, the Table 1.Average of the estimation error of n (standard error in parentheses), true positives, and false positives over 100 runs of PBS-SIM, SIM-EST, SIM-LASSO, HD-SIM-PST, SIM-ALASSO, and results of the estimation errors for the index vector and variable selection performance are provided in Table 1.Under this simulation scenario, PBS-SIM still performs well, presenting the lowest estimation error for the index vector n.Moreover, the number of true positives and false positives show that PBS-SIM can be used to achieve variable selection under the various scenarios.

Car mileage data
We examine car mileage data, which is available in the ISLR (James et al. 2017) package in R.
The data contain the mileages of N ¼ 392 cars and p ¼ 6 covariates: cylinders, displacement, horsepower, weight, acceleration, and year.Before analyzing the data, we standardized the predictors to have mean zero and variance one.
The left part of Figure 7 shows the PBS-SIM result of the car mileage data with the degree d ¼ 3, which is computed for the whole data set.The x-axis and y-axis represent the linear combination of predictors with the estimated index vector qðâÞ > x and the mileages per gallon (mpg), respectively.In the right panel of Figure 7, we consider the entire solution path for the total of six index coefficients over the tuning parameter k a : The optimal tuning parameter ka ¼ 533:82 is chosen by the BIC for the final PBS-SIM and we obtain its corresponding index vector n ¼ ð0, 0, 0, À 0:929, 0:076, 0:362Þ: One can see that the number of cylinders, displacement, and horsepower of the cars do not seem to effect the mpg.
Table 2 and Figure 8 shows the standard error and the box plots of 100 bootstrap replications of the proposed estimates, respectively.Note that the central 95% percentile intervals (2.5th and 97.5th percentiles of the bootstrap distributions) contained the value zero for the "cylinders," "displacement," and "horsepower."Table 3 shows the results for the PBS-SIM, PBS-SIM and PBS-SIM coefficients estimates for car mileage data.Although SIM-EST does not have a variable selection procedure, the coefficient of cylinders is estimated to be close to zero.Also, weight, year and horsepower were selected as relatively important predictors in SIM-EST.HD-SIM-PSD removes cylinders and displacement from the model same as the proposed estimator.In addition, only the sign of the index coefficients are reversed, the overall importance of the predictor variables is found to be consistent with PBS-SIM.

Crime data
In this example, we focus on the high-dimensional data analysis.We consider the communities and crime data from UCI Machine Learning Repository (Redmond and Baveja 2002).These data Figure 9 shows the PBS-SIM result for the crime data with degree d ¼ 3.There appears to be a positive trend between the estimated index and the percentage of women who divorced.Among the 93 predictors, 18 predictors are estimated as signal, the other 75 are regarded as noise.Table 4 lists the 18 active predictors and the corresponding estimated index coefficients obtained by PBS-SIM.The five predictors contributing most to the index are "PctFam2Par," "agePct65up," "HousVacant," "agePct16t24," "PctOccupmanu," and their estimated index coefficients are À0.912,À0.239, À0.172, À0.154, and 0.118, respectively."PctFam2Par" is the percentage of families with kids that are headed by two parents, not a very surprising result.As the percentage of women who are older than 65 or between 16 and 24, and the number of vacant households is higher, the percentage of women who divorced was lower.The percentage of women aged 16 and over who are employed in manufacturing gives a positive effect for the percentage of women who divorced.

Conclusion
In this article, we demonstrated that the spherical coordinates reparameterization of the index vector is useful in the single-index model to handle both the identification problem and variable selection.The proposed localized penalty can achieve sparsity of the index coefficients based on spherical coordinates.Moreover, an additional total variation penalty with the B-splines provides a spatially adaptive estimator to deal with the smoothness of the link function.The proposed coordinate descent algorithm provides the entire solution paths of the index coefficients and the link functions for all sets of tuning parameters.Numerical studies using both simulated and real data have illustrated that the finite-sample performance of the proposed estimator is satisfactory.
As we have considered in the crime data, one potential extension of the proposed method would be work on the high-dimensional setup where p is much larger than N.In a simple pilot study, we have confirmed the possibility that the proposed coordinate descent algorithm can also work in the high-dimensional case.We leave these tasks for further study.

Figure 1 .
Figure 1.Illustration of Example 2.1.The first column represents the underlying function (blue line) with N ¼ 200 data points, and the B-spline fit (red line) obtained by an unpenalized least-squares estimator with 1 and 50 interior knots, respectively.The vertical gray lines indicate the location of the knots.In the second column, we display RðhÞ with respect to a 1 2 ½0, 2p and the gray vertical line indicates the position of the true spherical coordinate a 1 ¼ 2p=3 in this example.

Figure 2 .
Figure 2. Localized penalty function P a with respect to a 1 2 ½0, 2p for p ¼ 2.

Figure 3 .
Figure 3. Localized thresholding functions for increasing the tuning parameter k a 0 to p=4: For each plot, the x-axis represents the unpenalized estimator c and the y-axis is the corresponding minimizer of r ka : The blue lines show the local thresholding function of the local penalty corresponding to each k a and the gray lines are the baselines with k a ¼ 0:

Figure 6 .
Figure 6.Boxplot results for Example 4.1 with the sample size N ¼ 100 and dimension p ¼ 10, 30, 50.In each plot, MSE, MAD, and MXDV for the estimators are displayed with 100 repetitions.

Figure 7 .
Figure 7. Left: PBS-SIM result of car mileage data.The data points display n> x i for i ¼ 1, :::, N and the solid line indicates the PBS-

Figure 9 .
Figure 9. PBS-SIM result of crime data.The data points display n> x i for i ¼ 1, :::, N and the solid line indicates the PBS-SIM function estimates for f : SIM-SCAD for Example 4.1 with sample size N ¼ 100 and dimension p ¼ 10, 30, 50.