Improved Estimation of High-dimensional Additive Models Using Subspace Learning

Abstract Additive models have been widely used as a flexible nonparametric regression method that can overcome the curse of dimensionality. Using sparsity-inducing penalty for variable selection, several methods are developed for fitting additive models when the number of predictors is very large, sometimes even larger than the sample size. However, despite good asymptotic properties, the finite sample performance of these methods may deteriorate considerably when the number of relevant predictors becomes moderately large. This article proposes a new method that reduces the number of unknown functions to be nonparametrically estimated through learning a predictive subspace representation shared by the additive component functions. The subspace learning is integrated with sparsity-inducing penalization in a penalized least squares formulation and an efficient algorithm is developed for computation involving Stiefel matrix manifold optimization and proximal thresholding operators on matrices. Theoretical convergence properties of the algorithm are studied. The proposed method is shown to be competitive with existing methods in simulation studies and a real data example. Supplementary materials for this article are available online.


Introduction
Consider a nonparametric regression problem where Y is the response variable and X = (X 1 , . . . , X p ) T is the p-dimensional vector of predictor variables. We have a random sample of size n from the joint distribution of (Y, X T ) T , denoted as (y i , x T i ) T with x i = (x i1 , . . . , x ip ) T for i = 1, . . . , n. The (nonparametric) additive model takes the form of where μ, f j 's, and e i 's are the intercept term, unknown additive component functions, and zero-mean random errors, respectively. For identifiability, we assume that f j 's are centered, that is, X j f j (x)dx = 0, where X j is the compact domain for the additive component f j . Without loss of generality, we assume that X = X j = [0, 1] for all j's. The additive model enjoys the interpretability of the linear model while being more flexible and thus, has become a classical nonparametric regression model when there are multiple predictor variables. There is an extensive literature on additive models, for example, Friedman and Stuetzle (1981), Buja, Hastie, and Tibshirani (1989), Tjøstheim and Auestad (1994), and Linton and Nielsen (1995). Comprehensive book-long treatments of additive models include the monographs of Hastie and Tibshirani (1990) and Wood (2017). While the additive model has been useful in many practical applications, its advantage over an unstructured nonparametric model has also been studied theoretically. According to CONTACT Kejun He kejunhe@ruc.edu.cn Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China, Beijing 100872, China. * These authors contributed equally to this work.
Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JCGS. Stone (1982), the optimal rate of convergence (measured using the squared L 2 norm) for estimating a p-dimensional m-order Hölder smooth function is O p (n −2m/(2m+p) ). The very slow rate of convergence for large dimension p is an indication of the socalled "curse of dimensionality. " Stone (1985) showed that, when estimating an additive model with p variables and each additive component belongs a Hölder function class with order m, one can achieve the rate of O p (n −2m/(2m+1) ), the same as estimating a univariate m-order Hölder smooth function, and thus, the additive model tames the curse of dimensionality.
Recent years have seen growing interests in using the additive models when p is very large, sometimes even larger than the sample size n. In such cases, consistent estimation of p unknown additive component functions is generally impossible. However, if most of the additive components f j 's are zero, consistent estimation is still feasible. In this case, the model (1) can be written as where J 0 (a subset of {1, . . . , p}) is an unknown index set corresponding to the nonzero additive component functions. The cardinality |J 0 | = s of the set J 0 is much smaller than p. The predictor variables corresponding to nonzero additive component functions are referred to relevant variables, and the s is the number of relevant variables. By using variable selection through penalized estimation, various proposals have been developed to distinguish the nonzero additive component functions from the zero ones and to provide efficient estimation of the nonzero additive component functions. Lin and Zhang (2006) applied a penalty that equals the summation of the RKHS (reproducing kernel Hilbert space) norms of the additive component functions, while Ravikumar et al. (2009) and Huang, Horowitz, and Wei (2010) employed the summation of the L 2 -norms of additive component functions as the penalty. Meier, van de Geer, and Bühlmann (2009), Koltchinskii and Yuan (2010), Raskutti, Wainwright, and Yu (2012), Suzuki and Sugiyama (2013) and Tan and Zhang (2019) considered a penalty that is the summation of the RKHS norms (or other function class norms) and the L 2 norms of the additive component functions. All of these works have developed asymptotic theory to justify the usefulness of using sparsity-inducing penalty in high-dimensional additive models. In particular, the latter three articles developed the minimax rate of convergence under different conditions; Huang, Horowitz, and Wei (2010) and Ravikumar et al. (2009) studied variable selection consistency. On the other hand, several authors developed computational algorithms to implement their own methods, including Meier, van de Geer, and Bühlmann (2009), Ravikumar et al. (2009) and Huang, Horowitz, and Wei (2010. Recent related extensions beyond the standard additive models include Lou et al. (2016) for the sparse partially linear generalized additive models, Petersen, Witten, and Simon (2016) for the fused Lasso additive models, and Sadhanala and Tibshirani (2019) for the additive models built with trend filtering.
Although incorporating variable selection allows the above mentioned methods to provide effective estimation of highdimensional additive models, the performance of these methods may deteriorate considerably when the number of relevant predictors s becomes a moderately large number. To illustrate, Figure 1 presents one replication result from the first setting of our simulation study in Section 4. In this setting, the sample size is n = 210, the total number of predictors is p = 100, and the number of relevant predictor variables is s = 20. Three of the above mentioned methods are applied to the simulated data to estimate of additive component functions, including HAM (Meier, van de Geer, and Bühlmann 2009), SAM (Ravikumar et al. 2009), and AGL (Huang, Horowitz, and Wei 2010, with the adaptive group Lasso penalty). Each panel in Figure 1 shows the results of estimating one of the nonzero additive component functions. The red dotted lines with large point size are the true curves, and the blue dashed, dot-dashed, and two-dashed lines are the estimation results by AGL, HAM, and SAM, respectively. In many cases the estimated functions are quite far away from the corresponding true functions (e.g., f 4 , f 7 , f 8 , f 10 , and f 11 ). We have observed similar finite sample behaviors of existing variable selection based methods in many other simulation experiments, leading us to tend to conclude that it may require a very large sample size for the nice asymptotic behavior of existing methods to "kick in. " The poor finite sample behavior shown in Figure 1 can be understood by noticing that on average, there are roughly about 10 data points for each unknown nonzero additive component functions, which may be not sufficient for accurate estimation. This motivates us to consider reducing the number of unknown functions to be estimated while keeping as much information as in the additive component functions. Specifically, if the nonzero additive component functions sit in a low-dimensional subspace, then we only need to estimate the small number of basis functions of this subspace, thereby increasing the number of data points per unknown functions to be estimated. This leads to the approach of model reduction through subspace learning (e.g., Wang and Poor 1998;Chen, Zhu, and Xing 2010).
Precisely, using subspace learning we assume all the additive component functions belong to an unknown r-dimensional subspace or approximately so. With a set of basis functions {β 1 , . . . , β r } of this subspace, we write where d νj 's are the representation coefficients for the additive components. In accordance with the identifiability condition we obtain the following model We call this model the reduced additive model (RAM) and refer to (2) as the full model. The reduced additive model (5) reduces the problem of estimating the p unknown functions in the full additive model to the problem of estimating r unknown basis functions of the dimension-r subspace and the associated representation coefficients. When r is much smaller than p (and s), we substantially reduce the number of unknown functions to be estimated and therefore, the difficulty of the problem. In general, the low-dimensional subspace representation given in (3) cannot be exact and it only serves as an approximation, but we show in our simulation study that we can still benefit by using the reduced additive model in such situations. In this article we propose to combine the subspace learning idea with variable selection using a sparsity-inducing penalty such as Lasso (Tibshirani 1996). After reducing estimation of many unknown functions to that of a small number of basis functions of a low dimensional subspace, we still need to estimate the p representation vectors D j 's in (5). Accurate estimation of these r-dimensional representation vectors is difficult for large p. To overcome this challenge, we assume that only a few predictor variables are relevant for predicting the response variable and therefore, many of the D j 's are zero vectors. We then apply the penalized least squares with a sparsity-inducing penalty to filter out the zero vectors during estimation. The black solid lines in Figure 1 are the results of our proposed method with the adaptive group Lasso penalty (RAM-AGL). We observe that subspace learning significantly improves upon existing methods and yields estimates close to the true functions.
Since the subspace basis functions are unknown, existing computation methods of the sparsity-induced optimization cannot be directly applied on our problem. We develop a novel alternating direction descent algorithm that involves Stiefel matrix The rest of this article is organized as follows. Section 2 presents the proposed penalized least squares estimation method with subspace learning. Section 3 develops the computational algorithm and analyzes its convergence properties. The results of a simulation study and the real data analysis are given in Sections 4 and 5, respectively.

Penalized Least Squares Estimation with Subspace Learning
Our reduced additive model (5) reduces the problem of estimating the large number of p additive component functions to estimating a much smaller number of r subspace basis functions β ν , ν = 1, . . . , r. Following Stone (1985) we estimate functions nonparametrically by approximating them with spline functions, or by a linear combination of fixed spline basis functions b , = 1, . . . , q (q r). These lower level fixed basis functions b (x)'s will be referred to as the hyper basis functions to distinguish them from the subspace basis functions β ν (x)'s to be learned from the data. Using these hyper basis functions, we can write (up to spline approximation error) each subspace basis function as To ensure the subspace basis β(x) = A T b(x) to satisfy the identifiability condition (4), we require the hyper basis functions to have zero integral, X b(x)dx = 0 and at the same time to be orthonormal, In Section S.1 of the supplementary materials we will construct the hyper basis functions b(·) using commonly used B-spline functions to satisfy the above required properties by applying an orthonormalizing and de-meaning transformation.
The above setup enables us to transfer the problem of learning the subspace basis functions to the problem of estimating the coefficient matrix A. Given the random sample It follows that the model (5) can be written as (6) where D = (D 1 , . . . , D p ) ∈ R r×p . To estimate the unknown A and D, we consider to minimize the least squares objective We next combine the subspace learning approach described above with variable selection through using a sparsity-inducing penalty to yield our final improved estimator of highdimensional additive models. In light of (6), the jth additive component becomes a zero function f j (·) ≡ 0 whenever the component representation vector turns to D j = 0. In other words, the jth predictor X j is irrelevant for the response Y if and only if D j = 0. Borrowing ideas from the group Lasso penalization (Yuan and Lin 2006), we consider the penalized least squares problem where p j=1 D j 2 is the group Lasso penalty and λ is the penalty parameter. The group Lasso penalty in (7) encourages a column sparse solution for the matrix D. Ideally, the solution would satisfy D j = 0 if and only if the true additive component function f j ≡ 0. However, the group Lasso estimator, defined as the solution of (7), usually does not have variable selection consistency , and also suffers from large biases for estimating the nonzero additive component functions. By extending the idea of the adaptive Lasso (Zou 2006), we recommend to use the adaptive group Lasso penalty that allows different levels of penalization for different columns of D using information provided by an initial estimator. The recommended adaptive group Lasso estimator solves the following problem where w nj = 1/ D j ι 2 with some fixed constant ι for j = 1, . . . , p and D is an initial estimator. The initial estimator D can be chosen as the group Lasso estimator defined by (7).
Remark. The orthonormality constraint in (4) facilitates the interpretability and the computation of our model. Under this constraint, the L 2 norms of additive component functions (f j 's) are equal to the Euclidean norms of their representation vectors D j 's. The penalty over D j in (7) and (8) is, therefore, a direct translation of the L 2 penalty on f j , j = 1, . . . , p. The orthonormality constraint also balances the relative size of the factored matrices A and D, avoiding D overly shrunk toward 0 induced by the penalty.

Alternating Direction Descent Algorithm
This section provides an algorithm for solving the penalized least squares problem (7). With a slight modification as explained at the end of this section, the same algorithm can be applied to solve (8).
Note that μ in (7) can be profiled out by taking the optimal value (1/n) n i=1 (y i − B i , AD ) when A and D are fixed. Thus, we can rewrite the objective function (7) as We propose to solve this minimization problem by an alternating direction descent method (e.g., Bunea, She, and Wegkamp 2012;He et al. 2018). The proposed Algorithm 1 generates a sequence of matrices D (k) and A (k) , k = 1, 2, . . . , by alternately updating D and A, with the other matrix fixed. To facilitate our discussion, we let L(D, A) denote the objective function in (9) without the last penalty term. We first address the problem of updating A when D is fixed at D (k) while taking into account the restriction that A has orthonormal columns. Without taking into account the geometric structure set by the restriction, an ad hoc algorithm may not provide a stable sequence of iterative points (Absil,  (D (k) ) as in (14), where β (k) is obtained in Algorithm 2. 7: k ← k + 1. 8: until convergence 9: Output the matrices A (k) and D (k) .
Mahony, and Sepulchre 2009). Therefore, we decide to employ the more principled gradient-based matrix manifold optimization method (Edelman, Arias, and Smith 1998).
Let T A be the tangent space of the Stiefel manifold at point A. The Riemannian gradient can be computed by projecting the classical gradient onto the tangent space T A . For the Stiefel manifold St(r, q), the projection has an explicit form The current point A (k) will move along the negative direction of the projected gradient P T A (k) (∂L/∂A). In addition, the QR decomposition is employed to keep the moved matrix on the Stiefel manifold. We denote qf(A) as the Q matrix of the QR decomposition A = QR. Note in this QR decomposition, the matrix Q has size q-by-r, and the matrix R ∈ R r×r is required to have positive diagonal elements. Then, the update becomes where α (k) is the stepsize selected by the Armijo rule (Absil, Mahony, and Sepulchre 2009, chap. 4.2). More precisely, the stepsize is set to be α (k) = (η A ) ρ with some η A ∈ (0, 1), where ρ is the minimal value among positive integers {1, 2, . . .} such that the following criterion holds with some fixed σ A ∈ (0, 1).

Algorithm 2
The backtracking line search.
After updating A, the matrix D (k) is updated by the iterative thresholding algorithm (Beck and Teboulle 2009). The gradient of L(D, A) with respect to D is In the iterative thresholding algorithm, L(D (k) , A (k+1) ) is replaced by a second order surrogate, where β (k) is the stepsize obtained by the backtracking line search as given in Algorithm 2. This second order surrogate Q (k) is coupled with the group Lasso penalty. The update on D amounts to solving the proximal operator For the group Lasso penalty, solving the proximal operator is equivalent to applying column soft thresholding to the matrix More precisely, the jth column of the updated matrix D (k+1) = Prox β (k) (D (k) ) has the analytical solution for j = 1, . . . , p. In the above, we have employed the function (x) + = max{x, 0}. The above iterative procedure is summarized in Algorithm 1. When solving the adaptive group Lasso penalization problem (8), Algorithm 1 can be easily adapted with a slight modification, that is, via replacing λ with the column specific λ j = λ w nj in the updating formula (15) of the matrix D j (k+1) .

Tuning Parameters
For our reduced additive models (RAM-GL and RAM-AGL), we need to select the dimension q of the hyper basis, the number r of the subspace dimension, and the penalty parameter λ. To be computationally simple but still data adaptive, we follow a strategy used in Fan, Feng, andSong (2011) andFan, Ma, andDai (2014) to let q = bn 1/5 , where b is a pregiven small positive integer and · denotes the ceiling function. In practice, we use b = 3 in all our numerical studies. We find this empirical rule works well in all of our experiments. On the other hand, r and λ are selected by minimizing the Bayesian information criterion (BIC). Specifically, BIC is computed as where RSS and df are short for the residual sum of squares and the degree of freedom, respectively. Let s be the number of selected nonzero additive component functions. The degree of freedom is computed as df(λ, r) = ( s + q − r)r, which was derived through an embedded rank-r matrix (Shalit, Weinshall, and Chechik 2012).

Convergence Analysis of the Proposed Algorithm
Using an alternating method to minimize an objective function can be found in many published works; see, for example, Huang, Breheny, and Ma (2012) and Bunea, She, and Wegkamp (2012), and more recently, Ma, Ma, and Sun (2020) and Yu, Gupta, and Kolar (2020). However, the convergence analysis of our proposed Algorithm 1 is not a direct extension of existing work in three folds: (i) the manifold structure on A imposes nonconvex constraints that the optimization algorithm needs to respect; (ii) in each step of updating one matrix with the other fixed, the direct solution is not available, and we, respectively, use the manifold gradient method and the iterative thresholding to update; (iii) unlike many alternating minimization method with both inner and outer loops (e.g., Bunea, She, and Wegkamp 2012), the proposed Algorithm 1 has only one iterative loop.
The nonconvexity makes it difficult to ensure an algorithm to converge to the global optimal. Further, the manifold structure puts extra challenges for an algorithm converging to a local optimal (Absil, Mahony, and Sepulchre 2009). In this section, we show that every accumulation point of the sequence generated by Algorithm 1 is a stationary point of (9). In addition, we characterize the speed of convergence to an accumulation point. Related results can be found in the literature of nonconvex optimization (e.g., Yun, Tseng, and Toh 2011;Beck 2015).
We first provide the precise definition of the stationary point in our context. Though the definition and the following analysis is based on the group Lasso penalization (7), the same results can be easily extended to the adaptive group Lasso penalization (8), since the penalty parameter λ in (7) is considered as fixed from the viewpoint of computational algorithm, and the same is true for λ j in (8), j = 1, . . . , p. We let G denote p j=1 λ D j for simplicity. Define ∂G/∂D as the set of subgradients of G with respect to D. Direct calculation shows that it has a closed-form such that its jth column (∂G/∂D) j can be written as j = 0, . . . , p, where B(0, 1) is the closure of the unit ball on R r centered at 0.
(i) All accumulation points of the sequence {(D (k) , A (k) )} generated by Algorithm 1 are stationary points of (9). Furthermore, the values of the objective function are identical for all accumulation points. (ii) There exists a positive constant C c depending on the algorithm parameters η A , σ A , η D , and β min , such that min k=0,...,m The condition of n ≥ 2 in Theorem 1 is a mild one and usually satisfied in real applications. Indeed, if the sample size n = 1, one can only estimate the intercept term μ from (9) as the observed response and all the additive component functions are estimated as zero functions, which is a degenerated case.
The second statement of Theorem 1 implies the norm of the gradients decreases at the rate of 1/ √ m, where m is the total number of iterations. This result matches the typical convergence rate for smooth nonconvex objective functions (Beck 2015;Kasai, Sato, and Mishra 2018) when a gradient-based method is used. Different from these existing results, the objective function in our problem is not convex and the geometric structure of Stiefel manifold (Absil, Mahony, and Sepulchre 2009) has to be taken into account. The proofs of Theorem 1 and auxiliary lemmas are deferred to Section S.3 of the supplementary materials.
It is worth mentioning that our result is different from He et al. (2018), in which they assumed a condition of Lipschitz continuity for the loss function without proving it. Theorem 1 also shows that for a given initial (D (0) , A (0) ), the accumulation points of the generated sequence always exist with the identical value of the objective function. This convergence result is stronger than that in He et al. (2018) for their manifold optimization algorithm.
Due to the nonconvexity of the objective function and the manifold constraint, the convergence to the global optimal from any initial point cannot be guaranteed. A similar theoretical gap also exists in a number of related works in statistics; see, for example, Bunea, She, and Wegkamp (2012) and Zhou, Li, and Zhu (2013). In practice, the proposed algorithm for our reduced model can be initialized by the truncated singular value decomposition of an estimated coefficient matrix using the full model (e.g., HAM or GL). We find that this initialization method can provide very stable results.

Simulation Study
In this section, we compare our proposed method with a few existing methods, including the method of Huang, Horowitz, and Wei (2010) with the group Lasso penalty and adaptive group Lasso penalty (abbreviated as GL and AGL, respectively), the HAM (abbreviation for high-dimensional additive models) method of Meier, van de Geer, and Bühlmann (2009), and the SAM (abbreviation for sparse additive models) method of Ravikumar et al. (2009). For ease of presentation, our reduced additive model with the group Lasso penalty and the adaptive group Lasso penalty are abbreviated as RAM-GL and RAM-AGL, respectively. We consider five simulation setups, two of which are presented in the main article, and the rest three are in the supplementary materials.
We could use these functions as the subspace basis function directly. But we decided to construct subspace basis functions with more interesting shapes using an orthogonal transformation of these functions. Specifically, we randomly draw a 6by-6 matrix A, whose entries are sampled from the standard normal distribution. Let Q A = (q uν ) 6×6 be the Q factor of the QR decomposition of the matrix A such that Q A is an orthonormal matrix. In the next step, the columns q ν 's of Q A are reordered by decreasing the value of q 2 5ν + q 2 6ν . Since the subspace basis functions are generated via β ν (x) = α(x) T q ν for ν = 1, 2, . . . , 6, the high frequency functions (i.e., α 31 (x) and α 32 (x)) will have the least amplitude (contribution) in forming the first a few subspace basis functions after the reordering, especially for β 1 and β 2 . This reordering, therefore, makes the first a few subspace basis functions smoother. It is also direct to verify these functions satisfy In the first two settings, we set γ = (30, 20, 0, 0, 0, 0) T and γ = (30, 20, 4, 3, 2, 1) T , respectively. The former gives an exact dimension-two model, while the latter is an approximate dimension-two model for which the first two subspace basis functions dominate but the effects of the rest basis functions are not exactly zero. We consider four different combinations of the total number of additive components p and sample size n: (p, n) = (100, 210), (400, 240), (700, 270), (1000, 300). In both settings, the first s = 20 additive component functions in (5) (corresponding to j = 1, . . . , 20) are set to be nonzero, and the remaining are set to zero. Totally 100 replicated datasets are generated in each setting. The tuning parameters of our reduced additive models are selected as in Section 3.2. For the other methods, the tuning parameters are selected by the BIC.
The estimation results for various methods are compared using the following metrics. The first metric is the square root prediction error (SRPE). In each simulation replicate, an additional test dataset (y ) T ) T of size n (test) = 3000 is generated by the same true model. After model fitting, the value of SRPE is evaluated over the test dataset by The second one is the averaged squared L 2 norm across all true nonzero functions, and f j (·) are, respectively, the true and the corresponding estimated additive component functions, respectively. The third metric is the averaged squared L 2 norm across all true zero functions, The variable selection is evaluated by true positive rate (TPR) and true negative rate (TNR) defined as For our proposed reduced additive model, the average of the selected subspace dimension (Dim) is also computed.
The simulation results for the first setting (the dimension-2 reduced additive model) and the second setting (the approximate dimension-2 reduced additive model) are summarized in Tables 1 and 2, respectively. In various metrics, our method outperforms the methods of Meier, van de Geer, and Bühlmann (2009), Ravikumar et al. (2009), andHuang, Horowitz, and with the same type of penalty. Our recommended method with the adaptive group Lasso penalty gives the best results. It is worth mentioning that in the second setting, the full model is correctly specified and the reduced additive model is not exactly correct, but our recommended method still shows better results. This setting illustrates the value of the reduced additive model when it serves as an approximated model and there are a moderately large number of nonzero additive component functions.
As one replication illustrated in Figure 1, the performance of our methods is consistently better in terms of estimation accuracy. Compared with the GL and AGL methods using the full model, our RAM-GL and RAM-AGL methods using the reduced additive model decreases the MSE 1 , respectively, by  more than 95% in the first setting, and more than 75% in the second setting. We observe that, when using the full model, the GL method has significantly larger MSE 1 than the AGL method; this is due to GL's failing to include relevant variables in the model. Our methods also decreases MSE 2 significantly, roughly more than 90% and 70% in the first and second settings, respectively. Both SAM and HAM are able to achieve similar estimation accuracy as AGL but still remains inferior to RAM-AGL. The ranking of the competing methods in terms of SRPE is the same as the ranking measured by MSE 1 . Overall, our recommended method RAM-AGL achieves comparably high accuracy of additive component estimation and comparably low prediction error, even when the reduced model is only approximately correct. For the accuracy of variable selection, SAM has high TPR's and low TNR's in the both settings. Though the GL and AGL methods have high TNR's in most cases, their relatively low TPR's indicate that these methods falsely identify some true relevant predictor variables as irrelevant ones. The TPR and TNR of our methods using subspace learning is generally higher than other compared methods. In particular, RAM-GL and RAM-AGL can identify the relevant predictor variables in both simulation settings. Moreover, RAM-AGL has relatively higher TPR and TNR than AGL, especially when the proposed model is only an approximation. For our reduced additive models, the average selected dimension of subspace is also reported in rows named Dim in Tables 1 and 2. RAM-GL and RAM-AGL are able to identify the dimension of the true or approximate subspace in the first and second settings, respectively. Section S.2 of the supplementary materials contains three additional simulation setups. In one setup, an approximate dimension-four model is considered and the result is similar as those of the two settings presented here. We also consider two classical setups from Meier, van de Geer, and Bühlmann (2009) and Huang, Horowitz, and Wei (2010), where no reduced-rank structure is assumed. The results show that RAM-AGL still performs the best when the sample size is relatively small. As the sample size gets larger, the selected subspace dimension of the reduced model increases and the performance of RAM-AGL is very close to that of the most competitive method using the full model. The robust performance of RAM-AGL can be explained by its ability to adjust the bias-variance tradeoff through adjusting the model rank.
Section S.2, supplementary materials also conducts speed benchmark for the competing methods. The results show that all methods can finish one round of optimization within a reasonable amount of time (one second) in all the numerical settings. The reduced additive model generally has the same level of scalability as the full model, but it has one more tuning parameter and may result in longer actual running time.

Real Data Analysis
We apply the additive model on the bioethanol production dataset from the R package chemometrics (Filzmoser and Varmuza 2017) and compare the several methods considered in the previous section. In large industrial scale bioethanol production, glucose and ethanol are two important compounds. During the fermentation process, the ethanol concentration level steadily increases in the mash. The mash is then separated by distillation into a high in alcohol vapor phase and stillage. By thoroughly distilling ethanol from the stillage, we can avoid production loss. The target of modeling is to predict the concentration level of ethanol from the near infrared (NIR) spectroscopy data. The NIR spectroscopy data is measured from the liquid sample by a Brimrose Luminar 5030 NIR spectrometer with acouso-optic tunable filter. This technique has the advantage of incorporating rapid, nondestructive, and multi-constituent analyses into the fermentation process.
A total number of n = 166 fermentation mashes are recorded. The response of the model is the ethanol concentration (in g/L). For each sample, the measured spectra has wavelength range 1115-2285 nm with 5 nm increments, and the first-order derivative is computed at each wavelength point. This results in p = 235 predictors for the ethanol concentration. The left panel of Figure 2 draws one of the samples. Its horizontal axis is the wavelength, and vertical axis is the first-order derivative of the spectral reflectance. In the domain area, Liebmann, Friedl, and Varmuza (2009) has identified 15 predictors (wavelength points) to effectively predict the concentration level of ethanol. We aim to see whether the result of our data-driven method is consistent with the findings from domain experts.
We evaluate the performance of the methods considered in Section 4 on the dataset in an out-of-sample prediction errors. For this dataset, each predictor is scaled to the interval [0, 1], and equally-spaced knots are placed inside the interval when constructing spline basis for our methods. The dataset is then Table 3. Out-of-sample prediction performance of the real data analysis, based on 400 random splittings.  randomly split into two subsets, that is, a training set and a test set, of size 149 (90%) and 17 (10%), respectively. Each model is trained using the training set and the out-of-sample average squared prediction error (PE) is calculated by taking the sum of squared prediction errors when using the trained model to predict the test data. Totally 400 rounds of random splittings are carried out. The tuning parameters of our reduced additive models are selected using the method described in Section 3.2. For the other methods, the tuning parameters are selected by the BIC. Table 3 reports the average PE and the number of selected variables. For our reduced additive model (RAM), the selected subspace dimension is also reported. The performance ranking of the methods is generally consistent with the simulation study. The RAM with the adaptive group Lasso (RAM-AGL) has the lowest prediction error among all methods, while GL has the worst prediction performance. For AGL, it identifies a relatively small number of relevant predictors. However, the prediction performance of AGL is not as good as RAM-AGL, which can be interpreted by the results in the simulation study that GL and AGL tend to falsely identify some relevant predictor as irrelevant ones. For most rounds of random splittings, subspace dimension of one is selected for RAM-AGL, suggesting that RAM-AGL significantly reduces the unknown functions to be estimated and the model is trained to increase the prediction power. Figure 3 depicts a raster plot to show the stability of variable selection of RAM-AGL. Its vertical and horizontal axes represent the 400 rounds of random splittings and the p = 235 predictors, respectively. In each row, the selected variables are marked as blue. It shows that RAM-AGL consistently selects variables at fixed wavelength for most of the random splittings.
Finally, we fit the RAM-AGL to the full dataset. In the left panel of Figure 2, the color marked points are the identified wavelengths as relevant predictors. Most of the identified wavelengths lie around 1750 nm, which is the region where CH-bonds have their first overtone and confirmed by Liebmann, Friedl, and Varmuza (2009). The right panel of Figure 2 plots all 22 estimated nonzero additive component functions. The function curves are colored according to the points in the left panel. It is shown that these 22 functions all come from a dimension-1 subspace and are proportional to each other up to a scalar.

Supplementary Materials
A supplemental pdf file: Details of construction the hyper basis functions, results of three additional simulation setups, and the technical proof of Theorem 1 with some auxiliary lemmas. Code: The R code used for experiments in the article.