Monotone B-Spline Smoothing for a Generalized Linear Model Response

Various methods have been proposed for smoothing under the monotonicity constraint. We review the literature and implement an approach of monotone smoothing with B-splines for a generalized linear model response. The approach is expressed as a quadratic programming problem and is easily solved using the statistical software R. In a simulation study, we find that the approach performs better than other approaches with much faster computation time. The approach can also be used for smoothing under other shape constraints or mixed constraints. Supplementary materials of the appendices and R code to implement the developed approach is available online.


Motivation and Monotonic Regression Examples
In many fields of study, it is thought, based on subject matter knowledge, that a certain predictor has a monotonic effect on a response. For instance, the probability of death is assumed to be an increasing function of the severity of burn injury (Wolfe, Roi, and Margosches 1981). A brand's unit sales are a decreasing function of price (Brezger and Steiner 2008). A dose-response curve is often assumed to be nondecreasing (Müller and Schmitt 1988;Kelly and Rice 1990;Dette, Neumeyer, and Pilz 2005;Kong and Eubank 2006). Higher levels of a biomarker are often assumed to be associated with increasing disease risk (Ghosh 2007).
In this article, we implement an approach for estimating a monotone relationship between a continuous predictor and a Wei Wang is Senior Statistician, Center for Outcomes Research, Children's Hospital of Philadelphia, Philadelphia, PA 19104 (E-mail: wangw@email. chop.edu). Dylan S. Small is Professor, Department of Statistics, the Wharton School, University of Pennsylvania, Philadelphia, PA 19104 (E-mail: dsmall @wharton.upenn.edu). The authors thank the referees and the editor for helpful comments, and thank Scott Lorch for the premature baby data and providing insights about the dataset.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/tas. generalized linear model (GLM) response using B-splines. The response variable follows a distribution in the exponential family, which includes the normal, binomial, Poisson, gamma, and other distributions. An invertible (link) function transforms the mean of the response to a possibly nonlinear form of the predictor. We use a linear combination of B-spline basis functions with unknown coefficients to approximate the nonlinear form. Thus, the link function transforms the response mean to a linear combination of B-spline basis functions, where we estimate the unknown coefficients under the monotonicity constraint. We consider an example of estimating a thought-to-be monotonically increasing relationship, the survival probability of prematurely born infants given the birth weight, which will be presented in Section 5.

Review of Semi/Nonparametric Monotonic Regression Literature
To model the dependence of a response on a predictor in a non/semiparametric way, kernels, local polynomials, or splines are popular techniques. The resulting regression estimate is not necessarily monotone and extra conditions or steps are required to produce a monotone estimate. A classical method to produce a monotone estimate is isotonic regression by the "Pool-Adjacent-Violators-Algorithm" (PAVA) (Brunk 1958;Barlow et al. 1972). Usually isotonic regression is not smooth and behaves like a step function, which is a constant function in some region and then jumps to a higher level. A hybrid approach is then to combine smoothing methods with isotonic regression to obtain both a smooth and a monotone estimate (Friedman and Tibshirani 1984;Mukerjee 1988;Mammen 1991;Ghosh 2007).
Most previous research on monotone smoothing focuses on studying a normal response. Approaches have been proposed to study a more general GLM response using splines (Wang 2000;Leitenstorfer and Tutz 2007;Tutz and Leitenstorfer 2007).
These methods are flexible but involve complex computation. Wang (2000) defined a two-step iterative algorithm involving approximation of the "relative curvature curves" and solution of differential equations. The approaches in Leitenstorfer and Tutz (2007) and Tutz and Leitenstorfer (2007) are based on B-splines and integrated B-splines, respectively. The boosting technique originating from machine learning for classification problems is used. Implementation of the methods can be computationally intensive. In our data analysis of a binary response and a continuous predictor from 12,607 observations, execution of both implementations exceeded memory size limit on a Mac OS X (10.7.5, 2 × 2.66 GHz 6-Core Intel Xeon, 24GB 1333 MHz DDR3).

B-Splines for Monotonic Regression
In an effort to retain the good statistical properties of Bsplines while reducing the computational intensity, we implement a monotone smoothing approach for a GLM response using B-splines. B-splines are piecewise polynomials and are more flexible than single polynomials. A smooth function being approximated by B-splines is usually written as a linear combination of B-spline basis functions (a brief definition is presented in Appendix A). In the expansion, if the coefficients are in an increasing order, the linear combination is then increasing (de Boor 1978). This constraint on the coefficients thus imposes a sufficient condition for the approximated smooth function to be monotonically increasing.
This property of B-splines has been used in Kelly and Rice (1990), Kong and Eubank (2006), and Leitenstorfer and Tutz (2007). Kelly and Rice (1990) and Kong and Eubank (2006) modeled a normal response and solved the constrained optimization problems using Fortran routines. Leitenstorfer and Tutz (2007) used boosting techniques to enforce the constraint. Our approach is expressed as a well-studied quadratic programming problem and so can be easily solved using standard optimization packages. The approach is easily implemented as we illustrate using the R function solve.QP in the package quadprog.

Outline
We present details of the model-fitting procedure in Section 2. In Section 3, we compare the performance of our estimator with the following estimators in simulations for a binary response: the conventional B-spline estimator, and the estimators in Leitenstorfer and  and Tutz and Leitenstorfer (2007). In the settings we tried, our estimator generally outperforms others and also improves on computation time. In Section 4, we apply our approach to estimate the survival probability of infants as a function of birth weight assuming the survival probability is monotonically increasing. The approach we develop can also be used for estimation under other shape constraints besides monotonicity (e.g., nonnegativity) or multiple constraints simultaneously; this will be discussed in Section 5.

APPROACH
Let y be the response with a distribution in the exponential family, which includes distributions of Bernoulli/binomial, nor-mal, Poisson, and others (McCullagh and Nelder 1989). Let μ be the mean of y. Let g be the link function. We assume g(μ) is connected with the predictor x through a smooth function f as g(μ) = f (x). In the following, however, we adopt the formulation in Leitenstorfer and Tutz (2007) and Tutz and Leitenstorfer (2007) In this way, f is confounded with β 0 and is only estimable under an identifiability constraint. Let x i be the ith predictor value. We add the constraint i f (x i ) = 0 (Hastie and Tibshirani 1990;Wood 2006). In the case of a linear f , we have g(μ) = β 0 + βx, where β is the slope and the constraint requires that x is centered by its sample mean.
For a Bernoulli y, its canonical link function has the logis- In the following, we expand f by a linear combination of B-spline basis functions. We estimate the unknown coefficients via maximizing the log-likelihood function under the monotonicity constraint imposed on f . We focus on a Bernoulli response with logistic link function.
Let φ p 's be a series of B-spline basis functions with knots given. The expansion of f by φ p 's gives where φ(x) = (φ 1 (x), . . . , φ P (x)) and β = (β 1 , . . . , β P ) is the unknown coefficient vector. Suppose f has an increasing effect on y. By de Boor (1978) (see also Appendix A), the monotonicity constraints on the coefficients is then where 0 is a zero vector and the inequality ≥ holds elementwise. Let A denote the (P − 1) × P matrix on the left-hand side of the inequality, which has elements −1 and 1 on the two main diagonals and elements 0 elsewhere. The constraint is thus If f has a decreasing effect on the response, the constraint is then (−A)β ≥ 0. Let (β) be the log-likelihood function. Our goal is to estimate β via maximizing (β) under the constraint (2). By definition, we have g( Throughout the article, we use cubic B-splines. To control the smoothness of the fitted f , usually a penalty term on the roughness of the fit, [f (x)] 2 dx, is added to the log-likelihood func-tion for maximization over β. From (1), the penalty term can be expressed as β Pβ, where P = (∂ 2 φ/∂x 2 )(∂ 2 φ/∂x 2 ) dx. Let λ be a fixed smoothing parameter associated with f . Estimation of β is obtained via maximizing the penalized log-likelihood (β) − (λ/2)β Pβ under the constraint Aβ ≥ 0. For a fixed λ, we find the estimate by iteration. Given the previous estimate, we approximate the log-likelihood and get the current estimate by solving a quadratic programming problem. Detailed derivation is presented in Appendix B. To choose λ, we use common information criteria discussed at the end of this section.
Let v i be the variance of y i . Let η i be the linear predic- Letβ be the estimate computed from the previous iteration, based on which we defineμ = (μ 1 , . . . ,μ n ) ,D = diag(d 1 , . . . ,d n ), andW = diag(w 1 , . . . ,w n ). Let y = (y 1 , . . . , y n ) . As calculated in Appendix B, maximizing the penalized log-likelihood (β) − (λ/2)β Pβ over β can be approximated by the iterative optimization problem Given the estimate from the previous iteration, solving (3) alone for β gives the conventional B-spline estimate at the current iteration. The whole procedure can be viewed as a modified iteratively weighted least-square procedure (see, e.g., O'Sullivan, Yandell, and Raynor 1986;Eilers and Marx 1996). Solving (3) under the constraint (2) for current estimate of β is a quadratic programming problem Idnani 1982, 1983). The optimal solution has to be found in the constrained region Aβ ≥ 0. Common criteria for choosing a smoothing parameter λ include Akaike information criterion (AIC), Bayesian information criterion (BIC), or generalized cross-validation (GCV). Calculation of all the criteria involves the "effective degrees of freedom" (df), which is obtained as the trace of the matrix H = B(B WB + λP) −1 B W (O'Sullivan, Yandell, and Raynor 1986). The AIC is defined as twice the df minus twice the loglikelihood, and BIC replaces twice the df as log(n) multiplying the df. The GCV is computed as i (y i − μ i ) 2 /[v i (n − df) 2 ] using parameter estimates. Usually a sequence of the smoothing parameters is specified. At each point, the model is fitted and the criterion is calculated. The optimal smoothing parameter is chosen as the one where the criterion attains the minimum.

SIMULATION
We conduct simulations to compare the performance of the monotone B-spline estimator (MB) with other estimators: the conventional unrestricted B-spline estimator (CB), the estimator proposed by Leitenstorfer and Tutz (2007) (BB: boosting B-spline), and estimator proposed by Tutz and Leitenstorfer (2007) (BI: boosting integrated B-spline). Estimators BB and BI are implemented through R packages from http://www.statistik.lmu.de/institut/lehrstuhl/semsto/software/ index.html. In both approaches, one crucial tuning parameter plays the role of a smoothing parameter, is selected using the AIC or BIC criterion. The other parameters take the recommended values as in Leitenstorfer and Tutz (2007) and Tutz and Leitenstorfer (2007).
For the CB and MB estimators, we take 100 grid points of the smoothing parameter λ equally spaced between −8 and 6 on a log scale. We also use the information criterion AIC or BIC to choose the optimal λ. The models we will use are not of overly complicated forms, and so to save computation time, we use a small number of basis functions, 5, for all the estimators. The knots are equally spaced within the range of the data.
The predictor x is simulated from the Uniform [0, 1] distribution and the response is simulated from a Bernoulli distribution. Components of the linear predictor are β 0 = 0.5 and f based on the following forms 10x)), and f 5 = 0.9x 2 + 0.1x. The functions f 1 , . . . , f 5 are scaled by a factor of 4 and then shifted to range from −2 to 2. Shapes of the transformed functions are plotted in Figure 1.
We consider sample sizes 100 and 500 and run 1000 simulations in each scenario. Let x m j and x M j be the minimum and the maximum of the simulated x values in the jth simulation. Evaluation of the estimated mean functions is based on 500 equally spaced points within the range: [max{x m j , j}, min{x M j , j}]. At each point, we calculate the squared bias, variance, and mean squared error (MSE). The averaged values over the 500 points are summarized in Table 1 for each estimator. The associated computation time per simulation is shown in Table 2.
In Table 1, models 1-5 correspond to f 1 -f 5 , respectively. The smallest value is in bold font for each combination of model and n. The bias of the MB estimator is close to that of the CB estimator. The MB estimator has smaller variance with a reduction of the MSE except for model 4 when n = 500. Estimators BI and BB perform best for models 4 and 5, respectively. On the  other hand, compared with CB, estimator BB has larger MSE in models 2-4 when n = 100 and in models 2 and 3 when n = 500. Estimator BI has larger MSE in models 1, 3, and 5 than CB. The MB estimator has the smallest MSE for models 1-3. Table 2 shows the computation time per simulation associated with the estimators (AIC and BIC together) for each combination of model and n. When n = 100, the CB and MB estimators' time is about 1/3 or 1/4 of BI, and between 1/7 and 1/4 of BB. When n = 500, the CB and MB estimators are at least two times faster than BI and about nine times faster than BB.
In summary, estimator MB improves CB in nearly all of the settings in terms of the MSE and has the smallest MSE in most of the settings.

DATA ANALYSIS
We study data on premature infants born at a birth weight between 500 and 2000 g in Los Angeles and surrounding counties between 1997 and 1999. This is a subset of the data described in Lorch et al. (2012). We are interested in estimating the probability of survival given the birth weight. Survival refers to whether the infant is alive 60 months post gestation. In total, 11396 infants survived and 1211 died. For premature infants, survival is thought to be an increasing function of birth weight (Hack et al. 1991;Cooper et al. 1998). For a description of the data, we group the birth weight into 16 bins with bin width 100 g. The minimum number of infants in a group is 448. We then calculate the sample average of survivals within each bin and plot them against the mid-points of the bin in Figure 2. In the figure, we see the survival rate increases sharply at low birth weight, and then the increasing rate decreases and becomes flat as the birth weight approaches 2000 g. The estimated survival probabilities using estimators CB and MB are also shown in Figure 2. We use 12 equally spaced knots, and the optimal λ was chosen by AIC or BIC from a grid of 100 points equally spaced between −5 and 15 on the log scale. In the plot, we see that CB AIC and CB BIC estimates overlap, and so do the two MB estimates. The CB estimates do not produce a monotone curve in the region between 1100 and 1400.
In total, implementation of the approach took 50 min on the platform described at the end of Section 1.2. We also tried estimating the survival probability with estimators BB and BI. However, executions of the implementation halted with error message of exceeding memory size limit and so the analysis results using these approaches are not available.
To obtain standard deviations of the estimates and to build confidence intervals at each birth weight value, we use the bootstrap method as in Dilleen, Heimann, and Hirsch (2003). We resample the paired birth weight and survival status with 200 bootstrap replications. At each birth weight, we obtain the estimated standard deviation as the sample standard deviation of the 200 estimates. We use the 97.5% and 2.5% quantiles of the estimates as the upper and lower bound of a 95% confidence interval. Figure 3 shows the standard deviations for all estimators and confidence intervals for the CB (AIC) and MB (AIC) estimators. In the plots, we observe less variability in the MB estimators.

SUMMARY AND DISCUSSION
A smooth function being approximated by B-splines is usually written as a linear combination of B-spline basis functions. In the expansion, an increasing order of the coefficients imposes a sufficient condition for the approximated smooth function to be monotonically increasing (de Boor 1978). We implement a monotone smoothing approach based on B-splines for a GLM response using the R function "solve.QP" of the package "quadprog." In the simulation studies, the approach generally outperforms the other estimators with much faster computation time.
Using B-spline properties (de Boor 1978), the approach can also be applied to estimate a smooth function under other shape constraints by changing the matrix A in (2). For example, nonnegativity of the coefficients, β k 's, guarantees the estimate to be nonnegative. These nonnegative constraints can then be imposed by a diagonal matrix A. If the estimate is required to be convex, that is, the second-order derivatives being nonnegative, we can look at the second-order differences of the coefficients.
For instance, we assume the knots are equally spaced. Let A be a (P − 2) × P matrix with elements 1, −2, and 1 on the main diagonals and element 0 elsewhere as Then (2) requires the nonnegativity of the second-order differences of β p 's, which in turn ensures convexity of the estimate. If the knots are not equally spaced, the structure of A remains the same but the values of nonzero elements will change according to formulas of B-spline derivatives (de Boor 1978). The approach can also handle more than one shape constraint simultaneously by binding multiple A matrices into one. [Received June 2013. Revised September 2014