Shape-Constrained Estimation Using Nonnegative Splines

We consider the problem of nonparametric estimation of unknown smooth functions in the presence of restrictions on the shape of the estimator and on its support using polynomial splines. We provide a general computational framework that treats these estimation problems in a unified manner, without the limitations of the existing methods. Applications of our approach include computing optimal spline estimators for regression, density estimation, and arrival rate estimation problems in the presence of various shape constraints. Our approach can also handle multiple simultaneous shape constraints. The approach is based on a characterization of nonnegative polynomials that leads to semidefinite programming (SDP) and second-order cone programming (SOCP) formulations of the problems. These formulations extend and generalize a number of previous approaches in the literature, including those with piecewise linear and B-spline estimators. We also consider a simpler approach in which nonnegative splines are approximated by splines whose pieces are polynomials with nonnegative coefficients in a nonnegative basis. A condition is presented to test whether a given nonnegative basis gives rise to a spline cone that is dense in the space of nonnegative continuous functions. The optimization models formulated in the article are solvable with minimal running time using off-the-shelf software. We provide numerical illustrations for density estimation and regression problems. These examples show that the proposed approach requires minimal computational time, and that the estimators obtained using our approach often match and frequently outperform kernel methods and spline smoothing without shape constraints. Supplementary materials for this article are provided online.


INTRODUCTION
We consider spline estimation problems with one or more shape constraints on the estimator. We demonstrate that nonnegative, monotone, and convex polynomial splines admit characterizations that lead to optimization models that are solvable with minimal running time using readily available software. The methods proposed in this article are applicable to a variety of problems; we concentrate on regression and density estimation problems with various shape constraints. In our numerical examples, we focus on nonnegative regression, isotonic regression and smoothing, and unconstrained density estimation. Further potential applications include the estimation of the arrival rate of a nonhomogeneous Poisson process and log-concave density estimation.
In each of the estimation problems considered in this article, the goal is to reconstruct a real-valued function f from finitely many observations (function values observed with noise, realizations of random variables, arrival times, etc.) under a collection of constraints on the shape of the function. Examples of such shape constraints include: (1) f be nonnegative over [a, b], or more generally, its graph lie in a specific bounded region (defined, e.g., by linear or polynomial inequalities); (2) f be monotone nondecreasing (or nonincreasing); and (3) f be convex (or concave). The function f is otherwise assumed to belong to some (possibly infinite-dimensional) functional space H. The optimal estimator shall minimize a given loss function over the set of shape-constrained functions from H. We propose a variant of the classical method of sieves to find the estimator via the solution of finite-dimensional optimization problems.
Our approach is based on convex (conic) optimization techniques. More specifically, we show that linear programming (LP), second-order cone programming (SOCP), semidefinite programming (SDP), and optimization over cones of nonnegative, monotone, or convex/concave functions in functional linear spaces can be employed to solve fairly complicated shape-constrained estimation problems in a conceptually more satisfying manner than existing approaches, without resorting to unnecessary approximations of the problem. Furthermore, our approach is flexible enough to handle a number of additional constraints, such as interpolation, betweenness, and multiple shape constraints over different intervals. While a lot of attention has been given to shape-constrained optimization, and also to employing (convex) optimization models in statistical estimation, no systematic study has appeared that demonstrates both the theoretical soundness and the computational efficiency of convex optimization in such a general setting.
In the remainder of this section, we give a broad overview of the vast literature on shapeconstrained estimation, focusing on the existing numerical algorithms. In Section 2, we lay out a general model using sieves of cones for shape-constrained estimation problems. In Section 3, we consider the special case of polynomial splines, and we show how SOCP and SDP approaches can be used effectively to solve these problems. We also consider another approach that has been rediscovered several times in the literature. This approach is based on optimization over splines with nonnegative coefficients over some nonnegative basis. A condition is given that helps decide whether a given choice of nonnegative basis is appropriate. Section 4 discusses applications of the developed theory to a number of shape-constrained estimation problems. The corresponding numerical results are collected in Section 5, where the SDP/SOCP approach is compared with the nonnegative basis approach, unconstrained smoothing splines, and kernel estimators in density estimation and regression problems.

PAST WORK AND OUR CONTRIBUTION
There is a vast literature on shape-constrained estimation and learning problems, much of which are summarized in the following surveys: in Delecroix and Thomas-Agnan (2000), a survey of smoothing regression problems with shape constraints is presented; in the thesis (Meyer 1996), algorithms for shape-constrained regression and density estimation are considered; the text of Robertson, Wright, and Dykstra (1988) is a comprehensive survey of order-restricted estimation problems, with over 800 references; Turlach (2005) has a more recent review on shape-constrained spline smoothing.
There has also been substantial research on nonparametric estimation via optimization: the works of de Montricher, Scott, Tapia, and Thompson (de Montricher and Tapia 1975;Scott 1976;Scott, Tapia, and Thompson 1980;Thompson and Tapia 1990) are representative. Nemirovskii, Polyak, andTsybakov (1984, 1985) provided a detailed analysis of the consistency of maximum likelihood (ML) estimators and the rate of convergence. However, unlike the developments in the asymptotic analysis of these estimation problems, the majority of the algorithmic techniques used to date are relatively simple and ad-hoc, often specific to the (single) shape constraint involved in the problem, with little potential for generalization.
Perhaps because in most interesting function spaces the nonnegativity constraint is expressible only by an infinite collection of linear inequalities, there is a hesitation to tackle nonnegativity (or other shape constraints) directly in these spaces. The following three quotes from well-known references are typical in the literature: Attempting to impose monotonicity on polynomials quickly becomes unpleasant. . . . (Ramsay 1988, p. 425) The nonnegativity constraint [in density estimation] is, in general, impossible to enforce when working with continuous densities which are not piecewise linear. (Thompson and Tapia 1990, p. 103) Constrained smoothing splines with infinitely many constraints (like m (r) (x) ≥ 0) for all x are difficult to compute. . . . (Mammen and Thomas-Agnan 1999, p. 240) In this article, we show that these concerns may not be all that warranted. In particular, in the case of univariate estimation problems, which is the focus of our article, we show that by using the modeling and algorithmic tools offered by SOCP, SDP, and related conic optimization problems, we can easily tackle an array of shape-constrained estimation problems and find optimal nonnegative, monotone, and convex spline estimators.
In the remainder of this section, we summarize some of the advantages of our method, most of which are not shared by earlier methods: • General Scope: Techniques that do not involve optimization are usually designed to solve a specific problem, mostly involving a single shape constraint. They are computationally extremely efficient, but they also have limited scope. Methods based on numerical optimization can typically incorporate additional linear constraints without difficulty. This increases the types of constraints one may consider: periodicity and interpolation constraints are just two examples of such constraints-as they are expressible by linear equations and inequalities on the estimator, they can be added to any convex optimization model freely. Several approaches have been proposed for shape-constrained spline smoothing using splines of a specific degree, including (Hildreth 1958) and (Brunk 1958) (piecewise linear estimators), He and Shi (1996) (quadratic B-splines for monotone regression), and Dierckx (1980) and Turlach (2005) (cubic splines). All of these approaches exploit the fact that the considered shape constraints in the considered spaces are characterized by a finite collection of linear inequalities; hence, none of them generalizes to higher-order splines than what they were developed for.
Our approach is based on a characterization of nonnegative polynomial splines of any degree that leads to computationally tractable optimization models of all of the considered estimation problems. This also takes care of our next concern.
• Direct Characterization of Nonnegativity: If the estimator f is required to be nonnegative, then a simple way of imposing this constraint is to write f = g 2 , or f = exp(g), and then turn the attention to g (Good and Gaskins 1971). Of course, this is overly restrictive if nonnegativity is only required, say, over an interval. Moreover, linear constraints on f (such as periodicity and the requirement that f integrates to one) are transformed into nonlinear equations on g, rendering the transformed optimization problems difficult to handle. Additionally, the underlying space of g may not be the same as that of f ; in particular, g may not belong to an easily characterizable finite-dimensional linear space even if f does.
• Avoiding Oversimplification and Approximation: At the time of some of the earlier works on shape-constrained estimation, SDP and SOCP methods were either not available or known, or were considered computationally too expensive; hence, these studies often used oversimplified, inexact optimization models. Most of the models reviewed and proposed in the surveys (Robertson, Wright, and Dykstra 1988) and (Delecroix and Thomas-Agnan 2000) and in the theses of Meyer (1996) and d 'Aspremont (2004) only approximate shape constraints, and find optimal estimators in a proper subset of the functional cone of interest. These are usually sets of functions with nonnegative coefficients in a nonnegative basis. Ramsay's I-spline (Ramsay 1988) is based on the same idea.
Since nonnegative splines can be effectively handled by convex optimization (SOCP/SDP) methods, such approximations are no longer necessary. Nonetheless, we explore this approach as well, and compare it with models that use the entire cone of nonnegative functions. In this article, we examine estimating nonnegative polynomial splines generated by the Bernstein polynomial basis (rather than the more popular B-splines), and compare the results to the SDP/SOCP approach. We will justify our choice of this basis over the B-spline basis in Theorem 2.
A strategy dual to these overly constrained approaches is imposing shape restrictions only on a finite subset of the domain of the unknown function to obtain optimization models that are simpler than the ones desired to be solved. For instance, Delecroix, Simioni, and Thomas-Agnan (1996), Mammen and Thomas-Agnan (1999), and Villalobos and Wahba (1987) considered polynomial splines with nonnegativity on the kth derivative for some k, but nonnegativity is imposed only on the knot points, or at finitely many evenly spaced points. Consequently, the optimization is over a set of estimators greater than the set of interest.
Recently, some authors have applied SDP and SOCP techniques to a few shapeconstrained estimation problems. Wang and Li (2008) used cubic splines for isotonic regression, and Fushiki, Horiuchi, and Tsuchiya (2006) have used SDP constraints with log-likelihood objective function in parametric density estimation with nonnegative polynomials. Alizadeh et al. (2008) have used SOCP and SDP models to estimate the smooth arrival rate of nonhomogeneous Poisson process based on observed arrival rates using cubic splines. Our work is an attempt to provide a general framework for a large number of shape-constrained estimation problems at a considerably higher level of generality than found in the above studies.

ESTIMATION IN GENERAL SPACES
In the most general setup, the goal of an estimation problem is to reconstruct an unknown (possibly multivariate) real-valued function f ∈ H in some space H (usually a reproducing kernel Hilbert space or a Sobolev space), based on finitely many observations z 1 , z 2 , . . . , z N drawn from a subset D ⊆ R k . Constraints on the shape of the function translate to the requirement that f belongs to a closed convex set K ⊆ H. (Only two of the commonly considered shape constraints are not convex: unimodality and log-concavity.) Finally, we seek a function f that minimizes a convex loss functional L(·|z 1 , . . . , z N ). Hence, a shape-constrained estimation problem is an optimization problem of the form where A 1 , . . . , A I are H → R linear functionals, b 1 , . . . , b I are real numbers, and L and K are as explained above. A few remarks are in order: 1. Care has to be taken to ensure that the problem (1) has an optimal solution, meaning that the infimum is finite and is attained; this largely depends on the underlying space H and the loss function L. When a smooth estimator is sought, the Sobolev space W d 2 ([a, b]) or a reproducing kernel Hilbert space is an appropriate choice for a number of commonly used loss functions (Kimeldorf and Wahba 1971). Moreover, in certain shape-constrained problems, it is also known that the optimal solution belongs to a specific finite-dimensional space, in particular to a space of polynomial splines with given knot points (Wahba 1990, chap. 1;Eggermont and LaRiccia 2001).
2. Constraints on the shape of f are expressed by the constraint f ∈ K (possibly together with some linear equations and inequalities). By the nature of shape constraints, K is usually a convex, pointed cone, that is, 0 ≡ f ∈ K implies that λf ∈ K for every λ ≥ 0 but −f ∈ K. For example, K can be the cone of nonnegative, monotone nondecreasing, or convex functions in H. Multiple shape constraints can be modeled by considering a cone K that is created by intersection, a Cartesian product, or a Minkowski sum of the cones modeling the individual shape constraints.
3. If H is finite-dimensional (which happens, for instance, when parametric models or finite-dimensional approximations of infinite-dimensional problems are considered), K is often the cone of nonnegative functions in H, which we denote by P. Many other shape constraints, especially in the univariate case, can be reduced to this case. For example, derivatives of f may be sign-constrained, ensuring monotonicity, convexity, or concavity of f .

SIEVES AND FINITE-DIMENSIONAL APPROXIMATIONS
The term sieve was coined by Grenander (1981) for a sequence of subsets S 1 , S 2 , . . . , S m , . . ., of some metric space of functions containing the unknown function; it is required that S m be dense in . The main idea is that for any kind of estimation problem, instead of minimizing a given loss functional on the space , we search in S m for some m, which is an easier problem. The density requirement ensures that some function in some S m is a good approximation of the function that is being estimated. As m increases, it is assumed that the "complexity" of the functions under consideration also increases. The appropriate value of m is usually determined by techniques such as cross-validation. Note that the idea is almost independent of the loss functional to be minimized. Thus, sieve methods have been applied to regression, density estimation, and even arrival rate estimation.
In this article, we follow a restricted version of the sieve method. In particular, we require that sets in the sieve sequence are all finite-dimensional convex cones, and that the sequence is nested or asymptotically nested.
and it is asymptotically nested if for each m, the sequence has a nested infinite subsequence containing K m .
Prior information, including assumptions on the shape of the estimator, is particularly useful when the number of samples is too small for the particular shape to be immediately clear based on the data alone. Therefore, we shall not be particularly concerned with the behavior of the estimators in the case when the number of samples reaches the asymptotic range. Nevertheless, we shall mention that existing results on the consistency of constrained estimators are applicable and prove the consistency of the estimators proposed in the article. The results of Geman and Hwang (1982) prove the consistency of the ML estimators of the present article. Dong and Wets (2000) considered a more general setup for density estimation, where the negative log-likelihood loss function can be replaced with other convex loss functions, including least-square and penalized log-likelihood function. (Simultaneously, they make a case for the use of constrained ML estimators instead of penalized ML estimators.) There is one additional technical restriction that needs to be added to the conic version of the sieve method: in the optimization models obtained by replacing K with the finitedimensional approximation K m in (1), an additional constraint f ≤ B m on the norm of f must be added to keep the set of feasible f 's compact. (Note that since the optimization takes place in a finite-dimensional space, all norms are equivalent.) With this addition, the theorems of Geman and Hwang (1982) and (Dong and Wets 2000) are directly applicable to conic sieves: as long as the bound increases slowly enough with m, the resulting estimators are consistent. Such bounds are often a priori imposed on the problem. For instance, nonnegative splines (with given knot points) that integrate to 1 form a bounded set.
The most important examples of conic sieves for our purposes are the sieves of polynomial splines. Recall that a univariate polynomial spline f of degree (or order) n and continuity C r (r ≤ n − 1) is a real-valued function on [a, b] = [a 0 , a m ] defined piecewise on the intervals [a i , a i+1 ], i = 0, . . . , m − 1 with the following properties: (1) f is a polynomial of degree n over each interval [a i , a i+1 ], i = 0, . . . , m − 1; and (2) f has continuous derivatives up to order r over (a, b). The points a i are called the knot points of the spline. Throughout the article, the sequence (a 0 , . . . , a m ) will be abbreviated as a. The length max 0≤i≤m−1 (a i+1 − a i ) of the longest subinterval in the knot point sequence is called the mesh size of a; it is denoted by a . The linear space of all splines of degree n and knot point sequence a is denoted by S(n, a). In this article, we will always assume r = n − 1. Schumaker (1981) proposed a more general definition, where existence of derivatives of different orders are required at different knot points. Wahba (1990) and others considered only natural splines, which are more restricted on the first and last subinterval of the domain. The methods proposed in this article can be adapted to these definitions without any difficulty.
Suppose that the degree n is fixed and K m is chosen to be S(n, a m ), where a 1 , a 2 , . . . is an infinite sequence of knot point sequences with mesh sizes approaching 0. Then, by (Schumaker 1981, chap (As before, P denotes the set of nonnegative functions.) The sequence (K m ) m=1,... is not necessarily nested; however, it is not difficult to create a subclass that forms a nested or asymptotically nested sequence. For instance, we may start from one interval [a, b] and then recursively add the midpoint of the rightmost longest interval to the set of knots. The most straightforward way to obtain an asymptotically nested sieve is to consider knot point sequences that subdivide [a, b] uniformly.
For polynomial spline estimators, the subscript m of K m is the number of pieces of the spline. As we shall see, in the optimization models, the estimator is represented by the coefficients of its polynomial pieces, and the upper bound B m on the norm of the spline can be imposed by giving an upper and a lower bound on these coefficients. These only require adding linear inequalities to the optimization models.
In each spline space S(n, a), there may be a number of different convex cones that encode some type of shape constraint. Let us, for the moment, concentrate on the cone P [n,a] of nonnegative functions in S(n, a). This cone is the Cartesian product of nonnegative polynomials of degree n (where the ith polynomial should be nonnegative on [a i , a i+1 ]), intersected with the linear space of C r functions. Therefore, P [n,a] is a convex, but in general nonpolyhedral, cone. One of the main results of Section 3, and of the article, is that problems of the form (1) with K = P [n,a] can be solved efficiently for every n and a. When considering sufficiently differentiable functions, many shape constraints mentioned in the Introduction reduce to the nonnegativity (or nonpositivity) of the derivatives. Hence, sieves of monotone nonincreasing or nondecreasing, convex or concave functions using polynomial approximations can be defined and characterized analogously to the sieves of nonnegative functions; it is sufficient to consider nonnegativity as the "universal" shape constraint.

REPRESENTATIONS INVOLVING SEMIDEFINITE AND SECOND-ORDER CONE CONSTRAINTS
In this section, we summarize some well-known results about the characterization of nonnegative polynomials, and apply this theory to characterize nonnegative splines. For more details on nonnegative polynomials, the reader is referred to Karlin and Studden (1966). As mentioned at the end of the previous section, this immediately gives rise to analogous characterizations of nonincreasing, nondecreasing, concave, and convex splines, by simply imposing nonnegativity on the derivatives of the spline. A generalization of these results, and examples of other functional spaces where nonnegative functions have analogous characterizations, can be found in Papp and Alizadeh (2011).
We will use the following notations and conventions: vectors (resp., matrices) are typeset boldface, their components (resp., entries) are denoted with the corresponding lowercase italic character. Indexing of vectors and matrices starts from 0 rather than 1. For example, the (n + 1)-dimensional row vector p could also be written as (p 0 , . . . , p n ).
The inequality X 0 denotes that X is a positive semidefinite real symmetric matrix. The cone generated by the set U, defined as {αu : α ≥ 0, u ∈ U }, is denoted by cone(U ); intS denotes the interior of the set S.
In this section, and throughout the article, we assume (for notational simplicity) that unknown polynomials in optimization models are represented in the standard monomial basis, even though this basis is numerically rather poorly behaved. This means that we also identify the polynomial function p with the coefficient vector p of the polynomial. There is no conceptual difficulty in modifying all theorems and algorithms in this article so that they involve polynomials represented in any other basis, such as some orthogonal polynomial basis.
The main results on the representation of nonnegative polynomials over an interval are summarized in the following two theorems. We split the odd-and even-degree cases into separate propositions for better readability.
Proposition 2 (Even-degree case: Karlin and Studden 1966). Let p = n i=0 p i x i be a polynomial of degree n = 2k, and let a < b be real numbers. Then, p(x) ≥ 0 for all x ∈ [a, b] if and only if there exist a symmetric (k + 1) for all = 0, . . . , 2k.
A useful property of quadratic and cubic polynomials nonnegative over an interval is that (following the above propositions) their characterizations involve only 2 × 2 positive semidefinite matrices. Positive semidefiniteness of 2 × 2 matrices can be translated to linear and quadratic (second-order cone) constraints using the following well-known fact.
Constraints of the form x ∈ Q k+1 are called second-order cone constraints, while constraints of the form X 0 are semidefinite constraints. Page limitations do not allow us to give a comprehensive overview of SOCP and SDP, and deep knowledge of these fields is not necessary to apply the methods proposed in this article, but a brief description is included in Appendix A in the online supplementary materials, along with pointers to software available for the solution of SDPs and SOCPs. The reader is also encouraged to consult Alizadeh and Goldfarb (2003) for an accessible survey on SOCP and Wolkowicz et al. (2000) for an in-depth review of many aspects of SDP.
The above characterization of nonnegative polynomials easily extends to a characterization of nonnegative splines over [a 0 , a m ]: the nonnegativity of each polynomial piece is translated to a set of semidefinite constraints and linear equations, and the continuity of the derivatives translates to another finite set of linear equations.
The equalities in (2) and (3) suggest that we may run into serious numerical problems if the knot points of the spline are distributed unevenly, as the linear equations in the characterization will have coefficients that may differ by many orders of magnitude. This can be avoided by scaling: for each i = 0, . . . , m − 1, we apply an affine transformation on the ith polynomial piece of the spline that maps the interval [a i , a i+1 ] to [0, 1], and we represent the spline S between the knot points a i and a i+1 by the coefficients of the thustransformed polynomial p (i) , rather than by the original coefficients. By way of formulas, the resulting scaled representation of the spline S is the following: where each p (i) is a polynomial defined on [0, 1], with coefficients p (i) 0 , . . . , p (i) n in the standard monomial basis. In the scaled representation of splines, the nonnegativity and continuity constraints are entirely independent of the location of the knot points, as nonnegativity is expressible as p (i) (x) ≥ 0 for every i = 0, . . . , m − 1 and x ∈ [0, 1], and the continuity of the spline is simply expressed as p (i+1) (0) = p (i) (1) = 1 for every i = 0, . . . , m − 1.
The linear equations expressing the continuity of the higher-order derivatives also depend only on the ratios of the distances between consecutive knot points (a i+1 − a i )/(a i − a i−1 ).
As an example, the complete list of constraints that characterize a nonnegative cubic spline of continuity C 2 , with knot points a 0 , . . . , a m , is provided in Theorem 3 in the online supplementary materials (Appendix B).

POLYHEDRAL CONES OF SPLINES
In this section, we examine polyhedral approximations of cones of nonnegative splines. We concentrate on models of the following form: we fix a basis U = {u 0 , . . . , u n } of degree n polynomials nonnegative over [0, 1] and then consider splines whose coefficients p (i) k in the scaled representation (4) are all nonnegative. Let P(U, a) denote the set of such splines with knot point sequence a. These are clearly a subset of all nonnegative splines.
As mentioned in the Introduction, the approach of approximating nonnegative splines by functions with nonnegative coefficients in a nonnegative basis (henceforth called the nonnegative basis method) is certainly not new. But to our knowledge, there has not been any systematic analysis of what bases may or may not be used in such an approach. The following simple example shows that this question is of prime importance. Consider, in our notation, the basis U = {1, x, . . . , x n } of degree n polynomials. This is a nonnegative basis over [0, 1]. Since these functions are also monotone nondecreasing, it is immediately clear that for every knot point sequence a, the cone P(U, a) consists only of monotone nondecreasing functions. Therefore, optimization over P(U, a) will not yield useful estimators if a general (possibly decreasing) nonnegative estimator is sought. As we shall see below, using the Bernstein polynomial basis, defined by u i (x) = ( n i )x i (1 − x) n−i , (i = 0, . . . , n), in place of the monomial basis is a theoretically sound choice.
Our first result in this section is a sufficient condition for a sequence of polyhedral spline approximations to form a sieve. Recall that the mesh size of the knot point sequence a is denoted by a .
Theorem 1. Consider a basis U = {u 0 , . . . , u n } of polynomials of degree n ≥ 1 such that each u i is nonnegative over [0, 1], and assume that 1 ∈ intcone(U ), where 1 denotes the constant polynomial 1. Furthermore, let {a i } be an asymptotically nested sequence of knot point sequences in [0, 1] satisfying lim i→∞ a i = 0. Then, the set ∪ i P(U, a i ) is a dense subcone of P ∩ C([0, 1]), the cone of nonnegative functions over [0, 1]. See Appendix C in the online supplementary materials for the proof. As Bernstein polynomials of degree n sum to the constant polynomial 1, we can construct, for every n, a sieve that consists of polyhedral cones of n times differentiable polynomial splines.
Corollary 1. For every n, the cone of polynomial splines of degree n whose pieces have nonnegative weights in the Bernstein polynomial basis is a dense polyhedral subcone of nonnegative continuous functions over [a, b] consisting entirely of (n − 1) times differentiable functions.

Henceforth, we shall call this subset of nonnegative splines piecewise Bernstein polynomial splines.
Our last observation about polyhedral sets of splines is about B-splines. B-splines are particularly popular in the approximation and engineering literature because of their excellent theoretical and computational properties. However, as it has been recently shown, cones generated by B-splines are proper subcones of piecewise Bernstein polynomial splines.
Theorem 2 (Papp 2011, theorem 3.6). For every positive integer n and knot point sequence a, the cone of functions generated by B-splines of degree n with knot points a is a subset of the cone of piecewise Bernstein polynomial splines of the same degree, with knot points a. For n ≥ 2, this containment is strict.
Note that if the knot points and the degrees are fixed, piecewise Bernstein polynomial splines and B-splines have the same degrees of freedom. Therefore, piecewise Bernstein polynomial splines provide a better approximation of nonnegative splines at no additional cost. Hence, we do not consider B-splines in this article any further.

KNOT POINT SELECTION
In each of the spline models above, we have assumed that a fixed sequence of knot points a 0 , . . . , a m is given. Finding the best selection of knot points is a central, but very difficult, problem. Ideally, we would make the knot points variables, and optimize over them as well as over the coefficients of the polynomials, but this would result in an intractable, nonconvex optimization problem.
A common practice is to use evenly spaced knot points. While this is a very crude method, it is also very simple, and it fits the conic sieve framework discussed in Section 2, as splines with evenly spaced knot points give rise to an asymptotically nested sieve. In our numerical examples, we used this method.
Another possibility is to place knot points at the data points. A theoretical result supporting this idea is that the optimal solutions to certain estimation problems (least-square regression with a penalty term for smoothing) are natural cubic splines with knots at the data points (Wahba 1990). We can start with a trivial subdivision (with two knot points, one at each endpoint of the domain), and add knot points one by one, at each step, subdividing one of the intervals with the largest number of data points in it. This gives rise to a nested sieve.
In either case, the optimal number of knot points can be found by common model selection procedures: validation on a test set (if there is one), cross-validation or k-folding (if there is no separate test set), or by using some information criterion, such as the Akaike or the Bayesian information criterion (AIC and BIC, respectively); see (Burnham and Anderson 2004). All models proposed in this article are solvable quickly enough that even using leave-one-out cross-validation is computationally feasible. In the computational experiments, when the knot point sequences were not nested, but asymptotically nested, we used leave-one-out cross-validation. Whenever we used nested sieves, and the objective function had no smoothing penalty, we used AIC c for model selection for simplicity.

OPTIMIZATION MODELS FOR SHAPE-CONSTRAINED ESTIMATION
This section reviews a number of estimation problems that are solvable with the techniques proposed in the article. Numerical illustrations of these applications can be found in Section 5.

NONPARAMETRIC REGRESSION OF A NONNEGATIVE FUNCTION
In nonnegative regression, our goal is to estimate a function f based on data z i = (x i , y i ) i = 1, . . . , N, assumed to come from the model where ε i are independent, identically distributed random variables with mean zero, and the function f is assumed to belong to a class of nonnegative functions F. The goodness of fit of f to the data is measured by some loss functional of the form L(f |z 1 , . . . where the term d(f ) measures the distance of the function values f (x i ) and y i , while s(f ) is a penalty (or smoothing) term that penalizes "rough" or overly complex solutions. A The smoothing term s(f ) may be omitted. If present, it is typically chosen to be |f |, |f |, or (f ) 2 . If f is estimated with nonnegative splines of degree 3 or less, then all the above choices of s(f ) + d(f ) lead to optimization models with only linear and second-order cone constraints.
Further possible constraints include periodicity and interpolation constraints, which can be modeled by adding linear equations to the set of constraints. This does not increase the difficulty level of the optimization models.

MONOTONE, CONVEX, AND CONCAVE REGRESSION
Suppose the estimator of the unknown function f belongs to a linear space of twicedifferentiable functions whose first and second derivatives lie in a space where nonnegativity is easily characterized. Then, any combination of monotonicity, concavity, and convexity constraints on the estimator can be added to the optimization models without any difficulty, as these constraints reduce to sign constraints on the derivatives. When using cubic splines, convexity and concavity can be expressed by finitely many linear constraints, since f is piecewise linear.
In the presence of multiple shape constraints, further simplifications may be possible, especially when splines of low degree are used. For example, a concave function f is nonnegative over [a, b] if and only if f (a) ≥ 0 and f (b) ≥ 0. Consequently, in the presence of the concavity constraint, the nonnegativity constraint simplifies to two linear inequalities.

UNCONSTRAINED DENSITY ESTIMATION
One can formulate the estimation of a probability density function (pdf) from a finite set of independent samples {X 1 , . . . , X n } as an optimization problem involving nonnegative functions.
A pdf must be nonnegative and integrate to 1. We assume that the pdf to be estimated has finite support, say [a 0 , a m ], and that it is continuous; therefore, it can be approximated by nonnegative polynomial splines of a fixed degree. When using a spline model, the condition that the pdf should integrate to 1 simplifies to a linear constraint, since the integral of a polynomial on a given interval is a linear function of the coefficients of the polynomial. For example, a cubic spline model can be constructed by adding the constraint to the characterization of nonnegative cubic splines provided in Theorem 3 in Appendix B.
Finally, the objective function needs to be determined. The most common and straightforward approach is maximum likelihood (ML) estimation. If the unknown pdf is denoted by f , this amounts to maximizing the likelihood function n i=1 f (X i ). This objective function will cause numerical problems, and furthermore, is not necessarily a concave function, which makes its maximization difficult. Instead, we will use the negative log-likelihood function − n i=1 log f (X i ) as the loss function, which is convex if f is a polynomial spline of a given knot sequence. This way, we obtain a convex optimization model with only linear and second-order cone or semidefinite constraints, depending on the degree of the spline. (The importance of the optimization model being convex is that local optima are global, making the optimization considerably easier.) It is also important to note that by constraining f to be a polynomial spline, the above ML optimization problem is always well defined: it has an optimal solution for every fixed set of knot points.

UNIMODAL DENSITY ESTIMATION
Further constraints may be added to the density estimation problem of the previous section for unimodal density estimation. If the mode is known, we can place one of the knot points on the mode and then add constraints that the spline is increasing from the first knot point to the mode, and is decreasing from the mode to the last knot point.
If the mode is unknown, we have a nonconvex problem: the set of unimodal functions is not convex simply because the sum of two unimodal functions is not necessarily unimodal. In this situation, an approximate solution to the problem can be found by solving a sequence of optimization models, each with a different mode, and comparing the optimal solutions that correspond to the different modes. Finding the exact solution this way is still nontrivial, as the ML function is not a unimodal function of the mode.

NUMERICAL EXAMPLES
In this section, we have compiled results of a number of numerical experiments in which the SOCP-based sieves of cubic and quartic splines (with uniform, asymptotically nested knot points) were compared with piecewise Bernstein polynomial spline models, kernel methods, smoothing splines, and, in one case, parametric models. Owing to the large number of models and problems, as well as page limitations, these results are included only as numerical illustrations, which demonstrate the wide applicability and computational feasibility of the models described in Sections 3 and 4. A comprehensive empirical comparison of all the available methods in each of the shape-constrained estimation problems covered by our framework would require, and perhaps merits, a separate article. Several more results, along with more details on the experiments below, are included in Papp (2011, sec. 3.5).
The optimization models described in Sections 3 and 4 were implemented using the AMPL modeling language, and were solved using the nonlinear solvers KNITRO version 5.1 (Nocedal and Waltz 2003) or CPLEX version 12.3 (CPLEX 2011).

DENSITY ESTIMATION
Tests were conducted to compare our methods with a wide range of kernel methods for density estimation. The choice of benchmark distributions, as well as the experimental design, is based on Eggermont and LaRiccia (2001, chap. 8); the distributions are shown in Figure 1. The pdfs of the benchmark distributions are where φ σ (x) is the pdf of the normal distribution with mean zero and standard deviation σ , U ([a, b]) is the pdf of the uniform distribution on [a, b], and ψ α,β is the pdf of the Beta density with parameters α and β (see Figure 1). For each benchmark density, random samples of size 100 were generated. Then, the optimal cubic spline densities were determined using the Bernstein polynomial-based and the SOCP-based method outlined in Sections 3.1 and 3.2, and compared with the kernel estimates of Eggermont and LaRiccia (2001), which employ the Epanechnikov kernel and the normal density kernel, and 13 bandwidth selection methods. These bandwidth selection methods include the "optimal method," which simply determines the bandwidth that minimizes the L 1 error. This is clearly not a rational method, as it requires the knowledge of the estimated pdf, but it serves as a perfect benchmark, as it is an upper bound on the performance of all possible bandwidth selection methods. The process was repeated  Boxplots showing the median, the range, and the interquartile range of the L 1 errors of "optimal" kernel and spline estimates in the density estimation benchmarks; dots mark outliers. OP-E and OP-N are the lower bounds on the errors of the best-possible kernel estimates using the Epanechnikov and the normal-density kernel, respectively; Bernstein and SOCP are actual errors from an untuned implementation of the proposed methods. 100 times; the statistics of the L 1 distances of the estimators and the true pdfs are reported in Figure 2.
It is safe to conclude that in three examples, both the Bernstein polynomial-based and the SOCP-based method give comparable results to the upper bounds of best-possible kernel methods. There is significant difference between the two methods only in the results for f 2 (favoring Bernstein and SOCP; Mann-Whitney test p-value < 10 −10 ), and for f 5 and f 6 (favoring kernel methods; p-values < 10 −9 ). The results are easy to interpret: the less smooth the estimated pdf is, the higher the disadvantage of cubic splines to the kernel methods is; for very smooth pdfs, the spline estimators clearly outperform the kernel methods.
The difference between Bernstein polynomial-based and the SOCP-based method is mostly insignificant, but the SOCP-based method did better in the least-smooth examples, f 5 and f 6 (Mann-Whitney test p-values 1.4 · 10 −4 and 0.023, respectively.) The bounds OP-E and OP-N are insignificantly different from each other (p-value > 0.749 for each test function).

Monotone Regression: Shape Constraint Versus Smoothing Penalty.
In this section, we outline an experiment we used to compare the effect of imposing shape restrictions on the estimator with the effect of using a smoothing penalty.
We simulated data using the model Y = f (X) + ε, where f is a smooth function given by where erf(x) = 2 √ π x 0 e −x 2 is the error function and ε is normally distributed with mean 0. The standard deviation σ of ε was varied in different experiments. The function f was chosen so that it is increasing, yet it has a number of essentially flat sections, as well as strictly increasing sections of various slopes. As a result, this function is likely to expose the shortcomings of regression methods that do not include monotonicity as a constraint in their model. With its step-function-like behavior, this function is also expected to expose the oscillation problems that often arise in estimation with polynomials.
Random samples of size 100 were drawn uniformly from the interval [0, 1]. Optimal cubic spline estimators were selected from the unconstrained and monotone increasing cubic splines with up to 100 uniformly placed knot points; the final number of knot points was selected by cross-validation.
We also considered smoothing splines: cubic splines that minimize the penalized residual sum of squares' objective function where λ > 0 is the smoothing parameter. The optimal number of knot points and the optimal λ were both determined by cross-validation. We considered both unconstrained and monotone increasing spline with smoothing penalty.
We compared the resulting four estimators by measuring the L 1 and L 2 distances of the function f and the estimators. This process was repeated 100 times with two different noise levels: σ = 0.15 and σ = 0.3. Boxplots of the L 1 errors are shown in Figure 3; the plots corresponding to the L 2 distances are similar.
Sample plots of some of the optimal estimators are shown in Figure 4. These plots are typical in that the unconstrained splines generally showed considerably more oscillation than the monotone estimators (which, of course, cannot oscillate).
It is instructive to compare the distribution of the number of knot points of the four spline varieties considered; Figure 5 shows the empirical distribution for the σ = 0.15 case. It is obvious from the figure that unlike the shape constraint, smoothing markedly increases both the number of knot points and the variability in the number of knot points of the optimal spline estimators. The same trend was observed for the σ = 0.3 case (figure omitted).
It is also interesting to note that unconstrained smoothing splines did not yield monotone estimators in any of the 200 experiments. Figure 3. Comparison of the unconstrained and the monotone spline estimators, with and without smoothing penalty, for the regression curves of a dataset simulated using the model (5) with noise levels σ = 0.15 and σ = 0.3. The boxplots show the median, the interquartile range, and the range of the L 1 distances between the estimators and the true regression function; dots mark outliers. Imposing monotonicity on the estimator improves the quality of the estimation significantly, both for unconstrained and for smooth estimators. The smoothing penalty does not have the same effect.  (5), to be estimated, is the dashed curve; the solid curves are the estimators. Left to right: unconstrained cubic spline, unconstrained smoothing spline, monotone increasing spline, and monotone increasing smoothing spline. Smoothing reduces but does not eliminate oscillation; in the presence of the monotonicity constraint, smoothing does not yield noticeable improvement.

Mixed-Shape Constraints and Higher-Degree Splines.
In this section, we outline an experiment we used to compare the effect of imposing multiple shape restrictions on the estimator with the effect of increasing the degree of the spline estimator.
We simulated noisy data using the model Y = f (X) + ε, where f (x) = 1/1 + e −10x , x ∈ [0, 1], and ε is normally distributed with mean 0 and standard deviation 0.2. This function was chosen so that the function has a nearly linear increasing, and also a long, nearly horizontal part on the domain-this way it is likely that explicit monotonicity and concavity constraints will be required for a good-quality fit.
Random samples of size 50 were drawn uniformly from the interval [0, 1]. As a baseline for the evaluation of the quality of the estimators, for each sample, the least-square optimal model from the one-parameter family f b (x) = (1/1 + e −bx ) + ε and the two-parameter family f a,b (x) = (1/a + e −bx ) + ε were computed. We compared these models with different shape-constrained polynomial spline models. These were obtained similarly to the spline estimators of the previous section, except that in this example, we compared splines of different degrees and parametric models rather than comparing smoothing splines with splines without smoothing. The comparisons were made by measuring the L 1 and L 2 distances of the function f and the estimators. This process was repeated 100 times.
Boxplots of the L 1 errors are shown in Figure 6; the plots corresponding to the L 2 distances are similar.
It is immediately clear from Figure 6 that the estimators benefit from imposing either shape constraint. Both the increasing and the concave cubic spline are significantly better estimates than the (otherwise unconstrained) nonnegative splines, and the concave increasing spline is significantly better than the concave-only estimator (Mann-Whitney Figure 5. The empirical distribution of the number of knot points of the optimal spline estimators. Left to right: unconstrained cubic spline, unconstrained smoothing spline, monotone increasing spline, and monotone increasing smoothing spline. Smoothing markedly increases the number of knot points and the variability in the number of knot points. Figure 6. Comparison of parametric and spline estimators for the regression curves of a simulated dataset. The parametric models are best least-square fits from a one-and a two-parameter family containing the true function. The remaining estimators are nonnegative splines with different combinations of additional shape constraints imposed on them. The boxplot shows the median, the interquartile range, and the range of the L 1 distances between the estimators and the true regression function; dots mark outliers. Each added shape restriction improves the quality of the estimation, but increasing the degree of the polynomial pieces does not help. test p-values < 7 · 10 −5 for each pair); the concave increasing spline is also better than the increasing-only estimator, but the difference is not significant (p-value = 0.15). Quartic splines, on the other hand, did not yield better results than cubic ones in this example. Note that the spline estimators outperform even the two-parametric model. The optimal concave increasing spline usually had five or six effective parameters. As expected, the one-parameter model proved to be hard to match.

DISCUSSION
Two optimization-based approaches using conic sieves of shape-constrained polynomial splines were discussed, and compared with each other and with unconstrained smoothing splines and kernel methods in various applications of shape-constrained estimation problems. We proposed a novel approach, based on SDP and SOCP, and revisited the popular nonnegative basis approach. The choice of basis in the nonnegative basis approach is critical. We presented a condition to test whether a given basis is appropriate, and proved that piecewise Bernstein polynomial splines satisfy this condition.
Theoretically, the SDP/SOCP approach, which requires solving optimization problems with semidefinite and second-order conic constraints, is clearly superior to the basic nonnegative basis approach (which employs only linear constraints to represent polyhedral cones of splines), as it allows us to optimize precisely over the set that we want: the set of nonnegative splines. Using polyhedral sieves (following the nonnegative basis approach) results in simpler, linearly constrained models, but in most cases, it only allows us to optimize over a strictly smaller cone than necessary.
In some instances, the two approaches give very similar results, but we have also found examples where the SOCP approach is superior in practice, too. We exhibited cases when the two approaches are provably equivalent; these are problems with multiple shape constraints on low-degree splines.
The optimization models obtained from both approaches are easily solvable in a fraction of a second using readily available software. This was demonstrated using examples with simulated data; we refer the reader to the thesis of the first author for more detailed examples (Papp 2011). From the viewpoint of computational feasibility, these approaches are competitive with kernel methods and even with closed-form formulas, which are only available for very restricted special cases. We found that the nonnegative spline estimators match, and frequently outperform, state-of-the-art kernel estimators in density estimation problems.
We underline that the validity and algorithmic efficiency of the proposed method relied only on the fact that the set of nonnegative polynomial splines admits a characterization using only linear and semidefinite constraints, which made it easy to optimize all commonly used loss functions over them. This characterization of nonnegative polynomials is an immediate consequence of a classic result that nonnegative univariate polynomials can be written as sums of squares of polynomials. Hence, the method can be applied verbatim in other spaces of functions where nonnegative functions have a similar "sum of squares" characterization. An example is trigonometric polynomials: nonnegative trigonometric polynomials have a characterization analogous to nonnegative polynomials. Ramifications of this observation, along with many more examples (including certain families of rational functions and exponential families), can be found in (Papp and Alizadeh 2011) and (Papp 2011). The application of the same approach in multivariate estimation problems requires further study.

SUPPLEMENTARY MATERIALS
Proofs and some other technical details are provided in the supplementary materials as Appendices.