The Fast RODEO for Local Polynomial Regression

An open challenge in nonparametric regression is finding fast, computationally efficient approaches to estimating local bandwidths for large datasets, in particular in two or more dimensions. In the work presented here, we introduce a novel local bandwidth estimation procedure for local polynomial regression, which combines the greedy search of the regularization of the derivative expectation operator (RODEO) algorithm with linear binning. The result is a fast, computationally efficient algorithm, which we refer to as the fast RODEO. We motivate the development of our algorithm by using a novel scale-space approach to derive the RODEO. We conclude with a toy example and a real-world example using data from the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation (CALIPSO) satellite validation study, where we show the fast RODEO’s improvement in accuracy and computational speed over two other standard approaches.


INTRODUCTION
Multivariate nonparametric regression is an invaluable tool for exploratory data analysis and inference, particularly local polynomial regression (LPR; Fan and Gijbels 1996), which has found widespread use. A key component of LPR and other nonparametric smoothers is the estimation of the bandwidth (i.e., smoothing) parameter or, in the multivariate case, the bandwidth matrix. With large amounts of data, for example, in image analysis and remote sensing, bandwidth estimation can be a computationally prohibitive process, in particular local bandwidth estimation, which we focus on here.
While considerable work has gone into the development of methods for global bandwidth selection (see Jones, Marron, and Sheather 1996 for a review of univariate methods), the development of computationally tractable methods for local bandwidth selection, especially in the multivariate setting, is an open and challenging problem. A powerful approach to locally adaptive smoothing that should be mentioned is the propagation separation approach to nonparametric smoothing (formally introduced as adaptive weighted smoothing, Polzehl andSpokoiny 2000, 2006). The main idea behind the latter is to find the greatest possible neighborhood about a point in which the local parametric assumptions about the data are justified. While this method has many theoretically appealing properties and has been shown to have strong performance in many practical applications, the focus is not on computational speed and scalability and as such is not as well suited to larger scale problems.
One promising new approach for fast local (and global) multivariate bandwidth selection is the regularization of the derivative expectation operator algorithm (RODEO; Lafferty and Wasserman 2008). The RODEO algorithm was developed as a general nonparametric method for bandwidth and variable selection in high-dimensional data. Here, we focus on bandwidth estimation using the RODEO. The principal behind the algorithm is to continually decrease the bandwidth by infinitesimal increments and determine whether these decreases lead to a significant change in the estimator. Given a particular value of the bandwidth, this is equivalent to testing whether the derivative of the estimator as a function of the bandwidth is significant; if it is significant the algorithm proceeds in a greedy manner to a smaller bandwidth but otherwise it stops. This procedure is repeated independently for each observation along each dimension, the final output is a collection of bandwidth estimates for each observation.
Herein, we introduce the fast RODEO, a novel approach to bandwidth estimation that combines the greedy selection search of the RODEO with linear binning (Fan and Marron 1994;Wand 1994). The integration of these methodologies results in faster, more efficient, and scalable local bandwidth estimation as compared to standard approaches. We also give an example of a scale-space-based interpretation of the standard and fast RODEO algorithms, providing new insights into their behavior.
The application we focus on here is high spectral resolution light detection and ranging (HSRL). HSRL is a remote sensing technology that is commonly used to characterize clouds and small particles in the atmosphere called aerosols. The measurements collected by this system are backscattered light or photons from these clouds and particles. Thus, a single measurement at a given spatial location is the aggregation of measured light along the path from that particular point to the HSRL system. To obtain an actual estimate of cloud density or aerosol concentration at any given spatial location, the derivative of these raw measurements must be taken.
The dataset we consider here include HSRL measurements of aerosols collected as part of the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation (CALIPSO) satellite validation study (http://science.larc.nasa.gov/hsrl/CALIPSO-VAL.html). These approximately 300 vertical by 20,000 horizontal measurements were collected from an aircraft flying at an altitude of approximately 9 kilometers (km) over a 4-h time period, with data collected at 0.5 sec intervals. Figure 1, taken from Rogers et al. (2009), shows a plot of the flight path and Figure 2 the corresponding "unfolded" raw measurements. As mentioned above, we refer to these as the raw measurements as we are interested in the derivative (which must be estimated) along the Altitude-axis which will give us the desired measurement, that is, aerosol concentrations. Also note that the x-axis, labeled as "Time (hrs)" corresponds to a time stamp of the spatial location of the plane taken from the beginning of the flight (as shown in Figure 1). These measurements have been preprocessed/corrected to adjust for movement of the plane, the effects of wind, and other environmental factors. While there is no ground truth available for the CALIPSO data to confirm that a local bandwidth is optimal, the distribution of measurements shown in Figure 2 suggests that there is considerable variation in the scale of local features. As a result, if a global bandwidth was used, it may smooth out features that are present and/or undersmooth those that are not.
In Section 2, we provide an overview of the RODEO algorithm, the principles of binning in the context of LPR, and how these methods can be interpreted in scale-space. In Section 3, we describe the fast RODEO procedure, and in Section 4 we provide results on both a toy and real world example.

THE RODEO
In this section, we provide an overview of the RODEO algorithm. Let Y i ∈ R and X i = (X i,1 , . . . , X i,d ) ∈ R d , i = 1, . . . , n denote the set of observed sample pairs. In the HSRL example, Y i would be the measured aerosol extinction and X i = (X i,1 , X i,2 ) the latitude and time stamp of that measurement. Next, define m(x) to be the true, unknown function so that Here, the smooth function v(x) models the local variation in the data (assumed to be equal to a constant σ for the purposes of the work presented here) and (x) is the iid random error with mean 0 and variance 1. Let m H (x) denote the LPR estimate with diagonal bandwidth matrix H = diag(h 1 , . . . , h d ), and x = (x 1 , . . . , x d ) some target point. It can be shown that the LPR solution has the form Here is assumed to be a product kernel. To illustrate the form of X x suppose p = 1, which corresponds to local linear regression (LLR), then Following Lafferty and Wasserman (2008), we express the "effective kernel" as so that m H (x) = n i=1 G(X i , x, H)Y i . Recall that the RODEO algorithm looks to determine whether an infinitesimal decrease in the bandwidth will lead to significant changes in the estimator. This is done by testing whether the derivative of the bandwidth in the direction of the jth dimension defined as Several methods for estimating σ are given in Lafferty and Wasserman (2008), the details of which are also provided in the supplemental material (Section 1 and additional discussion at the end of Section 2). To summarize: for each point x ∈ R d , the RODEO algorithm starts with a d × d bandwidth matrix H = diag(h 0 1 , . . . , h 0 d ), for h 0 j (> 0) sufficiently large (so that we are essentially performing standard linear regression), j = 1, . . . , d. Then for each j, we calculate Z j and V j . If |Z j | > λ j , λ j = V j √ 2 log n, we replace the jth diagonal element of H by βh j 0 , 0 < β < 1 and continue iterating in this way until the test is failed. This process is repeated for each observation. The algorithm is summarized in Figure 3. The final output is a set of local bandwidth estimates H * for each x.
Example 2.1. To provide a more concrete illustration of what the RODEO solution path will look like, consider the example in Figure 4. The data, shown in the x, y-plane are sampled from f (x) = sin(5/x) + , where ∼ N (0, 1/2) and x ∈ (1/4, π); along the zaxis is a sequence of standard, fixed bandwidth LPR smooths of the data, placed in order from smallest ("furthest" from us) to largest ("closest" to us) bandwidth. Collectively, this series of smooths creates a "scale-space surface" (Lindeberg 1994;Chaudhuri and Marron 2000;Godtliebsen, Marron, and Chaudhuri 2002;Holmström 2010). Note, in general any type of smoothing method can be used, not just LPR.
The line along the scale-space surface is the RODEO solution path whose coordinates are given by (x,m h , h), where the value of h for each x is determined by the RODEO (outlined in Figure 3). Visualized in this way the RODEO solution path can be thought of as being the derivative along the z-axis in Figure 4, followed by computing the test statistics, Z and V and proceeding as previously discussed.
As expected the size of the bandwidths is directly related to the fineness/coarseness of the local features in the data. In those regions with finer features, the RODEO moves further down the surface (toward smaller bandwidths) and in regions with coarser features it does not (remaining at larger bandwidths). Additional discussion on the relationship between the RODEO and scale-space can be found in the supplemental materials.

MULTIVARIATE BINNING
Binning has been successfully applied in the nonparametric regression setting to improve computational speed and efficiency in both the univariate (Fan and Marron 1994) and multivariate cases (Wand 1994) for fixed global bandwidths. We will show how the same principles can be applied to dramatically improve the computational speed of the RODEO algorithm and local bandwidth estimation. First, we provide an overview of multivariate binning in the context of LPR.
The principle behind binning is the following: assuming that two points located near one another will tend to be close in value, many of the kernel evaluations will be quite similar. Thus, the number of necessary evaluations can be greatly reduced without sacrificing much in the way of accuracy by performing these computations at a reduced number of representative locations. In the context of binning, this is achieved by first constructing an equally spaced grid spanning the range of the X i 's. The data are then approximated by "binning" each X i with its nearest grid point and fitting the LPR model to this grid of values. This has the effect of both reducing the number of calculations (assuming that the number of grid points is less than the original number of observations) and allowing all the kernel evaluations to be precomputed as the distance between the points on the grid is now fixed.
We now provide details needed for developing the binned implementation of the RODEO algorithm. First, define the matrices which are the same matrices that appear in (1). For the remainder of this article, we suppress the subscripts x and H from the matrices X x and W x,H . Following the notational convention of Wand (1994), lett where k = (k 1 , . . . , k d ), and for a d-vector So in the case of LLR, using the notation in (6) and (7) we have Next, let g j 1 , . . . , g jM j denote the equally spaced grid with M j points along the jth coordinate axis such that the interval [g j 1 , g jM j ] contains the range of values of . . , d to be a point on the grid with index j = (j 1 , . . . , j d ) and let the jth binwidth be denoted by δ j = (g jM j − g j 1 )/ (M j − 1).
From here on we assume that linear binning is being used (see Hall and Wand 1994 for a discussion on some of the advantages of linear binning over other binning schemes). Once an appropriate grid has been constructed, each data point must then be appropriately apportioned to its surrounding grid points. Figure 5 illustrates how linear binning works. Here the point marked "X" will have the top-right proportion of the shaded square assigned to the grid point g 1 , top-left to g 2 , bottom-right to g 3 , and bottom-left to g 4 . This process is repeated for every observation resulting in the set of grid counts (c j , d j ) that represent the amount of (X, Y ) data near each grid point. Specifically, if ω l (x) is the weight that x assigns to the grid point g l as described above, then The binned approximations of the quantities in (7) arẽ where κ 1 ) and x denotes the greatest integer less than or equal to x. The parameter τ corresponds to the interval

Illustration of Linear Binning
g 3 g 4 Figure 5. An illustration of the binning procedure.
[−τ, τ ] outside of which the kernel function K is exactly or effectively 0. If K has infinite support, then one could replace K by K1 [−τ,τ ] , where 1 I denotes the indicator function over the interval I. The parameter τ should be chosen such that it has a negligible effect on the estimation procedure. For example, in the case of a Gaussian kernel a reasonable choice would be τ ≈ 4.
The key observation is that the quantities in (9) are simply convolutions of κ 1 k,l with the counts d j−l and c j−l , respectively, and thus can be rapidly calculated using the fast Fourier transform (FFT). Wand (1995) showed that there is a considerable speed advantage using the FFT over a direct approach.

BINNING THE RODEO
Before proceeding with the details of the fast RODEO, let us briefly return to Figure 4 that provided an illustration of the RODEO solution path from the scale-space perspective. Recall the scale-space surface shown here is simply a sequence of standard, fixed bandwidth LPR smooths of the data, placed in order from smallest to largest bandwidth. From this perspective, the test statistics Z (3) and V (5) are derived from the partial derivative along the bandwidth-axis.
In a similar fashion, the fast RODEO approach uses binning to rapidly compute the scale-space surface and the corresponding test statistics Z and V.
With this in mind, we now provide details on the derivation of the fast RODEO. We begin by introducing some notation; let It can be shown that Z j in (3) can be written as where (X 1 , x, H), . . . , G j (X n , x, H)) T . From (11), the variance V 2 j in (5) is calculated as Expanding (11) and (12) and using the definitions of the matrices in (10) we have Of particular interest is that the elements of the matrices in (13) can be represented in a similar form ast 1,k andŝ 1,k in (7). Specifically, Here, e j is a d-dimensional vector with the jth element equal to 1 and 0's elsewhere. Using the notation in (14), for LLR the matrices in (10)  Following the discussion in Section 2.2, the quantities in (14) can be expressed in terms of their binned approximations where and e ij is the ith element of the vector e j . Letting T 2,j , S 2 , and S l,j , j = 1, . . . , d, l = 3, 4, 5, be the binned approximations of the matrices in (10) and plugging these into (13), we get the binned approximations of Z j and V j ,Z j , andṼ j . As was discussed in Section 2.2, the key observation here is that the quantities in (15) are simply the convolution of the counts d j−l and c j−l with their respective kernels, κ k,l and κ l k,l,j , which can be quickly and efficiently computed using FFTs.
Remark 3.1. It is important to highlight the relationship between binning and the dimensionality of the data. While the framework we provide generalizes to an arbitrary number of dimensions, it must be kept in mind that as the number of dimensions increases, the number of grid points increases exponentially. Since the number of grid points will quickly become greater than the number of observations (more often than not), binning will no longer be a computationally efficient approach.
It is worth mentioning that the largest number of dimensions we have considered in practice is five.
While not yet developed, as part of future work we are planning on building a primarily C++ driven version of the fast RODEO, which is to be implemented on five-dimensional image data, that is, spatial (x, y, z), spectral (λ), and temporal (t). As these data are densely sampled, the binning-based fast RODEO remains a sensible and computationally efficient approach.
Remark 3.2. An additional point to note is the interplay between the scale of features in the data, the bandwidth, and grid size. As with any implementation of binning, careful considerations need to be made regarding grid size. If the grid is too coarse, we run the risk of masking features potentially present in the data and no choice of bandwidth will help in this situation; if the grid is too fine we increase the number of computations, reducing the efficiency of the algorithm.
In our implementations of the fast RODEO, the choice of grid size is primarily dictated by the features we believe to be present and the size of the dataset. In the HSRL example in Section 4.2, we used bins that were roughly 5 × 12 (recall the dataset itself is approximately 300 × 20,000) since we did not believe that any of the features of interest would be smaller than this. In general, it is recommended that several grid sizes be considered and the results visualized before moving forward with any analysis.
As far as the effect of spatially nonuniform data, we have found that binning, in general, is not affected by any more or less than nonbinning-based methods (e.g., kriging). For an example of the use of the fast RODEO for interpolation (as well as kriging) on spatially nonuniform data, see Cambaliza et al. (in press).
Interesting work presented by Raykar, Duraiswami, and Zhao (2010) studies the problem of grid size selection and provides exact estimates of the difference between binned and unbinned kernel estimators. This error measure is then used as a criteria for selecting grid size. While not implemented here, in principle similar ideas could be extended to the fast RODEO.

THE FAST RODEO ALGORITHM
In terms of implementation there are two primary differences between the RODEO algorithm as described in Section 2.1 and what we propose here; first, the range of candidate bandwidths, from largest to smallest is selected in advance and second, the number of bandwidths across this range over which to search, call this n c , must be specified. Once these values have been selected what follows is the same as in the standard RODEO algorithm. That is, for every t ∈ {0, . . . , n c − 1} and corresponding h j (t) = β t h 0 j , j = 1, . . . , d along the solution path, we compute the binned approximation of the RODEO statisticsZ (t) j (g) and V (t) j (g) for each point g on the grid for the entire solution path. The algorithm then proceeds as usual using these precomputed values. This is outlined in Figure 6.
In general, the following guidelines are suggested when selecting values for the parameters h 0 j , β, and n c . First, the initial bandwidth, h 0 j is selected so that the number of points in the neighborhood of the kernel function is less than or equal to the hypercube formed by the M 1 × · · · × M d grid, that is, (near) the entire space so that we are effectively performing a (global) linear regression. Next, the parameters β and n c are selected so that the effective neighborhood of the kernel function associated with the smallest bandwidth β n c −1 h 0 in the RODEO solution path is greater than or equal to a single (or a small number of) point(s) on the grid.
The cost of computing the solution path of the fast RODEO is equal to O(5dn c M 1 log M 1 × · · · × M d log M d ). This can be seen by recalling that the computational complexity of performing an FFT over an M 1 × · · · × M d grid is O(M 1 log M 1 × · · · × M d log M d ). For each of the five matrices in (10) and each dimension d, the FFT of the corresponding grids must be calculated. Putting this together, the resulting complexity for a fixed t and bandwidth h(t) = β t h 0 is O(5dM 1 log M 1 × · · · × M d log M d ). Repeating this calculation for each t ∈ {1, . . . , n c } gives the above complexity. Note that this is only an order 5dn c larger than the corresponding binned implementation of LPR with a global bandwidth.

TOY EXAMPLE
The toy example we consider here provides a simple simulation of HSRL measurements in a two-dimensional setting (similar to those shown in Figures 1 and 2). One sample dataset from our simulation is shown in Figure 7. The plot on the left shows the ground truth, with each "hotspot" corresponding to a region of high concentration of a gas or aerosol. The plot on the right is the normalized running sum, going from top to bottom of the plot on the left with the addition of Gaussian noise (details below). Recall from Section 1, that the observed HSRL measurements correspond to the plot on the right, with the goal being to estimate the partial derivative of these data (along the y-direction) to retrieve the concentrations, shown in the plot on the left. The purpose of these simulations is to test the fast RODEO's speed and accuracy as compared to a standard bandwidth selection approach (here we use the R package locfit; Loader 2012, which uses generalized cross-validation), and an approach that uses simple differencing (commonly used in the remote sensing community, see remark) across datasets of varying size and feature complexity.
Remark 4.1. The simple differencing approach to aerosol concentration estimation consists of two steps. To start, let y i,j be the raw value, with i = 1, . . . , n indexing the time stamp (i.e., the spatial location as discussed in Section 1 and illustrated in Figure 1), and j = 1, . . . , m the index of the current altitude. The first step is then to calculate D ij = (y i,j − y i,j −1 )(change in altitude), for i = 1, . . . , n, j = 2, . . . , m, where D i1 = D i2 . The second step is then to perform a LOWESS smooth (Cleveland 1981) on the D ij 's across altitudes, that is, for each i smooth D ij , j = 1, . . . , m. The output from this second step is the estimate of the aerosol concentration.
Next, letting u 1 , u 2 ∼ Unif [−7, 7], the mean, that is, the location of the hot spot, is sampled as μ = (u 1 , u 2 ). Define G = [−8 + 16i n g −1 , −8 + 16j n g −1 ] n g −1 i,j =0 , to be a regularly spaced, n g × n g grid with n g taken to be 400 or 800. We had also considered smaller and larger grid sizes, however, we found with grid sizes larger than 800 × 800 that locfit began to run into memory issues, so we did not report those results (fast RODEO's computational time remains approximately the same for 1200 × 1200 grids).
Let f (x; μ, ) = Norm(x; μ, ) (Norm(x; μ, ) denotes the Gaussian density, evaluated at x, with mean μ and covariance ), f ij = f (G ij ; μ, ) the evaluation of f at the ij th point in G and F = {f ij } n g i,j =1 the associated grid. If F k is one realization of F for a given μ, , and n g , and s k ∼ Unif [10,20], iid (where Unif [a, b] denotes the uniform distribution of [a, b]), then F = n f k=1 s k F k corresponds to one random sample from our simulation, an example of which is shown in the left plot of Figure 7 (i.e., these are the true, unobserved concentrations we are trying to estimate). The parameter n f is the number of "features," taken to be 20 or 40, allowing for varying levels of complexity and s k is the relative magnitude of the sampled feature. An example with n f = 3 is shown in Figure 8 (s k set to 1).
Note, each F has associated with it an n g , n f pair. As the calculations are effectively the same regardless of the choice of these parameters, we do not include them in the notation.
Once F has been sampled, the normalized running sum is calculated, from top to bottom as Y tj = 16 j=1 . Putting this all together, the points that make up the observed data in our simulations are Y * = Y + E, where E = { ij } n g i,j =1 , ∼ Norm(0, σ ), σ = 1/2 (similar experiments were run using 0.25 and 1, yielding similar results).
To measure algorithm performance, we calculate the mean square error (MSE) between the actual values, F ij and the predicted values,m (y) where F is the fast RODEO, S is the plug-in bandwidth approach, and D is simple differencing. The superscript (y) denotes the partial derivative with respect to y (as this is the quantity of interest in HSRL studies). The estimated fits from one of the simulations (the corresponding Figure 9. The results from a fit to the data shown in the left-hand plot of Figure 7 using the fast RODEO, standard LPR, and simple differencing. raw data are shown in the right-hand plot of Figure 7) are shown in Figure 9. The MSE for a single realization from our simulation is then calculated as MSE M = 1 where M denotes the method used. For a given number of features, n f and grid size n g , we generate 10 examples and for each example we randomly sample the Gaussian error matrix E, 10 times. Thus, for a given n f , n g pair, we have 100 simulated samples. Overall performance is then calculated as MSE total = 1 100 100 k=1 MSE k , where the MSE k 's correspond to the MSEs for a given n f , n g , and Y * . The standard errors across the MSE k 's were also calculated. For algorithm speed, we computed the average time and standard deviation, in seconds for each n f and n g . The results of the simulations are shown in Figures 10 and 11. In each figure, the points correspond to the average MSE across the simulations and the whiskers two standard deviations.
Note that in practice, to obtain better estimates with simple differencing it is common to take a running average of the HSRL measurements over 1 min time intervals (corresponding to 120 measurements) and perform the differencing on the average. The purpose of doing this is to average out noise, which may be present in a single measurement. For this toy example, we found the optimal time range to average over that minimized MSE was 10 and 20 for n g that is equal to 400 and 800, respectively. Of course in practice ground truth would (most likely) not be available so this represents approximately the best possible performance for this method.
In each case, we see that the fast RODEO performs better, in terms of MSE than standard LPR, and both generally outperform simple differencing. Additionally, the standard error of MSEs is much smaller for the fast RODEO than both other approaches, showing its consistent performance across simulations. In terms of speed as we go from n g = 400 to n g = 800, the fast RODEO has over a 5-fold speed up over standard LPR (with simple differencing outperforming both). It is worth noting that the fast RODEO code is almost exclusively implemented in R whereas the standard LPR code is primarily written in C. As mentioned earlier, memory issues arose with matrices over 800 × 800 for standard LPR, however, for the successful runs we did have at 1200 × 1200 the exponential increase in compute time continued for the standard LPR while the fast RODEO remained about the same.

APPLICATION TO THE CALIPSO HSRL MEASUREMENTS
Next we apply the fast RODEO and simple differencing to real HSRL measurements collected in the CALIPSO study discussed in Section 1. We did not include results for  standard LPR due to memory issues. It is important to emphasize that an absolute ground truth is not available for these results, so the discussion here is largely qualitative. The results using the fast RODEO and differencing are shown in Figures 12 and 13. In terms of overall trend, both approaches yield similar predictions of aerosol concentration. The major difference is that the estimates from the fast RODEO suggest that many small-scale features may be present, while the differencing approach tends to smooth these  out. As mentioned above, while we have no way to absolutely confirm the presence of these regions, from the simulation results in Figure 9 we see that the differencing approach (middle plot) does have a tendency to smooth out finer features actually present in the image (compare to ground truth in the left-hand plot of Figure 7).
The potential "loss" of features is illustrated in Figure 14; here we have zoomed in on the region falling between 500 and 1200 m and 5.6 and 5.9 h for both the fast RODEO (top) and simple differencing (bottom). In the fast RODEO, we can see a distinct series of hot spots, while in the simple differencing approach these features are largely smoothed out. Here, the advantage of using the fast RODEO is that if these hot spots are real, they will be completely missed by simple differencing.
Of course, the number of time points being averaged over in the differencing approach may be causing some of this feature loss. The challenge is that if we reduce the number of points being averaged over, then we risk increasing the variability of the estimates in regions where no discernible features are present. The effect is that rather than drawing the scientists' eye to regions more likely to contain signal of interest, they now have many (possibly) irrelevant regions to look through.

SUPPLEMENTARY MATERIALS
The supplementary material contains details on the selection of the variance parameter σ as well as a scale-space based interpretation of the fast RODEO algorithm. [Received January 2013. Revised June 2014