Bias assessment in local regression

Abstract Local polynomial regression is a convenient method for smoothing scatterplots with readily available software. However, it is well known that variable amounts of bias are induced by the smoothing operation. This article proposes a simple visualization tool based on approximate confidence intervals which can alert the data analyst to regions of the regression function domain which might be susceptible to unacceptably large levels of bias, and possibly indicating a need for a less automatic smoothing approach. A simulation study verifies the accuracy of the confidence bands, and the method is illustrated with several real datasets.


The need for visualizing potential bias in a smoother
The problem of smoothing a scatterplot has been considered by numerous authors, going back to the pioneering work by Nadaraya (1964) and Watson (1964). Those papers introduced kernel methods for regression. The books by Wand and Jones (1995) and Fan and Gijbels (1996) provide excellent introductions to the methodology. Research continues in the area. For example, the paper by Li, Lu, and Ullah (2003) studied the use of local polynomial regression (LPR) on estimating derivatives. Other research focused on the performance of LPR, and how to achieve bias reduction, e.g., Choi and Hall (1998), Choi, Hall, and Rousson (2000), Cheng et al. (2018), He, Jiang, and Wang (2019) and He and Huang (2009), and efficient computation, e.g., Samarov (2015). Confidence bands have also been proposed for the resulting curve estimates such as those suggested by Hall and Horowitz (2013). Numerous packages exist for computing LPR in modern software. In R Core Team (2019), the locfit (Loader 2013) and KernSmooth packages (Wand 2020), and the loess function in the package stats are capable of fitting LPR (R Core Team 2019).
The main challenge in LPR is how much smoothing to do. The mean squared error (MSE) is usually employed to assess LPR performance. Bias and variance are the two components of MSE. When a scatterplot is under-smoothed, the bias is small and the variance is large. When a scatterplot is over-smoothed, the bias is large and the variance is small. This is the well-known bias-variance tradeoff. Minimizing MSE corresponds to balancing bias and variance. Currently, however, there is no software tool which can be used to indicate a potential bias problem in a given LPR application in any of the literature, as far as we can tell.
In the present article, we propose a new visualization tool which can accompany a LPR plot to signal how much bias there may be in the fitted curve. The organization of the article is as follows. After a brief review of LPR, we develop the methodology for pointwise confidence intervals for the bias of LPR estimates in Sec. 2. In that section, we also indicate how different kinds of bias-reduced local regression estimators could be used in the estimation of bias.
In Sec. 3, we use simulations to demonstrate the use and accuracy of our new tool; some examples illustrate the application of this proposed method. In Sec. 4, we consider two real datasets and illustrate the use of the new visualization tool for diagnostic purposes. The last section wraps up our paper with a brief discussion.

Pointwise confidence intervals for bias
To motivate the assessment plot technique, and to make the present work self-contained, it is useful to outline the main results of the LPR methodology. Suppose X and Y are two random variables, from which a sample of independent data points are collected, denoted as ðx i , y i Þ, i ¼ 1, 2, :::, n: We define the vectors x and y to hold x i and y i as their respective ith elements. The model we consider is where the noise elements, e i , are independent with mean 0 and variance r 2 , and g(x) is assumed to be smooth enough for a local polynomial estimator to be consistent. Typically, this would entail a two orders of differentiability beyond the degree of the local polynomial used in the regression estimator.
The kth degree local polynomial estimator is Fan and Gijbels 1996) where . . . 1 x n À x Á Á Á ðx n À xÞ k 2 6 6 4 3 7 7 5 , and e > 1 ¼ ½100 Á Á Á 0 is a ðk þ 1Þ-vector. K x is a diagonal matrix with ith diagonal element K h ðx i À xÞ: K h ðxÞ, the kernel function, is often a symmetric probability density function with scale parameter h. In accord with convention, we will refer to h as the bandwidth. The Gaussian kernel is a reasonable choice, though other functions have slightly better theoretical properties; the choice of kernel function does not have as big effect on the quality of the resulting estimate as the bandwidth h and the degree of the local polynomial. As usual, we require a bandwidth that is consonant with the requirement h ¼ O n À 1 2kþ3 ð Þ , when k is odd, and h ¼ O n À 1 2kþ3 ð Þ , when k is even. This minimizes the asymptotic MSE at points in the interior of the interval in which the estimate of g(x) is being computed.

Confidence interval for bias and the assessment plot
We first define the matrix The estimator of g(x) at a point x that we wish to assess bias on is then where we are using local polynomial degree k 1 with an estimated MSE-optimal bandwidth h 1 . The bias of the estimator b g 1 ðxÞ is where g ¼ E y ½ ¼ gðx 1 Þ ::: gðx n Þ 2 4 3 5 : In order to set up an estimator for this quantity, we first specify another bandwidth h 2 and another local polynomial degree k 2 , and define another estimator of g(x): The estimator b g 2 ðxÞ should be designed to be less biased than b g 1 ðxÞ, which will occur, if h 2 h 1 and k 2 k 1 , with at least one of these inequalities holding strictly. Other bias-reduced estimators for g(x) could be used, and the approach to two of the possibilities is sketched in the next subsection. We propose to estimate the bias B(x) by replacing g everywhere it appears in Eq.
(2) with b g 2 : b BðxÞ where, with H h, k defined as the n Â n matrix whose ith row is e > 1 C x i , h, k , Under these conditions, consistency of b BðxÞ is assured, via the asymptotic variance results by Ruppert and Wand (1994), as long as nh 2 ! 1: The leading term in the variance of b BðxÞ can be shown to be O p ððnh 2 Þ À1 Þ, and the leading term in the bias of b BðxÞ is O p ðh k 2 þ1 2 Þ if k 2 is odd, and O p ðh k 2 þ2 2 Þ if k 2 is even. The conditions of Eq. (5) ensure that for large enough n, the bias in b BðxÞ will be small relative to the standard deviation of b BðxÞ: The variance of the bias estimator can be written as x e 1 : A pointwise confidence interval for the true bias B(x) can then be constructed. On the basis of the theoretical results in Martins-Filho and Yao (2009) and Masry and Fan (1997), as well as empirical evidence from numerical experiments that we have conducted, we conjecture that a Central Limit Theorem holds for b BðxÞ: With a large sample, the approximate a-level confidence interval would be b BðxÞ þ z a=2 where z a satisfies PðZ z a Þ ¼ a for standard normal Z.
To evaluate the confidence interval and the variance of the bias, an estimate of r 2 is needed. A number of candidates are available in the literature, but a simple, approximately unbiased estimator very similar to one by Rice (1984), which seems to work well for moderate to large samples is one based on perturbing the (sorted) covariate observations slightly. Specifically, we set for j ¼ 1, 2, :::, n=2, ignoring x n if n is odd. The perturbed dataset fx Ã , yg now has replicates so that an estimate of the pure error sum of squares (SSPE) can be obtained.
Dividing SSPE by n=2 gives an unbiased estimate of r 2 when the amount of perturbation required is 0. Otherwise, the estimator can be biased. Dette, Munk, and Wagner (1998) argued that there are better ways to estimate r 2 , but this estimator benefits from simplicity and seems to work sufficiently well in the simulation studies, we discuss in the next section. The paper by Fan and Gijbels (1995a) suggested a method for choosing the order of the local polynomial using the asymptotic characteristics of the bias and variance of the regression estimator. In that study, the authors used a fixed bandwidth. On the other hand, the paper by Fan and Gijbels (1995b) suggested a relatively refined data-driven bandwidth selection procedure, based on an approximation of the bias and variance of the estimator. This procedure is capable of choosing a constant or variable bandwidth in a pilot estimation procedure. The approximation of the bias for a pth order local polynomial estimator uses a ðp þ aÞ th order local polynomial fit which is used in the pilot estimator. Here a is a positive integer. Fan and Gijbels (1995b) used a ¼ 2 in their work to demonstrate that this "two stage" method outperforms the usual "one stage" method.
Our proposal is to plot the pointwise confidence bands for the bias to assist in determining whether bias reduction is necessary, or in some cases, whether it is possible without increasing the variability in the regression function estimator to an unacceptable level. In practice, one might apply our visualization tool using a fixed bandwidth and make comparisons under various scenarios, or use a data-driven bandwidth and two different polynomial degrees. The tool is available in the MPV package of Braun and MacQueen (2021) for use in R (R Core Team 2019) as the function BiasVarPlot().
An example of the Bias assessment plot is shown in Figure 1. The data were simulated using Scenario 1 from our Simulation Study (see the next section); the target function is displayed in the upper panel, and the pointwise 95% confidence bands for the bias are displayed in the lower panel. Since we know the true target function, we can calculate the true bias; this is plotted in the lower panel as well. We see that it lies within the confidence bands, and that the confidence bands indicate a substantial amount of bias near x ¼ 0.

Alternative bias-reduced estimators of the regression function
In this subsection, we show how alternative forms of the visualization tool described earlier can be constructed using other bias-reduced estimators for g(x). We consider one of the data sharpening estimators by Choi, Hall, and Rousson (2000) as well as the more recent estimator proposed by Cheng et al. (2018).

Data sharpening
The data sharpened regression estimator we consider is constructed by first applying LPR of degree k 1 with bandwidth h 1 to obtain a vector of fitted values b g 1 : In the notation of the previous section, b g 1 ¼ H h 1 , k 1 y, and the residuals are y À H h 1 , k 1 y: The sharpened responses are obtained by adding the residuals to the original responses: The bias-reduced regression function estimator for this form of data sharpening is given by h, k as the n Â n matrix whose ith row is e > 1 C x i , h, k ð2I À H h, k Þ and defining A Ã x ¼ C x, h 1 , k 1 H Ã h 1 , k 1 À C x, h 1 , k 1 , we can write the resulting estimator of bias in b g 1 ðxÞ as b B Ã ðxÞ ¼ e > 1 A Ã x y: The variance of this bias estimator is again easily obtained: Varð b B Ã ðxÞÞ ¼ r 2 e > 1 A Ã x A Ã> x e 1 , and the arguments leading to the construction of the pointwise confidence intervals for B(x) are as aforementioned, since the sharpened estimator b g Ã 2 ðxÞ is known to have less bias than b g ðxÞ whereas the variance of b g Ã 2 ðxÞ retains the same asymptotic order as the variance of b g ðxÞ:

Cheng et al.'s regression approach
We consider the first of the (local linear) regression estimators considered by Cheng et al. (2018), referred to asm B ðxÞ in that paper. We will refer to this estimator asg 2 ðxÞ, and we express it as where g 1 , g 2 , :::, g B is a sequence of bandwidths, and Guidance from Cheng et al. (2018) suggests that the value of B need not be large, say 10, and the g j 's could be chosen as equally spaced in the range h 1 6h 1 =2, where h 1 is a bandwidth that has been selected by rule-of-thumb or direct-plug-in methods, for example. DefiningH B as the n Â n matrix whose ith row is e > 1 ð P B j¼1 a i C x i , g j , 1 Þ allows us to carry out the same steps in constructing a confidence interval for the true bias as before, defining a j C x, g j , 1 : and writing the resulting bias estimator asBðxÞ ¼ e > 1Ã x y, which now has variance VarðBðxÞÞ ¼ r 2 e > 1Ã xÃ > x e 1 :

Bias-corrected confidence intervals for the regression function
One of the reviewers suggested a consideration of pointwise confidence intervals for the regression function itself using a bias-corrected estimator of the form b g BC ðxÞ ¼ b g 1 ðxÞ À b BðxÞ: Using Eqs. (1, 3), and (4), we can re-write this estimator as b g BC ðxÞ ¼ e > 1 ðC x, h 1 , k 1 À C x, h 1 , k 1 H h 2 , k 2 þ C x, h 2 , k 2 Þy: Pointwise confidence intervals for g(x) can then be straightforwardly obtained. We would expect the coverage behavior for such intervals to be superior to that of the uncorrected form based on b g 1 ðxÞ which possesses larger bias and has variance e > 1 C x, h 1 , k 1 C > x, h 1 , k 1 e 1 : Pointwise 95% confidence intervals were computed for the simulated data example of Figure 1 and plotted together with the bias-corrected local linear regression function in Figure 2. The bands are somewhat wider, but they also tend to be more accurate, in this particular case. We will see in the next section that this is not an accident; coverage accuracy is improved at the expense of a slight increase in interval width.
We note that regression function confidence intervals can be constructed similarly, using other bias estimators in place of b BðxÞ, such as the data sharpened version b B Ã ðxÞ or the Cheng et al.

Simulation study
In this section, we investigate the finite-sample behavior for the proposed local polynomial bias estimator b BðxÞ with pointwise confidence intervals for the bias. Our simulation study design is summarized in Table 1. Target functions from He and Huang (2009) are employed since they represent different noise to signal ratio magnitudes. The higher the noise to ratio statistic, the more challenging it is for the estimator in terms of choosing the optimal local polynomial degree and bandwidth.
There are several factors affecting the behavior of the confidence interval estimates, including: the bandwidth h 1 , the degree of the local polynomial k 1 , the kernel for b g 1 ðxÞ, and the corresponding quantities for b g 2 ðxÞ, nominal coverage level and sample size. Details regarding the simulation study design are given in Table 2, and the steps taken are outlined in the Supplementary Materials for this article.
We plotted the pointwise coverages for each of the values of x in the domain of interest. The complete set of results is displayed in the Supplementary Material document. Figure 3 is included here for illustrative purposes, as a typical example of what the other cases look like. We also computed pointwise coverages for the uncorrected and bias-corrected 95% confidence intervals described in Sec. 2.3, and plotted these as functions of the predictor variable x.
For the 95% confidence intervals for bias, we observe that in almost all of the simulation scenarios, the actual coverage randomly fluctuates about the nominal coverage, though in a few cases, it might be somewhat conservative in that actual coverage slightly exceeds the nominal level. The same thing can be said of the 95% confidence intervals based on the bias-corrected regression function estimator. On the other hand, the coverage for uncorrected regression function confidence intervals is often substantially below the nominal value, and sometimes less than 60%. Thus, the bias-correction goes a long way to improve coverage accuracy.

Illustrative examples
In this section, we apply the proposed method to some real data. In each example, a LPR is fit and the diagnostic tool is used to visualize the possible extent of the bias in the fit.

Application to kiwi resistance data
The first example is the kiwi resistance dataset (Harker and Maindonald 1994). This dataset contains 128 measurements of electrical resistance (in ohms) for varying concentrations of kiwi juice. We should expect electrical resistance to decrease monotonically with juice concentration.
The left and middle panels of Figure 4 display the plots of the local polynomial curve estimates along with 95% nominal pointwise confidence intervals for both of the curve (top) and the bias (bottom).
In each panel, different powers of the polynomial k 1 , k 2 and bandwidths h 1 , h 2 are used. The first two panels in the top row show pointwise 95% confidence bands for local linear regression with respective bandwidths h 1 ¼ 4:8 and 2.4. The value of h 1 ¼ 4:8 was chosen because it is the smallest value that gives a monotonically decreasing local linear fit over the inner two-thirds of the range of the data. Table 1. Target functions, r and noise to signal ratios considered in the simulation study.
We observe first that none of the regression function estimates are monotonically decreasing, though the first one is only increases in the regions near the boundaries. The other two estimates have spurious bumps in the interior of the range. The confidence intervals for the bias are in agreement that there is a positive amount of bias in the vicinity of 30% juice concentration. There is also an indication (strongly so, in the first plot) that there is bias at the left boundary; some of this would be attributable to a known boundary bias effect. There is also some indication of bias at the right boundary as well as in the region just below 20% concentration.
The bias-corrected confidence intervals corresponding to the three cases plotted in Figure 4 are plotted in the corresponding panels of Figure 5. In each panel, the solid black curve represents the bias-corrected regression estimate, and the gray curve is the original uncorrected estimate. The dashed curves represent the pointwise 95% confidence bands.
Each row represents a scenario. What can be observed in Figure 5 is that the confidence bands are somewhat wider than those for the uncorrected estimates, and the corrected pointwise estimates are clearly nonmonotonic over the entire range. However, the confidence bands do not contradict the hypothesis of a monotonically decreasing regression function other than for the extreme left end of the range.

Application to seismic timing data
Our next example concerns seismic timing measurements recorded by geophones placed along a transect in central Alberta (Maindonald and Braun 2020). There are two variables: "distance" gives the location along the transect, and "thickness" represents the measured thickness a subterranean geological stratum.
We consider the use of the direct-plug-in bandwidth for local linear regression in this case. This bandwidth is h 1 ¼ 8:6: The plot of the regression function estimate with confidence bands is shown in the left panel of Figure 6, and the bias confidence bands are shown in the right panel. This latter plot shows clear evidence of bias through the latter half of the range of data. . Kiwi resistance data, top panel: local polynomial estimates with k 1 ¼ 1 and h 1 ¼ 4:8, k 1 ¼ 1 and h 1 ¼ 2:4, and k 1 ¼ 2 and h 1 ¼ 4:8, respectively. Bottom panel: pointwise confidence interval bands for bias with k 2 ¼ 1, k 2 ¼ 2 and k 2 ¼ 2, respectively, and h 2 ¼ 2:4 (all panels). The 95% pointwise confidence bands using a bias-corrected regression function shown in Figure 7 are almost as narrow as when using the uncorrected regression function, but there is a clear difference in performance over the right half of the data range.

Concluding remarks
We have proposed a new support tool for use when fitting LPRs to scatterplot data: an estimator of bias based on the difference between a less biased pilot estimator and the expected value of the local polynomial estimator applied to the pilot estimator. By calculating the exact variance of the bias estimator, we constructed pointwise confidence bands for the bias as a function of the covariate. Simulations at a variety of parameter settings for three different target functions provide evidence for the accuracy of the tool, in that the actual confidence interval coverage appears to match the nominal coverage very well. Confidence interval coverage for bands created from the bias-corrected estimators also matches nominal coverage very well.
In two real data examples, the usefulness of our assessment plots is clear. It can help us to visualize the possible magnitude of the bias in particular local polynomial estimates. The tool also demonstrates the well known bias-variance trade off. It also gives clear indications of where the LPR could fail in that the bias would be unacceptably large: that is, the regions having high curvature.
We remark further that our results contrast with what is expected if a bootstrap approach were to be taken; Hall and Horowitz (2013) point out that the bootstrap estimate for the bias is inconsistent in this context. The inconsistency of the bootstrap is due to random bias in the bootstrap estimator. Hall and Horowitz (2013) also remind us that there are challenges in making good bandwidth choices. We agree and suggest that our tool be used as in our examples, with Figure 6. Seismic timing data. Left panel: local polynomial estimates with k 1 ¼ 1, and h 1 ¼ 8:6: Right panel: pointwise confidence interval plot for bias using k 2 ¼ 2 and h 1 ¼ 8:6: Figure 7. Seismic timing data. Pointwise 95% confidence bands for the regression function using a bias-corrected local linear regression estimate. The gray curve is the original uncorrected curve.
consideration of several polynomial degree and bandwidth scenarios. In cases, such as the kiwi resistance data, where the bias cannot be adequately reduced, an adaptive approach would be recommended.