Selecting Local Models in Multiple Regression by Maximizing Power

This paper considers multiple regression procedures for analyzing the relationship between a response variable and a vector of covariates in a nonparametric setting where both tuning parameters and the number of covariates need to be selected. We introduce an approach which handles the dilemma that with high dimensional data the sparsity of data in regions of the sample space makes estimation of nonparametric curves and surfaces virtually impossible. This is accomplished by abandoning the goal of trying to estimate true underlying curves and instead estimating measures of dependence that can determine important relationships between variables.


Introduction
In this paper, we analyze the relationship between a response variable Y and a vector of covariates X = (X 1 , X 2 , . . . , X d ) T by using "signal to noise" descriptive measures of dependencies between Y and X. This approach addresses a dilemma of nonparametric statistical analysis (the curse of dimensionality), that with high dimensional data, sparsity of data in large regions of the sample space makes estimation of nonparametric curves and surfaces virtually impossible. On the other hand, if the goal is to find regions of strong dependence between variables, parametric methods may provide evidence for such relationships. This paper develops hybrid methods between parametric and nonparametric procedures that are designed to lift the curse of dimensionality by uniting the goal of analyzing dependencies between Y and the X's with the goal of finding a good model.
These methods are based on measures of dependencies between variables rather than on the estimation of some "true" curve or surface. Such measures can, for instance, be based on the currently available procedures that are defined in terms of tuning parameters.
However, instead of asking what value of the tuning parameter will bring us closest to the "truth", we ask what value of the tuning parameter will give us the best chance of finding a relationship between the variables, if it exists. For instance, consider the case where the tuning parameter is the window size h = (h 1 , . . . , h d ) T of local regions in IR d within which we do local parametric fits of Y to X. Our approach consists of finding the "best" window size h by maximizing a local signal to noise ratio where the signal is an estimate of a measure of dependence and the noise is the estimated standard error (SE) of that estimate. By dividing by the noise we lift the curse of dimensionality because the noise will be very large if the tuning parameter is such that the local region in the sample space is nearly empty.
Here we focus on the problem of exploring the relationship between the response Y and a covariate of interest X 1 while controlling for covariates X 2 , X 3 , . . . , X d . Applied research abounds with situations where one is interested in the relationship between two variables while controlling for many other factors. The usual approach to dealing with the curse of dimensionality is to assume either a simple parametric relationship (usually linear) or a nonparametric but additive relationship between the covariates and response. Our approach allows for nonparametric modeling of the interaction effects and hence can find more subtle patterns in the data.
Our analysis methods begin with a reduction step, a procedure to analyze the multivariate data in a family of subsets of the covariate space, subsets that vary in both dimension and size within each dimension. This step is motivated by the question "Over which subsets of the covariates do these data provide evidence of a significant relationship between the response and the covariate of interest?" Once these subsets (called features) are extracted from the data, the analysis can go in different directions depending on the objective. Inspection of the features is itself a useful exploratory data analysis tool. Scale space analysis is possible by varying the minimum feature size.
The idea of using the signal to noise to choose a good procedure is motivated by the result (Pitman (1948), Pitman (1979), Serfling (1980), Lehmann (1999)) that the asymptotic power of asymptotically normal test statistics T n for testing H 0 : θ = θ 0 versus H 1 : θ > θ 0 when θ is contiguous to θ 0 can be compared by considering their efficacies, defined as or the pre-limit version See Doksum and Schafer (2006), who considered the single covariate case and selected subsets using estimates of the nonparametric efficacy where H 0 is a specified independence property, P is the probability distribution of (X, Y ), and SD(T n ) is either SD P (T n ) or SD H 0 (T n ). Let SE(T n ) denote a consistent estimator for SD(T n ) in the sense that SE(T n )/SD(T n ) P −→ 1. We refer to estimates of EFF(T n ) as a signal to noise ratio or t-statistic. If T n P −→ T (P ) for some functional T (P ) ≡ θ, then EFF resembles the familiar Wald test statistic By Slutsky's Theorem, EFF L −→ N(0, 1) in general, as is the case of t W for parametric models. We could refer to " EFF L −→ N(0, 1)" as the Wald-Slutsky phenomenon. In some frameworks, if we consider ( EFF) 2 , this is related to the Wilks phenomenon (Fan, Zhang, and Zhang (2001)). By not squaring EFF we have a much simpler bias problem that makes it possible to address new optimality questions.
Most work in the nonparametric testing literature starts with a class of tests depending on a bandwidth h (or other tuning parameter) which tends to zero and then finds the fastest possible rate that an alternative can converge to the null hypothesis and still be detectable by one of the tests in the given class. These rates and alternatives depend on h and the sample size n. See, for instance, the development in Section 3.2 of Fan, Zhang, and Zhang (2001).
We consider alternatives not depending on h and ask what h maximizes the probability of detecting the alternative. We find that this h does not tend to 0 unless the volume of the set on which E(Y |X) is nonconstant tends to zero as n → ∞.
This paper is organized as follows. Section 2 defines and motivates our signal to noise criterion and Section 3 develops asymptotic properties. Section 4 describes how we incorporate variable selection into the procedure for bandwidth selection. Section 5 explains how critical values can be approximated via simulating the distribution of the criterion under random permutations of the data. Section 6 shows results from Monte Carlo simulations and analysis of real data.

Local Efficacy in Nonparametric Regression
We consider (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X n , Y n ) i.i.d. as (X, Y ) ∼ P , X ∈ IR d , Y ∈ IR, and write Y = µ(X) + ǫ where µ(X) ≡ E(Y |X) is assumed to exist and where ǫ = Y − µ(X). We will focus on finding dependencies between Y and the covariate of interest X 1 for X in a neighborhood of a given covariate vector x 0 from a targeted unit (e.g. patient, component, DNA sequence). We want to know if a perturbation of x 1 will affect the mean response for units with covariate values near x 0 . Formally, we test H (1) 0 :"X 1 is independent of X 2 , X 3 , . . . , X d , Y " versus H (1) 1 :"µ(x) is not constant as a function of x 1 ∈ IR." Our test statistics will be based on linear fits for x restricted to subregions of the sample space. The subregion that best highlights the dependencies between Y and covariates will be determined by maximizing signal to noise ratios. We illustrate such procedures and their properties by first considering a linear model fit locally over the neighborhood N h (x 0 ) of x 0 where h = (h 1 , h 2 , . . . , h d ) are bandwidths in each of the d dimensions. The true relationship between Y and X is such that E[Y | X = x] ≡ µ(x) and Var(Y | X = x) ≡ σ 2 are unknown. The local linear model gives fitted values as a function of x ∈ N h (x 0 ), denoted where β j = β j (h) are sample coefficients depending on x 0 and h. Thus, for coefficients We let β 1 (h) be the test statistic T n for testing H and develop some of the properties of the efficacy and estimated efficacy for this test statistic. Let V(h) ≡ P (X ∈ N h (x 0 )) and assume throughout that 0 < V(h) ≤ 1.

The Signal
Formally, the signal is where E h is expected value for the conditional distribution P h of (X, Y ) given X ∈ N h (x 0 ).
With Var h and Cov h also being conditional, Similarly, β(h) is the local linear least squares estimator where X h is the design matrix (X ij ), and Y h = {Y i : i ∈ I}. It follows that conditionally given X ≡ (X 1 , . . . , X n ), , and X D is the full design matrix (X ij ) n×(d+1) , we see that by the law of large numbers, Because E( β 1 (h) | X) P −→ β 1 (h) as n → ∞, this shows that the estimated local signal β 1 (h) is conditionally consistent given X as n → ∞ and d → ∞ when

The Noise
The denominator of the efficacy, the noise, is the asymptotic standard deviation σ 2 0 ( β 1 (h)) of β 1 (h) under H 0 . We derive a formula and develop a consistent estimator in what follows.
Let n h be the number of X i that fall in N h (x 0 ). Note that where Thus by Slutsky's Theorem and the Central Limit Theorem where with e = Y − β(h) T X.
We have shown the following.
Proposition 1. Suppose that h and d are fixed in n, that V(h) > 0, 0 < Var h (Y ) < ∞, and By using Equation (15) and Liapounov's Central Limit Theorem we can allow d and h to depend on n when we consider the asymptotic distribution of √ n[ β 1 (h) − β 1 (h)]. We have shown the following.
The sample variance s 2 1 (h 1 ) calculated using {X i1 : X i1 ∈ [x 01 − h 1 , x 01 + h 1 ]} is our estimate of σ 2 1 = Var h H 0 (X 1 ). It is consistent whenever nh 1 → ∞ as n → ∞. Note that with where h (−1) = (h 2 , . . . , h d ) and σ 2 L (h (−1) ) is the contribution to the variance of the error due to lack of linear fit to µ 0 (X) under H 0 . Let µ (−1) L (X) = β 0 + d j=2 β j X j be the locally linear fit based on then a natural estimator for σ 2 e is where n(h (−1) ) is the number of data points in D(h (−1) ). The following can be shown (see Appendix).

The Efficacy Criterion
The usual procedures for selecting bandwidths h involve minimizing mean squared predication or estimation error when predicting the response Y 0 or estimating the mean E(Y 0 | x 0 ) of a case with covariate vector x 0 . The disadvantage is that significant relationships may escape procedures based on h selected in this way because they will be based on small bandwidths which lead to large noise. Here we propose to choose the bandwidth to minimize the probability of Type II error, that is, maximize power among all t-tests with the same asymptotic significance level. We thereby maximize the probability of finding significant relationships between Y and a specified covariate. This procedure automatically selects h to keep the noise small.
Because T n = β 1 (h) is asymptotically normal and E H 0 (T n ) = 0, we know (Pitman (1948), Pitman (1979), Serfling (1980), Lehmann (1999)) that for contiguous alternatives the h that maximizes the asymptotic power is the h that maximizes the absolute efficacy, where Because The efficacy optimal h for testing H For fixed alternatives that do not depend on n with β 1 (h) > 0, this h (0) 1 will not satisfy Remark 1. The definition of the efficacy optimal h makes sense when the relationships between Y and covariates are monotone on N h (x 0 ). To allow for possible relationships such in Section 3 we will use neighborhoods of the form [ The properties of the procedures are not changed much by the introduction of the extra tuning parameters λ 1 , λ 2 , . . . , λ d . See Doksum and Schafer (2006) for the case d = 1.
Write d n for d to indicate that d may depend on n. Using the results of Sections 2.1 and 2.2, we have the following.
Proposition 3. In addition to the assumptions of Proposition 2, assume that as n → ∞, Corollary 1. Suppose G (a grid) denotes a finite set of vectors of the form (h, x 0 ), then, under the assumptions of Proposition 3, the maximizer of t 1 (h, x 0 ) over G is asymptotically optimal in the sense that Remark 2. Hall and Heckman (2000) used the maximum of local t-statistics to test the global hypothesis that µ(·) is monotone in the d = 1 case. Their estimated local regression slope is the least squares estimate based on k nearest neighbors and the maximum is over all intervals with k at least 2 and at most m. They established unbiasedness and consistency of their test rule under certain conditions.
Because the asymptotic power of the test based on β 1 (h) tends to one for all h with β 1 (h) > 0 and |h| > 0, it is more interesting to consider Pitman contiguous alternatives H (1) 1n where β 1 (h) depends on n and tends to zero at a rate that ensures that the limiting power is between α and 1, where α is the significance level. That is, we limit the parameter set to the set where deciding between H 1n as will be shown below.

Optimal Bandwidths for Pitman Alternatives with Fixed Support
Under H (1) 0 , We consider sequences of contiguous Pitman alternatives with β 1j (h) ∝ cn −1/2 , such as where γ n ≡ cn −1/2 , c = 0, and |Cov h (X j , r(X))| > b, for b > 0. Here, . As in the case of fixed alternatives, the maximizer h 1 is asymptotically optimal in the sense of Corollary 1.

Comparison with Mean Squared Error
There is a large literature on selecting bandwidths by minimizing mean squared error (MSE).
Here MSE can be expressed as the following.
Proof. MSE is variance plus squared bias where the conditional squared bias is as given.
The conditional variance of the local least squares estimate If finding significant dependencies is the goal, we prefer maximizing Equation (27) to minimizing Equation (36) because the local MSE (36), as well as its global version, focuses on finding the h that makes µ L (x) close to the true unknown curve µ(x). Using results of Ruppert and Wand (1994) and Fan and Gijbels (1996), page 302, we can show that under regularity conditions (the Hessian of µ(x) exists), the bandwidths minimizing (36) tend to zero at the rate n −1/(d+4) . By plugging such bandwidths into Equation (27) we see that this will in many cases, make it nearly impossible to find dependencies. Various semiparametric assumptions have brought the rate of convergence to zero of the bandwidths to n −1/5 which still makes it likely that dependencies will be missed. By using (27) we have a simple method for finding bandwidths that focus on finding dependencies for any type of alternative.
However there are alternatives where h → 0 makes sense. We construct these next.

Bandwidths for Alternatives with Shrinking Support
We consider sequences of alternatives where the set A n = {x : |µ(x)−µ Y | > 0} is a connected region whose volume tends to zero as n → ∞. One such set of alternatives is given by Pitman alternatives of the form where X and ǫ are uncorrelated, ǫ has mean zero and variance σ 2 , and each W j (·) has support [−1, 1]. We assume that each X j is continuous, in which case the hypothesis holds with probability one when γ j θ j = 0 for all j. We consider models where θ j = θ (n) j → 0 as n → ∞, and γ j may or may not depend on θ j and n. For these alternatives the neighborhood where E(Y |X) is non-constant shrinks to achieve a Pitman balanced model where the power converges to a limit between the level α and 1 as n → ∞. Note, however, that the alternative does not depend on h. We are in a situation where "nature" picks the neighborhood size θ ≡ (θ 1 , . . . , θ d ) T , and the statistician picks the bandwidth h. This is in contrast to Blyth (1993), Fan, Zhang, and Zhang (2001), and Ait-Sahalia, Bickel, and Stoker (2001) who let the alternative depend on h.
We next show that for fixed h with |h| > 0, EFF 1 (h, x 0 ) → 0 as max j {γ j θ j } → 0 in model (37). First note that Next note for h k > 0 fixed, γ k bounded from above, and θ k → 0, because W k ((X k −x 0k )/θ k ) tends to zero in probability as θ k → 0 and any random variable is uncorrelated with a constant. This heuristic can be verified (see the appendix) by a change of variable. This result is stated more precisely as follows.
Proposition 5. If h is fixed with |h| > 0, then as max k {γ k θ k } → 0 in model (37), Proposition 5 shows that for model (37), fixed h leads to small EFF 1 (h, x 0 ). Thus we turn to the h 1 → 0 case. If h 1 > 0, then observations (X 1 , Y ) with X 1 outside [x 01 − θ 1 , x 01 + θ 1 ] do not contribute to the estimation of β 1 (h). Thus choosing a smaller h 1 may be better even though a smaller h 1 leads to a larger variance for β 1 (h). This heuristic is made precise by the next results which provide conditions where h 1 = θ 1 is the optimal choice among h 1 Theorem 1. Assume that X 1 is independent of X 2 , . . . , X d , that the density f 1 (·) of X 1 has a bounded, continuous second derivative at x 0 , and that f 1 (x 0 ) > 0. Then, in model (37), which is maximized subject to h 1 ≥ θ 1 by h 1 = θ 1 .
Theorem 2. Suppose the joint density f (x) of X has a bounded continuous Hessian at x 0 , and that f (x 0 ) > 0, then in model (37), as |h| → 0 and |θ| → 0 with h 1 ≥ θ 1 , conclusion The case where h 1 < θ 1 remains; then some of the data in the neighborhood where the alternative holds is ignored and efficacy is reduced. Under suitable conditions on W j , j = 1, . . . , d, the optimal h 1 equals θ 1 ; that is, 2h 1 is the length of the interval where the X 1 term in the additive model (37) is different from zero. Details of the justification can be constructed by extending the results of Doksum and Schafer (2006) to the d > 1 case.
Remark 3. Fan (1992), Fan (1993), and Fan and Gijbels (1996) considered minimax kernel estimates for models where Y = µ(X)+ǫ in the d = 1 case and found that the least favorable distribution for estimation of µ(x 0 ) using asymptotic mean squared error has Y = µ 0 (X) + ǫ where b n = c 0 n −1/5 , for some positive constants c and c 0 . Similar results were obtained for the white noise model by Donoho and Liu (1991a,b). Lepski and Spokoiny (1999), building on Ingster (1982), considered minimax testing using kernel estimates of µ(x) and a model with normal errors ǫ and Var(ǫ) → 0 as n → ∞. Their least favorable distributions (page 345) has µ 0 (x) equal to a random linear combinations of functions of the form

Variable Selection
Suppose we want to test H (1) 0 that X 1 and Y are unrelated and wonder whether we should keep X j , j ≥ 2, in the model when we construct a t-statistic for this testing problem. In experiments where confounding variables are possibly present, it would seem unreasonable to keep or exclude X j on the basis of power for testing H (1) 0 because confounding could lead to dropping X j in situations where the relationship between X 1 and Y is spurious. For this reason it would seem more reasonable to base the decision about keeping X j on the strength of its relationship to (X 1 , Y ) and thus we ask whether X j contributes to accurate prediction of the response Y conditionally given X 1 . More generally we consider dropping several variables and consider fits to Y over subsets of IR d that vary both in dimension and size (tuning parameter) within each dimension and we select the dimension and size which maximizes efficacy in the X 1 direction while controlling for spurious correlation.
We define a feature S k to be a subset of the covariate space which yields the maximal absolute t-statistic for some x ∈ S k . In other words, among all of the subregions over which linear models are fit, we discard those that do not maximize the absolute t-statistic for some covariate vector x. The remaining regions are denoted S 1 , S 2 , . . . , S r . Assume the features are placed in decreasing order of the value of absolute t-statistic for the variable of interest.
These features are subsets of IR d ordered with respect to their relevance to the relationship between X 1 and Y . Define with S ′ 1 ≡ S 1 . Then ∪ k S ′ k = ∪ k S k and the S ′ k are disjoint. There are competing goals: We want the model to be parsimonious (not include too many covariates), especially since we are focusing on cases where d is large. But, we don't want to exclude any covariate which, if included, changes the picture of the (X 1 , Y ) relationship.
Thus if X 1 and X 2 are closely related, as are X 2 and Y , there will also be an apparent relationship between X 1 and Y . To avoid making false claims regarding the strength of the relationship between X 1 and Y , we tentatively include X 2 , and let the analysis show whether that weakens the (X 1 , Y ) relationship. This analysis of the (X 1 , Y ) relationship after correcting for other X ′ s can be accomplished in the following way: Consider the quantity the expected squared prediction error when X 1 = x 1 , where (X, Y ) is independent of (X i , Y ), 1 ≤ i ≤ n, and µ (S) j (x) stands for the model fit for model j based on a subset of covariates that are restricted to subregion S. We seek the subset of the covariates which minimizes this criterion. If there is no relationship between X 2 and Y when X 1 = x 1 , then X 2 will be excluded based on comparisons of γ(x 1 , µ (S) j ) for different models j; this could lead to different covariates included for different values of X 1 . This is implemented as follows. Within each S k , m different linear models are fit; these models differ in which covariates are included, but the notation is consistent in the sense that "model j" always refers to the same list of covariates. Let µ jk (x) denote the fitted value at covariate value x ∈ S k from model j fit over S k and let µ j (x) ≡ µ jk (x) if x ∈ S ′ k . Each x lies in exactly one S ′ k , so this is uniquely defined. The mean squared prediction error for a model fit is commonly approximated by the leaveone-out cross validation score based on the fitted value µ −i (X i ) from the fit with (X i , Y i ) removed. If µ is a linear model fit using least squares then the cross-validation prediction error is where h i is the i th diagonal element of the hat matrix from the linear model.
Here, we do not fit a global linear model, but the fitted value µ (S) j (X i ) is the result of the fit from some linear model. Let h ij denote the diagonal element corresponding to X i of the hat matrix from the linear model fit which gives us µ Note how this automatically adjusts for the differing degrees of freedom in the different models since (1 − h ij ) = n − d − 1. We can now estimate γ(·, µ (S) j ) by the smooth γ(·, µ) of γ (S) ij versus the observed value of X 1 for each data point X i , i = 1, 2, . . . , n. For a given X 1 = x 1 , the model j within S k with the smallest γ(x 1 , µ (S k ) j ) is selected. Figure 5 To summarize, by selecting the features as described above, we are simultaneously selecting the number of variables to include and the size of candidate neighborhoods for computing t-statistics in a given direction, here X 1 . We select the neighborhoods where the t-statistics are maximized and we use conditional prediction error to select the variables in such a way that we are protected against using models that produce spurious correlations.
Remark 4. A great number of tests of model assumptions and variable selection procedures are available in a nonparametric setting, e.g. Azzalini, Bowman, and Hardle (1989), Raz (1990), Eubank and LaRiccia (1993), Hardle and Mammen (1993), Bowman and Young (1996), Hart (1997), Stute (1997), Lepski and Spokoiny (1999), Fan, Zhang, and Zhang (2001), Polzehl and Spokoiny (2002), Zhang (2003), Samarov, Spokoiny, and Vial (2005), among others. One class of tests of whether the jth variable has an effect looks at the difference between the mean µ(x) of Y given all the x's and the mean µ −j (x −j ) of Y given all but the jth variable. Specifically, for some weight functions w(X), measures of the form are considered (Doksum and Samarov (1995), Ait-Sahalia, Bickel, and Stoker (2001)). Similar measures compare prediction errors of the form Our efficacy measure is a version of µ(x) − µ −j (x −j ) adjusted for its estimability while our measure of spurious correlation is a conditional version of the above prediction errors.

Critical Values
First consider testing H (1) 0 :"X 1 is independent of X 2 , X 3 , . . . , X d , Y " against the alternative H (1) is not constant as a function of x 1 in a neighborhood of a given covariate vector x 0 from a targeted unit of interest. Our procedure uses the data to select the most efficient t-statistic adjusted for spurious correlation, say t 1 (x 0 , (X, Y)), where (X, Y) = ((X ij ) n×d , (Y i ) n×1 ) are the data. Let X * 1 be a random permutation of (X 11 , . . . , X n1 ) and X * = (X * 1 , X 2 , . . . , X d ) where X j = (X 1j , . . . , X nj ) T . Then, for all c > 0, Next, we select B independent random permutations X * 1 , . . . , X * B and use the (1 − α) sample quantile of |t 1 (x 0 , (X * k , Y))|, for k = 1, 2, . . . , B, as the critical value. As B → ∞, this quantile converges in probability to the level α critical value. Figure 1 gives as example of the simulated distribution of |t 1 (x 0 , (X, Y))| under the null hypothesis for x 0 = (0.4, 0.4, 0.4).
Here B = 500, and the model is described in Section 6, stated in Equation (48). Note that although t 1 (x 0 , (X, Y)) is selected using absolute values of efficacy, we can still perform valid one-sided tests by following the above procedures without the absolute values.
In this case the alternative is that µ(x) is increasing (or decreasing) as a function of x 1 in a neighborhood of x 0 .

Next consider testing H
(1) 0 against the alternative H (1) that X 1 is not independent of X 2 , . . . , X d , Y . If x (1) , . . . , x (g) is a set of grid points we can use the sum, sum of squares, maximum, or some other norm, of 0 versus H (1) . We may also instead of grid points use X 1 , . . . , X n or a subsample thereof. Again the permutation distribution of these test statistics will provide critical values.
Finally, consider the alternative that µ(x) is monotone in x 1 for all x 1 ∈ IR. We would proceed as above without the absolute values, use a left sided test for monotone decreasing, and a right sided test for monotone increasing, see Hall and Heckman (2000) who considered the d = 1 case.

The Analysis Pipeline; Examples
In this section we describe practical implementation of the data analysis pipeline developed in the previous sections, using both simulated and real data. The simulations are built around the following functions, plotted in Figure 2. The algorithm can be described as follows. Fix a covariate vector x 0 ∈ [0, 1] 3 and a neighborhood which includes x 0 , but is not necessarily centered at x 0 . A linear model is to be fit over this region. The slope in the X 1 direction β 1 , is estimated using least squares, giving us β 1 . The t-statistic for β 1 follows easily. One can now imagine repeating this for all possible neighborhoods, and finding that neighborhood which results in the largest absolute t-statistic for β 1 . We then imagine doing this for all possible values of x 0 . In practice, a reasonable subset of all possible neighborhoods will be chosen, and models fit over these regions. Here, we lay down a grid of values for each of the covariates, based on evenly spaced quantiles of the data. The number of grid points can be different for each covariate; we choose to have a larger number for the covariate of interest (we use 15) than for the other covariates (we use 5). Thus, we have an interior grid made of up of 15 × 5 × 5 = 375 covariate vectors.
A neighborhood is formed by choosing two of these points and using them as the corners of the region. Thus, there are a total of 375 × 374/2 = 70125 potential neighborhoods; some will be discarded due to sparsity of data.
We consider a simple case first. Set where ǫ is normal with mean zero and variance σ 2 = 0.02. A sample of size n = 1000 is taken. The random variables X 1 , X 2 , X 3 are i.i.d. U(0, 1). Figure 3 Figure 3(a) except that the only regions plotted are those that yield the maximum absolute t-statistic for some x 0 . A convenient computational aspect of this approach is that wide ranges covariate values x 0 will share the same region: We can describe fixing x 0 and finding the optimal region for that x 0 , but in fact we only need to fit all of these 70125 candidate models, calculate the t-statistic for each, and then discard any fit which is not the maximum for some x 0 . In this example there are 18 regions remaining following this process. The regions represented in this plot are the previously defined features, S 1 , S 2 , . . . , S r . We call this plot the "t-statistic plot for variable of interest X 1 ." A more useful way of plotting the remaining regions is shown in Figure 4. We will refer to this graph as the "feature plot for variable of interest X 1 ." In each plot, there is one light, horizontal line for each feature; note that they are labeled with numbers going from 1 to 18.
The vertical axis gives β 1 for that local model. Each dot is an observed data point, and the shade of the dots on one line (i.e., in one region) again represents the extent of that region in the other two covariates.
From this plot, one can pick out dependencies in the data. Consider the regions labeled "1," "2," and "3." For each of these, the range of values of X 1 in the region is approximately 0.5 to 0.7 (look at the third plot). This is capturing the steep downslope in f 1 (x) for x in that range. But, the dependence between X 2 and Y (characterized here by the slope β 2 ) varies with X 2 . Ignoring this fact when modeling Y as a function of X 1 would mask this downslope. This is seen clearly when we look at these three regions in the second plot in Figure 4. The three regions correspond to different ranges of values of X 2 : Region "1" is approximately 0.5 to 0.75, where f 2 (x) is starting to level off, region "2" is extends up to 1.0, where f 2 (x) is almost flat, and region "3" goes from 0 up to 0.5, where f 2 (x) is the steepest.
Since the relationship between X 3 and Y is linear, there is no such pattern to be found in the last of the three plots. The light horizontal lines in this first plot show that our procedure chooses the largest possible bandwidths in the direction X 1 , that is, the response Y is modeled to be linear in this direction, as we know it should be.
Figure 5(a) shows the estimates of the expected squared prediction error γ(·, µ j ) for each of four models, relative to this quantity for the "null" model, the model which only uses the mean of Y within that region to predict the response. We call this plot the "CV plot." In this case, the best choice is to use all three predictors, as evidenced by the solid black line being the lowest for all values of X 1 . Figure 5(b) (the "slope plot") shows, for each X 1 value in the data set, the estimated slope β 1 for the region S ′ k in which that observation lies. In practice, if the variable selection procedure chose a simpler model for a particular subset S k , that model would be used in the slope plot.

Other Cases
We considered alterations to the simple simulation model described in the previous section.
First, consider a case with Y = f 1 (X 1 ) + f 2 (X 2 ) + ǫ, so that Y is not a function of X 3 , but now take (X 1 , X 3 ) to be bivariate normal, each with mean 0.5 and SD 1, and with correlation √ 0.5. X 2 is still U(0, 1), and independent of X 1 and X 3 . Again, n = 1000. Figure 6(a) shows the CV plot for this case. Note that since conditional on X 1 , X 3 and Y are independent, the variable selection procedure is indicating that X 3 could be excluded. Contrast this with the second case where (X 1 , X 3 ) have the same bivariate normal distribution, but now Y = f 2 (X 2 ) + f 3 (X 3 ) + ǫ; see Figure 6(b) for the CV plot. Here, the best choice is to include all three variables since excluding X 3 would lead to misleading conclusions regarding the strength of the relationship between X 1 and Y .

Analysis of Currency Exchange Data
The original motivation for this study was to address a question in the study of currency exchange rates regarding the nature of the relationship between the volume of trading and the return. Data were obtained on the Japanese Yen to U.S. dollar exchange rate for the period of January 1, 1992 to April 14, 1995, a total of 1200 trading days. The response variable used was today's log volume with three covariates: 1) today's log return, 2) yesterday's log volume, and 3) yesterday's log return. The first of these, today's log return, is set as the covariate of interest. Figure 7 shows the feature plot for this data set. One interesting result is that when today's log return is positive, the coefficient for today's log return is positive; see feature 8 at the top of the plot. And when today's log return is negative, those coefficients are mostly negative; see features 1,3,4, and so forth, on the left side of the plot.
When variable selection is applied, we see that there is evidence that yesterday's log return is not important; see Figure 8(a). Finally, the slope plot (Figure 8(b)) shows again how the estimated slope β 1 for the coefficient for today's log return abruptly switches from negative to positive once that variable becomes positive. This is an important finding that may have been missed under a standard multiple regression fit to this surface. This becomes clearer when inspecting the level plot shown in Figure 8(c). The plot shows one dot for each vector x in the data set, the horizontal axis gives today's log return, and the vertical axis is the fitted value µ(x). The superimposed solid curve is a smooth of the plot. For comparison, also plotted (as a dashed line) is the level curve for the d = 1 case where the only covariate is today's log return, as in Karpoff (1987). we see some evidence of the "Karpoff effect." An interesting aspect of this plot is that when all three covariates are included in the analysis, the slope on the left of the level curve is significantly smaller than the slope on the right. This seems to imply that there would be a way to exploit yesterday's log volume in an effort to predict whether today's log return will be positive or negative. But, this artifact is removed by excluding today's log volume as one of the covariates: See the level curve corresponding to "Two Covariates." Using only yesterday's information, it now does not seem possible to utilize yesterday's log volume to predict if today's log return will be positive or negative.

Proof of Lemma 1
The following implies Lemma 1.
Proof. If we condition on X ≡ (X 1 , . . . , X n ), then X 0 ∈ N h (x 0 ) no longer are random events.
We can adapt Hastie (1987), equation (16) to find Here E h ( µ L (X i )) = µ L (X i ) and Lemma 2 follows by the iterated expectation theorem.
Lemma 1 is a special case with µ and µ L computed under the null hypothesis.

Proof of Proposition 5
Proof. We compute expected values using the joint density of X given X ∈ N h (x 0 ). Thus, with k = j, where with x (−j) = (x 1 , . . . , x j−1 , x j+1 , . . . , x d ) T , is the marginal density of X j given X ∈ N h (x 0 ).

The change of variables
because the terms in EFF 1 other than γ k Cov h (X j , Y ) are fixed as γ k θ k → 0. Finally, (b) follows from Proposition 3.

Proof of Theorem 1
Proof. Because of the independence of the X 1 and X 2 , . . . , X d , where V(h 1 ) ≡ P (x 01 − h 1 ≤ X 1 ≤ x 01 + h 1 ). The result now follows from the proof of Theorem 2.1 in Doksum and Schafer (2006).

Proof of Theorem 2
Proof. The proof can be constructed by using the fact that for small |h|, X 1 , . . . , X d given X ∈ N h (x 0 ) are approximately independent with U(x 0j − h j , x 0j + h j ), 1 ≤ j ≤ d, distribu-tions. This can be seen by Taylor expanding f (x) around f (x 0 ) and noting that A similar approximation applies to moments. Now use the proof of Theorem 1 with appropriate small error terms.    Figure 6(a) is the CV plot for the case where X 1 and X 3 are dependent, but Y is not a function of X 3 , comparing competing models for the purposes of variable selection. Figure 6(b) is the CV plot for the case where X 1 and X 3 are dependent, but Y is a function of X 3 , not of X 1 .   (Figure 8(a)), the slope plot ( Figure 8(b)), and the level plot (Figure 8(c)) from the analysis where the response is today's log volume.