A bootstrap test for constant coefficients in geographically weighted regression models

ABSTRACT Statistical tests for whether some coefficients really vary over space play an important role in using the geographically weighted regression (GWR) to explore spatial non-stationarity of the regression relationship. In view of some shortcomings of the existing inferential methods, we propose a residual-based bootstrap test to detect the constant coefficients in a GWR model. The proposed test is free of the assumption that the model error term is normally distributed and admits some useful extensions for identifying more complicated spatial patterns of the coefficients. Some simulation with comparison to the existing test methods is conducted to assess the test performance, including the accuracy of the bootstrap approximation to the null distribution of the test statistic, the power in identifying spatially varying coefficients and the robustness to collinearity among the explanatory variables. The simulation results demonstrate that the bootstrap test works quite well. Furthermore, a real-world data set is analyzed to illustrate the application of the proposed test.


Introduction
Due to its simple implementation and ease of interpretation, the geographically weighted regression (GWR) technique, originally proposed by Brunsdon et al. (1996), has become a popular tool for exploring spatial non-stationarity of a regression relationship in geo-referenced data analysis and has been applied to a variety of fields including geography, ecology, econometrics, epidemiology, and environmental science. In this methodology, a spatially varying coefficient model is locally calibrated by the weighted least-squares procedure based on the observations y i ; x i1 ; . . . ; x iq È É n i¼1 of the response Y and the explanatory variables X 1, X 2 , . . ., X q collected at the geographical locations (u i , v i ) (i = 1, 2, . . ., n). In the model, β j (u,v) (j = 1, 2, . . ., q) are unknown functions of the location co-ordinate (u,v), and ε is the error term with mean zero and variance σ 2 . The model can include a spatially varying intercept by setting X 1 ;1.
Model (1) allows the regression coefficients to vary over space and therefore can be used to explore spatial non-stationarity of the regression relationship via the spatial variation patterns of the estimated coefficients. It is a common practice in the GWR literature that the estimated coefficients are mapped on the geographic region to show the degree of spatial non-stationarity in the regression relationship. In this sense, the coefficient estimators play an important role in reflecting spatial non-stationarity of the regression relationship and the analysis results depend largely upon the estimators of the underlying coefficients.
As a local fitting technique, however, the GWR estimator of each coefficient varies with the focal location no matter whether the underlying coefficient is constant or varies over the space. The excessive flexibility of this model calibration method may lead to spurious variation and even sign reversals of the estimated coefficients (Jetz et al. 2005, Páez et al. 2011. Moreover, collinearity among the explanatory variables may also result in sign changes and strong correlation in the coefficient estimators Tiefelsdorf 2005, Wheeler andCalder 2007). On the other hand, if the coefficients are of very different degrees of smoothness with, for example, some coefficients being constant and the others varying over space, a global bandwidth selected by some commonly used data-driven methods, such as CV, GCV, and AIC c cannot yield accurate estimators for all of the coefficients, which has been theoretically proved and empirically confirmed in the statistical literature of varying coefficient models (Fan and Zhang 1999) and has also been noted in the GWR literature (Farber and Páez 2007, Yang et al. 2011. In summary, there are several causes of inaccurate coefficient estimators and consequently misleading interpretation on the spatial variation patterns of the regression relationship. To alleviate the adverse effect of the collinearity and obtain more accurate estimators of the coefficient, Wheeler (2007) proposed the ridge regression and Bárcena et al. (2014) used the regularization methods for calibrating the GWR models. Furthermore, Wheeler (2009) presented the geographically weighted lasso for coefficient penalization and model selection. Recently, Brunsdon et al. (2012) and Gollini et al. (2015) developed a new ridge estimation approach, called locally compensated ridge GWR, to locally fit the ridge regression at the locations where the collinearity is a problem. Moreover, Wang et al. (2008) suggested the local-linear GWR to reducing the bias and boundary effect of the coefficient estimators. Yang et al. (2011Yang et al. ( , 2012 proposed the flexible bandwidth GWR method to estimate the coefficients with different degrees of smoothness. In addition to the above improved estimation methods, statistical tests for judging whether some of the coefficients vary over the space are fundamental in achieving a valid interpretation of spatial non-stationarity of the regression relationship. In particular, if the test indicates that some coefficients are constant over space, then we can, on one hand, conclude that the impact of the corresponding explanatory variables on the response is spatially stationary and, on the other hand, identify a more appropriate model, i.e. the mixed GWR model (Brunsdon et al. 1999) to fit the given data.
In fact, since the inception of GWR, much attention has been paid to such statistical inferences. In the pioneering work by Brunsdon et al. (1996Brunsdon et al. ( , 1998, two kinds of permutation test were proposed respectively based on the selected bandwidth size and the sample variances of the estimated coefficients at all of the observation locations. However, these two tests can only detect global stationarity of the regression relationship because it is under the null hypothesis that all of the coefficients are constant such that any permutations of the observations y i ; x i1 ; . . . ; x ip È É n i¼1 among the sampling locations are equally likely to occur and then the p-values of the tests could be derived. Shortly afterwards, Brunsdon et al. (1999) developed a set of analytically derived tests for the null hypothesis of no spatial stationarity by comparing the residual sum of squares from the GWR estimation of Model (1) (called GWR model henceforth) with that from the ordinary least-squares estimation of the corresponding global linear model. Furthermore, they proposed so-called mixed GWR models with the back-fitting calibration procedure on which the aforementioned tests can be used to identify one or more constant coefficients in the full GWR model in Equation (1). Leung et al. (2000) also introduced residual sum of squares based statistics to test for global stationarity of the regression relationship and another statistic that is based on the sample variance of each coefficient estimator at the sampling locations to test whether each coefficient varies spatially. In both papers, the null distributions of the test statistics were approximated by an F distribution with proper degrees of freedom. This kind of approximation roots in the distributional theory of the quadratic forms in normal variables and is closely related to the assumption that the model error term follows a normal distribution. Furthermore, when formulating the numerator and (or) denominator of each statistic as the quadratic forms of the model errors, the assumption that the fitted values of the response variable are unbiased estimators of the expectations of the corresponding observations has to be made. This assumption is in general untrue because it has been proved in the statistical nonparametric literature that the local-constant kernel estimation like GWR always produces biased estimators of the regression function except for the case of ordinary linear models (e.g. Zhang 1999, 2008). Thirdly, when using an F distribution to approximate the null distributions of the test statistics, the dependence between the numerator and the denominator of each statistic is neglected although they are generally co-dependent in the framework of nonparametric estimation. Mei et al. (2004) employed the sample variance based statistic to identify a mixed GWR model and suggested a more accurate approach, called three-moment χ 2 approximation, to derive the p-value of the test. Although the requirement of independence between the numerator and denominator of the test statistic is avoided in this approximation, the assumptions of the normality of the model error term and the unbiasedness of the coefficient estimators are still retained. Páez et al. (2002aPáez et al. ( , 2002b) developed a maximum likelihood estimation and inferential framework for the GWR model. However, the assumption that the model error term follows a heteroscedastic normal distribution is still needed.
The above discussions suggest that there is still much room for improvement in the existing inferential methods of the GWR technique. One of the possible ways is to derive the asymptotic null distributions of the test statistics. However, it has been recognized that the limiting distributions may not necessarily provide a valid approximation to the null distributions in the case of finite sample size (e.g. Fan and Jiang 2007). Driven by many sophisticated statistical inferential problems in a variety of areas and fueled by the power of modern computers, Monte Carlo methods are a popular method to simulate the distribution of a complex statistic. The most attractive one is perhaps the bootstrap method originally developed by Efron (1979). Because of its solid theoretical basis and fewer restrictions on the underlying distribution of the error term, the bootstrap method has been widely applied to the literature of parametric and nonparametric regression analysis for statistical inferences. This method has also been used in the GWR literature to test for zero coefficients in a mixed GWR model (Mei et al. 2006) and to investigate coefficient nonstationarity via comparing the basic GWR with some standard spatially autocorrelated regressions (Harris et al. 2015). In this paper, we focus on the residual sum of squares based statistic in Brunsdon et al. (1999) and propose a bootstrap procedure to approximate its null distribution to achieve the task of testing for the constant coefficients in GWR models. In view of the model calibration methods, although the local-linear GWR in Wang et al. (2008) performs better than the conventional GWR in reducing the bias and the boundary effect of the coefficient estimators, it is very computationally expensive because both the coefficients and their partial derivatives should be simultaneously estimated. Here, we still use the conventional GWR technique proposed by Brunsdon et al. (1996) to calibrate Model (1) and the two-step procedure derived by Fotheringham et al. (2002, Chapter 3) to fit a mixed GWR model because of their computational efficiency.
The rest of this paper is organized as follows. In Section 2, the estimation procedures for a GWR model and a mixed GWR model are briefly described. In Section 3, the residual sum of squares based statistic is introduced to test the null hypothesis that some of the coefficients in the GWR model are constant and a residual-based bootstrap procedure is proposed to compute the p-value of the test. Furthermore, some issues including the computational strategy and some possible extensions of the test are discussed. A simulation study with comparison to the existing test methods is conducted in Section 4 to evaluate the performance of the test under different error distributions and different degrees of collinearity among the explanatory variables. The Boston housing price data are analyzed in Section 5 to demonstrate the application of the proposed test. The paper is then concluded with some remarks.

2.
A brief review of the conventional estimation approaches of GWR and mixed GWR models 2.1. Estimation of a GWR model and the resulting residual sum of squares Let y i ; x i1 ; . . . ; x iq È É n i¼1 be the observations independently collected at the locations (u i , v i ) (i = 1, 2, . . ., n) and d 0i be the Euclidean distance between a given focal location (u 0 , v 0 ) and (u i , v i ). The GWR estimation of Model (1) is to minimize the objective function with respect to β j (u 0 , V 0 ) (j = 1, 2, . . ., q), where K h ( . ) = K( . /h) with K( . ) being a kernel function and h being a bandwidth. Denote (2) W h ðu 0 ; υ 0 Þ ¼ DiagðK h ðd 01 Þ; K h ðd 02 Þ; . . . ; K h ðd 0n ÞÞ; (3) and βðu 0 ; υ 0 Þ ¼ ðβ 1 ðu 0 ; υ 0 Þ; β 2 ðu 0 ; υ 0 Þ; . . . ; β q ðu 0 ; υ 0 ÞÞ T : Then, the GWR estimator of the coefficient vector β(u 0 , v 0 ) iŝ Especially, taking the focal location (u 0 , v 0 ) = (u i , v i ) (i = 1, 2, . . ., n), respectively, the fitted values of the response variable Y at the n sampling locations are computed bŷ . . . ; x iq Þ is the ith row of the matrix X in Equation (2). The fitted vector of the response variable Y at the sampling locations can be denoted bŷ The residual vector is then and the residual sum of squares is where I is the identity matrix of order n. The size of the bandwidth h can be determined by the data-driven methods, such as CV, GCV, and AIC c . Here, the AIC c procedure is used to select the bandwidth. That is, let where tr( . ) stands for the trace of a matrix. The optimal size of h is

Estimation of a mixed GWR model and the resulting residual sum of squares
After properly re-ordering the explanatory variables, a mixed GWR model in its sample form (Brunsdon et al. 1999) stands as where β j (u,v) (j = 1,2, . . ., r) are spatially varying coefficients and β j (j = r + 1, r + 2, . . ., q) are constant coefficients. Here, the errors ε i (i = 1, 2, . . ., n) are also assumed to be independent and identically distributed random variables with mean zero and variance σ 2 . Fotheringham et al. (2002, chapter 3) proposed a two-step calibration method for the mixed GWR model. This method is less computationally intensive than the back-fitting approach suggested in Brunsdon et al. (1999) and admits the explicit expressions of the estimators for both the constant and the spatially varying coefficients. What follows is the description of the two-step method with the resulting residual sum of squares.
For the mixed GWR model in Equation (9), let be the design matrices corresponding to the spatially varying coefficient part and the constant coefficient part, respectively, and let be the vectors of the spatially varying coefficients and the constant coefficients, respectively. According to the two-step estimation, the estimator of β c iŝ (3) with (u 0 ,v 0 ) replaced by (u i ,v i ) (i = 1,2, . . ., n) and Y is the observation vector in Equation (2). With the estimatorsβ c andβ v ðu i ; υ i Þ (i = 1,2, . . ., n), the fitted vector of the response variable Y in the mixed GWR model is then The residual vector is thereforê and the residual sum of squares is In the above estimation of the mixed GWR model, the optimal size of the bandwidth h can also be selected by the AIC c criterion. That is, let The optimal size of h is

A residual-based bootstrap test for constant coefficients in the GWR model
In this section, we first focus on testing the null hypothesis H 0 : some coefficients in Model (1)

Construction of the test statistic
Based on the data set y i ; , fit the GWR model in Equation (1) (the alternative model) according to the estimation procedure described in Subsection 2.1 and compute the residual sum of squares where RðH 1 Þ ¼ ½I À Sðh 1 Þ T ½I À Sðh 1 Þ and h 1 is the optimal bandwidth size selected by the AIC c criterion in Equation (8). Specifically, it is assumed that one wants to test whether the last qr coefficients in Model (1) are constant. Then the null hypothesis is which leads to the null model of the mixed GWR model in Equation (9). Calibrate the null model with the bandwidth size h 1 selected under the alternative model and compute the residual sum of squares where The test statistic is then constructed as The above statistic has been used in the GWR literature (Brunsdon et al. 1999, Leung et al. 2000, Mei et al. 2006) for comparing the goodness-of-fit of two related models. In the nonparametric regression literature, this statistic is called a generalized likelihood ratio statistic originally proposed by Fan et al. (2001) and has been widely used to perform inference for various nonparametric or semiparametric models (Fan and Jiang 2007). It has been proved in many situations that this statistic holds a useful property, called Wilks' phenomenon, meaning that its limiting distribution is free of the nonparametric smoothing methods used for calibrating the model. As pointed out by Fan and Huang (2005), when the generalized likelihood ratio statistic is used for hypothesis testing, it is essential to calibrate both the null and the alternative models with the same bandwidth size unless the null model is a full parametric model which does not need a bandwidth in its estimation. Otherwise, the adjustment of the bandwidth may make the residual sums of squares from the null and alternative models incomparable, leading to a loss in the power of the test. Furthermore, as noted in Fan and Jiang (2007) and Cai (2007), for a given data set, since one is not certain whether the null model is correct, the fit from the alternative model should be used to generate the residuals for bootstrap sampling because this fit results in consistent residuals under both the null and alternative models. For these reasons, we use in this paper the bandwidth size h 1 selected in the GWR model (alternative model) to fit the mixed GWR model (null model) in the construction of the test statistic in Equation (19), which meets the above two requirements simultaneously.
Intuitively, if the null hypothesis H 0 in Equation (17) is indeed true, the difference between RSS M (h 1 ) and RSS(h 1 ) should be small. Otherwise, RSS M (h 1 ) tends to be significantly larger than RSS(h 1 ). Hence, a large value of the statistic T supports rejection of the null hypothesis and the p-value of the test is then where t is the observed value of T obtained by Equation (19) and P H 0 means that the probability is computed according to the null distribution of T. Given a significance level α, if p < α, then reject the null hypothesis H 0 ; otherwise, do not reject H 0 .

A bootstrap procedure for computing the p-value of the test
To overcome the shortcomings of the existing inferential methods in approximating the null distribution of the test statistic T, we suggest a bootstrap procedure to compute the p-value of the test. Here, it is only assumed that the model errors ε 1 , ε 2 ,. . ., ε n are independent and identically distributed random variables with mean zero and variance σ 2 . The bootstrap procedure in our setting is as follows.
Step 2: Fit the null model (i.e. the mixed GWR model in Equation (9)) using the twostep calibration method and the same bandwidth size h 1 in Step 1. Then, compute the fitted vector of the response variableŶ M ¼ S M ðh 1 ÞY in Equation (11) and the residual sum of squares RSS M ðh 1 Þ ¼ Y T RðH 0 ÞY in Equation (18).
Step 4: Draw a bootstrap sample ε Ã ¼ ðε Ã 1 ; ε Ã 2 ; . . . ε Ã n Þ T with replacement fromε c 1 ε 1c ;ε 2c ; . . . ;ε nc ð Þ T and compute the bootstrap sample of the response variable by Step 5: Based on the bootstrap data, re-calibrate both the alternative and null models by their respective estimation approaches with the same bandwidth size h 1 . Compute the residual sums of the squares RSS*(h 1 ) and RSS Ã M ðh 1 Þ of the alternative and the null models, respectively. Then a bootstrap value of the statistic T is obtained Step 6: Repeat Steps 4 and 5 for b times and obtain b bootstrap values of the statistic T, which we denote by t Ã 1 ; t Ã 2 ; . . . ; t Ã b . The p-value in Equation (20) is then estimated bŷ where I( . ) is the indicator function.

Some comments on the test
(1) Testing a globally stationary regression relationship. As one of the most important inferences in the GWR literature, the test for a globally stationary regression relationship can provide the information that a spatially varying coefficient model is really necessary for the given data. The proposed bootstrap test is applicable to this case. In fact, the null hypothesis in this case is H 0 : all of the coefficients in Model (1) are constant, which corresponds to the ordinary linear regression model . . . ; n: In this case, the above null model is simply calibrated by the least-squares procedure. As a result, the matrix S M (h 1 ) and the residual sum of squares RSS M (h 1 ) in Step 2 do not depend on the bandwidth size h 1 and have the forms of respectively, where X is the design matrix in Equation (2).
(2) A simplified algorithm to compute the bootstrap value of the test statistic.
Step 5 says that both the alternative and the null models must be recalibrated based on the bootstrap data y . However, the calibrations of the alternative and null models do not need to be performed repeatedly. In fact, it is noted that both the original data and the bootstrap data have the same observations of the explanatory valuables and the corresponding sampling locations. Furthermore, the bandwidth size used for re-calibrating the two models is unchanged in each repeat of Steps 4 and 5. Therefore, re-calibrating the alternative and the null models will result in the same matrices S(h 1 ) and S M (h 1 ) and therefore the same matrices R(H 0 ) and R(H 1 ). Based on this fact, the residual sums of squares RSS Ã ðh 1 Þ and RSS Ã M (h 1 ) can be simply computed by Therefore, in each of the b repeats of computing the bootstrap value in Step 6, it is only needed to replace Y* with the updated one, which can largely reduce the computational complexity of the bootstrap test.
(3) Some possible extensions of the test. Although the test is proposed mainly for identifying the constant coefficients in a GWR model, it admits some extensions for testing more complex structures of the coefficients.
One extension is to test the null hypothesis that some coefficients, for example, the last qr coefficients in Model (1), are linear combinations of some known functions of the spatial coordinate (u,v). Specifically, the null hypothesis can be expressed as H 0 : β j ðu; υÞ ¼ X I j k¼1 θ jk g jk ðu; υÞ for j ¼ r þ 1; r þ 2; . . . ; q; (21) where for each j = r + 1, r + 2, . . ., q, θ jk (k = 1, 2, . . ., l j ) are unknown constant parameters and g jk (u,v) (k = 1, 2, . . ., l j ) are known linearly independent functions of the spatial coordinate (u,v). Under the null hypothesis, the null model is of the form . . . ; n: Let z ijk ¼ g jk ðu i ; v i Þx ij ; k ¼ 1; 2; : : : ; l j ; j ¼ r þ 1; r þ 2; . . . ; q; i ¼ 1; 2; : : : ; n: The null model becomes which is a mixed GWR model because z ijk (k = 1, 2, . . ., l j ; j = 1,2, . . ., q; i = 1, 2, . . ., n) are all known. Therefore, the proposed bootstrap test is applicable to the null hypothesis in Equation (21). When each g jk (u,v) in Equation (21) is a power function of u and v, the coefficients β j (u,v) (j = r + 1, r + 2, . . ., q) are polynomial functions of the spatial coordinate. Especially, when the null hypothesis includes all the coefficients (i.e. r = 0), the resulting null model is in fact a geographical expansion model which has been comprehensively studied in Casetti (1972Casetti ( , 1997 and Jones and Casetti (1992). Therefore, as a special case of the null hypothesis in Equation (21), the bootstrap test can be used to identify whether an expansion model is suitable for a given data set.
Another extension is to test the hypothesis of one mixed GWR model against another mixed GWR model. In this case, the alternative model is a mixed GWR model. The null model may be diverse. For example, in order to evaluate whether some explanatory variables in the constant-coefficient part of the alternative model do influence the response variable, one can set such a null model that the coefficients corresponding to these explanatory variables are all zero, resulting in the same test method in Mei et al. (2006).

A simulation study
In this section, some simulation with comparison to the existing test methods in the GWR literature is conducted to assess the performance of the proposed bootstrap test. The simulation focuses on the validity of the bootstrap approximation to the null distribution of the test statistic and the power of the test in detecting the spatially varying coefficients under normal and some typical non-normal error distributions. Furthermore, in view of the fact that collinearity among the explanatory variables may have an adverse impact on the coefficient estimators Tiefelsdorf 2005, Wheeler andCalder 2007), it is therefore of interest to evaluate the influence of the collinearity on the performance of the test.

Design of the simulation
(1) The spatial layout. The spatial region for the simulation is taken to be a square with the length of each side being 10 units and a Cartesian coordinate system is installed in such a way that its origin is located at the bottom-left corner of the square and the two axes coincide with the mutually orthogonal two sides of the square. The sampling locations are m × m lattice points with their spatial coordinates under the Cartesian system being ðu i ; υ i Þ ¼ 10 m À 1 mod ði À 1; mÞ; 10 m À 1 intði À 1; mÞ ; i ¼ 1; 2; . . . ; m 2 ; where mod(a,b) and int(a,b) stand for the remainder and the integer part of a divided by b, respectively. In the simulation, m = 16 and 21 are designated, resulting in the sample sizes n = 256 and 441, respectively.
(3) The error distribution. In order to examine the effect of the error distribution on the performance of the test, we consider the following three different types of the error distribution whose scales are so adjusted that they all have a common variance σ 2 = 1: (i) standard normal distribution N(0,1); (ii) uniform distribution UðÀ ffiffi ffi 3 p ; ffiffi ffi 3 p Þ; (iii) transformed and centered χ 2 distribution 1 2 χ 2 2 ð Þ À 1. (4) The way of generating the observations of the explanatory variables. Let Z 1 , Z 2 , Z 3 , and Z 4 be independent random variables with the common distribution U(0,1). The explanatory variables are taken to be where 0 γ,τ, ρ < 1. The following four cases are considered: (i) γ = τ = δ = 0. In this case, the explanatory variables are mutually independent.
Given each set of the values for γ, τ, and δ, the observations of the explanatory valuables are generated as follows. Draw independently a sample (z 1 , z 2 , z 3 , z 4 ) T from the distribution U(0,1) and the observation of the explanatory variables (x 1 , x 2 , x 3 , x 4 ) T is computed according to Equation (23). Furthermore, the corresponding observation of the response variable is derived according to Model (22). (5) The null hypothesis. The null hypothesis to be tested in the simulation is where β 3 and β 4 are two constants. This null hypothesis corresponds to c = 0 in the coefficients β 3 (u,v) and β 4 (u,v). (6) Selection of the kernel function and bandwidth. Throughout the simulation, the kernel function is chosen as the Gaussian kernel, i.e. K t ð Þ ¼ 1 ffiffiffiffi 2π p e À t 2 2 ; and the optimal bandwidth size for the alternative model is selected by the AIC c criterion in Equations (8).

Simulation results with analysis
For each of the experimental settings, we run N = 500 replications and, in each replication, b = 500 bootstrap samples were drawn to compute the p-value.
(1) Validity of the bootstrap approximation to the null distribution of the test statistic. Under the null hypothesis (i.e. setting c = 0 in the coefficients β 3 (u,v) and β 4 (u,v)) and for each of the experimental settings, we calculated the rate of rejecting the null hypothesis among N = 500 replications at the significance levels α = 0.01, 0.05, and 0.10, respectively. The rate is an estimator of the type I error of the test under the given setting and significance level.
Moreover, the bootstrap test was compared with the test in Brunsdon et al. (1999) where the test statistic is the same as that in Equation (19) except for an additional scaling factor, but the null distribution of the statistic is approximated by an F distribution. It should also be pointed out that, in the present simulation, the null model was calibrated by the two-step method described in Section 2.2 instead of the originally proposed back-fitting procedure in Brunsdon et al. (1999). Furthermore, although the degrees of freedom of the foregoing F distribution are derived under the assumption of the normally distributed error terms, the formulas can be routinely used to compute the degrees of freedom for other error distributions.
For each of the experimental settings, the rejection rate was computed for both test methods and the results are reported in Table 1, where the letters 'B' and 'F' refer respectively to the bootstrap test and the test in Brunsdon et al. (1999) which we call F-test henceforth.
It is known from Table 1 that, for the bootstrap test, the rejection rates under the null hypothesis are reasonably close to the respective significance levels in all of the experimental settings even for the small sample size of n = 256, indicating that the bootstrap method yields an accurate approximation to the null distribution of the test statistic at least on the right tail of the null distribution on which the p-value of the test is computed. Furthermore, although the bootstrap approximation seemingly performs best for the normal error distribution, it can also account well for nonnormal error distributions in the sense that the rejection rates do not show an evident deviation from the respective significance levels for both the short-tailed distribution of U À ffiffi ffi 3 p ; ffiffi ffi 3 p À Á and the skew distribution of 1 2 χ 2 (2)−1. More interestingly, although the collinearity among the explanatory variables may lead to spurious correlation between the GWR estimators of the spatially varying coefficients (Wheeler and Tiefelsdorf 2005), it does not bring about obvious influence on the rejection rates under the null hypothesis, which indicates that the bootstrap approximation to the null distribution of the test statistic is rather robust to the collinearity among the explanatory variables.
For the F-test, however, the rejection rates are much lower than their respective significance levels in all of the experimental settings, which indicates that the derived F distribution may have a heavier tail on the right than the real null distribution of the test statistic, leading to an inaccurate approximation to the null distribution. Although the rejection rates under the larger significance levels tend to increase with the sample size increasing, they are still much smaller than the corresponding significance levels. One more finding is that, under the null hypothesis, the rejection rates of the F-test are little influenced not only by the collinearity between the explanatory variables but also by the non-normality of the error distribution.
(2) Power of the bootstrap test. When c ≠ 0, the alternative hypothesis that all of the coefficients in Model (22) vary over the space is true. In this situation, given a significance level α, the rate of rejecting the null hypothesis in N = 500 replications is an estimator of the test power. Here, we conduct some simulation to assess the power of the bootstrap test via a comparison with that of the F-test.
Set the significance level α = 0.05 and the values of c to be from 0.1 to 0.7 by an increment of 0.1. Here, we only focused on the cases where the explanatory variables X 1 , X 2 , X 3 , and X 4 are mutually independent and two of them are highly correlated with ρ ðX 1 ;X 2 Þ ¼ ρ ðX 2 ;X 3 Þ ¼ ρ ðX 3 ;X 4 Þ ¼ 0:8; respectively. Furthermore, the power of the F-test was also computed in each experimental setting for the purpose of comparison. The power functions for both tests are depicted in Figure 1.
It can be observed from Figure 1 that the bootstrap test is of high power in identifying the varying coefficients in the GWR model no matter what the model error term follows a normal or a non-normal distribution. In contrast, the F-test shows evidently lower power in all of the settings. Especially, when the difference between the null and alternative hypotheses is small, the F-test is much more conservative than the bootstrap test. For other aspects, both tests show similar characteristics. Specifically, the power increases rapidly with either the sample size increasing or the alternative hypothesis deviating from the null hypothesis gradually. Compared to the results in the case of mutually independent explanatory variables, the power of both tests is somewhat influenced by the collinearity among the explanatory variables in complicated ways. Generally, there is a loss of power especially in the case of ρ ðX 2 ;X 3 Þ ¼ 0:8, where the collinearity exists between the explanatory variables in the null and alternative hypotheses, respectively; in the case of ρ ðX 3 ;X 4 Þ ¼ 0:8, where the correlated explanatory variables are all in the alternative hypothesis, however, the power increases slightly for both tests.
To further understand the impact of the collinearity on the power of the tests, we also run the simulation for negative correlation between the explanatory variables. That is, the parameters γ, τ, and δ take the corresponding negative values, respectively. The results for both tests are in general similar to those mentioned above, but for the case of ρ ðX 2 ;X 3 Þ ¼ À0:8; the power of both tests shows an evident increase  compared to the case of ρ ðX 2 ;X 3 Þ ¼ 0:8: The results for the case of ρ ðX 2 ;X 3 Þ ¼ À0:8 are illustrated in the last row of Figure 1.
In addition, the bootstrap test was compared with the test suggested in Leung et al. (2000) (called L-test henceforth) in the simulation study. In the L-test, the test statistic is constructed based on the sample variance of the GWR estimators of each coefficient at all of the sampling locations (see Equation (65) in Leung et al. 2000) and its null distribution is also approximated by an F distribution. It should be noted that since each coefficient is separately tested for a constant in the L-test, testing the null hypothesis in Equation (24) with the L-test amounts to simultaneously testing the following two null hypotheses H 03 : β 3 u; υ ð Þ ¼ β 3 and H 04 : In all of the experimental settings, the rejection rates under the null hypotheses H 03 and H 04 were also computed with the L-test. Because of the limited space, the results are omitted here, but those for n = 441 are attached as a supplementary material. It is found from the results that the L-test yields an excessively high type I error. That is, when the null hypotheses are indeed true, the rejection rates in all settings are much larger than the corresponding significance levels, indicating that an F distribution cannot provide an accurate approximation to the null distributions of the test statistics. Although the results improve with the sample size increasing, the rejection rates are still generally two or three times higher than the corresponding significance levels even for the lager sample size of n = 441 (see Table A in the supplementary material). In terms of the test power, the L-test yields a more complex pattern than the bootstrap test (see Figure A in the supplementary material). Generally speaking, the power of L-test is higher than that of the bootstrap test especially for small value of c. With the value of c increasing, however, the power of L-test for H 04 becomes lower than that of the bootstrap test. Moreover, the impact of the collinearity on the power seems different for these two tests with the most serious case being ρ ðX 3 ;X 4 Þ ¼ 0:8 for the L-test, but ρ ðX 2 ;X 3 Þ ¼ 0:8 for the bootstrap test.
In summary, the simulation study demonstrates that the bootstrap test outperforms the existing F-test and L-test in terms of type I error and test power. The bootstrap test can well approximate the null distribution of the test statistic even for small sample size and is free of the normality assumption on the error distribution. Furthermore, it is powerful in identifying the spatially varying coefficients. However, collinearity among the explanatory variables has an adverse impact on the power of the test, although the influence tends to be alleviated with the sample size increasing. Similar conclusions have been reached by Páez et al. (2011) in the empirical study of estimation of GWR models.

Application to Boston housing data
The proposed bootstrap test is illustrated by an application to the Boston housing data set given in Hurrison and Rubinfeld (1978) and corrected for a few minor errors by Gilley and Pace (1996). The data set consists of the median value (MEDV in $1000) of owneroccupied homes in 506 US census tracts in the Boston area in 1970 as well as the following explanatory variables: For the mixed GWR model in Equation (26), of further interest is to test whether some explanatory variables with constant coefficients influence the response MEDV significantly, which amounts to the tests for identifying which constant coefficients are zero. As mentioned in Section 3.3, the bootstrap test is applicable to such kind of tests. With the Gaussian kernel and the bandwidth selection criterion in Equation (15), the optimal size of the bandwidth for the mixed GWR model in Equation (26) is h M = 0.68. We first performed the tests with 1000 bootstrap replications for the null hypothesis that each single constant coefficient in Model (26) is zero. The resulting p-values are listed in Table 3.
It is known from Table 3 that, for each single explanatory variable with constant coefficient, INDVS has not significant impact on the response MEDV. Furthermore, we added one or more other constant coefficients into the null hypothesis β 2 = 0, the resulting p-values are all less than or equal to 0.005, indicating that INDVS is the only variable that does not influence MEDV significantly.
By removing the insignificant explanatory variable INDVS from Model (26) and reordering both the spatially varying coefficients and the constant coefficients, we finally specified a mixed GWR model for the Boston housing data as MEDV ¼ β 0 u; υ ð Þ þ β 1 u; υ ð ÞRM þ β 2 u; υ ð ÞDIS þ β 3 u; υ ð ÞLSTAT þ β 4 CRIM þ β 5 NOX þ β 6 AGE þ β 7 RAD þ β 8 TAX þ β 9 PTRATIO þ β 10 BK þ ε: The above model can be calibrated by the two-step method in Fotheringham et al. (2002, chapter 3) and can be used to explain the spatial characteristics of the regression relationship between the median value of owner-occupied homes and the related explanatory variables. This content is beyond the scope of this paper and is therefore omitted.

Final remarks
Understanding whether some of the coefficients in a GWR model really vary over the space is essential to validly explain spatial non-stationarity of the regression relationship. To achieve this task, we propose in this paper a residual-based bootstrap test to identify constant coefficients in a GWR model. The simulation study shows that the bootstrap test outperforms the existing test methods in terms of type I error and test power. Moreover, the bootstrap test is applicable to diverse hypotheses and provides a useful way of building a possible mixed GWR model for a georeferenced data set. In terms of future research, it should be noted that only the effect of global collinearity between the explanatory variables is evaluated in the simulation study whereas the impact of local collinearity on the bootstrap and other tests remains to be investigated. Furthermore, a more important issue is how to alleviate the adverse impact of global or local collinearity on the tests and deserves to be further studied. Moreover, the geographically and temporally weighted regression (GTWR), a promising extension of the GWR model, has recently been developed to account for local effects in both space and time (Huang et al. 2010, Fotheringham et al. 2015 and has attracted a growing interest on its applications. However, the similar inferential problems to those in the GWR model are far from being solved. The bootstrap test might be a potential candidate for achieving the task, but the details still need to be clearly addressed in view of the complexity of GTWR.