Specification Testing of Regression Models with Mixed Discrete and Continuous Predictors

Abstract This article proposes a nonparametric projection-based adaptive-to-model specification test for regressions with discrete and continuous predictors. The test statistic is asymptotically normal under the null hypothesis and omnibus against alternative hypotheses. The test behaves like a locally smoothing test as if the number of continuous predictors was one and can detect the local alternative hypotheses distinct from the null hypothesis at the rate that can be achieved by existing locally smoothing tests for regressions with only one continuous predictor. Because of the model adaptation property, the test can fully use the model structure under the null hypothesis so that the dimensionality problem can be significantly alleviated. A discretization-expectation ordinary least squares estimation approach for partial central subspace in sufficient dimension reduction is developed as a by-product in the test construction. We suggest a residual-based wild bootstrap method to give an approximation by fully using the null model and thus closer to the limiting null distribution than existing bootstrap approximations. We conduct simulation studies to compare it with existing tests and two real data examples for illustration.


Introduction
Specification testing plays a crucial role in decision-making in numerous fields, such as econometrics, medicine, and bioinformatics, as inferences based on misspecified models can lead to misleading conclusions.Data frequently contains discrete predictors such as family size, gender, choices made by economic agents, and so forth.Consistent model specification tests that are directly applicable in the presence of discrete predictors would be of value to applied researchers.In this article, we check whether the generalized partially parametric single-index model holds or not as follows: where Z = (X , W ) with X ∈ R p 1 being the p 1 -dimensional continuous covariates and W ∈ R p 2 being the p 2 -dimensional categorical predictors with W ∈ {θ 1 , . . ., θ c }, G(•) is a known regression function, (β , γ ) is an unknown vector of parameters, is the model error, independent of the covariates (X , W ) .The notation " " denotes the transpose operator on a vector or a matrix throughout this article.It is seen that the model (1) generalizes the parametric single-index model in Guo, Wang, and Zhu (2016), the latter fails to deal with the categorical predictors but we can.A practical example is Program Evaluation, which is usually parametrically specified as a linear probability, probit, or logit model.However, Policy Evaluation based on an incorrectly specified propensity score function may create incorrect conclusions in further analysis, see, for example, Li, Li, and Liu (2016) and references therein.
Another practical example is a financial market dataset, where the linear error correction model in Wahab and Lashgari (1993) was used to describe the dynamics of spot and futures prices in the presence of arbitrage.This model implicitly assumes that pricing errors are reduced at a speed that is independent of the magnitude of the price deviation between spot and futures.However, this model may not describe the relationship sufficiently, as Gaul and Theissen (2015) found that the speed of price adjustment increases almost monotonically with the magnitude of the price deviation.They then developed a partial linear regression model and proposed a test for the adequacy of the linear error correction model.In summary, to accurately and efficiently use the simple parametric model (1) rather than the complicated nonparametric model in applications, it is necessary to develop a suitable and efficient specification test before further statistical analysis.
Several tests for parametric regression models with mixed discrete and continuous predictors have attracted widespread attention in the literature.Racine, Hart, and Li (2006) developed a conditional moment test for the significance of categorical predictor variables in nonparametric regression models.Hsiao, Li, and Racine (2007) used the idea of Fan and Li (1996) and Zheng (1996) to develop a nonparametric kernel-based model specification test with mixed discrete and continuous data.Mora and Moro-Egido (2008) developed two specification tests, based on moment conditions and the difference between parametric and nonparametric estimations, respectively, for an ordered discrete choice model against general alternative models.Li, Li, and Liu (2016) proposed a consistent model specification test based on k-nearest-neighbor estimation method, which can also deal with this case.As far as we know, these existing tests are mainly based on locally smoothing techniques.A typical problem of these tests is that these (locally smoothing) tests have a slow convergence rate O(n −1/2 h −p 1 /4 ) under the null hypothesis, where h is the bandwidth going to 0. When the dimension p 1 of X is large, this convergence rate can be prolonged, and thus locally smoothing tests suffer severely from the curse of dimensionality.
The straightforward approach to overcome the curse of dimensionality is to project the high-dimensional covariates into one-dimensional space first and then form an aggregation over all the projection-based tests.Zhu and Li (1998) developed an unweighted integral of expectations concerning all onedimensional directions via this idea when all predictors are continuous.Lavergne and Patilea (2008) adopted a similar idea and further proposed a nonparametric method by exploring an optimal direction.Lavergne and Patiliea (2012) advised a smooth version of the integrated conditional moment test over all projection directions.Other relevant references include Wong, Fang, and Zhu (1995) and Xia (2009).
There are several works about justifications of regression by approximation of model misspecification.For example, White (1982) presented a general theory of maximum likelihood estimation in misspecified models.Cule, Samworth, and Stewart (2010) also studied the maximum likelihood estimation of a multi-dimensional log-concave density in misspecified models.Abadie, Imbens, and Zheng (2014) discussed an alternative parameter that corresponds to the approximation of the conditional expectation under misspecified models.Lv and Liu (2014) derived novel asymptotic expansions of the AIC and BIC principles in misspecified generalized linear models.James (2017) studied a misspecified model by fitting it with heteroscedastic error in astronomy.Demirkaya et al. (2022) suggested the generalized Bayesian information criterion to characterize the impacts of both model misspecification and high dimensionality for model selection.In detail, their methods obtained estimators of the approximation only; meanwhile, they do not have an estimate of the true mean function.But it is known that the proper interpretation of the model is more necessary and challenging, and this problem should be discussed and solved.
In this article, we propose a projection-based adaptive-tomodel test (PAMT).First, the new test shares the model adaptation property of existing adaptive-to-model tests such as Guo, Wang, and Zhu (2016), Zhu, Guo, and Zhu (2017), and Tan, Zhu, and Zhu (2018), which are adaptive to the underlying model for alleviating the dimensionality problems.However, to the best of our knowledge, existing adaptive-to-model tests are limited only to handling the problems with continuous predictors.Meanwhile, the tests still involve nonparametric estimation or empirical processes in high-dimensional spaces to accommodate the alternative hypotheses with no dimension reduction structures.Second, similarly to any projection-based test such as Zhu, Fang, and Bhatti (1997) and Lavergne and Patiliea (2012), our test projects the continuous predictors onto a one-dimensional subspace such that it is in the partial central subspace (e.g., Chiaromonte, Cook, and Li 2002;Feng et al. 2013) at any direction.This makes the test only involves univariate predictor while keeping the omnibus property.Compared Zhu, Fang, and Bhatti (1997) and Lavergne and Patiliea (2012) with the proposed test, our method fully adopts the dimension reduction structure under the null hypothesis.It derives a closed formula via choosing a Gaussian kernel function to avoid the difficulty caused by numerical calculation for the integral.To achieve the above goal, we develop a Discretization-Expectation Ordinary Least Squares estimation (DEOLS) to identify the partial central subspace.The details are presented in the next section.We suggest a residual-based wild bootstrap method to determine critical values, which also fully uses the information in the hypothetical model so that the approximation can be as close to the sampling null distribution as possible.
The materials in this article are organized as follows.In Section 2, we construct the test.Since partial sufficient dimension reduction plays a crucial role in the model adaptation of the proposed test, we develop a discretization-expectation ordinary least squares estimation (DEOLS).Section 3 presents some asymptotic properties of the new test.Section 4 includes simulation studies and the analyses for two real datasets as the illustration.The regularity conditions and all the proofs of the theoretical results are put in supplementary materials.

Basic Construction
As the data structure under the alternative hypothesis is unknown, we may consider the tautological regression model as where g(X, W) = E(Y|X, W) with g(•, •) being an unknown and smooth function.It is worth noting that for any orthogonal as the function g is unknown.In other words, any purely nonparametric regression model ( 2) can be reformulated as a special partially multi-index regression model: where B is a p 1 × q orthogonal matrix with an unknown number q such that 1 ≤ q ≤ p 1 , g(•, •) is still an unknown smooth function, and E( |X, W) = 0.For identifiability, we assume that the matrix B satisfies B B = I q .Based on this observation, we consider the alternative model (3) that can cover more model structures and is widely used in sufficient dimension reduction field.
We reformulate the hypotheses as The null and alternative models can then be unified.Under the null hypothesis, q = 1 and B = β/||β|| 2 , here ||A|| 2 denotes L 2norm of a vector A throughout this article.Under the alternative hypothesis, q ≥ 1.It offers us a way to construct a test that is automatically adaptive to the null and alternative models through consistently determining q and estimating B under both the null and the alternative models.
According to Lemma 1 of Escanciano (2006) and Lemma 1 of Li, Chiu, and Zhu (2019), we obtain the following lemma that offers the equivalence between the original hypothesis and the one with the projected predictors.
Lemma 1.Consider a random variable U ∈ R with E(U 2 ) < ∞ and continuous variable vector Z ∈ R q and a discrete variable where f α (•) is the conditional density function of (α Z, W) given α, and μ(•) is the marginal density of α.
The above equality entails that where f β (β X, W) is the density function of (β X, W).Under the alternative hypothesis H 1 , since E( |B X, W) = 0, if q = 1 we have Altogether with the formulas ( 5) and ( 6), we have Both the two quantities in (4) and (7) offer the ways, but with different model structures, to reject the null hypothesis for large values of a test statistic derived from V.Under the null and alternative hypotheses, these two quantities depend on β X and B X, respectively.Thus, under the null hypothesis, V only depends on the univariate predictor β X, while under the alternatives, it depends on the univariate projected predictors α B X and makes the proposed test omnibus.Therefore, this hybrid can not only significantly avoid the curse of dimensionality but also enjoy the omnibus property.To achieve the model adaptation property, we need an automatic identification for the matrix B according to the null and alternative hypotheses.The sufficient dimension reduction methods could accommodate this requirement.Moreover, we also develop the theoretical results of its estimator, saying that q satisfies, with probability going to one, (1) under the null hypothesis, q = 1 and B = κβ for some nonzero constant κ; (2) under the alternative hypothesis, q ≥ 1.Therefore, we will show that the empirical version of V in (7) can achieve the purpose of dimension reduction.Thus, the next step is to construct an estimator Bq of B such that under the null hypothesis, Bq converges to κβ and under the alternative hypotheses to B. Suppose we have iid random samples {x i , w i , y i } n i=1 drawn from the joint distribution of (X, W, Y).To handle the nonparametric estimation involved in the empirical version of V, we adopt the popularly used approach developed by Racine and Li (2004).For an unordered predictor, we use a variation on Aitchison and Aitken's (1976) kernel function where λ s ∈ [0, 1] is a smoothing parameter.Especially, when λ s = 0, ¯l(w is , w js , λ s ) becomes an indicator function, and when λ s = 1, ¯l(w is , w js , λ s ) denotes a uniform weight function such that the predictor does not affect the nonparametric estimation results.According to the suggestion of Hsiao, Li, and Racine (2007), for an ordered predictor, we adapt the following kernel: Again, λ s = 0 denotes an indicator function, and λ s = 1 denotes an uniform weight function.Combining the formulas (8) and (9), we derive the product kernel function given by Let Bq defined later be a partial sufficient dimension reduction estimator with an estimated structural dimension q of q.When q = 1, we estimate E( |B X, W) for any ( B q x i , w i ) by the following kernel estimator: , where K h (•) = K(•/h)/h with K being an univariate density kernel function and h being a bandwidth, and ˆ i = y i − G( β x i , w i , γ ) with β and γ being the nonlinear least squares estimators of β and γ , respectively.The density function f (•) of (B X, W), for any ( B q x i , w i ), can be estimated by and V 2 is estimated by It is seen that when q > 1, V 2n may involve some difficulty of numerical calculation due to the high-dimensional integral.Motivated from Li, Chiu, and Zhu (2019), we choose the Gaussian kernel function K(t) and the measure μ(•) being also Gaussian to derive a closed formula.To be precise, let exp(−t 2 /2), and consider α ∼ N(0, h 2 I q), where I q denotes the identity matrix of dimension q, and the density function μ(α) is μ(α) = (2π) q/2 h q exp(−α α/(2h 2 )).By applying the parallel computing process as Lemma 2 of Li, Chiu, and Zhu (2019), the integral has a closed formula: Therefore, V 2n has the following form: Combining the formulas (11) and ( 13), a non-standardized test statistic is constructed as Remark 1.The test statistic proposed in Hsiao, Li, and Racine (2007) is (15) where Kh (•) = K(•/ h)/ h, with K being a p 1 -dimensional kernel function and h = (h 1 , . . ., h p 1 ) being a bandwidth vector.There exist two differences between the formulas ( 14) and ( 15).First, we use . We will prove that under H 0 , in probability, q = 1 and Bq → β/||β|| 2 , which implies that our test reduces the dimension p 1 down to 1. Second, under the alternative hypothesis, our test is also an aggregation over all univariate predictors at all projection directions in the Euclidean space {α Bq : α ∈ R q}.
Remark 2. In the absence of the covariate W, the model adaptive test of Guo, Wang, and Zhu (2016) is defined as where Kh (•) = K(•/h)/h, with K being a q-dimensional kernel function.
The comparison between the tests in ( 14) and ( 16) presents two differences.First, under the null hypothesis H 0 , V n reduces to V GWZ n in probability going to one.Therefore, the proposed test shares the advantages of the Guo, Wang, and Zhu's (2016) test.Second, under the alternative hypotheses, the proposed test is also with univariate predictors at the projection directions in R q.In contrast, Guo, Wang, and Zhu's (2016) test still involves the high dimension nonparametric estimation.Then, the proposed test V n can also greatly alleviate the curse of dimensionality.
Next, we will develop a discretization-expectation ordinary least squares estimation (DEOLS) for identifying and estimating B in Section 2.2 and suggest a criterion to estimate the dimension q in Section 2.3.

Discretization-Expectation Ordinary Least Squares Estimation
Since the function g(•, •) is unknown, for any q × q orthogonal matrix C, it is seen that the unknown function g(•, •) can be written as g(B X, W) = g( B X, W) with B = BC and g(•, •) = g(C •, •).This illustrates that the matrix B can not be identified in the model (3).But the sufficient dimension reduction theories ensure that we can identify the space spanned by the column vectors of B (see, Chiaromonte, Cook, and Li 2002).It manifests that identifying the basis matrix B is sufficient for constructing the test statistic described herein.To estimate B, this is an estimation problem related to the partial central subspace, namely, S (W) Y|X .More precisely, the partial central subspace is defined as the intersection of all subspaces S such that Y⊥ ⊥X|(P S X, W), where ⊥ ⊥ stands for "independent of " and P (•) indicates a projection operator concerning the standard inner product.dim{S Chiaromonte, Cook, and Li (2002) first developed partial sliced inverse regression estimation (PSIR) under the homogeneous covariance condition.Li, Cook, and Chiaromonte (2003) developed the subpopulation ordinary least squares estimation (pOLS) for subpopulation mean functions.Wen and Cook (2007) suggested optimal partial inverse regression estimation (OPIRE) by taking a minimum discrepancy approach for regressions with mixed continuous and categorical predictors.Li, Li, and Zhu (2010) proposed groupwise dimension reduction (GDR), which can also deal with this problem.Hilafu and Yin (2013) proposed a projective resampling partial SIR (PRPSIR) for multivariate regressions with categorical predictors.Liu, Chiaromonte, and Li (2017) suggested a structured, ordinary least squares (sOLS) to estimate the structured mean dimension reduction subspace.Some of these methods are computationally demanding since the resulting estimators need to be solved via iterative algorithms: that is, iteratively estimating the nonparametric components and the parametric components, see, for example, Li, Li, and Zhu (2010) and Guo et al. (2015).PSIR requires the tuning parameters (Chiaromonte, Cook, and Li 2002), GDR (Li, Li, and Zhu 2010), and OPIRE (Wen and Cook 2007).Although pOLS (Li, Cook, and Chiaromonte 2003) and sOLS (Liu, Chiaromonte, and Li 2017) can overcome the above two difficulties, they mainly focus on the case where each group contributes a direction within each subpopulation.
In this section, we develop a discretization-expectation ordinary least squares estimation (DEOLS).Suppose that the observations w=θ 1 n w and S Y w |X w for the central space for the regression of Y w on X w .Li, Cook, and Chiaromonte (2003) proved that The estimation steps are below. Step and its p 1 × p 1 covariance matrix as w = cov(X w ), w = θ 1 , . . ., θ c .Under the linearity condition (see, Li 1991) Y|X , where ν is any positive constant.Thus, Step 3. (Expectation) Define the target matrix as where M w = E{M w ( Ỹw )} with Ỹw being an independent copy of Y w .Thus, M is a p 1 × p 1 positive semidefinite matrix satisfying Span(M) ⊆ S (W)  Y|X .If Span(M) = S (W) Y|X , B consists of the eigenvectors associated with the nonzero eigenvalues of M.
Step 4. (Estimation) The target matrix M can be estimated by where x w,i y w,i (t).
Then, when q is given, compute its eigenvectors Bq associated with the q largest eigenvalues of M n .
Remark 3. DEOLS can be obtained without any iterative algorithms that are required by Li, Li, and Zhu (2010) and is computationally inexpensive without any tuning parameter selection that is needed by the methods in Chiaromonte, Cook, and Li (2002), Li, Li, and Zhu (2010), and Wen and Cook (2007).Furthermore, DEOLS can deal with the general structure for regressions with categorical predictors.At the same time, pOLS (Li, Cook, and Chiaromonte 2003) and sOLS (Liu, Chiaromonte, and Li 2017) mainly focused on the case where each group contributes a direction within each subpopulation.Due to its OLS structure, it is very easy to derive that for each w = θ 1 , . . ., θ c , as where the notation d −→ denotes convergence in distribution, and the matrix w (t) is defined as Section 2.3 shows that DEOLS can be easily used to construct a criterion to determine the structural dimension q.Remark 4. In Step 2, a constant ν is added in the denominator of M w,n (t).This modification is for two purposes.First, it avoids the ratio of 0/0 because of b w (t) = 0 p 1 ×1 for some t.Second, to alleviate the dominating effect of large values of directions, we try to use a nonlinear function to transfer the original directions b w (t) to be b w (t)/(||b w (t)|| 2 + ν).Theoretically, any value of ν does not affect the identification.But practically, it might affect estimation efficiency to a certain extent.But we find that this is a minor issue even in finite sample scenarios.Thus, we recommend ν = 1 in the numerical studies.

Structural Dimension Estimation
As commented above, the structural dimension q is essential to identify the partial central subspace.Here we recommend the thresholding double ridge ratio criterion (TDRR), which is inspired by the idea of Zhu et al. (2020).Define and , where λ1 ≥ • • • ≥ λp are the eigenvalues of the matrix M n , λp 1 +1 = λp 1 +2 = 0 are two artificial "eigenvalues, " and c 1n and c 2n are proper ridges that converge to 0 such that the dimension q can be consistently estimated.An estimator q of q is defined as where 0 < τ < 1.According to the suggestion of Zhu et al. (2020), we adapt τ = 0.5 by the rule of thumb.We recommend the ridge values c 1n = c 2n = 0.04(log(n)) 1/10 p 1 / √ n in the numerical studies.
The algorithm is very easy to implement and the consistency of q is presented in the following proposition.
Proposition 1.In addition to the conditions in Appendix, supplementary materials assume that c 1n → 0, c 2n → 0, c 1n c 2n n → ∞, we have (i) under the null hypothesis H 0 , P(q = 1) → 1; (ii)under the alternative hypothesis H 1 , P(q = q) → 1.This proposition ensures that the estimated dimension q is consistent in probability to the true dimension under both the null and alternative hypotheses.

Asymptotics under the Null Hypothesis
We state the asymptotic results under the null hypothesis.
Theorem 1.Under H 0 and the regularity conditions in Appendix, supplementary materials, where Further, s 2 can be consistently estimated by where B1 denotes the first column of the estimated matrix Bq .
Remark 5.It can be proved that ŝ2 is consistent with the asymptotic variance of nh 1/2 V n under the null hypothesis; more details can be found in the justification of this theorem in the Appendix, supplementary materials.Under the alternative hypothesis, ŝ2 may not be consistent with the variance of nh 1/2 V n .But this is a minor loss as its merit is that the estimator only uses a univariate predictor.This differs from Lavergne and Patiliea (2012) and Hsiao, Li, and Racine (2007), who used the high dimensional nonparametric estimation for the variance, which encounters the dimensionality problem.
Under H 0 , Theorem 1 and the Slusky theorem yield a standardized test statistic T n to have the asymptotic result: and 1 is the chi-square distribution with one degree of freedom.

A Power Study
We now study the sensitivity of the test statistic T n under the alternative hypothesis.Consider the following sequence of local alternative models: where g(•) is a smooth function.
Lemma 2. Under the regularity conditions in Appendix, supplementary materials, for the sequence of models in (23), if n → ∞, the estimator q estimated by ( 19) satisfies that P(q = q) → 1.
Theorem 2. Under the regularity conditions in Appendix, supplementary materials, we have the following asymptotic results.
(i) Under H 1n with fixed C n > 0, if q = 1, for some constant C 0 > 0, where the notation p −→ denotes convergence in probability.(ii)Under H 1n with C n n 1/2 h 1/4 → ∞ and C n → 0, for any constant c 0 , Remark 6. Together with the results in Theorems 1 and 2, we can see the difference between our proposed test and Hsiao, Li, and Racine's (2007) test.First, our proposed test has convergence rates without depending on the dimension p 1 , under both the null and alternative models, while Hsiao, Li, and Racine's (2007) test does.Specifically, under the null hypothesis, V n converges to its weak limit at the rate of order nh 1/2 whereas Hsiao, Li, and Racine's (2007) test has a much slower convergence rate of order nh p 1 /2 .Moreover, T n can detect the local alternatives distinct from the null at the rate of order n −1/2 h −1/4 much faster than n −1/2 h −p 1 /4 that Hsiao, Li, and Racine's (2007) test can achieve.Additionally, we also see that the convergence rate of the new test statistic is free of the dimension of the discrete predictor vector.The existing adaptive-to-model tests, such as Guo, Wang, and Zhu (2016), can be extended to handle the cases with discrete predictors using a conventional frequency estimation method that splits the samples into several subsets ("cells").However, when using this frequency approach in finite-sample applications, there may not be enough observations remaining in each cell to produce reliable nonparametric estimators, which has the undesirable effect of a commensurate loss in the estimation efficiency.Therefore, it is expected that frequencybased procedures (which corresponds to λ s ) would suffer from low power due to discrete predictors.Hsiao, Li, and Racine (2007) also found this phenomenon for local smoothing testing methods and proposed a nonparametric kernel-based model specification test with mixed discrete and continuous data.Their  test has a convergence rate of order nh p 1 /2 that is also free of the dimension of the discrete predictor vector.Our test yields a faster rate of order nh 1/2 that is free of the dimension of the discrete and continuous predictor vector.This is not surprising because λ s → 0 for s = 1, . . ., p 2 under the null hypothesis, Guo, Wang, and Zhu (2016) test and our test should be close to each other when n is sufficiently large.Therefore, smoothing the discrete predictors does not lead to power gains asymptotically when all predictors are relevant.However, in finite-sample scenarios, the smoothing test by smoothing discrete predictors may be more powerful than a frequency-based test.More details can be found in Hsiao, Li, and Racine (2007).

Residual-based Wild Bootstrap
In some applications, the asymptotic normal approximation does not work well in finite sample settings.Hsiao, Li, and Racine (2007) also found this phenomenon for local smoothing testing methods.In the spirit of the wild bootstrap that was first suggested by Wu (1986), we develop a modified residual-based wild bootstrap method to approximate the sampling null distribution of T n , which fully uses the dimension reduction model structure under the null hypothesis.
The algorithm for the p-value determination is as follows: Step 1. Generate a sequence of iid variables U = {u i } n i=1 from Bernoulli distribution with and generate y * i = G( β x i , w i , γ ) + * i with * i = ˆ i × u i .Then, we construct the following statistic as where and ˆ * i = y i − G( β * x i , w i , γ * ) with { β * , γ * } being the nonlinear least squares estimators of β and γ based on the "bootstrap sample" (x i , w i , y * i ) n i=1 and B1 denotes the first columns of the estimating matrix Bq , which is defined in Section 2.2.
Step 3. The p-value is estimated by where T n is defined in (22).Whenever p ≤ α, reject H 0 , for a given significance level α, or the critical value is determined as the (1 − α) × 100% upper percentile of all Tn (U j ), j = 1, . . ., m.
Remark 7. It is worth pointing out that in this algorithm, we only use the univariate B 1 X with B1 being the first column of the matrix Bq defined in Section 2.2.This is different from the traditional wild bootstrap that uses the vectors B q X.More details can be found in Hsiao, Li, and Racine (2007) and Guo, Wang, and Zhu (2016).This makes the approximation as close to the sampling null distribution as possible.
The following theorem presents the consistency of the conditional distribution approximation.
Theorem 3.Under the conditions in Theorem 1 and the null hypothesis, for almost all sequences {(y 1 , x 1 , w 1 ), . . ., (y n , x n , w n ), . ..}, where (•) denotes the cumulative distribution function of a standard normal random variable.

Simulations
In this section, we conduct some Monte Carlo simulations to evaluate the finite sample performance of the proposed test.Write T n as T DEOLS n to stress the use of DEOLS.We focus on two types of numerical experiments below.Study 1 involves mixed discrete and continuous predictors to compare T DEOLS n with HLR's test (Hsiao, Li, and Racine 2007).As our method can also be applied to the situations in the absence of the discrete predictors W, Study 2 considers the models only with continuous predictors to compare T DEOLS n with the bootstrap versions of ZH's test (Zheng 1996), SZ's test (Stute and Zhu 2002), and GWZ's test (Guo, Wang, and Zhu 2016).These tests can be viewed as the representatives of local and globally smoothing and dimension reduction-based tests.Without confusion, HLR's, ZH's, SZ's, and GWZ's tests are written as T HLR n , T ZH n , T SZ n , and T GWZ n , respectively.
For all simulations, we use the univariate Gaussian kernel function for K(•), while the kernel for discrete predictors is 0.6 0.9575 0.9975 1 0.0005 0.0080 0.1500 0.8 0.9640 0.9980 1 0.0000 0.0000 0.0050 1.0 0.9675 0.9960 0.9995 0.0000 0.0000 0.0000 defined in the formula (10).We consider several examples with different sample sizes n = 50, n = 100, and n = 200, where the empirical sizes and powers are computed via 2000 replications at the significance level α = 0.05.For the empirical distributions, the test statistics are computed from the 400 wild bootstrap samples when the bootstrap method is used.Throughout the simulations, the data X 1 , . . ., X n are generated iid from the multivariate normal distribution N(0, 1 ) or N(0, 2 ), where 1 = I p 1 ×p 1 and 2 = (σ ij ) p 1 ×p 1 with σ ij = 0.5 |i−j| , respectively.The value a = 0 corresponds to the null hypothesis and a = 0 to the alternatives.The dimensions p 1 = 4 and p 1 = 8 correspond to low and high-dimensional scenarios, respectively.For T DEOLS n , some unreported results show that the empirical sizes tend to be slightly larger than 0.05.Based on the simulation studies as Guo, Wang, and Zhu (2016) and Li, Zhu, and Zhu (2021) conducted, and the numerical studies in this article about the choice of adjustment factor, aiming to balance between the significance level maintenance and power performance, we recommend using an adjusted version TDEOLS 1} with P(W i = l) = 0.5 for l = 0, 1, and is generated from the univariate standard normal distribution N(0, 1).To check the sensitivity of the test to the bandwidth selection, we conduct a simulation to choose h 0 and λ 0 in h = h 0 n −1/(4+1) and λ =  0.8 0.1720 0.3620 0.6980 0.0000 0.0000 0.0025 1.0 0.1685 0.3635 0.6950 0.0000 0.0000 0.0000 0 0.0760 0.0665 0.0550 0.0610 0.0595 0.0675 0.2 0.1715 0.2775 0.5675 0.2370 0.7790 0.9735 0.4 0.1645 0.3210 0.6425 0.0100 0.1845 0.7310 p 1 = 8 ∼ t 4 0.6 0.1840 0.3535 0.6920 0.0000 0.0025 0.1230 0.8 0.1705 0.3565 0.7040 0.0000 0.0000 0.0020 1.0 0.1780 0.3615 0.7085 0.0000 0.0000 0.0000 4+1) .Figure 1 reports, respectively, the empirical sizes and powers for the different bandwidths when a = 1 and n = 100.We can see that the empirical sizes are relatively robust against the different bandwidths and the empirical powers are not very sensitive to the bandwidths.We then recommend h 0 = 1 and λ 0 = 1 in the later simulation studies.Hsiao, Li, and Racine (2007) used CV to select tuning parameters, then T HLR n is very time-consuming especially when the dimension of the predictor vector is high.Based on our very limited numerical studies (which we do not report in this article), when p 1 =8, n = 100, the CPU time of computing Hsiao, Li, and Racine (2007) test is more than 450 times of computing the proposed test statistic.For computational efficiency, when p 1 = 8, we choose h in the formula (15) by adding the constraint The empirical sizes and powers of T HLR n and TDEOLS n are reported in Table 1.Some findings are as follows.First, TDEOLS n can control the empirical sizes well, T HLR n tends to be conservative especially when p 1 = 4, but when p 1 = 8, the significance level can be better maintained.Second, the empirical powers of the proposed test TDEOLS n increase reasonably with larger a and is significantly and uniformly more powerful than T HLR n , especially when the sample size is small.For p 1 = 4 and p 1 = 8, the dimension of X has less influence for TDEOLS n than it does for T HLR n .Regardless of X ∼ 1 and X ∼ 2 , T DEOLS n works better than T HLR n .Similar conclusions are for the comparison between TDEOLS n with λ = 0 and T HLR n with λ = 0. To further check the performance of TDEOLS n against the influence of correlation between the discrete and continuous predictors and the performance with multivariate discrete predictors, we design the following models.
Here p 1 = 4, 8, β = (1, 1, . . ., 1) / √ p 1 , X follows N(0, 1 ) and is drawn independently from N(0, 1).It is obvious that X and W are correlated in the above two cases.The results are reported in Table 2. Again the significance level maintenance of T HLR n is an issue: especially when p 1 = 4, the empirical sizes are larger than 0.0835.TDEOLS n works better in this aspect and also with the higher power.
In Examples 1 and 2, we consider that the models have the same projection direction in both the hypothetical and alternative models.To further assess the performance of TDEOLS n against the influence of the dimension of the discrete predictor vector W and the different distributions of the error term , we consider the following models with two projection directions under the alternative hypothesis.
The results in Tables 3 and 4 show that the correlation structure of X would not deteriorate the power performance, and the heavy tail of the error term does not have a significant impact on the performance of our test.When the dimension p 1 = 4, the empirical sizes of THLR n are more than 0.08 under the different sample sizes.While TDEOLS n maintains the significance level well with comparable powers with those of T HLR n .As the dimension gets p 1 = 8, although T HLR n improves its ability of Table 6.The frequencies of estimated dimensions about the structure dimension q.
Example 1 Example 3 , get lower, especially with the small sample size n = 50.These findings suggest that the proposed test is robust against the correlation structure of X and the different error .The comparison between the results in Tables 3 and 4 shows that the dimension of W negatively influences our test performance.When the dimension of W increases, the empirical powers become slightly lower.Comparing them with the results in Examples 1-3, we can see that THLR n in many cases can maintain the significance level well as that in the previous examples, and its power becomes reasonably lower with increasing the structural dimension q.But TDEOLS n is still superior to its competitor T HLR n in these examples.As commented above, the null models in Examples 1-3 are all linear.We now consider a nonlinear model under the null hypothesis against alternative models with complex structures.
Example 4. We generate the data from the following model: where X is generated iid from N(0, 1 ), W takes values in {0, 1} with P(W = l) = 0.5 for l = 0, 1, and follows N(0, 1).Here p 1 = 8 and p 2 = 1.Under the null model, β 1 = (2, 0, . . ., 0) .Under the alternative hypothesis, q = 8, B = (β 1 /2, β 2 , . . ., β q ) with β i to be the unit vector of which the ith element is 1, for i = 2, . . ., p 1 .Note that the alternative model is a full model without dimension reduction structure.Simulation results are presented in Table 5.The performance of T HLR n goes opposite: becoming very liberal with completely inconspicuous power.By the comparison between Tables 1-5, we observe that the power performance of TDEOLS n in Example 4 is similar to those in Examples 1-3.These results imply that even for the purely nonparametric alternative regression model (p 1 = q = 8), the proposed test can still have gains in the power performance even though λ = 0.
To further explore the performance of our proposed test in the case where the underlying models have the same projection direction and have two directions, respectively, we check the quality of the estimators q based on Example 1 with q = 1 and Example 3 with q = 2 under the alternatives.The results are reported in Table 6 with 1000 replications under the null and alternatives, respectively, and the sample size is n = 200.It is observed that when the dimension p 1 is small, the estimation has large proportions to get correct decisions.Although underestimation occurs under the alternatives when the dimension p 1 becomes large, our test can still work with high powers compared with the other competitors when the dimensions are high.
Study 2. The front part in Examples 1-4 concentrates on studying the models with mixed discrete and continuous predictors; apart from this, the proposed test can perform very well on the models with merely continuous predictors.
Example 5. Consider the following model having the same projection direction in both the hypothetical and alternative models: where β = (1, 1, . . ., 1) / √ p 1 , X is generated iid from N(0, 1 ), is drawn independently from N(0, 1).Some unreported results show that the empirical sizes can be maintained well and the empirical powers with a = 1 seem stable under different bandwidth h.We can still recommend h = n −1/(4+1) as that in Study 1.
In  To confirm that TDEOLS n is still omnibus rather than directional, we consider the following example with more than one direction under the alternative hypothesis.
From Figure 2, we can see that TDEOLS n performs uniformly better than its competitor T SZ n because T SZ n is a directional test.The empirical powers of TDEOLS n are close to 1 when a and the sample size n become larger.By the comparison between Table 7 and Figure 2, we observe that the power performances of TDEOLS n in Example 6 are similar to those in Example 5.This means that the proposed test can have the advantage of dimension reduction and is still omnibus.
The following example considers a nonlinear model under the null hypothesis against the model with a higher structural dimension under the alternative hypothesis to make the comparison more comprehensive.
Example 7. Data are generated from the model as where p 1 = 8, X is from N(0, 1 ) and is from N(0, 1) independent of X.Note that the model under the alternative is a full model without dimension reduction structure (p 1 = q = 8).
The results are reported in Figure 3.
Figure 3 indicates the advantage of TDEOLS n over the competitors in both significance level maintenance and power performance.Comparing the results of this example with those of Example 5, the structural dimension has a negative effect, but not significantly, for TDEOLS n .In contrast, it significantly impacts T GWZ n despite its dimension reduction nature.Further, T ZH n , without surprise, suffers severely from the dimensionality problem.

Two Real Data Examples
Dataset 1 (Auto MPG).This dataset is obtained from UCI Machine Learning Repository, which was analyzed by Quinlan (1993), Xia (2007), and Guo, Wang, and Zhu (2016), and is available at http://archive.ics.uci.edu/ml/datasets/Auto+MPG.There are 406 observations in the original dataset.After cleaning the units with a missing response or predictor values, there are 392 sample points left.The response variable is miles per gallon (Y).There are seven predictors: the number of cylinders (W 1 ), engine displacement (X 1 ), horsepower (X 2 ), vehicle weight (X 3 ), acceleration (X 4 ), model year (X 5 ) and origin of the car (1 = American, 2 = European, 3 = Japanese).We follow the suggestion in Xia (2007) as the origin of the car contains more than two categories, W 2 = 1 if a car is from America and 0 otherwise and W 3 = 1 if it is from Europe and 0 otherwise.Thus, (W 2 , W 3 ) = (1, 0), (0, 1), (0, 0) correspond to American car, European car and Japanese car, respectively.To facilitate the explanation, all continuous predictors are standardized separately.Quinlan (1993) used a linear regression model to fit this dataset.As the first step of data analysis, it is necessary to check whether the linear model is adequate.This is a typical model specification testing problem.The value of the test statistic TDEOLS n = 47.8541 and the p-value is about 0. Hence, the linear model may not be tenable.That is, although the linear regression model is simple and explicable, it is fragile.Xia (2007) believed fitting this dataset with a complicated multiindex model could be reasonable.This result is also consistent with that in Guo, Wang, and Zhu (2016).To further explore the model structure, we use the TDRR criterion in (19) to estimate the number q of linear combinations of predictors, q is 2. The residual plot about the linear model fitting against B X in Figure 4 shows a nonlinear pattern.To see whether a complicated model could be more suitable, we consider a way to check.Slice the dataset randomly, take n t = 262 points as the training set D train , n p = 130 points as the test set D test .To avoid random errors, we repeat the experiment 20 times and take the average of the results.Inspired by Xia (2007), we consider a partial linear multi-index model below: where f is an unknown smooth function.By the estimation technique for partial linear single-index models, say Zhu and Xue (2006), we combine the dimension reduction described in Section 2.2, to derive the estimator β of β using the training dataset D train .The estimator B of B is obtained by DEOLS in Section 2.2 and the function f is estimated by the nonparametric kernel estimator (see, e.g., Racine and Li 2004).To see the modeling accuracy, we use the linear and partial linear multiindex model in (25) to respectively, fit in the training set.The mean squared error (MSE), MSE = n t i=1 (y i − ŷi ) 2 /n t for y i ∈ D train are, respectively, 16.7966 and 9.3552.Further, we use the test set D test to compute the predicted mean squared error (PMSE) as PMSE = n p i=1 (y i − ŷi ) 2 /n p for y i ∈ D test .The values are respectively, 17.8226 and 9.8915.We also display the residual plot against B X with the test set D test for these two models in Figure 4. We see no obvious curve patterns except for a heteroscedastic structure.Therefore, the partial linear multiindex model in (25) would be plausible to fit this dataset.
Dataset 2 (Yacht Hydrodynamics).The dataset is available on the following website http://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics at UCI Machine Learning Repository.The response variable is Residuary resistance per unit weight of displacement (Y).There are six continuous predictors: Longitudinal position of the center of buoyancy (X 1 ), Prismatic coefficient (X 2 ), Length-displacement ratio (X 3 ), Beam-draught ratio (X 4 ), Length-beam ratio (X 5 ), and Froude number (X 6 ).Testing for linearity of the underlying model gives the value TDEOLS n = 52.1794 of the proposed test statistic and the pvalue is 0. Hence, the null hypothesis is rejected.The structural dimension is estimated to be q = 1 via the TDRR criterion.Consider the logarithm of Y with a polynomial of B X to fit this dataset: where B is determined by DEOLS.The value of the test statistic is TDEOLS n = −0.7695.As at the population level, it should be nonnegative, we then compute the p-value for | TDEOLS n | = 0.7695, which is 0.4416.The residual plot against B X in Figure 5 shows no obvious pattern.Additionally, we compute MSEs based on the linear model and the parametric model in (26) to have the values 78.4501 and 0.0553, respectively.Thus, it is reasonable to fit this dataset based on the regression model in (26).

Conclusion
In this article, we develop a nonparametric projection-based adaptive-to-model specification test, which can be used to check the regression models with both discrete and continuous predictors.As it can fully use the information of the dimension reduction structure under the null hypothesis, the general model structure under the alternatives, the test shows its advantages on significance level maintenance and power performance.A topic that would be worthwhile to study in the future is that although the limiting null distribution is tractable, its performance is not good if we use it to determine critical values unless the sample size is huge.This may show that the convergence rate to its weak limit is still not fast sufficiently.An important issue is constructing a V 1 with better asymptotic properties under the null hypothesis.

Figure 1 .
Figure 1.The empirical size surface of TDEOLS n

.
2n −4/5 ), which is asymptotically equivalent to T DEOLS n Study 1.In this study, we consider four examples to examine the performances of T DEOLS n and T HLR n for the models with mixed discrete and continuous predictors.Example 1.Consider the model as n = 100 n = 200 n = 50 n = 100 n = p 1 = 8 ∼ N(0, 1) 0.6 0.1795 0.3455 0.6745 0.0010 0.0035 0.1205

Figure 3 .
Figure 3.The empirical sizes and powers curves for Example 7. The solid, dashed-dotted, dashed, and dotted line represent the results of TDEOLSn

Figure 4 .
Figure 4.The residual plots against B X with B = ( B1 , B2 ) obtained from the DEOLS.The upper part of this figure presents the residual plots of the linear model with the full data.The middle part and the low part of this figure show, respectively, the residual plots of the linear model and the partial multi-index model with the test set.

Figure 5 .
Figure 5.The residual plot against B X obtained from the DEOLS for Yacht Hydrodynamics.

Table 3 .
Empirical sizes and powers of TDEOLS HLR n for H 31 of Example 3.
n and T

Table 4 .
Empirical sizes and powers of TDEOLS n and T HLR n for H 32 of Example 3.
n and T

Table 7 .
Empirical sizes and powers of TDEOLS The empirical sizes and powers curves for Example 6.The solid and dashed line represent the results of TDEOLS n and T SZ n , respectively.

Table 7 ,
we report the simulation results of our test TDEOLS Zheng's (1996)he alternatives distinct from the null at a faster rate of convergence thanZheng's (1996)test T ZH n does.