Efficient Estimation for Models With Nonlinear Heteroscedasticity

Abstract We study efficient estimation for models with nonlinear heteroscedasticity. In two-step quantile regression for heteroscedastic models, motivated by several undesirable issues caused by the preliminary estimator, we propose an efficient estimator by constrainedly weighting information across quantiles. When the weights are optimally chosen under certain constraints, the new estimator can simultaneously eliminate the effect of preliminary estimator as well as achieve good estimation efficiency. When compared to the Cramér-Rao lower bound, the relative efficiency loss of the new estimator has a conservative upper bound, regardless of the model design structure. The upper bound is close to zero for practical situations. In particular, the new estimator can asymptotically achieve the optimal Cramér-Rao lower bound if the noise has either a symmetric density or the asymmetric Laplace density. Monte Carlo studies show that the proposed method has substantial efficiency gain over existing ones. In an empirical application to GDP and inflation rate modeling, the proposed method has better prediction performance than existing methods.


Introduction
Consider the heteroscedastic model where β ∈ R p , s(X t , α) > 0 is some nonlinear function with unknown α ∈ R k , and the noise e t is independent of covariates X t . On one hand, the linear location function framework is widely used to model many important financial and economic variables. For example, in finance, the celebrated capital asset pricing model (CAPM) asserts that the risk premium (relative to risk-free interest rates) of an individual security or portfolio equals the market risk premium multiplied by the beta; in economics, the well-known Fisher hypothesis states that nominal asset returns move one-to-one with inflation rate; and a third example is the predictability of inflation from interest rates (Fama 1975). On the other hand, numerous ARCH models impose different nonlinear structures on σ t , such as squareroot heteroscedasticity σ t = (α 0 + α 1 X 2 t1 + · · · + α p X 2 tp ) 1/2 , linear heteroscedasticity σ t = α 0 + α 1 X t1 + · · · + α p X tp , and exponential-type heteroscedasticity σ t = exp{α 0 + α 1 X t1 + · · · + α p X tp } among others; see Bollerslev (2008) for more examples.
Since Koenker and Bassett (1978), quantile regression has become a powerful tool for studying heteroscedastic models. Let Q e (τ ) be the τ th quantile of e t . The conditional τ th quantile of Y t given X t is the nonlinear function involving parameters (β, α, Q e (τ )): However, since nonlinear quantile regression usually requires stringent conditions and is technically and computationally challenging, the literature has almost exclusively focused on linear heteroscedasticity s(X t , α) = X T t α so that Q Y t |X t (τ ) = X T t γ (τ ), γ (τ ) = β + Q e (τ )α. This linear quantile representation plays an essential role in the quantile regression literature on heteroscedastic models (Koenker and Bassett 1982;Gutenbrunner and Jurečková 1992;Koenker andZhao 1994, 1996;Zhou and Portnoy 1998;Xiao and Koenker 2009;Zhao and Xiao 2014;Zhu, Zheng, and Li 2018). The linearity enables linear quantile regression [throughout, ρ τ (z) = z(τ − 1 z<0 ) is τ -th quantile loss]: , α 0 is some preliminary estimator of α.
As in Equation (4), denote byβ(τ ) the estimate obtained by replacingσ t with σ t = s(X t , α) in Equation (5). Our first contribution is to establish a connection betweenβ(τ ) andβ(τ ) in the form where H(α 0 |τ ) = O p (1) quantifies the effect ofα 0 at quantile τ . Thus, unlike the linear heteroscedasticity case,α 0 has a nonnegligible effect for nonlinear heteroscedasticity. In Equation (6), the presence of the nonnegligible term H(α 0 |τ ) poses several undesirable issues in statistical inference. First, in Section 2.1 it is shown that H(α 0 |τ ) deviates from zero at the rate of √ n(α 0 − α) and thus any large estimation error inα 0 will propagate toβ(τ ). Second, it can be difficult to derive asymptotic normality forβ(τ ) and conduct-related statistical inference as we need to investigate the joint behavior of (β(τ ),α 0 ), where the preliminary estimator α 0 usually requires a two-step estimating procedure and its distributional effect onβ(τ ) can be nontrivial. Third, when a different method is used to obtainα 0 , completely new analysis must be carried out. Consequently, researchers could make different conclusions just because of using a different preliminary estimatorα 0 . Fourth, if we are givenα 0 from prior experience without knowing the exact method used to constructα 0 , from Equation (6) it is infeasible to make statistical inference.
Motivated by the aforementioned issues, our main contribution is to construct efficient estimator of β that can simultaneously achieve two goals (i) it is immune to the effect of preliminary estimatorα 0 ; and (ii) it can achieve good estimation efficiency or even asymptotically achieve the Cramér-Rao bound, at least in some situations. Our method is based on constrainedly optimally combining information across multiple quantiles. Given N quantiles 0 < τ 1 < · · · < τ N < 1, we consider a new estimatorβ(ω) of the form By Equation (6), the constraint N i=1 ω i H(α 0 |τ i ) = 0 ensures that the undesirable term H(α|τ ) is eliminated, and the constraint N i=1 ω i = 1 ensures the consistency ofβ(ω). We further propose the constrainedly optimally weighted quantile average estimator (COWQAE) by minimizing the limiting covariance matrix of √ n[β(ω) − β] subject to the constraints (7). Although the idea of combining multiple single-quantile regression estimates has been studied in the literature, our approach makes novel theoretical and methodological contributions. Existing works focused on simple homoscedastic linear model (Koenker 1984;Koenker and Portnoy 1987;Portnoy and Koenker 1989;Zhao and Xiao 2014;Wang and Wang 2016), nonlinear homoscedastic model (Jurečková and Procházka 1994), linear model with linear heteroscedasticity (Gutenbrunner and Jurečková 1992;Koenker and Zhao 1994;Zhao and Xiao 2014;Zhu, Zheng, and Li 2018) and time-varying coefficient longitudinal model (Kim, Zhao, and Xiao 2018). It is non-trivial to generalize existing methods to nonlinear heteroscedasticity case, and the main issue is the non-negligible effect when we use estimated heteroscedasticityσ t in Equation (5). While this undesirable effect exists for each quantile, with multiple quantiles we can eliminate it via the constraint (7). To our knowledge, there has been no existing work in the literature that used weighted average of multiple quantile regression estimates with special constraints on the weights to suppress the effect of preliminary estimator.
We show that the proposed COWQAE has good estimation efficiency. When the number of quantiles goes to infinity, the relative efficiency loss (compared to the Cramér-Rao lower bound) can be factorized into two components, one depending only on the noise distribution and the other depending only on model design structure. We further derive a conservative upper bound for the relative efficiency loss, and it is illustrated that the upper bound is usually quite small for practical situations. In particular, the COWQAE can asymptotically achieve the Cramér-Rao lower bound under either case: (i) e t has a symmetric density or (ii) e t has the asymmetric Laplace density. These results are new to the literature. Both simulation studies and an empirical application to GDP and inflation rate modeling show that the proposed method outperforms existing ones.
While the proposed method is based on constrainedly weighted average of multiple single-quantile regression estimates, another different framework, the composite quantile regression (CQR) which combines loss function across quantiles, has also been extensively studied (Koenker 1984;Zou and Yuan 2008;Kai, Li, andZou 2010, 2011;Bradic, Fan, and Wang 2011;Jiang, Jiang, and Song 2012;Xu, Wang, and Huang 2014), all for homoscedastic models. We envision that a weighted CQR approach can be developed for the nonlinear heteroscedastic model (1) by imposing appropriate constraints on the weights to eliminate the effect of preliminary estimateα 0 , and we leave it as a future work.
We introduce some notation. The notation p → means convergence in probability. For a matrix (or vector) as the crosscovariance matrix and var(U) = cov(U, U) ∈ R p×p . All technical proofs are gathered in the online Appendix.

Main Results
For e t , denote its density, distribution and quantile function by f e (u), F e (u) and Q e (τ ), respectively. For s(X t , α) as a function of α, denote its partial derivatives byṡ(X t , α) = ∂s(X t , α)/∂α and its Hessian matrix bys(X t , α) = ∂ 2 s(X t , α)/∂α 2 . Remark 1. In Equation (1), we assume that X T t β does not include a constant intercept. Otherwise, if there is an intercept β 0 , whenσ t ≈ σ is close to a constant (i.e., approximate homoscedasticity), the quantile regression (5) becomes approximately singular since β 0 and σ q can be combined into one parameter β 0 + σ q and we cannot identify them separately. In practice, we are more interested in slope parameters which measure the effect of covariates. See Section 4 for details on how to deal with models with an intercept.

Motivation: Effect of Preliminary Estimatorα 0 of α
Recallβ(τ ) in Equation (5) and the infeasible estimatorβ(τ ) obtained by replacingσ t with true σ t in Equation (5). To study their asymptotic properties, we impose the following assumptions: {(X t , e t )} t∈Z is a strictly stationary ergodic process.
Assumption 2. The density function f e (x) of e t is positive for all x ∈ R, twice differentiable, and bounded on {x : 0 < F e (x) < 1}.
Assumption 5. The estimatorα 0 of α is √ n-consistent, that is, We briefly comment on Assumptions 1-5. In Equation (1), consider the most important special case {e t }, which includes many widely used ARCH models (Bollerslev 2008). Under appropriate conditions (Wu 2005), Y t = G(e t , e t−1 , . . .) for some function G and thus Assumption 1 holds. Assumption 2 is a typical condition in quantile regression (Koenker 2005). Assumption 3 is a technical condition to ensure the negligibility of the remainder term in the Taylor expansions, and it is quite mild and satisfied for widely used variants of ARCH models. Assumption 4 prevents multicollinearity in scaled predictors. The √ n-consistency of preliminary estimator in Assumption 5 is a standard condition in two-step estimation problems; see Section 3 for discussions on obtaining such preliminary estimators.
Throughout, 1 = var X 1 σ 1 ∈ R p×p and 2 = cov The representation (9) reveals several undesirable issues witĥ β(τ ). First, due to the nonnegligible term √ n(α 0 − α) = O p (1), a large estimation error inα 0 may ruinβ(τ ). Second, it can be difficult to conduct statistical inference. By Equation (9), establishing the asymptotic normality of The preliminary estimatorα 0 usually requires a two-step estimating procedure (see Section 3 for examples), and it can be challenging to figure out the covariance between , statistical inference is sensitive to the choice of preliminary estimatorα 0 . Different conclusions could be made just because of using a different preliminary estimatorα 0 . Fourth, it is infeasible to make statistical inference if we are givenα 0 from prior experience without knowing the exact method used to constructα 0 .

Constrainedly Optimally Weighted Quantile Average Estimator
While the effect ofα 0 exists for each quantile τ (as demonstrated in Section 2.1), we can combine information across quantiles to eliminate such effect. Throughout let τ i = i/(N + 1), i = 1, . . . , N, be N quantiles and consider a new estimator by linearly weightingβ(τ 1 ), . . . ,β(τ N ) with weight ω = (ω 1 , . . . , ω N ) T : Under the first constraint N i=1 ω i = 1, this type of estimator is studied for simple homoscedastic linear model (Koenker 1984;Koenker and Portnoy 1987;Portnoy and Koenker 1989;Zhao and Xiao 2014;Wang and Wang 2016) and linear model with linear heteroscedasticity (Gutenbrunner and Jurečková 1992;Koenker and Zhao 1994;Zhao and Xiao 2014;Zhu, Zheng, and Li 2018). For nonlinear heteroscedasticity model (1), we introduce the second constraint N i=1 ω i Q e (τ i ) = 0 to suppress the effect ofα 0 . By the two constraints in and using the representation (9), we can obtain Remark 2. For the standard homoscedastic models, that is, σ t ≡ 1 in (1), we replaceσ t in Equation (5) by constant 1 and modify (10) Zhao and Xiao (2014), in Theorem 3, the optimal weight becomes ω can asymptotically achieve Cramér-Rao bound, regardless of whether the density is symmetric or not.
Remark 3. In Equation (9), consider p = k = 1 and write , ifα 0 is obtained from an independent dataset so that cov(U, V) = 0), then var(U − cV) − var(U) ≥ c 2 var(V) and a large value of |c| leads to poor performance ofβ(τ ). As a special example, consider σ t = α 2 + 0.2 2 X 2 t and X t ∼ Uniform (0, 1), for α = 0.3, 0.1, 0.01, −1 1 2 ≈ −0.4, −2, −26, respectively. Therefore, there could be a potentially substantial loss of efficiency (proportional to c 2 ) to usê β(τ ). On the other hand, if ccov(U, V) > c 2 var(V)/2,β(τ ) can perform even better thanβ(τ ), though the potential efficiency gain 2ccov(U, V) − c 2 var(V) is unlikely very large due to the term c 2 var(V) growing at rate c 2 . In practice, c is determined by the underlying model and we have no control over it. Moreover, it may be infeasible to evaluate var(V) and cov(U, V), especially whenα 0 is given as a priori knowledge without specifying how it was obtained. Therefore, it is desirable to eliminate such unknown effects via Equation (10). (9), if Q e (τ ) = 0, thenβ(τ ) =β(τ ) + o p (n −1/2 ). In particular, if e t has a symmetric density, then the median-quantile-regression estimatorβ(0.5) performs as well as the infeasible estimatorβ(0.5). In this case, whileβ(0.5) is not affected by the preliminary estimatorα 0 , in general it is not an efficient estimator of β since it only uses the information from the single median quantile. For example, consider estimation of β in the special case Y t = β + e t . If e t ∼ N(0, 1), the sample mean is the maximum likelihood estimator (MLE) and much superior to the sample median. For a continuous random variable Z with quantile function Q Z (τ ), τ ∈ (0, 1), we have (10) can approximate the sample mean, leading to better performance thanβ(0.5). On the other hand, if e t has a standard Laplace distribution, then the sample median is the MLE and much superior to the sample mean, and by assigning the unit weight onβ(0.5) and zero weights on other quantiles, β(ω) becomes exactly the sample median and is more efficient than sample mean. These phenomena are confirmed in our simulation studies in Section 3. In fact, by Theorem 5, if e t has a symmetric density, thenβ(ω) can asymptotically achieve the Cramér-Rao bound.

Remark 4. By Equation
As illustrated in Remarks 3 and 4, the new estimator (10) can achieve two important purposes: eliminating potentially unknown severe effects from preliminary estimatorα 0 as well as combining information across quantiles to improve estimation efficiency.
Theorem 2. Under the conditions in Theorem 1, In Theorem 2, −1 1 depends on the model structure and the scalar value S(ω) depends on the weight ω ∈ . Naturally, the optimal weight ω * can be obtained as We callβ(ω * ) the constrainedly optimally weighted quantile average estimator (COWQAE). Zhao and Xiao (2014) (12), Using this optimal weight, we have Remark 5. In Equation (5),β(τ ) remains unchanged if we replaceσ t byσ t /c for any c > 0 (this will only affect q therein by the same scale). The optimal weights and asymptotic normality in Theorem 3 are invariant under the scale change: replace (e t , σ t ) in (1) by (ce t , σ t /c).
Assumption 6. As a function of τ , f e (Q e (τ )) ∈ G(0, 1) and In Assumption 6, Equation (15) is adopted from Zhao and Xiao (2014). It can be verified that many commonly used distributions with support on R satisfy this condition, including, for example, all the distributions in Section 3.

Comparison to Cramér-Rao Lower Bound
It is well known that the Cramér-Rao bound is the smallest possible variance for any unbiased estimator of the parameter.
By the independence between e t and X t , the Fisher information matrix for is (assuming X t /σ t is centered so E(X t /σ t ) = 0): where 1 and 2 are defined in Theorem 1, 3 = E˙s (X t ,α)ṡ(X t ,α) T s(X t ,α) 2 , and 1 , 2 , 3 are defined in Theorem 4. By block matrix inverse, the Cramér-Rao bound for β, denoted by CR β , is To compare β [see (16) ≥ I p , where I p is the p × p identity matrix. Naturally, we can define the relative efficiency loss (REL) of β , when compared to the Cramér-Rao bound CR β , as ii. We have β = CR β (i.e., REL( β ) = 0) if e t has either a symmetric density or the asymmetric Laplace density iii. Regardless of the model design structure, the upper bound always holds: By Theorem 5(i), REL( β ) can be factorized into two independent factors: N noise depending only on the noise distribution and N design depending only on the model design. We can analyze N noise and N design separately to evaluate the potential efficiency loss.
By Theorem 5(ii), under one of the two cases listed,β(ω * ) can asymptotically achieve the Cramér-Rao bound. That is, by optimally weighting information across quantiles we can simultaneously eliminate the effect ofα 0 and achieve optimal estimation efficiency.
Remark 7. Beran (1974) and Bickel (1982) pioneered the work on adaptive estimators that can asymptotically achieve the information bound. In time series context, adaptive likelihood based methods are studied in, for example, Kreiss (1987) for ARMA models and Linton (1993) for ARCH models, where it is assumed that the noise has a symmetric density with finite fourth moment. The finite fourth moment condition excludes some heavy-tailed distributions such as Student-t d distribution with degrees of freedom d ≤ 4, and the symmetric density condition ensures that the effect from preliminary estimators vanishes. For asymmetric density case, these methods generally will not work for heteroscedastic models such as Equation (1), due to the nonvanishing effect from preliminary estimator. While our method also cannot achieve the information bound for asymmetric densities, Theorem 5 establishes an upper bound for the potential efficiency loss which, as illustrated in Examples 1-2, is quite small for practical situations. Furthermore, our proposed method is computationally appealing as it involves simple linear quantile regressions.
By Theorem 5(iii), the potential efficiency loss has the upper bound N noise I p which depends only on the noise distribution. Since N noise = 0 for any symmetric densities, below we calculate N noise for some asymmetric densities only. For illustration, consider scalar-value parameter (p = 1) so that REL( β ) = β /CR β − 1 can be interpreted as the percentage efficiency loss relative to the Cramér-Rao bound.
Example 1 (Mixture of normal). Consider the mixture of normal distribution pN(1 − p, 1) + (1 − p)N(−p, σ 2 ) for parameters p ∈ [0, 1], σ 2 > 0. This distribution has zero mean. Figure 1 plots N noise for two cases: fixed p = 0.8 and varying σ 2 , and fixed σ 2 = 3 and varying p. In both cases, N noise is close to zero for the range of parameters considered.
Example 2 (Skew normal). Skew normal distributions are widely used to model data with asymmetric and/or nonnormal distribution. The density is f e (x|μ, α) = 2φ(x − μ) (α(x − μ)), where μ is location parameter, α is slant parameter, φ(x) and (x) are the standard normal density and distribution function, respectively. By Andel, Netuka, and Svara (1984), f e (x|μ, α) is  the stationary density of the process {e t } defined by For μ = −α/ √ 1 + α 2 √ 2/π so that the distribution has zero mean, we plot N noise in Figure 2 as a function of the slant parameter α. For |α| ∈ [−2, 2], the maximal efficiency loss is upper bounded by 10%. Note that |α| = 2 corresponds to |ρ| ≈ 0.89 in (20), which is quite extreme and close to the unit root case. For the even more extreme |α| = 3 (corresponding to |ρ| ≈ 0.95, which is unlikely in practice), the upper bound is about 28%.
As illustrated in Examples 1-3, the upper bound for efficiency loss is usually quite small for practical situations. We stress that the upper bound holds regardless of the model design structure and thus tends to be very conservative as it will be further multiplied by the factor N design ≤ I p ; see Theorem 5(i). For scalar-value parameter, N design ∈ [0, 1].

Inferences
As discussed in Section 2.1, while it can be difficult to useβ(τ ) to conduct statistical inference, it is straightforward to useβ(ω * ) to address inference problems. By Theorem 6, Recallv andĤ in Section 2.4 andσ t = s(X t ,α 0 ). We can estimate 1 and J N bŷ andĴ We can use Equation (24) and Proposition 1 to construct confidence interval for each component of β. Also, if we are interested in the null hypothesis H 0 : Aβ = B, for some matrix A ∈ R r×p and vector B ∈ R r . Assume A has full rank r ≤ p. By Equation (24) and Proposition 1, we can form the χ 2 -type statistic . Under H 0 , T n converges in distribution to the χ 2 distribution with r degrees of freedom.
Remark 8. Under standard conditions in least-square estimation, it is not difficult to establish √ n-consistency forα 0 ,β 0 . To keep length, we skip the details.

Remark 9.
Under different identifiability conditions, other methods can be used to obtain (α 0 ,β 0 ). For example, if we assume in (27) the median of e t is zero, then we use the leastabsolute-deviation (LAD) or median quantile regression to it can be obtained by the same method discussed above with the least-squares regression replaced by LAD regression. For example, for Heteroscedasticity 1, by Remark 5, we assume WLOG that e 2 t has median one, then we can obtainα 0 via argmin a 0 ,a 1 ,a 2 n t=1 |ˆ 2 t − a 0 − a 1 X 2 t1 − a 2 X 2 t2 |.
For noise e t , we consider ten distributions, all with zero mean for identifiability. We compare the following six methods to estimate (β 1 , β 2 ): 2. WLS: Weighted least-square with estimated weights s(X 1t ,X 2t ,α 0 ) 2 . 3. WLS-T: Weighted least-square with true weights argmin b 1 ,b 2 n t=1 . 4. WLAD-T: Weighted LAD with true weights argmin b 1 ,b 2 ,q n t=1 . Since we include intercept and use true weight, we do not require e t to have median zero. Table 1. Relative MSE (COWQAE always has MSE one) for Heteroscedasticity 1 in (28). In each cell, the first number is the relative MSE for β 1 and the bracketed number is the relative MSE for β 2 . Bottom two rows: average of relative MSE over ten distributions. For estimateβ j of β j , we calculate its mean squared errors (MSE) by averaging 1000 realizations of (β j −β j ) 2 , and Tables 1-3 present the relative MSE or MSE(method)/MSE(COWQAE), for the three forms of heteroscedasticity in Equations (28)-(30), using sample size n = 200 and n = 500. Clearly, COWQAE with estimated optimal weight has almost indistinguishable performance from COWQAE-T with true optimal weight and both deliver substantially better performance than other methods. It is remarkable that COWQAE significantly outperforms WLS and WLAD even when the true weights are used in WLS and WLAD. To further appreciate the performance of COWQAE, we take the average of relative MSE over ten distributions and include these averages at the bottom of Tables 1-3. COWQAE outperforms other methods by a large margin and the result is consistent across the three forms of heteroscedasticity, the two sample sizes, and the two parameters.

Application: GDP and Inflation Rate Modeling
We consider two datasets, both available at Federal Reserve Bank of St. Louis web (https://fred.stlouisfed.org). The first dataset contains seasonally adjusted quarterly gross domestic product (GDP) growth rates (in annual percentage) from April 1947 to January 2019. The second dataset contains quarterly inflation rates, calculated as the percentage change of the Table 3. Relative MSE (COWQAE always has MSE one) for Heteroscedasticity 3 in (30). In each cell, the first number is the relative MSE for β 1 and the bracketed number is the relative MSE for β 2 . Bottom two rows: average of relative MSE over ten distributions.  seasonally adjusted total Consumer Price Index, during the same period. There are n = 288 observations. The plots in Figure 3 suggest possible heteroscedasticity.

Model with intercept:
In all subsequent models, we consider intercept. As discussed in Remark 1, the intercept may cause numerical issues. To address this, we use OLS to obtain the estimated intercept first, denoted byβ 0 LS , subtractβ 0 LS from the response Y t , use the de-intercepted response in our COWQAE estimation procedure to obtain estimated slope parameters, and finally addβ 0 LS back to the model as the intercept. We consider GDP forecasting. For model Y t = β 0 + X T t β + s(X t , α)e t , we divide data into two parts: the first J observations (X 1 , Y 1 ), . . . , (X J , Y J ) are used for model fitting (i.e., obtaining estimatesβ 0 ,β), and the remaining n − J observations (X J+1 , Y J+1 ), . . . , (X n , Y n ) are used to evaluate the out of sample forecasting performance based on the prediction error n t=J+1 (Y t −β 0 − X T tβ ) 2 . As in Section 3, we compare COWQAE to four other methods: OLS, WLS with estimated weights, LAD, and WLAD with estimated weights. For preliminary estimate, we follow the same procedure in Section 3. For ease of comparison, we consider relative prediction error defined as the prediction error of a method divided by the prediction error of COWQAE so that COWQAE always has relative prediction error one. We evaluate the relative prediction error using J = 100, 105, 110, . . . , 200. Three types of models are considered: (i) GDP on the past GDP; (ii) GDP on the past inflation rate; and (iii) GDP on the past GDP and the past inflation rate. Our goal is not to select the best model for GDP forecasting, but rather aim to compare the forecasting performance of our proposed estimation method to other estimation methods under various given models. For GDP on the past GDP, the partial autocorrelation function suggests that there is significant correlation at lag one, slightly significant correlation at lag two, and barely significant correlation at lag three, thus we consider up to lag two past GDPs. In all models, we use the most widely used square-root heteroscedasticity. We then have the following models: Model 1: GDP t = β 0 + β 1 GDP t−1 + α 0 + α 1 GDP 2 t−1 e t Model 2: GDP t = β 0 + β 1 GDP t−1 + β 2 GDP t−2 + α 0 + α 1 GDP 2 t−1 + α 2 GDP 2 t−2 e t  Model 3: GDP t = β 0 + β 1 Inflation t−1 + α 0 + α 1 Inflation 2 t−1 e t Model 4: GDP t = β 0 + β 1 GDP t−1 + β 2 Inflation t−1 + α 0 + α 1 GDP 2 t−1 + α 2 Inflation 2 t−1 e t The relative prediction errors for Models 1-4 are reported in Figure 4. The relative prediction errors almost always exceed one (except for very few exceptions where they lie between [0.95, 1]), suggesting almost uniformly better performance of COWQAE over other methods in all models and for all values of J considered. Also, since most of the relative prediction errors lie between [1.05, 1.10], COWQAE has about 5%-10% efficiency gain in terms of prediction error. Note that we have the decomposition Since s(X t , α)e t appears in the prediction error of all methods, it may shadow the better performance of (β 0 ,β), especially when s(X t , α)e t has a large magnitude. Thus, a 5%-10% efficiency gain could actually mean substantially better performance in (β 0 ,β). Next we examine the estimated slope coefficients in details. For illustration, we consider β 1 in Model 1. For each t = 5, 10, . . . , 85, we use data in the 50-year moving time window [t, t + 200] to estimate β 1 , and Figure 5 plots the series of β 1 based on the five estimation methods: OLS(thin solid), WLS(thin dotted), LAD(thin dashed), WLAD(thin dotdashed), and COWQAE(thick solid), along with the 95% confidence interval (thick dotted) for COWQAE at each t. The seriesβ 1 from the other four estimation methods are mostly contained within the confidence interval but there are occasional exceptions (suggesting significant difference). Since the confidence interval lies above zero for all t, we have strong evidence to reject the null hypothesis H 0 : β 1 = 0 for each moving time window, and we conclude that GDP is positively correlated with past GDP.
Using J = 100 subsamples for model fitting, Figure 6 plots COWQAE predicted GDP vs. observed GDP. The points are reasonably scattered around the diagonal line.