Heteroscedasticity Robust Panel Unit Root Tests

This article proposes new unit root tests for panels where the errors may be not only serial and/or cross-correlated, but also unconditionally heteroscedastic. Despite their generality, the test statistics are shown to be very simple to implement, requiring only minimal corrections and still the limiting distributions under the null hypothesis are completely free from nuisance parameters. Monte Carlo evidence is also provided to suggest that the new tests perform well in small samples, also when compared to some of the existing tests. Supplementary materials for this article are available online.


INTRODUCTION
Researchers are by now well aware of the potential hazards involved when using augmented Dickey-Fuller (ADF)-type unit root tests in the presence of unattended structural breaks in the deterministic trend. But breaks in the trend are not the only way in which structural instability may arise. Take, for example, the financial literature concerned with the testing of the efficient market hypothesis, requiring that stock returns are stationary, in which case violations of the otherwise so common homoscedasticity assumption is more of a rule rather than the exception. In fact, as Poon and Granger (2003, p. 481) stated in their review of this literature.
There are several salient features about financial time series and financial market volatility that are now well documented. These include fat tail distributions of risky asset returns, volatility clustering, asymmetry and mean reversion, and comovements of volatilities across assets and financial markets. More recent research finds correlation among volatility is stronger than that among returns and both tend to increase during bear markets and financial crises.
Time-varying volatility is therefore an important source of structural instability. However, as the citation makes clear there is not only the time variation in the volatility of individual stock returns, but also a great deal of similarity in the volatility of different assets and markets, a finding that has been confirmed by numerous studies (see, e.g., McMillan and Ruiz 2009). When it comes to unit root testing, while the latter is typically ignored, the most common way to accommodate the former is to assume that the time variation is in the conditionally variance only (see Seo 1999), which is of course not very realistic. Indeed, Pagan and Schwert (1990), Loretan and Phillips (1994), Watson (1999), and Busetti and Taylor (2003), to mention a few, all provided strong evidence against the unconditional homoscedastic-A previous version of this article was presented at a workshop at Deakin University. The author would like to thank seminar participants, and in particular Christoph Hanck, Rolf Larsson, Paresh Narayan, Stephan Smeekes, Jean-Pierre Urbain, Shakeeb Khan (the Editor), an Associate Editor, and two anonymous referees for many valuable comments and suggestions. Financial support From the Jan Wallander and Tom Hedelius Foundation Under research grant number P2009-0189:1 is gratefully acknowledged.
ity assumption for most financial variables, including exchange rates, interest rates, and stock returns.
Of course, unconditional heteroscedasticity is not only a common feature of financial data, but can be found also in most macroeconomic variables. For example, Kim and Nelson (1999), McConnell and Perez Quiros (2000), and Koop and Potter (2000) all showed how the volatility of U.S. gross domestic product (GDP) has declined since the early 1980s. Busetti and Taylor (2003) considered 16 series of industrial production and 17 series of inflation. Consistent with the evidence for GDP, most series are found to exhibit a significant decline in variability around the early 1980s. Similar results were reported by Cavaliere and Taylor (2009a). Chauvet and Potter (2001) reported decreased volatility also in consumption, income, and in aggregate employment, the latter finding is confirmed by results of Warnock and Warnock (2000). Sensier and van Dijk (2004) considered no less than 214 monthly macroeconomic time series. According to their results, around 80% of the series were subject to a significant break in volatility. The evidence also suggests that there is a great deal of similarity in the volatility within groups of series, such as industries, sectors, and real and nominal variables. Hence, again, the unconditional variance is not only time varying, but also co-moving.
The current article can be seen as a response to the above observations. The purpose is to devise a panel-based test procedure that is flexible enough to accommodate not only the "usual suspects" of serial and cross-correlation, but also general (unconditional) heteroscedasticity. The tests should also be simple. In particular, they should not suffer from the same computational drawback as many of the existing time series tests, which typically involve some kind of resampling and/or nonparametric estimation (see Beare 2008;Cavaliere and Taylor 2009b). The way we accomplish this is by exploiting the information contained in the similarity of both the variance and level of the cross-sectional units, a possibility that has not received any attention in the previous literature. In fact, except for the recent study of Demetrescu and Hanck (2011), this is the only panel investigation that we are aware of to address the issue of heteroscedasticity.
Two tests based on the new idea are proposed. One is designed to test for a unit root in a single time series, while the other is designed to test for a unit root in a panel of multiple time series. Both test statistics are very convenient in that no prior knowledge regarding the structure of the heteroscedasticity is needed, and still the implementation is extremely simple, requiring only minimal corrections. The asymptotic analysis reveals that while the test statistics have asymptotic distributions that are free of nuisance parameters under the unit root null, this is not the case under the local alternative, in which power generally depends on both the heteroscedasticity and the cross-correlation. Results from a small Monte Carlo study show that the asymptotic results are borne out well in small samples. In fact, the tests have excellent small-sample properties, even when compared to the competition.
The rest of the article is organized as follows. Section 2 introduces the model, while Section 3 presents the test statistics and their asymptotic distributions, which are evaluated in small samples in Section 4. Section 5 concludes.

MODEL AND ASSUMPTIONS
Consider the panel data variable y i,t , where t = 1, . . . , T and i = 1, . . . , N index the time series and cross-sectional units, respectively. The data-generating process (DGP) of this variable is given by with v i,t following a stationary autoregressive process of known order p i < ∞ with possibly cross-section dependent and/or groupwise heteroscedastic errors. In particular, it is assumed that v i,t admits to the following general dynamic common factor model: where φ i (L) = 1 − p i k=1 φ ki L k is a polynomial in the lag operator L, F t is an r-dimensional vector of common factors, and m = 1, . . . , M + 1 indexes M + 1 distinct variance groups. As with p i , initially we will assume that r (the number of common factors) and M (the number of variance groups) are known; some possibilities for how to relax these assumptions in practice will be discussed in Section 3.3. In the assumptions that follow x signifies the integer part of x and ||A|| = √ tr(A A) is the Frobenius (Euclidean) norm of the matrix A.
(a) ε i,t is independent and identically distributed (iid) across both i and t with E(ε i,t ) = 0, E(ε 2 i,t ) = 1 and E(ε 4 i,t ) < ∞; (b) F t is iid across t with E(F t ) = 0, E(F t F t ) = t > 0 and E(||F t || 4 ) < ∞; (c) ε i,t and F s are mutually independent for all i, t, and s; (e) φ i (L) have all its roots outside the unit circle; (f) E(u i,s ) < ∞ for all i and s = −p i , . . . , 0.
Consider σ 2 m,t . It is assumed that the cross-section can be divided into M cross-section homoscedastic groups with the mth group containing the units i = N m−1 + 1, . . . , N m , where N 0 = 1 and N M+1 = N . That is, the heteroscedasticity across the cross-section is made up of M discrete jumps between otherwise homoscedastic groups. As a matter of notation, in what follows, if a variable, x m,i say, depends on both m and i, since the latter index runs across groups, the dependence on m will be suppressed, that is, x m,i will be written as x i . Note also that while the sequential group structure assumed here, in which units i = N m−1 + 1, . . . , N m are arbitrarily assigned to group m, is without loss of generality. In fact, as we demonstrate in Section 3.3, in applications it is actually quite useful to think of the cross-section as being ordered.
s) and (s) are bounded and squareintegrable with a finite number of points of discontinuity; The parameter of interest, ρ i , is assumed to satisfy Assumption 3.
where κ ≥ 0, and c i is iid across i with E(c i ) = μ c and var(c i ) = σ 2 c < ∞. c i is independent of ε j,t and F s for all i, j, t, and s. Hence, if c i = 0, then ρ i = 1, and therefore y i,t is unit root nonstationary, whereas if c i < 0 (c i > 0), then ρ i approaches one from below (above) and therefore y i,t is "locally" stationary (explosive) (in the mean). The relevant null hypothesis here is given by H 0 : c 1 = · · · = c N = 0, which equivalent to H 0 : μ c = σ 2 not only seems like a reasonable scenario in many applications (see Section 3.3 for a discussion), but is also one of the strengths of the approach, as it allows for consistent estimation of the group-specific variances (see Section 3.1). This is facilitated by Assumption 2(a), which states that the groups increase proportionally with N. In this regard, it is quite obvious that while in theory q (the proportion) can be made arbitrarily small, in applications the value of q is not irrelevant, as larger groups are expected to increase the precision of the estimated group-specific variances. In Section 4, we discuss the importance of q in small samples (see also the supplement for this article). 2. While the heteroscedasticity across the cross-section needs to have the groupwise structure, the heteroscedasticity across time is essentially unrestricted. In the derivations, we assume that σ 2 m,t and t are nonrandom, which allows for a wide class of deterministic variance models, including models with breaks and linear trends. However, provided that σ 2 m,t , t , ε i,t , and F t are independent with E(σ 2 m,t ) > 0 and E( t ) > 0, σ 2 m,t and t can also be random. Conditional heteroscedasticity, including generalized autoregressive conditional heteroscedasticity (GARCH), is permitted too, as heteroscedasticity of this form does not affect the asymptotic results. 3. The assumption in Equation (4) is quite common (see, e.g., Pesaran 2007;Pesaran, Smith, and Yamagata 2009), and ensures that the serially uncorrelated error e i,t has a strict common factor structure. This is more restrictive than the approximate factor model considered by Bai and Ng (2010), where the idiosyncratic error is allowed to be "mildly" crosscorrelated, but is necessary for the proofs. Except for this, however, the above model is actually quite general when it comes to the allowable serial and cross-sectional dependencies. Note in particular how the error in Equation (2) admits to a general dynamic factor structure of potentially infinite order. 4. The assumption that the mean and variance of c i are equal across i is not necessary and can be relaxed as long as the cross-sectional averages of E(c i ) and var(c i ) have limits such as μ c and σ 2 c , respectively. The assumption that c i is independent of ε j,t and F s can also be relaxed somewhat. In particular, note that since under the null of a unit root c 1 = · · · = c N = 0, correlation between c i , and ε j,t and/or F s is only an issue under the alternative hypothesis, and then it can be shown that in the typical panel parameterization of ρ i with κ = 1/2, the effect of correlation between c i and ε j,t is negligible, although otherwise this is not necessarily so. Unfortunately, such a correlation greatly complicates both the derivation and interpretation of the results, and we therefore follow the usual practice in the literature and assume that Assumption 3 holds.

A Time Series Test
The time series test statistic that we consider, henceforth denoted by τ i , is very simple and can be seen as a version of the Lagrange multiplier (LM) test statistic for a unit root (see Remark 4). The implementation consists of four steps: 1. Obtain R i,t as the ordinary least-squares (OLS) residual from a regression of y i,t onto ( y i,t−1 , . . . , y i,t−p i ). 2. ObtainF t andˆ i by applying the principal components

Remarks.
5. τ i can be seen as a version of the true LM test statistic (based on known parameters) for testing H 0 : c i = 0, which is given by Unfortunately, the asymptotic distribution of this test statistic depends on σ m (r) and (r), which means that critical values can only be obtained for a particular choice of (σ m (r), (r)). 3 The use of accumulated weighted first differences (as when replacing w m,t R i,t−1 with t s=p i +2 w m,s R i,s ) eliminates this dependence. 6. The formula for τ i reveals some interesting similarities with results obtained previously in the literature. In particular, note how τ i is basically the squared, weighted, and "defactored" equivalent of the conventional ADF test. It can also be regarded as a generalization of the LM test developed by Ahn (1993), and Schmidt and Phillips (1992), who considered the problem of testing for a unit root in a pure time series setting with homoscedastic errors. The way that R wi,t is accumulated up to levels is very similar to the approach of Bai and Ng (2004). But while in that article the accumulation is just an artifact of their proposed correction for cross-section dependence, here it is used as a means to recursively weight the observations, and ultimately to remove the effect of the heteroscedasticity (see Beare 2008, for a similar approach). 7. As is well known, under certain conditions (such as normality), the LM test statistic possesses some optimality properties. While the test statistic considered here is not the true LM statistic, it is similar and is therefore expected to inherit some of these properties. Apart from the resemblance with the true LM statistic, the use of a squared test statistic has the advantage that it does not rule out the possibility of explosive units (c i > 0), which seems like a relevant scenario in many applications, especially in financial economics, where data can exhibit explosive behavior. Explosive behavior is 1 Because p i is not restricted to be equal across the cross-section, the panel data variableˆ i,t will in general be unbalanced. Therefore, to simplify this step of the implementation, one may apply the principal components method to the balanced panel comprising the last T − max{p 1 , . . . , p N } − 1 observations. 2 A formal derivation is available upon request.
3 Another problem with the true LM test statistic is that the feasible version is based on using weighted OLS (WLS) to estimate the first-step regression. However, since the weight w m,t is not known, this calls for the use of iterated WLS. also more likely if N is large, which obviously increases the probability of extreme events regardless of the application considered. This is particularly relevant when constructing pooled panel test statistics (see Section 3.2), in which case positive and negative values of c i may otherwise cancel out, causing low power. 8. τ i can be modified to incorporate information regarding the direction of the alternative. Suppose, for example, that one would like to test H 0 : c i = 0 versus H 1 : c i < 0. A very simple approach would be to first test H 0 versus H 1 . If H 0 is accepted, then we conclude that y i,t unit root nonstationary and proceed no further, whereas if H 0 is rejected, then the testing continues by checking the sign of T t=p i +2 R wi,tRwi,t−1 . Only if the sign is negative do we conclude that the evidence is in favor of stationarity and not of explosiveness. Of course, because of the data snooping, this test would not be correctly sized. A more appropriate testing approach involves replacing τ i with 1( T is the indicator function for the event A. The asymptotic null distribution of the resulting joint test statistic can be worked out by using the results of Abadir and Distaso (2007). 9. In regression analysis with heteroscedastic errors it is common practice to useˆ 2 i,t to estimate σ 2 m,t . However, as Lopez (2001) pointed out, while unbiased, because of its asymmetric distribution,ˆ 2 i,t is a very imprecise estimator of σ 2 m,t . σ 2 t,m uses more information and is therefore expected to perform much better. Beare (2008) used a similar recursive weighting scheme as the one considered here, but where σ 2 m,t is estimated in a nonparametric fashion, which not only leads to a very low rate of consistency but also complicates implementation.
The asymptotic distribution of τ i is given in the following theorem.
Theorem 1. Under the conditions laid out in Section 2, given κ = 0, as N, T → ∞ with → d signifying convergence in distribution, i = N m−1 + 1, . . . , N m , and W ε,i (s) and W F (s) being two independent standard Brownian motions.
To appreciate fully the implication of Theorem 1, note that since V R,i (s) satisfies the differential equation dV R,i (s) = dW ε,i (s) + c i w m (s)B R,i (s)ds, the numerator of the asymptotic test distribution can be expanded as This illustrates how the presence of c i has two effects. The first is to shift the mean of the test statistic and is captured by the first term on the right-hand side, whereas the second is to affect the variance and is captured by V R,i (s) (which depends on c i ). It also illustrates how the local asymptotic power depends on both the heteroscedasticity and the cross-section dependence, as captured by σ 2 m (s) and i B F,i (s), respectively. Whether this dependence implies higher local power in comparison to the case with homoscedastic errors is not possible to determine unless a particular choice of σ 2 m (s) and i B F,i (s) is made. However, we see that if σ 2 m (s) = σ 2 m and i = 0, then the dependence on these nuisance parameters disappears. We also see that whenever σ 2 m (s) is not constant and/or i = 0, then this is no longer the case (if c i = 0). Note in particular that even if σ 2 m (s) = σ 2 m , unless i = 0, the dependence on σ 2 m will not disappear. In Section 3.2, we elaborate on the power implications of the heteroscedasticity.
Under the unit root hypothesis that c i = 0, dV R,i (s) = dW ε,i (s), and hence V R,i (s) = W ε,i (s), which in turn implies as N, T → ∞. Hence, the asymptotic null distribution of τ i does not depend on σ m (s) or i B F,i (s). It is therefore completely free of nuisance parameters. In fact, closer inspection reveals that the asymptotic distribution of τ i is nothing but the squared ADF test distribution, which has been studied before by, for example, Ahn (1993), who also tabulated critical values (Table 1).

Remarks.
10. The fact that the asymptotic null distribution of τ i is free of nuisance parameters is a great operational advantage that is not shared by many other tests. Consider, for example, the (adaptive) test statistic analyzed by Boswijk (2005), which allows for heteroscedasticity of the type considered here (but no cross-section dependence). However, unlike τ i , the asymptotic distribution of this test statistic depends on σ 2 m (r), which is of course never known in practice. Similarly, while the unit root tests of Pesaran (2007) and Pesaran, Smith, and Yamagata (2009) do allow for cross-section dependence in the form of common factors (but no heteroscedasticity), their asymptotic null distributions depend on B F,i (s), whose dimension is generally unknown, which greatly complicates the implementation. In fact, the only test statistic that we are aware of that comes close to ours in terms of generality and still having a completely nuisance parameter free null distribution is the Cauchy-based t-statistic of Demetrescu and Hanck (2011). 11. The new test statistic can be applied under very general conditions when it comes to the dynamics and heteroscedasticity of the errors. However, it cannot handle models with a unit-specific trend that needs to be estimated. The reason is that while the effect of the estimation of the intercept is negligible, this is not the case with the trend, whose estimation introduces a dependence on σ 2 m (s). One way to circumvent this problem is to use bootstrap techniques (see, e.g., Cavaliere and Taylor 2009b). However, this does not fit well with the simple and parametric flavor of our approach. A feasible alternative in cases with trending data is to assume a common trend slope, which can then be removed by using data that have been demeaned with respect to the overall sample mean. That is, instead of using y i,t when constructing R wi,t one uses T k=p i +2 y j,k /N T . The asymptotic distribution reported in Theorem 1 is unaffected by this. 4 Some limited heterogeneity can also be permitted in this way by assuming that the trend slope, β i say, is "local-to-constant." Specifically, supposed that instead of (1), we have independent of all the other random elements of the DGP. The results reported in Theorem 1 (and also Theorem 2) hold also in this case, provided again that y i,t is replaced with y i,t − y. 5

A Panel Data Test
Given that the asymptotic null distribution of τ i is free of nuisance parameters, the various panel unit root tests developed in the literature for the case of homoscedastic and/or crosscorrelation free errors can be applied also to the present more general case. In this section, we consider the following panel 4 A detailed proof is available upon request. 5 Some confirmatory Monte Carlo results are available upon request. version of τ i : The main reason for looking at a between rather than a within type test statistic is that such test statistics are known to have relatively good local power properties (see Westerlund and Breitung 2012). The asymptotic distribution of τ is given in Theorem 2.
Theorem 2. Under the conditions of Theorem 1, given κ = 1/2, as N, and χ 2 k (d) is a noncentral chi-square distribution with k degrees of freedom and noncentrality parameter d.
In agreement with the results reported in Theorem 1 the asymptotic local power of τ depends on both the heteroscedastisity and the cross-section dependence, as captured by σ 2 m (s), (s), and i . In fact, Theorem 2 is the first result that we are aware of that shows how power is affected by all three parameters. 6 The first thing to note is that while this is not the case for σ 2 m (s), the effect of i and (s) is negligible (as N → ∞). This means that our defactoring approach is effective in removing the effect of the common component, not only under the null but also under the local alternative. As in Section 3.1, unless a particular choice of σ 2 m (s) is made, its effect on power cannot be determined. However, we see that if N < ∞, unless i = 0, the dependence on σ 2 m (s) will not disappear even if σ 2 m (s) = σ 2 m , which is again in agreement with Theorem 1.
To illustrate how a particular choice of σ 2 m (s) can affect power let us consider as an example the case when σ m (s) = σ (s) = 1 + 1(s > b)/4. With b = 1/2 this is the discrete break case considered in the simulations (see Section 4). Clearly, ). Now, the minimal value of θ 2 is obtained by setting b = 0 or b = 1, in which case σ (s) = 1. It follows that in this particular example θ 2 , and hence also power, is going to be higher with heteroscedasticity (b ∈ (0, 1)) than without (b = 0 or b = 1). The maximal value of θ 2 is obtained by setting b = 1/2.
Under the null hypothesis that c i and all its moments are zero, q N,m (u) = 0, suggesting that the result in Theorem 2 reduces to a central chi-squared distribution with one degree of freedom.

Remarks.
12. While most research on the local power of panel unit root test assume that N → ∞ (and also T → ∞) and only report results for the resulting first-order approximate power function (see, e.g., Moon, Perron, and Phillips 2007), which only depends on μ c , Theorem 2 also accounts for σ 2 c , and is therefore expected to produce more accurate predictions, a result that is verified using Monte Carlo simulation in Section 4. The theorem also shows how power is driven mainly by μ c , and that the effect of σ 2 c goes to zero as N → ∞. 13. While nonnegligible for κ = 0, τ i has negligible local power for κ > 0. Intuitively, when κ > 0, the deviations in ρ i from the hypothesized value of one are too small for τ i to be able to detect them. The fact that τ has nonnegligible local power for κ = 1/2 means that it is able to pick-up deviations that are undetectable by τ i . Thus, as expected, the use of the information contained cross-sectional dimension leads to increased power. The fact that the assumptions are the same (except for the requirement that N/T should go to zero), suggests that this power advantage does not come at a cost of added (homogeneity) restrictions. 14. The condition that N/T → 0 as N, T → ∞ is standard even when testing for unit roots in homoscedastic and cross-sectionally independent panels. The reason for this is the assumed cross-sectional heterogeneity of the DGP, whose elimination induces an estimation error in T, which is then aggravated when pooling across N. The condition that N/T → 0 as N, T → ∞ prevents this error from having a dominating effect (see Westerlund and Breitung 2011, for a detailed discussion). Demetrescu and Hanck (2011) considered a setup that is similar to ours. However, they require N/T 1/5 → 0, which obviously puts a limit on the allowable sample sizes. For example, if N = 10, then T > 100, 000 is required for T > N 5 to be satisfied. Hence, unless one is using long-span high-frequency data, this approach is not expected to perform well.

Uncertainty Over the Groups.
A problem in applications is how to pick the groups for which to compute the variances. The most natural approach is to exploit if there is a natural grouping of the cross-section. For example, a common case is when we have repeated observations on each of a number of countries, industries, sectors, or regions, where it may be reasonable to presume that the variance is constant within groups but potentially different across.
If there is uncertainty regarding the equality of the groupwise variances, then the appropriate approach will depend on the extent to which there is a priori knowledge regarding the grouping. In this section, we assume that the researcher has little or no such knowledge; in the supplement we consider the case when the researcher knows which units that belong to which group, such that the problem reduces to testing the equality of the groupwise variances. If nothing is known, then the number of groups and their members can be estimated by quasi-maximum likelihood (QML). To fix ideas, suppose that M = 1, such that there are only M + 1 = 2 groups. The problem of determining the groups therefore reduces to the problem of consistent estimation of the group threshold N 1 . For this purpose, it is convenient to treatσ 2 1,t andσ 2 2,t as functions of n = qN, . . . , In this notation, the QML objective function is given by and the proposed estimator of N 1 is given bŷ Proposition 1 shows thatN 1 is consistent for N 1 .
Proposition 1. Under the conditions of Theorem 1, given The proposed QML estimator ofN 1 is analogous to the OLS-based breakpoint estimator considered by, for example, Bai (1997), and Bai and Perron (1998) in the time series case. The reason for using QML is that OLS can only be used in case of a break in the mean. To implement the minimization, the following two-step approach may be used.

Computeσ
i,t /T for each i, and order the cross-section units accordingly. 2. ObtainN 1 by grid search at the minimum QML(n) based on ordered data.
If M > 1, then the one-at-a-time approach of Bai (1997) may be used. The objective function is therefore identical to QML(n). LetN 1 be the value that minimizes QML(n).N 1 is not necessarily estimating N 1 ; however, it is consistent for one of the thresholds. OnceN 1 has been obtained, the sample is split at this estimate, resulting in two subsamples. We then estimate a single threshold in each of the subsamples, but only the threshold associated with the largest reduction QML(n) is kept. Denote the resulting estimator asN 2 . IfN 2 <N 1 , then we switch subscript so thatN 1 <N 2 . Now, if M = 2, the procedure is stopped here. If, on the other hand, M > 2, then the third threshold is estimated from the three subsamples separated byN 1 andN 2 . Again, a single break point is estimated for each of the subsamples, and the one associated with the largest reduction in the objective function is retained. This procedure is repeated until estimates of all M thresholds are obtained. When M is unknown, before an additional threshold is added, we test if this leads to a reduction in the objective function (see Bai and Perron 1998, for a similar approach). 7

Remarks.
15. In contrast to conventional data clustering, which is computationally very demanding, the above minimization approach is very fast. The key is the use ofσ 2 i as an ordering device for the cross-section (see Lin and Ng 2012, for a similar approach). This makes the grouping problem analogous to the problem of breakpoint estimation in time series, for which there is a large literature (see Perron 2006, for an overview). The one-at-a-time approach of Bai (1997) is computationally very convenient and leads to tests with good small-sample performance (see Section 4). Note also that the ordering can be done while ignoring the heteroscedasticity in t (see Lin and Ng 2012).
16. An alternative to the above break estimation approach to the grouping is to use K-means clustering (see, e.g., Lin and Ng 2012). In this case, we begin by randomly initializing the grouping. Each unit is then reassigned, one at a time, to the other groups. If the value of the objective function decreases, the new grouping is kept; otherwise, one goes back to the original grouping. This procedure is repeated until no unit changes group. In our simulations, the proposed approach was not only faster but also led to huge gains in performance when compared to clustering.

Lag Selection.
It is well known that if the true lag order p i is unknown but the estimatorp i is such thatp i → ∞ andp i = o(T 1/3 ), then conventional unit root test statistics have asymptotic distributions that are free of nuisance parameters pertaining to the serial correlation of the errors, even if p i is infinite and the errors are (unconditionally) heteroscedastic (Cavaliere and Taylor 2009b). One possibility is therefore to setp i as a function of T. Of course, in practice it is often preferable to use a data-driven approach. In this context, Pötscher (1989) showed that if the errors are unconditionally heteroscedastic, the consistency of the conventional Bayesian information criterion (BIC) cannot be guaranteed. In this subsection, we therefore propose a new information criterion that is robust in this regard. The idea is the same as that of Cavaliere et al. (2012), that is, rather than modifying the information criterion (applied to the original data), we use the heteroscedasticity corrected variablesR i,t−1 and R i,t as input. In particular, the following information criterion, which can be seen as a version of the modified BIC (MBIC) of Ng and Perron (2001), is considered: is the estimated sample variance from that regression. The maximum p to be considered is denoted as p max and is assumed to satisfy p max → ∞ and p max = o(T 1/3 ). The estimator of p i is given bŷ In Section 4, we use Monte Carlo simulations to evaluate the performance of the tests based on this lag selection procedure in small samples.

Selection of the Number of Common Factors.
In the Appendix, we verify that Assumption C of Bai (2003), and Bai and Ng (2002) is satisfied, suggesting that neither the estimation of the common component, nor the selection of the number of common factors is affected by the heteroscedasticity. Therefore, if the number of common factors, r, is unknown, as is typically the case in practice, any of the information criteria proposed by Bai and Ng (2002) can be applied to R i,t , givingr.
If p i , r and the groups are all unknown, then a joint procedure should be used. In the simulations, we begin by estimating r given a maximum lag augmentation of p max . Once r has been estimated, the MBIC is applied to obtainp i for each unit. Given r andp 1 , . . . ,p N , the groups are selected by applying QML tô i,t .
statistics are constructed as t-ratios of the null hypothesis of a unit root. The difference is that while the first is a time series test, the second is a panel test. Thus, the most relevant comparison here is between τ i and τ IV,i , on the one hand, and between τ and τ IV , on the other hand. Moreover, in view of the apparent robustness of the sign function to heteroscedasticity (see Demetrescu and Hanck 2011, Proposition 1), the sign-based S NT statistic, which in construction is very close to τ IV , is also simulated. A number of other tests were attempted (including the ADF ĉ e (i) and P ĉ e statistics of Bai and Ng 2004) but not included, as their performance was dominated by the performance of the above tests. 9 In the simulations, the groups and lag augmentation order are selected as described in Section 3.3, using the QML and MBIC, respectively. Consistent with the results of Ng and Perron (1995), the maximum number of lags is set equal to p max = 4(T /100) 2/9 . All tests are based on the same lag augmentation. The trimming parameter in the QML is set to q = 0.15 (see, e.g., Bai 1997). The number of factors are determined using the I C 1 information criterion of Bai and Ng (2002) with the maximum Shin and Kang (2006), in which the data are weighted by the inverse of the conventional estimator of the cross-sectional covariance matrix of the errors. The second test is based on a modified covariance matrix estimator that is supposed to work better in small samples. The results were, however, very similar, and we therefore only report the results from the latter test. Also, the test that we denote here by τ IV,i is the weighted time series test that is used in constructing τ IV . While Demetrescu and Hanck (2011) did not present any asymptotic results for this test, in view of their Proposition 2, it seems reasonable to expect it to share the robustness of τ IV . 9 See Demetrescu and Hanck (2011, sec. 4) for a comprehensive Monte Carlo comparison with their tests. number of factors set to 4(min{N, T }/100) 2/9 (Bai and Ng 2004).
The results are reported in Tables 1-3. Table 1 contains the  size results, while Tables 2 and 3 contain the power results. The information content of these tables may be summarized as follows.
• As expected, τ i and τ are generally correctly sized. Of course, the accuracy is not perfect, and some distortions remain. In particular, both tests seem to have a tendency to reject too often. Fortunately, the distortions are never very large, and they vanish quickly as N and T increase, which is just as expected, because T > N (corresponding to the requirement that asymptotically N/T should go to zero) in the simulations. We also see that none of the tests seem to be affected much by the specification of σ 2 t . • The most serious distortions are found for τ IV , which is generally quite severely undersized, especially when φ 1 = 0.5 and the errors are serially correlated. Of course, with T = N 2 the requirement that T > N 5 (corresponding to N/T 1/5 → 0 as N, T → ∞) is not even close to being satisfied, and therefore the poor performance of τ IV in this case does not come as a surprise. There is also no tendency for the distortions to become smaller as N and T increases.
In fact, the distortions actually increase with the sample size. τ IV,i is also undersized, although not to the same extent as τ IV . • When κ = 0 the power of τ is increasing in the sample size, which is to be expected, as the rate of shrinking of the local alternative in this case is not fast enough to prevent the test statistic from diverging with N. The same reasoning  should in principle apply to τ IV . However, this is not what we observe. In fact, on the contrary, unless a and b are relatively large in absolute value, the power of this test is actually decreasing in the sample size, which again is probably because T < N 5 . The choice κ = 1/2 is similarly too fast for τ i and τ IV,i to have nonnegligible power. • When κ = 0 the power of the time series tests increases slightly in T, and for the panel tests there is a corresponding power increase in N and T when κ = 1/2. For larger sample sizes, however, the power is quite flat in N and T, which is in accordance with our expectations, since for these combinations of tests and values of κ there should be no dependence on the sample size, atleast not asymptotically. • The best power among the time series tests is obtained by using τ i , and among the panel tests τ is clearly the preferred choice. Of course, given their size distortions under the null and the fact that the reported powers are not size-adjusted, τ IV,i and τ IV are actually expected not to perform as well as the new tests. However, the difference in power is way larger than the required compensation for size. In fact, it is not uncommon for the power of the new tests to be more than two times as powerful as the other tests. To take an extreme example, consider the case when κ = 1/2 and a = b = −5, in which the power of τ can be up to 30 times larger than that of τ IV ! • Power depends on the nature of the heteroscedasticity. This is particularly clear when κ = 1/2, a = −3 and b = −1, in which case the power of τ raises from about 20% in case 1 to about 30% in cases 2 and 3, which is just as expected based on the example given in Section 3.2. Unreported results suggest that there is also some variation in power coming from the location of the variance break, which is again in accordance with our expectations. • The power of τ when a = b = −2 and a = b = −10 is about the same as when a = −3 and b = −1, and a = −15 and b = −5, respectively, which confirms the theoretical prediction that the power of this test should be driven mainly by μ c . However, there is also a slight tendency for power to decrease as we go from a = b = −10 to a = −15 and b = −5, although the effect vanishes as N and T increase. This is in agreement with Theorem 2, suggesting that there should be a second-order effect working through σ 2 c .
The results reported in Tables 1-3 are just a small fraction of the complete set of results that was generated. Some of the results that for space constraints are not reported here, most of which pertains to the selection of the groups, can be found in the supplement. The following summarizes the findings based on those results: • The QML group selection approach works very well in small samples. • The "cost" in terms of test performance of having to estimate the groups is very small. Indeed, the size and power results based on known groups are almost identical to the ones reported in Tables 1-3. • As pointed out by a referee, when N m − N m−1 = 1, such that the number of groups is equal to N, then R wi,t = w i,tˆ i,t = sign(ˆ i,t ), and sign-based tests have been shown to be robust to certain types of heteroscedasticity (see Demetrescu and Hanck 2011). Hence, while our theoretical results require that the size of each group goes to infinity with N, the new tests are still expected to perform quite well even when the groups are very small, and this is also what we find. • While size accuracy is basically unaffected by the size of the groups, as expected, accounting for the group structure can lead to substantial gains in power.
All-in-all, the results reported here and in the supplement lead to us to conclude that the new tests perform well in small samples, and also when compared to some of the existing tests. They should therefore be a valuable addition to the already existing menu of panel unit root tests.

CONCLUDING REMARKS
This article focuses on the problem of how to test for unit roots in panels where the errors are contaminated by time series and cross-section dependence, and in addition general unconditional heteroscedasticity. In the article, we assume that the heteroscedasticity is deterministic, as when considering models of permanent shifts or linear time trends, but it could also be stochastic. Conditional heteroscedasticity in the form of, for example, GARCH is permitted too. Moreover, the heteroscedasticity is not restricted to time series dimension, but can also emanate to some extent from the cross-sectional dimension. The assumed DGP is therefore very general, and much more so than in most other panel data studies. In fact, the only other panel study that we are aware of to consider such a general setup is that of Demetrescu and Hanck (2011).
Two tests are proposed. One is designed to test for a unit root in a single cross-section unit, while the other is designed to test for a unit root in the whole panel. Both tests are remarkably simple, requiring no prior knowledge regarding the heteroscedasticity, and still only minor corrections are needed to obtain asymptotically nuisance parameter free test distributions. This invariance is verified in small samples using Monte Carlo simulation. It is found that the new tests show smaller size distortions than the tests of Demetrescu and Hanck (2011) and, at the same time, have much higher power. These findings suggest that they should be a useful addition to the existing menu of unit root tests.

APPENDIX: PROOFS
Lemma A.1. Under Assumptions 1-3, whose first difference is given by

It follows that
where e i,t is defined in Equation (4). Consider R i,t , which we can expand as

Note that by Taylor expansion of exp(x) about
In the proof of Theorem 1, we show that R i,t−1 / √ T satisfies an invariance principle, and therefore that the first term on the righthand side is O p (1/N κ √ T ). As for the second term, by the Wold decomposition, 1/φ(L) can be expanded as φ * (L) Hence, since e i,t is serially uncorrelated, we have that E(e i,s x k,s−1 ) = 0 for all i, k and s. The variance of The presence of the factors therefore does not affect the rate of consistency ofˆ i . Making use of this result, we obtain Since e i,t = i F t + i,t , we have that the error incurred by applying the principal components method to R i,t rather than to e i,t is negligible (a detailed proof is available upon request). To also show that the estimation is unaffected by the heteroscedasticity in i,t , we verify Assumption C in Bai (2003). Toward this end, note that, since N m = λ m N , which is bounded under our assumptions, and therefore so is which is bounded too, because T t=1 σ 2 m,t /T → 1 r=0 σ 2 m (r)dr < ∞ as T → ∞. Similarly, since 1 r=0 (r)dr > 0, T t=1 F t F t /T converges to a positive definite matrix. These results imply that Assumption C of Bai (2003) holds and therefore that the principal component estimatorsˆ i andF t of i and F t , respectively, are unaffected by the heteroscedasticity.
Let us now consider R wi,t =ŵ m,tˆ i,t =ˆ i,t /σ m,t . Here, where, following Bai andNg (2004, p. 1154), with v t and d i implicitly defined. Hence, For ease of notation and without loss of generality, in what follows we focus on the first group containing the first N 1 cross-section units. Let us therefore considerσ 2 1,t , which iŝ σ 2 m,t for the first group. Letting c 0i,t = ( 2 i,t − σ 2 1,t ) and using c s1,t = N 1 i=1 c si,t /N 1 to denote the average c si,t for this group, we havê with an obvious definition of C 1,t . This result, together with a second-order Taylor expansion of the inverse square root ofσ 2 1,t , yieldŝ To work out the order of the remainder, we are going to make use of Lemma 1 of Bai and Ng (2004), which states that ||v t || = O p (1/C 1NT ) and Remark A.1. The expansion in Equation (A.7) has to be of second order, because otherwise the resulting remainder in the numerator of τ , NT , would not necessarily be negligible. For the numerator of τ i , T t=p i +2 R wi,tRwi,t−1 /T , it is enough with a first-order expansion.
We begin with c 01,t , whose order is given by as follows from using E( 2 i,t − σ 2 1,t ) = 0, cross-section independence and N 1 = λ 1 N = O(N ). As for c 21,t , by using (a + b) 2 ≤ 2(a 2 + b 2 ) and ||AB|| ≤ ||A||||B|| for arbitrary scalars a and b, and conformable matrices A and B, it is clear that c 2i,t = a 2 i,t ≤ 2|| i H −1 || 2 ||v t || 2 + 2||d i || 2 ||F t || 2 , and therefore where, by the Cauchy-Schwarz inequality, It follows that and so, via the cross-section independence of i,t , The same argument can be used to show that c 11,t , c 31,t and c 91,t are O p (1/T ). It remains to consider c 51,t and c 61,t . The orders of these are as follows: (A.14) As the above calculations make clear, c 01,t , N 1 i=1 i,t d i HF t /N 1 in c 41,t and c 61,t are the leading terms in C 1,t . Hence, by direct substitution of Equations (A.8)-(A.12) into the definition of C 1,t , where the order of the remainder follows from the fact that Because c 01,t + c 41,t + c 61,t = O p (1/C NT ), the remainder term in Equation (A.7) is O p (1/C 3 NT ), and thereforê The result in Equation (A.16) suggests that R i,t can be expanded as The order of C 1,t R wi,t is equal to that of C 1,t , and therefore Similarly, since C 1,t and a i,t are both O p (1/C NT ), which is clearly O p (1/C 2NT ). The other terms are all of higher order than this. It follows that with an obvious definition of g i,t , and where the second equality uses the approximation R wi,t = ε i,t + (ρ i − 1)w 1,t R i,t−1 = ε i,t + o p (1), which is valid in the present case (details are available upon request). This establishes the first of the two required results.
The second result requires more work. Note first that by using Equation (A.16), where again all terms that are O p (1/C 2NT ) can be treated as negligible. Consider T . ViaF s = HF s + v s , and then leading term approximation, w 3 1,s (c 01,s + · · · + c 91,s ) R wi,s , (A.21) where t s=p i +2 w 3 1,s c 01,s R wi,s / √ T is again not of sufficiently low order to be ignored. Let us therefore consider t s=p i +2 w 3 1,s c 61,s R wi,s / √ T . Because i,t is serially uncorrelated and independent of F t , we have E(ε i,s x k,s−1 ) = 0. Moreover, t s=p i +2 w 2 1,s ε k,s ε i,s x k,s−1 / Similarly, since t s=p i +2 w 3 1,s ε k,s x k,s−1 R i,s−1 /T 3/2 = O p (1), we can show and therefore which is O p (1/C 2NT ). The same steps can be used to show that Hence, since −2 N 1 k=1 k,s d k HF s /N 1 is the leading term in c 41,s , Thus, while nonnegligible in R wi,t , c 41,s and c 61,s are in fact negligible inR wi,t−1 / √ T . We now verify that the remaining terms in Equation (A.21) are negligible too. In so doing, note that, as in the above, the effect of (ρ i − 1)w 1,s R i,s−1 in R wi,s will generally be dominated by that of ε i,s . Hence, unless otherwise stated, in what follows we are going to work with the approximation R wi,s = ε i,s + o p (1).
Both R i,t / √ T and R wi,t / √ T satisfy an invariance principle. By using this and the fact that It is also not difficult to show that and similarly, which are all O p (1/C 2NT ). The terms involving a i,t can be evaluated in a similar fashion. In particular, we have where we have made use of the fact that t s=p i +2 w 3 1,s R k,s−1 R wi,s /T = O p (1) and the results of Bai andNg (2004, p. 1158), from which we deduce that Hence, since by ||a + b|| 2 ≤ ||a|| 2 + ||b|| 2 , || t s=p i +2 w 3 1,sF s R wi,s v s || 2 ≤ t s=p i +2 w 6 1,s ( R wi,s ) 2 ||F s || 2 ||v s || 2 , suggesting that Also, by the cyclical property of the trace, tr((d kF s ) 2 ) = tr(d kF sF s d k ) = tr(d k d kF sF s ). Therefore, since As for the remaining term, according to Bai (2003, p. 166), v s = H −1 N k=1 k k,s /N + o p (1), where = lim N→∞ N k=1 k k /N. By using this and the fact that ε i,s is cross-section independent, we obtain As for the term involving c 81,s , where the second equality uses R wi,s = ε i,s + o p (1), while the third equality holds because The remaining term can be written as and therefore Thus, by combining the results, w 3 1,s (c 11,s + · · · + c 91,s ) R wi,s Moreover, by using the same steps as in the above, the fourth term on the right-hand side of Equation (A.19) is The order of the remainder is not the sharpest possible but it is enough to show that it is o p (1/N κ ). Consider the first term on the right-hand side. By cross-section independence, which in turn implies The same steps can be used to show that which is clearly O p (1/C 2NT ). As for t s=p i +2 w 5 1,s c 2 41,s ε i,s / √ T , since ε i,s is independent across i and also of F s , t s=p i +2 F s F s w 3 1,s ε i,s ε 2 k,s / √ T = O p (1). Hence, since tr((d k F s ) 2 ) = tr(d k d k F s F s ), we get Thus, by adding the terms, The sixth and last term on the right-hand side of Equation (A.19), This establishes the second result, and hence the proof of the lemma is complete.
Proof of Theorem 1. Consider again the group consisting of the first N 1 observations. By Lemma A.1, with κ = 0, As forR wi,t / √ T , we have, by a simple modification of Lemma 2 in Bai and Ng (2004), where ρ i = exp(c i /T ) and e i,t = i F t + i,t , by recursive substitution, as T → ∞, where with W ε,i (s) and W F (s) being two independent Brownian motions. Hence, since R wi,t = t k=p i +2 R wi,k , we obtain as N, T → ∞. This result, together with the continuous mapping theorem, imply The proof is completed by noting that as N, T → ∞.
Proof of Theorem 2. Consider the numerator of τ . By definition,R wi,t = t with a similar result applying to T t=p i +2 R wi,t−1 R wi,t . Write Hence, focusing of the first group comprising the first N 1 units, Let us therefore considerR 2 wi,T . Clearly, by Lemma A.1, w 3 1,s c 01,s ε i,s .
As for the first term on the right-hand side, Moreover, as in the proof of Theorem 1, T It follows that Also, since by the proof of Lemma A.1, It follows that Similarly, since (c 01,t + c 41,t + c 61,t ) = O p (1/C NT ) and Thus, by adding the results, NT . By using κ = 1/2, and the definition of R wi,t , As for the first term, since κ = 1/2 in this case, Moreover, since ε i,t and F t are independent, E(ε i,s e i,k ) = E(ε i,s ( i F k + i,k )) = σ 1,k E(ε i,s ε i,k ), which is zero if s = k and σ 1,k if s = k. Letting g 1 (v) = w 1 (v) Let us also define with q i (u) and q 1 (u) being the corresponding integrals after taking N → ∞. That is, q i (u) = q 1 (u) = μ c g 1 (u). By using this and the fact that E(c 2 i ) = μ 2 c + σ 2 c , we obtain and by application of Corollary 1 of Phillips and Moon (1999), with λ 1 N = N 1 , we can further show that c i w 1,t R wi,t−1 R i,t−1 → p λ 1 1 u=0 q 1 (u)du (A.41) as N, T → ∞, where → p signifies convergence in probability. Thus, considering the whole cross-section, Clearly, E(R wi,t−1 ε i,t ) = 0. As for the variance, with an obvious definition of p N,i (r). The limit of (A.43) is therefore given by wi,t−1 → λ 1 1 r=0 p N,i (r)dr (A.44) as T → ∞. The conditions of Theorem 2 of Phillips and Moon (1999) hold. Hence, using p i (r) to denote the limit of p N,i (r), .45) as N, T → ∞. Note also that since the effect of the common component in p N,i (r) vanishes as N → ∞, R wi,t−1 ε i,t is asymptotically uncorrelated across groups, and thus independent by normality. The same is true forR wi,t−1 R wi,t if we assume that N/T → 0 such that O p ( √ N/C 2NT ) = o p (1). It follows that as N, T → ∞ with N/T → 0 and where the definition of p 1 (u) is analogous to that of q 1 (u). As for the denominator of τ , by using the same steps as before, via Corollary 1 of Phillips and Moon (1999), Proof of Proposition 1. For notational simplicity, in this proof, when context is clear, the dependence ofσ 2 1,t (n) andσ 2 2,t (n) on n will be suppressed.
The criterion function is QML(n) = T t=p i +2 n log σ 2 1,t (n) + (N − n) log σ 2 2,t (n) , from which we obtain QML(n) − QML 0 (N 1 ) where x t = σ 2 1,t /σ 2 2,t − 1. Hence, since From here it is easy to see that if n = N 1 , then Note that QML 0 (N 1 ) is independent of n. Hence, minimizing QML(n) is equivalent to minimizing [QML(n) − QML 0 (N 1 )]. Because of this, and the fact that the order of magnitude of [QML(n) − QML 0 (N 1 )] increases by a factor of √ NT when n < N 1 , the minimum is obtained by setting n = N 1 , as was to be shown.