“Statistical Inference for Average Treatment Eﬀects Estimated by Synthetic Control Methods”

Abstract The synthetic control (SC) method, a powerful tool for estimating average treatment effects (ATE), is increasingly popular in fields such as statistics, economics, political science, and marketing. The SC is particularly suitable for estimating ATE with a single (or a few) treated unit(s), a fixed number of control units, and large pre and post-treatment periods (which we refer as “long panels”). To date, there has been no formal inference theory for SC ATE estimator with long panels under general conditions. Existing work mostly use placebo tests for inference or some permutation methods when the post-treatment period is small. In this article, we derive the asymptotic distribution of the SC and modified synthetic control (MSC) ATE estimators using projection theory. We show that a properly designed subsampling method can be used to obtain confidence intervals and conduct inference whereas the standard bootstrap cannot. Simulations and an empirical application that examines the effect of opening a physical showroom by an e-tailer demonstrate the usefulness of the MSC method in applications. Supplementary materials for this article are available online.

We give some explanations of the above derivations. Hilbert Space projection onto convex sets was studied by Zarantonello (1971) and extended to general econometric model settings by Fang and Santos (2018). The projection operator Π Λ : R N → Λ (Λ is a convex subset in R N ) can be viewed as a functional mapping. Zarantonello (1971) showed that Π Λ is (Hadamard) directional differentiable, and its directional derivative at β 0 ∈ Λ is Π T Λ,β 0 , which is the projection onto the tangent cone of Λ at β 0 . Hence, the second equality of (A.3) follows from a functional Taylor expansion, the third equality follows from the fact that T Λ,β 0 is positive homogenous of degree one, i.e., for α ≥ 0, α T Λ,β 0 θ = T Λ,β 0 α θ for all θ ∈ R N , and the last line follows from √ T 1 (β OLS − β 0 ) d → Z 1 and the continuous mapping theorem because projection is a continuous mapping.
We can see that the term 'tangent cone' is analogous to referring to the derivative of a function at a given point as a 'tangent line' of the function (at the given point). Now, the functional derivative of the mapping Π Λ is a projection onto the cone Π T Λ,β 0 (rather than a line). Therefore, it is called the 'tangent cone' of Λ at β 0 and is denoted as T Λ,β 0 . For readers' convenience, we give the formal definition of tangent cone of Λ at θ ∈ R N below: where for any set A ∈ R N , A is the closure of A ( A is the smallest closed set that contains A).
Using the above definition one can easily check that for our synthetic control estimation problem, the tangent cone of Λ at β 0 is the same as the asymptotic range of √ T 1 (β T 1 − β 0 ).

B.1 A projection of the unconstrained estimator
We write the regression model in matrix form: Y = Xβ 0 + u, where Y and u are both T 1 × 1 vectors, X is of dimension T 1 × N and has a full column rank, and β 0 is of dimension N × 1.
We assume that the true parameter β 0 ∈ Λ, where Λ is a closed and convex set (Λ = Λ SC or Λ M SC in our applications). We denote the constrained least squares estimator asβ T 1 , i.e., where ||A|| 2 = A A for a vector A.
We denote the unconstrained least squares estimator asβ OLS = arg min β∈R N (Y − Xβ) (Y − Xβ), i.e.,β OLS = (X X) −1 X Y . By the definition ofβ OLS , we may write Y = Xβ OLS +û, where we dropped a cross term in the third equality becauseû X = 0 (least squares residual u is orthogonal to X). Since û 2 is unrelated to β, the minimizer of f (β) is identical to the minimizer of (β OLS − β) X X(β OLS − β). Thus, we havê where the second equality follows since T 1 > 0.

B.2 The uniqueness of the (modified) synthetic control estimator
We first give the definition of a strictly convex function. A function f is said to be strictly convex if f (αx + (1 − α)y)) < αf (x) + (1 − α)f (y) for all 0 < α < 1 and for all x = y, x, y ∈ D, where D is the domain of f .
Under the assumption that the data matrix X T 1 ×N has a full column rank, we show below 2 is a strictly convex function. Since the objective function is a convex function and the constrained domains for β, Λ SC and Λ M SC , are convex sets, then the constrained minimization problem has a unique (global) minimizer. To see this, we argue by contradiction. Suppose that we have two local minimizers z 1 = z 2 . Then for any convex This contradicts the fact that z 1 and z 2 are two minimizers. Hence, we must have z 1 = z 2 and the minimizer is unique.

Appendix C: Three useful lemmas
In this supplementary appendix, we prove two lemmas that are used to prove Theorem 3.1.
Lemma C.1 Under the same conditions as in Theorem 3.1, we havê Proof: For any fixed > 0, suppose that √ where the strict inequality is due to uniqueness of the projection and the assumption that > 0 which implies thatβ T 1 =β T 1 . By simple algebra (adding/subtracting terms), we have: By (C.1) and (C.2), we know that the sum of the last two terms in (C.2) is negative, i.e., The minimum eigenvalue of a square matrix A is denoted by λ min (A). The third equality uses Lemma C.2 which is proved at the end of this appendix.
By writing (X X/T 1 ) = E(X t X t ) + (X X/T 1 ) − E(X t X t ), the second term in (C.3) can be rewritten as By the definition ofβ T 1 and Lemma 1.1 in Zarantonello (1971) By a law of large numbers, where we used the Lipschitz continuity of projection operators, and Π Λ,T 1 is the projection onto Λ with respect to the aforementioned random norm a T 1 = a (X X/T 1 )a (Zarantonello, 1971). Hence, we have D 2,2,T 1 = o p (1). Combining D 2,2,T 1 = o p (1) and (C.6), we obtain , this argument is used twice in (C.8) below) where the second inequality above follows from D T 1 = D 1,T 1 + D 2,T 1 ≥ 2 λ min (X X/T 1 ) + o p (1) by (C.4) and (C.7). Hence, This concludes the proof of Lemma C.1.
Lemma C.2 Let A be an N × N positive definite matrix, and S N = {a ∈ R N : a = 1} denotes the unit sphere in R N . Then we have inf a∈S N a Aa = λ min (A) .
Proof: Let v 1 , ..., v N be N eigen-vectors of A with corresponding eigen-values λ 1 , ..., λ N so that Av j = λ j v j for j = 1, ..., N . Then since v 1 , ..., v N form an orthonormal basis for S N , we have On the other hand, pre-multiplying Av j = λ j v j by v j , we get λ j = v j Av j ≥ inf a∈S N a Aa for all j = 1, ..., N , which implies (ii) λ min ≥ inf a∈S N a Aa. Combining (i) and (ii), we finish the proof of Lemma C.2.
The theoretical result presented in Theorem 4.1 is pointwise. That is, Theorem 4.1 holds true for a fixed vector β 0 ∈ Λ. However, one may be concerned whether Theorem 4.1 also holds uniformly locally at β 0,T 1 as β 0,T 1 approaches the boundary of the set Λ (as T 1 → ∞). If the limiting distribution ofÂ depends discontinuously on β 0 when β 0 is at the boundary of Λ, then the test may fail to adequately control for size when β 0 is close to the boundary of Λ. In the case of the MSC method, β 0 is at the boundary of Λ if β 0,j = 0 for some 2 ≤ j ≤ N . To examine this issue we consider a sequence of distributions in the form of β 0, where β 0,j ≥ 0 for all j ∈ {2, ..., N }, β 0,i = 0 for at least one i ∈ {2, ..., N }, and c j ≥ 0 for all j ∈ {2, ..., N }.
whereÂ is defined in (4.2), and P β 0 +c/ √ T 1 indicates the distribution of the data associated with β = β 0 + c/ √ T 1 ∈ Λ and that β 0 is at the boundary of Λ. Equation (C.11) proves Lemma C.3 and it implies that our analysis delivers inference procedures with reliable size control.
Appendix D: Asymptotic theory with non-stationary data

D.1 The trend stationary data
The trend-stationary data generating process can also be motivated using a factor model framework. Let {y 0 it }, for i = 1, ..., N and t = 1, ..., T , be generated by some common factors with one of the factors being a time trend and the remaining factors being weakly dependent stationary variables. Following Hsiao, Ching, and Wan (2012), we assume that y 0 t = (y 0 1t , y 0 2t , ..., y 0 N t ) is generated via a factor model .., f Kt ) is a K × 1 vector of common factors, and t = ( 1t , ..., N t ) is an N × 1 vector of idiosyncratic errors. We assume that f 1t = t and all other factors are stationary variables.
Letα T 1 andβ T 1 be the SC/MSC estimators of α 0 and β 0 based on (D.3). Then it is to show ). Thus, using (3.7) and (D.4), we havê and all other diagonal elements equal to √ T 1 .
To establish the asymptotic distribution of √ T 2 (∆ 1 − ∆ 1 ), we make the following assumptions.
We assume that w t is a ρ-mixing process where the mixing coefficient ρ(τ ) satisfies the condition: ρ(τ ) ≤ C λ τ for some finite positive constants Assumptions D1 and D2 are not restrictive. They require that (z t , v 1t ) be a weakly dependent stationary process so that law of large numbers and central limit theorem hold for their (partial) sums. If E(z t z t ) is not invertible, we can remove the linearly dependent regressors and redefine z t as a subset of (1, η 2t , ..., η N t ) such that assumption 1 holds. Assumption D3 further imposes an exponential decay rate for the ρ-mixing processes. Many ARMA processes are known to be ρ-mixing with exponential decay rate.
Our derivation below is based on Andrews (1999). By adding/subtracting γ 0 and inserting identity matrix I N +1 = M T 1 M −1 T 1 , we can write A(γ) as: where the fourth equality follows from (AB) = B A , where Z 3 is a zero mean, finite variance, (N + 1) × 1 vector of normal random variable. It is easy to show that J T 1 p → J tr , where J tr is an (N + 1) × (N + 1) positive definite matrix defined by . Hence, if we take the limit of T 1 → ∞ and letλ denote . Note that the last equal sign in (D.10) defines a projection. That is, for the time trend model, the projection of θ ∈ R N +1 onto a convex set Λ is defined as Thus, we just showed thatλ By Assumption D3 and the proof of Theorem 3.2 and Lemma 1 in Li and Bell (2017), we know thatγ − γ is asymptotic independent with T −1/2 2 T t=T 1 +1 v 1t . Therefore, applying the projection theory to (D.6) we immediately have the following result.
Under assumptions D1 to D3 and noting that γ 0 ∈ Λ, we have by (D.12), where Z 3 is the weak limit of M T 1 (γ OLS − γ 0 ) as described in Assumption C1, and Z 2 is independent with Z 3 and is normally distributed with a zero mean and variance Σ v .

D.2 The unit-root non-stationary data
Here we only consider unit-root processes without drifts because the asymptotic theory for a unit-root process a drift is the same as the trend-stationary data case due to the fact that the drift term leads to a time trend component which dominates other components. Therefore, we assume that, in the absence of treatment, the outcome variables follow unit-root processes without drifts: where η jt is a zero mean, weakly dependent stationary process that satisfies Assumption D4 below.
, B u is a (scalar) standard Brownian motion generated by partial sum of u 1t 's (B u is independent of B η ), The convergence results presented at Assumption D4 hold jointly, then by the continuous mapping theorem we have where D T 1 = T 1 Diag(T −1/2 1 , 1, ..., 1) is the N × N diagonal matrix defined in Section 3.4.
Remark D.1 Co-integration theory is well developed in the literature. Primitive conditions that ensure that Assumption D4 and D5 (i) hold can be found in many published papers, e.g., Stock and Watson (1993).
Recall that φ = lim T 1 ,T 2 →∞ T 2 /T 1 . It can be shown that when φ = 0,Â 1 = o p (1). Therefore, we only need to consider the case that φ > 0. By Assumption D4 (ii) and noting that ( For the unit-root data process, define J I, Similar to the derivation to (D.11), we can show that, for the unit-root process, the projection of θ ∈ R N onto a convex set Λ is defined as Similar to the derivations of (D.9) and (D.12), we can show that By noting that T /T 1 = 1 + T 2 /T 1 → 1 + φ, we havê It can be shown thatÂ 1 and A 2 are asymptotically independent with each other. This completes the proof of Theorem 3.5.

Appendix E: Additional simulation results
In this supplementary appendix, we report some additional simulation results. In Section 5.4, we compute M SE(∆ 1 ) for four different methods. We also compute squared biases and variances of these estimators. The results show that variances dominate biases in the sense that more than 96% of the MSEs come from variances. We show the results for the modified synthetic control (MSC) and HCW methods in Table 8. The results for the original synthetic control (OSC) and the synthetic control (SC) are similar and will not be presented here.

E.1 Estimation and inference for large N
In Section 5.4, we report M SE(∆ 1 ) = M −1 M j=1 (∆ 1,j − ∆ 1 ) 2 . We also computed squared bias and variance of∆ 1 , where Bias( j=1∆ 1,j , M = 10, 000 is the number of simulations. It is easy to check that the identity M SE(∆ 1 ) = (Bias(∆ 1 )) 2 + V ar(∆ 1 ) holds. To save space, we report the ratios of V ar(·)/M SE(·) for the modified synthetic control (MSC) and HCW methods as they dominate the original synthetic control (OSC) and the synthetic control (SC) methods in most cases. Table 8 reports the variance to MSE ratios for the case that u it (defined in (5.1)) is uniformly distributed. We see from Table 8 that ratios of V ar(·)/M SE(·) are greater than 99% for all cases. Therefore, the squared biases are negligible compared to variances.
The negligible squared biases may be partly due to symmetric distribution of u it . Thus, next we replace u it by an asymmetric χ 2 1 distribution (normalized to have zero mean and unit variance). The variance to MSE ratios for chi-square distributed u it case are given in Table 9 where we see that variance to MSE ratios indeed drop but still the ratios are greater than 96% for all cases considered. The results show that variance is the main component of MSE.
where y 0 1t,j andŷ 0 1t are the generated outcome data and its estimator at the j th replication. The results are given in Table 10. We see the same ranking as in the case of M SE(∆ 1 ) reported in Table 3 that only for DGP6 with for N = 11 and N = 21, the HCW has smaller MSE than the modified synthetic control (MSC). For all other cases, the modified synthetic control (MSC) has smaller MSE than HCW.
We report estimated coverage probabilities of the modified synthetic control (MSC) and HCW methods for DGP7 and DGP8 discussed in Section 5.5. The results are given in Table  Table 10 167 1.057 1.062 1.055 1.061 1.075  HCW 3.974 3.033 2.896 3.620 14.85 1.153 1.321 1.533 2.345 11.17 11. They are similar to the cases of DGP5 and DGP6. While HCW CIs significantly over-cover ∆ 1 for large N , the modified synthetic control (MSC) method has more accurate coverage probabilities than the HCW method.

E.2 Inferences when T 2 is small
In this section, we consider the case of large T 1 (100, 200) and small T 2 (3, 5). We use Andrews' The simulation results are reported in Table 12. Andrews ' (2003) test is expected to give good estimated sizes when T 1 is large. As expected, we see from Table 12 that the test is oversized for T 1 = 100. Its estimated sizes improve as T 1 increases to 200. Another result from Table 12 is that, if we fix T 1 , the estimated sizes deteriorate as T 2 increases. That is understandable because this test is designed for large T 1 and small T 2 .
Recall that a test is said to be a consistent test if, when the null hypothesis is false, the probability of rejecting the (false) null hypothesis converges to one as sample size goes to infinity (T 2 → ∞). As Andrews (2003) points out, this statistic is not a consistent test for small values of T 2 . While a large T 1 helps to give better estimated sizes, it does not increase the power of the test. Therefore, we only consider T 1 = 100 for power calculations because for T 1 = 200 or even larger T 1 , the powers of the test are similar. When T 1 is large, the power of the test increases with T 2 and also depends on the magnitude of T t=T 1 +1 (∆ 1t − ∆ 1,0 ) under H 1 . From Table 12, we see that the estimated power increases with T 2 as well as with α 0 (the magnitude of ∆ 1t ). However, a large T 2 adversely affects the estimated sizes of Andrews ' (2003) test.
We also conducted simulations of Andrews' (2003) test under DGP1 using T 1 = 90 and T 2 = 20 (the same T 1 and T 2 as in our empirical data). Based on 10,000 simulations with α 0 = 0, the estimated sizes are 0.1660 and 0.1964 for nominal levels 5% and 10%, respectively. We see that for the T 2 = 20 and T 1 = 90 case is not large enough for the test to have good estimated sizes because an error term of order T 2 /T 1 is not negligible, which causes Andrews ' (2003) test invalid in our context. Therefore, the end-of-sample stability testing and the subsampling testing procedures are complements to each other. The former can be used when T 2 is small while the later is preferred when T 2 is not small.
Remark E.1 For our (modified) synthetic control ATE estimator with panel data, large T 2 invalidates Andrews' (2003) test due an error term of order T 2 /T 1 becoming non-negligible.
This differs from the time series model considered by Andrews (2003), where when T 2 is also large, testing a possible structural break at T 1 becomes a simple and standard problem.
Appendix F: Explanation of subsampling method works for a wide range of subsample sizes In this appendix, we explain why the subsampling method works well for our estimated ATE estimator for a wide range of subsample size m values.
The constrained least squares estimator of µ 0 isμ n = max{Ȳ n , 0}, whereȲ n = n −1 n i=1 Y i . It is easy to show thatŜ where Z denotes a standard normal random variable. Let Y * i be random draws from {Y j } n j=1 . Then a bootstrap analogue of (F.1) is √ n(μ * n −μ n ), whereμ * n = max{Ȳ * n , 0} andȲ * n = n −1 n i=1 Y * i . Andrews (2000) shows that this standard resampling bootstrap method as well as several parametric bootstrap methods do not work in the sense that, when µ 0 = 0,S * n = √ n(μ * n −μ n ) will not converge to max{Z, 0}, the limiting distribution ofŜ n . In fact, Andrews (2000) shows thatŜ * n converges to a distribution that is to the left of max{Z, 0}.
Andrews (2000) also suggests a few re-sampling methods that overcome the problem. One particular easy-to-implement method is a parametric subsampling method. Specifically, for values of m that satisfy m → ∞ and m/n → 0 as n → ∞, one can useS N (0, 1). To see that the subsampling method indeed works, we have that, conditional on where the second equality follows from the definition ofμ * m , the third equality follows from adding and subtracting √ m µ 0 , the fourth equality follows from max{a, b}−c = max{a−c, b−c}, the sixth equality follows from m/n = o(1), √ n(Ȳ n − µ 0 ) = O p (1) and o(1)O p (1) = o p (1). The last equality follows from the fact that Y * i −Ȳ n = u * i is iid N (0, 1). Hence, is iid with meanȲ n and unit variance but is not normally distributed, then we need m to be large so that √ m(Ŷ * m −Ȳ n ) d → N (0, 1) ≡ Z by virtue of a central limit theorem argument (as m → ∞).
Comparing (F.1) and (F.2), we see that subsampling method works under very mild conditions that m → ∞ and m/n → 0 as n → ∞.

F.2 Testing for zero ATE by subsampling method
We conduct simulations to examine the finite sample performances of the subsampling method.
We generate Y i iid N (0, 1) (i.e., µ 0 = 0) for i = 1, ..., n and we choose n = 100 and conduct 5000  m,(2000) . Then we get right-tail α-percentile value byŜ * ((1−α) (2000)) . We record rejection rate as the percentage thatŜ is greater or equal toŜ * ((1−α) (2000)) for α ∈ {0.01, 0.05, 0.1, 0.2}. We consider two cases: (i) We generate Y i iid N (0, 1) and Y * i =Ȳ n + v i with v i iid N (0, 1); and (ii) We generate Y i uniformly distributed over [− √ 3, √ 3] (so that it has zero mean and unit variance) and The results for the two cases are almost identical. For brevity, we only report the normally distributed v i case in Table 13.  0968 .1006 .1104 .1346 .2014 20% .1936 .2004 .2278 .2588 .3164 .4020 First, we see that the subsampling method with 5 ≤ m ≤ 20 seem to work well. Second, we see clearly that using m = n or m close to n (m ≥ 50) do not work. For example, when m = n, it gives estimated rejection rates double that of the nominal levels. Andrews (2000) shows that the distribution of √ n(μ * n −μ n ) is to the left of that of √ n(μ n − µ 0 ). Hence, the bootstrap method will lead to over rejection of the null hypothesis. Our simulation results verifies Andrews' theoretical analysis.
The simulation results seem contradict to the simulation results reported in Section 5 where even for m = n, the subsampling method seems to be fine. We explain the seemingly contradictory result in the next subsection.

F.3 Not all parameters are at the boundary
Our simulations reported in Section 5 correspond to the case of β 0,j > 0 for j = 2, ..., 7 and β 0,j = 0 for j = 8, ..., 11. The constrained estimatorsβ T 1 ,j (β * m,j ) for j = 8, 9, 10, 11 can cause problems for the standard bootstrap method. However, notice that our ATE estimator also depends onβ T 1 ,j (β * m,j ) for j = 1, ..., 7, which does not take boundary value 0. This helps to improve subsampling method for large value of m. More importantly, our ATE estimator also contains a term not related toβ T 1 (see the second term at the right hand side of (4.5) and the existence of this term further improves the performance of the subsampling method when m is close to or equal to n. This is the reason why in our simulations even when m = n, the subsampling method seems to work fine. To numerically verify this conjecture, we generate a sequence of iid Z 1 , Z 2 ∼ N (0, σ 2 v ) random variables and add them toŜ n andŜ * m , i.e.,S n = S n + Z 1 andS * m =Ŝ * m + Z 2 . We then repeat the simulations to compute the estimated sizes. The results for σ v = 1 and 5 are reported in Table 14. We observe that the performance of the subsampling statisticS * m has significant improvements overŜ * m for m = 50 and 100. Consider the case of σ v = 1 and m = n. The rejection rates based onS * m is about 20% higher than that of the nominal levels whereas it was 100% higher than that of nominal levels based onŜ * m . From Table 14, we see that when σ 2 v is large, Z 1 and Z 2 becomes the dominating components ofS n andS * m . Therefore, the subsampling method works well for all values of m including m = n. The estimated sizes for σ 2 v = 1 are only slightly oversized compared to σ 2 v = 25. This shows that the significant improvements in the estimated sizes (over the case of σ 2 v = 0) does not require adding a regular component with large dominating variance.

G.2 Adding Covariates
We collect monthly data on unemployment rate (Unemp), labor force (LF) and average weekly earnings (Inc) for Columbus and linearly extrapolate them to weekly data. The data is downloaded from the Bureau of Labor Statistics website (bls.gov). The estimation model is where x t = (1, y 2t , ..., y N t ) , we consider three cases of adding covariates: (i) z 1t = (U nemp t , LF t , Inc t ) , i.e., add the three covariates linearly to the regression model; (ii) add both the three covariates and their square terms, i.e., add a total of six additional regressors; (iii) add three more cross product terms of the three covariates, i.e., add a total of nine additional regressors (3 linear, 6 quadratic terms), γ 0 is a k × 1 vector of parameters, where k is the dimensional of z 1t . Since opening a showroom has no (or negligible) effect on z 1t , we can use the above model to predict post-treatment counterfactual sales for the treated city. Specifically, we estimate model (G.2) under the restriction β j ≥ 0 for j ≥ 2 using the pre-treatment data t = 1, ..., T 1 (there are no restrictions for the other parameters). Letβ T 1 andγ T 1 denote the corresponding estimators.
We estimate the counterfactual outcome y 0 1t bŷ for t = T 1 + 1, ..., T and estimate ATE by T −1 2 T t=T 1 +1 (y 1t −ŷ 0 1t ). Note that in (G.3) we use the treated unit's covariates z 1t in estimating the counterfactual outcome y 0 1t . We do not need to use control units' covariates to form a synthetic path for z 1t because z 1t is exogenous in the sense that the treatment event will not affect (or its effect on z 1t is negligible) covariates' evolution of the treated unit. Adjusted R square = 0.52 Actual Counterfactual Figure 5 plots the estimation result for Columbus with three covariates added to the regression model linearly, i.e., z 1t is of dimensional three. The ATE becomes 69.7% which is quite close to the original result of 67%. However, the adjusted R 2 decreased slightly from 0.528 to 0.520, indicating that the three covariates do not have additional explanatory power to explain sales. Obtaining virtually the same ATE estimation result even with added covariates supports our original ATE estimation result. For cases (ii) and (iii), z 1t is of dimensional six and nine, the resulting adjusted R 2 are reduced to .495 and .478, respectively. Therefore, adding quadratic terms of the three covariates do not give additional prediction power to Columbus' sales.

G.3 Selecting control units based on covariate matching
In this subsection, we first select cities whose covariates are close to the covariates of the treated city. Then we select the number of control cities by comparing adjusted R 2 . Finally we estimate ATE using the selected control units. We explain this procedure in more detail below.
For each j = 1, 2, 3 (corresponding to Unemp, LF, Inc), we regress z 1,jt on z i,jt using the pre-treatment data and obtain the goodness-of-fit R 2 i,j for i = 2, ..., 11. We obtain a total Rsquare for city i by R 2 i = R 2 i,1 + R 2 i,2 + R 2 i,3 . We sort them in a non-increasing order: R 2 (2) ≥ R 2 (3) ≥ ... ≥ R 2 (11) . Their corresponding sales are denoted by y (2),t ,...,y (11),t for t = 1, ..., T 1 . Next, we regress y 1t on y (2),t and obtain an adjustedR 2 (2) . Then, we regress y 1t on (y (2),t , y (3),t ) and obtain an adjustedR 2 (2),(3) . We continue this way until we regress y 1t on all (y (2),t , ..., y (11),t ). We choose a model with the largest adjustedR 2 . For Columbus, the method that selects seven cities (Portland, Houston and Atlanta are not selected) gives the largest adjustedR 2 . Using the seven selected cities as control group, the modified synthetic control method's estimation result is plotted in Figure 6. The ATE estimation result is 68.5% which is quite close to the original result of 67%. The robustness check shows that our ATE estimation result is not sensitive to the selection of different control units. G.4 Allowing for v 1t to be serially correlated As discussed in Section 6.2, when testing the null that v 1t is serially uncorrelated, we obtain a p-value of 0.0963. It is not strong evidence supporting the null hypothesis. In this section, we allow for v it to follow an AR(1) process: v 1t = ρ v v 1,t−1 + ξ t , where ξ t is serially uncorrelated.
The estimated confidence intervals are given in Table 16. Comparing Table 16 with Table   5, we observe the results are similar although the estimated confidence intervals reported in Table 16 are wider than those in Table 5.