Estimation and Inference of FAVAR Models

The factor-augmented vector autoregressive (FAVAR) model is now widely used in macroeconomics and finance. In this model, observable and unobservable factors jointly follow a vector autoregressive process, which further drives the comovement of a large number of observable variables. We study the identification restrictions for FAVAR models, and propose a likelihood-based two-step method to estimate the model. The estimation explicitly accounts for factors being partially observed. We then provide an inferential theory for the estimated factors, factor loadings, and the dynamic parameters in the VAR process. We show how and why the limiting distributions are different from the existing results. Supplementary materials for this article are available online.


INTRODUCTION
Since the seminal work of Sims (1980), vector autoregressive (VAR) models have played an important role in macroeconomic analysis. Because the number of parameters in a VAR system increases rapidly with the number of variables, there is a degreeof-freedom problem when too many variables are included in the system. On the other hand, too few variables may not fully capture the dimension of the structural shocks. These problems may explain some puzzling empirical results in the body of VAR research. For example, various studies commonly find that a contractionary monetary policy often leads to an increase of the price level, rather than a decrease as the standard economic theory alleges (see Sims 1992;Christiano, Eichenbaum and Evans 1999). Sims (1992) proposed a plausible interpretation of this puzzle, suggesting that it results from the VAR analysis not fully capturing the information. Including more series in a VAR model is limited because of the loss of degrees of freedom. (The Bayes method is alternatively considered (Doan, Litterman, and Sims 1984;Litterman 1986;Sims 1993), and by imposing some prior restrictions, the usual VAR model can accommodate more variables (e.g., Leeper, Sims, and Zha 1996).) Furthermore, as Stock and Watson (2005) pointed out, it is doubtful that the larger VAR models with some potentially incredible restrictions would be superior to the smaller ones. Bernanke, Boivin, and Eliasz (2005) proposed a factoraugmented vector autoregressive (FAVAR) model to address the dilemma arising from the information deficiency and the degree-of-freedom problem in traditional VAR models. In contrast with such models, the FAVAR model includes unobserved low-dimensional factors in the autoregression. These factors, which may not be captured by some specific macroeconomic aggregates, are thought to contain the bulk of information about an economy. With inclusion of these unobserved factors, the FAVAR model is of rich information, but remains tractable in terms of the number of parameters, owing to the low dimension of the factors. Such approach of using a small number of factors to summarize useful information in a large number of indicators have been used in many papers, for instance, Bernanke and Boivin (2003) and Stock and Watson (2002). The FAVAR model is now widely used in economic applications. (For example, Boivin, Giannoni, and Mihov (2009), Bianchi, Mumtaz, and Surico (2009), Forni and Gambetti (2010), Moench (2008), and Ludvigson and Ng (2009), to name a few. Large dimensional factor models are also increasingly used outside macroeconomics and finance, for example, Fan, Liao, andMincheva (2011), Fan, Liao andMincheva (2013), and Tsai and Tsay (2010).) Despite its wide applicability, important issues remain to be addressed.
We first derive the number of restrictions needed in the presence of observable factors, and then consider how to impose these restrictions. Two types of restrictions may be considered. One type involves restrictions on the sample moments of factor process, the other involves restrictions on the population moments of the factor process. The first type is more appropriate for factors being a sequence of fixed constants, for example, Bai and Li (2012). The second type is more appropriate for factors being a random sequence. Similar issue was discussed by Anderson (2003, p. 571). In FAVAR models, since the factors are stochastic processes, restrictions on population variance are more reasonable than on sample variance. An important result of this article is that the two types of restrictions, although asymptotically equivalent, lead to different limiting distributions for the estimated factors and factor loadings, as well as different limiting distributions for the estimated parameters in the VAR process.
The second issue is estimation and the related inferential theory. In the FAVAR literature, Bernanke, Boivin, and Eliasz (2005) and Boivin, Giannoni, and Mihov (2009) suggested a two-step method to estimate an FAVAR model, in which the factors are extracted first and their dynamics are estimated next. There are no studies on the inferential theory of the FAVAR model. The deficiency in this respect makes it difficult to construct the confidence intervals for the impulse response function and to interpret the subsequent economic analysis. Possibly for this reason, Bernanke, Boivin, and Eliasz (2005) also considered a Bayesian method to estimate the model. However, the burdensome computation procedure of the Markov chain Monte Carlo (MCMC) method in this context is formidable for many researchers.
In this article, we consider the identification, estimation, and inferential theory of the FAVAR models. We contribute to the FAVAR literature in several ways. First, we investigate the identification problem of the FAVAR model. Due to the presence of partially observable factors, the identification problem here differs from those in standard factor models. We consider three sets of identification conditions. Unlike the usual identification conditions that are imposed on the sample variance of factors, we put the conditions on the variance of innovations to factors. These conditions are similar to those in the standard structural VAR literature. Second, we propose a likelihood-based two-step method to estimate the FAVAR model, which explicitly takes into account of partial factors being observed. Using maximum likelihood (ML) method instead of principal components (PC) method in the first step gives a better estimation of unobserved factors. (See Bai and Li (2016) for a comparison of finite sample performance of the ML and PC methods.) In addition, we find that the iterative estimation procedure advocated by Boivin, Giannoni, and Mihov (2009) can be avoided.
Third, we establish the statistical theory of the two-step estimators including consistency, convergence rates, and the asymptotic representations. We also give an inferential theory for the impulse response functions. Based on this theory, the confidence intervals of the impulse response function can be easily constructed.
There are several studies related to our work. Stock and Watson (2005) considered the identification and estimation issues in the dynamic factor models. Their identification strategies share with ours the same feature that partial conditions are imposed on the variance of innovations. But the remaining conditions are different: their conditions are imposed on the vector moving average representation and ours are imposed on the original factor representation. Which identification strategy is preferred depends on specific applications. Bernanke, Boivin, and Eliasz (2005) suggested a timing-exclusion strategy for identification. Their strategy may lead to over-identification. Han (2015) proposed a statistic to test the over-identification restrictions. There are additional studies considering the bootstrap method to construct confidence intervals for factor-augmented models, such as Goncalves and Perron (2014), Shintani and Guo (2011), and Yamamoto (2011). Our theoretical results also pave ways for future studies in this direction.
The rest of the article is arranged as follows. Section 2 introduces the FAVAR model with its identification problem, and examines three sets of identification restrictions; and presents some regularity conditions. Section 3 states our two-step estimation procedures. Section 4 presents all the asymptotic properties of our estimators. Section 5 focuses the impulse response function and its confidence intervals. Section 6 investigates the finite sample properties of our estimators. Section 7 concludes. Technical proofs are delivered in the online appendix. Throughout the article, the norm of a vector or matrix is that of Frobenius, that is, A = √ [tr(A A)] for vector or matrix A.

THE FAVAR MODELS
Let g t be a vector of observable factors, and f t be a vector of latent factors, both of low dimension. The FAVAR model assumes that g t and f t jointly follow a VAR process. That is, let h t = (f t , g t ) , then h t is characterized by a VAR(K) process for some K, (2.1) where 1 , 2 , . . . , K are matrices of coefficients. In general, neither f t nor g t alone is a finite-order VAR process. The FAVAR model further assumes that a large number of observable variables z t = (z 1t , z 2t , . . . , z Nt ) , dimension of N × 1, is affected by h t through a factor model where and are the factor loadings with = (λ 1 , . . . , λ N ) and = (γ 1 , . . . , γ N ) , and e t = (e 1t , e 2t , . . . , e Nt ) is the idiosyncratic error, where λ i is of dimension r 1 × 1 and γ i of r 2 × 1, for all i = 1, 2, . . . , N. Throughout, we assume f t is of dimension r 1 × 1, g t of r 2 × 1, and h t of r = r 1 + r 2 . We consider estimating the factors (f t ) and factor loadings, the variance of the idiosyncratic errors e it , and the dynamic parameters in the h t process, and derive their limiting distributions under various identification restrictions. Model (2.1)−(2.2) is the FAVAR model proposed by Bernanke, Boivin, and Eliasz (2005). Equation (2.1) is a standard specification of VAR(K) model, except that the variables f t are unobservable. The inclusion of unobservable factors is crucial to the FAVAR model. These unobservable factors usually capture the information of some structural shocks that are important to the economy but cannot be well represented by specific macroeconomic aggregates. As mentioned before, omitting unknown structure shocks may be a primary reason for the failure of the traditional VAR model in some empirical applications. Equation (2.2) specifies that the common factors h t are related to the observable data z t by a factor model. This approach is a plausible way to model the relation between observable variables z t and the latent variable f t , given the diffusion nature of common shocks in h t . The FAVAR model can be considered as a special case of Forni et al. (2000), but with more structures.

The Number of Identification Restrictions Needed
Model (2.1)−(2.2) cannot be fully identified without additional restrictions. To see this, for any invertible r 1 × r 1 matrix M 11 and r 1 × r 2 matrix M 12 , the model can be written as Then we obtain two observably equivalent models. Since the total number of free parameters of M 11 and M 12 is r 2 1 + r 1 r 2 , we need at least r 2 1 + r 1 r 2 restrictions to identity parameters. A subsequent question is whether r 2 1 + r 1 r 2 restrictions are enough. To answer this question, we first define some notations for ease of exposition. Let The following proposition shows that the preceding question has a definite answer.
Proposition 1. Suppose that H is of full column rank, the number of restrictions needed to fully identify model (2.2)−(2.1) is (r 2 1 + r 1 r 2 ). Proof. Let M be any invertible r × r rotation matrix, partitioned as where M 11 , M 22 are r 1 × r 1 and r 2 × r 2 square matrices, respectively. Then Equation (2.2) can be written as Let h † t = Mh t . If M is a qualified rotation matrix, the lower r 2 elements of h † t should be g t . This gives If H is of full column rank, by post-multiplying H (H H ) −1 , we have M 21 = 0, M 22 = I r 2 . This result indicates that, to fully identify the parameters, we only need to uniquely determine the matrix M 11 and M 12 , whose number of free parameters is exactly r 2 1 + r 1 r 2 . This proves the proposition.

Identification Restrictions
The identification problem brings advantages and disadvantage to the FAVAR model. On one hand, it causes difficulties in interpreting the model in a universal way; on the other hand, the model has flexibility to fit specific situations through a careful design of the identification strategy. In what follows, we consider three sets of identification restrictions, which we think are of practical relevance. We first introduce the following notations: where ε t and υ t are the innovations corresponding to f t and g t , respectively. We consider the following three sets of identification restrictions.
IRa. The underlying parameter values θ satisfy: εε = I r 1 , where Q is a diagonal matrix with its diagonal elements being distinct and arranged in descending order. IRb. The underlying parameter values θ satisfy: εε = I r 1 , ευ = 0 and 1 is a lower triangular matrix, where 1 is the upper r 1 × r 1 submatrix of . IRc. The underlying parameter values θ satisfy: ευ = 0 and 1 = I r 1 , where 1 is the upper r 1 × r 1 submatrix of . Each set of identification restrictions imposes r 2 1 + r 1 r 2 restrictions. There are no restrictions on υυ as υ t is the reduced form residual from the observable g t . In the next subsection, we explain why it is possible to assume ευ = 0.
Remark 1. In factor analysis, Anderson (2003, p. 571) considers both types of restrictions E(f t f t ) = I r 1 and 1 T T t=1 f t f t = I r 1 . The former restriction is considered population restriction, and the latter is considered sample version restriction. In our case, since we have dynamics in h t , the errors ε t correspond to f t . Because we assume the errors are random, it is reasonable to make populational assumptions rather than sample version restrictions. However, as we will show, though E(ε t ε t ) = I r 1 and 1 T T t=1 ε t ε t = I r 1 are asymptotically equivalent in a certain sense, they imply different distributions for the estimated factor loadings and the estimated factors f t . The population version restriction implies larger variance than the sample version restriction.
where h , where we use the notation without † to denote the new parameters. Note that the observable factor g t and the corresponding innovation υ t do not change. Let be the variance matrix of the new innovation u t = [ ε t υ t ], then = A † A = [ I r 1 0 0 υυ ] , where the new innovations satisfy εε = I r 1 and ευ = 0. Consequently our imposed identification restrictions on the innovations as stated in the previous subsection are reasonable. The new factor f t = ( † υυ g t is now a linear combination of f † t and g t . With some appropriate restrictions on the new loadings [ ], the factor f t can now have economic meanings with additional identification restrictions.
The three different identification restrictions in the previous subsection can be interpreted as follows.
IRa requires that −1 ee be diagonal, which is often used in the maximum likelihood estimation, see Lawley and Maxwell (1971). This identification condition is important in terms of the statistical analysis, it can also be of economic interest in some specific cases, as pointed out in Bai and Ng (2013). For example, is block diagonal such as = [π 1 , 0; 0, π 2 ], where π i is a column vector of N i elements with N 1 + N 2 = N . In this case, the first factor only affects the first N 1 variables, and the second factor only affects the next N 2 variables. Each variable is affected by only a single factor, but we do not need to know which variable is affected by which factor; we have −1 ee being diagonal under arbitrary cross-sectional permutation of individuals.
IRb shares the same feature with IRa by imposing the restrictions on the variance of u t . In addition, it restricts 1 to being a lower triangular matrix. This allows IRb to endow economic implications with the unobserved factors. Under IRb, only the first unobservable factor affects the first variable, the first two unobservable factors affect the second variable, etc. This scheme somewhat resembles the recursive identification in structural VAR analysis. Through careful selection of the first r 1 variables, the unobservable factors are now explainable. For example, Geweke and Zhou (1996), in their study of pricing error in the Arbitrage Pricing Theory, used the recursive identification conditions. IRc restricts the upper r 1 × r 1 matrix of the factor loadings to being an identity matrix. Since more restrictions are imposed on the factor loadings , IRc relinquishes the requirement that the innovations to the unobservable factors be orthogonal and have unit variance. Under IRc, the first unobservable factor affects only the first series, the second unobservable factor affects only the second series, etc.
Overall, the identification restrictions considered in this article share the feature that they impose restrictions on the loadings and the variance of the innovations to h t . This is in contrast with the usual identification conditions in factor models, which impose restrictions on the loadings and the sample variance of factors; see Anderson and Rubin (1956) and Bai and Li (2012) for traditional identification conditions. Imposing restrictions on innovations instead on factors themselves is important and reasonable because the components of f t are correlated while the innovations ε t can be assumed uncorrelated, similar to structural analysis.

Assumptions
To analyze model (2.2)−(2.1), we make the following assumptions: Assumption A. The factor h t = (f t , g t ) admits a VAR representation (2.1), where u t is an iid process with u t = 1/2 ς t , where E(ς t ) = 0, var(ς t ) = I r , and > 0, E( ς t 4 ) < ∞ and the elements of ς t are mutually independent. In addition, all the roots of the polynomial (L) = I r − 1 L − 2 L 2 − · · · − K L K = 0 are outside of the unit circle. Assumption B. There exists a positive constant C large enough such that = Q exists and is a positive-definite matrix, where ee is defined in Assumption C.
Assumption C. E(e t ) = 0; E(e t e t ) = ee = diag(σ 2 1 , σ 2 2 , . . . , σ 2 N ); E(e 4 it ) < ∞ for all i and t. The e it are independent over i and t. The N × 1 vector e t is identically distributed over t. Furthermore, e it is independent with u s for all i, t, s.
Assumption A makes the regularity conditions on factors. It requires factor h t to be stationary over t. It also guarantees that H = (h 1 , h 2 , . . . , h T ) is of full column rank. So under Assumption A, Proposition 1 holds. Assumption B is made on the factor loadings. This assumption is standard. Notice that Assumption B requires the columns of to be linearly independent; otherwise, Q will be a singular matrix. Assumption C centers on the idiosyncratic errors. Under Assumption C, the correlations over time and cross-section are ruled out. Meanwhile, the heteroscedasticity over time is also precluded. This assumption can be relaxed to a great extent. In fact, the analysis of this article can be extended to the approximate factor models (Chamberlain and Rothschild 1983). Assumption D requires σ 2 i to be estimated in a compact set. This assumption is due to the high nonlinearity of the likelihood function, and it is common in the literature for nonlinear problems.

ESTIMATION
In this section, we propose a two-step method to estimate the underlying structure parameters that satisfy IRa, IRb, or IRc. Some alternative methods can also be used. Bernanke, Boivin, and Eliasz (2005) considered the MCMC method. Boivin, Giannoni, and Mihov (2009) considered the iterated principal component-ordinary least squares (PC-OLS) method. Our method directly takes into account that g t is observable, no iteration is necessary. Also, the MLE-based method is more efficient than that of PC-based.
To gain insight into our method, write (2.2) into matrix form as (3.1) Applying the quasi-maximum likelihood (ML) estimation method to the model, we obtain the quasi-ML estimates (QMLE) ,˜ ee , andF .
Thenˆ p = R˜ p R −1 for p = 1, 2, . . . , K, andˆ υυ =˜ υυ . Estimation under IRb: Let˜ 1/2 εε·υ˜ 1 = QR be the QR decomposition of˜ 1/2 εε·υ˜ 1 with Q an orthogonal matrix and R an upper triangular matrix, where˜ 1 is the upper r 1 × r 1 submatrix of˜ . The parameters are estimated byˆ =˜ ˜ 1/2 εε·υ Q, Thenˆ p = R˜ p R −1 for p = 1, 2, . . . , K, andˆ υυ =˜ υυ . Estimation under IRc: The parameters are estimated byˆ Remark 2. The innovations υ t do not involve any identification problem and hence are the same under different identification restrictions, due to the factors g t being observable. As a result, the estimatorˆ υυ is the same under different identification restrictions. However, for the innovations ε t , its variance matrix is restricted to being an identity matrix under IRa and IRb, so we only need estimate εε under IRc. The estimator would be useful in the construction of the impulse response function in Section 5.
Remark 3. We explain how we recover f t from f t (how to obtainf t fromf t ) using the given formula above. We take IRc as the example to illustrate.
. From the estimation procedure, it is seen that˜ −1 1 corresponds to R 11 . Also notice that By υε = 0, we see that −1 υυ υε = − −1 gg gf R 11 . So the term −˜ −1 υυ˜ υε is an estimator of −1 gg gf R 11 . This justifies the for-mulaF = (F − G˜ −1 υυ˜ υε )˜ 1 in IRc. Remark 4. The parameters , , ee , 1 , . . . , k , and can also be estimated by the state space method using the Kalman smoother as in Watson and Engle (1983), Quah and Sargent (1992), and Doz, Giannone, and Reichlin (2012) (though the latter article considers homoscedastic e it , it can be extended to heteroscedastic errors). But the state space method is computationally more demanding than the two-step method here. That is perhaps the reason that Doz, Giannone, and Reichlin (2011) subsequently also considered a two-step method. Furthermore, it can be shown that, due to the static relationship between z it and h t , there is no asymptotic efficiency gain by using the Kalman smoother (see Bai and Li (2016)). None of these articles study the limiting distributions of the estimators.
Throughout the article, we use the symbols with a hat to denote the final estimators (e.g.,λ i ,f t ,ˆ k ) and the symbols with a tilde to denote the intermediate estimators (e.g.,λ i ,f t , k ). Since σ 2 i does not have the identification problem, the intermediate estimator and the final estimator are the same. For this reason, we use the two symbols interchangeably, that is,

ASYMPTOTIC PROPERTIES OF THE ESTIMATORS
In this section, we deliver the asymptotic results on the twostep estimators. The following proposition states that the twostep estimators are individually consistent.
Proposition 2. Under Assumptions A-D, when N, T → ∞, with any one of identification conditions (IRa, IRb, or IRc), we haveλ To give the asymptotic representations for the factor loadings, we introduce the following notations. Let V be an r 1 × r 1 matrix, which is defined as follows: )/N , K r is the rdimensional commutation matrix such that K r vec(M) = vec(M ) for any r × r matrix M and D 1 is the matrix such that veck(M) = D 1 vec(M) for any symmetric matrix, and veck(M) is the operator that stacks the elements of M below the diagonal into a vector; P 1 = [I p , 0 p×q ] with p = (r 1 + 1)r 1 /2 and q = r 1 (r 1 − 1)/2; Given the consistency, we have the following theorem on the asymptotic representation of the estimator for loadingsλ i : where φ t = f t − fg −1 gg g t and φφ = E(φ t φ t ), where fg and gg are defined in (2.4).
Remark 5. Consider the limiting distribution under IRa. The restrictions under IRa are similar to those for the principal components estimator. The limiting distribution here is different from that of the usual PC in several ways. First because of the presence of observable g t , the "regressors" f t is projected onto g t , and the projection error φ t enters into the distribution. Second, there is an extra term V in the limiting distribution. To better understand this term, consider the situation in which g t is absent, and the dynamics in h t is also absent so that h t = f t = ε t . The restriction E(ε t ε t ) = I r becomes E(f t f t ) = I r . The limiting distribution under IRa becomes If one assumes the sample version restriction 1 T t f t f t = I r , then the first term disappears. This result is consistent with that of Bai and Li (2012), where the sample version restriction is considered. Thus restrictions on sample covariance and restrictions on population covariance lead to different limiting distributions for the estimated factor loadings. Restrictions on population covariance of f t imply a larger limiting variance forλ i . Finally, because we allow dynamics in h t , the first term V involves the innovations of ε t rather than f t .
Also in the absence of g t and without dynamics in h t so that h t = f t = ε t , Equation (4.2) is also the asymptotic representation of the PC estimator under the population restriction E(f t f t ) = I r ; but in the definition of V, the matrix Q = ( −1 ee )/N is replaced by /N because the PC estimator treats the error covariance matrix as an identify matrix. This result is also consistent with that of Bai (2003), where he considers a limiting distribution of the form √ T (λ i − Rλ i ) for some rotation matrix R. If we let R = I r + V , then the representation will be equivalent to that of Bai (2003).
Under IRb, the population restriction E(ε t ε t ) = I r 1 continues to affect the limiting distribution. Now V itself is composed of two expressions. The second expression in V is analogous to a term in Bai and Li (2012) Under IRc, there are no restrictions on the population variance of ε t , and instead, the restrictions are imposed on the factor loadings. The limiting distribution is analogous to that of Bai and Li (2012) under IC1.
Remark 6. Theorem 1 shows that the asymptotic representation forλ i under different IRs has a similar expression, which justifies our treatment that the asymptotic properties forλ i under different IRs are studied in a unified framework. The symbol φ t in the asymptotic representation is the residual from projecting f t on g t . Hence, it is orthogonal with g t . The expression of V is different under different identification restrictions.
To derive the limiting distribution ofλ i , we consider the covariance between the first and second term on the right-hand side of (4.1). Under IRa, V only involves the variance of the VAR innovations ε t ε t , and the second term only involves φ t e it , so the first and second term are asymptotically independent because of the absence of correlation between ε t ε t and φ t e it . Uncorrelation implies asymptotic independence because each term is asymptotically normal. Under IRc, we only need to estimate λ i with i > r 1 , so the second term only involves e it where i > r 1 . But V involves ξ t φ t , where ξ t = (e 1t , e 2t , . . . , e r 1 t ), so the first term is asymptotically independent with the second term. Under IRb, for i > r 1 , these two terms are asymptotically independent for the same reason as under IRa and IRc. But for i ≤ r 1 , these two terms are correlated. Based on the preceding analysis and Theorem 1, we have the following corollary.
Corollary 1. Under the assumptions of Theorem 1, we have The notations B Q , P 1 , D r 1 , D 2 , D 3 , 1 , S r 1 , and φφ are defined in the paragraph before Theorem 1.
Remark 7. Term V in the asymptotic representation under IRa and IRb involves ε t ⊗ ε t − vec(I r 1 ), where ε t is a component of u t . The limiting variances therefore depend on the kurtosis of ε it , where ε it is the ith element of ε t . If normality of u t is assumed, then terms G and H are zeros since W = 0, and the limiting distribution in the preceding theorem is simplified. Similarly, with normality assumption of u t , the asymptotic variances in Corollaries 3 and 4 and Theorems 7-9 will also be simplified. The simplified limiting distributions can be found in an earlier (online) version of this article. Now we consider the asymptotic results forγ i − γ i . We have the following theorem.
−1 ff f t and ηη = E(η t η t ) where gf and ff are defined (2.4). In addition, Similar to Theorem 1, the asymptotic representation ofγ i under different IRs also has a unified expression. Symmetric to the symbol φ t in Theorem 1, the symbol η t here is the residual of projecting g t on f t . The matrix W has a unified expression under different IRs. If the population restriction E(υ t ε t ) = 0 is replaced by the sample version 1 T T t=1 υ t ε t = 0, then the first term W disappears.
Note that the two terms in the asymptotic representation of γ i − γ i are asymptotically independent. The asymptotic independence follows from the absence of correlation between v t ε t and η t e it , which is similar to the case in (4.1) under IRa.
Corollary 2. Under the assumptions of Theorem 2, we have where εε and υυ are defined in (2.4).
After deriving the asymptotic result of loadings, we consider the estimation of the unobservable factorsf t . The asymptotic result off t − f t involves both V and W matrices, which is stated in the following theorem.
In the asymptotic representation off t − f t , the first term, V and W are asymptotically independent with each other. To see this, first consider V and W. Under IRa, no- . . , u T ) . Combining the above two results under IRa and IRc, we have E(vec(V )vec(W ) ] = 0 under IRb in view of the expression of V. We next show the first term is asymptotically independent with both V and W. Since W only involves u while the first term only involves e, they are independent under all IRs. Under IRa, V is independent with the first term for the same reason. Under IRc, V involves ξ t φ t over t where ξ t = (e 1t , . . . , e r 1 t ), while the first term involves summations of e it over i, so they are uncorrelated because φ t and e it are independent from the assumption that u t is independent of e it . The previous two cases imply that under IRb, V is also asymptotically uncorrelated with the first term. Given the above analysis, we derive the limiting distribution in the following corollary.

Corollary 3. Under the assumptions of Theorem 3, we have
under IRb: under IRc: where Remark 8. In the absence of g t , the term √ T Wg t drops out. Furthermore, in the absence of factor dynamics (so that h t = f t = ε t ), Theorem 3 characterizes the MLE of f t under the population restriction E(f t f t ) = I r when IRa or IRb is used. Especially, the principal components estimator of f t (withˆ ee replaced by an identity matrix), has the following representation, Here in the definition of V, the matrix Q = −1 ee /N is replaced by /N .
For estimatorσ 2 i , we have the following theorem and corollary.

Theorem 4. Under Assumptions
In addition, we have , where κ i is the excess kurtosis of e it . With the normality of e it , the limiting distribution reduces to N (0, 2σ 4 i ). Notice e it does not have the identification problem. Consequently its asymptotic representation does not depend on the identification restrictions. We then consider the asymptotic representation ofˆ k − k , which is stated in the following theorem.
Theorem 5. Under Assumptions A-D, when N, T → 0 and √ T /N → 0, we have If the factors f t were observed, the asymptotic representation of However, f t is unobservable, the asymptotic representation of √ T (ˆ k − k ) then has two extra terms, − √ T B k + √ T k B . Theorem 5 shows that the inferential theory of the standard VAR models cannot be applied to the FAVAR model.
Given Theorem 5, we have the following corollary.
Corollary 4. Under the assumptions of Theorem 5, we have where V k denotes the (k, k)th r × r submatrix of [E(ψ t ψ t )] −1 and J is the limiting variance of √ T vec(B) and defined as Under IRa: where D 4 and D 5 are respective r 2 × r 2 1 and r 2 × r 1 r 2 matrices such that vec(B) = D 4 vec(V ) + D 5 vec(W ); D 6 = (I r ⊗ k − k ⊗ I r )K r with K r the r-dimensional commutation matrix. Remark 9. This article assumes h t follows a finite order AR(K) process. This may not be appropriate for certain settings. For example, if the idiosyncratic errors contain unit roots, then differencing the data is necessary. But the factor process h t may be over-differenced, and a finite lag VAR for h t will not be appropriate. Our model does not apply to this case. A possibility is to allow K to increase slowly with T as in Berk (1974), but the analysis will be much more complicated. When the errors e it are all stationary AR(1), there are two methods to proceed. Method 1 is to model the AR(1) process, and the likelihood function needs to be modified. Method 2 is to ignore the serial dependence, and the same likelihood function is used. It can be shown that both approaches give consistency for the factor loadings and the factors. But the limiting distributions will be different. In a non-FAVAR context, Bai and Li (2016) studied some related issues.

IMPULSE RESPONSE FUNCTION
Impulse response function plays an important role in the VAR analysis. In this section, we construct the confidence intervals for impulse response function of model (2.1). Let = ( 1 , 2 , . . . , K ). Theorem 5 gives √ T vec(ˆ − ) +o p (1).
Let D 9 be a K 2 r 2 × r 2 matrix satisfying that vec(I K ⊗ B) = D 9 vec(B). Given this result, we have √ T vec(ˆ − ) By definition, it is seen that where J is the limiting variance of √ T vec(B) and = E(u t u t ). Under the assumption of stationarity of the process h t , model (2.1) has a vector MA(∞) expression Given the asymptotic results of √ T vec(ˆ − ), the limiting distribution ofˆ s − s for all s can be derived in the standard way (see Hamilton 1994, p. 336). The limiting result is stated in the following theorem.

Theorem 6. Under Assumptions A-D, when N, T → ∞ and
where ϒ s is defined recursively by with 0 = I r and s = 0 for s < 0.
We notice that the above impulse response functions are derived from the nonorthogonal shocks. In the analysis of some structural models, the impulse response functions for orthogonal shocks are required. For this, we consider decomposing = var(u t ). Let P be the lower triangular matrix, which is obtained by the Cholesky decomposition of . And let ω t be the corresponding structural shocks with the relation that u t = Pω t . Then the moving average expression (5.1) can be written as with C s = s P being the impulse response function corresponding to the structural shocks ω t .
Remark 10. There are some cases in which no Cholesky decomposition is needed. For instance, in the application of Bernanke, Boivin, and Eliasz (2005), g t is a scalar that is the federal fund rate. Then υυ is a scalar and hence a diagonal matrix. So under IRa and IRb,ˆ is diagonal implying that the innovations u t are mutually orthogonal and hence can be interpreted as structural shocks. But under IRc,ˆ is not diagonal due to the nondiagonal matrixˆ εε .
Next we aim to derive the limiting distribution of √ T vec(Ĉ s − C s ), on which basis the confidence intervals of the impulse response function can be constructed.
By definition, C s is related to both s and P. The limiting distribution ofˆ s − s is given in Theorem 6. The limiting distribution ofP − P can be derived based on the following theorem, since by definition, P is related to εε and υυ .
where D + r 2 is the Moore-Penrose inverse of an r 2 -dimensional duplication matrix, D r is defined in Corollary 1, and W υ is an r 2 -dimensional diagonal matrix with its ith element κ † i − 3 where κ † i is the fourth moment of its ith element of −1/2 υυ υ t . In addition, under IRc, we also have where W ε is defined similarly as W υ .
Further, based on Theorems 6 and 7, we can derive the limiting distribution of √ T vec(Ĉ s − C s ) as in the following theorem.
Then based on (2.2) and (5.2), the impulse response function of the observable variables z t with respect to the structural shocks ω t is for each i and for all k ≥ 0. Then note that has two components, which arise from estimating the loadings (λ i , γ i ) and the MA(∞) coefficients C k . From the asymptotic representations of (λ i − λ i ), (γ i − γ i ), and (Ĉ k − C k ), taking into account their covariances, we obtain the following theorem on the impulse response function.

FINITE SAMPLE PROPERTIES
In this section, we run Monte Carlo simulations to investigate the finite sample properties of the two-step estimators. For the sake of space, we only consider IRb and IRc, which are of more practical relevance. In factor analysis literature, many studies, such as Bai andLi (2012, 2016), Doz, Giannone, and Reichlin (2012), investigate the finite sample properties of the QMLE, that is,˜ ,F ,˜ ee . Consequently, in this article we instead focus on the performance of the estimatorˆ . Notice thatˆ has a close relation with the impulse response function, which in many occasions is the primary tool of the economic analysis. Hence, the finite sample properties ofˆ deserves our special attention.
The factors are assumed to follow VAR(1) and are generated according to where h t = (f t , g t ) and u t is an iid N (0, ) process. Matrix is restricted by the identification IRb and IRc and exhibits the form, respectively, Throughout the simulation, the number of unknown factors and known factors, r 1 and r 2 , are set to 2 and 1 (so r = r 1 + r 2 = 3). In addition, the parameter is fixed to 0.7I r . All the factor loadings are generated independently from N (0, 1) (where is N × 2 and is N × 1). To make the underlying factor loadings satisfy the identification restrictions, we set the (1, 2)th element of to be 0 under IRb and the upper 2 × 2 matrix of to be the identity matrix under IRc. After the factor loadings are obtained, the data are generated by where e t = (e 1t , e 2t , . . . , e Nt ) with e it ∼ N (0, After the data (Z, G) are constructed, we need to determine the number of unknown factors r 1 before estimation. There are two approaches to determine r 1 . One approach is to first estimate the total number of factors based on Z (denoted asr) by the information criterion proposed by Bai and Ng (2002), and then getr 1 =r − r 2 where r 2 is the number of known factors. A better approach is to directly estimate the number of unknown factorsr 1 through the transformed data ZM G . The second approach is adopted in simulations. Once r 1 is determined, we use the method described in the previous section to estimate the parameters.
The identification conditions IRb have so-called sign problem. (See Bai and Li (2012) for an illustration on the sign problem.) To eliminate this problem, after the estimated fac-torsF are obtained, we calculate the correlation coefficients between each column ofF and the corresponding column  of F. If the coefficient is negative, then multiply −1 to that column ofF and the corresponding column ofˆ . In practice, this treatment is not feasible. However, sign problem can be fixed by other means, see Stock and Watson (2005). We consider a combination of N = 50, 100, 200 and T = 50, 100, 200, 500. All the results are obtained in 1000 repetitions. Table 1 reports the root of mean square error (RMSE) of all elements of . The last element of each row is the average of the left nine elements under IRb. On the whole, we can see that the RMSE decreases as the sample size becomes larger. More concretely, Table 1 shows that the RMSE is closely linked with the time length T and little related to the cross-sectional size N. Take 11 as an example. When T = 200 and N = 50, 100, 200, the corresponding three RMSEs are 0.0611, 0.0572, and 0.0557, which are roughly equal. However, when T increases to 500, the corresponding three RMSEs are 0.0373, 0.0343, and 0.0335, which are still roughly equal but dramatically lower in comparison with those of T = 200. This result is consistent with results in Theorem 5.
Aside from the consistency, we are also concerned about the limiting distribution of √ T vec(ˆ − ), which, as seen in the last section, has a direct effect on the confidence interval of the impulse response function. To this end, we calculate the size of t-test for every ij in each simulation and count the number of times that the absolute value of t-statistics is greater than the critical value of the 5% significance level for the standard normal distribution (i.e., 1.96) in 1000 repetitions. Table 2 reports the actual significance level that corresponds to 5% nominal size for every ij . As in Table 1, we average the result of the nine elements of and report the result in the last column. From Table 2, we find that, unlike in Table 1, the actual significance level is related to both N and T. When the sample size is small, say N = 50, T = 50, the size distortion is a little larger, for 22 , the actual significance level is 0.086. However, when the sample becomes larger, the distortion gradually decreases (see the last column). When N = 200, T = 500, we can see that all the elements of have a satisfactory size.
The results under IRc are similar to those under IRb and are reported in Tables 3 and 4. We do not repeat the detailed analysis.

CONCLUDING REMARKS
This article considers the identification, estimation, and inferential theory of the FAVAR model. Three sets of identification restrictions are considered. We propose a likelihood-based two-step method to estimate the parameters. Consistency, convergence rates, asymptotic representations, and the limiting distributions have been established. The impulse response function and its confidence intervals are also provided. An important result from our theory is that if the identification conditions are imposed on the population variance rather than on the sample variance of the factor process, an additional term, which arises from the distance between the sample variance and the population variance, would enter the final asymptotic representations. Consequently the limiting variances of the estimators are larger. We studied the ways in which this distance affects the limiting distributions. The finite sample Monte Carlo simulation confirms our theoretical results.
The analysis of this article assumes constant parameters. In empirical applications with a long time span, it is likely that a structural change occurs, either in the dynamics of h t , or in the factor loadings ( , ). It is of interest to develop inference procedures allowing for this possibility, as in Chen, Dolado, and Gonzalo (2014), Cheng, Liao, and Schorfheide (2013), and Han and Inoue (2015).

APPENDIX: TECHNICAL MATERIALS FOR THE ASYMPTOTIC RESULTS
In this appendix, we provide the detailed derivations for the asymptotic results under IRa. The derivations for the asymptotic results under IRb and IRc as well as the theorems in Section 5 are delegated to the online supplement. Throughout the appendix, we useK to denote K + 1 andT to denote T − K − 1. To facilitate the analysis, we introduce the following auxiliary identification condition (an intermediate step analysis). AU1: The underlying parameter values θ * = ( * , * , F * , * , ee ) satisfy: 1 N * −1 ee * = Q * , 1 T T t=1 f * t f * t = I r 1 , and 1 T T t=1 f * t g t = 0, where Q * is a diagonal matrix, whose diagonal elements are distinct and arranged in descending order.

APPENDIX A: THE ASYMPTOTIC RESULTS OF THE QMLE
In this appendix, we show that the QMLEλ i ,σ 2 i ,f t and˜ k are respectively consistent estimator of λ * i , σ 2 i , f * t and * k under AU1. We also derive their asymptotic representations.
Proposition A.1. Under Assumptions A-D, together with AU1, Let Y = ZM G and y t denotes the t-th column of Y. The above equation is equivalent to Bai and Li (2012) derive the asymptotic representations ofλ i ,f t ,σ 2 i under the case that g t ≡ 1. However, when g t is a general random variable, as like in the present context, the derivation is the same since term eG(G G) −1 g t is essentially negligible. Using the arguments of Bai and Li (2012) The second term of the right hand side is zero by T t=1 g t f * t = G F * = 0. Consider the third term. Noticẽ .
whereh t = (f t , g t ) and h * J 11 J 12 J 21 0 where The first term of J 11 is O p (N −1 ) + O p (T −2 ), as shown in Bai an Li (2012). Consider the second term. By (A.7), 1 The first term of the above expression is O p (N −1/2 T −1/2 ) + O p (T −1 ) by 1 T T t=K f * t−p f * t−q = O p (1) and A = O p (N −1/2 T −1/2 ) +O p (T −1 ) , as shown in Bai and Li (2012). The second and third terms are also O p (N −1/2 T −1/2 ) + O p (T −1 ), which can be proved similarly as Lemma C.1(e) of Bai and Li (2012). Given these results, the second term of J 11 is O p (N −1 ) + O p (T −1 ). The last term can be proved to be the same magnitude by the similar arguments. Summarizing these results, we have J 11 = O p (N −1 ) + O p (T −1 ). Terms J 12 and J 21 can be proved to be O p (N −1/2 T −1/2 ) + O p (T −1 ) similarly as J 11 . Then (a) follows.
Consider (b). The left hand side of (b) is equal to The two none-zero terms of the above are O p (N −1 ) + O p (T −1 ), which are shown in (a). Then (b) follows. Consider (c). The left hand side of (c) is equal to So it suffices to consider term 1 , which, by (A.7), can be written as ee˜ can be proved to be O p (N −1/2 T −1/2 ) +O p (T −1 ) similarly as Lemma C.1(e) of Bai and Li (2012). Given these results, together with Then (c) follows. This completes the proof of Lemma A.1.
Now we consider the following auxiliary identification restrictions (denoted by AU2), in which the loading restrictions are the same as AU1 but factor restrictions are imposed on the population. = Q , E(f t f t ) = I r 1 and E(f t g t ) = 0, where Q is a diagonal matrix, whose diagonal elements are distinct and arranged in descending order.
Note that the superscript "stars" in θ and θ * are different. Different identification restrictions imply different rotations. Because AU1 and AU2 are asymptotically the same (the former with sample moment restriction 1 T t f t f t = I r 1 and the latter with population moment restriction E(f t f t ) = I r 1 ), θ and θ * are also asymptotically the same. That is why the deviation of MLE from θ also converges to zero in probability, which will be proved below.
The following lemma is useful to our analysis.
Lemma A.2 Let Q be an r × r matrix satisfying where V is an r × r diagonal matrix with strictly positive and distinct elements, arranged in decreasing order, and D is also diagonal. Then Q must be a diagonal matrix with elements either −1 or 1 and V = D.
Lemma A.2 is proved in Bai and Li (2012 Proof of Proposition A.3. Notice that We show λ * i and λ i are close to each other because AU1 and AU2 are asymptotically the same. Different identification restrictions imply different rotations. Let R be the rotation matrix, which transform (λ * i , γ * i ) to (λ i , γ i ) . Then we have As mentioned in the main text, due to the fact that the factors g t are observed, matrix R 12 is fixed to 0 and matrix R 22 is fixed to I r 2 . So equation (A.12) reduces to This gives The last equation of (A.13) can also be written as Post-multiplying g t on both sides and taking summation over t, by T t=1 g t f * t = 0, we have Substituting (A.15) into (A.14), The first equation of (A.13) shows = * R 11 . So we have The left hand side of (A.16) converges to I r 1 in probability. Thus R −1 11 R −1 11 p − → I r 1 . Applying Lemma A.2 to R −1 11 R −1 11 p − → I r 1 and R −1 Substituting (A.15) into the above equation and noticing λ i = R 11 λ * i , we haveγ Now considerf t − f t . By (A.13), By R −1 11 = I r 1 + U , the above equation is equal tõ However, by (A.13) together with R 11 = (I r 1 + U ) −1 and U = O p (T −1/2 ), we have In addition, by (A.14), (A.12) and Given the above results, by (A.26), we have the last expression of Proposition A.3. This completes the proof of Proposition A.3.
The asymptotic result for˜ k under AU2 is given in the following proposition.
Proposition A.4. Under Assumptions A-D, together with the identification condition AU2, we havẽ

Proof of Proposition A.4. Notice
However, by R −1 11 = I r 1 + V + O p (T −1 ) and R 21 = −W + O p (T −1 ), we have Given the above result, we have R = I r − B + O p (T −1 ). Substituting the preceding two results into (A.27), we havẽ By Proposition A.2, we can rewrite the above result as Given this result, together with the fact that R − I r = O p (T −1/2 ), we have and This completes the proof of Proposition A.4.

APPENDIX B: THE ASYMPTOTIC RESULTS AND THEIR PROOFS UNDER IRa
As in the main text, we use ( , , F ) to denote the underlying parameters satisfying IRa. Let R be the rotation matrix which transforms (λ i , γ i ) into (λ i , γ i ) . Then we have Then we have 2) The last equation in (B.2) can be written as Note that the rotation matrix R is nonrandom. To see this, both AU2 and IRa impose restrictions on the loadings and the covariance of h t . So the rotation matrix R, which transform the underlying parameters from AU2 to IRa, only involves loadings and covariance of h t . Thus it is nonrandom. This is in contrast with R , which is random since AU1 involves f t . Post-multiplying g t on both sides and taking the expectation, by E(f t g t ) = 0, we have Define φ t = R −1 11 f t . From the above results, φ has an alternative expression The following lemmas will be used in the subsequent proof.

Proof of Theorems
Then R is O p (T −1/2 ) since both B and B are O p (T −1/2 ). This result together withˆ k − k = O p (T −1/2 ) +O p (N −1 ) implies V = O p (N −2 ) + O p (T −1 ). Given this result, together with k = R −1 k R , we havê Substituting (B.36) into the above equation, together with u t = R −1 u t , h t = R −1 h t and Proposition A.4, we havê This completes the proof of Theorem 5.

SUPPLEMENTARY MATERIALS
The supplementary materials include Appendices C-E and detailed derivations for the asymptotic results under IRb in Appendix C, and results under IRc in Appendix D. We also provide the derivations for the asymptotic results of the impulse response function in Appendix E.