Martingale Difference Divergence Matrix and Its Application to Dimension Reduction for Stationary Multivariate Time Series

ABSTRACT In this article, we introduce a new methodology to perform dimension reduction for a stationary multivariate time series. Our method is motivated by the consideration of optimal prediction and focuses on the reduction of the effective dimension in conditional mean of time series given the past information. In particular, we seek a contemporaneous linear transformation such that the transformed time series has two parts with one part being conditionally mean independent of the past. To achieve this goal, we first propose the so-called martingale difference divergence matrix (MDDM), which can quantify the conditional mean independence of V ∈ Rp given U ∈ Rq and also encodes the number and form of linear combinations of V that are conditional mean independent of U. Our dimension reduction procedure is based on eigen-decomposition of the cumulative martingale difference divergence matrix, which is an extension of MDDM to the time series context. Interestingly, there is a static factor model representation for our dimension reduction framework and it has subtle difference from the existing static factor model used in the time series literature. Some theory is also provided about the rate of convergence of eigenvalue and eigenvector of the sample cumulative MDDM in the fixed-dimensional setting. Favorable finite sample performance is demonstrated via simulations and real data illustrations in comparison with some existing methods. Supplementary materials for this article are available online.


Introduction
A central problem in the modeling and inference of multivariate time series is the reduction of dimensionality of parameters. In the time domain, several dimension reduction methods have been proposed, including the canonical correlation analysis by Box and Tiao (1977), the factor models by Peña and Box (1987), the scalar component analysis by Tiao and Tsay (1989), the independent component analysis by Back and Weigend (1997), the principal component analysis by Stock and Watson (2002), and the dynamic orthogonal component analysis by Matteson and Tsay (2011). In these works, linear combinations are sought to make linearly transformed series have simpler dynamic structure, which can be captured by parsimonious parametric models. In the spectral domain, dimension reduction methods have been developed by Geweke (1977), Brillinger (1981), Stoffer (1999), Ombao, von Sachs, and Guo (2005), and Eichler, Motta, and von Sachs (2011), among others.
In this article, we propose a new methodology to perform dimension reduction for a strictly stationary multivariate time series. Our proposal is motivated by the consideration of optimal prediction. Let Y t ∈ R p , t ∈ Z be a mean zero p-variate stationary time series, then the optimal predictor of Y n+1 given the past information set F n = σ (Y n , . . . ,Y 1 , . . .) is E(Y n+1 |F n ) in the mean squared error sense. This led us to focus on the dimension reduction of E(Y n+1 |F n ), which we intend to do in a way without imposing any parametric or linear structure. In particular, we seek for a contemporaneous linear (invertible) transformation for Y t , say, M ∈ R p×p , such that MY t = Z t = [Z T 1t , Z T 2t ] T , where Z 1t ∈ R s and Z 2t ∈ R p−s , such that E(Z 1(n+1) |F n ) = E (Z 1(n+1) ) and E(Z 2(n+1) |F n ) = E (Z 2(n+1) ). In other words, the transformed series can be separated into two parts with one part being conditionally mean dependent on the past and the other part being conditionally mean independent upon the past. Thus, the modeling task for the whole series Y t is reduced to that for the lower dimensional series Z 1t , since and dimension reduction can be achieved without loss of prediction accuracy. Interestingly, our new method can be formulated equivalently in a factor model framework. Representing multiple time series in terms of several static or dynamic factors is quite popular and the literature is large, see Peña and Box (1987), Forni, Hallin, andReichlin (2000, 2005), Bai and Ng (2002), Bai (2003), Stock and Watson (2005), Pan and Yao (2008), Lam, Yao, and Bathia (2011), and Lam and Yao (2012), among others. A distinction from the static factor models in the existing literature is that our error process, that is, e t = Y t − E(Y t |F t−1 ) is a vector martingale difference sequence, which is stronger than the usual vector white-noise assumption. This implies that the effective number of factors under our model could be different from (more precisely, is equal to or larger than) the number of factors in the factor models described by Peña and Box (1987) and Pan and Yao (2008). A more detailed discussion of the difference is provided in Section 4.2.
To quantify conditional mean (in)dependence for a multivariate time series, we extend the notion of martingale difference divergence (MDD) recently proposed by Shao and Zhang (2014), which is used to measure the conditional mean dependence of a univariate response Y with respect to a vector covariate X, in several aspects. First, we consider multivariate response variable, and generalize MDD to a matrix-valued quantity called MDDM (martingale difference divergence matrix). Second, we define the cumulative MDDM by taking the sum of MDDM at several lags to account for the underlying time series structure, either jointly or in a pairwise fashion. To determine the number and the form of linear combinations that are conditional mean independent of the past, we perform the eigen-decomposition of the sample cumulative MDDM and use ratio-based estimator, as adopted in Lam, Yao, and Bathia (2011) and Lam and Yao (2012). Note that the inference in the latter work is based on a linear analog of cumulative MDDM, which only measures linear dependence. Since non-Gaussian and nonlinear time series are prevalent in various applied areas, our methodology has a built-in advantage over the ones that rely on linear dependence measures for the dimension reduction of multivariate nonlinear time series.
The rest of the article is organized as follows. Section 2 provides a review of martingale difference divergence and its sample estimate. In Section 3, we introduce the definition of martingale difference divergence matrix and its properties. Our dimension reduction methodology for conditional mean is presented in Section 4, which includes an extension of principal component analysis to principal conditional mean component analysis, and factor model representation as well as some discussion of practical issues and related work. Simulation results are gathered in Section 5. Section 6 presents two real data illustrations and Section 7 concludes. Technical details are relegated to supplementary material.
A word on notation. Let i = √ −1 be the imaginary unit. For x ∈ C p , we use x * for "x-conjugate-transpose" (conjugate for scalars). The scalar product of vectors x and y is denoted

Review of Martingale Difference Divergence
For U ∈ R q and V∈ R, where q is a fixed positive integer, Shao and Zhang (2014) proposed the so-called martingale difference divergence (MDD) and its standardized version martingale difference correlation (MDC) to measure the conditional mean independence of V on U , that is, Specifically, MDD(V |U ) is defined as the nonnegative number that satisfies where and c q = π (1+q)/2 / ((1 + q)/2). The definition can be regarded as an extension of distance covariance (Székely, Rizzo, and Bakirov 2007) since a similar weighting function is used and it inherits many desirable properties of distance covariance. For example, where (V , U ) is an independent copy of (V, U ).
Recently, Park, Shao, and Yao (2015) made an extension of MDD to allow multivariate response. If V ∈ R p , p ≥ 1, then the characteristic function-based definition (2) still applies. Under the assumption that E(|V | 2 which is a scalar-valued quantity. Most of the properties mentioned in Shao and Zhang (2014) still hold for this more general definition. Assume that we have a random sample (U k , V k ) n k=1 from the joint distribution of (U, V ). Let V n = 1 n n k=1 V k , a kl = V k V l , a k· = 1 n n l=1 a kl = V k V n , a ·l = 1 n n k=1 a kl = V n V l , a ·· = 1 n 2 n k,l=1 a kl = V n V n , and A kl = a kl − a k· − a ·l + a ·· = Based on the above quantities, sample martingale difference divergence MDD n (Shao and Zhang 2014) is defined as the nonnegative number that satisfies where g n V,U (s) = 1 n j V j e i<s,U j > , g n V = 1 n j V j , and g n U (s) = 1 n j e i<s,U j > . The second equality in the above equation is shown in Theorem 2 of Shao and Zhang (2014) and it implies that the simpler algebraic form is equivalent to an empirical plug-in version. The above definition applies to the case p = 1. When p > 1, the sample MDD is defined as the nonnegative number that satisfies It turns out that the above definition can be further simplified as which can be shown by a straightforward calculation, and the details are omitted. Note that in general MDD n (V |U ) 2 is a biased estimator of MDD(V |U ) 2 , however the bias is expected to be asymptotically negligible when p is fixed. It is indeed possible to adopt the U-centering idea (Székely and Rizzo 2014;Park, Shao, and Yao 2015) to define an unbiased estimator, but it unfortunately complicates the asymptotic analysis in Section 4.

Martingale Difference Divergence Matrix
For U ∈ R q and V ∈ R p , it is possible that there exists a linear In the presence of such a relationship, the modeling of conditional mean of V as a function of U can be simplified, as the effective dimension of E(V |U ) can be reduced via linear transformation and separating out the part that is conditionally mean independent of U . To this end, we introduce a new matrix object, the so-called martingale difference divergence matrix (MDDM), which can be viewed as an extension of martingale difference divergence from a scalar to a matrix.
Note that the (i, i)th entry of the p × p matrix MDDM equals to MDD(V i |U ) 2 , whereas the (i, j)th entry is that is, MDDM(V |U ) is a Hermitian matrix and thus has real eigenvalues. Here for the notational simplicity, we opt to use the notation MDDM instead of MDDM 2 in our definition. Below we provide a simple and equivalent expression for MDDM.
where (V , U ) is an iid copy of (V, U). Therefore, MDDM(V |U ) is a real, symmetric, and positive semidefinite matrix.
By elementary matrix algebra, it is not difficult to show that Lemma 1 implies the following theorem, which states that the rank of the MDDM is closely tied to the number of linear combinations of V that are conditionally mean independent of U .
Subsequently, there exist p − s linearly independent combinations of V such that they are conditionally mean independent of U , if and only if rank (MDDM(V |U )) = s. Remark 1. We shall provide a discussion on a possible analog of MDDM to measure the linear dependence between two vectors V ∈ R p and U ∈ R q . Define It is easy to show that L(V |U ) is a real, symmetric, and positive semidefinite matrix. Then there exists a nonzero α ∈ R p , such that cov(α T V, U ) = 0 (i.e., a linear combination of V is uncorrelated with U ), if and only if L(V |U ) is singular. Further, suppose the number of linearly independent combinations of V that are uncorrelated with U is p − r, then r = rank(L(V |U )). Since conditional mean independence implies uncorrelatedness, it is not difficult to show that rank(MDDM(V |U )) ≥ rank(L(V |U )).
Given a random sample (U k , V k ) n k=1 from the joint distribution of (U, V ), sample martingale difference divergence matrix MDDM n can be defined as

Dimension Reduction for Conditional Mean
As we mentioned in Section 1, our goal is to seek linear transformation of Y t such that linear transformed series can be separated into two parts with one part being conditionally mean independent of the past. Mathematically, we look for linear combinations of Y t , say α T Y t , that are conditionally mean independent As we only have a finite stretch of observations from the process Y t , t ∈ Z, we shall approximate the conditional mean independence of α T Y t on F t−1 by that on where k 0 is a prespecified fixed integer. This practice is quite common in time series analysis, and it is consistent with the notion that for weakly dependent time series the main dependence is at short lags. The approximation can be in fact supported by certain time series models. For example, if the time series model is var(k 0 ), then the conditional distribution of Y t given F t−1 is identical to the conditional distribution of Y t given F t−1,t−k 0 , thus there is no loss of information in this approximation. In the sequel, we define the so-called cumulative MDDM to quantify the conditional mean independence of Y t on its recent past F t−1,t−k 0 .
Definition 2. The cumulative MDDM is defined as Since MDDM is a positive semidefinite matrix, k 0 is also a positive semidefinite matrix. Note that MDDM(Y t |Y t− j ) depends on the time lag j but not on t due to strict stationarity. The sample estimate of k 0 is given by Remark 2. Our definition follows the common practice in time series analysis, where the cumulative contribution from various lags is added up in a pairwise fashion. Alternatively, we could adopt a joint approach, that is, we can define and its sample estimate as . For a given k 0 , it seems difficult to know whether inference based on (2) k 0 is preferred for the given series at hand. We shall compare the finite sample performance for inferences based on ( j) k 0 , j = 1, 2 in simulation studies.

Principal Conditional Mean Components (PCMC)
As outlined in Section 1, dimension reduction for conditional mean can be achieved once we identify the number and the form of linear combinations of Y t that are conditionally mean independent of the past. It turns out that such information is encoded in k 0 (or (2) k 0 ); see Theorem 1. Inspired by the work of Hu and Tsay (2014), who proposed the concept of principal volatility component analysis, we shall introduce the so-called principal conditional mean component analysis.
Since k 0 is a real symmetric positive semidefinite matrix, its eigenvalues {λ j } p j=1 are either zero or positive. We shall assume that Let γ j be an orthonormal eigenvector of k 0 corresponding to the eigenvalue λ j . Then we have Since all these linear combinations live in the null space of k 0 , they can be readily estimated based on eigen-decomposition of k 0 .
Let ( λ j , γ j ) p j=1 be the p pairs of eigenvalues and eigenvectors of k 0 , where λ 1 ≥ λ 2 ≥ · · · ≥ λ p and the eigenvectors { γ j } p j=1 are orthonormal. To estimate s, the rank of k 0 , we adopt a ratiobased estimator following the practice of Lam, Yao, and Bathia (2011), Lam and Yao (2012), that is, where R is an integer satisfying s ≤ R < p. There are other methods developed for the estimation of s, see, for example, Bai andNg (2002, 2007) and Hallin and Liska (2007) for information criteria-based approach, Bathia, Yao, and Ziegelmann (2010) for the bootstrap approach, and Hu and Tsay (2014) for a sequential testing approach. As mentioned by Lam and Yao (2012), the ratio-based method can be viewed as an enhanced scree test (Cattell 1966), and it is very easy to implement. We use the ratiobased estimator in part because of the connection of our dimension reduction framework with the ones by Lam, Yao, and Bathia (2011) and Lam and Yao (2012); see Section 4.2 for details. It is also worth noting that Lam and Yao (2012) showed that the ratio-based estimator can still work even when p is large and grows to infinity in the asymptotics. For j = 1, . . . , p, we can then estimate γ j by γ j . Since γ j might be replaced by − γ j , the results stated below concerning a comparison of eigenvectors implicitly assume that signs have been chosen appropriately. Below we further impose suitable moment and weak dependence assumptions on Y t .
(C2) Let (Y t ) t∈N be a strictly stationary and β-mixing process. Assume that there exist Theorem 2 is analogous to Proposition 1 in Lam and Yao (2012), but is stated for our cumulative MDDM. It suggests that the empirical eigenvalues and eigenvectors obtained by the eigen-decomposition of sample cumulative MDDM are indeed reasonable estimators of their population counterparts for large sample size. The above theory is developed for the fixed p case, and theory for the growing p setting seems very challenging and is left for future research. Nevertheless, we examine the finite sample performance of the ratio-based estimator in the large p setting in simulation studies.

Factor Model Representation
In this subsection, we shall provide a factor model representation for our dimension reduction framework. Our static factor model is closely related to the one used in Peña and Box (1987), Pan and Yao (2008), Lam, Yao, and Bathia (2011), as well as Lam and Yao (2012). We provide a brief review of the latter first.
Factor Model with White Noise Error (FM-WNE).
where Y t is a p × 1 observed vector of time series, X t is an r × 1 latent factor time series, which is assumed to be stationary and not a white noise, A is a p × r constant factor loading matrix, r ≤ p is the number of factors, { t } is a p × 1 white-noise sequence with mean zero. It is important to note that matrix A is not uniquely identified. For instance, if we replace A and X t by AQ and Q −1 X t for any invertible matrix Q, we still get the same Y t . Let B be a p × (p − r) matrix for which (A, B) forms a p × p orthogonal matrix. Following the practice in Pan and Yao (2008), Lam, Yao, and Bathia (2011), and Lam and Yao (2012), we assume that Even with the above assumption, the matrix A is not unique, but the factor loading space M(A), which is the linear space spanned by the columns of A, is uniquely defined. Note that M(B) can be interpreted as the null space of the factor loading space M(A). The main inferential task is to estimate r, the number of factors, and the factor loading space M(A) or A.
Once an estimator of A, say A is obtained, it is natural to estimate the factor process X t by X t = A T Y t and the residuals are Then the next step is to fit a parsimonious model to X t , which may be achieved by rotating X t appropriately, or equivalently modeling H T X t , where H is an r × r orthogonal matrix.
Based on the above model assumptions, we derive that where {B T t } and {A T t } are both white-noise sequences. This implies that certain linear combinations of Y t (i.e., those corresponding to B T Y t ) are white noise. Assuming that the crosscorrelations between X t and t are zero at all lags, we can derive that cov(Y t+k , Y t ) = Acov(X t+k , X t )A T , for k = 1, 2, . . ., thus the columns of B are the orthonormal eigenvectors of cov(Y t+k , Y t ) corresponding to zero eigenvalues. This observation led them to focus on the following matrix: and their estimation of r and M(A) is based on the eigenanalysis of sample estimate of the positive semidefinite matrix L k 0 . It is interesting to note that the matrix L k 0 they defined is basically a linear counterpart of our cumulative MDDM k 0 ; see Remark 1. Thus, the main difference between the two is that L k 0 encodes the information about the number and form of linear combinations of Y t that are uncorrelated with Y t−1 , · · · , Y t−k 0 in a pairwise fashion, whereas k 0 encodes the number and form of linear combinations of Y t that are conditionally mean independent of Y t−1 , . . . ,Y t−k 0 in a pairwise fashion.
In time series analysis, cov(Y t+ j , Y t ) is used to measure the linear dependence at lag j, whereas MDDM(Y t+ j |Y t ) is used to measure conditional mean dependence at lag j. If the time series is Gaussian, then the second-order property fully characterizes the joint distribution and autocovariances at all lags are sufficient to characterize the joint dependence. However, for non-Gaussian and nonlinear time series, the secondorder property may not be sufficient to characterize the serial dependence, which has motivated the development of various nonlinear dependence measures in the literature; see Hong (1999), Granger, Maasoumi, and Racine (2004), and Zhou (2012) among others. Most of these nonlinear-dependent measures are however scalar-valued, and do not seem directly useful for dimension reduction.
Factor Model with Martingale Difference Error (FM-MDE).
factor loading matrix and Z t is the s-dimensional latent factor process. Similar to the factor model with white-noise error, only the factor loading space M(A) is uniquely defined. Note that M(A) is the column space of M = (γ 1 , . . . , γ s ). For the convenience of discussion, we assume that there is a p × (p − s) matrix B, such that Following the argument in the derivation of (7), we have that where {B T η t } and {A T η t } are martingale difference sequences. Based on (12), it is easy to see that we can obtain good estimates of A and B (or the corresponding column spaces), say A and B, then we can estimate Z t by Z t = A T Y t and the residuals are As the two factor models (FM-WNE and FM-MDE) appear to have the same form, it pays to mention their differences. On one hand, our latent factor process Z t is measurable with respect to F t−1 , and its contemporary linear combination AZ t is the conditional mean of Y t given the past information by definition. Since E(Y t |F t−1 ) has the interpretation of being the optimal predictor of Y t given F t−1 (in the mean squared error sense), our dimension reduction is well motivated by the consideration of optimal prediction. By contrast, the process X t in FM-WNE is not necessarily measurable with respect to F t−1 and AX t may not be the conditional mean. On the other hand, the estimation methods are different for these two factor models. Under the FM-WNE, we seek to find contemporary linear combinations (i.e., B) that make the linear transformed sequence B T Y t a white-noise sequence, whereas under the FM-MDE, contemporary linear transformations (i.e., B) are sought to make B T Y t a martingale difference sequence; compare (7) and (11). Due to different requirements on the error sequence, the matrix objects that encode the information about the dimension of latent factor process and the factor loading space are different. Under the FM-WNE, we take advantage of the assumptions on the second-order property of (X t , t ), and the inference is based on the cumulative linear matrix L k 0 , which encodes the linear dependence, whereas under the FM-MDE, we naturally focus on cumulative MDDM k 0 , which characterizes conditional mean independence. To shed some light on the difference, we consider the following state space model.

FM-WNE
mean zero error process. Let W t = h( 2t , 2(t−1) , . . .), t ∈ Z be an r-dimensional nonlinear stationary causal process, where { 2t } t∈Z is the r-dimensional mean zero iid innovation process that is independent of the p-dimensional error process ( 1t ) t∈Z . Assume that W t is not a white-noise sequence. Note that several models used in simulation studies of Lam, Yao, and Bathia (2011) and Lam and Yao (2012) fall into the above framework. Table 1 shows the detailed representation under the two factor models. Although the latent processes under the two models (i.e., X t and Z t ) are different, it is easy to see that r = s and M(A) = M(A), that is, the two factor loading spaces are identical. It would be interesting to see which inference method (i.e., the one based on L k 0 vs. the one based on k 0 ) delivers a better estimate of the factor loading space in this case and we shall address this question in our simulations.
In general, a white-noise sequence is not necessarily a martingale difference sequence but a martingale difference sequence has to be a white-noise sequence under finite second moment assumption. This fact implies that for a stationary time series Y t that admits both representations (i.e., FM-WNE and FM-MDE), the two could coincide as demonstrated in the following proposition. (1) and (2)  In some cases, s can be strictly larger than r, as shown in the following example.

Proposition 1. Suppose that Assumptions
, W 1t is an r-dimensional stationary causal process as defined in Example 1 and is not a white-noise sequence, W 2t is a q-dimensional vector white-noise sequence but not martingale difference sequence, and 1t are iid mean zero. (iii), The three processes {W 1t }, {W 2t }, and { 1t } are mutually independent.
Let W 2t = (W T 3t , W T 4t ) T , where W 3t is of dimension r and W 4t is of dimension q − r. Then the model can be reformulated as It is easy to see that under the framework of FM-WNE, [A 1 , A 2 ]W 2t + 1t is a vector white noise so M(A) = M(A 1 ), whereas under the framework of FM-  Shao (2011). We shall examine the performance of our dimension reduction method for this example in Section 6.

Related Work and Practical Issues
As pointed out by a referee, our work is to some extent related to Park, Sriram, andYin (2009, 2010), who have extended the sufficient dimension reduction framework from random sample to the time series setting. Specifically, the latter authors focused on the univariate time series and considered the estimation of central subspace (Park, Sriram, and Yin 2010) and central mean subspace (Park, Sriram, and Yin 2009) where d is assumed to be fixed and possibly unknown. For the central subspace, they estimated it by minimizing Kullback-Leibler distance and for the central mean subspace, they used a variant of MAVE (minimum average variance estimation), which was proposed by Xia et al. (2002) and shown to be applicable to time series data. While the work by Park, Sriram, andYin (2009, 2010) mainly focuses on the dimension reduction of covariates, which are naturally defined as the lagged observations (Y t−1 , . . . ,Y t−d ) in the time series setting, our work focuses on the dimension reduction of the multivariate response Y t , and thus are quite different in terms of the goal and the setting. In particular, (1) the parameter Park et al. target is the column space associated with the central subspace or central mean subspace, whereas we want to estimate the space spanned by linear combinations that make the response conditional mean independent of the past information. In a sense, our dimension reduction is closer in spirit to the recently developed envelop models by Cook, Li, and Chiaromonte (2010), which also remove the redundant linear combinations of the response that are not related to covariates. But the latter was done under a linear model and for random sample, whereas we do not have any parametric/linear assumptions and our reduction is formulated in a time series setting; (2) our dimension reduction is based on spectral decomposition of a sample matrix, and no smoothing is involved; whereas nonparametric estimation and smoothing is required in Park, Sriram, andYin (2009, 2010) since the targeted space is different.
In practical implementation, we need to come up with a choice of k 0 . In theory, k 0 can be chosen as the smallest positive integer that makes . Thus, the determination of k 0 itself is a nontrivial task. If the series follows a vector autoregressive model, then partial autocorrelation function (ACF) provides an indication about the magnitude of k 0 . Alternatively, we can first look at the lag j sample MDD, that is, MDD n (Y t |Y t− j ) 2 , accompanied by the standard error estimated from, say, the moving block bootstrap, and choose k 0 as the largest j such that jth MDD is still significant from zero. For time series that exhibits seasonal dependence patterns, we often want to let k 0 to be an integer multiple of the period (see Section 6.2) to capture the conditional mean dependence at seasonal lags. We leave a more careful and data-driven choice of k 0 and their impact to dimension reduction to future work.
An additional practical and methodological issue is that after we obtain the rank of k 0 , say by s, and the null space of k 0 , by M 1 = [ γ s+1 , . . . , γ p ], it would be useful to verify that the transformed series M 1 Y t are indeed conditionally mean independent of F t−1 (or F t−1,t−k 0 in a pairwise fashion). To this end, one can look at the magnitude of H n = k 0 j=1 MDD 2 n (M 1 Y t |Y t− j ) and perform a significance test. Under the null that the transformed series are conditionally mean independent of the past, H n is expected to be small, and its significance can presumably be assessed by using a block bootstrap approach. We shall also leave a rigorous investigation of this topic to future study.

Numerical Simulations
In this section, we examine the finite sample performance of our MDDM-based dimension reduction approach via simulations. In particular, we compare with the method in Lam and Yao (2012), which is based on the linear dependence metric L k 0 (see (9)). In our simulations, we tried ( j) k 0 , j = 1, 2 for several k 0 's to assess the sensitivity of our dimension reduction method with respect to the choice of k 0 and cumulative MDDM employed. Even though our theory is developed for the fixed p case, we also investigate the finite sample performance for the large p case by Monte Carlo experiments.
Recall that for both methods, two steps are involved in the estimation. The first step corresponds to the estimation of the true number of factors, that is, r or s using ratio-based estimator (see (6)), where we set R = p − 1, R = [p/2] or [p/3] for small p and large p setting, respectively. The second step refers to the estimation of the factor loading space, that is, M(A) (or M(A)) once s (or r) is estimated. This can be achieved by performing principal component analysis on the sample cumulative MDDM as described in Section 4.1. For each example, we replicate the simulation 1000 times and use the following criteria to measure the accuracy of our dimension reduction method. , which measures the overall closeness of the estimated signal A X t to the true signal AX t . Smaller value of RMSE indicates more accurate estimation of underlying factor series. We shall investigate the following examples.
Example 3. Example 3 is adopted from the simulation study of Pan and Yao (2008) with slight modification so the model falls into the framework of Example 1. We define the factor series X t = (x 1t , x 2t , x 3t ) T as x 1,t = 0.8x 1,t−1 + e 1,t , x 2,t = e 2,t + 0.9e 2,t−1 + 0.3e 2,t−2 , where e i,t , i = 1, 2, 3 are all iid standard normal variables. The observed data Y t = (Y 1,t , . . . ,Y p,t ) T are defined by where i,t , i = 1, 2, . . . , p are iid standard normal variables and independent from {e j,t }, j = 1, 2, 3. We consider several different combinations of (p, n, k 0 ), that is, p = 5, 10, 20, n = 300, 600, 1000, and k 0 = 1, 3. For the above data-generating process, the true number of factors r and s are 3 and the factor loading matrix, A = A = (I 3 , 0 p−3 ) T . Note that when k 0 = 1, k 0 and (2) k 0 become the same so some results are identical in Table 2.
It appears from Table 2 that when p increases or n decreases, the ability of correctly identifying the true number of factors diminishes and the D-distance gets larger for all methods. It might be expected that the method based on L k 0 performs the best since Y t is generated from Gaussian linear time series and all dependence of Y t can be effectively captured by autocovariance matrices. However, when k 0 = 1, our MDDM-based approach is superior to Lam and Yao's L k 0 -based counterpart in terms of the probability of correctly estimating the true number of factors and D-distance. For k 0 = 3, the performance of the L k 0based one and MDDM-based one is similar. It is interesting to note that when k 0 increases from 1 to 3, the performance of L k 0 -based method improves substantially, showing its sensitivity with respect to the choice of k 0 , whereas for k 0 (or (2) k 0 ), the performance is quite stable with respect to k 0 .
Example 4. In this example, the linear ARMA model for X t in Example 3 is replaced by a nonlinear model, where X t = (x 1,t , x 2,t , x 3,t ) T is defined as x 2,t = (0.5e −0.4x 2 2,t−1 + 0.4)x 2,t−1 + e 2,t , Then the data Y t = AX t + t , where A = (I 3 , 0 p−3 ) T , i,t , e i,t are iid standard normal and independent from each other. Like Example 3, we consider n = 300, 600, 1000, p = 5, 10, 20, and k 0 = 1, 3. According to Theorem 5.1 and Example 5.1 in Shao and Wei (2007), X t admits a stationary solution and the model falls into the framework in Example 1. For this example, the true number of factors r and s are still 3 and M(A) = M(A). From Table 3, we can see that when k 0 is 1, our method outperforms the L k 0 -based one with smaller D-distance and higher proportion of estimating the number of factors correctly. If k 0 is 3, the performance of all methods are fairly comparable in terms of estimating the true number of factors and the factor loading matrix. Overall, the finding is similar to that in Example 3. Table . Mean, standard error (in the bracket) of D-distance and r, x s with  replicates for Example .  Lam, Yao, and Bathia (2011) and it addresses the large p case, where p can exceed n. More specifically, three different factors X t = (x 1t , x 2t , x 3t ) T are generated by the following model, For the factor loading matrix A, first p/2 elements of each column are iid U(-2, 2) and are kept fixed once generated and the other elements are set to be 0. The data Y t are defined as We consider (p, n) = (100, 100), (100, 200), (400, 200) with k 0 = 1 and 5. Again the true number of factors r and s are 3 and the model falls into the framework in Example 1. When Table . Mean, standard error (in the bracket) of D-distance and r, s with  replicates for Example .   (6)). According to Table 4, the performance of L k 0 -based and k 0 -based methods are very much comparable for k 0 = 1 and 5. It appears that when p = 400 and n = 200, none of the methods succeed, since the proportion of estimating the true number of factors correctly is low. This phenomenon might be due to the use of ratio-based estimator. More sophisticated method of estimating the number of factors in the large-p setting has been developed in Li, Wang, and Yao (2017) recently.
Example 6. In this example, we replace the data-generating process of w t in Example 5 by a nonlinear one as follows: where error t is generated from N(0, 0.25 ) and is defined in Example 5. Other parts of the model, such as A are exactly the same as Example 5, along with the combinations of (p, n, k 0 ). From Table 4, our k 0 -based approach outperforms L k 0 -based counterpart in all cases and under both criteria with the advantage more pronounced when k 0 = 1. When k 0 = 5, (2) k 0 -based approach is slightly inferior to k 0 -based counterpart, but is still superior to L k 0 -based one, especially for the case (p, n) = (400, 200). We speculate that part of the reason L k 0 -based approach does not perform well is that it only captures linear auto-dependence. In the presence of strong nonlinearity in the factor series and relatively low noise (compared to Example 5), the inability of L k 0 -based method to accommodate nonlinear dependence is amplified. It is also worth noting that for k 0 , (2) k 0 , and L k 0 , the performance can depend on k 0 to some extent.
Example 7. Example 7 is constructed by following Example 2 where r and s are different. The factor loading matrix A = ) is a 10 × 3 matrix. For each columns of A, the first five elements are iid U(−2,2) and the rest five elements are set to 0. The time series where e 1,t follows N(0, 8 2 ) and e 3,t follows N(0, 1.5 2 ). Then the data are generated by Y t = AX t + 1t = A 1 (W 1t + W 3t ) + A 2 W 4t + 1t , where 1t ∼ N(0, 0.5 × I 10 ) and independent from {e i,t }, i = 1, 3. Note that W 1t is from a stationary MA(1) model and W it , i = 3, 4 are consecutive observations from a stationary nonlinear MA model. Here p = 10, n = 50, 100, 200 and k 0 is either 1 or 3. Theoretically, r is equal to 2 but s is 3 therefore the true number of factors, r and s, are different for this example. Not only the true number of factors are different but also factor loading spaces are different that is, . This is because W 3t and W 4t are white-noise sequence but not martingale difference sequence; see Example 2.3 in Shao (2011). According to Table 5, we can clearly see that both methods are targeting different number of factors and the proportion of estimating the true number of factor correctly increases as n increases. This example confirms that our MDDM-based method is seeking different linear combinations of Y t from the L k 0 -based counterpart and the true number of factors inferred on the basis of L k 0 or k 0 can be different.
Our limited simulation results suggest that (1) for dimension reduction of conditional mean, MDDM-based approach can outperform L k 0 -based one in both the case of linear Gaussian time series and the nonlinear case in the small-p setting. The performance of the L k 0 -based approach seems noticeably inferior when k 0 is small or when nonlinearity is strong in the series; (2) in the large-p setting, our MDDM-based method is the number of lags involved. The performance of L k 0 and k 0 depends on the choice of k 0 , and in this case k 0 = 3 often delivers the optimal forecasting error (unreported results show that increasing k 0 beyond 3 does not help reduce the forecasting error significantly). Noticeably, k 0 -based approach appears less sensitive to the choice of k 0 as compared to L k 0 -based ones. It is worth noting that the gain in forecasting accuracy by k 0 -based approach (as compared to the L k 0 -based one) is completely due to the use of a different matrix to extract the number of factors and factor loading matrix, as we apply the same procedure of fitting VAR model to the estimated factor series.

Seven City Temperature Series
The second dataset we analyze is the monthly temperature series for seven cities: Nanjing, Dongtai, Huoshan, Hefei, Shanghai, Anqing, and Hangzhou in Eastern China. The series run from  January 1954 to December 1998, a portion of which have been analyzed by Pan and Yao (2008). The length of the data is n = 540 and the dimension p = 7. Figure 2 suggests that there exist strong seasonal dependence. Following the approach in Section 6.1, we use the first 80% of the data that contain the first 433 data points as the training set and the remaining 107 data points are included in the testing set. Using a rollingwindow approach, we get Y (1) 433+ j based on (Y j , . . . ,Y 432+ j ) for j = 1, . . . , 107 and then report the average of forecasting errors over 107 periods. The one-step-ahead forecast is implemented using the following procedure: (1) for j = 1, . . . , 107, we apply PCMC to the training data (Y j , . . . ,Y 432+ j ) and estimate the number of factors and factor series; (2) a (possibly multivariate) seasonal ARIMA (0, 0, 1) × (0, 1, 1) 12 model is fitted to the estimated factor series. The order of this particular model is determined by checking ACF plots of estimated factor series, as shown in Figure 3 for one particular training data; the model fits the estimated factor series quite well, as the residual autocorrelation is fairly small for most training data; see Figure 4 for the average of the absolute value of the ACF of residual series after fitting the above seasonal model to the estimated factor series from MDDM-based approach. Similar findings apply to the L k 0 -based one.
(3) The one-step-ahead forecast is then A X (1) 433+ j , where A is the estimated factor loading matrix, and X (1) 433+ j is the one-step-ahead prediction based on the fitted seasonal ARIMA model to the lower dimensional factor series. In Table 7, we present the average forecasting error and the proportion of estimated number of factors based on L k 0 and (1) k 0 . The choice of k 0 = 12, 24, 36 is based on the consideration that the time series is apparently seasonal with period 12.
It can be seen from Table 7 that L k 0 -based approach and k 0 -based one deliver different number of factors and the performance of both methods are stable with respect to the choice of k 0 = 12, 24, 36. The k 0 -based method has substantially smaller FEs than those of L k 0 -based method, as shown in Figure 5. It suggests that using one factor model, as estimated by the MDDM-based approach for all training data, can lead to more accurate forecasting. Note that the underlying factor   series is nonstationary, which is not covered by our theory. Nevertheless, it shows the potential applicability of our approach to nonstationary time series. For both real data examples, it would be interesting to fit multivariate nonlinear time series models to the estimated factor series after applying k 0 -based dimension reduction method. We did not pursue this step since there seems no general guidance on how such modeling can be conducted.

Summary and Conclusion
In this article, we propose the so-called martingale difference divergence matrix to quantify the conditional mean (in)dependence of a random vector V ∈ R p on U ∈ R q . The MDDM encodes the number and form of linear combinations of V that are conditional mean independent of U . Building on this property, we generalize MDDM to the time series context and introduce cumulative MDDM, which can approximately quantify the conditional mean independence of Y t upon the past information F t−1 . Dimension reduction for a multivariate time series is then achieved by estimating the number and form of linear combinations that are conditionally mean independent of the past, which is encoded in the cumulative MDDM. Compared to the use of linear dependence metric in Lam, Yao, and Bathia (2011) and Lam and Yao (2012), our cumulative MDDM is a natural nonlinear analog and can capture unknown nonlinear mean dependence. We also present a static factor model representation for our dimension reduction framework and discuss the subtle difference from the static factor model that was studied in previous literature. Since we typically do not know a priori whether nonlinear dependence exists in practice, it might be safe/robust to use our MDDM-based dimension reduction approach. In terms of implementation, since sample MDDM has a V-statistic form, it can be readily calculated. The estimation of the number of factors and factor loading matrix can be conveniently implemented by spectral decomposition of sample cumulative MDDM. For MDDM-based dimension reduction of conditional mean, we provide some theoretical results to justify the validity of our method. Although our theory is obtained only for the case p is fixed, we investigate the finite sample performance in both small p and large p settings in our simulation studies. In the small p setting, our limited simulation results show that our MDDM-based approach can be as effective as the linear counterpart in Lam and Yao (2012) for linear Gaussian time series and can outperform the latter for nonlinear time series. The merits of the MDDM-based dimension reduction are further supported by two real data illustrations. Although the simulation suggests our approach may still be useful in the large p setting, there is currently no theoretical support. As seen from Li, Wang, and Yao (2017), there can be complications with the ratio-based estimator (see (6)) in the high-dimensional setting, and one may have to resort to random matrix theory to derive a sensible estimator for the number of factors. In addition, the finite sample simulation results and data illustrations show that in some cases, the results can depend on the choice of k 0 , which is the number of lags included in the cumulative MDDM. It would be desirable to develop a data-driven rule for k 0 besides the visual inspection of the (partial) autocorrelation plot. Furthermore, strict stationarity is assumed throughout, and it would be interesting to extend the MDDM-based methodology to allow nonstationary series; see Pan and Yao (2008), Motta, Hafner, andEichler, Motta, and. Also an extension to dimension reduction for conditional variance-covariance matrix using MDDM and its variant would be interesting. The research along these directions are well underway.

Supplementary Materials
The supplemental materials contain the proofs of Lemma 1, Theorem 1, Theorem 2, and Proposition 1.