A Dynamic Structure for High-Dimensional Covariance Matrices and Its Application in Portfolio Allocation

ABSTRACT Estimation of high-dimensional covariance matrices is an interesting and important research topic. In this article, we propose a dynamic structure and develop an estimation procedure for high-dimensional covariance matrices. Asymptotic properties are derived to justify the estimation procedure and simulation studies are conducted to demonstrate its performance when the sample size is finite. By exploring a financial application, an empirical study shows that portfolio allocation based on dynamic high-dimensional covariance matrices can significantly outperform the market from 1995 to 2014. Our proposed method also outperforms portfolio allocation based on the sample covariance matrix, the covariance matrix based on factor models, and the shrinkage estimator of covariance matrix. Supplementary materials for this article are available online.


Introduction
Covariance matrix estimation is an important topic in statistics and econometrics with wide applications in many disciplines, such as economics, finance and psychology.A traditional approach to estimating covariance matrices is based on the sample covariance matrix.However, the sample covariance matrix would not be a good choice when the dimension is large, and especially when the inverse is required, which is often the case when constructing a portfolio allocation in finance.This is because the estimation errors would accumulate when using the inverse of the sample covariance matrix to estimate the inverse of the covariance matrix.When the size of the covariance matrix is large, the cumulative estimation error would become unacceptable even if the estimation error of each entry of the covariance matrix is tiny.
In recent years there has been various attempts to address high dimensional covariance matrix estimation.Usually, a sparsity condition is imposed to control the trade-off between variance and bias.See, Wu andPourahmadi (2003), El Karoui (2008), Bickel andLevina (2008a, 2008b), Lam and Fan (2009), Fan, Liao, and Mincheva (2011), and the references therein.Fan, Fan and Lv (2008) considered a different approach by imposing a factor model and estimated the covariance matrix based on this structure.
Most of the literature addressing high dimensional covariance matrix estimation assumes that the covariance matrix is constant over time.However, in many applications, covariance matrices are dynamic.For example, today's optimal portfolio allocation may not be optimal tomorrow, or next month.Therefore, when applying the formula for Markowitz's optimal portfolio allocation (Markowitz 1959), the covariance matrix used should be dynamic and allowed to change over time.
In order to introduce a dynamic structure for covariance matrices, one cannot simply assume each entry of a covariance matrix is a function of time because this would not serve very well in prediction.Instead, we start with an approach stimulated by Fan, Fan and Lv (2008) which is based on the Fama-French three-factor model (Fama andFrench, 1992, 1993) where y t is the excess return of an asset and X t is the vector of the three factors at time t.To make (1.1) more flexible, we allow a to depend on the values of the three factors at time t − 1.To avoid the so-called 'curse of dimensionality', we assume this dependence is through a linear combination of the values of the three factors at time t − 1, which brings us to This motivates a dynamic structure for the covariance matrix of a random vector Y t through an adaptive varying coefficient model which we shall now introduce.
Suppose (X T t , Y T t ), t = 1, • • • , n, is a time series, where Y t is a p n dimensional vector and X t is a q dimensional factor.An underlying assumption throughout this paper is that p n −→ ∞ when n −→ ∞, and q is fixed.Also, we assume that X t , t = 1, • • • , n, is a stationary Markov process.
We assume ) is a factor loading matrix which is varying over X T t−1 β, and { t , t = 1, • • • , n} are random errors which are independent of {X t , t = 1, • • • , n}.We assume for each k = 1, • • • , p n and for some integers m and s.Let F t be the σ−algebra generated by {(X T l , T l ) : l ≤ t}.The main focus of this paper is on the conditional covariance matrix where Σ x (X t−1 ) = cov(X t |X t−1 ).In (1.5), Φ(•), β, Σ x (•), α k,i and γ k,j , i = 0, • • • , m, j = 1, • • • , s, are unknown and need to be estimated.Not only does (1.5)introduce a dynamic structure for cov(Y t |F t−1 ), but also reduces the number of unknown parameters from p n (p n + 1)/2 to p n q + q 2 unknown functions and q + s + m + 1 unknown parameters.
We remark that model (1.3) is interesting in its own right, since it combines single-index modelling (Carroll et al., 1997, Härdle et al., 1993, Yu and Ruppert, 2002, Xia and Härdle, 2006, Kong and Xia, 2014) and varying coefficient modelling (Fan and Zhang, 1999, 2000, Fan et al., 2003, Sun et al., 2007, Zhang et al., 2009, Li and Zhang, 2011, Sun et al., 2014).In this paper, as a by-product, an estimation procedure for (1.3) is proposed and an iterative algorithm is developed for implementation purposes.
This paper is organised as follows.We begin in Section 2 with a description of the proposed estimation procedure for cov(Y t |F t−1 ).A discussion on bandwidth selection is given in Section 3.
In Section 4 we provide asymptotic properties of the estimation procedure.An iterative algorithm to implement the estimation procedure is suggested in Section 5. Using the proposed dynamic structure for covariance matrices and the developed estimation procedure, we outline a process for constructing a portfolio allocation based on the formula for Markowitz's optimal portfolio in Section 6.The performance of the estimation procedure and portfolio allocation are also assessed by simulation studies in Section 7. In Section 8, we apply the portfolio allocation methodology to a data set consisting of 49 industry portfolios which are freely available from Kenneth French's website.We find that the proposed methodology works surprisingly well.All the detailed proofs are relegated to the appendix.

Estimation procedure
In this section, we are going to introduce an estimation procedure for cov(Y t |F t−1 ).We will first estimate β, Φ(•), Σ x (•), α k,i and γ k,j , and denote the resulting estimators by β, Φ(•), Σx (•), αk,i and γk,j for i = 0, • • • , m and j = 1, • • • , s.Let Σ0,t be Σ 0,t with α k,i and γ k,j being replaced by αk,i and γk,j respectively.We use Throughout this paper, for any function f (x), we use ḟ (x) to denote its derivative.For any functional matrix F = (f ij (x)), we define its derivative as Ḟ = ( ḟij (x)).For any integers p and q, we use 0 p×q to denote a p × q matrix with each entry being 0, and 1 p to denote a p-dimensional vector with each component being 1.

Estimation of β
A Taylor expansion gives, for This, together with the idea of least squares estimation, brings us to the following local discrepancy function where: h is a bandwidth; and g j , ξ j , A j and B j are used to denote g(X T j β), ġ(X T j β), Φ(X T j β) and Φ(X T j β) respectively.By minimising under the conditions we use the corresponding value of β as the estimator and denote it by β.

Estimation of Φ(•) and g(•)
Once an estimate β has been obtained, the estimators of Φ(•) and g(•) can be constructed row by row through a standard univariate varying coefficient model for each component of Y t .Let By (1.3), and for k = 1, • • • , p n , we have the following synthetic univariate varying coefficient model By local linear estimation for standard varying-coefficient models, and for any given u, we have where and h 1 is a bandwidth.

Estimation of Σ x (•)
In order to estimate E(X t |X t−1 = u) and E(X t X T t |X t−1 = u), for any given u, we use the local constant estimators . This gives us the following estimator of Σ x (u) where and h 2 is a bandwidth.

Estimation of Σ
By (1.4), we have the following synthetic GARCH model which is equivalent to where η kt = r 2 k,t − σ 2 kt , γ k,i = 0 when i > s, and α k,i = 0 when i > m.Once α k,i and γ k,j have been estimated, by substituting them into (2.5) and setting σ 2 kl = r 2 k,l for l ≤ max(m, s), we can obtain an estimator σ2 kt of σ 2 kt and hence an estimator Σ0,t of Σ 0,t .For each k We are going to use a quasi-maximum likelihood approach to estimate θ k .We define the negative quasi log-likelihood function of θ k as where σ 2 k,t (θ k ) are recursively defined by (2.5) with initial values being either By minimising Q k,n (θ k ) with respect to θ k on a compact set Λ defined in (B3) in Appendix A, we use the minimiser θk to estimate θ k .

Bandwidth selection
The choice of the bandwidth h, used in the estimation of β, is not crucial.According to some numerical analysis not presented in this paper for brevity, the accuracy of the estimator β is not very sensitive to h, as long as h is within in a reasonable range.In the computational algorithm for estimating β, see Section 5, we recommend choosing a bandwidth h equal to around 20% of the following range where β is a randomly chosen initial estimate of β.We update h on subsequent iterations by replacing β in (3.1) with the most recent estimate of β.This approach is employed in the simulation studies and real data analysis of this paper.
We now focus on the selection of the bandwidth h 1 , used in the estimation of g(•) and Φ(•).The proposed bandwidth selection is based on a k-nearest neighbours bandwidth with k being selected by cross-validation.We define the cross-validation statistic by where ĝ(t−1) (•) and Φ(t−1) (•) are the respective estimators of g(•) and Φ(•) using a k-nearest neighbours bandwidth based on (X T l , Y T l ), l = 1, • • • , t − 1, and where M is a look-back integer parameter such that M < n − 1.
Hence, denoting the k that minimises CV(k) by k, we use a k-nearest neighbours bandwidth in the estimation of g(•) and Φ(•).The bandwidth h 2 in the estimation of Σ x (•) or E(X t |X t−1 = u) can also be selected by cross-validation in a similar way.

Asymptotic properties
In this section, we are going to present the asymptotic properties of the proposed estimators.
We first introduce the following notation which will be used throughout this paper.For any matrix A = (a ij ) m×N , we use λ min (A) and λ max (A) to denote respectively the smallest and largest eigenvalues of A. The trace of A is denoted by tr(A), the Frobenius norm of A by A F , and the spectral norm (also called operator norm) and element-wise norm by respectively.We also define and Theorem 1.Under assumptions (A1 -A5), (B1 -B4), (C1) and (C3) in Appendix A, there exists C > 0 and a small ε > 0 such that (I) where Z is a compact subset of the range of X T t β.
Remark 1. Theorem 1 shows that β − β = o P (n −1/2 ) when p n diverges to ∞ as n → ∞, ).It indicates that the index β is estimated with a rate faster than the normal rate n −1/2 , which is the optimal rate if p n is fixed.This is known as a 'blessing of high dimensionality'.
The main interest of this paper is to estimate cov(Y t |F t−1 ).To measure the accuracy of an estimator M of a matrix M of size p n , we use the entropy loss norm, proposed by James and Stein (1961), To facilitate our presentation, we focus on the convergence of cov Theorem 2. Under assumptions (A1 -A5), (B1 -B4) and (C1 -C4) in Appendix A, there exist C > 0 and ε > 0 such that, with probability at least Fan, Fan and Lv (2008) and Fan, Liao and Mincheva (2011) showed an estimator of a covariance matrix based on a certain structure would achieve a higher convergence rate than the sample covariance matrix.Theorem 2 tells us the same story.There are three terms to measure the The first two terms tell us how the nonparametric smoothing steps in estimating Φ(•) affect the performance of cov(Y n+1 |F n ), and the third term evaluates the influence of conditional covariance matrix Σ x (X n ).It turns out that even though q−dimensional smoothing is required, its effect is small and often negligible if p n is large.

Computational algorithm
To implement the proposed estimation procedure for cov(Y t |F t−1 ), the hardest part is to compute an estimate of β, which is equivalent to finding the minimum of under the conditions We now introduce the proposed iterative algorithm which can be used to do this minimisation.Let  ( Step 2) Minimise with respect to β. Denote the minimiser by β, and define β = β/ β when the first component of β is positive and β = − β/ β otherwise.
The β resulting from the convergence is the final estimate of β.
The proposed iterative algorithm is easy to implement as both minimisers in Step 1 and Step 2 have a closed form.Once an estimate of β is obtained, the remaining computation of cov(Y t |F t−1 ) becomes straightforward.

Portfolio allocation
In this section, we will briefly describe the construction of an estimated optimal portfolio allocation based on the proposed dynamic structure and the associated estimation procedure.Since the formula for optimal portfolio allocation contains E(Y t |F t−1 ) we shall introduce its estimator By taking conditional expectation of (1.3), we have Therefore, we use Our estimated optimal portfolio allocation builds on the mean-variance optimal portfolio by Markowitz (1952Markowitz ( , 1959)).The allocation vector w of p n risky assets, to be held between times t − 1 and t, is defined as the solution to where δ is the target return imposed on the portfolio.The solution ŵ is given by where

Simulation studies
In this section, we are going to use a simulated example to show how well the proposed estimation procedure and portfolio allocation works.We shall use a i,j (•) to denote the entry corresponding to the ith row and jth column of Φ(•).
We generate 1000 data sets from model (1.3) together with (1.4).We repeat this using the following combinations of n and p n : {n = 1000, p n = 50}, {n = 1000, p n = 100}, {n = 2000, p n = 50} and {n = 2000, p n = 100}.We set where Ξ j,k are some fixed parameters for j = 0, In order to define Ξ j,k , we simulate them independently from a uniform distribution on [−1, 1], and use these same values throughout all simulations.For t = 1, • • • , n + 1, we generate X t independently from a uniform distribution on [−1, 1] q , Z t from p n -variate standard normal distribution, and t through t = Σ 1/2 0,t Z t .Once both X t and t have been generated, Y t can be generated through (1.3) for t = 1, • • • , n + 1.
We will initially pretend that (X T n+1 , Y T n+1 ) is unknown to us, and this will not be used in the estimation of cov(Y n+1 |F n ).The purpose of generating an additional data point (X T n+1 , Y T n+1 ) is to enable us to calculate the 1-period simple return In order to evaluate the performance of an estimator M of matrix M we use the following metric We also use the Sharpe ratio to evaluate the performance of ŵ, where SD {R( ŵ)} is the standard deviation of R( ŵ).We assume a zero risk-free rate for simplicity.
We first examine how well the estimation procedure works.We estimate cov(Y n+1 |F n ), and The kernel function in the estimation procedure is taken to be the Epanechnikov kernel K(u) = 0.75(1 − u 2 ) + , and the bandwidths are selected by the methodology described in Section 3. The results, presented in Tables 1 and 2 We now examine the performance of the proposed portfolio allocation, using a target return δ = 1%, by computing the return as described in (7.1).In order to see how much gain can be made by making use of the dynamic structure, we make a comparison with portfolio allocations based on Markowitz's formula but where the covariance matrix is estimated using the sample covariance matrix and also the estimator proposed by Fan, Fan and Lv (2008).The mean, standard deviation and Sharpe ratio of the returns are presented in Table 3.For each situation discussed, we see the Sharpe ratio of the proposed portfolio allocation is much bigger than the other two portfolio allocations.This suggests there is significant gain from making use of the dynamic structure of the covariance matrix.

Real data analysis
In this section, we are going to apply the dynamic structure for covariance matrices to a real data set.We use the term Face (Factor model with an Adaptive-varying-coefficient-model structure Covariance matrix Estimator) to denote the proposed portfolio allocation.This name was chosen because the estimator will 'face' the markets today based on what happened yesterday and adapt according to the dynamic structure.We compare Face with the allocation based on the sample covariance matrix (denoted by Sam), and the allocation proposed by Fan, Fan and Lv (2008) (denoted by Fan).In all three cases, we use the same target return δ = 1%.We also make a comparison with the market portfolio (denoted by Market) since this aids as an important benchmark indicating whether we are in a bull or bear market.In this section, the kernel function used in the construction of Face is still taken to be the Epanechnikov kernel, and the bandwidths are selected by the method described in Section 3.
All data used can be freely downloaded from Kenneth French's website http://mba.tuck.
dartmouth.edu/pages/faculty/ken.french/data_library.html and was accessed on 2nd April 2015.The response variable Y t is chosen to be the vector of the daily returns of p n = 49 industry portfolios (value weighted) minus the risk-free rate.The observable factors x 1,t , x 2,t and x 3,t are taken to be the market, size and value factors respectively from the Fama-French three-factor model.
The labelling along with a brief description of Y t = (y 1,t , • • • , y 49,t ) T and X t = (x 1,t , x 2,t , x 3,t ) T can be found in Table 4 and Table 5 respectively.
There are various advantages of using the portfolio returns for y k,t as opposed to using individual stocks: we avoid having to merge different sources of data; we avoid survivorship bias (where we only picked companies that did not go bankrupt); and we attempt to avoid company specific risk.
A further benefit is that the results we give are entirely reproducible since the data is free and presented in a spreadsheet format.
To We compare the three portfolio allocations, (Face, Sam and Fan), along with the market portfolio, year by year from 1995 to 2014 using a simple trading strategy.For each year we trade on each trading day, which is approximately T = 252 trading days per year.At the beginning of each year we assume we have an initial balance of 100 pounds.Although this initial choice is arbitrary, it is a useful way of comparing the performance during the course of a year.We assume no transaction costs, allow for short selling, and assume that all possible portfolio allocations are attainable.Our trading strategy consists of forming a portfolio allocation ŵ the end of each trading day and holding it until the end of the next trading day.Between day t − 1 and day t, we obtain the portfolio return where ŵ is formed based on (X T t−j , Y T t−j ), j = 1, • • • , n, for some look-back integer n.With the realised returns R t ( ŵ), t = 1, • • • , T , we can calculate the annualized Sharpe ratio where and R f,t is the risk-free rate on day t.Hence, for each year, and for each of the four trading strategies, we compute an annualized Sharpe ratio and the balance at the end of the final trading day of the year.We repeat this using n = 100, 300, and 500.From the the annualized Sharpe ratios presented in Figure 4 and the balances in Table 6, it is clear that Face performs significantly better than the other three.
We remark that although Face, Sam and Fan are all constructed based on Markowitz's formula, the difference between them lies in the way to estimate the covariance matrix of returns, which appears in Markowitz's formula.Both Sam and Fan do not take into account the dynamic feature of the covariance matrix in their estimation, but Face does.This is the fundamental reason why Face performs significantly better than Sam and Fan.One may argue that if Sam and Fan used fewer observations in their moving window to estimate the covariance matrix they would start to take the dynamic feature into account, potentially improving their performance.However when constructing Face, Sam and Fan, we tried a variety of n, ranging from 100 to 500, and found Face always performs better.This suggests that even if Sam and Fan only use the observations in a carefully chosen moving window, Face still outperforms them.
To have a tangible idea about whether the covariance matrix is dynamic or not, we plot the estimated intercept and coefficients of x 1,t , x 2,t and x 3,t , interpreted as the impact of the factors, for each of the first four components of Y t in Figure 3.One can see that these coefficients are dynamic rather than constant, which implies the covariance matrix is also dynamic.
It In 2009, the market performs best, but still with very little profit.

Food Soda Beer
This figure shows the estimated intercept and coefficient functions for the market, size and value factors, for the first four industry portfolios (Agriculture, Food Products, Candy & Soda, and Beer & Liquor) on the first day of trading.
Figure 4: Annualized Sharpe Ratios q q q q q q q q q q q q q q q q q q q q 1995 1997 1999 Sharpe ratio (annualized) q q q q q q q q q q q q q q q q q q q q 1995 1997 1999 Sharpe ratio (annualized) q q q q q q q q q q q q q q q q q q q q 1995 1997 1999   This figure shows the performance of the four trading strategies (Face, Sam, Fan and Market) using n = 500 during 2007, 2008 and 2009 in terms of the end of day balances, assuming an initial balance of 100 pounds at the start of each year.In this table, the first two columns show the year and the balance on the final trading day when investing in the market portfolio.The balances on the final trading day for Face, Sam and Fan are grouped according to n = 100 (columns 3-5), n = 300 (columns 6-8) and n = 500 (columns 9-11).
The density function f (x) of X t is bounded away from zero and twice differentiable in X and the joint densities of X 1 and X k for all k ≥ 2 are bounded.
Assumption A5.V p − V = o(1), as p n → ∞, for some q × q symmetric positive definite V such that λ min (V) is bounded away from zero.
For the bandwidths h, h 1 , h 2 and the dimension p n , we require the following assumptions.

Assumption C1. (i) The bandwidth h and h
Assumption C3.The dimension p n satisfies p n ≤ Cn d/2−2−2ε for some constants C > 0 and Our aim is to estimate cov(Y t |F t−1 ).Fan, Fan and Lv (2008) and Fan, Liao and Mincheva (2013) showed that by incorporating the factor structure into the covariance matrix, the resulting estimator has a better convergence rate than the usual sample covariance matrix under the norm • Σ .To prove the convergence rate of cov(Y t |F t−1 ) − cov(Y t |F t−1 ) under the norm • Σ , we impose the following assumption: , as p n → ∞ for some q × q symmetric positive definite V 2 such that λ min (V 2 ) is bounded away from zero.
The assumptions are regular.The strong mixing condition in the Assumption (A2) can be relaxed as α(k) ≤ ck −β with a large constant β.Assumption (B1) and (B2) guarantee the existence of the 2d−th moment of ,1 .For simplicity, we do not impose the conditions that ensure the finiteness of the d−th moment of σ 2 ,1 .For more details, see Lindner (2009).Assumption (C4) requires that the factors should be pervasive, that is, impact every individual time series.It was also imposed in Fan, Fan and Lv (2008) and Fan, Liao and Mincheva (2011).

Appendix B: Proof of Theorem 1 (I)-(III)
For ease of presentation, we give some notation.Define with a small c 0 > 0. For a random sequence a n , a n = Ōa.s.(b n ) for some sequence b n means that ), where ε is defined in Assumption (C3).
To prove Theorem 1, the following lemma is useful.
Lemma B.1.Assume that Conditions (A1)-( A3) and (C3) in Appendix A hold and for some where d is defined in (C3).Then there exists a constant C > 0 such that The proof of Lemma B.1 can be followed from the proof of Lemma 6.1 in Fan and Yao (2003).
Of course, some constants involved in the proof need to be modified.For instance, we instead use The following lemma gives the asymptotic representation of Γ (z).
Lemma B.2. Suppose that Assumption (A1)-(A4) in Appendix A hold.Then we have that Using a Taylor's expansion, we obtain that , (I).Consider the term Ω h (z; b).Following the proof of Theorem 5.3 in Fan and Yao (2003), we have that there exists a large C > 0 such that 4).By specific matrix calculations, we can show that Combining (I) and (II), we obtain that This completes the proof.
The following lemma, Lemma B.3, gives the asymptotic relationship between β m+1 and β m , where β m is the mth step estimator based on our procedure in Section 2.
We decompose U n as X ij e T ij,1 e j,4 w ij ( β 1 ) (a).Consider the main term U n1 .Note that ).
Proof of Theorem 1 (II) and (III).Lemma B.2 tells us that, for = 1, where ) and for some constant C > 0.
(a).Consider the term Ω h 1 (z; b).Following the proof of Theorem 5.3 in Fan and Yao (2003), we have that there exists a large C > 0 such that is positive definite.Therefore, Ω h 1 (z; b) is positive definite almost surely and (b).By Lemma B.1, we have Therefore, combining (a) and (b), there exists a large C > 0 such that This completes the proof of Theorem 1(II).Theorem 1(III) can be proven analogously.

Appendix C: Proof of Theorem 1 (IV)
Before we prove Theorem 1(IV), we first give the convergence rate of the difference between the estimated residual t and the true residual t .
Lemma C.1.Suppose that Assumptions (A1)-( A5), (B1)-( B4) and ( C1) and (C3) in Appendix A hold.Then there exists C > 0 and small ε > 0 such that Hence, there exists a large constant C > 0 such that where sup z∈Z ( g (z) ∞ + Φ (z) ∞ ) = O(1) is used in the last terms.For any v > 0, we have the following inequality Take v = C(h 2 1 + δ 3n ) for a large constant C > 0. It follows from parts (II) and (III) of Theorem 1 that there exists a constant C > 0 such that This completes the proof of Lemma C.1.Now we are going to prove Theorem 1(IV).Define the quasi log-likelihood function For convenience, denote the true value of θ by θ ,0 .First, we consider the consistency of θ .Recall that the observed quasi log likelihood function where σ 2 ,t (θ) is defined in Section 2. Following the proof of Theorem 7.1 in Francq and Zokoian (2009), we shall establish the following results: (a4) For any θ = θ ,0 , there exists a neighbourhood U (θ) such that lim inf By the proof of Theorem 7.1 in Francq and Zokoian (2009), we only need to prove (a1).Denote .
We have the relationship σ 2 ,t = c ,t + B σ 2 ,t−1 .The condition (B2) and the compactness of Λ implies that ρ = sup θ∈Λ ρ(B ) < 1, where ρ(B) means the spectral radius of B. Furthermore, σ 2 ,t can be expressed as Let σ 2 ,t (θ) be the vector obtained by replacing σ 2 ,t−i (θ) by σ 2 ,t−i (θ) in σ 2 ,t (θ), and let c ,t be the vector obtained by replacing 2 ,t−i by r 2 ,t−i and r 2 ,1 , • • • , r 2 ,2−m by the initial values.Then we have As a result, for t ≥ m + 1, we obtain that for some constant C > 0. We thus have sup θ∈Λ where θ * is between θ and θ ,0 .Suppose we have shown that there exist two positive constants where C 2 is defined in (A.5).Then, for each x > 0, 2 and the proof of Theorem 1(IV) follows immediately from (A.4) and (A.5).Now we prove (A.4) and (A.5).To establish (A.4) and (A.5), it suffices to prove the following five parts: (b1) There exists a constant C > 0 such that ) There exists a constant C > 0 such that It is not hard to see that (A.4) can be proved from (b1) and (b2) and (A.5) follows from (b3)-(b5).
We now prove them separately.
(ii) Let K h 2 ,t (u) = K h 2 (X t−1 − u) and ϕ(X t ) be a bounded function uniformly over X t ∈ X .
By following the proof of Theorem 5.3 in Fan and Yao (2003), we can see that there exists a large C > 0 such that P sup u∈X By setting ϕ(X t ) = 1, X j , X j X k , (j, k = 1, • • • , q), part (ii) follows.
the β in the kernel function being replaced by b.First of all, randomly choose an initial estimate for β, denoted by β, such that β = 1 and the first component of β is positive.Then, iterate between the following two steps until convergence: (Step 1) If this is the first iteration, let β 0 = β.Otherwise, set β 0 equal to the β obtained from Step 2 of the previous iteration.Minimise 1 and B n−1 , and denote the minimiser by ĝ1 , ξ1 , Â1 , B1 , • • • , ĝn−1 , ξn−1 , Ân−1 and Bn−1 .
have a better idea about what the data is like, we plot the observations from 3rd January 1995 to 31st December 2014 of the three factors and the risk-free rate in Figure 1, and the first four components of Y t in Figure 2 corresponding to the industrial sectors Agriculture, Food Products, Candy & Soda, and Beer & Liquor.The plots show clearly that there are periods of large volatility around the 2008-2009 financial crisis.We will see Face performs reasonably well even during that period, whilst the others do not.
is interesting to have a closer look at the performances of the four strategies in the volatile time period 2007-2009 during which the financial crisis took place.Still assuming an initial balance of 100 pounds at the start of each year, and using n = 500, we plot the balances at the end of each trading day in Figure 5.During 2007, Face, Sam and Fan all perform reasonably well, with Face slightly better.The market does not make much profit, and is beaten by the other three.In 2008, Face continuously does well whilst the other three do not make profit at all.In 2009, although Face does not do very well during some time periods, it adapts to the market change quickly and almost breaks even.The reason that Face can adapt to market change quickly is because it takes into account the dynamic feature of the covariance matrix of returns.On the other hand, both Sam and Fan do very poorly, and in fact they almost lose all their money at the end of the year.

Figure 1 :Figure 2 :
Figure 1: Returns plots of factors and the risk-free rate R f the performance of the four trading strategies (Face, Sam, Fan and Market) in terms of the annualized Sharpe ratio, using different sample sizes n = 100, n = 300 and n = 500.

Figure 5 :
Figure 5: Trading strategies during the financial crisis Assumption A3. (i) The kernel function K(z) is a symmetric density function which is bounded with a bounded support and satisfies the Lipschitz condition; (ii) The density function f b (z) of X T b is twice differentiable and bounded away from zero on {z = x T

Table 2 :
Mean and Standard Deviation of

Table 3 :
Means, Standard Deviations and Sharpe Ratios Fan, Fan and Lv (2008)e the proposed portfolio allocation by ŵ, the portfolio allocation formed by Markowitz's formula using the sample covariance matrix by ŵ1 , and the portfolio allocation formed by Markowitz's formula using the estimated covariance matrix fromFan, Fan and Lv (2008)by ŵ2 .

Table 5 :
Description of the Fama and French factors
Next, we consider the convergence rate of sup 1≤ ≤pn θ − θ ,0 .The proof of this part is based on a standard Taylor expansion of Q ,n (θ) at θ ,0 .Since θ converges to θ ,0 , which lies in the interior of the parameter space, we thus have 0 Liu, Xiao and Wu (2013)at Note that { ,t , t ≤ n} are strictly stationary and α−mixing with geometric rate.(AlsoseeLindner(2009).)It follows from Theorem 2 (ii) ofLiu, Xiao and Wu (2013)that, there exist positive constants C 1 , C 2 and C 3 such that for all x > 0, Hence, by taking x = Cδ 2n for a large constant C > 0, we obtain thatC 1 n 1−d/2 p n C d (log(n)) d/2 + C 2 p n exp −C 3 C 2 log(n) Then it follows that, for i = 1, • • • , m + s + 1,We thus bound cov(Y n+1 |F n d < ∞. ≤ 1 + C( d2 + d + ρ t ).As a result, for i = 1, • • • , m + s + 1, the i-th component of the difference