Shrinkage Estimation for Multivariate Hidden Markov Models

ABSTRACT Motivated from a changing market environment over time, we consider high-dimensional data such as financial returns, generated by a hidden Markov model that allows for switching between different regimes or states. To get more stable estimates of the covariance matrices of the different states, potentially driven by a number of observations that are small compared to the dimension, we modify the expectation–maximization (EM) algorithm so that it yields the shrinkage estimators for the covariance matrices. The final algorithm turns out to reproduce better estimates not only for the covariance matrices but also for the transition matrix. It results into a more stable and reliable filter that allows for reconstructing the values of the hidden Markov chain. In addition to a simulation study performed in this article, we also present a series of theoretical results that include dimensionality asymptotics and provide the motivation for certain techniques used in the algorithm. Supplementary materials for this article are available online.


Introduction
Hidden Markov models (HMM) are a popular class of models for time series that locally (within a state) behave like iid data, but repeatedly change the data-generating mechanism. The dependence in time is generated by a Markov process with usually a finite state space that controls switching between states, but is not observable directly. The structure of and inference for HMMs are well understood (see, e.g., Cappé, Moulines, and Rydén 2005), and there are tried and tested algorithms around like, for example, the expectation-maximization (EM) algorithm for calculating model parameter estimates, or the Viterbi algorithm for filtering that allows us to approximate the values of the hidden Markov process. However, these methods may cause problems in high dimensions that we address in this work.
HMMs have been applied in many fields. For illustration, we pick out one particular area of application in this article: modeling high-dimensional time series from finance with a view toward portfolio analysis. A simple but wide-spread model for the vector of returns of all assets in a stock portfolio is based on the assumption that the data are independent random vectors with covariance matrix and volatility matrix 1/2 , respectively. If the data are assumed to be Gaussian, this would correspond to a discrete-time version of the Black-Scholes model for stock prices. However, if the market environment changes, for example, if it moves to a more volatile state, the covariance matrix changes too. This behavior can be modeled by an HMM with a finite number, say K, of states represented by the different covariance matrices k , k = 1, . . . , K. Additionally, the HMM model is particularly interesting for analyzing financial returns as it is able to describe certain particularities ("stylized facts, " see, e.g., Rydén, Terasvirta, and Asbrink 1998) such as departures from the normality assumption and existence of dependence between the data, leading to volatility clusters. The dependence between the data is inherent to the Markovian structure of the hidden switching mechanism. Even if the observations behave locally (within a state) as iid Gaussian random vectors, the distribution is not normal, but a mixture of normals. In the following, we do not, however, assume normality of the conditional distributions of the returns given the hidden states to allow for more flexibility in modeling the tail distribution of the data.
A prime task in fitting HMMs to data is the estimation of model parameters like the covariance matrices k , k = 1, . . . , K, their inverses, or the transition probability matrix of the hidden Markov chain. For high-dimensional data, which are common in portfolio management applications, this is no trivial task and requires some particular considerations. In this article, we illustrate with some applications to simulated and real data the instability of parameter estimates that work well in low-dimensional situations. For covariance matrix estimates, this does not come as a surprise, as it is well known that a largescale sample covariance matrix is not stable and is even not guaranteed to be invertible if the dimensionality of the data is large with respect to the amount of data available (see, e.g., sec. 8.4 in Härdle and Simar 2007). This problem is even more pronounced in the context of switching regimes. Indeed, some of the states could be seldom visited, in which case the effective sample size for estimating the covariance matrix of that regime will be very small.
For HMMs, this instability of covariance estimates in particular influences the M-step of the EM-algorithm where Gaussian (pseudo-) likelihoods have to be calculated involving the inverses. Indeed, if the estimate of the covariance matrix is numerically unstable, then a slight perturbation in the data will result in large changes in the inverse of this matrix (Fiecas et al. 2010). This has an unexpectedly dramatic effect on estimates of the transition matrix and on the reconstruction of the hidden process, which may break down more or less completely in high dimensions (compare Figure 2) if the standard estimation and filtering algorithms are applied in a straightforward manner.
We propose a remedy for that kind of problem by combining EM-type estimation of HMM with stable estimates of covariance or alternatively precision matrices in the M-step of the algorithm. In this article, we illustrate this idea with a particular approach based on shrinkage. Note that shrinkage for dependent data has been widely investigated, for example, in the frequency domain of time series, where the technology has been successfully applied to shrinkage estimation of the spectral density matrix (Boehm and von Sachs 2008;Fiecas and Ombao 2011). We extend the work of Sancetta (2008) to the situation of multivariate hidden Markov models. Sancetta (2008) introduced a shrinkage estimator for covariance matrices based on dependent data, which extended the popular shrinkage method of Ledoit and Wolf (2004) for iid multivariate data. This shrinkage estimator for large-scale covariance matrices guarantees an estimate that is invertible, positive-definite, more numerically stable, and has smaller mean squared error than the sample covariance matrix. Implemented as part of the M-step of the EM-algorithm in HMM inference, it also leads to reliable estimates of the transition matrix and of filtering results.
This article is organized as follows. In Section 2, we first give the details of our model, then discuss the well-known standard EM algorithm for numerical approximation of maximum Gaussian pseudo likelihood estimates, which also serves to introduce notation. In Section 3, we discuss the relationship between a penalized likelihood approach and shrinkage. Then, we generalize the results of Sancetta (2008) to HMMs, and we describe how to modify the EM algorithm to get more stable estimates of covariance matrices, transition probabilities, and other characteristics of the model. A simulation study is carried out in Section 4, completed by a numerical (portfolio) analysis of a real dataset in Section 5. All of the proofs are in the supplementary material for this article.

Hidden Markov Models and the EM Algorithm
In this section, we first introduce the model for the data including the basic assumptions needed. Then, we give a brief review of the common EM algorithm for numerical approximation of the maximum likelihood parameter estimates, which also serves to introduce some terminology and notation. This standard approach does not work well in high dimensions, which provides the motivation for the modifications in Section 3.

The Model
Let S t be a finite-state Markov chain assuming K different values. To simplify notation later on, we choose as the set of states {e 1 , . . . , e K }, where e i is the unit vector in R K having the ith entry equal to 1. For i, j = 1, . . . , K, we denote the transition probabilities P(S t = e j | S t−1 = e i , S t−2 , . . .) = P(S t = e j | S t−1 = e i ) = a i j . Then, our model for the datagenerating process is the following: where the ε t are iid with mean 0 and unit covariance matrix I p , and S t,k = 1 iff S t = e k , and = 0, else. At each time instant t, exactly one of the components of S t is equal to 1 and the other components are 0. We observe X 1 , . . . , X T , but not the hidden state process S 1 , . . . , S T . We make the following assumptions, where in what follows we use · to denote the Euclidean norm of a p-dimensional vector: (A) 1. S t is aperiodic, irreducible, and stationary, . . , p, for some constant κ ε < ∞, and Eε 8 t,i < ∞, i = 1, . . . , p. Note that the first part of (A)3 requires uniformity with respect to i even in case p → ∞, and, therefore, does not follow from the second part. For discrete Markov chains with a finite number of states, in particular for our hidden process, aperiodicity and irreducibility already imply the existence of a unique stationary version S t with stationary distribution π = (π 1 , . . . , π K ), where π k = P(S t = e k ). Moreover, from Lemma 1 of Francq and Roussignol (1997), the stochastic process S t is also α-mixing with exponentially decreasing rate. We have included stationarity in the assumptions above for convenience, but all results hold also for hidden processes starting in an arbitrary state, as they are always asymptotically stationary.
The observed process X t defined in Equation (1) inherits stationarity as well as α-mixing with exponentially decreasing rate from the process S t , which follows from (A)1 and (A)2 as in Lemma 1 of Francq and Roussignol (1997). Both properties are needed as key ingredients to derive the consistency and asymptotic normality of the maximum likelihood estimates by a straightforward adaptation of standard arguments given, for example, by Douc, Moulines, and Rydén (2004), compare also Tadjuidje Kamgaing (2013). Additionally, the existence of the mth moment of ε t implies the existence of the mth moment of X t .
The data-generating mechanism defined in (1) can be considered as multivariate generalization of the white noise driven by hidden Markov chains as introduced by Francq and Roussignol (1997). It can further be regarded as a multivariate version of the CHARME model introduced by Stockis, Franke, and Tadjuidje Kamgaing (2010), with constant state means and covariances. However, it differs from the approach of Yang (2000), that focused on (trend) VAR hidden Markov models as opposed to the hidden Markov volatility model we are considering. Nevertheless, we point out that Yang (2000) and Francq and Zarkoïan (2001) gave conditions for the existence of a stationary solution as well as for geometric ergodicity in these more complex models. Additionally, Francq and Roussignol (1997) investigated the asymptotic behavior of maximum likelihood estimators of the model parameters. However, in high-dimensional situations, the nice asymptotic properties of the estimates do not help for realistic sample sizes as it is not guaranteed that straightforward implementations lead to reasonable results.
We propose to apply model (1) and the following estimation algorithms to high-dimensional time series data where one possible area of application is portfolio management in finance. On the first glance, the assumption of conditional independence of the observations given the states, which follows from the independence of the ε t and (A)2, seems to be somewhat restrictive. However, as mentioned above, Rydén, Terasvirta, and Asbrink (1998) had shown that the dependence structure of the data induced by model (1) is already rich enough to cover most of the well-known stylized facts of financial time series.

The EM Algorithm
To estimate the parameters of model (1), we adopt a Gaussian pseudo maximum likelihood approach. Let θ denote the vector of all parameters determining means and covariance matrices of the μ k , k , k = 1, . . . , K, of the different regimes as well as the transition probabilities of the Markov chain. Let denote the conditional probability density of X t given S t = e k if it would be Gaussian. Correspondingly, let denote the conditional likelihood given the whole hidden sample path S = (S 1 , . . . , S T ), and let L(S | θ ) denote the probability of observing this particular path. Then, by the law of total probability, (3) is the Gaussian pseudo likelihood that we have to maximize.
In the classical context of HMMs, consistency and asymptotic normality of maximum likelihood estimates can be derived by a straightforward adaptation of standard arguments given, for example, by Douc, Moulines, and Rydén (2004) and Tadjuidje Kamgaing (2013), as the strict stationarity of the process X t , which is the main ingredient of those arguments, is here trivially guaranteed by that of the hidden process S t and the assumptions on the ε t . In this article, however, we are interested in the "p ∼ T " context, where we have to deal with finite sample problems caused by the instability of high-dimensional sample covariance matrices. These problems vanish for T → ∞ and fixed p and are not covered by standard asymptotic theory that we do not discuss here. In the next sections, we develop an estimation algorithm that is able to cope with those problems for large p and relatively small T .
The direct numerical maximization of the pseudo maximum likelihood in Equation (3) would be quite cumbersome. We instead follow the popular approach of an expectationmaximization (EM) algorithm where the hidden process S t is interpreted as missing data. The EM algorithm can be traced back to Hartley (1958) and has been presented by Baum et al. (1970) in a general framework. Finally, it was made popular through the seminal work of Dempster, Laird, and Rubin (1977), and since then various variants of this algorithm have been proposed in the literature. The motivation for the EM algorithm is the observation that the implementation of maximum likelihood becomes much easier if an oracle uncovers the hidden variables S 1 , . . . , S T . In that case, we could maximize the complete likelihood L c , that is, the likelihood given the observations X 1 , . . . X T and the hidden data S 1 , . . . , S T . L c frequently has a much simpler structure than the incomplete likelihood based on the observations only, allowing for fast numerical calculations or even explicit analytical formulas for the maximum likelihood estimates. For model (1), the complete likelihood is given by (4) Note that c decomposes into K + 1 different terms depending only on μ k , k , k = 1, . . . , K, respectively, on the transition probabilities of the hidden Markov chain such that they may be maximized separately. For the means and covariance matrices, we get as explicit oracle estimates the sample mean and the covariance matrix of all observations generated from the kth regime, that is, for T t=1 S t,k = 0. Otherwise, we set μ o k = 0 and˜ o k = 0 for convenience. This happens with exponentially decreasing probability and will be asymptotically negligible. In practice, however, we may run into numerical problems if we have states that are rarely visited such that the effective sample size T t=1 S t,k is small, though not 0. Therefore, in the following algorithms we prefer to use the biased estimator of the variance-covariance matrix where π o k = T −1 T t=1 S t,k is the oracle estimate of π k = P(S t = e k ). Here and in the following, an upper index o always designates oracle estimates where we assume the state variables S 1 , . . . , S T to be known.
In the iterations of the EM algorithm, the complete Gaussian pseudo log-likelihood as a function of the parameter of interest θ is replaced by its conditional expectation given only the observations (X 1 , . . . , X T ) where the conditional expectation is calculated with respect to a dummy parameter value θ (see, e.g., Baum et al. 1970 for more details on this issue). Neglecting constants that do not depend on the model parameters, we get The basic idea of the algorithm is now the following: In the Estep we calculate Q(θ; θ ) as a function of θ , pretending that the model parameter θ , which is used for calculating the conditional expectation, is known and coincides with the parameter estimateθ (n) from the previous iteration step. In the course of this step, we derive estimatesŜ t of the hidden state variables, where we make use of the forward and backward variables (Rabiner 1989).
In the M-step, we replace the hidden states in the definition of Q(θ; θ ), again with θ =θ (n) , by their estimates from the Estep. Then, we maximize with respect to θ to get an updateθ (n+1) of the model parameter estimate. The outline of the algorithm is as follows.
The details of the E-and M-step are discussed in the following sections.

... E-Step
The first part of the EM algorithm provides approximationsŜ t of the hidden state variables S t . The procedure is quite standard (see, e.g., Rabiner 1989) and given here only for sake of completeness. For readability, we write θ instead of the actually used parameter valueθ (n) of the previous nth iteration. Recall that θ contains all the information about the means μ k and covariance matrices k of the conditional distributions of X t given S t,k = 1, k = 1, . . . , K, as well as the transition probabilities a i j and the corresponding stationary distribution π 1 , . . . , π K of the state variables. Moreover, f k (x) = f (x, μ k , k ) given by (2) denotes the density of X t in state k, where the parameters μ k , k of those Gaussian densities are part of θ . We introduce the forwardbackward variables as auxiliary quantities that may be calculated from θ and the data using the following recursions.

Forward variables:
Backward variables: State variables: From the forward and backward variables given Note that ξ i j (t ) provide approximations of the bivariate conditional distribution of adjacent state variables given the data and parameter θ :

M-Step
In the M-step, we calculate an updateθ (n+1) of the parameter valueθ (n) = θ used in the E-step. Using the state variables from the E-step, we first get updated estimates of the transition probabilitieŝ and correspondingly, of the stationary state probabilitiesπ k = T −1 T t=1Ŝ t,k , for k =, 1, . . . , K. The state means are updated asμ and the estimates of the covariance matrices are updated as˜ k = π −1 k k , and Finally, for the hidden path reconstructions like those shown in Figure 2 we make use of the Viterbi algorithm, described in Appendix A, to compute the optimal state sequence in a hidden Markov model from a sequence of observed outputs. It is based on the maximization of the single best state sequence and uses the dynamic programming methodology, see, for example, Rabiner (1989) for more details.

A Modified EM Algorithm Using Shrinkage
In this section, we propose a modification of the EM algorithm where we replace the likelihood function with a penalized likelihood. With a properly chosen penalty term, in the M-step we will obtain well-conditioned estimates of the covariance matrix. In particular, the resulting estimators will have the form of the linear shrinkage estimator by Ledoit and Wolf (2004) and Sancetta (2008), which are proven and tested estimates that are capable of handling the instability of large covariance matrices. Indeed, these maximum penalized likelihood estimates will be obtained using a data-adaptive regularization parameter, as pointed out by Yuan and Huang (2009) in the iid case. We first review this kind of relationship between shrinkage and penalized likelihood as the basis for further discussion. In the following, we use as a matrix norm || · || the (scaled) Frobenius norm:

Penalized Maximum Likelihood and Shrinkage
For the moment, let us consider the case K = 1 in model (1), that is, the observed data X 1 , . . . , X T are iid, and there are no hidden state variables. Then θ just parameterizes the common mean μ and covariance matrix of the data, and the complete and incomplete log likelihoods both coincide with (X | θ ) = T t=1 log f (X t , μ, ). Let θ 0 , μ 0 , 0 denote the true unknown parameter values. Instead of the log-likelihood we consider the penalized log-likelihood with J(θ|θ 0 ) = log | | + α tr( −1 ), where α = α(θ 0 ) is up to a factor the expectation of the trace of the sample covariance matrix : α = p −1 tr( 0 ) = p −1 E θ 0 tr( ). We have chosen the penalty term J(θ|θ 0 ) according to Yuan and Huang (2009) with the special choice = αI p , that is, the target matrix is a multiple of the p × p identity matrix. Up to a constant independent of , J(θ|θ 0 ) can be interpreted as the entropy loss between and αI p . Yuan and Huang (2009) had shown by purely analytical arguments that the penalized log-likelihood is maximized by the sample meanX T as an estimate of μ and by a weighted average of the sample covariance matrix and the target matrix αI p , . This is the kind of shrinkage estimator considered by Ledoit and Wolf (2004) who proposed to use an optimal weight W 0 = arg min 0≤W ≤1 E θ 0 || s − 0 || 2 . In the language of penalized maximum likelihood, this means that the regularization parameter λ is chosen adaptively to the underlying distribution as λ 0 = λ(θ 0 ) = W 0 /(1 − W 0 ). Ledoit and Wolf (2004) derived an explicit formula for the optimal shrinkage weight W 0 , and they discussed how to estimate the unknown quantities in the expression for W 0 to get an approximation W 0 calculated from the data. Sancetta (2008) extended their work to dependent data. In the framework of Yuan and Huang (2009), the final shrinkage estimate of Ledoit and Wolf (2004) is a penalized maximum likelihood estimate with regularization parameter λ chosen adaptively from the data to get a small estimation error for the covariance matrix in terms of the Frobenius norm.
Note that, in the case of no shrinkage, that is, W = 0, we have s = , so that the shrinkage estimate with weight W 0 is always at least as good as the sample covariance matrix. The proportionality factor α is chosen such that the expectation of the trace of s for all W coincides with the expectation of the trace of , that is, the shrunken matrix is in a certain sense of the same size as the sample covariance matrix, but more regular. In particular, the condition number of the regularized estimator, which is the ratio of the largest to the smallest eigenvalue, is closer to one as a result of a reduction in the dispersion of the eigenvalues of the estimator.

Shrinkage With Several Regimes
We now return to our hidden Markov model given by Equation (1). Here, the complete log-likelihood c given by Equation (4) is an important feature of the EM algorithm described in Section 2.2. It is based on the assumption that an oracle provides the hidden variables S 1 , . . . , S T . To get more stable estimates of the covariance matrix, we modify the log-likelihood by adding regularization terms in the spirit of the Section 3.1 that enforce a better condition of the covariance estimates. As c decomposes into K + 1 summands, we may add separate penalties for each of the K covariance matrix estimates, that is, we consider the analog to Equation (11) the penalized log-likelihood where J k (θ|θ 0 ) = log | k | +α k tr( −1 k ) withα k = p −1 tr( 0,k ). Here, θ 0 again denotes the true underlying parameter, and 0,k , k = 1, . . . , K, are the corresponding covariance matrices. As in the case of one regime (K = 1) and iid data, we get from Yuan and Huang's arguments (Yuan and Huang 2009) that, analogous to (12), the estimates of the covariance matrices maximizing p c are shrunken versions of the sample covariance matrices˜ o k based on the data from the K individual regimes introduced in Equation (6), that is, As discussed in Section 2.2, we prefer to work with o k = π o k˜ o k instead of˜ o k , compare Equation (7). For the corresponding shrinkage estimator, we have to modify the scaling factorα k accordingly, and, using the optimal weights W 0,k in the sense of Ledoit and Wolf (2004), we get as shrinkage estimates As the optimal weights W 0,k are not known, we have to estimate them from the data. In the rest of this section, we show that and how the estimation method of Sancetta (2008), which generalizes the approach of Ledoit and Wolf (2004) to dependent data, can be applied to hidden Markov models. Note that we are considering a complete penalized likelihood for later use in the EM-algorithm, that is, we are looking at the situation where an oracle uncovers the hidden variables.
First, we discuss how to replace the scaling factor α k , depending on the underlying distribution, by an estimate. A suitable choice is α o k = p −1 tr o k , as the following lemma shows. Lemma 1. Let the data be generated from model (1), either for fixed p or for with T increasing p → ∞, satisfying assumption (A). Consider the assumptions (B1) max 1≤i≤p |μ 0,k,i | as well as || 1/2 0,k || 2 = p −1 tr 0,k are uniformly bounded in p, and (B2) π k > 0, tr( 0,k ) ≥ cp r for some c > 0, 0 ≤ r ≤ 1 and all p. Then, it follows for some 0 < β < 1 and all k = 1, . . . , K, that is, the relative error of tr( o k ) as an estimate of tr(π k 0,k ) converges to 0 if p 1−r = o( √ T ).
Note that the assumptions (B1) and (B2) are automatically satisfied for fixed p and nonsingular 0,k . In general, the second part of (B1) is satisfied if the largest eigenvalue λ max ( 0,k ) is bounded uniformly in p, though this is not a necessary condition. (B1) includes also, for example, the situation where λ max ( 0,k ) is growing with a rate up to p, as long as all except finitely many of the eigenvalues are uniformly bounded in p. As tr( 0,k ) ≥ pλ min ( 0,k ), the second part of assumption (B2) is automatically satisfied if the smallest eigenvalue satisfies λ min ( 0,k ) ≥ c/p 1−r , that is, it does not converge to 0 too fast with increasing dimension p. However, (B2) is weaker and allows even for singular 0,k .
The optimal shrinkage weight W 0,k minimizes the mean squared error Assuming the existence of all necessary moments, we get an explicit formula for W 0,k directly from Proposition 1 of Sancetta (2008).
Lemma 2. Under the model (1), for k = 1, . . . , K, The optimal shrinkage estimate s k , that uses the optimal weights of Lemma 2, is, however, not a feasible estimate because α k is not known. Referring to Lemma 1 we replace it in the denominator by its oracle estimate α o k . This choice for an estimator of the denominator of W 0,k has been discussed by Ledoit and Wolf (2004) in detail; it leads to the sample estimator of the denominator where the expectation (risk) is just replaced by the sample analog (loss), see Equation (16) further below.
To get a direct estimate of the numerator in Equation (14), we follow Sancetta's approach. From Lemma 5 in the supplementary material on "Proofs and Technical Lemmas, " we have where, similar to the arguments in Proposition 7.3.3 and 7.3.4. of Brockwell and Davis (1991), replacing the estimates μ o k in the definition of o k by their true values μ 0,k , has an asymptotically negligible effect. Note that the first term on the right-hand side is of order O(p/T ) by the discussion preceding Lemma 5.
Setting y t,i j = S t,k (X ti − μ 0,k,i )(X t j − μ 0,k, j ), the above expression is given by where asymptotically, for any fixed k, var(T −1/2 T t=1 y t,i j ) is well-known as the long run variance, see, for example, Hall (2005), of the time series y t,i j . We note that this time series is stationary with, by Lemma 3 in the supplementary material, exponentially decaying autocovariances c i j,k (t ), which implies the existence of a continuous spectral density f i j,k (ω). Let the corresponding sample autocovariances bē From Lemma 3, the variance of the sample mean T −1 T t=1 y t,i j is asymptotically T −1 f i j,k (0). Following Sancetta (2008), we approximate the spectral densities f k,i j (0) = ∞ t=−∞ c k,i j (t ) at the frequency 0 by applying a popular nonparametric estimator, namely, via windowing in the time domain (see, e.g., Andrews 1991). For some kernel k,i j (0). By the same kind of arguments used in showing Lemma 5, replacing the unknown means μ k has an asymptotically negligible effect such that we may instead consider the following feasible estimates: We finally get as an oracle estimator of the optimal shrinkage weights W 0,k : and the optimal shrinkage estimator is given by We are now in the position to state our main result (whose proof is in the supplementary material).

Theorem 1.
Under assumption (A) on the Markov chain S t and the residuals ε t , and under assumption (B1) of Lemma 1, let the time window K(u) be continuous, symmetric, nonnegative, and, for u > 0, decreasing with K(0) = 1 and Assumptions (A1) and (A2), respectively, derive from classical assumptions of shrinkage theory, which would quite obviously cease to be of interest if the shrinkage targets α k I p were equal to the "truth" π k 0,k . We just recall that the target has been chosen as a constrained version of the truth to regularize the latter one-some sort of deliberate "model misspecification" to achieve regularization. Quite naturally, as in Sancetta (2008), the larger the parameter γ (< 2) of this model misspecification, the faster is p allowed to grow with T , and thus, the more important it is to use shrinkage. To avoid misunderstandings, recall that we use a scaled Frobenius norm in (A2), such that, even asymptotically, the shrinkage target has to remain different from the truth.
The above theorem states that W o 0,k is asymptotically equivalent to W 0,k , and that the error of s k is in probability asymptotically as small as the corresponding error of the shrinkage estimator s k = (1 − W 0,k ) o k + W 0,k α k I p , which is based on the "true" optimal weights, that is, with high probability, s k will be as much an improvement over o k as s k improves the nonshrunken estimator. Note that we do not state an assertion about consistency of the covariance estimates-which is not necessarily of interest, and in fact not possible, if the dimensionality p is of the same order as the sample size T . Instead, we show via Theorem (1c) what is appropriate in this general framework, namely, that shrinkage provides an improvement compared to the original estimate o k . If however (e.g., for fixed p) the latter is mean-square consistent (compare Lemma 1c)), then s k will inherit this property.

Shrinkage as Part of an EM Algorithm
The common covariance matrix estimators˜ o k are an essential part of the EM algorithm of Section 2.2, but they face the aforementioned numerical problems. Even in a low-dimensional setting, some of these matrices may be numerically very close to singular or even not invertible at all, in particular if, by chance, there are only few observations from a given regime k.
As a remedy, we use our developments of the shrinkage estimator of Section 3 to modify the EM algorithm in the following manner. First, in the definition of Q given by Equation (8) we replace the complete log-likelihood c by its penalized version such that now Then, in the E-step again we calculate Q(θ;θ (n) ) whereθ (n) is the parameter approximation from the previous iteration. Note that the penalty term is independent ofθ (n) except for the factors α k that are linear in the corresponding covariance matrices. Using exactly the same formulas as in Section 2.2.1, the Estep again provides updates ofŜ t that replace S t in the complete penalized log-likelihood.
In the M-step of Section 2.2, we used maximum likelihood estimates based on the complete log-likelihood where the hidden S t,k were replaced by their approximations from the previous E-step. Now, we use instead maximum penalized likelihood estimates with a data-adaptive regularization parameter. As the penalty term only influences the choice of the covariance matrices, the estimates given by Equations (9) and (10) of Section 2.2.2 for a i j , μ k remain unchanged. For estimating the covariance matrix k , we started from the rescaled sample covariance matrix (7) as maximizer of the complete log-likelihood with hidden state variables S t,k replaced by their conditional expectations from the E-step. In the modified version of the EM algorithm, we analogously start from the shrunken covariance matrix estimate given by Equation (17), which is motivated by maximization of the complete penalized log-likelihood including a dataadaptive choice of the optimal regularization parameter. Again, we replace the hidden state variables S t,k by their conditional expectations from the E-step. Hence, we consider in the M-step the covariance matrix estimates . . , K, and with W 0,k defined as W o 0,k in Equation (16) withŜ t,k replacing S t,k everywhere in the definition. Equations (9), (10), and (18) together provide the components of the updated parameter vectorθ (n+1) in the modified EM algorithm.
The modified EM algorithm turns out in simulations to provide numerically more stable covariance matrix estimates, which in turn results in considerable improvements of estimates of transition probabilities and of reconstructions of the hidden state variables, too. Note, however, that we lose the monotonicity of the original numerical algorithm for which the incomplete log-likelihood, that is, the logarithm of (3), evaluated at θ (n) increases in n. This feature is not uncommon for modifications of EM algorithms using penalized likelihoods, compare, for example, Green (1990). The reason for the loss of monotonicity is not the use of a penalty term resp. shrinkage, but the use of a data-adaptive regularization. If instead we fix λ k in Equation (13) as well asα k in the subsequent definition of J k , the EM algorithm would retain its monotonicity, now with regard to the penalized incomplete log-likelihood. This can be easily shown by standard arguments for EM algorithms as, for example, in Franke et al. (2011), using the fact that the penalty term of the complete log-likelihood does not depend on the hidden variables and, hence, is the same for the incomplete log-likelihood, too.

Numerical Simulations
In this section, we verify our proposed methodology using simulated data. The data will be a multivariate time series of dimension p = 20 or 50 generated from either K = 2 states, and where stated, K = 3 states, with sample size T = 128 or 256. To circumvent the problem of identifiability of the states, the transition probability matrices were constructed to have equal entries along the diagonal and equal entries among the off-diagonal elements. This consideration may seem to be restrictive. However, considering the sample size in relation to the dimensionality and taking into account the hidden process, one can easily see that the effective sample size for one of the states could be smaller than T/2 or T/3. This gives a flavor of how well the procedure works under challenging conditions. Any other choice of the transition probability will make the hidden process either less stable (a lot of fluctuations of the hidden process) or unbalanced. The instability may increase the difficulty in recovering the estimated hidden path, but will not affect the estimation of the covariance matrices if we increase the sample size.
We considered the model defined in Equation (1) to simulate the data. We give the true covariance matrices for each state in Appendix B. The transition probability matrices for the hidden state variables S t are as follows: with stationary distribution π = (1/3, 1/3, 1/3) for the threestate simulation study. We investigated multiple scenarios in this study: (1) we assumed that the state variables were known and fixed (performance of the oracle), (2) we estimated all of the parameters in the model, including the hidden state variables, (3) we let the means differ between states, and (4) we simulated data from the multivariate t distribution. The oracle setting in Simulation 1 allowed us to assess our estimation of each state's covariance and precision matrix without the added challenges of estimating the hidden states. In Simulations 2-4, we estimated the states using the (modified) EM algorithm. Simulation 2 is a copy of Simulation 1, except the hidden states must be estimated. Simulation 3 investigated the impact of different means between states. And Simulation 4 investigated the performance of the methodologies whenever the underlying distribution has heavy tails.
We compared the performance of our proposed method relative to the traditional EM algorithm, which uses the sample covariance matrix. We assessed performance of the estimators by looking at the percentage relative improvement in average loss (PRIAL), given by The expectation is computed by averaging the corresponding statistics over 100 Monte Carlo samples. Finally, note that we have a label-switching problem, and so we cannot evaluate the performance of estimating each state's covariance matrix. To address this, we assessed the performance of the estimates by looking at (1) the estimates of the transition probability matrix (because it is symmetric by design, and hence, not affected by the label-switching problem) and (2) the marginal covariance matrix, given by = K k=1 π k k .

The Performance of the Oracle
We show the estimated PRIALs in Table 1. In all of the cases, we see that shrinkage improves on the estimates of the covariance matrices for each state, with the larger sample size corresponding to a decrease in the PRIALs and the higher dimensionality corresponding to an increase in the PRIALs. The simplest setting is when T = 256 and p = 20 for two states, as this would result in the largest effective sample size for estimating the parameters. Indeed, this is the only setting where our proposed method yielded the smallest relative improvement. The numerical instability of high-dimensional sample covariance matrices becomes a concern when we consider the precision matrix, which is necessary for evaluating the likelihood function. We invert the shrinkage estimator and the sample covariance matrix and investigated how well these inverses perform in estimating the true precision matrix. To assess performance, we again used the PRIAL, similarly defined as that given by Equation (19) with all the matrices replaced by their inverses. The PRIALs for the shrinkage estimator when estimating the precision matrix yielded a substantial improvement over the sample covariance matrix in all simulation settings. Finally, we point out that in many of the simulation settings, the sample covariance matrices k were not always invertible so that PRIALs could not be computed, though the shrinkage estimates were always invertible.

Performance of the Modified EM Algorithm
In practice, the states are not known and must be estimated. We show these results in Tables 2-4. In Table 2, we see that, in the two-state scenario, the two estimators have comparable performance in estimating the transition probability matrix. However, in all dimensions and sample sizes, the estimate of the marginal covariance matrix was improved when we employed shrinkage via the modified EM algorithm. As expected, we see larger benefits of our proposed method at higher dimensions (for a fixed sample size). The benefits are least pronounced in the easiest considered case of p = 20 and T = 256. The benefits of our proposed method are most pronounced when we considered three states. Just as in the oracle scenario in Simulation 1, here the classical EM algorithm did not converge because of singular estimates of the covariance matrix.
In Table 3, we show the results whenever the means between the two states are also different. These results are comparable Table . Results for simulation : Frobenius norm of the difference between the true transition probability matrix and the estimated transition probability matrix, and the PRIAL for the estimate of the marginal covariance matrix. Entries with the hyphen (-) denote that some of the sample covariance matrix estimates were singular so that error rate and the PRIALs could not be calculated.  to the previous setting when the means were truly zero. Indeed, we see that in all cases, our proposed method yielded improved estimates of the covariance matrix marginal over the states, and improved estimates of the transition probability matrix, too. This is most pronounced at the most difficult setting of p = 50 and T = 128.
In the previous scenarios, data were simulated from the multivariate normal distribution. Results whenever the data were simulated from the multivariate t distribution are shown in Table 4. In this scenario, we again see the benefits of shrinkage in estimating the transition probability matrix, and this is most pronounced in the most difficult setting of p = 50 and T = 128.
In summary, we have illustrated the benefits of our proposed method for obtaining stable estimates of the transition probability matrix and the covariance matrices under challenging scenarios. The main benefits are in the event when the dimensionality is large and when the effective sample size is small. Indeed, because our proposed method yields invertible estimates of covariance matrices, we can fit our model in cases when the classical EM algorithm would yield unstable estimates or would simply fail.

Analysis of the U.S. Portfolio Dataset
Portfolio datasets are inherently high-dimensional. The present dataset, a U.S. industry portfolio dataset publicly available (The dataset can be downloaded at http://mba.tuck.dartmouth.edu/ pages/faculty/ken.french/data_library.html), consists of monthly returns from p = 30 different industry sectors taken from NYSE, NASDAQ, and AMEX. A description of each of industry sector can be obtained from the website. We investigated the performance of our proposed method by looking at the data restricted from January 2000 to December 2011, yielding T = 144 time points. Thus, the number of parameters in the model greatly exceeds the amount of data available.
We took the log-transform of the data to induce a light tail in the distribution to satisfy the moment conditions of our model. The transformed log-returns were then mean-centered. We optimized the penalized likelihood via our modified EM algorithm at 1000 random initializations. We report results based on the initialization that yielded the largest penalized likelihood. . We considered up to five states, and for each number of states we computed the Akaike information criterion (AIC); the model with K = 2 yielded the lowest AIC.
The estimated shrinkage weights were 0.123 for State 1 and 0.320 for State 2. Comparing the two states, State 1 represents a state of lower volatility and stronger correlations in the industry portfolios. In particular, the games and recreation industry has larger correlations with the other industries in State 1 than in State 2. Moreover, the chemicals, textiles, construction, steel, machinery, electrical equipment, automobiles, transportation equipment, metal mining, oil, and utility industries have portfolios that are highly correlated with one another; the latter three industries have stronger correlations within this block of industries in State 1 than in State 2. This block of industry portfolios is correlated with another block, namely, the industries regarding business supplies, transportation, wholesale, retail, restaurants and hotels, banking and trading, and an industry labeled as "other. " The estimated transition probability matrix for this two-state model is This suggests that the industry portfolios are more likely to move and stay in the less volatile State 1 than State 2. In Figure 1, we can see that State 1 occurred during important events in the history of the U.S. industry, such as the dot-com bubble collapse (early 2000s), 9/11, and the 2008 financial crisis. We point out that the grouping of companies into industry portfolios is a coarse classification. This aggregation into industry classifications, in addition to the data being observed monthly as opposed to daily, will not allow us to detect the precise date that specific companies are most affected by changes in the economy. Nevertheless, we were able to capture transitions in the U.S. industry between states of stability and high-risk in the market. We also did the analysis without using our proposed modified EM algorithm. As before, we used 1000 random initializations for the EM algorithm, and used the initialization that yielded the largest log-likelihood. A comparison of the resulting path reconstructions is shown in Figure 2. We see that, though the two results had some agreement in the estimated states, our modified EM algorithm yielded parameter estimates that would result in a more interpretable and more stable reconstruction of the underlying state sequence.
We conclude with a remark that the number of states picked up by AIC should not be taken too seriously. While AIC is well known to asymptotically choose slightly too large models, for small sample sizes that is no longer true. Here, AIC is only able to get hold of major features of the data structure neglecting finer elements (compare, e.g., Steinberg, Gasser, and Franke 1985). In our case, the sample size T = 144 is small for multivariate time series data. There likely are more than two states, but the sample size does not suffice to discover them.

Outlook
For stabilizing the EM algorithm for high-dimensional HMMs, we have used a particular popular shrinkage approach, motivated as a maximum penalized likelihood approach with dataadaptive regularization parameter. Our goal was to develop one simple and feasible solution for the problems encountered with the direct implementation of the standard algorithm and to obtain some theoretical basis for it. For that purpose, we modified the EM algorithm to incorporate a penalty term that results in estimates of the covariance matrices that are, from the point of view of theory and practice, well-established for estimating high-dimensional covariance matrices.
Other methods for getting stable estimates for the covariance matrices or, depending on the intended applications, for the precision matrices could be used as well for replacing the plain empirical covariance matrices and their inverses in the standard version of the M-step. One possibility would be to incorporate additional knowledge into the covariance or precision matrices, such as sparsity or a graphical model (see Friedman, Hastie, and Tibshirani 2007;Bickel and Levina 2008;Fan, Feng, and Wu 2009;Cai, Liu, and Luo 2011, and the references therein). However, such additional knowledge is not always available or trustworthy. Recently, Ledoit and Wolf (2012) proposed  nonlinear shrinkage estimation that is capable of estimating the precision matrix directly, but the computations required are much more involved.
Data-adaptive regularization is not new, and indeed, there have been works on, for example, adaptive lasso techniques (e.g., Fan, Feng, and Wu 2009;Leng, Tran, and Nott 2014). The dataadaptive regularization we have developed in this work led to the loss of the monotonicity property of the EM algorithm, which is a common consequence with modified EM algorithms (see, e.g., Green 1990). As suggested by one of the anonymous referees, one could potentially circumvent this problem by considering instead an iterative adaptive lasso approach such as the one proposed by Sun, Ibrahim, and Zou (2010) that uses the expectation conditional maximization (ECM) algorithm. This approach has a penalty term that changes per iteration, but it also presents new challenges such as the selection of hyperparameters, which is not done automatically as in our approach. The investigation of these alternatives is postponed to future research.
We restricted the discussion to HMMs that are special cases of more general Markov switching models, for example, Markov switching AR or GARCH processes. These more complex models that allow for additional structure due to direct dependence of the observations on each other provide more flexibility, but, of course, estimation and filtering suffers from the same problems as in the case of HMMs. However, the basic model (1) already exhibits the major challenges. A further discussion of the more complex situation of general Markov switching models would go beyond the scope of this article.