A Stochastic Approximation-Langevinized Ensemble Kalman Filter Algorithm for State Space Models with Unknown Parameters

Abstract Inference for high-dimensional, large scale and long series dynamic systems is a challenging task in modern data science. The existing algorithms, such as particle filter or sequential importance sampler, do not scale well to the dimension of the system and the sample size of the dataset, and often suffers from the sample degeneracy issue for long series data. The recently proposed Langevinized ensemble Kalman filter (LEnKF) addresses these difficulties in a coherent way. However, it cannot be applied to the case that the dynamic system contains unknown parameters. This article proposes the so-called stochastic approximation-LEnKF for jointly estimating the states and unknown parameters of the dynamic system, where the parameters are estimated on the fly based on the state variables simulated by the LEnKF under the framework of stochastic approximation Markov chain Monte Carlo (MCMC). Under mild conditions, we prove its consistency in parameter estimation and ergodicity in state variable simulations. The proposed algorithm can be used in uncertainty quantification for long series, large scale, and high-dimensional dynamic systems. Numerical results indicate its superiority over the existing algorithms. We employ the proposed algorithm in state-space modeling of the sea surface temperature with a long short term memory (LSTM) network, which indicates its great potential in statistical analysis of complex dynamic systems encountered in modern data science. Supplementary materials for this article are available online.


Introduction
State space models (SSMs) are ubiquitous for statistical modeling of complex dynamic systems in many fields of science and technology, such as biology, finance, and systems and control. Consider a SSM of the form x t = g(x t−1 , ϕ) + u t , u t ∼ N(0, U t (ζ x )), where t ∈ N indexes the stages of the system, x t ∈ R p represents the system state at stage t, y t ∈ R N t represents the system observations at stage t and the sample size N t ∈ N is allowed to vary with t, and ϕ ∈ R d ϕ , ν ∈ R d ν , ζ x ∈ R d ζx , and ζ y ∈ R d ζy are unknown parameters which are assumed to be invariant with respect to stage t. For convenience, we let θ = (ϕ T , ν T , ζ T x , ζ T y ) T denote the vector of all unknown parameters of the model. Both the model error u t ∈ R p and the observation error η t ∈ R N t are zero-mean Gaussian random variables, and their covariance matrices are parameterized by ζ x and ζ y , respectively. Note that the dimension of η t can change with stage t, while ζ y is invariant with respect to it. The nonlinear propagator g(·) : R p×d ϕ → R p contains the parameter vector ϕ ∈ R d ϕ , and the linear propagator H t (·) : R d ν → R N t ×p contains the parameter vector ν ∈ R d ν . The propagator H t (ν) relates the state variable to the measured variable and yields the expected value of the recursion becomes computationally infeasible, which requires storage and calculation of matrices of the state dimension. In this case, the ensemble Kalman filter (EnKF) (Evensen 1994) can be employed, which represents the filtering distribution by an ensemble of equally weighted random samples and replaces the covariance matrix by the sample covariance matrix of the ensemble. The ensemble is propagated forward according to the state evolution equation and readjusted according to a linear regression when new data arrives. Unlike the Kalman filter, the EnKF also works when the state evolution equation is nonlinear. Because of its conceptual simplicity and ease of implementation, the EnKF has been widely used in atmospheric and oceanic sciences, where the applications are usually termed as data assimilation. Unfortunately, as shown in Law, Tembine, and Tempone (2016), the EnKF converges only to a mean-field filter, which provides the optimal linear estimator of the conditional mean but not the filtering distribution except for the linear systems in the large sample limit. Similar results can be found in Le Gland, Monbet, and Tran (2009), Bergou, Gratton, and Mandel (2019) and Kwiatkowski and Mandel (2015). Other than the EnKF, the particle filter (Gordon, Salmond, and Smith 1993) has also been used to infer the filtering distribution for (1). However, it becomes impractical when the state dimension is high and/or the number of stages is large; it often suffers from the sample degeneracy issue (Cappé et al. 2004) under these cases.
Quite recently, Zhang, Song, and Liang (2021) proposed a Langevinized EnKF (LEnKF) algorithm for estimating the states of the model (1) by reformulating the EnKF under the framework of Langevin dynamics. The LEnKF inherits the forecastanalysis procedure from the EnKF and the use of mini-batch data from the stochastic gradient Langevin-type algorithms, which make it scalable with respect to both the state dimension and the sample size. Moreover, LEnKF overcomes the sample degeneracy issue suffered by the particle filter and the ergodicity issue suffered by the EnKF. The LEnKF works well for large scale, high-dimensional and long series dynamic systems.
Toward simultaneous estimation of the states and parameters for the dynamic system (1), the maximum likelihood method (Dee and da Silva 1999;Gibson and Ninness 2005) and sequential Bayesian method (Dee 1995;Stroud and Bengtsson 2007) have been developed. However, these methods are full likelihood-based and do not scale well for large scale and highdimensional problems, because evaluation of the full likelihood requires a complete scan of all available data and integrating out all state variables. Recently, Aicher et al. (2019) proposed a particle buffered stochastic gradient MCMC method, where the states are simulated using the particle filter and the parameters are simulated using a stochastic gradient MCMC algorithm. Although the stochastic gradient MCMC step is scalable for big data by running with mini-batches, the particle filter step is not. Moreover, the particle filter step can suffer from the sample degeneracy issue (Cappé et al. 2004) for high-dimensional and long series dynamic systems.
By the state augmentation method (Anderson 2001;Baek et al. 2006;Gillijns and De Moor 2007), the EnKF can also be used for parameter estimation, where the state vector is augmented by the parameters and then a new EnKF is constructed for the state-augmented model. Since the EnKF itself suffers from an ergodicity issue, so does the state augmentation method. Empirically, state augmentation method often works reasonably well for the parameters that enter additively in the state evolution equation, but it can be problematic for the stochastic and multiplicative parameters. The stochastic parameters refer to those controlling the variance of the system, that is, ζ x and ζ y in (1), and the multiplicative parameters refer to those multiplied by the state variables. For the former, the state augmentation method often converges to arbitrary values that depend on the initial condition and noise realization (DelSole and Yang 2010); while, for the latter, the state augmentation method often leads to one realization of the model that is dynamically unstable (Yang and DelSole 2009).
Based on the ergodicity theory of the LEnKF (Zhang, Song, and Liang 2021), we propose the so-called stochastic approximation-Langevinized EnKF (SA-LEnKF) algorithm, which is to jointly estimate the state and parameters for the dynamic system (1) under the framework of stochastic approximation MCMC. Each iteration of SA-LEnKF consists of two steps: (i) LEnKF-step, which is to simulate the state variables conditioned on the current parameter estimate; and (ii) SAstep, which is to update the parameter estimate by the stochastic approximation method (Robbins and Monro 1951) based on the newly simulated state variables. Under mild conditions, we prove its consistency in parameter estimation and ergodicity in state simulations. SA-LEnKF works well for all types of parameters: additive, multiplicative and stochastic, and can also be conveniently used for uncertainty quantification for the dynamic system (1). As an advantage inherited from LEnKF, SA-LEnKF can work well for long series, large scale and highdimensional dynamic systems.
In summary, our contribution in this work is 2-fold: (i) We propose the SA-LEnKF algorithm for online learning of the dynamic system with unknown parameters, and provide theoretical guarantees for its consistency in parameter estimation and ergodicity in state simulations. SA-LEnKF works well under various scenarios of dynamic systems: high-dimensional, large-scale and long series; while the existing algorithms might work less satisfactorily under one or more of these scenarios. (ii) We provide, compared to the existing work, a more general proof for the convergence of adaptive stochastic gradient MCMC algorithm and a more explicit convergence rate for it as well. In particular, our theory relaxes the compactness condition on parameter space used in the existing work, provides detailed guidance for the settings of the learning rate sequence (for state variable simulation) and the step size sequence (for parameter estimation), and has many implications for further developments of adaptive stochastic gradient MCMC.
The rest of this article is organized as follows. Section 2 describes the SA-LEnKF algorithm and proves its convergence. Section 3 illustrates the use of SA-LEnKF by some simulated examples. Section 4 applies SA-LEnKF to the problem of sea surface temperature modeling with a LSTM network, where the state dimension is ultrahigh. Section 5 concludes the article with a brief discussion.

A Brief Review of LEnKF
To have a better description of the LEnKF algorithm, let's start with a Bayesian inverse problem for the linear regression: where y ∈ R N , x ∈ R p is an unknown continuous parameter vector, H ∈ R N×p is a known matrix, and η ∈ R N is a Gaussian random error vector with mean 0 and covariance matrix , that is, η ∼ N(0, ). Our goal for this problem is to sample from the posterior distribution π(x|y). To accommodate the case that N is extremely large, we assume that y can be partitioned into B = N/n independent and identically distributed blocks The assumption on V rules out the case that y contains some linearly dependent observations/variables. Let π(x) denote the prior density function of x, which is assumed to be differentiable with respect to x. Through subsampling and Langevin diffusion, the LEnKF reformulates the model (2) as a state-space model: and H k ∈ R n×p is a submatrix of H extracted with the corresponding observation set y k . At each iteration k, the state x k evolves according to an Eulerdiscretized Langevin equation of the prior distribution, and the measurement varies with subsampling. To generate samples from the posterior π(x|y), the LEnKF works via simulating the dynamic system (3) in Algorithm 1.

Algorithm 1: LEnKF for Linear Inverse Problems
(i) Initialization: Initialize an ensemble {x a,1 0 , x a,2 0 , . . . , x a,m 0 } of size m by drawing from the prior π(x), and input the value of K, where K denotes the total number of iterations. for k=1,2,…, K do (i) Subsampling: Draw a mini-batch sample, denoted by (y k , H k ), of size n from the full dataset of size N. Set Q k = k I p , R k = 2V k , and the Kalman gain matrix (iii) Analysis: Draw v i k ∼ N n (0, n N R k ) and calculate

end end
The ergodicity of Algorithm 1 has been established in Theorem 1 of Zhang, Song, and Liang (2021). It was shown that if the eigenvalues of V k are uniformly bounded with respect to k, log π(x) is differentiable with respect to x, and the learning rate k = O(k − ) for some 0 < < 1, then lim k→∞ W 2 ( k , * ) = 0, where k denotes the empirical distribution of x a k , * = π(x|y) denotes the target posterior distribution, and W 2 (·, ·) denotes the second-order Wasserstein distance. In the proof, it was derived that where k = n N (I − K k H k ), e k is a zero mean Gaussian random error vector with covariance var(e k ) = k k , and That is, the LEnKF forms a preconditioned stochastic gradient Langevin dynamics (SGLD) algorithm (Li et al. 2016), where the preconditioner k is adapted with the mini-batch by noting that k k = n N (I − K k H k )Q k is exactly the inverse of the Fisher information matrix of the conditional posterior distribution π(x a k |x a k−1 , y). The preconditioner rescales the parameter updates according to the geometry information of the conditional distribution π(x a k |x a k−1 , y) and generally improves the convergence of the simulation.
The LEnKF makes use of both techniques, subsampling and forecast-analysis, which make it scalable with respect to the sample size N and the state dimension p, respectively. As analyzed in Zhang, Song, and Liang (2021), for a high-dimensional problem, for which one typically sets the mini-batch size n much smaller than p, the computational complexity of the LEnKF for K iterations is O((n 2 p+mnp)K). In contrast, if (6) is directly simulated as a preconditioned algorithm, the computational complexity will be O((p 3 + np 2 )mK) for mK iterations, where O(p 3 ) and O(np 2 ) represent the costs for generating e k and computing k ∇ log π(x a k−1 |y), respectively. The former requires an LUdecomposition of t , which has a computational complexity of O(p 3 ). The LEnKF gets around this issue with the forecastanalysis procedure, making it scalable with respect to the state dimension p. It is interesting to note that the computational complexity of LEnKF is exactly the same as that of a parallel SGLD algorithm (Welling and Teh 2011), which consists of m chains and each chain evolves via the equation where ∇ x log π(x k−1 |y) is as defined in (6),ẽ k ∼ N(0, k I p ), and each chain is updated based on the same mini-batch data y k at each iteration k. As shown in Theorem A.2, see also Remark 4, the LEnKF can converge faster than the parallel SGLD algorithm, since all eigenvalues of the preconditioner k are less than 1. Algorithm 1 can be extended to the dynamic system (1) based on the Bayesian formula π(x t |y 1:t ) = f (y t |x t )π(x t |y 1:t−1 ) f (y t |x t )π(x t |y 1:t−1 )dx t .
That is, to express the filtering distribution π(x t |y 1:t ) as a posterior distribution, the predictive distribution π(x t |y 1:t−1 ) needs to be used as the prior of x t . To estimate the gradient ∇ log π(x t |y 1:t−1 ), Zhang, Song, and Liang (2021) employed an importance resampling procedure. To make the algorithm scalable for big data problems, Zhang, Song, and Liang (2021) assumed that y t ∈ R N t can be partitioned into B t = N t /n t blocks such that y t = {ỹ t,1 , . . . ,ỹ t,B t } and each block consists of n t observations. Let y t,k ∈ R n t denote a block randomly drawn from the set {ỹ t,1 , . . . ,ỹ t,B t }, then y t,k = H t,k x t +v t,k holds, where H t,k denotes a submatrix of H t corresponding to the block y t,k , v t,k ∼ N(0, V t ), V t is positive definite, and v t,k 's are mutually independent. Finally, they showed that the filtering distribution π(x t |y 1:t ) can be simulated by applying Algorithm 1 to simulate of the dynamic system andx t−1,k−1 denotes a sample drawn from π(x t−1 |x t,k−1 , y 1:t−1 ) at iteration k of stage t through an importance resampling procedure from X t−1 . Here X t−1 denotes the set of all samples collected at stage t − 1, which approximates the filtering distribution π(x t−1 |y 1:t−1 ) as in a particle filter. However, due to the finiteness of X t−1 , the term −U −1 t (x t,k−1 − g(x t−1,k−1 )) can be biased as an estimator of ∇ log π(x t,k−1 |y 1:t−1 ), although the bias vanishes as |X t−1 | becomes large. We define the bias , where E(·) and E X t−1 (·) denotes the expectation with respect to the true density π(x t−1 |x t,k−1 , y 1:t−1 ) and its empirical estimate based on X t−1 , respectively. See Zhang, Song, and Liang (2021) (equation (S17)) for the detail. It is reasonable to assume that b t,k−1 , as a function of X t−1 , might be nonzero at an individual stage but has a mean of zero over all stages; that is, E t (b t,k−1 ) = 0, where E t (·) denotes the expectation with respect to the stages t = 1, 2, . . .. More generally, we can assume b t,k−1 is a zeromean short-range dependent random error. Zhang, Song, and Liang (2021) also argued that unlike the particle filter, the LEnKF is less bothered by the sample degeneracy issue, although they both involve an importance resampling procedure at each stage. In particular, the importance resampling procedure in the LEnKF aims to draw a sample from the pool X t−1 for evaluating the gradient ∇ x t log π(x t |y 1:t−1 ), while that in the particle filter aims to draw a sample from the pool X t−1 for approximating the posterior π(x t |y 1:t ). In consequence, the LEnKF is less bothered by the sample degeneracy issue, especially when the sample size of each stage is large. In summary, the LEnKF is very attractive in computation; it can work well for long series, large-scale and high-dimensional dynamic systems.

The SA-LEnKF Algorithm
It is known that the stochastic approximation MCMC algorithm can be used for parameter estimation in presence of missing data, see, for example, Gu and Kong (1998). For the dynamic system (1), the state variables can be viewed as missing data. However, a direct application of the stochastic approximation MCMC algorithm to the system (1) leads to Algorithm 4, supplementary materials, the so-called SA-LEnKF-offline algorithm as described in the supplementary materials. Algorithm 4, supplementary materials works in a similar way to the particle buffered stochastic gradient MCMC algorithm (Aicher et al. 2019). In each sweep, the state variables x 1:T = {x 1 , x 2 , . . . , x T } are first simulated by the LEnKF based on the current estimate of θ, where T ∈ N denotes the total number of stages, and then the estimate of θ is updated in a stochastic approximation procedure based on the newly simulated state variables x 1:T . Obviously, such an algorithm can be inefficient when T is large. Moreover, the offline algorithm does not seem to comply with the spirit of stochastic approximation, for which the data information from all stages is not necessarily used at each iteration of the stochastic approximation procedure when θ is invariant for the dynamic system. However, how to perform online parameter estimation for the dynamic system (1) is still unclear particularly when a stochastic gradient MCMC algorithm is used for state variable simulations.
Therefore, by assuming it is invariant for the dynamic system (1), θ can be estimated online by solving the following equation at each stage t: where the estimate obtained at stage t − 1 is passed on to stage t as the initial value. Next, by viewing the state variable x t as a latent variable for the dynamic system (1), we have ∇ θ log π(y t |y 1:t−1 , θ ) = ∇ θ log π(y t , x t−1 |y 1:t−1 , θ )π(x t−1 |y 1:t , θ )dx t−1 = ∇ θ log π(y t |x t−1 , θ )π(x t−1 |y 1:t−1 ) π(x t |y 1:t , θ )π(x t−1 |x t , y 1:t , θ )dx t−1 dx t = ∇ θ log π(y t |x t−1 , θ )π(x t |y 1:t , θ ) where the first equality follows from Fisher's identity (see e.g., Cappé, Moulines, and Ryden 2005), and π(x t−1 |x t , y 1:t , θ ) = π(x t−1 |x t , y 1:t−1 , θ ) used in obtaining the last equality follows from the Markov property of the model (1). Therefore, (9) can still be solved under the framework of stochastic approximation MCMC along with the iterations of the LEnKF. Note that the state variable x t−1 used in solving the Equation (9) is not drawn from the filtering distribution π(x t−1 |y 1:t−1 ) but from the conditional filtering distribution π(x t−1 |x t , y 1:t−1 , θ ). As shown in Zhang, Song, and Liang (2021), such a state variable can be drawn via an importance resampling procedure from the pool X t−1 . Based on the above derivation, the SA-LEnKF algorithm can be described as follows. As in Zhang, Song, and Liang (2021), we assume that at each stage t, y t has been partitioned into B t = N t /n t blocks such that y t = {ỹ t,1 , . . . ,ỹ t,B t }, each block consists of n t observation, and y t, is positive definite, and v t,k 's are mutually independent. It is easy to derive that for any θ , conditioned on the state x t−1 , y t,k follows a multivariate normal distribution N n t (H t,k The pseudo-code of the SA-LEnKF algorithm is given in Algorithm 2 where, for simple notation, we have depressed the parameters contained in H t,k (ν), g(x t−1 , ϕ), U(ζ x ) and V t (ζ y ). In the algorithm, we let K denote the number of iterations performed at each stage; let { t,k : t = 1, 2, . . . , k = 1, 2, . . . , K} denote the learning rate sequence, which is positive and for any t the subsequence satisfies the conditions given in Lemma A.2; and let {γ t,k : t = 1, 2, . . . ; k = 1, 2, . . . , K} denote the step size sequence, and for any t its subsequence satisfies the conditions given in Assumption 1 and Lemma A.2. In addition, we let k 0 denote the common burn-in period of each stage; that is, at each stage the LEnKF is considered as converged after k 0 iterations and the samples are collected from then on.
Regarding the parameter updating step of Algorithm 2, we have the following remark: (15) can be estimated based on Fisher's identity: which implies the following procedure (in the notation of Algorithm 2): , and denote the samples by {x j t,k : j = 1, 2, . . . , m }. The simulation can be done by a short run of the Metropolis-Hastings algorithm or SGLD.

Convergence Analysis
SA-LEnKF is essentially an adaptive preconditioned SGLD algorithm if we view LEnKF as a preconditioned SGLD sampler by the explanation given in Section 2.1. The algorithm is said "adaptive" because the parameter θ changes from iteration to iteration. In what follows, we establish its consistency in parameter estimation and ergodicity in state variable simulations.
Algorithm 2: SA-LEnKF-Online for simultaneous state and parameter estimation (i) (Initialization) Initialize an ensemble {x a,1 1,0 , x a,2 1,0 , . . . , x a,m 1,0 } of size m by drawing from the prior π(x 1 ), set X t = ∅ for t = 1, 2, . . . , T, and set k 0 as the common burn-in period of each stage. for t=1,2,…,T do for k=1,2,…,K do (ii) (Subsampling) Draw a mini-batch data (that is, a block), denoted by (y t,k , H t,k ), of size n t from the full dataset of size N t . Set Q t,k = t,k I p , R t = 2V t , and the Kalman gain matrix where π(·) denotes the prior distribution of x 1 . If t > 1, set where θ t,k denotes the estimate of θ obtained at iteration k of stage t, and

Consistency of Parameter Estimation
Theorem 2.1 concerns consistency of parameter estimation, whose proof follows that of Theorem A.1 directly.
Theorem 2.1 (Convergence of θ t,k ). Suppose that the dynamic system (1) is stationary; the covariance matrix V t satisfies the condition λ l ≤ inf t λ min (V t ) ≤ sup t λ max (V t ) ≤ λ u for some positive constants λ l and λ u , where λ max (·) and λ min (·) denote, respectively, the largest and smallest eigenvalues of a matrix; and the conditions of Theorem A.1 holds for any stage t. If we set γ t,k = c 1 /(c 2 + (tk) β ) for some β ∈ (0, 1] and some constants c 1 > 0 and c 2 ≥ 0, then there exists a constant , a stage t 0 , and a root θ * ∈ {θ : T t=2 ∇ θ log π(y t |y 1:t−1 , θ ) = 0} such that for any t ≥ t 0 , Theorem 2.1 provides a non-asymptotic analysis for the convergence of θ t,k . Since γ t,k = c 1 /(c 2 + (tk) β ) decreases in both t and k, lim t→∞ E θ t,k − θ * 2 → 0 holds for all k = 1, 2, . . . , K. Theorem A.1 provides an explicit expression for , which helps us to understand how the learning rate sequence { t,k } interacts with the step size sequence {γ t,k } to influence the convergence of θ t,k . Refer to Remark 3 for more discussions on this issue.
Theorem 2.1 requires that V t is positive definite, which, as stated previously, rules out the case that y t contains some linearly dependent observations. More or less, this is a technical condition, as linearly dependent observations are rare under the practical setting. Most often, we assume the observations are iid, for which V t =σ 2 t I holds for some constantσ 2 t > 0. The settings of the sequences { t,k } and {γ t,k } will be discussed in Remark 2. Other assumptions of Theorem A.1 are quite regular for studying the convergence of an adaptive stochastic gradient MCMC algorithm. Refer to Section A.3 of the Appendix for more explanations on them.
We note that the proof of Theorem 1 in Deng et al. (2019) essentially relies on a compactness condition on parameter space (see Proposition 1 of Deng et al. 2019), and assumes a constant learning rate. In our proof of Theorem 2.1, the compactness condition is removed and a decaying learning rate is assumed. The latter leads to the ergodicity of state variable simulations, the basic requirement toward accurate estimation of the filtering distributions. Our proof also provides guidance for the settings of the learning rate and step size sequences, see Lemma A.2 and Remark 2.

Ergodicity of State Variable Simulations
Theorem 2.2 concerns ergodicity of state variable simulations, whose proof follows that of Theorem A.2 directly.
Theorem 2.2 (Ergodicity of SA-LEnKF-Online). Let φ(x) be a test function satisfying φ(x) < c(1 + x ) for some constant c. Further, assume that the conditions of Theorem 2.1 and Assumption 6 (given in the Appendix) hold, and that the learning rate sequence { t,k } is chosen such that for Remark 2. Assumption 6 is a quite standard assumption for establishing ergodicity of stochastic gradient MCMC, see for example, Chen, Ding, and Carin (2015) and Li et al. (2016). The learning rate sequence { t,k } can typically be set in the order t,k ∝ 1/k α for some α ∈ (0, 1], which satisfies the conditions prescribed in Theorem 2.2. Further, by Lemma A.2, which constrains the decaying rates of the sequences { t,k } and {ω t,k }, we can set t,k ∝ 1/k α and ω t,k ∝ 1/(tk) β for some α, β ∈ (0, 1] and β ≤ α ≤ 2β for ensuring the convergence of the SA-LEnKF algorithm. In this paper, we set 0 < α = β < 1 in all examples.
Theorems 2.1 and 2.2 imply that SA-LEnKF can be used for an approximate quantification of uncertainty for dynamic systems. In particular, we employ the following procedure for approximating the 95% confidence interval ofφ t . For simplicity, we describe the procedure for the scalar case only: (i) Calculate the weighted path average for each chain j:φ Regarding this procedure, we have a few remarks. First, it follows from (17), the central limit theorem, and Slutsky's theorem that as m → ∞ and K → ∞, where d −→ denotes convergence in distribution. The parallel simulation nature of the SA-LEnKF algorithm simplifies the asymptotic normality theory required in uncertainty quantification. Second, we ignore the dependence of the particles in deriving (18) and estimating the standard deviation σ t ; these particles are approximately independent when X t is large for each t. At each stage t, the chains interact with each other at the importance resampling step via resampling from the same set X t−1 of particles. Our numerical results indicate that the above uncertainty quantification procedure works well andσ t provides a good approximation to σ t . Third, it follows from Theorem 1 of Song et al. (2020) that for simplicity of computation,φ (j) t can also be replaced by the simple path

A Dynamic Linear Model with Stochastic Parameters
Consider a dynamic linear model given by , and M(α) ∈ R p×p is a tri-diagonal matrix with α 1 on the main diagonal, and α 2 and α 3 on the first upper and lower sub-diagonals, respectively. In the simulation, where μ 0 = −21 p , 0 = I p , and 1 p denotes a constant vector of ones. Our goal is to jointly estimate the states x 1:T = (x 1 , x 2 , . . . , x T ) and the stochastic parameter θ = {σ }.
The SA-LEnKF-Online (Algorithm 2), SA-LEnKF-Offline (Algorithm 4 in the supplementary material), and stateaugmented EnKF (Algorithm 3 in the supplementary material) were applied to this example. For SA-LEnKF-Online, we set the ensemble size m = 20, stage iteration number K = 20, the learning rate sequence t,k = 0.01/k 0.6 for k = 1, 2, . . . , K and t = 1, 2, . . . , T, and the step size sequence γ t,k = 0.0002/(tk) 0.6 for k = 1, 2, . . . , K and t = 1, 2, . . . , T. At each stage t, the state was estimated by averaging over the ensembles generated during the last K − k 0 iterations, and the resulting estimate was denoted byx t . The accuracy ofx t was measured by the root Similarly, at each stage, we estimated θ by averaging those obtained in the last K − k 0 iterations, and denote the resulting estimate byθ t . We measured the accuracy ofθ t by RMSE t (θ ) = θ t − θ 2 / √ d θ , where d θ denotes the dimension of θ . The final estimate of θ was given byθ = 2 T T t=T/2+1θ t . For the SA-LEnKF-Offline algorithm, we set the ensemble size m = 20, the sweep number L = 100, the stage iteration number K = 20, the state learning rate l,t,k = 0.01/k 0.6 for k = 1, 2, . . . , K and l, t = 1, 2, . . . , T, and the parameter learning rate γ l = 0.001/l 0.6 for l = 1, 2, . . . , L. At each sweep l and each stage t, the state was estimated by averaging over the ensembles generated during the last K − k 0 iterations, and the resulting estimate was denoted byx l,t . In default, the last sweep estimatex L,t was used as the final estimate of x t . The accuracy ofx L,t was measured by RMSE t (x) = x L,t − x t 2 / √ p. We letθ l denote the estimate of θ obtained at sweep l, and measured its accuracy by RMSE l (θ ) = θ l − θ 2 / √ d θ . The final estimate of θ was given byθ = 2 L L l=L/2+1θ l . Since the offline algorithm has the same settings of m and K as the online algorithm, the offline algorithm is expected to be about L times slower than the online algorithm. This setting is certainly not optimal. For many problems, the offline algorithm can be significantly accelerated by reducing the stage iteration number or the number of sweeps.
For the state-augmented EnKF, we set the ensemble size m = 20 and σ θ = 0.01 (see Algorithm 3, supplementary materials for its definition), which makes the parameter estimate converge fast and most accurate for this example. At each stage, the state and parameter were estimated by averaging over the ensemble. The final estimate of θ was obtained by averaging the estimates produced in the stages T/2 + 1, T/2 + 2, . . . , T. The above state and parameter estimation procedures prescribed for the three algorithms have also been used in other examples of this article.
Let Figure 1 compares the estimates and 95% confidence bands of x (i) t , produced by the above three algorithms for one dataset simulated from (19), along with stage t for components 10, 25, and 40. The confidence bands were calculated based on (18). The comparison indicates that both SA-LEnKF-Online and SA-LEnKF-Offline work very well for this example, providing good point estimates and good uncertainty quantification as well. In contrast, the stateaugmented EnKF performs less satisfactorily for this example, which fails to capture the true pattern of observations in the early stages. Figure 2 shows that both the SA-LEnKF-Online and SA-LEnKF-Offline algorithms significantly outperform the stateaugmented EnKF algorithm in parameter and state estimation for this example. The SA-LEnKF-Online and SA-LEnKF-Offline perform similarly for this example; their parameter estimates converge to the same value and their state estimates have almost identical root mean squared errors at each stage t. As mentioned previously, the stochastic parameter is usually hard to estimate. It is exciting to see that both SA-LEnKF-Online and SA-LEnKF-Offline produced accurate estimates of σ for this example. Figure 3 compares the coverage probabilities of the 95% confidence intervals produced by the three algorithms for the states x 1:T . Figure 3(a) shows the results for one dataset, where, for each stage t, the coverage probability was calculated by averaging over the coverage status of each component of x t . Figure 3(b) shows the coverage probabilities averaged over 10 independent datasets. The coverage probabilities produced by the two SA-EnKF algorithms are very close to the nominal level 95%, while those by state-augmented EnKF are only about 40%. Table 1 summarizes more numerical results produced by the three algorithms, where k 0 = K − 1 was set for the two SA-LEnKF algorithms. MRMSE(x), MCP and MRMSE(θ) denote the results for one dataset, and Ave-MRMSE(x), Ave-MCP and Ave-MRMSE(θ) denote the results averaged over 10 datasets. For the online algorithm, MRMSE(x) denotes the averaged root-mean-squared-error of the state estimate, MCP denotes the averaged coverage probability of the state, and MRMSE(θ ) denotes the averaged root-mean-squared-error of the parameter estimate, where the average was taken over the stages t = 51, 52, . . . , T. For the offline algorithm, they are defined in the same way, but only the estimates obtained in the last sweep was used. The comparison shows that for this example, SA-LEnKF-Online and SA-LEnKF-Offline produced almost the same state and parameter estimates, while the latter is much more costly in CPU time than the former. The state-augmented EnKF is poor in both state and parameter estimation, although it is very fast.

A Dynamic Nonlinear Model with Multiplicative Parameters
The Lorenz-96 model (Lorenz 1996) was developed by Edward Lorenz in 1996 to study difficult questions regarding predictability of weather. The model is given by   (p) , and x (p+1) = x (1) . The parameter F is known as a forcing constant, and F = 8 is a common value known to cause chaotic behavior of the model.  To generate true states x 1 , x 2 , . . . , x T , we initialized x (i) 0 by 20 for i = 1, 2, . . . , p and added to x (20) 0 a small perturbation of 0.1. We solved the differential equation using the fourth-order Runge-Kutta numerical method with a time interval of t = 0.01. Let L(·) denote the state evolution equation obtained by the Runge-Kutta numerical method. At each stage t and for each component i, we added to x (i) t a white noise, that is, standard Gaussian random error. At each stage t, there are only N t = p/2 randomly chosen state values observable, which are masked with a white noise. In summary, we can write the corresponding state space model as where g(X t−1 ; θ ) = θ L(X t−1 ) is the nonlinear state evolution operator, H t is a random selection matrix, T = 100, σ = 1 is known, and θ is an unknown parameter. Figure 4 illustrates the chaotic behavior of the model with a simulated dataset. Our goal for this example is to jointly estimate the states and the parameter θ.
The SA-LEnKF-Online, SA-LEnKF-Offline and stateaugmented EnKF algorithms were applied to this example. For SA-LEnKF-Online, we set the ensemble size m = 50, the stage iteration number K = 20, the learning rate sequence t,k = 0.5/k 0.6 for k = 1, 2, . . . , K and t = 1, 2, . . . , T, and the step size sequence γ t,k = 0.0002/(tk) 0.6 for t = 1, 2, . . . , L and k = 1, 2, . . . , K. For SA-LEnKF-Offline, we set the ensemble size m = 50, the number of seeps L = 100, the stage iteration number K = 20, the learning rate sequence l,t,k = 0.5/k 0.6 for k = 1, 2, . . . , K and l, t = 1, 2, . . . , 100, and the step size sequence γ l = 0.001/l 0.6 for l = 1, 2, . . . , L. For stateaugmented EnKF, we set the ensemble size m = 50 and σ θ = 0.01. Figure 5 compares the three algorithms in state estimation for one dataset simulated above. It indicates that the two SA-LEnKF algorithms work very well in both state estimation and uncertainty quantification for the example. In contrast, the stateaugmented EnKF performs less satisfactorily. It needs more stages to capture the pattern of observations than the SA-LEnKF algorithms. Figure 6(a) shows that the two SA-LEnKF algorithms produced very accurate estimates of θ , while the estimates by the state-augmented EnKF have relatively large variations. Figure 6(b) shows that the three algorithms produced about the same accurate state estimates, but the state-augmented EnKF converges slightly slowly. Figure 6(c) shows the state coverage probabilities calculated based on one dataset, where the coverage probability at each stage was calculated by averaging the coverage status of all components of the state variable. Figure 6(d) shows the state coverage probabilities calculated based on 10 datasets, which were obtained by further averaging the coverage probabilities obtained in plot (c) over 10 datasets. The state estimation results by the two SA-LEnKF algorithms are very impressive. Their coverage probabilities are almost identical to the nominal level 95%. In contrast, the coverage probabilities by the state-augmented EnKF algorithm are much lower, only about 80% over most stages. Table 2 summarizes more numerical results by the three algorithms, where k 0 = K − 1 was set for the two SA-LEnKF algorithms. In summary, the two SA-LEnKF algorithms work extremely well for the Lorenz-96 model, while the (b) RMSE t (x) of the state estimates along with stage t; (c) state coverage probabilities along with stage t, which were calculated based on one dataset; (d) state coverage probabilities along with stage t, which were averaged over 10 datasets. state-augmented EnKF works less satisfactorily for it. In terms of CPU time, the SA-LEnKF-Offline is much more costly than the other two algorithms.
The SA-LEnKF-Online, SA-LEnKF-Offline, and stateaugmented EnKF algorithms were applied to this example. For SA-LEnKF-Online, we set the ensemble size m = 20, the stage iteration number K = 20, k 0 = K − 1, the learning rate t,k = 0.01/k 0.6 for k = 1, 2, . . . , K and t = 1, 2, . . . , 100, and the step size γ t,k = 0.0002/(tk) 0.6 for k = 1, 2, . . . , K and t = 1, 2, . . . , T. For SA-LEnKF-Offline, we set the ensemble size m = 20, the stage iteration number K = 20, k 0 = K − 1, the sweep number L = 100, the learning rate l,t,k = 0.01/k 0.6 for k = 1, 2, . . . , K and l, t = 1, 2, . . . , 100, and the step size γ l = 0.002/l 0.6 for l = 1, 2, . . . , L. For the state-augmented EnKF algorithm, we set the ensemble size m = 20 and σ θ = 0.01. Figure 7 compares the three algorithms in state estimation for one dataset simulated from (19). Both SA-LEnKF-Offline and SA-LEnKF-Online work very well for the dataset, while the state-augmented EnKF fails to capture the pattern of observations in the early stages. Compared to Figure 1, we conclude that including more parameters might adversely affects the performance of the state-augmented EnKF, but this is not the case for the two SA-EnKF algorithms. Figure 8 compares the three algorithms in parameter estimation, where the plots (a), (b), (c), and (d) show the estimates of α 1 , α 2 , α 3 , and σ , respectively. Both SA-LEnKF algorithms work reasonably well for this example, while the state-augmented EnKF does not. The parameter estimates produced by the stateaugmented EnKF can be far from their true values even with a large number of stages. Figure 9(a) shows that in parameter estimation, the two SA-LEnKF algorithms perform similarly and they both significantly outperform the state-augmented EnKF. Figure 9(b) shows that in state estimation, SA-LEnKF-Offline and SA-LEnKF-Online perform almost the same, both significantly outperforming the state-augmented EnKF. Figure 9(c) shows coverage probabilities of the 95% confidence intervals of the state estimates, where, for each stage t, the coverage probability was calculated by averaging over the coverage status of each component of x t . Figure 9(d) shows coverage probabilities averaged over 10 datasets. The coverage probabilities produced by the two SA-EnKF algorithms are very close to the nominal level 95%, while those by stateaugmented EnKF are only about 30%. Table 3 summarizes more numerical results by the three algorithms. In summary, the two SA-LEnKF algorithms work very well for this example, and they both significantly outperform the state-augmented EnKF in state and parameter estimation. Again, in terms of CPU time, the SA-LEnKF-Offline is much more costly than the other two algorithms.

Sea Surface Temperature Modeling
Consider an autoregressive model of order q, denoted by AR(q).
Let ψ t = h t+1 denote the target/response vector at stage t. The LSTM network with s hidden neurons is defined by the following set of equations: ⎡ where W (·) ∈ R s×qd , R (·) ∈ R s×s , and b (·) ∈ R s for · = f , i, o, z, and the activation function χ(·) applies to vectors pointwisely and is commonly set to tanh(·). The activation function sigmoid and tanh applies to the vectors pointwisely, and • represents elementwise matrix multiplication. In terms of LSTM networks, h t ∈ R qd is called the input vector, c t ∈ R s is called the state vector, y t ∈ R s is called the output vector, and i t , f t , and o t are called the input, forget and output gates, respectively. For initialization, we set y 0 = 0, and c 0 = 0. Given the output vector y t , we can model the target output ψ t as where W ∈ R d×s , b ∈ R d×1 , and η t ∼ N(0, σ 2 I d ).
For convenience, we group the coefficient matrices and weight vectors of the LSTM model as where d x = ds + d + 4sqd + 4s 2 + 4s. By the state-augmentation approach, with which one can significantly reduce the number of parameters to be estimated in the stochastic approximation step, we can rewrite the LSTM model as a state-space model with a linear measurement equation: The resulting model (24) contains four parameters, namely, σ , σ x , σ c , and σ y . Since σ x , σ c , and σ y are all artificial parameters introduced for the purpose of statespace modeling, we fix them to small constants in simulations. We treat only σ as the unknown parameters, which measures variation of observations. To make the notation consistent with other parts of the article, we denote by θ = {σ } the collection of unknown parameters in (24).
With the above formulation, the SA-LEnKF-Online algorithm was applied to train the LSTM model. In the parameter updating step, Remark 1 is followed in evaluating the gradient ∇ θ log p(ψ t |Z t−1 , θ ). The SA-LEnKF-Offline algorithm can also be applied to the model, but it is too CPU time costly for such a large-scale model. In principle, the state-augmented EnKF algorithm can also be applied to train the model, for which the state-augmented dynamic system is given bỹ , andH t is chosen such thatH tZt = ξ t holds. In the state-augmented EnKF algorithm, σ 2 θ is pre-specified.
Sea Surface Temperature. The dataset was downloaded from http://iridl.ldeo.columbia.edu, which records the sea temperatures of 200 months since January 1960 with the latitude from 14.5S to 5.5S and the longitude from 150.5E to 159.5E. The sea surface has been gridded by every single latitude and longitude, and thus the dimension of h t is d = 100. For this dataset, we set q = 6 and T = 200, that is, modeling the data of the first 200 months using an AR(6) LSTM model. The data was scaled into the range (−1, 1) in preprocessing and then scaled back to the original range in results reporting. The SA-LEnKF-Online algorithm was applied to the dataset with the ensemble size m = 10, the stage iteration number K = 20, α = 0.9, the number of hidden neurons s = 400, the learning rate is t,k = 0.00005/ max{8, k} 0.85 for k = 1, . . . , K and t = 1, . . . , T, and the step size is γ t,k = 0.0000001/(tk) 0.85 for k = 1, 2, . . . , K and t = 1, 2, . . . , T. In the simulation, we fix log σ x , log σ c and log σ y to −10.01, and initialized log σ by −10.01. The Metropolis-Hastings algorithm (Metropolis et al. 1953;Hastings 1970) was used in the imputation step as prescribed in Remark 1, where a Gaussian random walk proposal with a variance of 0.01 was employed, the sampler was run for 20 iterations, and the last sample was output as the imputed value.  The proposed method took 9091 CPU seconds on a personal computer with RAM16GB and 2.9GHz Intel Core i7. The results were summarized in Figures 10-12. Figure 10(a) shows the fitting error along with stages, which shows that the algorithm can converge very fast. The occasional small bumps of the fitting error curve reflect the dependence of the fitting error on observations instead of pre-convergence. Figure 10(b) shows the estimates of the parameter σ along with stages. Figure 11 shows the 95% confidence bands produced by the SA-LEnKF-Online algorithm at three selected spatial locations along with stages, which indicate the excellence of the SA-LEnKF-Online algorithm in data fitting and uncertainty quantification. Figure 12 compares the heat maps of the true sea temperatures and the fitted sea temperatures by the SA-LEnKF-Online algorithm at 12 selected months. It indicates again the excellence of the SA-LEnKF-Online algorithm in data fitting.
For comparison, we have applied the state-augmented EnKF (Algorithm 3 in the supplementary material) to the example with the formulation (25). In this algorithm, the covariance matrix C t ofZ t needs to be calculated at each stage. SinceZ t has a dimension even higher than d x (= 1,641,700), calculation of such a huge covariance matrix directly caused a collapse of our computers. By contrast, in the SA-LEnKF-Online algorithm, the corresponding matrix Q t is diagonal, which avoids the memory issue suffered by the state-augmented EnKF. This example demonstrates that the SA-LEnKF-Online can work well for ultrahigh-dimensional problems.

Conclusion and Future Works
This article has developed the SA-LEnKF-Online algorithm as an effective and efficient algorithm for jointly estimating the state and parameters for long series, large scale and highdimensional dynamic systems. Under mild conditions, we prove its consistency in parameter estimation and ergodicity in state simulations. The SA-LEnKF-Online algorithm has been successfully used in uncertainty quantification for dynamic systems with additive, multiplicative, and stochastic parameters. The algorithm has also been successfully used in state space modeling of the sea surface temperature with the LSTM network, which indicates its great potential in modeling highly complex dynamic systems encountered in modern data science.
This paper considers only the case that the observation error follows a Gaussian distribution. As future works, the proposed algorithm can be easily extended to the case of non-Gaussian observations by introducing a latent variable to the system. To illustrate the idea, let's consider the inverse problem (2). For this problem, the latent variable model can be formulated as where z is observed data following a non-Gaussian distribution ρ(·), y is the latent Gaussian random variable, and x is the parameter. To adapt the LEnKF to simulating from the posterior π(x|z), we only need to add an imputation step in Algorithm 1, between the forecast step and the analysis step. The imputation step is to simulate a latent variable y from the distribution π(y|x, z) ∝ ρ(z|y)f (y|x). By an identity established in Song et al.
(2020), we will be able to show that ∇ x log π(x|z) = ∇ x log π(x|y, z)π(y|x, z)dy where the last equality holds as π(x|y, z) = π(x|y) given the hierarchical structure (26). That is, ∇ x log π(x|y) forms an unbiased estimator of ∇ x log π(x|z) provided that y ∼ π(y|x, z), and the imputation leads to an unbiased estimate for the logposterior gradient involved in (6) although z is not Gaussian. Therefore, the extended LEnKF is valid, by which samples from the target posterior π(x|z) can be generated. A further extension of the algorithm to data assimilation problems is straightforward.
Like the SA-LEnKF-Online algorithm, the extended SA-LEnKF-Online algorithm can be efficiently used in the analysis of many non-Gaussian dynamic systems that are of long series, large scale and high dimension. For example, it can be applied to analyze spatio-temporal count data with a dynamic Poisson model, see for example, Wikle (2002) and Katzfuss, Stroud, and Wikle (2019). It can also be applied to learn latent representations for dynamic social networks with a latent space model, for example, dynamic social network latent (DSNL) model (Sarkar and Moore 2005), dynamic latent distance (DLD) model (Sewell and Chen 2015), or the more general dynamic graph neural networks (see Skarding, Gabrys, and Musial 2021 for a recent survey).

A.1. Stochastic Approximation
The stochastic approximation algorithm (Robbins and Monro 1951) is the prototype of many adaptive algorithms, which aims to solve an expectation equation given by denotes a probability density function parameterized by θ, and H(θ, x) and h(θ ) are called the random-field and mean-filed functions, respectively. As a special case of the stochastic approximation algorithm, the stochastic approximation MCMC algorithm works by iterating between the following two steps: (i) Simulate x k+1 from the transition kernel T θ k (x k , ·) which admits f θ k (x) as the invariant distribution, where θ k denotes a working estimate of θ obtained at iteration k, and x k denotes the sample drawn at iteration k.
In particular, the algorithm samples x from a transition kernel T θ k (·, ·) instead of the distribution f θ k (·) exactly, which leads to a Markov state-dependent noise H(θ k , x k+1 ) − h(θ k ). In addition, it allows for a higher-order bias term ρ(·, ·) in parameter updating. The limiting value of θ k will not be affected by the bias term, provided that the bias term satisfies certain conditions, for example, ρ(θ , x) is bounded. See Benveniste, Métivier, and Priouret (1990) for more discussions on this issue.

A.2. An Adaptive LEnKF Algorithm
By the proof of Theorem 1 in Zhang, Song, and Liang (2021), the LEnKF algorithm forms a type of preconditioned SGLD algorithm. Therefore, the SA-LEnKF algorithm can be written in a general form as where k denotes the learning rate, e k is a zero mean Gaussian random error with covariance var(e k ) = k , ∇ xl (θ k−1 , x k−1 ) denotes an unbiased estimate of ∇ x l(θ k−1 , x k ) = ∇ x log π(x k−1 |y k , θ k−1 ), and, by Equation (10), can be viewed as an unbiased estimator of ∇ θ log π(y k |x k−1 , θ k−1 ) π(x k−1 |x k , y 1:k−1 , θ k−1 )dx k−1 . Convergence of adaptive stochastic gradient MCMC algorithms has been studied in a number of papers, see for example, Deng et al. (2019), Deng, Lin, and Liang (2020), and Kim, Song, and Liang (2022).
For the LEnKF algorithm, as shown in Zhang, Song, and Liang (2021), k can be expressed as which implies that k has bounded eigenvalues for all t ≥ 1 as long as V k is positive definite and has bounded eigenvalues for all k ≥ 1. Moreover, 0 < λ min ( k ) ≤ λ max ( k ) < 1 for all t ≥ 1, where λ min (·) and λ max (·) denote, respectively, the smallest and largest eigenvalues of a matrix. Since k can be a function of θ k−1 , we define L(θ k−1 ,

A.3. Convergence of θ k for Adaptive LEnKF
Notation: We use E x [u(θ, x)] to denote the expectation of u(θ, x) with respect to the conditional distribution f (x|θ ), and use E[u(·)] to denote the expectation with respect to the joint distribution of all the variables involved in the integrand u(·).
This section studies the convergence of the adaptive LEnKF algorithm prescribed by (A2). We made the following assumptions.
Assumption 1. The step size sequence {ω k } k∈N is a positive decreasing sequence of real numbers such that There exist δ > 0 and a stationary point θ * such that for any θ ∈ , and, in addition, where · denotes the default L 2 norm.
A practical choice of {ω k } satisfying Assumption 1 is ω k =Ã/(k β + B) for any constantsÃ > 0,B ≥ 0 and where a ∧ b = min(a, b). As implied by (15) and (A1), δ increases linearly with the sample size N. Therefore, it is easy to have 2δÃ > 1 for a moderate value ofÃ and a large value of N, and we just choose β ∈ (0, 1] in this article. See Benveniste, Métivier, and Priouret (1990) (p. 244) for a generalized version of this assumption. In other words, for any x, x , x ∈ X , and θ , θ ∈ , the following inequalities are satisfied: where θ * is a stationary point as defined in Assumption 1.
The smoothness and dissipativity assumptions are standard in studying the convergence of the stochastic gradient MCMC algorithms, and they have been used in many papers such as Deng et al. (2019) and Raginsky, Rakhlin, and Telgarsky (2017). We note that M, m, and b increase linearly with the sample size N. Therefore, the assumption m > 1 can be easily satisfied by increasing the sample size N. It follows from (A7) and (A8) that where ξ k 's are mutually independent white noise or Martingale difference noise, ζ k 's are short-range dependent random error, {ξ k } and {ζ k } are independent, and they satisfy the conditions E(ζ k ) = 0, E ζ k 2 ≤ B 2 , cov(ζ k , ζ k±l ) = 0 for any k, k ± l ∈ N + and any l > l 0 , (A10) where B denotes a positive constant, and F k = σ {θ 1 , x 1 , θ 2 , x 2 , . . ., θ k , x k } denotes a σ -filter, l 0 denotes the dependent range, and Cov(·, ·) denotes the covariance operator.
Assumption 3 is a relaxation for the gradient noise conditions considered in many existing works such as Chen, Ding, and Carin (2015), Li et al. (2016), Raginsky, Rakhlin, and Telgarsky (2017), and Deng et al. (2019). In general, ζ k can be interpreted as the gradient estimation bias associated with iterations; for example, if the iterations are partitioned according to stages for the proposed adaptive LEnKF algorithm, we might have E(ζ k |k ∈ S t ) = 0, while E[E(ζ k |k ∈ S t )] = 0 still holds, where S t denotes the set of iterations of stage t. We note that the condition E ζ k 2 ≤ B 2 can be further relaxed as E ζ k 2 ≤ M 2 E x k 2 + M 2 E θ k 2 + B 2 , while the results of Theorems A.1 and A.2 can still hold with a minor modification for some parameter values there.
Assumption 4. There exist positive constants M and B such that Lemma A.1 is a replacement of Proposition 2 of Deng et al. (2019).
Proof. This theorem can be proved following the proof of Theorem 1 in Deng et al. (2019), but the proofs of Proposition 2 and Lemma 1 in Deng et al. (2019) no longer hold for the Langevinized EnKF algorithm. In the above, we have reestablished them under the conditions given in this theorem.
The expression of can be derived based on the proof in Deng et al. (2019). The λ 0 and k 0 are as defined in Lemma A.3. Their values depend on the sequences { k } and {ω k }, and the constants M, B, m, b, ς 1 , and ς 2 given in the assumptions. The second term of is obtained by applying the Cauchy-Schwarz inequality to bound the expectation E θ k − θ * , T θ k−1 μ θ k−1 (x k ) , where we further bound E θ k − θ * 2 according to Lemma A.2 and bound E T θ k−1 μ θ k−1 (x k ) 2 according to Assumption 5.
Remark 3. Algorithm 2 contains multiple stages, where the step size sequence {γ t,k : t = 1, 2, . . . , k = 1, 2, . . . , K} is set in the form γ t,k = c 1 /(c 2 + (tk) β ) as suggested in Theorem 2.1. As implied by Theorem A.1, there exists a stage t 0 such that (16) holds for any stage t ≥ t 0 . Lemma A.3 and Theorem A.1 together characterize how the constants (M, B, m, b, δ, ς 1 , ς 2 ), the learning rate sequence { t,k }, and the step size sequence {γ t,k } influence the convergence of {θ t,k }. More precisely, they influence its convergence rate, measured by ( , k 0 ) in Theorem A.1, directly or via the upper bounds G x and G θ , while G x and G θ are determined by (M, B, m, b, δ, ς 1 , ς 2 ), { t,k } and {ω t,k } as indicated by Equations (A21) and (A22). The dimensions of x and θ can also influence the convergence of {θ t,k } through the constants (M, B, m, b, δ, ς 1 , ς 2 ), although implicitly. In Assumption 2, M is defined via the Lipschitz constants of ∇ x L(θ , x) and thus M ≥ max sup{ ∇ 2 x L(θ , x) : x ∈ X , θ ∈ }, sup{ ∇ θ ∇ x L(θ, x) : θ ∈ , x ∈ X } , where the two terms in max{·, ·} increase with respect to the dimensions of x and θ , respectively. Similarly, as implied by Assumption 5, ς 1 and ς 2 also depend on the dimensions of θ and x. As shown in Lemma A.1, x * 2 ≤ b/m and B = M b/m, which indicate the dependence of m, b and B on the dimensions of x and/or θ.

A.4. Ergodicity of Adaptive LEnKF
To establish the ergodicity of the adaptive LEnKF, we need more assumptions. Let the fluctuation between φ andφ: where υ(x) is a functional solving the Poisson equation, and L is the infinitesimal generator of the Langevin diffusion. The solution functional υ(x) characterizes the difference between φ(x) and the meanφ. Following Chen, Ding, and Carin (2015) and Li et al. (2016), we make Assumption 6 on υ(x). Refer to Chen, Ding, and Carin (2015) for more explanations on this assumption.
Assumption 6. There exists a sufficiently smooth solution υ(x) to (A25) and a function V(x) such that the derivatives satisfy the inequality D j υ V p j (x) for some constant p j > 0, where j ∈ {0, 1, 2, 3}. In addition, V p has a bounded expectation, that is, sup k E[V p (x k )] < ∞; and V p is smooth, that is, sup s∈(0,1) V p (sx +(1 − s)x ) V p (x) + V p (x ) for all x, x ∈ X and p ≤ 2 max j {p j }.
Theorem A.2 (Ergodicity of Adaptive LEnKF). Let φ(x) be a function satisfying φ(x) < c(1 + x ) for some constant c. Further, assume that the conditions of Theorem A.1 and Assumption 6 hold, and that the learning rate sequence { k : k = 0, 1, . . .} satisfies the conditions: Remark 4. As implied by the MSE upper bound (A26) and that the maximum eigenvalue of the pre-conditioned matrix k+1 is less than 1 (see Section A.2), the LEnKF algorithm has a smaller MSE upper bound than the parallel SGLD algorithm, although they have the same computational complexity as mentioned in the main body of the article. For the parallel SGLD algorithm, E Ṽ k 2 in (A26) needs to be multiplied by a factor −1 k+1 2 which is greater than 1.

Supplementary Materials
The supplementary materials describe the state-augmentation EnKF algorithm and the SA-LEnKF-Offline algorithm.