A scalable quasi-Newton estimation algorithm for dynamic generalised linear models

ABSTRACT This research develops a scalable computing method based on quasi-Newton algorithm for several estimation problems in dynamic generalised linear models (DGLMs). The new method is developed by applying the principle of maximising a pointwise penalised quasi-likelihood (PPQ) for a DGLM for observed data often of massive size. Statistical and computational challenges involved in this development have been effectively tackled by exploiting the specific block structure and sparsity involved in the underlying projection matrix. The obtained maximum PPQ estimator of the state vector in the DGLM has been shown to be consistent and asymptotically normal under regularity conditions. Numerical studies and real data applications are conducted to assess the performance of the developed method.

Developing a scalable statistical computing (SSC) procedure is currently in demand for applying a DGLM for data analysis and statistical inference, where the underlying observed data may be massive with high dimension. An effective approach for this development lies on deriving an SSC algorithm capable of being run in a parallel computing environment, and on deriving a statistical inference procedure capable of efficiently collecting information from the multiple parallel computing outcomes simultaneously. Before embarking on a detailed development we give a brief review of statistical computing for the DGLMs here. Two computational methodologies are commonly used for estimation in the DGLMs. The first one is based on Bayesian approach, which often entails using various Markov chain CONTACT Lixing Zhu lzhu@math.hkbu.edu.hk Supplemental data for this article can be accessed here. https://doi.org/10. 1080/10485252.2022.2085263 Monte Carlo (MCMC) algorithms for computing the parameter and state vector estimates. Since MCMC essentially involves sequential computing, it is difficult to implement in a parallel computing environment. The second methodology is based on the maximum quasi-likelihood approach, where Kalman filter, extended Kalman filter, particle filter and Kalman smoother (KS), among others, are often used in the context of the DGLM estimation. In this paper we propose a quasi-Newton estimation algorithm for the DGLMs based on the maximum quasi-likelihood methodology. The proposed algorithm can be computationally scalable once the high-dimensional Hessian-alike matrix in the algorithm is properly transformed to be block-diagonal. Suppose an n-dimensional response vector sequence {y t } T t=1 with y t ∈ R n×1 , and a state sequence {α t } T t=1 with α t ∈ R p×1 are related through the model where natural parameter θ t = X t α t ∈ R n×1 is a vector function with p × n design matrix X t , and functions c(·) and b(·) have known forms. F t (∈ R p×p ) is a known design matrix, and the random error ξ t ∼ N(0, t ) with N(·, ·) denoting a p-variate normal distribution, the initial state is ξ 0 ∼ N(0, 0 ) and both t and 0 are p × p variance matrices.
Using the properties of exponential families, the mean and variance functions are: where h(·) is a two-times continuously differentiable function. Note that we assume the design matrix X t involves y s with s < t. Define a DGLM as being specified by (1) with the following conditions: (C1) Conditional on α t , the current response y t is independent of α t−1 , . . . , α 0 . Namely, f (y t | α t , α t−1 , . . . , α 0 ) = f (y t | α t ); t = 1, . . . , T.
(C1) is implied when {ξ t } T t=1 in the error sequence are mutually independent. The conditional density functions in (C1) are understood to be conditional on y s with s < t as well if the design matrix X t contains past responses. The conditional distribution of α t when α t−1 is given, is α t | α t−1 ∼ N(F t α t−1 , t ) for given matrix t .
Computation for estimation in the above-defined DGLM is challenging, particularly when some or all of (n, p, T) are large. This will be tackled as following in this paper. First, we develop a pointwise penalised quasi-likelihood (PPQ) function to replace the classic penalised quasi-likelihood (PQ) function for point and interval estimation of the states and parameters. Second, we explore ways to optimise the PPQ in order for effective and efficient parameter estimation and the associated error analysis. Third, through exploiting a block sparse projection matrix (BSPM), we further develop a scalable quasi-Newton (SQN) algorithm to compute the relevant estimates.
We will also investigate relevant convergence properties of the estimators computed from the SQN algorithm. These properties make it possible to assess the improvement of performance of the new method over that of the existing methods. We will further demonstrate how the SQN algorithm can be implemented in a parallel computing environment through simulation studies. Finally, we apply our proposed method to analyse a real data set on the CSI Internet financial index for illustration. This paper is organised as follows. Section 2 describes the PPQ estimation procedure for the DGLMs. Section 3 presents the SQN method for computing the PPQ estimators, as well as the convergence properties of the algorithm. Section 4 discusses the statistical errors and asymptotic properties of the SQN method. Section 5 introduces dimension reduction techniques used in high-dimensional DGLMs. Section 6 displays several simulation results. Section 7 discusses the findings and conclusions.

Pointwise penalised quasi-likelihood
In this section, we present the PPQ function of DGLMs. Define where a 0 is the first component ofα assumed to be known beforehand. The distribution of α, given by y * T , is See Fahrmeir and Wagenpfeil (1997, 9) for detail. For any p(T + 1) dimensional vector α, by taking logarithm of f (α | y * T ), we obtain PPQ as an objective function based on (1): For the sufficiently well-behaved model (1), it is easy to see the following holdŝ Therefore, we may nameα as the maximum PPQ estimator, or MPPQE of α. Note that the first term in PPQ(α) measures the goodness of fit using θ t = X t α t and the second term penalises the third term which measures the smoothness of {α t } T t=1 .
We will develop a scalable quasi-Newton algorithm, to implement the calculation for (4). To facilitate this development, we rewrite PPQ(α) as Denote y = (a 0 , y 1 , . . . , y T ), μ(α) = {α 0 , h (X 1 α 1 ), . . . , h (X T α T )}, and the p + nT by p + nT block-diagonal covariance matrix We also pool all design matrices into a p(T + 1) by p + nT blockdiagonal matrix X is presented as X = block diag{I p×p , X 1 , . . . , X T } where I p×p is a p × p identity matrix; and denote another p + nT by p + nT block-diagonal matrix D(α) as The quasi-score function equals which is the derivative of the first term in (5) for PPQ(α) with respect to α. Accordingly, being the tth diagonal block of D(α). Denote the following p + nT by p + nT blockdiagonal weight matrix . The quasi-Fisher information matrix is then provided as T). Thus, the quasi-score function and quasi-information matrix of PPQ(α) are, respectively, Note that the form of α t is symmetric in the second term and the third term of (3). We then have E(Kα) = 0 and E(S PPQ (α)) = 0. The quasi-likelihood equation of PPQ(α) is The maximum PPQ estimate of α is then the solution of the above equation which requires a numerical method. Regarding the numerical approach for estimation in DGLMs, Fahrmeir and Wagenpfeil (1997) proposed some numerical smoothing methods illustrated with some real applications; Triantafyllopoulos (2009) reviewed developed numerical strategies; Das and Dey (2013) presented some applications for the DGLMs with numerical approximation; Migon, Schmidt, Ravines, and Pereira (2013) introduced a combination sampling algorithm to deal with the problem.

A scalable quasi-Newton method for pointwise PQ estimation
As a closed-form solution ofα from (9) does not exist, iteration algorithms are often used to obtain approximations. Commonly used iterative methods for solving nonlinear systems are Newton-Raphson (NR) and Gauss-Newton (GN). Given an initial vectorα (0) , these methods produce the following sequence: (k) , depending on whether NR or GN is the underlying method, respectively. Without affecting the convergence of (10), it is possible to replace F PPQ (α (k) ) there by k F PPQ (α (k) ) with the learning rate matrix k being positive definite, which results in a gradient descent method. Computing F PPQ (α (k) ) −1 in the aforementioned methods involves inverting a p(T + 1) by p(T + 1) matrix, which is challenging when p or/and T are large.
We now propose a new SQN method to compute the maximum PPQ estimates which does not involve inverting a high-dimensional matrix directly. The proposed method is obtained by adding the following secant iteration for findingδ (k) α to (10): for a given iteration matrix T S which will be constructed through (14), which will be explained in the discussion leading to (14). Also d k is a function ofα (k) , and will be defined in (14) below. Combining the two iterative equations, we can iteratively express the solution from the proposed SQN method as (k ≥ 0): Here m k equals r specified in (13). We can also rewrite the SQN-based solution aŝ in which A m (α) and G m (α) are given by The number of elements in α increases with T and p, the dimension of α t . To fit DGLMs, we would repeatedly solve a system of p(T + 1) equations. This requires a scalable procedure when p and/or T is large. Note that in the above iterations, a properly constructed iteration matrix T S (α) (in short, T S ) plays a pivotal role ensuring the proposed SQN procedure to be scalable. To see this, it is reasonable to assume the existence of some r (< p(T + 1)) computationally feasible decompositions for F PPQ : where π i is a related projection matrix, J i is an n i × n i principal submatrix of F PPQ , J −i is the complementary principal submatrix, and L i equals the transpose of K i . We then construct The iteration matrix T S can be shown to have the following additive and multiplicative procedures, defined, respectively, as: where P i = R i J −1 i R i F PPQ , and B ≡ B r is given by
The SQN procedure using the first T S in (14) can be seen as an overlap block Jacobi method, see Gander and Hajian (2015). We shall call it the additive SQN method. The SQN method using the second T S in (14) can be seen as a generalisation of block Gauss-Seidel method, see Liu and Keyes (2015), which is later called the multiplicative SQN method. Both procedures have been shown to possess higher convergence rate than the other methods in the literature. The pseudo code for implementing these two methods is presented in Algorithm ??. Denoting the computing complexity of a standard QN method as ϕ(p(T + 1)), an increasing function of p(T + 1), the computing complexity of the proposed SQN method is ϕ(p(T + 1)/r) for given function ϕ(·). Next we present a key condition and several lemmas for establishing the convergence results for the proposed SQN methods.
(A1): Let · 2 be the Euclidean matrix norm. Assume F PPQ (α) is a symmetric positive definite (SPD) matrix in a radius-r 0 neighbourhood of the vectorα such that S PPQ (α) = 0. Then there exists an L(> 0) such that The following result provides the convergence ofα (k) using the T a procedure.
Theorem 3.1: Assume that (A1) holds, m k ∈ N and m < ∞. Also let S be a neighbourhood ofα satisfyingα (0) ∈ S. Then the sequence of iteratesα (k) converges toα and satisfies The next result shows the convergence using the T μ procedure.

Theorem 3.2:
Under the same conditions assumed in Theorem 3.1, the sequence {α (k) } k≥0 obtained from using the T μ procedure converges toα, and satisfies Theorems 3.1 and 3.2 suggest that the iteratesα (k) computed from the proposed SQN methods converge to the maximum PPQ estimatorα as k → ∞, provided thatα exists.

Approximation properties of the SQN algorithm
In this section, we study the statistical approximation properties of the SQN-based estimator iterateα (k) , including consistency, asymptotic unbiasedness and asymptotic normality as k → ∞. Assume that E(·) is the expectation, and particularly Pα(·) and Eα(·) are the probability and expectation, evaluated with respect to the probability distribution ofα that is a solution of α such that S PPQ (α) = 0, respectively. We need the following relevant definitions before giving the approximation properties: if there exists a positive definite matrix α such that the distribution of (T+1) ) as the iteration time k goes to infinity. Here α is called the asymptotic variance of the estimatorα.

Definition 4.4 (Sieve ratio):
For quasi-score S Q (·) and S PPQ (·), the sieve quasi-score ratio is defined as forα P ,α Q ∈ S 6 in (S3.2) of supplementary material where S 6 refers to the neighbourhood in Theorems 3.1 and 3.2. Here S PPQ (·) is the quasi-score function of (8), and S Q (·) is the quasi-score function in (6), or in other PQ estimators. Here other PQ estimators refer to the PPQ estimators determined through the kernel smoothing (KS) and MCMC methods.
We now investigate the asymptotic behaviours of the SQN-based estimation iterates under the following assumptions.
(A2): Let F PPQ (·) be the information matrix in (8) and F Q (·) be the information matrices in (7) or in other PQ estimators. F PPQ (α P ) and F Q (α Q ) are invertible forα P ,α Q ∈ S 6 . Here other PQ estimators refer to the PPQ estimators with the KS and MCMC methods.
(A3): There exists a sieve quasi-score ratio S s (·, ·) satisfying, forα P ,α Q ∈ S 6 in (S3.2), where C S is a sieve constant with C S ∈ (0, 1]. (A4): There exist a sequence of eigenfunctions φ Pi and the corresponding eigenvalues λ Pi over i such that, For the sum of the eigenvalues λ Pi , we have the inequality: is the spectral radius function, and C S is the sieve constant.
The condition (A4) gives the relationship between sieve constant C s and trace(F Q (·)). Note that the relationship between α P −α 2 and α Q −α 2 is a very important factor of our proposed method. We use the following two Taylor expansions for S PPQ (·) and S Q (·): In order to control, the two remainder terms F PPQ (α P )(α P −α) and F Q (α Q )(α Q −α), and make the objective function α P −α 2 / α Q −α 2 less than 1. We start with these two aspects: first, sieve quasi-score ratio S s (·, ·) is introduced in Definition 4, to process the two score functions S PPQ (·) and S Q (·); second, eigenvalue decomposition is performed on F Q (·) to find the relationship between S s (·, ·) and the eigenvalues of F Q (·) and F PPQ (·). According to this relationship, we can obtain the following theorem.
Recall thatα (k) be an estimated value ofα in the kth iteration. We then get the following result: By Lemma 1 of supplementary material and Theorem 3.1 for T a ; and Lemma 1 and Theorem 3.2 for T μ , we can prove the above theorem. Some more details can be found in Appendix. Using Lemma 1 of supplementary material and Theorems 3.1 and 4.2 for T a ; Lemma 1 of supplementary material and Theorems 3.2 and 4.2 for T μ , the above theorem can be proved.

Theorem 4.4 (Asymptotic normality): Suppose that the sequence {α
For the given matrix = F PPQ (α), the following decomposition holds where Then we have The above theorems can be proved through checking the characteristic function ofα (k) . See Supplementary material.

A brief discussion on dimension reduction
Consider high-dimensional DGLMs where p n, i.e. the number pT of parameters is larger than the sample size nT in (19). It is important to note that In this situation, it entails a sparse and shrinking estimation of α in fitting the DGLM, which usually require O(npT 2 ) operations. A sparse and shrinkage estimate of α can be obtained by maximising PPQ(α) plus a negative penalty term, which can still be computed using the proposed SQN algorithm with little modification. Namely, the new estimatorα is the following maximiser: where PPQ(·) is from (5), and p λ (·) is a penalty function with λ(< 0). Many different p λ (·) can be used, e.g. the following commonly used ones: Lasso (Tibshirani 1996) contains a L 1 -penalty, namely, p λ (|α tj |) = λ|α tj |. Adaptive lasso (Alasso) includes p λ (|α tj |) = λω tj |α tj | where ω tj is the weighted function. Group lasso (Glasso) is another alternative that can be expressed as Yuan and Lin (2006): where λ is the tuning parameter, α = (α I 1 , α I 2 , . . . , α I G ) , K = diag{K 1 , K 2 , . . . , K G } is a p(T + 1) × p(T + 1) SPD matrix. G is the number of groups, K i is p i × p i sparse SPD matrix and v K = (v Kv) 1/2 for v ∈ R p(T+1)×1 . Glasso is more robust with small-sized group structures. Package grplasso in R software can be used to deal with Glasso in R software. Some other methods include MCP (Zhang 2010) that has the derivative

is a shape parameter. SCAD (Fan and Li 2001) uses a penalty function with the derivative
SICA (Lv and Fan 2009) uses a penalty function has the following derivative:

Simulation studies
The performance of the SQN method is examined via several measurement indicators. They include the mean squared error (MSE) ofα, bias ofα, coverage rate of confidence intervals ofα j , Akaike information criterion (AIC) and Bayesian information criterion (BIC). We also consider the time cost and apply empirical dimension reduction technique in cases of high-dimensional data.

Simulation settings
We use the natural link function, i.e. θ t in each DGLM used in our simulation studies.
It facilitates the simulation study by properly choosing the tuning parameter C S such that Observe that S Q (α) and S PPQ (α) are zero or close to zero. For simplicity, in both the numerical and Monte Carlo methods used here we set We make use of the available software packages (Rmpi, snowfall, MPICH2 and PETSc) in simulation. The computation is implemented on a cluster, which includes three computers each having quad core processors of specifications 2.67 GHz, 8 GB RAM and 1 TB HDD. Eight threads are used to examine the performance of the SQN method, namely, r = 8. We will compare the proposed SQN method with several existing ones, the details of which are briefly described here. The methods of NR and GN in DGLMs are presented in (10), while the method of Levenberg-Marquardt (LM) numerically solves where λ is a penalty parameter, cf. Guo (2012). The SQN method with procedure T a becomes the Newton block-Jacobi (NJ) when it has no overlap sample block in Algorithm 1 of the supplementary. This can also be regarded as a Newton-type block coordinate descent method, cf. Frommer and Renaut (1999). The KS method in DGLMs is as following: where α (k) ). See Migon et al. (2013) for details.
Furthermore, the estimation of α may also be completed by a Bayesian method where MCMC is expected to be used for the involved computing. Here we will not get into details of the Bayesian method but briefly describe two MCMC methods -the Gibbs sampling and the Metropolis-Hastings (MH) algorithms -in the context of estimating α by posterior random samples. The Gibbs sampling method is as follows. Start from α (0) . Step See Fahrmeir (1992) for more details. The Metropolis-Hastings (MH) algorithm is described here.
Here f (α q | y * T ) is the proposal transition density simply chosen as a normal one. The related acceptance rate is .

Low-dimensional simulations
With {n = 5000, p = 10,α (0) = (1, . . . , 1) p(T+1) }, Table 1 displays the comparison results of the SQN method with four existing methods: NR, GN, Levenberg-Marquardt (LM) and Newton block-Jacobi (NJ) as described above. The penalty parameter of the LM method is provided by cross-validation. With {T = 50, 100, 200}, we present MSEs, biases and coverage rates of 95% pointwise confidence intervals. From the results, the basic message is that when T is small, say, T = 50, there is no single winner in each case and the SQN does not perform satisfactorily overall. However, the performance changes with larger T. The SQN is clearly the winner with significant improvement for MSE and bias when T = 200. Also, for all six distributions, the SQN has robust performance. Interestingly, the coverage rates of confidence intervals are similar for all competitors.
With initialα (0) = (1, . . . , 1) p(T+1) , and n = 5000, Table 2 lists the comparison results of the SQN method with the methods of KS, Metropolis-Hastings (MH), Gibbs sampling and parallel tempering (PT). With {T = 300, 500, 800}, the performance of the SQN is also robust across all distributions and when T = 300, 500, it works well resulting in small MSE. When T = 800, the PT has smaller MSE. Regarding the bias, the SQN is always a winner to have smaller bias and regarding the coverage rate all methods preform similarly. In terms of MSEs and biases, the Gibbs, the PT and the SQN obtain smaller values than the LM and the NJ. In terms of coverage rate, the LM, the Gibbs and the SQN obtain more stable values than the NJ and the PT. Therefore, the SQN is overall a recommendable method. Figure 1 shows the pointwise confidence intervals of E(y ti ) for the DGLMs with all six data generating distributions. The pointwise confidence intervals perform consistently well. These confidence intervals are not symmetric. Figure 2 presents the confidence intervals of E(y ti ) for the DGLMs with Poisson and Weibull where T = 200. The confidence intervals accurately reflect the variability. Their ranges span from −3 to 7 and from −1 to 3, respectively.
We record the computing cost for each of our simulations. As an example, Figure 3 shows the time costs of the DGLMs with Poisson with (n, p, T, r) = (200, 5, 5000, 8). The computational cost of the SQN is lower than those of the others. Moreover, the SQN is shown to be nearly as efficient as the PT.    We also study the effects on simulation performance of the sieve constant C s and the algorithm computing tolerance error ε with (s, l b ) = (1, 12) in the relevant projection matrix and (p, n, T) = (150, 200, 2000) in the models. Figure 4 displays the performances of the SQN with different C S and ε for the Pareto-DGLM as an example. Across different  . Performance comparisons of the SQN method solving Pareto-DGLMs for different sieve constants C s . Here (b, l b , p, n, T) = (1, 12, 150, 200, 2000). ε, the indicators (MSE, bias, AIC and BIC) achieve the smallest at C S = 1. Across different C S , the four indicators achieve the largest with ε = 10 −1 , and achieve the smallest with ε = 10 −3 . In summary, the SQN returns a desirable outcome throughout this designed simulation study.

High-dimensional simulations
As discussed in Section 5, the SQN method needs to be coupled with a dimension reduction criterion in cases of high-dimensional data analysis. The dimension reduction criteria we consider here include Lasso, Alasso, Glasso, MCP, SCAD and SICA. We choose the penalised parameter λ through an eight-fold cross-validation, and evaluate five measurement indicators (MSEs, biases, coverage rate, AIC and BIC). The settings are {p = 1000, 1200, 1500; T = 1000; n = 600}. Let a = 3 ln(nT)/(nT) 1/2 , 4 ln(nT)/(nT) 1/2 and 5 ln(nT)/(nT) 1/2 respectively in the penalties of MCP, SCAD and SICA. Also set G = 6 in Glasso, and ω tj = 1/(pT) in Alasso. The results in Table 3 show that Lasso, Alasso, MCP, SCAD and SCIA have similar performances. As expected, Glasso performs best among the six methods.
We discuss the number of selected α states when the dimension p is given. Figure 5 displays the results of various methods in high-dimensional DGLMs. Panels (a) and (b) are the plots of the DGLMs with binomial distribution, Panels (c) and (d) are the plots for the DGLMs with gamma distribution, Panels (e) and (f) are the plots for the DGLMs with Pareto distribution.
The DGLMs with Poisson, Pareto and Weibull show similar trends. We generate them with different n. With (b, l b ) = (3, 46) in the relevant projection matrix and {T = 100}, we employ Glasso as a dimension reduction technique to discuss the performance of the SQN with various C S and p. We also use this simulation to check the performances of the SQN with different C S and ε. In the previous theoretical results, we need to restrict the value of C S in the range (0, 1). In this simulation, we also try other values larger than 1 to see which value of C S performs well. Table 4 summarises the estimation results. We can see that the MSEs and biases ofα are basically stable in different dimension p cases. The interesting observation is about the choice of sieve constant C S . We can see that, when sieve constant C S = 1 that is on the boundary of the range in our theoretical result, the MSEs and biases are smaller than those with C S = 0.5 and 1.5, and for AIC and BIC, the results are similar for various p and C S . Further, we generate samples from the DGLM with Weibull where {p = 50, 60, 70, 80, 90}, and choose a suitable block sparse projection matrix (BSPM) with the related b and l b . Applying Glasso with l b = 16 and {b = 1, 2, 3, (nT)/ ln(nT)}. Figure 6 displays comparison results of the SQN method with different b and p. With fixed l b , the MSEs of the SQN when b = 1 are sightly smaller than the others. The bias with b = 2 is smaller than the others. For fixed b, the MSEs and bias with p = 50 and 60 are smaller than the others. Moreover, AIC and BIC have smaller values with (b, p) = (2, 60). Thus, the performance of the proposed method is better in the case.

Discussion
The SQN-based estimation proposed in this paper has good asymptotic properties in regard to the parallel iterative computing procedure. The key element in this success is the replacement of the quasi-Fisher information matrix by a BSPM without which simplifies the computing complexity, facilitates the computing speed, but still maintains the estimation accuracy. However, some problems still remain unresolved, for example, the block length choice in the BSPM, the optimisation of the sparse blocks, and the acceleration problem in ultra high-dimension scenarios. The approximation properties in Section 4 would still hold when some sparsity assumptions are imposed. Since the proofs of these properties would be technically much more complicated but do not involve new statistical ideas, we will dispatch them to a future study.
We may opt to use distributed MCMC and distributed resampling method instead in big data analysis. We note that the distributed MCMC methods include PT, Pop-MCMC and particle MCMC, etc. We also note that the PT method has some obvious drawbacks with respect to information exchange, while Pop-MCMC and particle MCMC might be good choices. For distributed resampling methods, a subblock method to resample the whole data set would be useful. All these deserve further studies.