A comparison of three algorithms in the filtering of a Markov-modulated non-homogeneous Poisson process

ABSTRACT A Markov-modulated non-homogeneous Poisson process (MMNPP), whose intensity process is designed to capture both the cyclical and nonrecurring trends, is considered for modelling the total count of cyber incidents. Extending the Expectation-Maximisation (EM) algorithm for the current MMPP literature, we derive the filters and smoothers to support the MMNPP online parameter estimation. A scaling transformation is introduced to address the numerical issue for large data sizes whilst maintaining accuracy. The filter- and smoother-based EM algorithms are then benchmarked to the maximum likelihood-based EM algorithm at the theoretical level. The differences emerge in the E-step of the EM procedure. Both the filtering and smoothing schemes, in conjunction with the change-of-measure technique, avoid the computing complication caused by the hidden regimes. In contrast to the usual EM algorithm, the said two algorithms could be implemented given only the incident counts data without the specific times of jumps. Within the data compiled by the U.S. Department of Health and Human Services, the filter-based algorithm performs better than the algorithm involving smoothers. The benchmarked algorithm may do well in calibration under the presence of extreme incident counts with an extremely low frequency; however, overfitting may occur. For most practical applications involving 2 or 3 regimes, both algorithms are superior when it comes to efficiency, real-time update, and low computational cost. The benchmarked algorithm is better when there are more regimes under relatively closer intensities. Overall, the filter-based algorithm gives better estimation, especially if there is a low-frequency regime and the flexible binning of the data set is an important consideration.


Introduction
The Markov-modulated Poisson process (MMPP) has a pivotal role in the realm of hidden Markov models (HMMs).It functions as a Poisson process with an intensity parameter guided by an unseen continuoustime irreducible Markov chain.This paper delves into the Markov-modulated non-homogeneous Poisson process (MMNPP) as detailed in Avanzi et al. (2021), which is the MMPP's augmentation.An MMNPP features a component that could be introduced to its frequency perturbation measure, modulated by recognisable factors, thereby enhancing its modelling flexibility.This structure endows the MMNPP with advantageous attributes of the MMPP, and at the same time its augmented perturbation measure captures favourably the cyclical and non-recurring trends of the process.Estimating the HMM parameters is quite In natural sciences, it is used to model birthrelated events (Leroux & Puterman, 1992) and photon arrivals (Burzykowski et al., 2003).It has found relevance in inventory management (Arts et al., 2016;Ghanmi, 2016), communication networks (Heffes & Lucantoni, 1986;Kawashima & Saito, 1990;Muscariello et al., 2005), and insurance (Asmussen, 1989;Guillou et al., 2013;Wei et al., 2010).The MMNPP, which is a non-homogeneous MMPP, has also farreaching impact in various scientific undertakings.For instance, Thayakaran and Ramesh (2013) innovatively incorporated covariates (e.g.temperature and humidity) into the MMPP framework.In ecology, Langrock et al. (2013) utilised the MMNPP for double-observer line transect surveys.The MMNPP embeds a seasonality component in the study of Guillou et al. (2015) giving novel perspectives in calibrating insurance risk models.
In addition to the above-mentioned applications, we consider the MMNPP setting for cyber risk assessment as a new application.A real-world data set will be used to test the applicability of two EM algorithms developed in this paper.The MMPP model has an inherent ability to handle time-series counting data.As demonstrated by Ihler et al. (2007), Markov-modulated Poisson provides a robust and accurate framework for detecting anomalous events with sequential counting data occurring in the context of internet navigation logs.Meanwhile, Carnevali et al. (2019) stressed the MMPP's aptness for cyber risk modelling in terms of dynamic prediction of a system's state and progression.In particular, the marked MMPP model is usable in inferring a system's current state and in forecasting its progression by learning from the observed events.Thus, it enhances software reliability and offers robust comparison vis-à-vis traditional methods for either event state-dependent timing or event-type categorisation.Husák et al. (2018) further observed that Markov models still function capably even when encountering unfamiliar states and transitions.Unlike alternative discrete modelling approaches, such as attack graphs and Bayesian network techniques, Markov models do not require a comprehensive knowledge concerning intrusion detection and attack forecasting.This particular characteristic underlines the versatility of HMMs, which are powerful tools for detecting and anticipating cyber threats on digital infrastructures, as delineated in Amin et al. (2021) and Chadza et al. (2020).
Our investigation extends to leveraging the HMM in capturing cyberattack patterns.In a recent work Li and Mamon (2023a), a framework was developed for cyber risk modelling, wherein the cyberattacks occurrences as well as their inter-arrival and duration are captured by an HMM.
To succeed in applying the MMPP and MMNPP, a crucial challenge to surmount is their parameter estimation because the parameters are driven by a latent process.Various strategies have been proposed for this purpose and include moment-matching (Gusella, 1991), likelihood-based approaches for discrete-time MMPP (Deng & Mark, 1993), and Markov chain Monte Carlo (MCMC) sampling (Rydén, 2008).Both the maximum likelihood estimation (MLE) and the Expectation-Maximisation (EM) algorithms stand out as the preferred inference tools for HMMs.Rydén (1994) initially introduced the likelihood-based method for the MMPP, where the robust consistency of the maximum likelihood estimator was affirmed.A subsequent EM algorithm was further crafted by Rydén (1996).The EM's prominence is attributed to its computational efficacy.Nevertheless, it is essential to recognise that the EM algorithm does not always guarantee convergence to the maximum likelihood estimate.The EM algorithm may only attain a local maximum or even a likelihood function's saddle point.Despite the convergence issue, it is pointed out in Rydén (2008) that the EM algorithm remains the most straightforward and fastest method for obtaining a point estimate.With the aid of this algorithm, it is sufficient to compare models using only their penalised maximum likelihoods.
Another well-established tool for estimation covering the family of HMM-modulated models is the reference probability technique.This refers to a procedure where the estimation is carried out under an ideal reference probability after a suitable probability measure change.Then, the estimates are linked back to the realworld measure (i.e. the probability under which the data are observed).In our case, the MMPP becomes a homogeneous Poisson process with a unit intensity under the reference measure.The results are then transformed back to the real world with the original probability measure.Krishnamurthy and Elliot (1997) applied the change of measure technique and calculated the filters for the MMPP.In contrast to the traditional MLE-based EM algorithm (Rydén, 1996), the filters of the E-step are causal because they involve fixed-interval smoothing (i.e. a forward pass and a backward pass).This helps in the substantial reduction of the memory cost and enhances the suitability for parallel-computing implementation (Krishnamurthy & Elliott, 1993).However, the inherent computational instability becomes a problem when different schemes are used to approximate the discrete-time recursions.Through the Clark transformations (Clark, 1978), the numerical stability of the discrete-time state estimation could be boosted (Malcolm et al., 2003).Elliott and Malcolm (2005) proposed a new smoothing scheme via Clark transformations to tackle the technical difficulties caused by the presence of reciprocal terms in earlier smoothing method (Snyder, 1972).In Elliott and Malcolm (2008), the EM algorithms are presented for the MMPP parameter estimation via robust filtering and smoothing techniques.
In this research, we develop the MMNPP setting and consider three EM-driven algorithms.The use of the reference probability methods draws inspiration from the implementation of the MMPP version developed by Elliott and Malcolm (2008).A detailed comparison amongst these algorithms follows with the MLE-based EM algorithm in Avanzi et al. (2021) serving as a benchmark.Both the MLE and reference probability methods have their own unique advantages and disadvantages.The MLE, grounded on traditional probabilistic notions, emphasises the likelihood computation of incident arrivals.However, the intricate architecture of the unknown hidden states of the intensity process induces significant complications for the MLE.Additionally, due to the complexity of the associated likelihood formulae, researchers have strived to improve the MLE's computational efficiency (Roberts et al., 2006;Rydén, 1996).Conversely, the appealing attribute of the reference probability method is its capacity to transform the MMNPP into a homogeneous Poisson process with unit intensity, thereby circumventing challenges posed by the hidden states.Nonetheless, the reference probability method entails the construction of an appropriate density which is not necessarily trivial.There could also be issues surrounding the discretisation and numerical stability in the numerical implementation.
The three algorithms considered in our analysis differ in their filtering or smoothing techniques.In the two proposed algorithms involving the reference probability method, one uses filters and the other one employs smoothers.The filtering and smoothing techniques are used in conjunction with the MLEbased algorithm.Given an estimation time point, smoothing utilises all the data before and after this time point whilst the filtering technique only considers the data before this time point.The benefit of doing this is that a biased estimate is less likely to occur than when using filtering.However, this also makes it less efficient when accommodating new information, which is an important aspect when quick and frequent parameter updates are required.In summary, our three algorithms have the combined flexibility of these two perspectives: the first one makes use of filtering with a reference measure, the second one involves smoothing with a reference measure, and the third one adopts smoothing with the MLE.
Our primary contribution lies in customising the reference probability method for the MMNPP as well as in refining the filter-and smoother-based EM algorithms that could cater to large data sets.This broadens the array of tools available for the MMNPP implementation.In addition, a thorough comparison between our proposed algorithms and the MLE-based methods is conducted both at the theoretical and practical levels using simulated data.Our findings indicate that, given a data set with notable characteristics (e.g. a pronounced intensity difference, recurring regime switches, and a limited number of regimes), our algorithms offer increased accuracy and computational efficiency compared to the MLE-based method.However, for data sets with infrequent regime changes, the MLE-based method proves to be performing well in parameter estimation.
The rest of this paper is organised as follows.Section 2 details the MMNPP model and the reference probability method.Filtering and smoothing approaches are explored in Section 3. Section 4 derives the filtered and smoothed estimates and introduces the associated EM algorithms.Section 5 presents the MLE-based EM algorithm and offers a theoretical comparison of the three methodologies.Detailed numerical applications are showcased in Section 6, where we undertake an exhaustive comparison using simulated data.Finally, Section 7 concludes the paper.

Model description and change of measure
Suppose ( , F, P) is a complete probability space, where P is the probability measure in the real world.All processes are defined in this probability space.Let M := {M t } t≥0 be a continuous-time finite-state, hidden Markov process.Without loss of generality, the state space of M t is associated with {e 1 , e 2 , . . ., e K } ⊆ R K , where e i = (0, . . ., 0, 1, 0, . . ., 0) with the '1' in the ith position and denotes a vector or matrix transpose.Let Q denote the constant rate transition matrix of the Markov chain.The dynamics of M t have the semi-martingale representation, where W := {W t } t≥0 is a martingale with respect to the natural filtration generated by M and takes values in R K .We consider a non-homogeneous Poisson process (NPP) N := {N t } t≥0 with the intensity function where λ = (λ 1 , λ 2 , . . ., λ K ) ∈ R K with {λ k } K k=1 ≥ 0 and η(t) ≥ 0 is an exposure process.Note that η is an external exposure process under a weak assumption that is adequately approximated by a finite number of constant-valued intervals.When estimating parameters λ and Q, we assume η to be known.In practice, η can be used to calibrate seasonal or non-recurring trends.For example, η can be defined to incorporate the external covariates, temperature, sea level pressure and relative humidity, to model the rainfall bucket tip times data (Thayakaran & Ramesh, 2013).We define an operational time function ρ(t) := t 0 η(s)ds.Note that η is a monotonically increasing function of t with ρ(0) = 0. Avanzi et al. (2021) established the equivalence between the MMNPP and MMPP.The MMNPP is equivalent to the MMPP with the conditional Poisson component defined on the operational time ρ(t).The hidden Markov process does not govern the operational time.From the definition of the MMNPP, the number of incident arrivals over the time interval [t 1 , t 2 ] ⊆ R + follows a Poisson distribution with the rate t 2 t 1 λ s ds.We define the following right-continuous, complete filtrations for further discussions: In the MMNPP, the Markov process and F M are unobservable whilst the Poisson process and F N are observable.After setting up the filtrations, we present the Doob-Meyer decomposition (Cohen & Elliott, 2015) for where V := {V t } t≥0 is an martingale with respect to G.
Given the dependent and complicated link between N and M, we wish to construct a reference probability measure P † to simplify our later derivations.The change of measure technique is widely used in the filtering models; for more details, see Elliott et al. (2008).Under an appropriate reference measure P † , the MMNPP, N, is a homogeneous Poisson process with unit intensity, which is independent of the underlying Markov process, M.Moreover, Elliott and Malcolm (2008) proved that the Markov process, M, remains unchanged under the new probability measure P † , i.e. following the dynamics in (1).Before constructing P † , we define the G-adapted process := { 0,t } t≥0 } as 0,t := 1 where (Pascucci, 2011), the solution to the stochastic differential equation (SDE) (4) is ( 5 ) Given the solution form (5), 0,T has the factorisation 0,T = 0,t t,T , where With T being the time horizon, we define the reference measure P † via the Radon-Nikod ym derivative dP dP The compensated process of N under P † is which is a local martingale under P † .

Filters and smoothers
In this section, we shall present the derivation of the filters and smoothers for the hidden Markov process M.
The major results can be found in Theorems 3.1-3.3.In particular, Theorem 3.1 provides the Zakai stochastic differential equation governing the dynamics of the unnormalised filter over time.Theorems 3.1 and 3.2 will be described in Subsections 3.1 and 3.2.Our results are modified versions of those under the MMPP described in Elliott and Malcolm (2008).Suppose where E † is the expectation under After notation simplification, an equivalent representation of ( 7) is where 1 = (1, 1, . . ., 1) ∈ R K .Note that γ (G t M t ) will be used in the filter estimation later.Define Consider the following diagonal matrix The recursive Zakai equation governing p t is given by the following theorem.

Robust filters
In this subsection, we shall use the gauge transformation technique introduced in Clark (1978) to derive the robust filters for the hidden Markov states and the ordinary differential equation (ODE) that governs the robust filter.The relevant result is presented in Theorem 3.2.Malcolm et al. (2003) demonstrated why the filters generated by gauge transformation are robust.They proved that one can guarantee nonnegative estimated probabilities p t using the robust filters by choosing a maximum grid step to be no greater than a given bound; see (41).The robustness was also illustrated with the numerical example in comparison with the conventional Euler-Maruyama discretisation scheme (Kloeden & Platen, 1992) which produced negative probability estimates.By applying the gauge transformation, we can also be evaluating the stochastic integral with dN in (10), which makes the implementation difficult in practice.Therefore, we shall introduce a rotation or transformation, matrix t = diag(ψ 1 t , ψ 2 t , . . ., ψ K t ) ∈ R K×K , where The inverse of the transformation, −1 satisfies an SDE given in the following result.
Lemma 3.1: The matrix −1 satisfies where 0 = −1 0 = I and I ∈ R K×K is the identity matrix.
Proof: See Appendix B.
Definition 3.2: The transformed process p := { p t } t≥0 is defined as We then derive the ODE that governs the dynamics of p with the following theorem.
Theorem 3.2: p is governed by Proof: See Appendix C.
With p t = t p t , the expectation of the Markov states in (9) is given by This gives the robust filters of the hidden Markov process M.

Smoothing scheme
In this subsection, we shall exploit the duality between the extra-information process and its transformed process.A linear forward ODE for the smoothed estimates and a backward ODE for the extra-information process are derived in Theorem 3.3.In the context of smoothing, for any given t ∈ [0, T], we consider estimates given observations up to time T.That is, our purpose is to calculate Note that by the factorisation of 0,T and the tower property, Now, consider the following smoothed estimates for the unnormalised and normalised hidden Markov states.
Definition 3.3: Definition 3.4: Next, we define the extra-information process and its transformed process.
The corresponding transformed process is given by Remark 3.1: As observed from the above definition, v t incorporates information from t to T, which explains why it is called as 'extra-information process.' Consequently, we have the following formulae and ODEs for the smoothed estimates.
where p t satisfies the following forward linear ODE equation and v t satisfies the following backward linear ODE Proof: See Appendix D.
Note that the duality is illustrated by the equality between (20) and (21).

EM algorithm
The EM algorithm switches between two steps.The first is the expectation-step (E-step), in which a function is set up for the expectation of the log-likelihood evaluated using the current parameter estimates.The second is the maximisation-step (M-step), in which parameters that maximise the function created in the E-step are calculated.Such parameters are then plugged into the E-step and the next iteration continues.
Suppose Y ⊂ F and {P θ , θ ∈ } is a family of probability measures defined on a measurable space ( , F). Assume that each measure is absolutely continuous with respect to a reference measure P 0 , with representing the parameter space.The likelihood function associated with the estimation of θ, given the available information encapsulated in Y, is The corresponding maximum likelihood estimator of θ is defined as The objective is to find an estimator of θ that optimises the conditional expectation of the likelihood ratio.However, obtaining the maximum likelihood estimate directly is generally infeasible, especially for complicated density functions.This necessitates the utility of iterative or numerical strategies (e.g.EM algorithm) to approximate the 'true' parameter estimate.
The EM procedure works as follows: Step 1. Initialise l = 0 and select an initial value, θ 0 .
Step 4. Increment l by 1 and repeat from Step 2 until a stopping criterion is satisfied.
Regarding the convergence of the EM algorithm, the likelihood derived from the MMNPP belongs to the exponential family of likelihoods, as represented in (77).Wu (1983) established that the exponential family meets the conditions necessary for the convergence of the EM algorithm.This provides a theoretical justification for our employment of the EM algorithm.
In this section, we shall consider both robust filterbased and robust smoother-based EM algorithms to estimate the model parameters in Subsections 4.1 and 4.2, respectively.These two algorithms share the same M-step given by (29) in Subsection 4.1.In contrast, the uniqueness of the E-step, which differentiates the two algorithms, are explained in the respective subsection that discusses each algorithm.

Robust filter-based EM algorithm
The filter-based estimates of λ and Q are given in Theorem 4.1, where the quantities used to compute them are governed by stochastic integrals.To simplify the calculation, the robust filters using the transformed gauge process are presented recursively in Theorem 4.2.For practical purposes, both discretised and explicit robust filters expressed recursively are calculated in Theorem 4.3 and Corollary 4.1.At the end of this subsection, we present a scaling technique for numerical implementation and the algorithm procedures in Table 1.
We first define the following processes for parameter estimation for i = 1, 2, . . ., K.
(i) The number of transitions from e i to e j of the process M up to time t is (ii) The number of jumps or incident arrivals with the process M in e i up to time t is The occupation time of the process M in state e i until time t is (iv) The occupation time of the process M in state e i relevant to η until time t is The number of claims, N tm at time t m for m = 0, 1, . . ., Y; The exposure function values η(t 1 ), η(t 2 ) . . ., η(t Y ); The dimension of the hidden Markov chain, K; Tolerance values, 1 , 2 .1: Set up the initial values for the starting EM algorithm pass k = 0.For 1 ≤ i, j ≤ K, q i,j (0) and λ i (0).Set the initial estimate differences max 1≤i≤K Calculate ˘ m,m−1 with ( 43) and ( 57); 6: Calculate s tm with (56) 7: Calculate q i,j (k) and λ i (k) with ( 59) and ( 60), where 1 ≤ i, j ≤ K; 16: Update q i,j and λ i used in Step 4-13; 17: end while 18: return q i,j , λ i and s tm .
Remark 4.1: In the context of cyber risk, J i,j t describes the number of transitions within the underlying cybersecurity environment.This is the number of shifts from state e i to e j up time t.We count the number of cyber attacks and denote it by as O i t throughout [0, t] when the cybersecurity environment is in state e i .Moreover, T i t represents the amount of time the cybersecurity environment is in state e i up to time t.The quantity T * i t is simply called an auxiliary process as it does not have a direct interpretation.It bears semblance to T i t with the η function included in the integrand.
According to Avanzi et al. (2021), the parameter estimates of Q and λ arising from the M-step of the EM algorithm are In the subsequent discussion, we introduce how to use the filters in carrying out the E-step.Generally, we cannot compute the recursive filtered dynamics of O i , J i,j , T i and T * i directly.However, a plausible way is to first compute the associated filtered product quantities, for example, γ (O i t M t ) and then obtain γ (O i t ) = γ (O i t M t ), 1 .The estimates using the product quantities are given in the following theorem.
Theorem 4.1: For i, j ∈ {1, 2, . . ., K} and i = j, the estimates q i,j and λ i are given by and where To simplify the calculations, we define the gauge transformed processes (Clark, 1978) as we did in Theorem 3.2.
Theorem 4.2: The gauge transformed processes in Definition 4.1 are governed by Proof: See Appendix F.

Remark 4.2: Note that we provide two expression for γ (O i t m M t m
).The first one in (36) requires stochastic integration involving dN t whilst the second one in (37) avoids such complexity by using integration by parts.According to our numerical implementation experience, we find the algorithm based on ( 36) is less likely to be interrupted in the numerical analysis.Nevertheless, we can have two theoretical versions to enrich the application tools.In the discussion that follows, the discretisation approximations will be derived based on both ( 36) and ( 37).
Remark 4.3: So far, we have built the filters based on continuous-time dynamics.However, in practice to compute the estimates q i,j and λ i using the filters in Theorem 4.1, we need to discretise the filtered quantities.By building an approximated discrete-time version of the filters in recursion (see Theorem 4.3 below), it will be more manageable to implement the EM algorithm.See further Grimm et al. (2020) on the issue of implementing continuous-time filters.From the standpoint of data processing, the continuous-time filters are also recursive because the data set is grouped into batches.New filtered updates are produced after an algorithm pass that processes an entire batch of data over the interval [0, t].When a new data set is available, the most recent updates become the initial estimates of the next algorithm pass.
Here, we consider a discretisation 0 where Y is a positive integer.Denote the interval length as so that the discrete-time unnormalised recursion for p process is stable (Elliott & Malcolm, 2005), i.e.
Based on Theorem 4.3, we can also express the robust filter estimates, γ (•)'s, in explicit forms.
Corollary 4.1: For each m = 1, 2, . . ., Y and i = j ∈ {1, 2, . . ., K}, the discretised explicit formulae for the robust filters are expressed as: where by convention Proof: The proof could be completed by mathematical induction; see Appendix H.
Ideally, we could use the expression in Corollary 4.1 to calculate the robust filters and then obtain the estimates of the hidden Markov states and the parameter estimates with Theorem 4.1.However, when the sample size of the data is large, we often find that the values of filter estimates could exceed the computation power for large m.To resolve this problem, we observe our ultimate goal, which is to calculate s t m 's, Q and λ.We find that we could obtain the values of s t m 's by the following iterative formula derived from ( 9) and ( 44), As for Q and λ, we could adjust the formula by multiplying some constants without affecting the ultimate estimates.In particular, we denote the adjusted filter estimates as γ (•) after replacing m,m−1 by Similarly, we have γ . Therefore, we could calculate our target with the adjusted filters, The robust filter-based EM algorithm is summarised in Table 1.

Robust smoother-based EM algorithm
In this subsection, we demonstrate the use of smoother-based algorithm to calculate the E-step.The main results are presented in Theorem 4.4 and Corollary 4.2.In many applications of the EM algorithm, the E-step is completed using the smoothed estimates rather than the online filtered estimates.For example, Shumway and Stoffer (1982) used the conventional Kalman smoothed estimators in the EM algorithm to derive a simple recursive procedure for estimating the parameters by the MLE.The smoothing scheme typically used is referred to as the 'fixed-interval smoother'.It could be difficult to compute the smoothing schemes for MMPPs, for example, Snyder (1972).The major challenge is to develop the backwards dynamics, which usually involves constructing the stochastic integrals evolving backwards in time.However, an applicable solution is to exploit the duality between the forward and backward robust system.Such an approach is used in Elliott and Malcolm (2000) and Elliott et al. (2009) to develop smoothing algorithms without calculating backward stochastic integration.In the following, we shall implement the approach based on duality and construct the EM algorithm for our MMNPP framework.
We start with the vector v t in Definition 3.5, which incorporates the process of extra-information from t to T. Due to the difficulty of computing the dynamics of v t , we consider its dual process v t such that By discretising the backward ODE in ( 23) for v t and reverting v t back into v t based on Definition 3.5, we have With the duality, we derive the smoothed estimates of the desired parameters q i,j and λ i in the following theorem.
Theorem 4.4: The smoother-based update equations for q i,j and λ i are , and (63) Proof: See Appendix I.
Remark 4.4: Note that by using integration by parts, we have (65).The smoothed estimates in Theorem 4.4 are calculated without the stochastic integrals.In applications, we could use the Euler method with the time partitions defined in Subsection 4.1 to approximate the integration involved in ( 63) and ( 64) with the discretised estimates in ( 44) and (62).However, there are situations when such an approximation based on (65) leads to a negative lambda estimation.Therefore, the default choice is to use (64) in estimating λ.
With the Euler discretisation method, we are able to compute the values of the integrals in Theorem 4.4.For example, The rest of the integrals could be approximated similarly.Again, we encounter the same problem that p t m and v t m could go beyond the calculation range.We find that the same adjustment in (57) could still resolve the problem.By ( 50) and ( 62), we have for Next, we define the adjusted estimates and with m = 0, 1, 2, . . ., Y.It is immediate to establish the relation Therefore, by replacing p and v by p and v, respectively, in the integrals in Theorem 4.4, we summarise the above method of discretisation and adjustments in the following corollary.

Corollary 4.2:
The smoother-based update equations for q i,j and λ i are approximated by The robust smoother-based EM algorithm is summarised in Table 2.

Introduction to Algorithm III and comparison
In this section, we shall first introduce the third EM algorithm constructed by Avanzi et al. (2021) in Section 5.1.Then, we compare the three EM algorithms in estimating the parameters of the MMNPP from the theoretical perspective in Section 5.2.

Introduction to Algorithm III
For simplicity, we refer to the robust filter-based and smoother-based EM algorithms as Algorithm I and Algorithm II, respectively.The third algorithm is denoted as Algorithm III.The theory of Algorithm III is adapted from Rydén (1996) who provided a fundamental building block of the EM algorithm for the MMPP model.Avanzi et al. (2021) employed the framework of the MMPP by introducing the operating time function ρ(t) := t 0 η(s)ds and an indicator function of whether there is a change of η(s) at the partitioned time points t m .A brief introduction to the theory behind Algorithm III will be illustrated.For more details, see Avanzi et al. (2021).
Similar to Algorithms I and II in the M-step, Algorithm III calculates the parameters λ and Q using the first equality in ( 29) and ( 28) with quantities J i,j , O i , T i and T * i defined in ( 24)-( 27).However, the method used to calculate the estimators of four quantities is based on the probabilistic perspective.Focusing on Algorithm III, consider the time interval [0, T] with a partition 0 The time points T k 's, are the times of incident arrivals or occurrences of change in the exposure function.The estimators of J i,j , O i , T i and T * i are written as where i, j = 1, 2, . . ., K. To calculate the probabilities in ( 33)-( 76), Avanzi et al. (2021) derived the probability of regime transitions with or without incident arrival at T l , given M T l−1 .To reduce the computational cost, a scaled forward-backward recursive methodology proposed by Roberts et al. (2006) is employed to derive the four estimators recursively.The recursion is built by decomposing [0, T] into disjoint incident inter-arrival times.Algorithm III requires matrix calculations and applies the Padé's method (Moler & Van Loan, 2003) to approximate the matrix exponential.After fitting the parameters, the estimation of hidden regimes is given only at incident arrival times.The technical details of the computation leading to ( 73)-( 76) via the MLE method and the equations cited in Table 3 are provided in Appendix J. Table 3 depicts the steps for Algorithms III.

A comprehensive comparison of the three algorithms
The elements of contrast and comparison amongst Algorithms I, II, and III are multifaceted and stem from their underlying computational and theoretical structures as portrayed in Sections 2-5.1.A summary of these differences and similarities is displayed in Table 4.
(i) Methodological fundamentals.The calculation of the estimators under Algorithm III utilises the probabilities under the real-world measure.This

1:
Set up the initial values for the starting EM algorithm pass l = 0.For 1 ≤ i, j ≤ K, q i,j (0) and λ i (0), Algorithm pass number update: l = l + 1; 4: for k = 1, 2, . . ., x N do 5: Calculate a k , g(k) and h(k) with (J.8), (J.9) and (J.10); 6: end for 7: Calculate J , O, T and T * ; using (J.11)-(J.14);8: Calculate q i,j (l) and λ i (l) with (J.15), where 1 ≤ i, j ≤ K; 9: Update q i,j and λ i used in Step 5; 10: end while 11: Estimate regime s t k at time t k as e i where ith component of g(k) is the largest component, for 1 ≤ k ≤ x N .12: return q i,j , λ i and s t k .emphasises the distributional characteristics of the MMNPP, and the adoption of the MLEbased approach.On the other hand, in Algorithms I and II the complexity of the regimeswitching intensities is addressed by a change of probability-measure technique.In turn, the MMNPP becomes a homogenous Poisson process with unit intensity facilitating the derivations of estimators in continuous time.
(ii) Causality and information integration.In the Estep, the causality of filters is a distinct feature.and II have superior computational efficiency.These advantages will be illustrated using simulated data in Subsection 6.2.2.

Numerical implementation and comparison
In this section, we first present the numerical implementation results of Algorithms I and II.The analysis in Section 6.1 will be based on the data compiled by the U.S. Next, a comparison of three algorithms will be presented in terms of multiple simulated data sets.
In particular, we want to deal with the following questions: (i) How efficient are the algorithms?(ii) What are their computational costs?(iii) How is the accuracy influenced by the size of the data?(iv) How does the partition size m,m−1 influence the accuracy of Algorithms I and II? (v) How does the differences between λ i 's impact the accuracy?(vi) How do the transition rates impact the accuracy?(vii) Could the algorithms identify the correct regime?Note that the algorithm efficiency means that the estimation process is not interrupted and the parameters converge before a specified iteration limit.By computational cost, we mean the time needed to perform one iteration step.The questions raised will be illustrated in Section 6.2.

Numerical implementation of Algorithms I and II
In this subsection, we present the process of implementing the MMNPP using Algorithms I and II to the HHS data.This is an empirical study with detailed steps on how to apply these algorithms to a realworld data set.This segues into the various comparison aspects of the algorithms highlighted in Subsection 6.2.
Before processing the data, we first introduce several notations which will be useful in later analysis.Recall that we could obtain λ and s t m 's by using Algorithms I and II in Tables 1 and 2. Suppose i m is the order of the component that is the largest component of s t m , for i = 1, . . ., T. Given this assumption, the state estimate M t m will be e i m .Denote x m as the observed number of incidents in the mth batch and x m symbolises the corresponding estimates of x m , which is calculated as x m = λ i m η(t m ).For simplicity, note that the representation of x m ignores the change in η for batches that contain a year-end.
In the process of selecting the proper number of hidden regimes K, we shall consider the Akaike information criterion (AIC) (Akaike, 1974) and the Bayesian Information Criterion (BIC) (Schwarz, 1978), respectively calculated as The log-likelihood function L is expressed as In our HHS data, we apply the simplified L as Furthermore, to measure the estimation accuracy under the MMNPP model with different regime numbers, we perform the residual analysis by using statistics of time-weighted residuals.We shall use (x m − x m )/ m,m−1 rather than the raw residuals as the range of x m generally increase with the partition size m,m−1 .The time-weighted version of residuals could better facilitate the comparison of models with diverse partition sizes.In particular, we shall calculate termaveraged residuals (TAR) averaged absolute residuals (AAR) and root mean squared residuals (RMSR), given by In addition, three statistical tests will be applied to the raw residuals to check the goodness of fit.As we set equal partition sizes in the discussion that follows, it is equivalent to using time-weighted or raw residuals.The details and purposes of the tests are listed as follows: (i) Wald-Wolfowitz (WW) runs test (Wald & Wolfowitz, 1940), which checks whether there is a consistent pattern.The WW's null hypothesis is that the elements of the sequence are mutually independent.(ii) Ljung-Box (LB) auto-correlation test (Ljung & Box, 1978), which evaluates whether the data are independently distributed with zero correlation.Retaining the null hypothesis indicates the independence of the data while rejecting it suggests that the data are not independently distributed and exhibit serial correlation.
(iii) Bartlett's test (Bartlett, 1967), which assesses whether the residuals are white noise.The null hypothesis is that all population variances are equal against the alternative that at least two population variances are different.
Note that the results of the LB test are influenced by the number of lags and significance level.In the following discussion, we shall use a lag of 3, which is recommended in Hassani and Yeganegi (2020) when the length of the time series is below 500.

Data description.
The HHS data set covers 3485 medical data breaches from 2010 to 2020.It includes various aspects of the data-breach incident including the date, the institution name, the number of individuals affected, the incident type, etc.For example, on 18 June 2010, a laptop computer was stolen from the University of Kentucky, which led to the possible leakage of the protected health information (PHI) of approximately 2,027 individuals.The information included demographic information, specifically, names, addresses, dates of birth, social security numbers, other identifiers, and clinical information.Li and Mamon (2023b) presented the pre-analysis of why the MMNPP model is chosen to model the data-breach-incident arrivals from the point of view of the heavier-tailed distribution of inter-arrival times (relative to exponential distribution) and non-stationary incident counts for each 14-day batch.This is based on the assumption that there is no regime switching over each interval, which is reasonable because the underlying regime could represent the state of the cybersecurity environment.It will a much longer time than 14 days to experience the impact (if any) of any reform and regulation amendment and to identify a data breach.
Concentrating only on Algorithms I and II, the detailed computational steps are given in Appendix L. The algorithms are numerically executed with the following essentials: • A time unit of 14 days is selected, thereby establishing the time horizon T as 287; • In fitting the HHS data, we apply Equations ( 51) and ( 71 It is surprising to find that the third components of λ's for both algorithms differ significantly.We also calculate the residual statistics for the two cases in Table 5, where Q 1 and Q 3 are the first and third quartiles.The residual statistics indicate a better fit for Algorithm I and the TAR, median and mean of the residuals are close to 0. Figure 1 delineates the fitting results in the assessment of the estimation accuracy of the two algorithms.The first and second columns display the results of Algorithm I and II, respectively.Observing Figure 1(a,b), a notable overlap is apparent in the hidden states estimated by both algorithms.In   λ k and Q i,j largely remain constant for k, i, j = 1, 2. On the other hand, the maximum differences in absolute-value terms between λ k 's and between Q i,j 's fall below 0.55 and 0.017 for k, j = 3 in both algorithm applications.The estimations are resilient to bias from the outlier with Algorithm II evincing more robustness.Simulated confidence interval (CI) bands under both algorithms are displayed in Figure 1(e,f).Within (50,100), the paths of the accumulated incident counts, indicated by solid purple lines, closely align with the lower bound of the CI bands.The 99% CI band generated by Algorithm I scarcely encompasses the path of N t m , and the 95% CI band does not do well occasionally.For Algorithm II the CI bands perform poorly as N t m even goes outside of the 99% CI for some time periods.
Regarding the quantile-quantile (Q-Q) plots, majosrity of the residuals are aligned with the diagonal lines, except for a few points significantly deviating due to the outlier.The normality of residuals is affirmed via a two-sample two-tailed Kolmogorov-Smirnov (KS) test (Pratt & Gibbons, 1981), wherein the null hypothesis is that the two samples originate from an identical population.With p-values of 0.1588 and 0.0715 for Algorithms I and II, respectively, the residuals from both models exhibit normality.This indicates that the model has sufficiently explained the data structure.Adjustments to the residuals could be achieved by controlling the values of η.Nonetheless, Algorithm I ultimately delivers a superior fit compared to Algorithm II.
Concerning the large incident count during (279, 280], several observations merit discussion.We previously noted that the estimation process is disrupted for large K's when using Algorithms I and II.Neither algorithm could estimate extreme-valued and low-frequency incident counts despite their robustness.In contrast, Algorithm III is operable for large K's.Specifically when K ≥ 5, it produces a λ k that fits the exceedingly low-frequency x m (Li & Mamon, 2023b).This low-frequency challenge will be revisited in Subsection 6.2.6.It could be concluded that both algorithms support a credible MMNPP implementation on the HHS data.Nevertheless, Algorithm I is superior to Algorithm II in the context of goodness of fit.

Comparison of three algorithms
In this subsection, we shall compare the three algorithms via the simulated data.Based on the results of the comparison, our purpose is to provide guidance for the users of three algorithms; see Subsection 6.2.8.
The aspects that will be compared come from the questions raised at the beginning of Section 6.The details could be found in Subsections 6.2.2-6.2.7.Data used for the discussion are simulated using the algorithm constructed in Section 6.2.1.Simulated data sets can better help us discuss the regime-number selection and accuracy as we know the true values of parameters.Besides, we can have sufficient data to resolve the problems raised.

Data simulation
A data set that is generated from an MMNPP is obtained using the algorithm in Table 6.Regarding the inputs of this algorithm, we shall simulate the parameters with uniform distributions.We choose the range of the parameters comparable to the estimation results of the binned HHS data with a batch length of 14 days.The details of the ranges are specified in each following subsection.With the same reasoning, we have the exposure function η(t) increasing by 0.088 every 26 The exposure process defined as η(t) = 1 + 0.088( t/26 − 1); The time horizon T.

1:
Set up the initial state for the hidden Markov process, M t0 = e 1 ; 2: Simulate the Markov chain M t over [0, T] with CMCSimulation function in R package spuRs (Jones et al., 2018) time units, which is analogous to (78), where y calculates the smallest integer value that is at least y.The intercept of η(t) is adjusted so that we have η(t) = 1 for the first 26 time units.Unless specified, let T = 260, which is approximately 10 years if the time unit is a fortnight.For simplicity, we set the interval partition with equal length.Even though Algorithms I and II will not be affected by the intervals with no incident counts, Algorithm III could only provide the hidden regime estimate s t at the incident arrival times.Therefore, we could not have an incident count estimate x t for the intervals with no arrivals using Algorithm III.To make the results more comparable, we add a loop to adjust m,m−1 so that over each interval, we have a non-zero incident count.Note that the comparison is still not a level-playing field to the first two algorithms and Algorithm III as Algorithms I and II only utilise the information incident arrivals after being batched.We shall present a separate comparison between Algorithms I and II with varying m,m−1 .
To resolve the comparison questions, we shall discuss the estimation results based on 200 data sets for each K value.For each algorithm, we set the tolerance values, 1 = 10 −2 and 2 = 10 −3 .Also, we shall force the loop of algorithm pass to end at 200, 300 or 400 for K = 2, 3 or 4, respectively.Ideally, the loop will end when the tolerance value is reached.However, we find that λ(k) and Q(k) may not converge even after 10,000 iterations especially when the tested K is larger than true K.We investigate the features of these non-convergent cases.In normal cases, the components of λ(k) are usually very distinct and appear in ascending order while those of λ(k) in non-convergent cases often do not follow this rule.For example, when the MMNPP model with K = 4 is used for a data set whose true regime is two, λ = (3.68, 3.85, 9.52, 8.53).Therefore, we establish some trigger points to end the estimation iterations when they could not converge.Remark 6.1:The rate or speed of convergence in the EM algorithm is a particularly important consideration.The factors typically affecting this rate include the likelihood function's form, the initial values, and the chosen stopping criteria.From our empirical observations, it is evident that the differences in intensity and the number of regimes primarily impacted the number of iterations, and hence, the rate of convergence.Both of these factors are inseparably linked to the likelihood's functional form.Altering the initial values or the tolerance levels does not notably change the number of iterations.As highlighted earlier, when the difference in intensities is minimal or when the chosen number of regimes exceeds the true value, the algorithms struggle to discern between closely estimated intensities, often resulting in non-convergence.

Algorithm efficiency and computational cost
In this subsection, we address two central questions: (i) How efficient are the algorithms?(ii) What are their computational costs?Efficiency of algorithms.To gauge the efficiency of our algorithms, we simulate the MMNPP data using the model parameters displayed in Table 1 (Appendix K).The ratio between consecutive λ i values, denoted as λ i+1 /λ i , ranges from 1.1 to 3.5.Moreover, the minimum long-run proportion of hidden regimes is set to 3%.
The outcomes are reported in Table 7.A primary consideration is whether the algorithm runs uninterruptedly; this is indicated by the label 'working' or 'not working'.For our 200 data sets, the algorithms's running were rarely interrupted, with 'not working' rates typically under 5%.However, there is a small possibility that the algorithm stops due to the numerical approximation of the theoretical values.This could be attributed to negative λ values, which could happen when the differences between regime intensities are negligible or when the occupation time of a given regime is low.For data sets with these features, parameter estimation could prove challenging.
Similar to our implementation on HHS data, we use ( 51) and (71) when applying Algorithms I and II.We then evaluate if the parameter estimation has convergence within iteration limits and given tolerance thresholds.Based on over half of the simulated data sets, convergence is achieved but not if using Algorithm II with K = 4. Success in estimation is marked by the convergence of sequence of estimates and if the residuals ( x m − x m ) passed three statistical tests, viz., the WW runs, LB auto-correlation, and Bartlett, at a 5% significance level.Algorithms I and III have similar success rates, faring better than Algorithm II, especially for K = 3, 4.Even with the added work of constructing reference probability measures in Algorithms I and II, it is remarkable that Algorithm I performed with more efficiency.For K = 3 and K = 4, Algorithm III converges faster than Algorithm I (convergence rates: 83.5% vs. 99.5% for K = 3; and 54.0% vs. 98.5% for K = 4).The high rate of non-convergence in Algorithms I and II, especially as K rises, could be concerning.For K = 4, the small differences between the λ i values and the longrun proportions of hidden regimes affect the accuracy of estimating parameters in the MMNPP.See further Subsections 6.2.5 and 6.2.6.

Computational cost.
Besides efficiency, the computational cost -quantified as seconds per iterationis a major matter.As shown in Table 8, Algorithms I and II are considerably faster than Algorithm III.This difference in speed could be attributed to their operational design.Whilst Algorithm III deals with each specific arrival time, Algorithms I and II used only the incident counts for each interval.This distinction elucidates why Algorithm III, whilst being the most time-consuming, needs fewer iterations to attain convergence.Based on our simulation-driven analysis, Algorithm I emerges as the algorithm of choice for its ability to combine efficiency and reduced computational demand.

Impact of data size
In this subsection, we confront the next question: (iii) How is the accuracy influenced by the size of the data?We investigate the effect of the data size, symbolised by T, on the accuracy of the estimations for K = 2. Our analysis encompasses data simulations covering 5, 10, and 20 years.
Our findings reveal an overarching trend: As data size grows, state estimation accuracy improves substantially.When extending the data size to 20 years, all algorithms consistently perform with accuracy rates above 80%, with Algorithms I and II's rates even exceeding 90%.Furthermore, Algorithm I distinctly stands out as the most optimal choice for a 5-year data size based on state-estimation rates.
However, a peculiar pattern surfaces regarding λ.Whilst increasing the data size from 5 to 10 years bolsters accuracy, extending it to 20 years surprisingly diminishes accuracy.This decline is attributed to the compounded error in estimating the external exposure function η(t).In real-world applications, the true challenge lies in making a balancing act: enhancing the data size for improved state accuracy and being cautious of the potential distortions from η(t).Our simulation methodologies as well as a comprehensive analysis of our findings are available in Appendix M.

Impact of partition sizes
This subsection tackles the question: (iv) How does the partition size m,m−1 influence the accuracy of Algorithms I and II?
As explained in Section 6.2.1, we alter the partition size to ensure a positive incident count for each interval when using Algorithm III.This constrains the application of Algorithms I and II which have partition flexibility.We examine the empirical impact of partition sizes on Algorithms I and II under K = 2 and T = 260.
Model parameters are presented in Table 1.Ten partition sizes are examined.Then, following the MMNPP model fitting, a residual analysis is conducted.A significant finding is that a narrow partition size creates issues with the algorithm execution.The algorithms' performance under various partition sizes is shown in Table 9.In essence, larger partitions are preferable for Algorithm I and narrower ones for Algorithm II although such partitions are still within reasonable boundaries.
The estimation accuracy with respect to partition size is another point of interest.The Shapiro-Wilk Thus, we recommend fairly large partition sizes for Algorithm I and small ones for Algorithm II, ensuring neither is too broad or narrow.A partition size that is too narrow may lead to the failure of model implementation.The extreme opposite (i.e.too large of a partition size) requires a restrictive assumption that there is no regime switching within the partitioned interval.

Impact of intensity difference
In this subsection, the focus is placed on the question: (v) How does the differences between λ i 's impact the accuracy?
We use simulated data sets with two distinct regimes, i.e.K = 2.It is crucial to note that the intensity value of an MMNPP is inherently tied to the selected time unit.This gives flexibility in modifying the intensity by altering the time unit.For the first regime, an intensity of λ 1 = 5 is assigned.
The convergence of the EM algorithm exhibits variation with the intensity ratios.For ratios 1.1, 1.2, and 1.3, the maximum iterations to achieve convergence are 369, 320, and 201, respectively.In contrast to scenarios where the ratio surpasses 2.0, there would be fewer than 50 passes.This variability in convergence, especially the high non-convergence rate could be seen in Table 7 due to the negligible intensity differences.
The application of three statistical tests to residuals enables an assessment of the estimations' validity.
There is a pronounced decline in the success rate at the ratio 16.0, plummeting to merely 22%.This decreased rate correlates with a high probability of substantial number of incident arrivals within singularly partitioned intervals.The consequence is an excess of computational capacities due to elevated exponential components of φ k m,m−1 .For instance, in a data set generated with an intensity ratio of 16.0, the exponential component of regime 2 reaches 739.92; this leads to an exceptionally large value for φ k m,m−1 .The large incident counts often correlate with big residuals, challenging the non-rejection of the LB test's null hypothesis.Based on our experiment, large intensity differences yield high accuracy rates, but these may pose inherent challenges in algorithm implementation.Conversely, for smaller ratio values, λ 2 λ 1 ∈ [1.1, 2.0], the success rate is generally over 80%.
To study the estimation accuracy further, we present the state accuracy rate medians with a line plot in Figure 2(a).A rapid rise in accuracy, from 55.8% to 80.0%, occurs as the ratio λ 2 λ 1 ranges from 1.1 to 1.5.An added increase of 10% accuracy is evident as the ratio changes from 1.5 to 2.0.However, for the ratios above 2.0, the accuracy improvements dissipate, with only a 0.4% increase noted when increasing the ratio 8.0 to the ratio 16.0.See the box plots of absolute transition rate estimation errors in Figure 2(b,c).These graphs indicate that more robust estimations could result as the ratio grows.We also display the box plots for λ 1 and λ 2 in Figure 2(d,e) with an adjustment to the estimation error of λ 2 based on the λ ratio.The errors for λ 1 are in the range of 0.0-0.5 whilst λ 2 's larger ratios present fewer outliers; most scaled errors are within 0 and 0.75.
In summary, large intensity differences boost state estimation accuracy, with a rate surpassing 90% for ratios of 2.0.Nonetheless, huge intensity differences may bring marginal enhancement in state estimation but could also induce numerical complications.Ratios below 1.2 may render the MMNPP ineffective in state differentiation.This would make the addition of a distinct state superfluous and only cause long convergence times.

Impact of transition rate
In this subsection, we focus on the question: (vi) How do the transition rates impact estimation accuracy?
Our exploration is motivated by the outcomes of Algorithm III using the HHS data.One regime is exclusively associated with a single batch, possessing an occupation time of just 0.37%.Yet, Algorithm I was unable to pin down this nuanced regime, particularly when a large number of regimes was used.To find out whether such an occurrence only happens by chance and to gauge algorithmic performance across a spectrum of intensity rates, we undertake the ensuing analysis.
We begin by removing the data sets that do not reach regime 2 because of their low long-run proportion.Around 85% of the data sets remain and we find that Algorithm I has higher success rates for long-run proportions of 10% and 5%.However, for the 1% of the cases, Algorithm III shows a higher success rate.This suggests that Algorithm III might be more suitable for detecting regimes with lower occurrences.The next step is to evaluate the quality of the estimations.
Various statistics are calculated for each data set.The mean values are presented in Table 10.For instance, with Algorithm I and an occupation time of 10%, the mean of λ 1 is 5.0079.Normally, λ 1 values are close to the actual value due to the significant longrun proportion of regime 1.For λ 2 , whilst Algorithm III produces an estimate closer to the actual value, it still differs from λ 2 by about 4. This suggests that estimating the intensity for a regime that occurs less frequently could be a hurdle with limited data.In terms of hidden regime estimation, Algorithm I has a higher rate for state-estimation accuracy.Further evaluations using specificity (true negative rate) and sensitivity (true positive rate), with regime 2 as the positive case, show that Algorithm III has lower values for both metrics.In the scenario with a 1% long-run proportion, Algorithm I identifies about 85% instances of regime 2.
To assess the combined effect of estimating both the intensities and transition rates, we introduce three residual metrics: TAR, AAR, and RMSR.Although Algorithm I might not yield the most accurate estimate for λ 2 , its performance under the three residual metrics suggests a more comprehensive estimation.
Notably, λ 2 is set to 20, which is four times larger than λ 1 .This means that regime 2 might lead to more incident counts.In scenarios with pronounced outliers, it is instructive to understand how these algorithms handle extreme incident counts.Accordingly, we compute three metrics from the residuals of the top 1% incident counts.See the last three rows of Table 10, where x (0.99) denotes the 99% sample percentile of incident counts.Based on the compared scenarios, Algorithm I consistently outperforms Algorithm III with the exception of the sum of residuals for a 1% long-run proportion of regime 2. This observation is consistent with the relatively more accurate λ 2 presented in the Table 10's second row.
For a seldom-occurring regime that might be inadequately represented or inaccurately calibrated due to limited data, the following supplementary MMNPP simulation is given.We set the parameters to K = 2, T = 260, λ = (5, 10) , q 1,2 = 0.08, and q 2,1 = 0.12.Noise is introduced by choosing a random time unit (e.g.(12, 13)) and infusing it with 65 additional incident arrival times.This action simulates an outlier in the incident count during the given time unit, thereby replicating the HHS data pattern.When applying Algorithms I-III with K = 3 to the data set, only Algorithm III consistently detects the outlier with the added regime.This was consistent across different random seeds.Such a phenomenon prompts reflection on the risks of overfitting when outliers might be introduced erroneously.If these outliers are genuine, Algorithm III appears to have an advantage.
In other words, when dealing with a rarelyoccurring regime that is not treated as an outlier to be discarded, Algorithm I generally offers a more efficient and meaningful estimation than Algorithm III.However, for the precise estimation of the associated intensity, Algorithm III seems to have an edge.Indeed, there is an inherent challenge in capturing or accurately calibrating the model when a regime appears infrequently and when data availability is an issue.

Selecting the number of regimes
In this subsection, we shall answer the last question: (vii) Could the algorithms identify the correct regime?
Specifically, we assess the ability of the three algorithms to determine the correct number K of regimes using either the AIC or BIC criteria.For our analysis, only regimes K = 2 and 3 are considered.Many studies in various fields have shown that these smaller numbers of regime could suitably fit various data sets (e.g.Burzykowski et al., 2003;Chang et al., 2011;Muscariello et al., 2005;Ramesh, 1998).
The data used for this evaluation are grouped into two categories.The first category, termed the random group, is produced from the parameters given in Table 1.This group has a λ ratio varying from 1.25 to 2.23, with the smallest long-run proportion being 7.8%.The second category, termed the fixed group, utilises predetermined parameters.For K = 2, the parameters are set as λ 1 = 5, λ 2 = 10, q 1,2 = −q 1,1 = 0.08, and q 2,1 = −q 2,2 = 0.12.For K = 3, we have λ = (5, 10, 20) with q i,j = 0.05 for 1 ≤ i = j ≤ 3.For each scenario, such as the fixed group with K = 2, we simulate 400 data sets.
In order to accurately analyse the efficiency of correctly selecting the regime K across the 400 simulated data sets, we develop a consistent procedure using the AIC or BIC for each data set.Specifically, when K = 2, parameters for K = 2, 3, and 4 are estimated using the algorithm in question.We exclude K ≥ 5 as majority of the estimation processes in these cases fail to converge within 200 algorithm passes.Similarly, for a true K value of 3, the potential K values considered are between 2 and 4. Illustrating the procedure using the AIC and Algorithm I when K = 2, our initial step involves fitting the 200 simulated data sets using Algorithm I for K = 2, 3, and 4. Subsequently, for each data set, we discard the candidate K values, where the algorithm fails.That is, when the associated residuals show significant evidence leading to the rejection of the null hypothesis.The K value that corresponds to The results, presented in Table 11, indicate good performance with respect to the BIC across all the three algorithms with most regime selection rates surpassing 66%, except in the random group with K = 3.This exception might arise from the small intensity differences and the reduced long-run proportion in the random group.For the data sets in the random group with K = 3, the AIC is a better criterion than the BIC yielding a selection rate below 52%.Both the AIC and BIC are comparable for Algorithm III.
Thus, for data sets that could be termed as 'decent', characterised by significant intensity differences and non-trivial long-run proportions, the BIC emerges as a reliable tool to determine the regime K = 2, 3. Algorithm III also finds AIC to be effective.However, for less 'decent' data sets, the AIC and BIC metrics may also have less reliability.

Implications from the comparison
Based on our findings and the comprehensive comparison detailed in Subsections 6.2.2-6.2.7 and 5.2, we outline below some guidelines on algorithm selection under various scenarios.
Base case: specific incident-arrival times and small number of regimes.Consider a data set with specific incident-arrival times characteristics.Having a small number of regimes, either two or three, could give a valid fit and adequate for many disciplines.For instance, in actuarial science (Guillou et al., 2013), a two-state MMPP was chosen to model industrial damages from fire incidents.When the number of incidents is not massively scaled and each regime's occupation time exceeds 10% with an intensity ratio in the interval [2.0, 8.0], all three algorithms provide satisfactory results.Although the use of the BIC facilitates the selection of the number of regimes for any algorithm, Algorithm III is recommended when computational cost is not a barrier.Algorithm III is resilient to interruptions caused by negative λ's.

Handling large scale incident counts or quick parameter updating.
When dealing with the modelling of large-scale total incident counts or situations demanding quick parameter updates with new data, Algorithm I stands out.This is due to its efficiency and residual statistics (i.e.TARs, AARs and RMSRs plotted in Figure 1).Although larger-partition sizes could enhance accuracy, it is assumed that no regime switching occurs within the partitioned interval.Other tools such as simulated CI bands and Q-Q plots could also strengthen the definiteness of the selection decision.Whilst larger data sizes generally enhance estimation results, it is crucial to scrutinise the external exposure η to ascertain whether the estimation error magnifies over the long term.

Challenges with extremely low long-run proportions.
In a scenario where a regime is associated with an extremely low long-run proportion (i.e.1% or less), Algorithm I is preferred for its superior accuracy in estimating the hidden regimes and for its superb residual statistics.Algorithm III should be considered as well due to its notable ability to produce better intensity estimate in this particular scenario.The conventional regime-number selection tools, namely the AIC and BIC, may potentially fail in this scenario.A thorough analysis, including residual and CI band plotting, becomes essential.From Li and Mamon (2023b), which deals with the HHS data implementation using Algorithm III, the estimated occupation time of one regime is 0.37%.Surprisingly, Algorithms I and II are ineffective in identifying such low long-run regimes even with an increased number of regimes.
Without a known true model as a benchmark for direct comparison, assessing the overall estimation efficacy across algorithms may not yield unequivocal conclusions.Algorithm III, whilst demonstrating the practical utility in certain scenarios, must be carried out with care in order to mitigate potential overfitting issues; see Subsection 6.2.6.The model must be chosen in such a way that it reasonably adapts to the main characteristics of the input data.Choosing a suitable model and preventing overfitting will safeguard predictive accuracy when such a model is applied to unknown data.
Absence of specific arrival times.Should all conditions remain the same as the base case other than the absence of specific arrival times, Algorithms I and II are favoured.These algorithms tend to do better despite the uncertainties brought about by the specific time arrival-times assumption.Algorithm I would be the preferred choice, as illustrated in Table 7, if superior performance in algorithm efficiency is of prime importance.

Dealing with data having high number of regimes.
For data sets with a high number of regimes (i.e.K ≥ 4), Algorithm III appears to have a suitable applicability on a broader spectrum of data types.Algorithms I and II may get entangled in an endless loop if the intensity difference is not substantial enough.Although Algorithm III could make estimations within a restricted number of iterations, utilising a high-regime model may be unnecessary given the trade off between complexity and accuracy as K increases.

Intensity difference considerations.
With respect to intensities' differences, a high-intensity ratio aids parameter estimation.However, extremely large differences might necessitate time unit adjustments to accommodate the computational power demanded by Algorithm I.When the ratio difference drops below 1.5, the accuracy of both the regime number selection and hidden regime estimation is adversely impacted.

MMNPP model considerations.
Empirical evidence highlights certain limitations of the algorithms, particularly in the accurate MMNPP's estimation of the incident counts.In numerous instances, the levels of the residuals are notably large, often spanning the range −10 to 10, with standard deviations around 3 for T = 260 and K = 2.An intriguing observation is the significant reduction in the daily estimation error when the time unit is reverted back to a daily unit from a 14-day period.Furthermore, using term-averaged incident counts could mitigate observed errors.But, the length of the averaging period and data's stylised facts must be carefully considered to optimise the MMNPP's modelling performance.
The application of the MMNPP to various counting processes should be strategically aligned with the modeller's purpose.In the economic contexts, hidden states are commonly used to model a dynamic economic environment.Such hidden states are frequently modelled using two or three states.The MMNPP enables the inference of these hidden states from incident counts.Our algorithmic approach for the MMNPP estimation achieves an impressive state estimation accuracy of up to 96% when the intensity ratio hits 3.0.With the MMNPP, the number of incidents across specific periods could be aggregated as this would not demand exact intensity estimations as shown in Figure 1.Yet, it must be noted that there is a potential for estimation deterioration of intensities due to the bias of estimating η(t) especially with large data sets.There are situations though when the MMNPP does not perform well.This happens in particular when the precise incident numbers in each subinterval are sought and when the incident count per partition is very high, leading to non-negligible estimation errors.Remark 6.2: Whilst the MMNPP has proven to be an effective modelling tool under certain conditions, there are instances when its performance may be suboptimal.This is especially the case when the precise incident numbers in each subinterval have a lot of uncertainty.The guidelines and considerations articulated herein are meant for a systematic selection of a model and algorithm, which must be tailored to the data set and the modeller's objective.

Conclusion
In this paper, we applied the change of reference probability measure to the MMNPP considered in Avanzi et al. (2021), whose intensity process is designed to capture both cyclical and nonrecurring trends.By modifying the EM algorithm framework of the MMPP proposed by Elliott and Malcolm (2008), we derived the filters and smoothers for parameter estimation and established the corresponding EM algorithms.We also made improvements to two algorithms by scaling a transformation-pertinent matrix m,m−1 , which resolves the numerical problem for large data sizes without affecting the accuracy.
Our proposed filter-based and smoother-based EM algorithms (i.e.Algorithms I and II) were compared with the MLE-based EM algorithm (referred to as Algorithm III) from a theoretical perspective.The differences mainly emerged in the E-step.The first two algorithms avoid the complication caused by the hidden regimes via the change of measure technique in contrast to Algorithm III.Also, the filters in the E-step are causal for Algorithm I and non-causal for Algorithms II and III.This affects the efficiency of updating the parameters with new information.Algorithms I and II partition the time horizon [0, T] differently in the recursive calculation of the E-step.The partition is more flexible and allows finer interval length for Algorithms I and II whilst the partition points of Algorithm III are incident arrival times or change points of the external exposure.This implies that our proposed two algorithms could be carried out knowing only the incident counts over certain periods but not the specific jump times.
We illustrated the implementation of our two algorithms using the HHS data and compared the results with those of Algorithm III.Within our real-data analysis example, the two algorithms supported an MMNPP implementation with Algorithm I performing better than Algorithm II.These algorithms, however, have calibration difficulty when incident counts are large and with an extremely low frequency.Such a calibration problem does not affect Algorithm III, but a precaution must be taken that Algorithm III could have an overfitting issue.
A simulation study was conducted to compare the algorithms under several criteria.As revealed by our simulation study, Algorithms I and II are recommended if the following are of prime consideration: efficiency, quick parameter update with new information, and low computational cost when the number of regimes is limited (e.g. 2 or 3).The last consideration is important in modelling processes in network theory (Muscariello et al., 2005), rainfall arrivals (Thayakaran & Ramesh, 2013) and extreme climate events (Guillou et al., 2015).On the other hand, Algorithm III is recommended for data with a large number of regimes and relatively closer intensities.The regimenumber selection based on the BIC is plausible for data sets with relatively big intensity differences and nonextreme long-run proportions.With large data sizes, better estimates could be obtained for the transition matrix of the underlying Markov chain.However, the impact on intensity parameter estimation is unclear and the accuracy of the exposure η(t) must be factored in.Given our example, Algorithm I favours greater partition sizes and Algorithm II prefers the opposite.If the data set contains a regime occurring with low frequency, Algorithm I provides a better overall estimation.
); • Employing the quasi-Poisson GLM, we compute η(i) for year i using η(i) = −173.8819+ 0.08751323 × year i .(78)• The tolerance parameters are 1 = 10 −3 and 2 = 10 −5 ; • The initial parameters qi,j (0) and λi (0) are provided in Appendix K; • The partition size m,m−1 is assigned the values of 1.0 and 0.7 for Algorithms I and II, respectively; • The number of regimes, K, is set to 3 for both algorithms.Implementation results.The parameter estimates of Algorithm I are given by λ = (3.805,5.281, 8.048) ,

Figure 1 .
Figure 1.Fitting results of the MMNPP-3 on the HHS data using Algorithms I and II.

Figure 1
Figure 1(c,d), most of the raw residuals scatter randomly across the red dashed horizontal axis, albeit with one residual point exceeding 30 for each plot.The actual incident count during (279, 280] registers at 58, which is deemed as an outlier.Substituting

Figure 2 .
Figure 2. Analysing the influence of the differences between intensity rates on accuracy under Algorithm I.

Table 1 .
The robust filter-based EM algorithm summary.

Table 2 .
The robust smoother-based EM algorithm summary.

Table 4 .
A comparative summary of the three algorithms.
Algorithm II, being smoother-based, demands additional work during updates.This is due to the calculations of the extra-information process v t defined in Definition 3.5, and it requires the availability of information from t to T. The updates to smoother-based parameters could be achieved using the methods byElliott and Malcolm (2008).
Specifically, Algorithm I employs causal filters and uses solely prior information.It adaptively incorporates new data as illustrated by the recursive expressions involving γ (•) in Theorem 4.1.In contrast, Algorithms II and III utilise noncausal filters, which complicate the updating stage with new information.

Table 5 .
Residual statistics of the final MMNPP model under Algorithms I, II with HHS data.

Table 6 .
The algorithm for MMNPP data simulation.
Algorithm: simulate MMNPP dataInput: The intensity parameter λ and transition matrix Q;

Table 9 .
Success rates for various partition sizes.

Table 11 .
Rates for correctly selecting the number of regimes.