On regime changes of COVID-19 outbreak

The COVID-19 pandemic has had a very serious impact on societies and caused large-scale economic changes and death toll worldwide. The first cases were detected in China, but soon the virus spread quickly worldwide and the intensity of newly reported infections grew high during this initial period almost everywhere. Later, despite all imposed measures, the intensity shifted abruptly multiple times during the two-year period between 2020 and 2022 causing waves of too high infection rates in almost every part of the world. To target this problem, we assume the data heterogeneity as multiple consecutive regime changes. The research study includes the development of a model based on automatic regime change detection and their combination with the linear birth-death process for long-run data fits. The results are empirically verified on data for 38 countries and US states for the period from February 2020 to April 2022. Finally, the initial phase (conditions) properties of infection development are studied.


Introduction
The initial information about COVID-19 was scarce and intimidating in the beginning of 2020. Not surprisingly, the public interest was focused on almost real-time infection spread trackers and prediction tools. Most of them were based on already existing epidemic modelling theories, incorporating separately or in combinations specific parameters, such as spatial details or realistic population mixing structures, individual-based network models, and simple SIR-type models that incorporate the effects of reactive behaviour changes or inhomogeneous mixing [11].
There is also a long tradition of using stochastic epidemic models to simulate and predict the transmission dynamics of infectious diseases. They are important when the number of infectious individuals is small or the transmission and recovery rates vary with time due to various reasons. Very popular stochastic models in epidemiology are continuous time Markov chains (CTMCs), stochastic differential equations (SDEs) and branching process approximations. Detailed review on stochastic epidemic modelling is available in [2,14].
In aim to implement a real-data stochastic application, we we use the theory of branching processes for modelling COVID-19 outbreak [5,16,30]. Estimates for the probability of extinction based on them have been applied frequently to populations, genetics, cellular processes, and to epidemics on networks. Some methods for the calculation of disease extinction thresholds in deterministic and stochastic models are summarised in [3]. They estimate the probability of a major outbreak for the susceptible -infectious -recovered (SIR) model when the population size is large and a small number of infectious individuals are considered.
Important assumptions when using branching processes are that each infected individual (usually noted as a particle in branching theory) spreads disease (gives birth) independently to others and everyone has an equal probability of getting infected. For a small number of infectious individuals and a large population size, these assumptions are realistic and similar approximations are very good in many cases. Another important advantage of selecting a stochastic solution is the lack of knowledge about population immunity similar to early assumptions [28].
Another possible approach is based on the multi-type branching processes with inhomogeneous Poisson immigration [26]. The advantage of this model is that the random arrival of infected persons makes a serious contribution to infection spread. However, the branching process is reduced to a single type with immigration when modelling COVID-19 outbreaks during the initial period outside Wuhan. The branching reproduction starts after multiple independent arrivals of infected patients zero. Then, local clusters emerge from them after a time delay dependent on the incubation period, estimated within the range of two to nine days with 95% confidence [22].
This dominance of immigration continues until the infection gains local dynamics and surpasses the imported cases. In order to describe this inhomogeneous dynamic, caused by different social factors, the data is split in consecutive intervals. The initial conditions for every interval can be considered altogether with the immigration. They cannot change the critical parameter of the branching reproduction, but impact the extinction probabilities [24,32].
Finally, the public reactions and measures against spread of contagious diseases are an important part of the response to them. The different actions may include quarantine at a local and national level, border closing, physical distances, etc. According to data models, all actions for prevention are assumed as constraints on free infection spread. The estimate of intervention effect could be obtained by stochastic SIR network epidemic model with preventive dropping of edges [7].
In accordance with all these assumptions, we designed and implemented a computational model for automatic fit and short time prediction of COVID-19 infection incorporating all significant changes in ambient conditions and natural trends. It is based on the linear birth-death processes X(t), t > 0, starting from one particle X(0) = 1, [32]. It is combined with change-point theory to implement the dynamic detection of the changes in infection spread rates. The model is calibrated and verified on data from 38 countries. Finally, the empirical explanations for detected changes in initial conditions and infection dynamics is looked for.

Data modelling
In general, the branching process Z(t), t ≥ 0, Z(0) = Z 0 , with probability generating function (p.g.f.) F(t, s) = E[s Z(t) |Z(0) = Z 0 ], |s| < 1, F(t, 1) = 1, is defined by the initial condition, the lifetime of particles and the offspring number with p.g.f. h(s), |s| ≤ 1. Suppose, the p.g.f. U 0 (s) describes the number of particles Z 0 = Z(0) at the initial moment, (say today). The branching intensity depends on the lifetime of particles. Knowing the independence of particle evolution, the assumption that the lifetime is a random exponentially distributed variable with parameter K guarantees the Markov property of this branching process Z(t). It means that the entire past information of the process until the time moment t > 0 is presented in its current state Z(t). Thus, the probabilities of future events are completely determined by the present state of the process -if its current state is known, its past behaviour provides no additional information in determining the probabilities of future events.
The linear birth-death process X(t), t > 0, X(0) = 1, can provide locally the best explanation of the branching evolution starting with a single particle. The probability of a birth (successful transmission) of an infectious individual is denoted by p, where 0 < p < 1, describes the outcome of either two daughter particles' birth or parent demise. The offspring number is defined by a random variable η with p.
The ultimate extinction probability, traditionally denoted by q := lim t→∞ P( and the mean of the infinitesimal offspring number is denoted by m = f (1) = K(2p − 1). Then the mathematical expectation is E[X(t)] = e mt . A branching process X(t) is classified as supercritical if m > 0, 1 2 < p < 1, critical when m = 0, p = 1/2, or subcritical for m < 0, 0 < p < 1 2 . The extinction probability of the linear birth-death process to the finite time 0 < t < ∞ is given by the value P(X(t) = 0) in its explicit form [32].
Due to the Markov property, the information on the first time interval of approximation t ∈ [0, t 1 ) is provided by (U 0 , h 0 , K 0 ). Thus, for inhomogeneous branching process Z(t), t > 0, Z(0) > 1, the information on the first time interval depends on the epidemics evolution, i.e. the probability of a successful transmission p = p 0 is in the time interval 0 ≤ t < t 1 and K = K 0 . Then the medical research specialist can recalculate the new parameters (U 1 , h 1 , K 1 ) and the new transmission probability for the next interval of approximation, where U 1 (s) = E[s Z(t 1 ) ] describes the real data at the time moment t 1 > 0. The p.g.f. U 1 (s) and the random variable Z(t 1 ) defines the initial conditions for the next interval of approximation, t ∈ [t 1 , t 2 ). Also, the intensity of reproduction and the offspring's number can be different from one interval of approximation to another, respectively (h 1 , K 1 ) on the interval t 1 ≤ t < t 2 and so on.
An important part of any pandemic onset is its development during the first days of contagion. In the case of pure imported infection, the local branching process begins only with immigration from outside and it is dependent on the incubation period. During the first days, the pandemic evolution is dominated by arrival intensity and define the initial conditions of the newborn branching process. Thus, this initial situation must be considered because the first COVID-19 virus variant has a quite long incubation period and all initial cases around the world except China are imported.
Appropriate probability distributions for modelling the initial conditions are Negative-Binomial (N-B) and Poisson (Po) distributions. Conveniently, the following development in the case of the linear birth-death process is defined in [24,32]. The N-B distribution may explain better the stochastic fluctuations. mainly due to the dominance of super-speeding contagion [23]. However, because the initial immigration of infected from China was limited event compared to overall population and passenger traffic until the end of February 2020, the Poisson distribution is selected as default.
The mainly used empirical estimator of epidemics dynamics is the basic reproduction number, R 0 , known also as a threshold in deterministic epidemic theory. It is equivalent to the expected value of the newborn particles in branching processes. This parameter predicts a disease outbreak if R 0 > 1. Epidemiologists calculate R 0 tracing data for individual-level contacts at the onset of the epidemic. It is computed by averaging over the number of confirmed by tests secondary cases of many diagnosed individuals. But not surprisingly, this deterministic approach requires a complete history and it is not a reliable real-time measure for the development of the outbreak due to inevitable sampling bias. It could occur due to different factors, such as the pattern of contact underestimation, regional and local factors, methods and errors in testing, etc. A possible solution could be found in the search and implementation of more innovative and complex modelling like in [25,31].
In addition, we remark that the basic reproduction factor, R 0 , is a deterministic constant and it does not reflect to changes in behaviour and social restrictions. For this reason, it is more appropriate to use the effective reproduction factor R t at any moment t > 0 to model inhomogeneous development. However, the population growth rate r t per unit in time is selected as a more convenient empirical measure. It is widely used in demographics and it is already known in the theory of branching processes [16]. The relation to productivity knowing the infection occurred in the first interval of approximation of the linear birth-death process with mean offspring number m is R t = e mt = 1 + r t . Empirically, r t is obtained from time series data by B1.
With a prolonged pandemic, the daily growth rate series (r 1 , r 2 , . . .) aggregates the typical for time series analysis trend and seasonality. A possible local fit of similar inhomogeneous birth-death process with linear time-dependent birth and death rates can be obtained by application with a generalised Gompertz growth model [4]. Another possible approach is to use Autoregressive Integrated Moving Average (ARIMA), similar to [1,13]. However, for inhomogeneous data, regime changes occurred by time cause temporal changes of ARIMA model parameters, such as in [33].
In aim to construct an automatic computational tool, the occurred regime changes in growth rate series have to be detected at the first step. They are identified by applying change point analysis. Initially, the cumulative sum control chart (CUSUM) method has been used at the early stage of the process. It is very efficient non-parametric test to detect small shifts in the mean of a process, introduced by E. S. Page in 1953Page in -1955. Due to its simplicity, it is widely used in many different change point applications. In epidemics it is successfully used in early stages of infections [36]. It works as established continuous inspection scheme to detect unknown location parameter k where the mean value μ of i.i.d. x i , i = 1, . . . , n changes significantly. In statistical terminology, this means to test the null hypothesis H 0 against the alternative H A if exists 1 ≤ k ≤ n such that The standard methods are based on properly standardised cumulative sums (CUSUM)s The null hypothesis is denied if some derivative statistics W n from the cumulative sum is too large.
There are different computational method modifications for estimation of W n , generally following the original definition.The software,selected in this work, relies on the use the likelihood style ratio of the CUSUM functionals instead of the differences for estimation of magnitude for the computation of W n . It is proposed and the required important limit theorems are proved in [12]. The computational implementation relies on library changepoint in R statistics environment [20].
The CUSUM method was applied on the stationary time series of daily growth rate changes, r i − r i−1 , i = 1, . . . , k, . . . n, without any distributional assumptions [20]. It works well at the very beginning of the outbreak when data is very variate, small in size and hence with clear non-normal behaviour. However, in the later stages of process development, when the number of segments grew, the CUSUM began to fail. In this case, the CUSUM method produced bias towards the early stage of the outbreak, neglecting the later development. The empirical attempts of any other change point test for mean value did not produce any improvements.
For this reason, the conceptual hypothesis is changed -the regime changes are assumed as simultaneous differences of mean and variances of independent normally distributed changes. The dedicated computation again relies on likelihood-ratio procedure penalised by information criteria [10]. The software implementation is based on function cpt.meanvar from the changepoint package in R [20]. The advantage of this computational library is the implementation of several methods for optimisation of multiple segmentation process for large data. The available methods are Pruned Exact Linear Time (PELT), Binary Segmentation and Segment Neighbourhood [6,18,21,29]. The At Most One Change (AMOC) methods is also implemented, but it is mainly associated with CUSUM. However, the previous three ones provide a useful tool to deal with large data sets with many change points.
After all regime changes are detected, the data in segments between two consecutive change points are fitted with the new process parameters. Because the length of segments varies randomly, starting from a few elements, the precise real-time application of ARIMA is not feasible. However, the Markov property of the linear birth-death process Z(t), t > 0, Z(0) > 1, enables a quick regime switch, allowing real-time implementation in the segment time window T = T i , i = 1 : n. The linear birth-death process in selected segment is determined by the average probability estimate p T . It is obtained directly from the local mean rate parameter r T in studied interval T: The formula is derived from ultimate extinction probability q in the critical and supercritical process. In the later case, the probability is centred at 1 + r t = 2, requiring L = 2 to obtain e − r T L = q in Equation (1). Finally, an additional correction of L is introduced to adjust probability estimate to an optimal value. It is a country dependent small constant c that is computed by iterative optimisations based on minimisation of mean square error (MSE), B2. Another possible option for data adjustment is correction of intensity branching parameter K = K t , t = t i , after every regime change. Both corrections are simple and computationally cheap substitute for adopted in demographics method of blending -passing from one curve to other [8].
When the initial condition is either Poisson or Negative Binomial, the solutions for the linear birth-death process can be obtained analytically, as it is shown in [32]. However, the results obtained from the available numerical experiment are preferred for automated computations when the explicit determination of intermediate initial conditions is not critical. The simulator generates trajectories after 10,000 independent realisations as it is explained in the provided Supplement material, extending the realisation in [32]. The integrated results over all paths are obtained from averaged values. The computational procedure can be easily adjusted linearly by changing together or separately the scalars K t and L.

Results
The performance of the combination between the linear birth-death process and change point analysis is tested on real data from a cumulative number of COVID infected patients by regional separation. Those included in this research data consist of reported daily new cases in 38 different entities. They were selected intentionally to represent geographical, political and cultural diversity. The main focus is set on data from countries assumed as open and relatively effective in COVID-19 response. The other criteria were variety in policy measures like the very different approaches such as Sweden and Kuwait, population density -North Dakota in comparison to Singapore and New York or total population over large territory coverage in countries like the USA, Canada, Australia, Russia, and India against cities like Hong Kong, Singapore and New York or densely populated countries like Japan and Netherlands. The expectations for common policy of the European Union are also considered. In general, the geographical distribution is as follows: The major data source is the dedicated public John Hopkins' database and maps [14]. The data are preprocessed for irregularities and growth rate values are computed. The detailed explanations are available in provided Supplemental material.
The COVID-19 outbreak at every country began as independent arrivals of newly infected persons. Their number and intensity varied due to multiple reasons. But, the probabilistic distribution of daily infections is expected to follow either Negative Binomial or Poisson ones. If the process remains completely exhaustive, i.e. the infection dies without transition and the transmission probabilities remain infinitesimally small, but not in deterministic zeroes. Thus, having permanently single count infected persons the new super-spreader cluster could emerge, triggering a new branching process. Another possible source of renewed infection could be immigration.
By assumption, we have to distinguish the secondary pandemic outbursts from the initial conditions at the first infected person arrival. However, in the terms of modelling, every new wave of infection outburst is separated from the previous and following calm periods by clear regime changes. For data fit software, the moment t i when change point occurs is considered as the end of the previous time window and the initial condition for linear birth-death process with new parameters.

Model initialisation
The early days after the first infected particles arrival are of high importance for further pandemic development. Due to the long incubation period and without registered local infection yet, the new daily counts are assumed as a Poisson counting process. Then, when the first local branching clusters are available, the rate of newly infected persons accelerates. At this moment neither branching nor immigration dominates. In case of local transmission collapse for any reason, the process N(t), t > 0, giving the number of particles alive remains homogeneous Poisson (HP) with constant parameter λ > 0 and mean λt, such as P(N(t) = n) = e −λt (λt) n n! , n = 0, 1, 2, . . . .
However, with the prolonged process, two possible scenarios emerge separately or combined. The first is that immigration evolves due to inhomogeneous Poisson process similar to that reported in [17]. Another possibility is when the branching continues to expand causing growing domination of locally transmitted infection. The daily numbers N(t), t > 0 are time dependent. For generalisation of both cases, the considered distribution is non-homogeneous Poisson (NHP) with mean The location and estimation of the state before the first regime change are important tasks for initial conditions' definition and computation initialisation. Because any detailed specific rule does not exist, the first change point following either the 10th day after the first infection or after the first 50 cases are reported is expected to include completely initial conditions. As the data show, it is a time when exponential growth has just begun in most of the observed data series. However, the hypothesis of homogeneous Poisson distribution of daily data before the first change point is not confirmed by empirical data from all studied 38 areas. The maximum likelihood estimation (MLE) shows that the mean parameter λ of the homogeneous Poisson distribution is a biased estimator. Thus, despite nλ yielding a good approximation for cumulative counts at the end of the period with length n, the predictions for previous days are wrong. The MLE mean values for homogeneous Poisson distribution are computed by the function fitdistr in the packet MASS in R [35]. The obtained results are shown in Table A1.
For further analysis, this initial time window is split again on only two parts by the already adopted change point model, splitting homogeneous and non-homogeneous processes. In general, as a thumb rule, the first time ordered sample and the remaining part are usually HP and NHP distributed. We also note that in cases when there is not significant immigration, the second period could be assumed as a linear birth-death process. To distinguish them more clearly, in the following text we will denote the firstly occurred in time part, which is HP distributed, as Initial condition. The following part will be notified as a mixing stage, referring to still serious impact of arriving from abroad infection. The prediction fits for both intervals are significantly improved using the library poisson in [9]. Results with simulated trajectories for United Kingdom (UK) are shown in Figure 1.
From samples obtained in this way, several additional conclusions on the initial phase of the pandemic can be made. At first, a conclusion can be drawn of possible undocumented infections during the first weeks of pandemic due to low data quality. This conclusion is confirmed by examples like time series from the Netherlands, where subsection of HP distributed part is missing. Another hypothesis for data inconsistency can be yielded from the difference between projected trajectory of the HP process with optimal λ and the real data. In the UK, the empirical data are constantly outside the 3σ confidence intervals of the modelled HP process with λ directly obtained from MLE, see Figure 1(a). Note that this is the period exactly before lockdown when chaotic travels from and to the country occurred.
The next property noticed from the initial condition (IC) data is that the parameter λ for initial condition does not depend on population size, but λ weighted on population density depends on geographical location and the proportion of IC time length in days and the number of days when IC is detected after 1st February 2020 process. The conclusions are obtained from the Log-gamma Generalised Linear Model (GLM) with 34 degrees of freedom [15]. The optimal model is obtained after investigation of different scenarios. The selection is based on a stepwise procedure by minimising the Akaike Information Criterion (AIC), 10-fold cross-validation procedure and residuals heteroscedasticity. The final model is with mean square error (RMSE) of 165.33 and bias of −5.37. The details for computational process are available in Appendix B.

Data fit performance
The data modelling begins with those obtained from the initial change point split period, including initial condition and mixing stage periods. If followed technically strictly [32], the initialising values have to be computed by the estimate of HP distribution for the initial condition. However, due to data differences, the initial point is set at the first change point. The starting value for every trajectory is computed of the approximate value of nλ whole , where n is the size of the whole period until the first detected regime change. This homogeneous Poisson parameter λ whole is used for simplicity despite that it is biased and does not explain the data trajectories from the initial moment t = 0 to any t < n. However, it is good enough and easy to compute at the moment n, where all possible trajectories are aggregated. Computed values are shown in Table A1.
The fits are repeated regularly multiple times with different time series length during the period 2020-2022. At first, the computational algorithm relies on automatic non-inferred one step detection of all possible change points. Then, the local average probabilities p T for every segment are computed from previously precomputed growth rates for all t. At the next step, all 10,000 trajectories are executed consecutively ordered by time. This splits the cumulative stochastic error of the process between all available trajectories (even zeros, if it exists). Thus, the expected number of infected people E(X t ) and variation V(X t ) are, respectively, the empirical mean and dispersion over all resulting trajectories at the moment t in the interval t i < t < t i+1 .
This approach provides an easy to implement and effective automatic algorithm for obtaining a good estimate for the expected value at long distant time moment and short time future predictions. In addition, it is easy to adjust the final result by applying the correction value of c. The value can be computed by selecting the value with minimal overall MSE error. It is executed autonomously and without supervising modelling for all 38 countries. When the change points from this first iterations are obtained by the BinSeg segmentation algorithm, the fit is in good agreement with the overall trend (see Figure 2(a)). However, at some local time intervals, mainly during non-linear expansion due to two overlapping waves, this first iteration shows very low flexibility due to the large size of determined change point segmentation. For these cases, a local re-computations with a more sensitive segmentation algorithm as PELT improves significantly the goodness-of-fit of model (see Figure 2(b)). The errors are proved non-autocorrelated by ACF diagnostic and normally distributed (Kolmogorov-Smirnoff test for standarised values yields p-value of 0.8467).
Another major empirical observation from regime change detection in studied cases is that the frequency of detected change points usually is higher at the beginning and at the end of the period when infection rate is either strongly increasing or in decreasing mode  ( Figure 3). This is in direct relation to the long unsegmented periods during the non-linear expansion of daily cases. Having knowledge of this trend, the frequent change points can be used as time location predictors of wave arrival or extinction. Visually it is easily detectable for Australia data (see Figure 3).
After multiple differences in time repetitions of the model, change point split for the first days confirmed stable and precise for the countries with strong initial wave. The variations in detected locations of change points are negligible. However, the location of regime changes in the early days is changing for some countries. The common observation for these cases is their delayed great shock. Such countries are Germany and Japan. Their infection rate shifted extremely to 2022 due to a strong reduction of the initial wave in 2020. Thus, when data from 2022 are included, the change points in early 2020 are changed.
There are cases when the model requires enforced recalibration. It is technically easy to apply dynamic level correction by resetting the process with new values due to the Markov property. Such case arises in France when data are updated by a reduction of the cumulative infection numbers in a single day (see Figure 4(a)). Because it is a one step negative correction, the change point analysis detects only occurrence of the correction, not the amplitude. Thus, the fit continues to follow the trend of average local probability requiring intensity resetting. The following recalibration is only a technical operation and its effect on results is only limited to the magnitude of overall intensity. Another, more complicated correction of the model is required when the measurement regime is changed to clear weekly periodicity (see Figure 4(b)).

Conclusion
The COVID-19 outbreak was a multiple wave pandemic driven by different viral variants. The previous sections successfully described the pandemic modelling fitted to the overall number of infected persons without distinction of data acquisition and report. However, focussing mainly on modelling, we missed some precise discussion about limitations. Possibly, distributions can be oversimplified, e.g. overdispersion can affect Poisson and negative binomial in real data scenarios. We also do not comment on different viral variants and that the instability of reproduction factor can play a role. However, these additional considerations could be modelled with appropriate data. The simplicity of software implementation allows concurrent computation of any sub-variant models. Further inference conclusions could be derived from a comparison and analysis of aggregated data.
For this reason, we share the complete software implementation with special supplements part in aim to make possible any further independent implementations. They may require additional development not considered neither in this work nor in the software. Another possible extended use is the implementation of geospatial regional modelling to study local differences and their impact on aggregated global level. This could be easily implemented, using separate modelling of regional data and following aggregation. This will improve the precision of predicted overall values by including lagging in time differences. The simulator source code can be downloaded from here.

Disclosure statement
No potential conflict of interest was reported by the author(s).  Table A1. Maximum-likelihood fitting of Poisson distribution for the initial condition.
of predictors in any model for every scenario is obtained from a step-wise procedure by AIC statistics, defined as follows: AIC := nln(RSS/n) + 2k. The preselected explanatory variables are the following columns λ IC , Data IC , Days.tot, Population, Density, Locat from Table A1.
The selection between different scenarios is based on observation for residuals' heteroscedasticity and finding optimal MSE-Bias proportion by 10-fold crossvalidation. Thus, the GLM final model relies on Log-Gamma identity link log(E(Y)) = Xβ with weights of squared population density. The final combination of predictors are the location categorical parameter and the fraction Days.tot/DaysIC. The are confirmed with quite high Z-value probabilities.