Monitoring aggregate warranty claims with dynamically designed CUSUM and EWMA charts

Statistical monitoring of warranty claims data using dynamic probability control limits has been shown to be effective in early detection of unforeseen reliability problems that emerge at the design and manufacturing phases. As the discrepancy between abnormal patterns and the normal pattern in aggregate warranty claims is usually small (especially at the early stage), we develop two new dynamic monitoring schemes that adopt CUSUM-type and EWMA-type statistics, named DyCUSUM and DyEWMA, respectively, to better address the warranty claims monitoring problem. Three effective algorithms – that is, the Monte Carlo simulation, Markov chain, and near-enumeration algorithms – are proposed to progressively determine control limits for the two schemes. In particular, comparison studies show that the near-enumeration algorithm can attain a higher approximation accuracy with a lower computational burden and is thus recommended. In-depth simulation experiments are then conducted to assess the performance of the schemes. We find that the DyEWMA scheme has superior and robust detection performance in various situations, whereas the DyCUSUM scheme is less effective and could even be ineffective in certain cases, compared with a Shewhart-type counterpart. Some specific suggestions are also provided to facilitate implementation of the proposed monitoring schemes. Improved schemes by combining the moving window approach to mitigate the ‘inertia’ problem is further discussed. Finally, a real case study is presented.


Introduction
Nowadays, due to rapid technological advances and fierce market competitions, consumers are increasingly demanding superb product performance and quality/reliability.In response, manufacturers have dedicated considerable efforts, based on modern quality and reliability philosophies, to implementing design for reliability and doing up-front testing and inspection so as to avoid products with serious reliability problems entering the market (Wu and Meeker 2002).Despite the efforts made in the up-front stage, reliability problems -many of them arising at the design and manufacturing phases -might still occur.This is partly because some reliability problems cannot be completely observed or perceived before products are put into use for a certain period of time.The occurrence of such problems for products in the field often results in considerable economic losses and reputational damages (Motabar et al. 2018).Hence, it is critical for manufacturers to possess the capability of stopping reliability problems in their infancy when the affected population is relatively small.A common practice adopted by many manufacturers is to regularly monitor their sold product units in order to detect unforeseen reliability problems as early as possible (Yashchin 2012;Hong, Zhang, and Meeker 2018).Nowadays, almost all consumer durables are sold with aftersales services (such as warranty and maintenance), with which malfunctioned products under normal usage will be rectified or even recalled by manufacturers (Wu 2012;Wang and Xie 2018).In this regard, warranty claims data that are collected mainly for financial reporting purposes, can be useful for early detection of reliability problems (Wu 2012).This is because reliability problems that emerge at the design and/or manufacturing phases can be reflected by certain abnormal patterns in warranty claims data (Wu and Meeker 2002;Lawless, Crowder, and Lee 2012).For example, Harley-Davidson recalled more than 27,000 motorcycles from its 2016 models due to a problem with the clutch master cylinder, after observing a spike in relevant warranty claims (Thomas 2016).A recent recall concerning Ford Explorer SUVs was issued after Ford observed 70 warranty claims (as of Decem-ber 31, 2022) alleging an issue with the powertrain control module; Ford became aware of this issue on August 31 2022, and trends in warranty claims suggested that impacted vehicles were produced from June 2019 to April 2022 (Anderson 2023).The two examples demonstrate the importance -from a manufacturer's perspective -of being able to analyse and identify the underlying patterns in warranty claims data.
Statistical process monitoring (SPM) -a popular tool for monitoring the stability of a process and detecting possible abnormal variations of the process (Montgomery 2020), also widely known as the control chart, can be used for monitoring warranty claims.In general, control chart visualises a statistic that quantifies some concerned features of a process over time, and issues an alarm once the charting statistic exceeds pre-specified control limits, indicating that the process is likely to be experiencing certain abnormal variations.However, early detection of reliability problems by promptly capturing abnormal patterns hidden in warranty claims data is highly challenging.Unlike industrial quality control, the monitored population in the warranty claims monitoring problem (i.e. the population of sold units that are still under warranty, also known as the warranted base) is subject to a high degree of heterogeneity as well as multiple timevarying characteristics (Li et al. 2020).Specifically, the warranted base is a mixture of units produced and sold in the past periods, whose warranties still apply.Due to the dynamics of new sales and out-of-warranty units, the warranted base itself along with the proportions of units of different ages (time in service) and of different batches (production periods) in the base is evolving over time.As an illustration, Figure 1 shows how the warranted base Figure 1.Evolution of the warranted base (the red line) and its composition over the warranty life cycle (Li et al. 2020).The production and sales process in this illustration follows from the scenario 1 reported in Section 4.1.
and the number of units of four equally partitioned age groups (corresponding to zones I, II, III, and IV, respectively) evolve over the warranty life cycle.Such multiple time-varying characteristics distinguish the warranty claims monitoring problem from traditional industrial quality control problems and render the modelling and analysis difficult.
In the literature, the warranty claims monitoring problem has received limited attention.Wu and Meeker (2002) make an early attempt by considering simultaneous use of multiple Shewhart-type charts for monitoring claim counts of product units stratified by different batches and ages.Lawless, Crowder, and Lee (2012) develop a singlechart cumulative sum (CUSUM) procedure by directly monitoring aggregate warranty claims collected in each time period regardless of production batch and product age, which avoids the issue of inadequate data due to stratification of product units.Zhou, Chinnam, and Korostelev (2012) and Zhou et al. (2017) suggest the use of hazard rate models to integrate upstream supply chain quality/testing information (as explanatory covariates) with warranty claims data for improved detection of reliability problems.Focusing on the unit-specified heterogeneity (due to variations in usage and environment), Huang, Jiang, and Shi (2021) capture the over-dispersion by a hierarchical Poisson-gamma model and present a log-likelihood based chart for warranty claims monitoring.Gupta and Chattopadhyay (2022) utilise the labour code priority index measure to address early detection of reliability problems based on two-dimensional warranty claims data.In addition, some recent studies take a new view that treats warranty claims data as profiles.Song et al. (2022) propose an empirical likelihood ratio chart and Wang et al. (2022) develop two nonparametric regression schemes, to monitor such profiles.Shang et al. (2017) and He et al. (2021) adopt profile monitoring methods to deal with Phase I change-point detection for warranty-claims profiles.Song et al. (2023) develop two semiparametric control schemes to dynamically monitor small-sample-size profiles with count response and arbitrary design.
However, the studies above do not adequately account for the underlying time-varying characteristics in the warranty claims monitoring problem.In particular, their methods either determine control limits before the control charts are activated, relying entirely on historical data, or simply assume that the entire collection of warranty claims data for each batch of product units is readily available, ignoring the process of practical data collection.These issues severely constrain the applications of their methods.Moreover, some monitoring schemes (e.g. the one in Lawless, Crowder, and Lee 2012) adopt constant control limits, which would induce additional variability that in turn reduces detection capability.In a recent study, Li et al. (2020) advocate the necessity of dynamic control-chart design in the warranty claims monitoring problem.They develop a dynamic control charting scheme whose control limit is determined period by period, taking into account the latest (sales) information, while controlling the false signal rate in each time period at a desired level.They demonstrate that the dynamic scheme works well in various situations and outperforms other existing static schemes (those with a constant control limit).Nevertheless, the scheme in Li et al. (2020) is essentially a simple Shewhart-type control chart with dynamic probability control limits (DPCLs).
In reality, an undesirable increase in the warranty claim rate (function) across product age, caused by reliability problems, can hardly be recognised as a large shift (at least, it is small at the early stage).Moreover, when a claim-rate change occurs, the resultant shift in the underlying pattern of aggregate warranty claims remains small for a long time due to a high proportion of units manufactured before the change in the mixed monitored population.Therefore, the warranty claims monitoring problem features the detection of small shifts.Shewhart-type control charts are known to be effective in detecting large shifts but relatively insensitive to small shifts, while memory-type schemes that incorporate all past process information, e.g. the exponentially weighted moving average (EWMA) and CUSUM control charts, are excellent alternatives to detecting small and moderate changes (Montgomery 2020).It is natural to consider some memory-type schemes instead of the Shewharttype scheme for faster detection of reliability problems.However, the dynamic design of these advanced (and also more complicated) schemes is not as straightforward as that of the Shewhart-type scheme in Li et al. (2020).
The dynamic design approach has been increasingly adopted in industrial quality control and public health surveillance.Shen et al. (2013) take an early step to apply the probability control limit approach in a dynamic way and, based on that, they develop an EWMA chart with DPCLs to monitor Poisson count data without any assumption on the distribution of timevarying sample sizes.To determine the DPCLs along with the latest sample-size information so as to control the conditional false signal/alarm rate (CFSR) at each step, Shen et al. (2013) introduce simulation-based and Markov chain procedures incorporated into their EWMA-type scheme.Similar ideas have been widely adopted by subsequent research; see, e.g.Zhang and Woodall (2015), Huang et al. (2016), Shen et al. (2016), Yang, Zou, and Wang (2017), Sogandi et al. (2019), Aytaçoğlu and Woodall (2020), Driscoll, Woodall, andZou (2021), andAytaçoğlu, Driscoll, andWoodall (2022), among others.Despite the fact that these studies are merely concerned with time-varying sample (population) sizes, they provide the foundation for developing more advanced control charts for dynamically monitoring warranty claims.
In this paper, we extend the research in Li et al. ( 2020) by developing two new proposals using memorytype statistics with DPCLs -specifically, dynamically designed CUSUM-type (DyCUSUM) and EWMA-type (DyEWMA) schemes.The statistics are built upon an aggregate warranty claims forecasting model in Li et al. (2020) -which is a discrete-time version of those in Xie and Ye (2016), Xie, Shen, and Zhong (2017), and Wang et al. (2022) -by coupling stochastic product sales and failure processes.The model is able to make oneperiod-ahead forecasting based on the latest sales information collected.To facilitate progressive determination of the DPCLs for our schemes, we propose a new nearenumeration algorithm that can fully leverage the discrete nature of the warranty claims monitoring problem, in addition to adapting the Monte Carlo simulation and Markov chain algorithms in Shen et al. (2013).A smallscale comparison study shows that the near-enumeration algorithm can attain a higher approximation accuracy with a lower computational burden, compared with the other two algorithms, and is thus a promising alternative to similar monitoring problems.In-depth simulation studies and a real case study are also presented to demonstrate the effectiveness of the proposed schemes.We find that the DyEWMA scheme has superior and robust performance, whereas the DyCUSUM could be completely ineffective in certain scenarios.Some specific suggestions on the implementation of the proposed schemes are further provided.
The remainder of this paper is organised as follows.Section 2 formulates the warranty claims monitoring problem of interest.Section 3 introduces the proposed memory-type schemes as well as three algorithms for determining their DPCLs.Section 4 conducts a thorough simulation-based investigation and comparison on the detection performance of various dynamic monitoring schemes.In addition, a modified model is further developed to mitigate the 'inertia' problem.In Section 5, a reallife case study is presented to illustrate the application of the proposed schemes.Finally, Section 6 concludes the paper.Performance evaluation and comparison of the three algorithms are relegated to Appendix.

Problem formulation
In this work, the formulation of the warranty claims monitoring problem follows directly from Li et al. (2020).We consider a new product that is sold with a non-renewing free repair warranty of length w.That is, if a unit sold at period j fails at period k ∈ (j, j + w], then it will be instantly repaired by the manufacturer at no cost to the consumer, and its remaining warranty period becomes w − (k − j).
We introduce the following setting in line with the current practice in which sales and warranty data are often collected in a grouped form.Let V i denote the number of product units that are manufactured at period i (i = 1, 2, . . ., p), where p is the final period of production.Further let D i,j represent the number of units produced at period i and sold at period j (j = i, i + 1, . . ., l), where l is the last sales period.In general, we have p ≤ l; that is, production ceases before the end of sales.The total number of units sold at the jth period can thus be expressed as S j = j∧p i=1 D i,j , where j ∧ p = min{j, p}.It should be noted that this setting well accommodates the make-to-stock and make-to-order scenarios.In the former scenario, a sales delay between production and sales can be observed, so that D i,j can be nonzero for any j ≥ i; clearly, D i,j might be zero for certain periods, especially when j is much larger than i or the discrete time interval is too short.In the latter scenario, for all D i,j 's with j ≥ i, only one or several of them would be nonzero and the others would be always zero.Another point noteworthy is that although the sales process terminates at the lth period, warranty claims could still be generated during (l, l + w], where l + w is the time the warranty of units sold exactly at period l expires.In this manner, the monitoring schemes could cover the whole warranty life cycle -from the first period to the (l + w)th period.
At the beginning of period k, the size of the monitored population (i.e. the warranted base) can be evaluated by where 1 ∨ (k − w) = max{1, k − w} and Q 1 = 0.The size of the warranted base is evolving over time, due to the dynamics of new sales and out-of-warranty units (see Figure 1 for an illustration).More importantly, the composition of the warranted base is heterogeneous and time-varying, reflected by D i,j in Equation (1).At the very beginning, most sold units are young; while at the very end, the ages of most units in the warranted base are near w.The underlying heterogeneity and time-varying features render the warranty claims monitoring problem more complicated than others, posing difficulty in the design of dynamic monitoring schemes.
The focus of early detection, via dynamically monitoring aggregate warranty claims over time, is essentially on an abrupt change (increase) in the warranty claim rate of unknown magnitude at an unknown time point τ .Let λ 0 (a), a > 0, be the reference claim rate of a product that is designed or expected by the manufacturer and is known a priori, and λ(a | i), a > 0, be the field claim rate for units manufactured at period i; both λ 0 (a) and λ(a | i) are functions of unit age a.In this setting, early detection of reliability problems can be formulated as a test of the hypothesis versus from a certain point of time onwards (i.e.i > τ).Note that although the problem formulation accommodates both parametric and nonparametric forms of λ 0 (a), we adopt a parametric one to support the development of our monitoring schemes.In practice, the parameters of λ 0 (a) can be estimated via the maximum likelihood estimation (MLE) method, if historical warranty data of previous product generations or reliability test data of the current generation are available (Wu 2012).
For the uth unit produced at period i and sold at period j (j ≥ i), let X (u)  i,j,k be a random variable denoting its cumulative number of warranty claims up to period k, where k = j + 1, j + 2, . .., and u = 1, 2, . . ., D i,j .In particular, we suppose that warranty claims do not appear in the period that the uth unit is sold; that is, X (u)  i,j,j = 0. We further assume that upon receiving a warranty claim, a minimal repair action is applied to restore the failed unit to a working state, and the repair duration is negligible.In this scenario, the warranty claim process can be modelled by a non-homogeneous Poisson process with claim rate λ(k − j).Hence, we have where κ = (k − j) ∧ w.
Let M k denote the cumulative warranty claims received exactly in the kth period ( M 1 = 0).Note that M k contains eligible warranty claims generated by all units sold before, and including, period k−1.According to Li et al. (2020), the mathematical expression of M k is given by It can be verified that the number of incremental warranty claims in each period is independent of each other, due to the independent increment property of the Poisson process (Li et al. 2020).Thus, it is clear that the mean of M k is where E[X (u) i,j,k ] is given by Equation (4).Li et al. (2020) reveal that M k is a key quantity for early detection of reliability problems during the warranty life cycle.In this work, we propose two dynamically designed control charting schemes based on M k , utilising memory-type statistics -namely, the CUSUM-type and the EWMA-type.

Two memory-type statistics
We first develop a CUSUM-type statistic for warranty claims monitoring purposes, in line with the one proposed by Lawless, Crowder, and Lee (2012) and used in their constant-control-limit CUSUM scheme.Our CUSUM-type statistic can be written as where W 0 = 0, ψ ≥ 1, and E[ M 0 k ] represents the incontrol (IC) mean of M k .In general, ψ might follow a functional form derived from the log-likelihood-ratio statistic, involving the IC and out-of-control (OC) claim rates.Because the form of the OC claim rate is usually unknown, here we simply set ψ to a constant, called the reference factor, and its effect will be discussed later in Section 4.
Further define where Var( M 0 k ) represents the variance of M k in the IC state.Then, our EWMA-type statistic, with a lower reflecting barrier at zero, can be expressed as where G 0 = 0 and θ ∈ (0, 1] is the smoothing factor.The effect of θ will be discussed later in Section 4 as well.It is worth mentioning that the Shewhart-type statistic in Li et al. (2020) can be viewed as a special case of G k in (8), with θ = 1.

Determination of probability control limits
The control limits of conventional CUSUM or EWMA charts are usually constant.However, the use of a constant control limit would result in time-varying CFSRs, even for the simplest case with a constant sample size (Driscoll, Woodall, and Zou 2021).As mentioned earlier, the warranty claims monitoring problem of interest exhibits multiple time-varying characteristics, not merely timevarying sample sizes.This would make the CFSR at each period uncontrollable, which is undesirable in real applications.In order to maintain the CFSR at a stable level (or, to vary over time according to a specified pattern), it is suggested to adopt time-varying control limits.
The control limits of the proposed dynamic monitoring schemes are calculated in a way that the conditional probability that the monitoring statistic exceeds its control limit given no prior alarms, is equal to (or not greater than) a pre-specified false signal rate α.Mathematically, the control limits H W,k and H G,k for monitoring statistics W k and G k in Equations ( 7) and ( 8), respectively, satisfy and It is customary to refer to H W,k and H G,k as the probability control limits (Driscoll, Woodall, and Zou 2021).In practice, the probability control limit in each period can be progressively determined right after the latest data regarding production, sales, and warranty claims are obtained.In this sense, the proposed dynamic monitoring schemes -that is, the DyCUSUM and DyEWMA schemes -can be implemented in a dynamic way without making any prediction regarding future sales volumes.This characteristic makes the DyCUSUM and DyEWMA schemes fairly general and adaptive to complex application scenarios.
However, due to the intricacy of the conditional probability, analytically solving for H W,k and H G,k is intractable.We thus develop three effective algorithms, namely, Alg #1: the Monte Carlo simulation-based algorithm, Alg #2: the Markov chain algorithm, and Alg #3: the near-enumeration algorithm, to facilitate progressive determination of H W,k and H G,k for the DyCUSUM and DyEWMA schemes under an acceptable level of α.The first two algorithms are inspired by the corresponding methods in Shen et al. (2013) and adapted for the warranty claims monitoring problem, while the near-enumeration algorithm is newly proposed.In what follows, we detail the fundamental ideas and computational procedures of the three algorithms, putting a particular emphasis on the near-enumeration algorithm.We note that only the computational procedures for the DyEWMA scheme are presented here; those for the DyCUSUM scheme can be designed in a similar manner and thus omitted for brevity.

Simulation-based algorithm
Monte Carlo simulation is a simple yet effective way to approximate the DPCLs (see, e.g.Shen et al. 2013;Huang et al. 2016).In this subsection, we present a simulation-based procedure, applying it period by period in a dynamic fashion, to determine probability control limits in the warranty claims monitoring problem.
Let us start from the very beginning and consider that S 1 units are sold in the first period.The monitoring actually starts from the second period, as warranty claims (if any) appear till the second period.However, at the end of the first period, we could know that the number of warranty claims, M 2 , received in the second period follows the Poisson distribution with mean E[ M 0 2 ], where E[ M 0 2 ] can be evaluated by Equation ( 6) based on previous sales volume S 1 .Therefore, we can determine the upper control limit H G,2 by randomly generating M 2,i (i = 1, 2, . . ., R) from Poisson(E[ M 0 2 ]) and then, with G 1 = 0, calculating the corresponding pseudo EWMA-type statistic, G 2,i , based on Equation (8).Let According to Equation ( 10), H G,2 is equal to the 100(1 − α)th percentile of the elements in O G 2 .Note that the number of simulation runs, R, should be sufficiently large to guarantee a high approximation accuracy.The actual value of M 2 is available at the end of the second period.We can then compute G 2 and compare it with H G,2 .An OC signal is triggered if G 2 > H G,2 ; otherwise, the process is regarded as IC and we proceed to the next period.
According to Equation (10), we should ensure that only those G 2,i satisfying G 2,i ≤ H G,2 are kept to determine H G,3 for the third period.We use a vector O G 2 to store those elements in O G 2 that satisfy this condition.Likewise, we would also know that M 3 ∼ Poisson(E[ M 0 3 ]) at the end of the second period.Hence, we can obtain a vector O G 3 in which the ith

Markov chain algorithm
In addition to the simulation-based approach, one can adopt the Markov chain approach to determine the DPCLs, which is often more accurate and stable.Following the standard practice of Markov chain approaches, we need, first of all, to discretise the region of monitoring statistic G k , for which we should know its lower and upper bounds (denoted by L k and U k , respectively).The lower bound is clearly L k = 0, according to the form of G k in Equation ( 8).However, the upper bound U k is difficult to determine as it is time-varying in the current context.In this study, U k is specified by solving Pr{G k ≤ U k } = 1 − , where > 0 is a sufficiently small constant (we set = e −16 ).Then, we partition the interval [L k , U k ] into V equal-length subintervals, where the lower and upper bounds of the ith Usually, V should be large enough to guarantee an acceptable approximation accuracy.
. ., V, then we say that G k is in the ith state, with G k = 0 corresponding to the 0th state (i.e.i = 0).In total, there are V + 1 discrete states, from 0 to V. Suppose that the probability density within each subinterval (state) i is concentrated as a probability mass at the midpoint ), can be obtained by calculating the corresponding probability of M k .
To initialise the proposed Markov chain procedure, we again start from k = 2.At the second period, given The conditional probability that G 2 falls into the ith state can be calculated by an OC signal should be triggered; otherwise, we proceed to the next period.
For the third period, we know that G 3 is partially dependent on G 2 .The condition G 2 ≤ H G,2 requires us to only keep the first V 2 + 1 states of G 2 and store their normalised probabilities p 2,j = ).Then, the conditional probability vector P 3 = (p 3,0 , p 3,1 , . . ., p 3,V ) T can be calculated by Once P 3 is obtained, the control limit H G,3 can be approximated by H G,3 = V 3 U 3 /V, where V 3 is recalculated by Equation ( 12) with P 3 .Subsequent steps repeat as above.The Markov chain procedure is briefly described in Algorithm 2.
Algorithm 2 Markov chain procedure for determining and P 1 = [1].2. Set the lower bound L k to 0 and specify the upper bound by Pr{G

and specify the control limit as
, and calculate the normalised probabilities p k,j , j = 0, 1, . . ., V k , storing them in P k .Then, let k = k + 1, and go to Step 2. Otherwise, an OC signal should be issued.

Near-enumeration algorithm
Both Monte Carlo and Markov chain approaches are the most common ways of computing control limits and evaluating control chart performance when dealing with complex SPM problems.The rationale behind the two approaches lies in approximation.As pointed out by Huang et al. (2016), owing to the period-by-period iterative specification of control limits, the approximation error will propagate to subsequent periods.Thus, the algorithm arguments R (for the simulation-based approach) and V (for the Markov chain approach) should be sufficiently large in order to attain a satisfactory accuracy, inevitably requiring excessive computational efforts.Therefore, an exact method with an acceptable computational burden is needed for determining DPCLs.In the warranty claims monitoring problem, the modelling and analysis hinge on the Poisson distribution.Given the discreteness of the Poisson distribution, a natural idea of specifying DPCLs is through enumeration; however, applying an exhaustive enumeration is impractical in real scenarios.In this work, we propose a near-enumeration approach that is able to obtain sufficiently accurate control limits.The basic idea and computational procedure are described below.
The enumeration approach requires us to consider all possible cases that might happen.To this end, we execute the enumeration process in the following two steps.
• The first step is to identify all possible observations of warranty claims M k received in a single period k.It is obvious that the minimal number of warranty claims is zero in nature.On the other hand, although an intuitive upper bound on the number of warranty claims is Q k -the warranted base in period k, the magnitude of Q k is often huge in real applications, especially for consumer electronics.We thus specify a reasonable upper bound U * k in a way similar to that in the Markov chain procedure; that is, where * > 0 is also a sufficiently small constant (we set * = e −16 as well).As the probability of k can be viewed as an approximate minimum of M k .In this sense, the possible range of M k is from L * k to U * k , and the occurrence probability of each value can be easily derived from • The second step is to further identify all possible values of monitoring statistic G k .For this purpose, it is natural to first enumerate all possible combinations of M k and G k−1 .Then, for each combination, we can calculate the corresponding G k and its probability of occurrence based on Equation ( 8).After this, we can sort out the unique values of G k and determine their probabilities by combining the results with the same G k .Essentially, this forms a conditional distribution of G k given that G k−1 is IC and, on this basis, we can determine the control limits exactly.
A problem here is that the set of unique values (states) of G k keeps expanding over time, and becomes unacceptably large after a few periods.Hence, we need to compress its scale when it exceeds a certain limit.In this study, if the number of combinations of M k and G k−1 becomes greater than a sufficiently large number J (say, J = 100,000), then we discretise the region of G k into V * + 1 discrete states (including state 0), similar to that in the Markov chain approach.The original value of each G k > 0 is then replaced by the midpoint of the subinterval that it falls in, and the corresponding probability can be obtained accordingly.Here, V * should be relatively large to achieve a high degree of accuracy while maintaining a tolerable computational burden.As long as the number of possible combinations is no larger than J at period k, full information on the states of G k given the states of G k−1 , will be kept; otherwise, we only keep V * + 1 representative states of G k .This is why we call it the near-enumeration approach.
To implement the near-enumeration procedure, we start from k = 2 again.As G 1 = 0 with probability one, in order to determine the control limit H G,2 , we only need to consider all possible observations of )) and their corresponding occurrence probabilities.Here we can directly specify G 2 associated with the same G 2 and then rescaling them so that the overall probability equals to one.
At the third period, again, we first identify O * M 3 and P * M 3 in a similar way to that of the second period.The only difference is that here we use the probability 1 − * when specifying bounds U * 3 and and P * G 2 , we can directly enumerate all combinations of M 3 and G 2 , and then calculate the corresponding G 3 and the associated probability for each combination (Note that each probability is the product of the corresponding elements in P * M 3 and P * G 2 ).We store the results in O * G 3 and P * G 3 , respectively.Further, we update O * G 3 -in an accurate or approximate manner depending on its size -by only retaining the unique values in it and then sort them in an ascending order; we also update P * G 3 correspondingly.Then, the control limit

Remarks on the three algorithms
According to our evaluations and comparisons of the three algorithms through small-scale simulation experiments (see Appendix for details), we make the following remarks: (1) compared with traditional practice of adopting a fixed (probability) control limit, the DPCLs calculated by any of the three algorithms improve significantly in controlling the CFSRs without any prior information on future sample sizes; (2) due to the randomness inherent in Monte Carlo simulation, the actual CFSRs generated by the simulation-based algorithm fluctuate around the pre-specified level more heavily, among the three algorithms; and (3) the results of the nearenumeration and Markov chain algorithms exhibit a high degree of consistency, while the actual CFSRs by the near-enumeration algorithm are slightly closer to the pre-specified level.
Furthermore, we find that the near-enumeration algorithm is computationally much more efficient than the Markov chain algorithm.As an algorithm that can return sufficiently accurate DPCLs, its computational efficiency is acceptable.This is mainly because a lot of calculations involved in the near-enumeration algorithm are basic arithmetic operations.It is thus able to attain a higher approximation accuracy with a lower computational burden.Therefore, the nearenumeration algorithm shall be employed to determine DPCLs throughout this work.We believe that it is also applicable to similar problems concerning dynamic design of control charts in the discrete case.Table 1 provides a brief summary of the major strengths and limitations of the three algorithms.

Performance study
In this section, we conduct a thorough investigation and comparison on the overall detection performance of the proposed DyCUSUM and DyEWMA schemes, together with the dynamic Shewhart-type scheme (hereafter, DyShewhart) in Li et al. (2020).For this purpose, we consider different tuning parameters: ψ = 1.00, ψ = 1.10, ψ = 1.25 for DyCUSUM and θ = 0.10, θ = 0.25, θ = 0.50 for DyEWMA.We first demonstrate the superiority of the proposed schemes in the current context and further provide guidelines on parameter selection.It should be noted that we do not include other alternative schemes in the literature (e.g. the constant-control-limit CUSUM scheme in Lawless, Crowder, and Lee 2012), because Li et al. (2020) have shown that such a CUSUM scheme is impractical and also less effective than the DyShewhart scheme.

Simulation setting
For a fair comparison, we follow the simulation setting in Li et al. (2020), which is briefly introduced below.Interested readers may refer to Li et al. (2020) for more details.Consider a manufacturer that produces and sells a specific product, periodically collecting and analysing the production data, sales data, and warranty claims data for early detection of potential reliability problems.The production is supposed to last for p = 130 weeks.In particular, two types of production scenarios are considered in the simulation study.In the first scenario, the weekly production volume V i first increases, then levels off, and finally decreases, whereas in the second scenario, a relatively stable production mode comes with a much larger production volume.More specifically, the two production scenarios are respectively specified as and where v 1i ∼ U(−150, 150) and v 2i ∼ U(−300, 300) are uniformly distributed integers corresponding to the fluctuations in production volumes.
Moreover, selling all units produced in a week is assumed to take 10-30 weeks and the sales are thus assumed to be randomly distributed across these weeks.After the production terminates, the sales of remaining units will continue for extra 26 weeks at most, so that the product life cycle is l = 156 weeks.All units are sold with a free repair warranty of length w = 52 weeks.In this study, the power law process is employed to characterise the claim rate; that is, where β > 0 is the shape parameter and η > 0 is the scale parameter.To set up the simulation, the parameters of λ 0 (a) are set to β 0 = 3 and η 0 = 100 for the first production scenario corresponding to an increasing claim rate, and β 0 = 1 and η 0 = 1000 for the second scenario with a constant claim rate.Assume that the delay between the occurrence of a failure and the resulting warranty claim is negligible.Hence, under the minimal repair policy, the occurrence of warranty claims obeys a non-homogeneous Poisson process.Furthermore, we assume that at time τ , some reliability problem induces an instantaneous shift in the scale parameter from η 0 to η 1 = (1 − ρ)η 0 (0 < ρ < 1 representing the magnitude of the change), while the shape parameter keeps unchanged.This would result in an undesired increase in the claim rate and consequently a certain abnormal pattern in warranty claims.In particular, we consider three cases of the change point τ , i.e. τ = 0, τ ∼ U(15, 20), and τ ∼ U(75, 90), which correspond to the problem emerging before mass production, within (0, w], and within (w, l], respectively.Note that we provide some flexibility on τ by allowing it to be uniformly distributed within a certain range, instead of being a constant.In each case of τ , we examine three cases of ρ, viz., ρ = 0.10, ρ = 0.25, and ρ = 0.50, which correspond to slight, moderate, and significant changes, respectively.In total, there are nine types of OC cases in the simulation study that cover a broad scope of changes occurring at different stages of the life cycle with different magnitudes.

Performance comparison
For comparison purposes, we focus on the detection powers of the proposed monitoring schemes in different OC cases described above.We use the overall signal probability (SP) over a given period of time as the performance metric.Specifically, in the current context, we are primarily concerned about the overall true signal probability (TSP) -namely, the probability of having at least one true alarm from the (τ + 1)th week up to some point in time, given no prior alarms.The TSP reflects the detection capability of a monitoring scheme after the change point.Clearly, a higher TSP is preferred for early detection purposes.Meanwhile, we need to control the CFSR in each time period at a desired level.Here, the maximum desired CFSR of each monitoring scheme is pre-specified as α = 0.0027, a value commonly used in the literature.For other values of α, the results are similar and thus omitted for space consideration.It is noteworthy that the overall false signal probability (FSP) -namely, the probability of having at least one false alarm triggered before and in the (τ + 1)th week -can be directly calculated based on the attained CFSRs.
In this performance study, as noted previously, the DyCUSUM and DyEWMA schemes are designed based on the near-enumeration approach (Alg #3) with J = 100,000 and V * = 10,000.This approach is also used to evaluate their SP metrics in order to guarantee accurate and reliable results, whereas the same metrics of the DyShewhart scheme can be directly calculated from the underlying Poisson distributions.To obtain converged results, the simulation study is conducted based on 50,000 random runs for each single case.
By comparing the DyCUSUM and DyEWMA schemes, we find that the overall FSPs of the latter are higher than those of the former.The DyEWMA scheme is much more robust to different tuning parameters in terms of the overall FSPs, while the DyCUSUM scheme designed with a moderately large ψ (say, ψ = 1.25) could even be ineffective in a few cases (having a significantly lower value of the overall FSPs).Except for these ineffective DyCUSUM cases, the overall FSPs of the two schemes are comparably close.We then proceed to compare their overall TSPs.In addition to the results (for specific time periods) in Tables 2 and 3, Figures 2 and 3 further illustrate their overall TSPs (the averages from 50,000 random runs) over time under production scenarios 1 and 2, respectively.Note that for the curves of overall TSPs, a faster growth corresponds to a higher detection power.From Figures 2 and 3, we can see that the DyCUSUM scheme is heavily affected by its tuning parameter ψ.Its performance can be poor with a large ψ, even far worse than the DyShewhart scheme in some cases, especially under scenario 2. In addition, even with ψ = 1.00 the DyCUSUM scheme that has a simple 'observed − expected' form, could also be inferior to the DyShewhart scheme in certain cases; see Figure 3(i).This means that a CUSUM scheme whose monitoring statistic has a similar form with the one in Lawless, Crowder, and Lee (2012), even dynamically designed, is not reliable (especially for a large ψ).When using the DyCUSUM scheme, the reference factor ψ should be selected with great caution.
Unlike the DyCUSUM scheme, the tuning parameter θ of the DyEWMA scheme appears to have much less effect on the overall TSPs, since the curves of the DyEWMA scheme with different values of θ are very close, especially for θ = 0.10 and θ = 0.25.Moreover, the DyEWMA scheme is always better than the DyShewhart scheme, and possesses greater advantages when the magnitude of claim-rate change, ρ, is small, which is to be expected.Even if there is a large change, say, ρ = 0.50, the DyEWMA scheme can still slightly outperform the DyShewhart scheme.By contrast, the DyEWMA scheme designed with θ = 0.10 is almost always superior to or at least comparable to the best of all other schemes in all cases we consider.The results show that the DyEWMA scheme, especially with θ = 0.10, is a highly promising option for monitoring aggregate warranty claims, which has superior and robust detection performance.
Furthermore, we have the following observations that apply to all dynamic monitoring schemes: • The overall TSP after the underlying change keeps relatively low for a certain period of time in most cases.This is mainly because a) only a small proportion of units produced after τ have been sold during this period of time, resulting from the sales delay; and b) the change in the claim rate due to the simulated shift in scale parameter η 0 is quite small when the age a is small.After certain periods, however, the overall TSP sharply grows to and then stays at one.In particular, the overall TSP grows more sharply when ρ is larger, all else being equal.This is consistent with our expectation because it would be easier and faster to detect a larger change in the claim rate.• The overall TSP grows more slowly when the claimrate change occurs later in the life cycle, and it thus takes more weeks for this probability to reach one (or a half).This can be explained by the fact that a late occurrence of reliability problems would result in mixed warranty claims from the units produced both before and after τ , which in turn increases the difficulty of early detection.This implies that it is more difficult to detect a reliability problem emerging at a later stage of the life cycle, relative to one occurring earlier.
• By comparing Figures 2 and 3, we find that the curves of the DyEWMA and DyShewhart schemes under scenario 1 exhibit similar trends and relations to those under scenario 2, whereas those of the DyCUSUM scheme appear to be highly impacted by the production scenario.In general, the overall TSPs under scenario 2 grow earlier and more sharply than those under scenario 1.The detection power regarding a given claim-rate change depends on the expected number of warranty claims each week.Moreover, as noted earlier, the overall FSPs under scenario 2 (in Table 3) are larger than those under scenario 1 (in Table 2).This implies that the CFSRs under scenario 2 are generally higher than those under scenario 1, which further widens the gap in detection power between the two scenarios.
To examine the variability of the results, we further look at the standard deviations of the overall SPs over 50,000 random runs; see Tables 2 and 3 for some details (the complete results of the standard deviations are available upon request).The standard deviations of the overall TSPs are relevant to their averages.To be specific, when the average overall TSP is moderately large (say, between 0.3 and 0.7), the standard deviation is large; when it approaches either extreme (0 or 1), the standard deviation gradually drops off.Moreover, we find that both τ and ρ can impact the standard deviations (e.g. the standard deviations become larger as ρ increases), although the impact of ρ is more significant than that of τ .In general, the degree of variation in the results is acceptable.The difference of the schemes in the standard deviations of the overall TSPs is inapparent in most cases (but this observation does not apply to the DyCUSUM scheme with ψ = 1.25 when its performance is rather poor), while the standard deviations of the DyShewhart scheme appear to be slightly smaller in some cases.These findings indicate that the variability of the results comes mainly from the inherent randomness in simulation, rather than the monitoring schemes.
So far we have carried out an in-depth performance analysis of the DyCUSUM and DyEWMA schemes under scenarios 1 and 2. In addition, we have further conducted a more comprehensive simulation study including more scenarios, particularly concerning variations of the two scenarios with changes of η 0 and β 0 , among others.Detailed numerical results can be found in Online Supplement.We find, for example, that the overall TSP for the DyEWMA scheme grows more sharply when η 0 is smaller.Most importantly, the DyEWMA scheme designed with θ = 0.10 remains superior to or at least comparable to the best of all other schemes under these variational scenarios.In general, these results are highly consistent with those under scenarios 1 and 2 as above.

Optimal choices of tuning parameters
Optimising the tuning parameters for CUSUM and EWMA charts is an essential issue in classic SPM research.Well designed CUSUM and EWMA charts can be even powerful in specific OC cases.In the previous subsection, we considered three choices of the tuning parameters (i.e.ψ = 1.00, 1.10, 1.25 for DyCUSUM and θ = 0.10, 0.25, 0.50 for DyEWMA, respectively), which represent the cases with different magnitudes of claimrate change.This provides a broad picture of the effect of different tuning parameters on the performance of the monitoring schemes.Although the best one selected from the three choices is usually satisfactory (at least, not bad), it is interesting to explore the globally optimal values of the two parameters and examine to what extent the performance of the DyCUSUM and DyEWMA schemes can be further improved.
Distinguishing the TSP curves that correspond to a number of different tuning parameters and then figuring out the best one may not be easy.For parameter optimisation purposes, we adopt a performance indicatorthe delay until the true-alarm signal is triggered since the claim-rate change (denoted by DuTS), which is analogous to the concept of average run length in classic SPM research.In addition, we further examine two auxiliary indicators -the delays until the overall TSP exceeds 0.5 and 0.8 after the claim-rate change (denoted by DuP 0.5 and DuP 0.8 ).The three indicators are easy to understand and can be used to evaluate the detection power of any dynamic monitoring scheme.Our simulation studies show that the optimal results of the three indicators are highly consistent across different cases.The results using the optimal tuning parameters for the proposed DyCUSUM and DyEWMA schemes, in terms of the average DuTS over 50,000 random runs, are summarised in Table 4, under production scenario 1.The results using previously suggested tuning parameters are also included for comparison purposes.Similar results for DuP 0.5 and DuP 0.8 and for scenario 2 are observed and thus omitted due to space consideration.
According to Table 4, we can observe that the optimal values (or the optimal ranges at current numerical precision) of the tuning parameters (ψ opt and θ opt ) are directly related to the OC case.In accordance with the traditional wisdom on parameter selection for classic CUSUM and EWMA charts, smaller (larger) values of ψ and θ are generally preferable for detecting smaller (larger) shifts.In particular, given the same ρ, when the claim-rate change occurs later (i.e.τ is larger), the resultant shift in the underlying pattern in aggregate warranty claims appears to be smaller for a long time due to a higher proportion of units manufactured before τ in the mixed monitored population.Moreover, we find that the optimal performance of the DyEWMA scheme is always slightly superior (with smaller DuTS) to the DyCUSUM scheme.By focussing on the DyEWMA scheme, we find that it maintains a similar performance against different θ.This is because the shift size is not constant in a dynamic environment but varies within a certain range over time.As a result, the DyEWMA scheme with previously suggested θ = 0.10 can achieve a quite close performance to the optimal one across different cases.It confirms that θ = 0.10 is a rather robust and satisfactory setting for the DyEWMA scheme.

Improvement via the moving window approach
Earlier studies (e.g.Lawless, Crowder, and Lee 2012;Li et al. 2020) have pointed out that there is a so-called 'inertia' problem in monitoring warranty claims based on M k .After the underlying change occurs, M k actually contains a large part of claims from the units that were manufactured a long time ago (before τ ), mixed with those claims from the units produced after τ .The units manufactured well before τ may still be sold after τ and continually generate eligible warranty claims.As long as these sold units are still under warranty, they are in the warranted base and the warranty claims generated by them need to be counted in M k .Such a highly heterogeneous population, causing the inertia problem, inevitably makes the monitoring schemes less sensitive to any change.It is worth clarifying that this problem is different from the 'inertia' issue discussed in classic SPM research with EWMA/CUSUM charts, which is beyond the scope of this work.
To mitigate the inertia problem, following Lawless, Crowder, and Lee (2012) and Li et al. (2020), we modify the current monitoring schemes in a straightforward way -that is, confining our attention to the warranted units that are manufactured during the B (B ≥ 1) recent periods only (analogous to a moving window approach).In other words, in the kth period, only the sold units manufactured within periods [1 ∨ (k − B), k − 1] are considered as the population under monitoring.As a result, M k in (5) can be modified to: Then, the monitoring statistics W k and G k in Equations ( 7) and ( 8), respectively, can be computed by replacing M k with M k , and the corresponding DPCLs can also be obtained in a similar way.In fact, M k can be viewed as a limiting case of M k when B is sufficiently large.
The moving window approach mitigates the inertia problem by proactively excluding some units from the monitored population, once B periods since the manufacturing have passed.However, this simple idea works at the expense of information loss to some extent, especially when B is small.Only the claims from the units manufactured in the B recent weeks are used, with the rest discarded.In other words, there can be no claims from those (OC) units with age a > B, and the exclusion of such information would thus reduce the detection power.Therefore, a sensible balance is required when choosing the value of B. In this study, we examine the effect of B on the performance of various monitoring schemes, as in Li et al. (2020).For simplicity, we directly follow the simulation setting in Section 4.1.We find that the effect of B is highly consistent across the schemes.Due to space consideration, here we just take the DyEWMA scheme designed with θ = 0.10 as an example and display in Figure 4 the effect of B on its overall TSPs under scenario 1.
According to Figure 4, the effect of B on the performance of the schemes hinges on the value of τ .Consistent with the observations in Li et al. (2020), we find that the way of the TSP curve changing with B for τ = 0 is distinct from that for τ ∼ U (15,20) or τ ∼ U(75, 90).
• When τ = 0 (i.e. the claim-rate change happens at the very beginning), all units are produced under the OC condition.Because all warranty claims are generated by the OC product population, it is better to make use of as many warranty claims data as possible for early detection purposes.This is confirmed by the first row of Figure 4, where a larger B corresponds to a steeper TSP curve, implying a higher detection capability.For τ = 0, the limiting case (i.e.B = ∞, corresponding to the original model without truncation regarding production periods) is the optimal case.This means that if the reliability problem is expected to occur very early (say, τ = 0), the manufacturer should involve all warranty claims data, just like the original model.• However, when the claim-rate change occurs at a later stage of the life cycle, the impact of B becomes more complicated.Different from the case of τ = 0, B = ∞ is no longer the best option for τ ∼ U(15, 20) or τ ∼ U(75, 90); rather, it is mostly inferior to the others.More critically, no obvious optimal choice of B can be found in this case, in terms of the TSP curve.
The second and third rows of Figure 4 indicate that along with time, the curves dominate their counterparts in an alternate fashion (in an ascending order of B).This is because a smaller B is able to exclude the units produced before τ from the monitored product population faster, resulting in a larger detection power at the early stage after the change; on the other hand, a small B also causes a low detection power at later stages due to insufficient information utilised.The situation is reversed for a larger B. In general, small (resp.large) values of B are preferable for detecting large (resp.small) rate changes; choosing a reasonable value of B can improve the detection capability as long as B is not too small.
In addition, we observe that the curves approach the limiting case faster when ρ is larger.The monitoring process terminates earlier if a relatively smaller B is adopted.This explains the phenomenon that some of the curves in the case of τ ∼ U(75, 90) are shorter.The same observations can be drawn for the other monitoring schemes.
Although there is no clear dominator across different cases, Figure 4 shows that a reasonable option of B should be between 10 and 30 (of course, one may also consider the DuTS indicator).In particular, compared with B = 10 and B = 20, B = 30 is a fairly safe option if we use B = ∞ as a baseline; the curve of B = 30 always increases faster than or at least overlaps with that of B = ∞ in all cases, whereas for B = 10 and B = 20, they could be inferior under certain conditions (though they can be even better than B = 30 in some other cases).This suggests that one can safely consider B = 30 when adopting the moving window approach to improve detection capability.
It is interesting to further compare the performance of the monitoring schemes by taking into account the moving window.Some results on the performance comparison of the schemes for particular choices of B (e.g.B = 10, B = 20, and B = 30) are available in Online Supplement.We find that the relationship of the DyEWMA schemes with different θ is similar across the choices of B; the DyEWMA scheme with θ = 0.10 is always highly competitive.However, the case is quite different for the DyCUSUM schemes.The performance of the DyCUSUM scheme with ψ = 1.25 improves significantly when using these choices of B, whereas ψ = 1.00 turns out to be the worst among the three cases of ψ.This observation further supports our previous recommendation of the DyEWMA scheme with θ = 0.10, which also performs fairly well with M k .

A real-life case study
To illustrate the practical application of the proposed dynamic monitoring schemes, we apply them to a real warranty claims dataset, which was introduced by Li et al. (2020).This dataset, collected from a home appliance manufacturer, relates to one of its main products and contains the production data and associated warranty claims data of the units produced between March 2015 and July 2018 (about 175 weeks).In total, 8,349,101 units were produced and 808,820 warranty claims were received upon the collection date.The product is sold with a one-year (52 weeks) warranty, with which any failures under normal usage will be repaired free of charge by the company.As the average claim rate is unacceptably high, the company would like to find out if some reliability problems have occurred during the design and manufacturing stages, through warranty data analysis.
Based on the claims data of the units manufactured within the first 30 weeks, Li et al. (2020) find, via the MLE method, that the baseline claim rate follows a power law process (see Equation ( 14)) with shape parameter β 0 = 3.053 and scale parameter η 0 = 113.300weeks; that is, λ 0 (a) = 1.637 × 10 −6 × a 2.053 , a ≥ 0. For more details, interested readers may refer to Li et al. (2020).Then, they apply the DyShewhart scheme to the dataset and observe the first OC signal in the 134th week when considering the maximum desired CFSR α = 0.0027.Moreover, after the first OC point, many points fall beyond the control limit of the DyShewhart chart, showing that the process might have gone OC.
Since the previous section has demonstrated the superiority of the proposed memory-type schemes (especially the DyEWMA scheme) over the DyShewhart scheme in detecting unforeseen reliability problems, it is natural to consider applying them to the same dataset and compare the performances.Moreover, their implementation in practical applications is simple with the assistance of a well encapsulated computer programme for iterative determination of control limits (our MATLAB program is available from the corresponding author upon reasonable request).For this purpose, we just need to compute the monitoring statistic each week based on the number of warranty claims collected and then compare it with the corresponding control limit.If the monitoring statistic is larger than the control limit, an alarm is triggered with necessary follow-up root-cause finding procedures; otherwise, we continue the monitoring process by collecting the latest sales volume and computing a new control limit for the next week.Then, we proceed to the next week and repeat the process above.A flow chart is given in Figure 5 to illustrate the implementation process of the DyEWMA scheme (the steps for the DyCUSUM scheme are similar).It is worth emphasising that our proposed near-enumeration algorithm can return a sufficiently accurate control limit in less than a second.This is fairly efficient for real applications.
In Figure 6, we plot two charts corresponding to the DyEWMA scheme with θ = 0.10 and the DyCUSUM scheme with ψ = 1.00, respectively, which are representative among their respective counterparts.Although the two charts look very different, they sound alarms at similar points in time.Both trigger an OC signal around the  120th week with subsequent points going dramatically beyond the control limit; this observation is consistent with that on the DyShewhart chart, but the DyEWMA and DyCUSUM charts signal earlier.In addition, the two charts reveal that the warranty claims data prior to the 120th week appear to be slightly unsteady, as they also sound alarms around the 60th and 90th weeks, respectively.In contrast, the DyShewhart chart does not indicate any OC condition before the 134th week (see Li et al. 2020).Generally speaking, the DyEWMA and DyCUSUM charts are more likely to discern underlying OC conditions.Therefore, it is quite necessary for the company to initiate a thorough investigation on possible root cause(s) that resulted in such an abnormal pattern.It is also worth mentioning that the control limit in Figure 6(a) rapidly approaches and then stays around a constant.This demonstrates that G k (if in control) approximately follows a normal distribution for sufficiently large values of E[ M k ], as in the case of the DyShewhart chart.
We have also checked other choices of tuning parameters for the two charts (detailed results are omitted for brevity), and found that the DyEWMA chart with θ = 0.25 or θ = 0.50 looks similar to that in Figure 6(a), whereas the DyCUSUM chart almost fails to work even for ψ = 1.10.Again, this reveals that the DyCUSUM scheme is not as reliable as the DyEWMA scheme.
It is viable to further improve the detection power by incorporating the moving window approach into the original schemes.Figure 7 displays the DyEWMA chart with θ = 0.10 and a sensible choice of B = 30.As a comparison, we also show in the figure the corresponding DyEWMA chart for B = 10, in order to demonstrate the performance of a small B. From Figure 7, we find that the DyEWMA chart for B = 30 resembles the original one, which well supports our previous findings with the original schemes, but its detection sensitivity is improved (all the OC signals occur about 10 weeks earlier).On the other hand, the DyEWMA chart for B = 10 becomes unsteady (the associated G k fluctuates dramatically since the early weeks).In other words, a small B, representing a significant information loss, can lead to ineffective process monitoring and is thus not recommended.Overall, in real applications, it is a good practice to examine several choices of B for enhanced early detection.

Conclusions
This paper focuses on early detection of reliability problems that emerge at the design and/or manufacturing stages, through dynamically monitoring aggregate warranty claims with suitable SPM methods.To improve the performance in detecting small changes in the warranty claims monitoring problem, we modify traditional CUSUM and EWMA schemes by incorporating the dynamic design approach and thus adopting DPCLs instead of a constant control limit.The proposed monitoring schemes, named DyCUSUM and DyEWMA, work well even in complex situations without any prior knowledge on future sales, which is an essential step forward compared with traditional static schemes.To achieve dynamic design of the DyCUSUM and DyEWMA schemes, we develop three algorithms (i.e.Monte Carlo simulation, Markov chain, and nearenumeration) to progressively determine control limits while controlling the CFSRs at a desired value.After examining the three algorithms, we find that the nearenumeration algorithm is a good choice as it can achieve a higher approximation accuracy with a lower computational effort.We believe that it is applicable to not only the warranty claims monitoring problem, but also similar problems concerning dynamic design of control charts in the discrete case.
A thorough performance evaluation, based on the overall SP metrics, of the proposed DyCUSUM and DyEWMA schemes is carried out, taking into account different production-sales processes and OC cases.The results show that the DyEWMA scheme exhibits superior and robust detection performance; in particular, the DyEWMA scheme designed with θ = 0.10 is almost always superior to or at least comparable to the best of all the other schemes in all the cases we consider.By contrast, the DyCUSUM scheme could be completely ineffective in a few cases, especially when designed with a large ψ.Consequently, the DyCUSUM scheme should be used with great caution.The optimal tuning parameters of the two schemes are also investigated, and guidelines on parameter selection are provided.Furthermore, to mitigate the 'inertia' problem, we adopt the idea of moving window to modify the DyCUSUM and DyEWMA schemes.It shows that the detection capability of the schemes can indeed be improved by choosing a reasonable value of B (B cannot be too small).It is noteworthy that when considering the moving window, the DyEWMA scheme remains highly competitive among all schemes under consideration.
Although the implementation of dynamically designed control charts seems more complicated than that of the traditional ones, there is no substantial difficulty in applying them with the assistance of a well encapsulated computer programme, as we have shown in the case study.Nevertheless, this paper presents several limitations.First, individual heterogeneity is not considered in the aggregate warranty claims forecasting model.In reality, individual units might exhibit heterogeneity in claim rates due to variations in operating condition and environment.Second, we implicitly assume that the time lag between occurrence of a failure and the resulting warranty claim is negligible.In practice, however, reporting delays might occur for various reasons, from either the consumer's or the manufacturer's side (Wu 2012), in which case the observed number of claims over a period is different from the actual number of claims that have occurred.Although these assumptions are helpful for simplifying the model and facilitating development of the monitoring schemes, a more realistic warranty claims forecasting model is needed.
From the SPM perspective, there are several future avenues that might be interesting to explore.First, profile monitoring has been a promising topic in SPM research.The new perspective of warranty-claims profile monitoring deserves further investigations (Song et al. 2022(Song et al. , 2023)).Second, modern warranty databases generally contain rich information on each claim.It is thus interesting to simultaneously monitor multiple aspects of warranty claims (e.g.claim counts and servicing cost of each claim), which might contribute to systematic and refined decision support on early detection of reliability problems.the second experiment so that a representation of 30 random sample sizes is drawn from n t ∼ U(10, 20).
From Table A1, we can observe that the attained DPCLs by the three algorithms show a great degree of agreement.
The control limits vary significantly in the first few periods (samples) and then tend to be relatively stable (and close to the corresponding constant control limit given the samplesize distribution).In particular, under the constant-sample-size  setting, the control limits derived by Alg #2 and Alg #3 reach the steady value after a certain number of periods, whereas those derived by Alg #1 constantly fluctuate in a small range due to the randomness inherent in Monte Carlo simulation.By contrast, the control limits under the varying-sample-size setting change over time as n t varies, in order to maintain the CFSR at the desired level.For a valid comparison considering the approximation accuracy of various algorithms in determining the DPCLs, we evaluate actual CFSRs over time with the determined DPCLs through an exhaustive enumeration.The actual CFSRs of various monitoring schemes are obtained and plotted in Figures A1  and A2, respectively, for the two sample-size settings.In addition to the DyCUSUM and DyEWMA schemes, their respective static versions -referred to as FixCUSUM and FixEWMA that use a constant control limit -are also included for comparison purposes.For the FixCUSUM and FixEWMA schemes, we determine the control limit for each specified sample-size distribution in a similar way to those in Ryan and Woodall (2010), Jiang, Shu, andTsui (2011), andZhou et al. (2012), among others.By maintaining the average run length approximately at 370 (i.e. the average false signal rate equals to 0.0027), the constant control limit of the FixCUSUM scheme is derived as 19.000 in the first experiment and 20.296 in the second, and 1.124 and 1.113 for the FixEWMA scheme.
From Figures A1 and A2, we can see that the actual CFSRs of the dynamic monitoring schemes based on Alg #2 and Alg #3 exhibit a high degree of consistency and are well controlled at or near the pre-specified level (α = 0.0027), even under the varying-sample-size setting.Alg #3 is slightly better than Alg #2, because the actual CFSRs by Alg #3 exhibit less fluctuation and, more importantly, are generally closer to but not exceeding 0.0027, whereas those by Alg #2 occasionally go beyond this maximum permissible level.On the other hand, although the actual CFSRs by Alg #1 also fluctuate around 0.0027, they are subject to much larger variations compared to those of Alg #2 and Alg #3 and exceed 0.0027 more frequently.The performance of Alg #1 is especially poor when applied to the DyCUSUM scheme in the first experiment (see Figure A1(a)).This makes the simulation-based approach less attractive for dynamically determining probability control limits, when compared with the other two algorithms.
For the FixCUSUM and FixEWMA schemes, we observe that their actual CFSRs always grow from zero and gradually approach the pre-specified level in a certain number of periods (6 ∼ 15 in our experiments).Obviously, the corresponding CFSRs are uncontrolled and severely suppressed during the early periods, which is undesirable as early shifts in the process cannot be detected effectively.After this early stage, the actual CFSRs may reach their long-term average (i.e.approximately 0.0027) and then remain stable (under the constantsample-size setting; see Figure A1) or fluctuate around the average (under the varying-sample-size setting; see Figure A2).However, it should be emphasised that Figure A2 only represents an ideal situation in which the sample-size distribution is known in advance.However, exact information on future sample sizes is more often unavailable for offline determination of control limits.When the actual sample-size sequence (distribution) deviates from the assumed one, the outcome would be much worse.As a result, traditional FixCUSUM and FixEWMA schemes are not recommended in most cases, especially when the sample sizes are time-varying and unpredictable.
In general, the proposed dynamic monitoring schemes perform quite well in controlling the CFSRs and can be used for any sample-size sequence without prior information on future sample sizes.Nevertheless, their actual performance depends largely on the specific algorithm adopted.Among the three algorithms, as noted above, Alg #2 and Alg #3 appear to be satisfactory options, especially the latter, whereas Alg #1 cannot output sufficiently reliable control limits.Although Alg #3 adopts a near-enumeration idea that seems time-consuming at the first glance, its computational efficiency is acceptable.To reveal the computational efficiency of the three algorithms, we conduct additional simulation experiments.Table A2 reports the average time (in seconds) required by the algorithms for determining each control limit, based on 30 runs each with a sequence of 100 random sample sizes drawn from n t ∼ U(10, 20) and n t ∼ U(100, 200).The three algorithms are programmed with MATLAB R2018b and run on a 3.60 GHz Intel Xeon E5-1650 machine.
Table A2 shows that, using suggested algorithm arguments, Alg #3 enjoys a comparable (even less) computation time compared to Alg #1 and runs far faster than Alg #2.This is because the near-enumeration algorithm leverages the discreteness of the Poisson distribution, so that a lot of calculations involved in Alg #3 are basic arithmetic operations.On the contrary, when determining the transition probability matrix at each period for Alg #2, we have to calculate (V + 1) 2 entries in terms of a Poisson distribution.Moreover, the Markov chain approach itself is relatively inefficient in the discrete setting, as many of the entries are actually zero or the same; many calculations thereinto are thus meaningless.Because Alg #3 is able to attain a higher approximation accuracy with a lower computational burden, it shall be employed for dynamically determining probability control limits throughout this work.

Figure 2 .
Figure 2. Overall TSPs of various dynamic monitoring schemes after the claim rate change under scenario 1.

Figure 3 .
Figure 3. Overall TSPs of various dynamic monitoring schemes after the claim rate change under scenario 2.

Figure 4 .
Figure 4. Effect of B on the overall TSPs of the DyEWMA scheme (θ = 0.10) after the claim rate change under scenario 1.

Figure 5 .
Figure 5. Step-by-step implementation of the DyEWMA scheme.

Figure 6 .
Figure 6.DyEWMA chart (θ = 0.10) and DyCUSUM chart (ψ = 1.00) for the real dataset.Each arrow points to the starting point of a run of successive points plotted above their respective control limits, which correspondingly triggers an OC signal.

Figure A1 .
Figure A1.Actual CFSRs various monitoring schemes in the first simulation experiment.

Figure A2 .
Figure A2.Actual CFSRs various monitoring schemes in the second simulation experiment.
R, and store them in O G k .4. Set the control limit H G,k as the 100(1 − α)th percentile of the R elements in O G k . 5. If G k ≤ H G,k , then keep the elements in O G k that are less than or equal to H G,k and store them in O G k .Then, let k = k + 1, and go to Step 2. Otherwise, an OC signal should be issued.
G 2 .The control limit H G,2 is exactly the maximum element in O * G 2 by our definition.If we observe G 2 > H G,2 in use, then an OC signal should be triggered; otherwise, we proceed to the third period.In this case, we derive a paired vector P * *G 2 -by adding up the probabilities in P * 3 is observed, then we move on to the next period and store the first V * 3 elements of O * G 3 in a new vector O * k , storing them in O * G k and P * G k , respectively.Then, let k = k + 1, and go to Step 2. Otherwise, an OC signal should be issued.

Table 1 .
Strengths and limitations of the three algorithms.
Alg #2: Markov chain• able to attain sufficiently accurate and stable DPCLs • adapted from a usual routine for offline determination of control limits• less efficient in the discrete case • computationally intensive and highly time-consuming when V is relatively large Alg #3: Near-enumeration• able to attain sufficiently accurate and stable DPCLs • able to get results in less than a second • fully leveraging the discreteness of warranty claims• applicable only to the discrete case • embedded with both exhaustive enumeration and discretization for approximation that increases the complexity

Table 2 .
Overall SP results of the DyCUSUM and DyEWMA schemes under scenario 1.

Table 3 .
Overall SP results of the DyCUSUM and DyEWMA schemes under scenario 2.

Table 4 .
DuTS values of the DyCUSUM and DyEWMA schemes using the optimal and referenced tuning parameters under scenario 1.

Table A1 .
DPCLs of the two monitoring schemes for the first 30 samples in the two simulation experiments.

Table A2 .
Computation time (in seconds) required by the three algorithms.