Directional fault classification for correlated High-Dimensional data streams using hidden Markov models

Abstract Modern manufacturing systems are often installed with sensor networks which generate high-dimensional data at high velocity. These data streams offer valuable information about the industrial system’s real-time performance. If a shift occurs in the manufacturing process, fault diagnosis based on the data streams becomes a fundamental task as it identifies the affected data streams and provides insights into the root cause. Existing fault diagnostic methods either ignore the correlation between different streams or fail to determine the shift directions. In this paper, we propose a directional fault classification procedure that incorporates the between-stream correlations. We suggest a three-state hidden Markov model that captures the correlation structure and enables inference about the shift direction. We show that our procedure is optimal in the sense that it minimizes the expected number of false discoveries while controlling the proportion of missed signals at a desired level. We also propose a deconvolution-expectation-maximization (DEM) algorithm for estimating the model parameters and establish the asymptotic optimality for the data-driven version of our procedure. Numerical comparisons with an existing approach and an application to a semiconductor production study show that the proposed procedure works well in practice.


Introduction
Rapid development of sensor technology has made it a possible for modern manufacturing systems to be equipped with networks of sensors such as thermometers, vibroscopes, displacement meters and flow meters.These sensors continuously generate high-volume data at high velocity.Such data are high-dimensional in nature, so in the literature they are often referred to as high-dimensional data streams (HDDS).Because of their ability to indicate the industrial system's performance on a real-time basis, HDDS are regarded to possess great potential for helping improve the practice of industrial quality control.Properly analyzing HDDS, however, can be challenging for several reasons.First, with the number of data streams in the hundreds or thousands, it is likely to have abnormal (extremely large or small) observations that trigger out-of-control (OC) signals at each time point, rendering typical metrics such as average run length (ARL) in the traditional statistical process control (SPC) inapplicable.Second, OC data streams tend to occur in clusters and clumps because data sourced from sensors in close vicinity are often correlated.Such correlation needs to be taken into account appropriately.Third, in order to have a timely assessment of the system performance, efficient computing is necessary for processing the fast-flowing data streams.Consequently, many conventional multivariate SPC techniques (e.g., Hotelling's T 2 or principal components analysis) that require large matrix computation would be too cumbersome to handle HDDS.
The existing literature on SPC methods for HDDS has largely focused on online monitoring of which the major goal is to determine whether the production process has experienced an OC event.See, for instance, Mei (2010), Ross, Tasoulis, and Adams (2011), Liu, Mei, and Shi (2015), Zou et al. (2015), Xian, Wang, and Liu (2018), Yan, Paynabar, and Shi (2018), Zhang, Chen, and Wu (2020) and Li et al. (2021) for related methods.We refer readers to Woodall and Montgomery (2014), Weese et al. (2016) and Megahed and Jones-Farmer (2015) for comprehensive overviews of this topic.More recently, there have been growing interests in post-signal diagnostic problems involving HDDS, in which the primary purpose is to identify the data streams affected by the OC event after an OC signal has been delivered.The task of efficient fault diagnosis is fundamentally important in quality engineering, especially in situations involving HDDS, as it would be extremely labor intensive (if not practically infeasible) for quality engineers to examine the thousands of data streams one by one.It is also beneficial if the diagnostic procedure can accurately indicate the shift directions of OC streams.With this directional information, quality engineers are able to shorten the repairing process significantly, saving time and reducing costs.In the literature, a number of diagnostic approachs have been proposed.Some have made use of variable selection techniques for multiple linear regression by setting up the fault diagnosis problem such that the selected variables are regarded as OC streams (e.g., Wang and Jiang 2009;Zou and Qiu 2009;Capizzi and Masarotto 2011;Zou, Jiang, and Tsung 2011).Although this approach has been shown to outperform many traditional SPC methods, it would be too computationally expensive for typical HDDS applications in which there could be thousands of input data streams generating observations in each second (see, e.g., De Oca et al. 2010;Spiegelhalter et al. 2012;Woodall and Montgomery 2014;Li et al. 2020).Others have analyzed the HDDS problem under the framework of large-scale multiple testing (e.g., Li et al. 2020;Xiang et al. 2021a).We find this approach particularly relevant to post-signal diagnostics as it not only identifies nearly all the OC streams but also excludes the irrelevant incontrol (IC) streams.A major limitation of the existing multiple testing-based methods is that they assume independence between data streams, which is rarely the case in practice.To address this limitation, Xiang et al. (2021b) has proposed a reliable diagnostic procedure that permits correlation between data streams by incorporating a two-states hidden Markov model (HMM).Their approach is able to describe the correlation structure reasonably well, but the two-states HMM is not equipped to determine the shift directions, making the fault diagnosis less informative.Xiang et al. (2021a) has also considered between-stream correlations and evaluated their proposed approach using simulations.Although the numerical results have indicated that their approach is able to handle correlated data streams, the authors have assumed that the correlation matrix is completely known in their data-driven method, rendering the method less useful in practice.Furthermore, in their numerical study, the authors could only implement the independent version of their approach (i.e., a diagonal working correlation matrix is used) due to the insurmountable difficulty of integral computations in high dimensions.
In this article, we develop a computationally fast directional diagnostic procedure that handles correlated HDDS.Our contribution to the literature is three-fold.First, we model the HDDS correlation structure by a three-state HMM which not only conveniently enables inference about the shift directions but also makes the theoretical analysis tractable.By formulating the diagnosis problem as multiple three-class comparisons, we derive an oracle decision rule corresponding to the optimal directional fault classification procedure that minimizes the expected number of false signals while controlling the proportion of missed discoveries to all the OC streams at a pre-specified level.Second, we devise recursive formulas for the numerical implementation of our procedure, making the computation manageable for most HDDS applications.Third, we suggest a deconvolution-expectation-maximization (DEM) algorithm for estimating the model parameters and establish the asymptotic optimality for the data-driven version of our procedure, eliminating the assumption made in Xiang et al. (2021a) that the correlation matrix must be specified.Comparisons with a state-of-the-art method show that the proposed procedure works well in various scenarios.It should be pointed out that, in addition to their ubiquity in manufacturing environments, HDDS are increasingly common in computer network monitoring (Denby et al. 2007;Jeske et al. 2009), bioterrorism surveillance (Rolka et al. 2007;Shmueli and Burkom 2010) and business activity monitoring (Jiang, Au, and Tsui 2007;Bisgaard 2012).Our proposed procedure is useful in those applications as well.
The remainder of this article is organized as follows.In Section 2, the proposed procedure and its theoretical properties are described in detail.In Section 3, the numerical performance of our procedure is evaluated and compared with an existing approach.Section 4 demonstrates the proposed procedure with a real semiconductor production dataset.Section 5 summarizes the work and suggests future research directions.Additional simulation results and technical proofs are included in the supplementary file.

Problem setup
Let t be the time index and fX t ¼ ðX 1t , X 2t , :::, X mt Þ T : tg be the monitored data streams of m quality characteristics.When the production process is IC, X t follows an IC distribution, which can be estimated using an IC dataset collected beforehand.The problem of IC distribution estimation is often considered in phase I SPC.We refer to Qiu (2014) for commonly used phase I SPC methods.Throughout our discussion, we assume that the IC distribution is known.After an OC shift occurs in the process, an online monitoring control chart can signal the shift and a small number of OC observations, denoted by fX OC j , j ¼ 1, 2, :::, ng, are available for fault diagnosis.In a manufacturing setting, some components of the manufacturing system might still be working stably as an OC event usually involves only a few parts of the system.Hence, considering the two possible shift directions, each component of X OC can have three latent states: IC, OC with a positive shift and OC with a negative shift.Denote this unobservable latent state by h.A quality engineer's major concern in this setting is to infer h based on the observed values of X OC : By this motivation, we assume that fX OC j g are generated from the following mixture model: where Nðl, RÞ denotes the normal distribution with mean l and covariance matrix R, d x ðÁÞ denotes the Dirac function with the unit mass concentrated at x, IðÁÞ is an indicator function, h 1 ðÁÞ and h 2 ðÁÞ are probability density functions with support in ð0, þ 1Þ and ðÀ1, 0Þ respectively, p k !0 for k ¼ 0, 61, and P 1 k¼À1 p k ¼ 1: l is the OC mean with some nonzero components due to the process shift.h i represents the status for data stream i with the probability of being IC, having shifted positively and negatively equal to p 0 , p 1 and p À1 , respectively.The hierarchical structure of model [1] has been widely adopted in the literature for approximating high-dimensional distribution (e.g., Efron 2004;Sun andCai 2007, 2009;Cai and Sun 2009) and it is particularly relevant to our fault diagnosis problem for the following reasons.First, the three-state specification serves our purpose of directional fault classification.Second, in quality control applications, the IC state of quality characteristics is often well defined whereas the OC state includes a range of possibilities.This is consistent with the mixture distribution of ljh in model [1].Third, the dependence structure of the HDDS is often too complex to be described adequately.With the hierarchical structure in [1], the dependence between data streams can be induced by the correlation among h i 's, which is more tractable.More importantly, the formulation of fh i g as a Markov process is able to characterize the clusters-and-clumps phenomenon in fault analysis (more detailed discussion on this later).Here we further assume that fX OC j g are independent over time.In cases where the observations are temporally correlated, we can remove the correlation in advance by some decorrelation technique (e.g., Apley and Tsung 2002;Qiu, Li, and Li 2020).We also assume that R is the m Â m identity matrix.In Section 3, we consider extensions to cases where R is non-diagonal or the distribution of X OC j jl, h is non-normal.In many HDDS applications, OC streams often occur in clusters because the physical locations of their corresponding data sources (e.g., sensors) are within close vicinity.This dependence structure is informative for our fault diagnosis but often ignored in the existing research.We adopt the following three-state HMM to capture such a dependence structure.We assume that fh i , i ¼ 1, 2, :::, mg form a stationary, irreducible and aperiodic Markov chain.Denote the initial distribution by p 0 ¼ ðp 0 0 , p 0 1 , p 0 À1 Þ T and the stationary distribution by The transition probabilities fa kl g do not depend on i and satisfy the standard constraints: 0 a kl 1 for k, l 2 f0, 61g, and P 1 l¼À1 a kl ¼ 1 for k 2 f0, 61g: Note that the stationary distribution p ¼ ðp 0 , p 1 , p À1 Þ T can be determined by the transition matrix using the relationship p T A ¼ p T : In the special case of p 0 ¼ p, we have Pðh i ¼ kÞ ¼ p k : The goal of our fault diagnosis is to accurately identify all the OC streams and correctly determine their shift directions.To achieve this goal, consider a related three-class multiple testing problem: where X is the sample space.Here d ¼ ðd 1 , d 2 , :::, d m Þ T can be regarded as a decision vector of length m with elements 0, 1, or À1.If d i ¼ 0, then h i is deemed to have taken the value of 0, i.e., the i-th data stream is classified as IC.Similarly, if d i ¼ 1 or À1, then the i-th data stream is classified as OC with a positive shift or negative shift, respectively.For a given decision rule d, the outcomes of problem [2] can be categorized as done in Table 1.Based on this categorization, the missed discovery rate (MDR) is defined as MDR ðdÞ By the definition of MDR, it would be considered a missed discovery if an OC stream is correctly flagged but its shift direction misspecified.MDR is similar to the concept of false discovery rate (FDR) in the multiple testing literature.However, it is not appropriate to use FDR in our fault diagnosis problem, because an FDR-based approach only concerns with the proportion of the IC streams misclassified as OC relative to all the discoveries, paying no attention to missed OC signals.
In addition to controlling the MDR at a desired level, a reasonable fault classification procedure should keep the expected number of false discoveries (EFD) to a minimum.The formal definition of EFD is In our fault classification problem, the goal is to have as few false discoveries as possible while controlling the proportion of missed signals.Therefore, we consider the following constrained optimization problem: where a 2 ð0, 1Þ is specified beforehand.In the following subsections, we derive the solution to problem [3].

The oracle procedure
Throughout our discussion in this subsection, we assume that the parameters A, p 0 and functions h 1 ðÁÞ, h 2 ðÁÞ in model [1] are known.Since the sample mean X ¼ P n j¼1 X OC j =n is a sufficient statistic for l, we will use it to construct our solution to problem [3].Define : Therefore, the constraint optimization problem [3] can be written as The related penalized objective function is where k > 0 is a penalty parameter.It is easy to verify that, for a given k > 0, L X ðk,dÞ is minimized by It is intuitively sensible that the decision rule d k to the multiple testing problem [2] is based on K i ðXÞ as it compares the conditional probabilities fPðh i ¼ kjXÞ,k ¼ 0,61g: It remains to find the k value such that the decision rule d k controls the MDR at the desired level a.The next theorem gives the details.
is the infimum of all the MDRs achievable by d k , and this lower bound is strictly positive.To see this, write Table 1.Classification of tested hypotheses.
Hence, the condition a > lim k!1 MDRðd k Þ is necessary for d k to achieve the required MDR level.By Theorem 1, the optimal decision rule d k Ã is obtained by spending all the available MDR.Similar tradeoffs also exist in the multiple testing literature (e.g., Cai and Sun 2009).
The use of decision rule d k Ã requires the calculation of H k i ðXÞ, which can be greatly simplified by the forwardbackward algorithm described below.Let f ðÁÞ and f ðÁjÁÞ denote the joint and conditional likelihood, respectively.Define a iþ1 ðkÞ ¼ f ðX 1 , X 2 , :::, X iþ1 , h iþ1 ¼ kÞ and b i ðkÞ ¼ f ðX iþ1 , X iþ2 , :::, X m jh i ¼ kÞ: By using the Markov property repeatedly, we have the following recursive computation. Similarly, For initialization, we set a 1 ðkÞ ¼ p 0 k f ðX 1 jh 1 ¼ kÞ and b m ðkÞ ¼ 1: Taken together, we have Next, we discuss the computational issue of finding k Ã as MDRðd k Þ is not straightforward to compute algorithmically.Our idea is to obtain an estimate for MDRðd k Þ, denoted by d MDRðd k Þ, and then estimate More specifically, by [4], the moment estimator for MDR

The Data-Driven procedure
The model parameters fA, p 0 g and density functions fh 1 ðÁÞ, h 2 ðÁÞg are seldom known in practice.In this subsection, we suggest ways for estimating them and thus obtain a data-driven version of our diagnostic procedure.
We first consider the estimation of fh 1 ðÁÞ, h 2 ðÁÞg given an estimated transition matrix c A : By the data generating mechanism given in model [1], we can write where l i $p 0 d 0 ðl i Þþp 1 h 1 ðl i Þþp À1 h 2 ðl i Þ,e$Nð0,R=nÞ, and l and e are independent.Given the observed value of X, we are interested in estimating the density of l i .This is a typical setup for density deconvolution problems (e.g., Diggle and Hall 1993 A natural estimator for hðlÞ is given by The choice of the kernel function K and bandwidth s is crucial for the success of deconvoluting kernel estimators.In our implementation in Section 3 and Section 4, we adopt the sinc kernel KðxÞ ¼ sin ðxÞ=ðpxÞ suggested by Delaigle and Hall (2006).As for s, we select its value using the approach proposed in Delaigle and Gijbels (2004).The idea of using density deconvolution techniques for estimating h was initially proposed in Sun and McLain (2012), where they considered the i.i.d.two-class case.Notably, the correlation among fl i g has no influence on the optimal bandwidth choice and optimal rate of mean squared error for the kernel estimator in cases where the distribution of e i is normal (i.e., supersmooth).See Kulik (2008) for a detailed discussion about deconvolution and dependence.Therefore, we can use the deconvoluting estimator as if fl i g were i.i.d.
Next, we estimate ff ðx i jh i ¼ kÞ, k ¼ 0, 61g: It is clear that f ðx i jh i ¼ 0Þ is the density of Nð0, 1=nÞ: As for ff ðx i jh i ¼ kÞ, k ¼ 61g, write Next, we describe our procedure for estimating the transition matrix A and initial distribution p 0 : Since fh i g are unobservable, the maximum likelihood estimators for A and p 0 are not readily available.The expectation-maximization (EM) algorithm is useful in such a situation.The key step is that, given the values of A ðtÀ1Þ and p 0ðtÀ1Þ at the t-th iteration of the EM loop, we can update them by After c A is obtained by the EM algorithm, we can again update our estimate for the stationary distribution p and subsequently our estimate for fh 1 ðÁÞ, h 2 ðÁÞg: This deconvolution-expectation-maximization (DEM) iteration continues until convergence.Note that our estimation for fh 1 ðÁÞ, h 2 ðÁÞg only requires b p: Therefore, we only need an initial estimate for the stationary distribution p to start the DEM loop.The estimator proposed in Jin and Cai (2007) has been shown to be consistent and work well in large-scale hypothesis testing.We adopt their estimate for p in our DEM initialization.The full description of the DEM algorithm is given below.

Compute the initial stationary distribution p ð0Þ
according to Jin and Cai (2007).2. At the r-th iteration (r ¼ 1, 2, :::), do the following:   (ii) (M-step) update the parameters by One could have initialized the stationary distribution p ð0Þ in step 1 by random assignment, but our numerical experience shows that the stationary distribution estimator given in Jin and Cai (2007) is a better initialization choice in the sense that the algorithm converges much faster.
Our data-driven diagnostic procedure is the same as the oracle procedure except that the quantities H k i ðXÞ, H max i ðXÞ and K i ðXÞ are replaced by their estimates.The following theorem shows that the datadriven procedure is asymptotically optimal.
Theorem 3. Denote our data-driven diagnostic procedure by d D .Under the regularity conditions stated in the supplementary file, if a > lim k!1 MDRðkÞ, then we have

Simulations
We evaluate the numerical performance of our procedure in this section.Denote our stepwise oracle and data-driven procedure by SOP and SDP, respectively.In the recent fault diagnostic literature, Xiang et al. (2021a) has developed a fault classification method that can estimate the shift directions but ignores the between-stream correlations.We denote the oracle and data-driven version of their procedure by XO and XD, respectively.We use the performance of XO and XD as benchmarks for SOP and SDP.
Throughout this section, the desired level of MDR is fixed at 10% (i.e., a ¼ 0:1), the dimension m ¼ 3000, and the number of OC observations n varies among 1, 2, 5, 10, 20 and 30.In order to assess the numerical performance of the four procedures, the following replicated simulations are conducted for each procedure.After l and h are given, the actual MDR and EFD values of each procedure are calculated based on 1,000 replicated simulations of X: This whole process is then repeated 100 times, rendering 100 sets of l and h values along with 100 pairs of MDR and EFD values.The averages of these 100 MDR and EFD values are reported as the final metrics.We consider three scenarios: (i) Xjl, h is normal with diagonal R, (ii) Xjl, h is normal with non-diagonal R, and (iii) Xjl, h is nonnormal.The initial distribution p 0 is set to be ð1, 0, 0Þ T throughout this section.Regarding the choice of fh 1 ðÁÞ, h 2 ðÁÞg, we consider both symmetric and asymmetric shifts.

Symmetric
Shifts: , 0:05, 0:5Þ, where Gammaða, b, cÞ denotes the density of the gamma distribution with shape parameter a, location parameter b and scale parameter c.Asymmetric Shifts: h 1 ðl i Þ ¼ Gammað3, 0:05, 0:5Þ and h 2 ðÀl i Þ ¼ Gammað2, 0:05, 0:5Þ: The use of a location parameter in the gamma distributions is to ensure that the magnitude of OC signals is bounded from below by a postive number.We have set larger average shift size in the asymmetric case (1.55 for positive shift and 1.05 for negative shift) than in the symmetric case (1.05 for both negative and positive shifts).

Normal cases with diagonal R
Let A 1 ¼ 0:75 0:125 0:125 0:6 0:3 0:1 0:6 0:1 0:3 0 @ 1 A : The stationary distribution is given by fp 0 ¼ 0:706, p 6 ¼ 0:147g: In this case, the non-null proportion is relatively low.Our simulation results are shown in Figure 1.It can be seen from the figure that (i) all four procedures can achieve the specified MDR level with moderate n values, and (ii) SOP and SDP outperform XO and XD respectively, in both symmetric and asymmetric cases.
Next, let the transition matrix be defined by A 2 ¼ 0:7 0:15 0:15 0:3 0:5 0:2 0:3 0:2 0:5 0 @ 1 A with the stationary distribution fp 0 ¼ 0:5, p 6 ¼ 0:25g, making the proportions of null and non-null cases equal.In addition, with relatively large diagonal entries in A 2 , the non-null cases are likely to occur in clusters.Our simulation results are summarized in Figure 2. It can be seen that our procedures outperform XO and XD in terms of EFD by a large margin.This is expected because XO and XD do not take into account the between-stream correlations.
With A 1 or A 2 , the chances of transitioning from an IC status to a positive or negative shift status are equal.To examine an unequal case, consider the following transition matrix The stationary distribution is then given by fp 0 ¼ 0:625, p 1 ¼ 0:2375, p À1 ¼ 0:1375g: Our simulation results with A 3 are presented in Figure 3, which shows that SOP (SDP) produces fewer number of false discoveries than XO (XD) does while controlling the MDR at the required level for n !5:

Normal cases with Non-Diagonal R
In this subsection, we evaluate our procedure in cases where R is non-diagonal.It is worth noting that, even in cases of a diagonal R in our hierarchical model [1], components of X are still correlated because of the correlation between h i 's.The conditional correlations (between components of Xjh) induced by a non-diagonal R can further increase between-stream correlations.We would like to examine the performance of our procedure when R is non-diagonal.Specifically, consider the following autoregressive covariance structure where q is chosen to be 0.3, 0.5 or 0.7.The comparison with XO and XD is shown in Figure 4, where the asymmetric shifts and transition matrix A 1 is used.It can be seen from the figure that (i) SOP outperforms XO in terms of EFD while XO is better at controlling MDR for small n values, and (ii) SDP and XD have comparable EFD while XD controls MDR better when n is small (particularly in the case of q ¼ 0:7).The primary reason why our procedure does not control the MDR well in cases of high conditional correlations is that the recursive formula [5] -[6] is no longer valid.The Markov property of fh i g process does not carry over to fX i g: If H k i ðXÞ is computed directly rather than by the recursive formula, our oracle procedure remains the same, its optimality still holds as stated in Theorem 1, and it is able to control the MDR at its nominal level.This issue is further investigated and additional numerical studies are conducted in the supplementary file.

Non-normal cases
Our assumption that X ¼ X OC =n follows a normal distribution is mainly justified by the central limit theorem as X is an average of the OC observations.In practice, however, this normality assumption might be violated.In this subsection, we examine our procedure in cases where the observations are non-normal.Specifically, we generate X OC j by (i) , and (ii) X OC j ¼ ðX 0 j À 3Þ= ffiffi ffi 3 p þ l where X 0 j $ Gamma m ð3, 0, 1Þ: Here t m ð5Þ and Gamma m denote the m-dimensional t distribution with 5 degrees of freedom and m-dimensional gamma distribution respectively.Our results are shown in Figure 5 where the asymmetric OC shifts and transition matrix A 1 are specified.It can be seen from the figure that (i) all four procedures are able to achieve the required MDR level as n increases, (ii) SOP consistently outperforms XO, and (iii) SDP has fewer false discoveries than XD does with moderate or relatively large n.

Real data
We apply our proposed procedure to a sensor dataset collected from a semiconductor manufacturing process.The dataset is available from the UCI Machine Learning Repository (http://archive.ics.uci.edu), and was automatically recorded by a computer system that manages the entire SMP from producer requests to laboratory analysis.The dataset contains 1,567 observations with 453 measured attributes (i.e., m ¼ 453).Each observation is labeled as pass or fail based on the manufacturer's in-house testing.There are 1,463 pass observations.The remaining 104 fail observations are regarded as OC data (i.e., n ¼ 104).Here all attributes are numerical so their values being too high or low provides directional information about possible faults.Our goal is to identify the OC streams and determine their shift directions using the OC data.
There are missing values in the dataset, and we impute their values using the mean imputation method.To ensure that our normality assumption is not severely violated, we transform the data by where fX 0 ij g are the original OC observations, b F i is the empirical cumulative distribution function (CDF) based on the pass observations, and U À1 denotes the inverse CDF of the standard normal distribution.Figure 6 shows the between-stream correlations after the transformation.It can be seen that there are nonzero between-stream correlations, particularly for those nearby data streams.This clustering phenomenon of the OC streams suggests that our hidden Markov model is appropriate to use here.
Next, we estimate our model parameters using the fail observations.b h 1 ðÁÞ and b h 2 ðÁÞ are depicted in Figure 7.It can be seen from the figure that both positive and negative shifts occurred in the process, suggesting that our three-state specification for the latent variable is reasonable with this data set.The transition probabilities estimated by the DEM algorithm are given below.c A ¼ 0:77 0:22 0:01 0:32 0:68 0:00 1:00 0:00 0:00 0 @ 1 A : Next, we apply our data-driven procedure SDP with MDR level 0.1 and the diagnostic results are shown in Figure 8.There are 249 data streams classified by SDP as OC.Notably, our diagnostic results do not depend solely on the observed value of X i .Some streams with their observed X i values relatively close to 0 are still determined to be OC.This is because one stream is likely to be classified as OC if its neighboring streams are OC.For comparison, the

Summary
We have proposed a fault diagnostic procedure for high-dimensional data streams.A major feature of the proposed procedure is that it allows the data streams to be correlated and enables inference about the shift directions.By setting up the directional diagnostic problem as multiple comparisons, we have shown that our procedure is optimal in the sense that it achieves the minimum expected number of false discoveries while controlling the missed discovery rate at a desired level.We also establish that the data-driven version of the proposed procedure is asymptotically optimal.There are ways to further generalize our procedure.For instance, our procedure is concerned with mean shifts only.Designing a diagnostic procedure for variance-covariance shifts is certainly an interesting direction to pursue.Additionally, in many HDDS applications, not all the data streams can be easily collected and stored due to limited computer memory.It requires future research to develop effective diagnostic procedures in such situations.
Note to practitioners: This article is motivated by the problem of data-driven fault diagnosis for complex industrial systems where data streams are generated by networks of sensors.Our proposed method controls the missed discovery rate at a pre-specified level while keeping the number of false signals to a minimum.It is suitable for quality engineering applications in which faults tend to occur in clusters and clumps and the data contain directional information (i.e., values of certain quality variables are too high or too low).It may not be appropriate to use for applications if faults (when they occur) scatter randomly or directional diagnosis is not of interest.

About the authors
b p ¼ ðb p 0 , b p 1 , b p À1 Þ T is the stationary distribution derived from c A : Since h 1 and h 2 have support in ð0, 1Þ and ðÀ1, 0Þ respectively, we can estimate them by b h 1 ðlÞ ¼ Iðl > 0Þ b hðlÞð1 À b p 0 Þ=b p 1 and b h 2 ðlÞ ¼ Iðl < 0Þ b hðlÞð1 À b p 0 Þ=b p À1 : in which b p is replaced with p ðrÀ1Þ : Let b f ðrÞ ¼k, h iþ1 ¼ljX, p 0ðtÀ1Þ , A ðtÀ1Þ Þ P mÀ1 i¼1 Pðh i ¼kjX, p 0ðtÀ1Þ , A ðtÀ1Þ Þ : Denote the estimates by b A ðrÞ and b p 0ðrÞ after the EM loop converges.(c) Update the stationary distribution estimate p ðrÞ using c A ðrÞ :

Figure 4 .
Figure 4. Diagnostic results given by XO (), XD (þ), SOP (w) and SDP (᭡) in cases where R is non-diagonal, the shifts are asymmetric, and the transition matrix is A 1 :

Figure 5 .
Figure 5. Diagnostic results given by XO (), XD (þ), SOP (w) and SDP (᭡) in cases where the observations are non-normal, the shifts are asymmetric, and the transition matrix is A 1 :

Figure 7 .
Figure 7. b h 1 ðÁÞ and b h 2 ðÁÞ for the semiconductor data.

Yan
He received his B.Sc. degree in Statistics from East China Normal University (ECNU), Shanghai, China.He is now working towards the Ph.D. degree at the School of Statistics at ECNU advised by professor Dongdong Xiang.His current research focuses on large-scale multiple testing and machine learning theory.Yicheng Kang received the B.Ss. degree in mathematics from Wuhan University, Wuhan, China, in 2008, the M.Sc.degree in mathematics from the University of British Columbia, Vancouver, Canada, in 2010, and the Ph.D. degree in statistics from the University of Minnesota, Twin Cities, USA, in 2013.He currently is an Assistant Professor at the Farmer School of Business, Miami University, Oxford, OH, USA.His research area includes image analysis, image-based process monitoring, industrial analytics, statistical quality control and transportation analytics.Fugee Tsung received the B.Sc. degree from National Taiwan University, Taipei, Taiwan, and the M.Sc.and Ph.D. degrees from the University of Michigan, Ann Arbor, MI, USA.He is currently a Chair Professor with the Department of Industrial Engineering and Decision Analytics and the Director of the Quality and Data Analytics Laboratory, the Hong Kong University of Science and Technology, Hong Kong.His research interests include quality analytics in advanced manufacturing and service processes, industrial big data, and statistical process control, monitoring, and diagnosis.Dr. Tsung is a Fellow of the Institute of Industrial Engineers (IIE), a Fellow of the American Society for Quality (ASQ), a Fellow of the American Statistical Association (ASA), an Academician of the International Academy for Quality (IAQ), an Elected Member of the International Statistical Institute (ISI), and a Fellow of the Hong Kong Institution of Engineers (HKIE).Dongdong Xiang received his Ph.D. degree in statistics from the East China Normal University, Shanghai, China.He currently is a Professor at the School of Statistics at the East China Normal University (ECNU).His research interests are statistical process control, large-scale multiple testing and machine learning.He has over 30 publications in journals such as Journal of the Royal Statistical Society -Series B, Technometrics, Journal of Quality Technology, Naval Research Logistics, and Statistica Sinica.

Figure 8 .
Figure 8.The diagnostic results for the semiconductor data using SDP.

Figure 9 .
Figure 9.The diagnostic results for the semiconductor data using XD.
MDRðd k Þ is to threshold K i ðXÞ, we have b k Ã ¼ 1=K ðcÞ ðXÞ, where K ð1Þ ðXÞ K ð2Þ ðXÞ Á Á Á K ðmÞ ðXÞ are ordered values of K i ðXÞ and c ¼ max j : We now have all the elements for our stepwise oracle procedure (SOP), summarized below.Compute a i ðkÞ and b i ðkÞ by recursive formula [5] -[6] with the initialization a 1 ðkÞ ¼p 0 k f ðX 1 jh 1 ¼ kÞ and b m ðkÞ ¼1 for k ¼ 0,61: 2. Compute H k i ðXÞ, H max The next theorem shows that our stepwise procedure retains the nice properties of d k Ã : i ðXÞ and K i ðXÞ by Eqs.[ Its idea is briefly described next.Let W,W l and W e denote the characteristic function of X i , l i and e i , respectively.We have W¼W l W e : The kernel estimator for WðtÞ is b W X ðtÞW K ðstÞ, where K ðÁÞ is the Fourier transform of the kernel function KðÁÞ, and s is the bandwidth parameter.Then the density of l i can be estimated by the inverse Fourier transform of b W X ðÁÞW K ðsÁÞ=W e ðÁÞ: Specifically, define Pðh i ¼ kjX, p 0ðtÀ1Þ , A ðtÀ1Þ Þ ¼ a ðtÀ1Þ ðnÞ;