Robust multivariate control chart based on shrinkage for individual observations

Abstract A robust multivariate quality control technique for individual observations is proposed, based on the robust reweighted shrinkage estimators. A simulation study is done to check the performance and compare the method with the classical Hotelling approach, and the robust alternative based on the reweighted minimum covariance determinant estimator. The results show the appropriateness of the method even when the dimension or the Phase I contamination are high, with both independent and correlated variables, showing additional advantages about computational efficiency. The approach is illustrated with two real data-set examples from production processes.


Introduction
Statistical process control charts are the most popular tool for monitoring production quality in the industry since they allow to search for abnormal or out-of-control behavior. In the univariate case, the classical approach is to use control charts that study the behavior of the mean and the variability of the process at the same time. These are known as the Shewhart control charts (Montgomery 1997;Nelson 1984;Shewhart 1941;Western Electric Company 1956). These control charts allow defining a quality control process that consists of two distinct phases. In Phase I, historical observations are used to create the control charts for retrospectively testing whether the process is in control detecting abnormal behavior. By removing the out-of-control data, the parameters of the in-control process can be estimated. Once this is accomplished, in Phase II, future samples obtained during the manufacturing process can be monitored.
Nowadays, it is very common that the dataset available is characterized by more than one variable. In this case, monitoring the variables one by one using the univariate control charts can be misleading and the variability due to their relationship would not be taken into account. Therefore, multivariate statistical process control techniques are more appropriate. Harold Hotelling introduced the first approach within this area in his pioneer paper (Hotelling 1947). It was basically an extension of the Shewhart control charts to the multivariate case. The problem divides into two cases: (i) when there are groups or subsamples in the data, and (ii) when the data consists of individual observations. In this paper, we focus on the latter case. For the multivariate sample x ¼ fx 1 , :::, x n g in Phase I, where x i represents a pÀ dimensional vector of measurements, the Hotelling T 2 statistic is defined as: T 2 i ¼ ðx i À lÞ t R À1 ðx i À lÞ, i ¼ 1, :::, n: When the process is in control, the sample is assumed to be independent and follows a multivariate normal distribution N p ðl, RÞ, where l is the location vector and R is the covariance matrix. The interpretation of the T 2 statistic is the weighted distance of any point from the process mean, under stable conditions. A large value of T 2 indicates that the process has shifted in some way.
This definition has a close resemblance to the Mahalanobis distance. Therefore, the appropriate probability limits may be obtained using the known distribution of the corresponding statistic. In the case of p > 1 dimensions, the limits are ellipsoids that depend on the paired correlations between the variables (Mason and Young 2002;Mitra 2008;Tracy, Young, and Mason 1992). The idea is to calculate the value of the Hotelling T 2 statistics for each observation, and if any value falls out of the pÀ dimensional ellipsoid contour, then it will be considered as outof-control.
In the definition of the Hotelling T 2 statistic in Eq.
[1], since l and R are not known, the sample mean vector and the sample covariance matrix are used to estimate location and dispersion. These estimators, however, are sensitive to the presence of special causes. In fact, Woodall (1996, 1998) showed that a T 2 chart based on the classical sample estimators is not effective in detecting a shift or a trend in the mean vector. That is the main reason why in this paper, the focus of attention is centered around the idea of using robust estimators of location and covariance in the definition of the Hotelling T 2 control chart.
In this paper, we propose an alternative robust Hotelling T 2 procedure, using the robust shrinkage reweighted (SR) estimators, introduced in Cabana, Lillo, and Laniado (2019) and Cabana, Lillo, and Laniado (2020) Section 2, introduces the background about the classical Hotelling T 2 and the RMCD control chart. Section 3 describes the proposed approach based on SR estimators. Also, the quantiles of the proposed robust statistic are estimated through the Monte Carlo technique and the control limit of the proposed control chart is described. Next, Sec. 4 studies the performance of the new alternative method and it is compared with the classical Hotelling T 2 approach and the RMCD under several simulation schemes. Section 5 describes the performance through a real dataset example and finally, Sec. 6 provides some conclusions.

Multivariate Hotelling T 2
In practice, it is known that the assumption that the Phase I data comes from an in-control process is not always valid. Abnormal observations in Phase I can inflate the control limits and reduce the power to detect process changes in Phase II. For this reason, the first step in Phase I is investigate those observations that turned to be outside the control limits. If the out-of-control observations are found to be due to an identified assignable cause that can be removed, they should be eliminated and the control limits recalculated. The iterative re-estimation procedure in Phase I can eliminate the effect of a small number of very extreme observations. However, this process will fail to detect more moderate outliers, which is known as the masking effect. This fact motivates the use a robust estimators of the mean vector and the covariance matrix, in order to determine an appropriate control limit for Phase II data.
In the case of multivariate individual observations, the Hotelling T 2 statistic is defined using in Eq. [1] the classical sample estimators: where x i ¼ ðx i1 , :::, x ip Þ t , for i ¼ 1, :::, n, are the pÀ variate Phase I observations with sample mean x ¼ n À1 P n i¼1 x i and sample covariance matrix C ¼ ðn À 1Þ À1 P n i¼1 ðx i À xÞðx i À xÞ t : For each individual observation, the value of the T 2 statistic is compared with a control limit usually derived by assuming the data are independent multivariate normal, i.e., N p ðl, RÞ, with mean l and covariance matrix R: The observations are assumed not to be correlated over time. Under these normality and independence assumptions, the control limit for Phase I data is based on a Beta distribution and for a Phase II observation that is independent of the Phase I data, is based on an F distribution (Tracy, Young, and Mason 1992). Note that the definition in Eq. [2] uses the classical estimators sample mean and sample covariance matrix that are very sensitive to the presence of outliers. Thus, this classical definition of the Hotelling T 2 statistic can be affected if abnormal observations are present in Phase I (Rousseeuw and Van Zomeren 1990). Woodall (1996, 1998) proposed a procedure based on estimating the mean vector and the covariance matrix using successive differences, which is more effective than the classical estimators in detecting the process shift when certain conditions apply. However it is not effective in detecting multiple multivariate outliers because its breakdown point tends to zero, thus it will not be considered here (Jensen, Birch, and Woodall 2007). Following this idea, some alternatives have been proposed in the literature to avoid the negative effect of outliers, such as the Trimmed approach introduced by Alfaro and Ortega (2008). Although, this method has the disadvantage of being computed only up to dimension p ¼ 5. Motivated by this, Vargas (2003) proposed a robust control chart in order to identify outliers in Phase I based on two robust estimators: the minimum covariance determinant (MCD) and the minimum volume ellipsoid (MVE) estimators both proposed by Rousseeuw (1985). However, when different estimators are used in the T 2 statistic, the assumption about the distribution does not have to be true. Moreover, the exact distribution of the statistic based on the alternative robust estimators is not available. Then, the control limits for the robust T 2 control chart in Phase I should be obtained empirically. Vargas (2003) and Jensen, Birch, and Woodall (2007) proposed to estimate the control limits based on simulations.
The later study of Chenouri, Steiner, and Variyath (2009), followed some of these ideas. The authors proposed to use the reweighted MCD (RMCD) estimator (Lopuhaa and Rousseeuw 1991;Rousseeuw and Van Zomeren 1990;Willems et al. 2002) to compute the robust control chart for the Phase II data. The motivation is that the RMCD estimators inherit the nice properties of the initial MCD estimators, such as affine equivariance, robustness, and asymptotic normality, while achieving a higher efficiency. Because of the reweighting step in the calculation of the estimates, the result is not unduly influenced by outliers and there is no need to identify outliers in the Phase I data. The authors concluded that their proposed multivariate robust Hotelling T 2 chart based on RMCD estimates is a good alternative to the classical multivariate T 2 control chart for Phase II data.
Although, there are some disadvantages in the case of using the MCD or the RMCD estimators. The MCD estimators of location and scatter are determined by the subset of size h ¼ bncc (where 0:5 c 1), for which the covariance matrix has the smallest possible determinant. Thus, the method depends on the parameter h which depends on c, that should be defined by the user or practitioner in practice. This parameter directly affect the performance of the method, the breakdown value and the efficiency. Here, 1 À c represents the (asymptotic) breakdown point of the MCD estimator and thus, of the RMCD. The MCD estimator has its highest possible finite sample breakdown point (bdp) when h ¼ bðn þ p þ 1Þ=2c (Leroy and Rousseeuw 1987). It is important to note that the computation of the estimates is very challenging, especially in case of large data-sets or high dimension. That is why approximate algorithms have to be used for this task. The problem is that this results in worse performance about consistency and bdp, than the exact theoretical estimator would have had. And it gets worse with the increase of the sample size n and/or the dimension p of the sample (Hawkins and Olive 2002;Stromberg, H€ ossjer, and Hawkins 2000). Actually, in the statistical software packages, the MCD only runs for dimension p 30: Also, the approximate algorithm Fast-MCD still requires substantial running times for large p, because the number of candidate solutions grows exponentially with the dimension p of the sample and, as a consequence, the procedure becomes computationally expensive for even moderately sized problems. Cabana, Lillo, and Laniado (2019) proposed a robust estimator for location and covariance matrix to deal with the problem of outliers in multivariate data. They were based on a notion frequently used in finance and portfolio optimization (see Ledoit and Wolf 2004;Chen, Wiesel, and Hero 2011;Couillet and McKay 2014;Steland 2018), known as shrinkage, defined as follows:

Robust shrinkage reweighted control chart
whereÊ is the prior estimator of a parameter h to which the shrinkage is done, toward a target estimator T : This way,Ê Sh is a convex combination that relies on the fact that "shrinking"Ê towardT , would help to reduce the estimation error, because although the target estimator is usually biased, it also contains less variance than the estimatorÊ: Therefore, under general conditions, there exists a shrinkage intensity g, a number between 0 and 1, such that that the resulting shrinkage estimator would contain less estimation error thanÊ (James and Stein 1961). Cabana, Lillo, and Laniado (2019) proposed to use the shrinkage to overcome the positive definite condition that the covariance estimator used in the Mahalanobis distance should meet. Since the shrinkage estimator has the advantages of providing a tradeoff between bias and variance, and in the case of covariance matrices the shrinkage is always positive definite and well conditioned, we propose to use the shrinkage estimator in our proposal as well. Let x ¼ fx 1 , :::, x n g be the n Â p data matrix with n being the sample size and p the number of variables. For location, the authors defined a shrinkage over the multivariate medianl MM called L 1 À median which is a robust and highly efficient estimator of central tendency (Lopuhaa and Rousseeuw 1991;Vardi and Zhang 2000;Oja 2010). The L 1 À median is defined as: The shrinkage estimator overl MM was proposed considering it as the sample estimatorÊ and l e as the target estimatorT in the definition of Eq.
[3], where e is the pÀ dimensional vector of ones. Then, the shrinkage estimator over the multivariate L 1 À median is: [4] The scaling factor l and the intensity g should minimize the expected quadratic loss: where kxk 2 2 ¼ P p j¼1 x 2 j : The parameters l and g are estimated robustly and optimally minimizing the estimation error, as proposed in Proposition 2 (p. 6) of Cabana, Lillo, and Laniado (2019). The authors propose to use the asymptotic distribution for the L 1 À medianl MM , which can be approximated by , (Bose 1995;Bose and Chaudhuri 1993;M€ ott€ onen et al. 2010 , with x i 2 R p , for each i ¼ 1, :::, n: Therefore, the parameters l and g are estimated in practice as follows: On the other hand, the authors also propose an adjusted special comedian matrixŜ Sh , based on the classical definition of comedian from Falk (1997): The comedian matrix is a robust alternative for the covariance matrix, but in general it is not positive (semi-)definite (see Falk 1997). Thus, Cabana, Lillo, and Laniado (2019) used the shrinkage approach applied to the comedian, and a robust and wellconditioned estimate is obtained (Ledoit and Wolf 2003a, 2003b, 2004; DeMiguel, Martin-Utrera, and Nogales 2013):R Sh ¼ ð1 À gÞŜ Sh þ g R I: [5] In this case, the target is the scaled identity matrix, as suggested in the literature. The optimal expression for the parameters g and R is described in Cabana, Lillo, and Laniado (2019) in Proposition 3 (p. 10). In this paper, the robust estimators of locationl Sh and covariance matrixR Sh based on shrinkage are used to define a robust Mahalanobis distance that has the ability to discover outliers with high precision in the vast majority of cases in the simulation scenarios studied in the paper, with both gaussian data and with skewed or heavy-tailed distributions. The behavior under correlated and transformed data showed that the approach was approximately affine equivariant. With highly contaminated data it is shown that the method had high breakdown value even in high dimension or large level of contamination. Furthermore, the significantly low computational times show the advantages of the proposal.
These robust estimators are also studied in Cabana, Lillo, and Laniado (2020), and are used to define a robust regression approach, for which they used the reweighted version ofl Sh andR Sh (Eqs. [4] and [5]). To define those reweighted estimators, the associated robust squared Mahalanobis distance must be defined, for each observation x i , i ¼ 1, :::, n : Sh ðx i Àl Sh Þ: Cabana, Lillo, and Laniado (2020) defined w i ¼ wðd 2 ðx i ÞÞ, a weight function depending on the robust squared Mahalanobis distance, as an indicator function which assigns weight 1 to the x i having a distance less than certain quantile of the chisquare distribution with p þ 1 degrees of freedom: Then, the shrinkage reweighted (SR) estimatorsl SR andR SR for location and covariance matrix, respectively, are: The reweighting step in the SR estimators provides the advantage of keeping the properties of the initial shrinkage estimators, such as robustness, approximate affine equivariance and high breakdown value, while achieving a higher efficiency.
In the statistical process control procedure, with the proposed estimators there is no need to identify outliers in the Phase I data because the estimators are not influenced by outliers. Therefore, using them will provide us with robust Phase I estimates of location and covariance that will serve to compute a robust control chart for Phase II data. Next, the procedure for the main contribution of this paper: defining a robust quality control approach based on the SR estimators, is described in detail.
Let x ¼ fx 1 , :::, x n g be the n Â p data matrix in Phase I, with n being the sample size and p the number of variables. Let us assume that it follows a multivariate normal distribution x $ N p ðl, RÞ: Since we are in the case of individual observations, it is well known (Wilks 1962) that for a new observation in Phase II, x k we have: where T 2 k is defined as in Eq.
[2]. In this paper, we propose to robustify the definition of the Hotelling T 2 control chart based on Phase I data, replacing the classical estimators with the robust shrinkage reweighted (SR) estimatorsl SR andR SR , defined in Eq.
[6]. The proposed robust Hotelling T 2 for the Phase II observation .. is: Note that this definition is similar to the definition of a squared robust Mahalanobis distance. It is well known that the distribution of the classical squared Mahalanobis distance, i.e. with sample mean and covariance matrix, is a chi-square with p degrees of freedom. When different estimators are used, this assumption does not have to be true. Although, in the literature the 0.975 quantile of the chi-square distribution is usually used to determine a threshold for the distance to detect multivariate outliers. In this paper we apply Monte Carlo simulations to estimate the quantiles of the distribution of T 2 SR for several combinations of sample sizes n and dimensions p, similar as in Chenouri, Steiner, and Variyath (2009), but considering a wider range of n and p. For each dimension, a smooth curve between the sample size and quantiles of .. is fitted and used to estimate appropriate quantiles of T 2 SR for different Phase I sample sizes.
The simulation study is organized as follows: In order to estimate the 99% and 99.9% quantiles of T 2 SR for a given Phase I sample size n and a dimension p, M ¼ 10.000 samples of size n from a standard multivariate normal distribution N p ð0, I p Þ are generated. For each data set of size n, the SR location and covariance matrix estimates are computed,l SR ðmÞ andR SR ðmÞ, m ¼ 1, . . . , M: In addition, for each data set, we randomly generate a new observation x k, m from N p ð0, I p Þ, which is treated as a Phase II observation, and calculate the corresponding T 2 SR ðx k, m Þ value as given by Eq. [7].
The empirical distribution function of T 2 SR is based on the simulated values: With this values, the empirical distribution function of T 2 SR can be obtained, for different combinations of p ¼ 2, . . . , 30 and n ¼ 20, . . . , 300: By inverting the empirical distribution function, the Monte Carlo estimates of the 99% and 99.9% quantiles of T 2 SR can be obtained.
Figures A2-A7 from the Appendix show the scatterplots of the empirical 99% and 99.9% quantiles of T 2 SR against the sample n for the selected dimensions p ¼ 5, 10, 30 to illustrate the results. The graphical results for different dimensions are similar and they suggest that the quantiles could be modeled using a family of regression curves of the form: where a 3 ¼ v 2 ðp, 1ÀaÞ as recommended in Chenouri, Steiner, and Variyath (2009), and a 1 and a 2 are constants that depend on p, and .. as well. Fitting the curve to the data in each case, will help predict the quantiles of T 2 SR ðx k Þ for any Phase I sample size n. Note that, as n increases, f(n) approaches v 2 ðp, 1ÀaÞ , the quantile used in the literature. Table 1 shows the resulting least-square estimates of the parameters of the curves a 1 and a 2 for each dimension p ¼ 2, :::, 30 considered and both 99% and 99.9% quantiles. In practice, using Eq. [8] and the values of the constants from Table 1, the 99% and 99.9% quantiles can be computed for any dimension between 2 and 30 and any Phase I sample size. The estimated curves fit well to the data for all cases yielding R 2 values of at least 98%. Furthermore, the significance tests yielded p-values less than 0.05, which allows to conclude that there is a statistically significant linear relationship between the quantiles and 1 n : For dimensions p > 30 the practitioner can simulate the control limits following the described steps, but a similar behavior is expected. Since the method is going to be compared with another that has this dimension restriction, the results of this paper only consider dimensions p 30, but the proposed method does not have any restriction with respect to the dimension. The simulations were done in Matlab in a PC with a 3.40 GHz Intel Core i7 processor with 32GB RAM.
The implementation in practice to construct the T 2 SR control chart is as follows: Phase I 1. Save the sample size n and the dimension p of the data. Select the confidence level 1 À a, with a ¼ 0:01, 0:001: 2. Using the Phase I data, compute the shrinkage reweighted (SR) estimates of location and covariance matrix. 3. For the desired a and p values, choose the least square estimates a 1 and a 2 from Table 1 and compute the control limit using Eq. [8].

Phase II
1. Compute the value T 2 SR ðx k Þ for each new observation x k defined in Eq. [7], and plot each value on a control chart with the limit derived in Phase I (step 3). 2. Interpret the chart and look for out-of-control observations or patterns. Diagnose the process if needed.

Performance simulation study
The performance of the proposed SR control chart is compared to the classical Hotelling T 2 and the RMCD control charts. The Phase I data structure is considered under a multivariate normal distribution, considering the presence of outliers. The behavior is studied through the shifts in the process mean vector in the Phase II data, for both cases of independent or correlated variables. The shifts are defined through the non-centrality parameter (ncp) n 2 : where l and l A represent the in-control and the outof-control mean vectors, respectively. The larger the value of n 2 is, the more extreme the outliers are. The in-control (outlier-free) Phase I data are generated from a multivariate normal distribution N p ð0, RÞ: Then, some abnormal observations are included, generated from N p ðl I , RÞ: The Phase II data are generated from N p ðl II , RÞ: Since there is no change in the covariance structure, the ncp in Phase I is n 2 I ¼ jjl I jj and in Phase II is n 2 II ¼ jjl II jj: Different scenarios are considered varying the following elements: number of observations (n), number of variables (p), proportion of outliers (p), proportion of dimensions to be shifted (#) and the parameters in the distributions, i.e. l I , l II and R: This simulation study is analogous as the one from Chenouri, Steiner, and Variyath (2009), with three main improvements: (i) the authors studied the problem up to dimension p ¼ 10, and here we will Table 1. Least-squares estimates of the regression parameters a 1 , a 2 and R 2 , for dimensions p ¼ 2, :::, 30 and confidence levels 1 À a ¼ 0:99, 0:999, where a 3 ¼ v 2 p;1Àa : consider higher dimensions, (ii) they considered the identity matrix R ¼ I p as the covariance structure of both the clean data and the outliers, and here we propose to study also correlated variables as in the comparison study from Alfaro and Ortega (2009), and (iii) we additionally consider to study not only the magnitude of the shift but also where it happens. The Phase I and II data generation models we propose are the following: Case A: Phase I: ð1 À pÞN p ð0, I p Þ þ pN p ðl I , I p Þ: Phase II: N p ðl II , I p Þ Case B: Phase I: ð1 À pÞN p ð0, RÞ þ pN p ðl I , RÞ: Phase II: N p ðl II , RÞ The Phase I sample sizes considered are: n ¼ 50, 150 with the respective dimensions p ¼ 2, 6, 10, 30: The confidence level studied is set to be a ¼ 0:01: The level of contamination in the Phase I data is characterized by p ¼ 0, 0:1, 0:2: The process mean shifts considered in Phase I are n 2 I ¼ 5, 30 and in Phase II are n 2 II ¼ 0, 5, 10, 15, 20, 25, 30: Two covariance structures R are considered: (Case A) the identity matrix to compare the methods when there are differentsized changes in the average of all the variables if the variables are independent, and (Case B) the matrix of size p with 1 on the main diagonal and 0.9 elsewhere, to analyze whether the correlation level affects the detection probability of each alternative. Table 2 summarizes the different combinations for the other parameters in the simulations study, for both Cases A and B.
The following results show the performance when the process mean shifts happen in all dimensions. We also further studied the case in which the shifts happens in some percentage of the corresponding dimension, instead of happening in all, to see if the performance of the control chart is not only affected by the magnitude of the shift but also where it happened. We defined a parameter # for this purpose. The values of # are 100%, 75%, 50% and 25%, each representing the scenarios in which the shift happens in all dimensions, most dimension, half of the dimensions and some dimensions, respectively. Note that this is done for each dimension considered in the simulations, i.e. p ¼ 2, 6, 10, 30, each value of the sample size n ¼ 50, 150, each combination from Table 2 considering the percentage of outliers p and the process mean shifts. And of course, this is also studied in both Cases A and B, i.e. with no correlations and with correlations. Since the results with # ¼ 75%, 50%, 25% are very similar to when it is 100%, and to avoid unnecessary extension of the paper, we include only the results when # ¼ 100%, i.e., the shift happens in all dimensions. For both Cases A and B described as follows, and all the combinations of the dimension, sample size, percentage of outliers and shift, the proposed control chart behaves similar, not only when the shift happens in all dimensions, but also when it happens in some percentage of the dimensions, set by the parameter #. The results for when # ¼ 75%, 50%, 25% can be found in the Supplementary Material.

Case A
The performance of the control charts is studied by means of the probability of signal that is estimated as the proportion of T 2 SR ðx k Þ values (Eq. [7]), that fall above the control limit based on 10.000 simulations. The probability of signal is also obtained for the classical Hotelling T 2 control chart and the robust procedure based on RMCD proposed by Chenouri, Steiner, and Variyath (2009). Figures A8-A17 from the Appendix show the results from all the different scenarios in the simulation study, in which the probability of signal is obtained for each Phase II non-centrality parameter. Figure A8 shows the resulting probability of signal when the Phase I dataset is outlier free, i.e. p ¼ 0 and the sample size is n ¼ 50. In all cases the probability of detecting the Phase II shifts increases with the increase of the shift, i.e. the noncentrality parameter n 2 II : In case of low dimension p ¼ 2, all methods behave similarly. However, when the dimension of the data increases, method T 2 RMCD decrease its performance. Meanwhile, in these cases, the proposed method T 2 SR behaves similarly to the classical Hotelling T 2 in Phase II, which is efficient since we are in the case of no Phase I outlier presence. On the other hand, Figure A9 shows the results when the sample size increases to n ¼ 150 and there are still no outliers in Phase I. For the three methods the probabilities of signal are similar, although the lowest values are those of T 2 RMCD , and when the dimension increases it can be noted a slightly better performance of T 2 SR : When outliers are included in the Phase I simulated data, and they are near the center of the distribution (n 2 I ¼ 5), while the sample size is low (n ¼ 50), the performance of the robust alternatives is slightly better than the classical approach in low dimension p ¼ 2, both for 10% and 20% contamination, as shown by Figures A10 and A11. But when dimension increases, T 2 RMCD starts to worsen its performance, specially when 20% outliers are considered. In the mean time, the probabilities of signal of T 2 SR remains higher than the classical T 2 and the other robust alternative, even in high dimension or high contamination. Figures A12 and A13 show the behavior of the methods when the sample size increases to n ¼ 150. In this case, although it is difficult to detect outliers when they are near the center of the data (n 2 I ¼ 5), the robust alternatives perform better than the classical approach, for all dimensions considered. However, when dimension increases, T 2 RMCD decreases the probability of signal. T 2 SR has a good behavior in this case because the increase in dimension or level of contamination does not greatly affect its performance. Figures A14 and A15 show the performance when the non-centrality parameter in Phase I is large (n 2 I ¼ 30) and the sample size is n ¼ 50. In this case, both robust alternatives substantially outperform the classical approach. But with the increase of dimension, T 2 RMCD decreases its performance, while T 2 SR remains robust, even in presence of a high level of contamination.
When the dimension is increased to n ¼ 150 ( Figures A16 and A17), the performance of the robust approaches is better than with lower sample size, but T 2 RMCD still worsen its performance with the increase of the dimension or the level of contamination and it is outperformed by T 2 SR : The overall outcomes are that if the variables are independent, i.e. when the covariance structure is the identity matrix, the outliers and mean shifts have a negative impact on the classical Hotelling T 2 approach, and the use of robust alternatives is needed. The robust T 2 RMCD approach has good performance only when the dimension or the level of contamination are low, while the proposed approach T 2 SR has better performance in most cases, even when the dimension or the level of contamination increases.

Case B
Figures A18-A20 in the Appendix show the resulting probabilities of signal with respect to the Phase II non-centrality parameter. The results do not greatly change with different sample sizes, that is why we show only the results associated with n ¼ 150. The same happens in case of Phase I contamination with the different values of p, we show the most significant results with p ¼ 20%: In case of outlier-free Phase I data, the three methods behave similarly, but when a certain percentage of outliers are introduced, the classical method T 2 and the robust alternative T 2 RMCD start worsening their performance. With the increase of the contamination level, the results are more or less the same, but their behavior gets worse with the increase of the dimension p. Meanwhile, the proposed method T 2 SR remains robust to the presence of outlying observations, even when dimension increases. In general, comparing with the results when the variables are considered independent (Case A), in presence of correlated features (Case B), a decrease in the probability of signal is noted for all methods, but mostly for the classical Hotelling T 2 and the T 2 RMCD : Table 3 shows the resulting computational times in seconds depending on the dimension p and the sample size n considered, for the three methods. The results contain the average time between both simulation schemes, i.e. considering independent or correlated variables.

Computational times
The results show that when the dimension of the data or the sample size increase, the computations become more expensive, but the running time of T 2 SR is very low compared to the classical method and the other robust alternative, specially in high dimension, where T 2 and T 2 RMCD are 6 times slower than the proposed method T 2 SR :

Case study
To show the performance of the proposed method in practice, two real examples are selected. The first is also studied in Chenouri, Steiner, and Variyath (2009). The data consists on four variables that measure proper alignment in the final assembly of automobiles, a characteristic that is known to influence the customer perception of quality. The alignment depends on four angles, namely front-right, and frontleft wheel camber and caster. In this example, large volumes of data are available because all final inspections include the measurement of the four angles. Those automobiles having any of the characteristics out of specification, or out-of-control, should be reworked before shipment. To monitor the alignment, a multivariate Hotelling T 2 control chart could be used. However, the process is known to produce occasional outliers that should be reworked. Therefore, it is better to use a robust control chart procedure. In this example, the Phase I data consist of 186 vehicles produced during the time interval of January 2nd. The first step of the proposed method is to compute the Phase I robust SR estimators of location and covariance matrix, using Eq. Next step is to compute the control limits using Eq. [8] and For n ¼ 186, the resulting values of the control limits are f 4, 0:990 ¼ 13:293 and f 4, 0:999 ¼ 18:809: As we see, these control limits are near the asymptotic classical control limits 13.277 and 18.467, from the chisquare distribution with 4 degrees of freedom, because the sample size n ¼ 186 is reasonably large. Also, they are slightly lower than the limits estimated by the approach based on RMCD estimators. Since the robust SR estimates are not influenced by outliers, we do not need to take any action, such as removing outlying points and reconstructing the control limits and the chart. Using the Phase I robust SR estimates, the Phase II control chart for the future observations can be obtained. Figure A1 shows the control chart for the 100 vehicles produced on January 12, the Phase II data. It can be seen that there is a mean shift in the process, assuming that the covariance structure remains the same, because a large number of T 2 values exceed the 99% and 99.9% control limits.
More specifically, 45 Phase II observations exceed the 99% control limit, whereas 21 observations exceed the more restrictive 99.9% control limit. Table 4 contains the results. It also shows the results based on this dataset example using the classical T 2 control chart and the RMCD based control chart, studied in Chenouri, Steiner, and Variyath (2009). With the RMCD based control chart, there is no outlier detection in the Phase I because of the reweighting step of the RMCD estimates, and 43 (99%) and 18 (99.9%) observations resulted to be out-of-control in Phase II. The proposed method detected a few more out-ofcontrol observations than the classical approach and the robust alternative. Since the Phase I sample is almost outlier free, the Phase II pattern is more or less similar for the three control charts.
A second case, which is also used in Quesenberry (2001), Vargas (2003) and Alfaro and Ortega (2008), is studied. The original data consist of 11 quality variables measured on 30 products from a production process. In Vargas (2003), the authors consider the first two variables and use it to compare six different methods. On the other hand, Alfaro and Ortega (2008) use the same example to illustrate the performance of their proposed approach. Almost all methods proposed by the authors detect observation 2 as outof-control in Phase I. Then, they propose to modify two of the observations (16 and 24) and turn them into outliers to test the performance of the methods in the task of detecting them in Phase I. Let us call this modified dataset that contains three outliers X new . Only one of the six methods from Vargas (2003) and the proposed method from Alfaro and Ortega (2008) were able to detect in X new the three outliers, through the computation of the control limits. All the other methods including the classical approach failed. Since their proposed methodologies are rather different than ours (because they perform the detection on Phase I), we propose to use X new in Phase I and also generate n ¼ 100 Phase II observations containing 10% and 20% of randomly generated outliers, to check the performance of the classical T 2 , RMCD and SR. We repeated the simulation process M ¼ 1000 times and the average True Positive Rates (FPR) and False Positive Rates (FPR) in Phase II are shown in Table 5. The classical T 2 does not detect any of the three outliers in Phase I (as reported by Vargas 2003 andOrtega 2008), that is, these outliers are not detected due to the masking effect. This obviously worsens its performance in Phase II, failing to detect the out-of-control observations in both cases (10% and 20%). For both RMCD and SR the procedure is different. We do not need to remove outlying points in Phase I as in the classical T 2 . Using the Phase I corresponding robust estimates, the Phase II control chart for the future observations can be obtained. We compute the corresponding 99% control limit for both methods which is more sensitive for the FPR (with 99.9% control limits the FPR will be less or equal). The results show the minimum possible FPR for both methods, i.e. no observation is incorrectly signaled as out-of-control. Regarding the TPR, i.e. the rate of correctly signaled observations, it is lower with the RMCD compared to our proposed method SR, and it significantly reduces when the level of contamination increases, while the TPR of SR maintains high even with higher contamination. These results show an advantageous performance of our proposed method SR over the other two.

Conclusions
In this paper, a multivariate robust control chart is proposed as an alternative to the classical Hotelling T 2 approach. The computation of the chart is based on the robust shrinkage estimators introduced by Cabana, Lillo, and Laniado (2019) and further studied by Cabana, Lillo, and Laniado (2020). The proposed control chart is obtained by replacing these robust estimators of location and covariance matrix in the definition of the classical approach. Since the estimators have good properties like robustness, approximate affine equivariance and high breakdown value, the results in this area is advantageous.
The quantiles for the proposed robust statistic are computed by means of Monte Carlo simulations considering a wide range of dimensions and sample sizes, extending the previous study of Chenouri, Steiner, and Variyath (2009) up to dimension p ¼ 30. The estimated quantiles are modeled to approximate the control limits for any sample size, in order to ease the use of the proposed methodology in practice.
A simulation study is carried out, considering different Phase I scenarios in which the data is contaminated to see the effect in the proposed control chart computations and the capacity of detecting Phase II process shifts. The method is compared with the classical Hotelling T 2 approach and the robust alternative T 2 RMCD : Two different correlation schemes are considered: independent or highly correlated variables. Other parameters are also considered to explore different scenarios, such as the dimension, the sample size, the percentage of contamination, the magnitude of the contamination and where it happens, i.e. in all dimensions or in some percentage of the corresponding dimension. The results under these scenarios showed that the classical Hotelling T 2 is highly sensitive to the presence of Phase I outliers. Meanwhile, the robust alternative T 2 RMCD shows in general good results but only when the dimension p or the level of contamination p is low, but the performance worsens when p or p start to increase. On the other hand, the proposed approach T 2 SR outperforms the classical method and the other robust alternative in the vast majority of cases, showing special advantages in high dimension or when the level of contamination increases, even in presence of high correlations. The proposed robust control chart performs robustly not only when the shift happens in all dimensions, but also when it happens in some percentage of the dimensions. Furthermore, T 2 SR shows additional advantages with respect to the competitive computational time.
Finally, the proposed method is illustrated using two case studies: one from the automotive industry consisting on four variables measured in 186 vehicles and the other from a production process consisting of two variables measured in 30 products. The robust SR control chart is computed for the Phase II data of the first real example, using the robust shrinkage reweighted estimators of location and covariance matrix computed from the Phase I data. The control limits based on the quantiles estimated in this paper suggest the presence of several outliers for both confidence levels considered 99% and 99.9%. The results are similar to the obtained by the other methods because the Phase I data was almost outlier-free, and the proposed T 2 SR control chart detects a few more observations as out-of-control. In the second example, additional out-of-control observations are included in Phase I to test the method's performance, as in Vargas (2003) and Alfaro and Ortega (2008). Phase II observations are generated containing 10% and 20% of outliers, and the process is repeated 1000 times to obtain the TPR and FPR of the methods in the task of detecting the abnormal behavior. The classical method seems to be influenced by the masking effect and it fails to detect any of the Phase II out-ofcontrol observations. The other two robust approaches show the lowest possible FPR. Although, the TPR is significantly lower for RMCD compared to SR, and it worsens with the increase of the contamination level, while the TPR of SR remains high in both cases. These results demonstrate the good performance of the robust SR control chart in real applications. In this paper, the case of individual observations is considered, analogously as in Chenouri, Steiner, and Variyath (2009). The case when the data is divided in subgroups with equal or different sample sizes is left for future research work to study if the proposed approach may have advantages in this scenario as well.
On the other hand, as stated in Ledoit and Wolf (2004), when the dimension p is larger than the number n of observations available, the sample covariance matrix is not even invertible. When the ratio p/n is less than one but not negligible, the sample covariance matrix is invertible but numerically ill-conditioned, which means that inverting it amplifies the estimation error dramatically. However, in this case, their shrinkage estimator is asymptotically well-conditioned and more accurate than the classical estimator. This fact can motivate the further study of the proposed approach in the "large p, small n" scenario. Although, since this case was not the focus of the present paper, and the classical approach based on the sample covariance matrix and the method RMCD are not suitable in the case of n < p, we propose to explore the performance of the robust control chart based on shrinkage in this alternative scenario, in a future research.

About the authors
Elisa Cabana is a postdoc researcher at IMDEA Networks Institute in Madrid, Spain. Prior to joining IMDEA she got her PhD in Mathematical Engineering at the University Carlos III of Madrid. She also collaborates in the uc3m-Santander Big Data Institute. Her areas of interest include robust methods for data analysis, outlier detection, Machine Learning and Artificial Intelligence.
Rosa E. Lillo is a Full Professor at the University Carlos III of Madrid and director of the uc3m-Santander Big Data Institute in Madrid, Spain. Her areas of research are stochastic processes and their applications, stochastic ordering, reliability, Bayesian inference, robust methods for data analysis and functional data. Figure A1. T 2 SR control chart for the 100 Phase II observations collected on January 12. The dashed and solid horizontal lines represent the control limits based on 99% and 99.9% quantiles, respectively. Figure A2. Simulated quantiles of T 2 SR and fitted curve for p ¼ 5 and a ¼ 0:01: Figure A3. Simulated quantiles of T 2 SR and fitted curve for p ¼ 5 and a ¼ 0:001: Figure A4. Simulated quantiles of T 2 SR and fitted curve for p ¼ 10 and a ¼ 0:01: Figure A5. Simulated quantiles of T 2 SR and fitted curve for p ¼ 10 and a ¼ 0:001: Figure A6. Simulated quantiles of T 2 SR and fitted curve for p ¼ 30 and a ¼ 0:01: Figure A7. Simulated quantiles of T 2 SR and fitted curve for p ¼ 30 and a ¼ 0:001: Figure A8. Probability of signal when Phase I data is of size n ¼ 50 and outlier free (Combination 1 of Case A in Table 2).

428
E. CABANA AND R. E. LILLO Figure A9. Probability of signal when Phase I data is of size n ¼ 150 and outlier free (Combination 1 of Case A in Table 2). Figure A10. Probability of signal when Phase I data is of size n ¼ 50 and has 10% of outliers with n 2 I ¼ 5 (Combination 2 of Case A in Table 2). Figure A11. Probability of signal when Phase I data is of size n ¼ 50 and has 20% of outliers with n 2 I ¼ 5 (Combination 4 of Case A in Table 2). Figure A12. Probability of signal when Phase I data is of size n ¼ 150 and has 10% of outliers with n 2 I ¼ 5 (Combination 2 of Case A in Table 2). Figure A13. Probability of signal when Phase I data is of size n ¼ 150 and has 20% of outliers with n 2 I ¼ 5 (Combination 4 of Case A in Table 2). Figure A14. Probability of signal when Phase I data is of size n ¼ 50 and has 10% of outliers with n 2 I ¼ 30 (Combination 3 of Case A in Table 2). Figure A15. Probability of signal when Phase I data is of size n ¼ 50 and has 20% of outliers with n 2 I ¼ 30 (Combination 5 of Case A in Table 2). Figure A16. Probability of signal when Phase I data is of size n ¼ 150 and has 10% of outliers with n 2 I ¼ 30 (Combination 3 of Case A in Table 2).

436
E. CABANA AND R. E. LILLO Figure A17. Probability of signal when Phase I data is of size n ¼ 150 and has 20% of outliers with n 2 I (Combination 5 of Case A in Table 2). Figure A18. Probability of signal when Phase I data is of size n ¼ 150 and is outlier-free (Combination 1 of Case B in Table 2). Figure A19. Probability of signal when Phase I data is of size n ¼ 150 and has 20% of outliers with n 2 I ¼ 5 (Combination 4 of Case B in Table 2). Figure A20. Probability of signal when Phase I data is of size n ¼ 150 and has 20% of outliers with n 2 I ¼ 30 (Combination 5 of Case B in Table 2).