Digital Twin simulation models: a validation method based on machine learning and control charts

The adoption of simulation models as Digital Twins (DTs) has been standing out in recent years and represents a revolution in decision-making. In this context, we note increasingly faster and more efficient decisions by mirroring the behaviour of physical systems. On the other hand, we highlight the challenges to ensure the simulation models validity over time since traditional validation approaches have limitations when we consider the periodic update of the model. Thus, the present work proposes an approach based on the constant assessment of these models through Machine Learning and control charts. To this end, we suggest a monitoring tool using the K-Nearest Neighbors (K-NN) classifier, combined with a p-control chart, to periodically assess the validity of DT simulation models. The proposed approach was tested in several theoretical cases and also implemented in a real case study. The findings suggest that the proposed tool can monitor the DT functioning and identify possible special causes that could compromise its results. Finally, we highlight the wide applicability of the proposed tool, which can be used in different DT models, including near/real-time models with different characteristics regarding connection, integration, and complexity.


Introduction
The use of Digital Twins (DTs) to support decisionmaking has gained prominence in recent years due to technological advancements and the search for increasingly intelligent and efficient tools (Tao and Zhang 2017).The DT concept is based on the virtualisation of physical systems through digital models that are highly integrated with them.The DT allows mirroring these systems through virtual models connected with physical resources by sensors, intelligent devices, and databases, allowing more efficient and effective decisions (Alam and Saddik 2017;Tao and Zhang 2017).In this scenario, DT applications can be found in the most diverse sectors, such as manufacturing, services, logistics, and healthcare (Wright and Davidson 2020).
In particular, Zhong et al. (2017) report that, when considering the Industry 4.0 context, DT-based decisions have become one of the pillars of this new industrial era, which is based on the digitisation of processes and highly intelligent decision-making.Moreover, the use of DTs to optimise decisions has been widely disseminated by several authors, representing a revolution in the CONTACT Carlos Henrique dos Santos chenrique.santoss@gmail.comProduction Engineering and Management Institute, Federal University of Itajubá (UNIFEI), Ave.BPS, 1303, Itajubá, Minas Gerais, Brazil Supplemental data for this article can be accessed online at https://doi.org/10.1080/00207543.2023.2217299.
decision-making process (Semeraro et al. 2021).In this case, Stavropoulos and Mourtzis (2022) reinforce that the DT contributes to better products and processes since it allows decisions based on physical systems results and behaviours.Finally, it is important to highlight that the DT benefits could be potentialised by considering its use integrated with other Industry 4.0 pillars, such as Cloud Technology, Internet of Things (IoT), and Big Data (Tao et al. 2018).
Moreover, for the design of the DT models, in addition to commercial packages that often accompany physical equipment, Santos et al. (2022a) highlight the use of computer simulation, with emphasis on Discrete Event Simulation (DES) and Agent-based Simulation (ABS).The authors reveal that this approach has stood out due to the flexibility and cost-effectiveness of the simulation software and packages, which allow for good analysis techniques combined with graphic resources and easy integration with physical systems.Mourtzis (2020) also highlights that the process digitalisation through simulation has a fundamental role in the future of operations towards Industry 4.0, while Rodič (2017) reveals that the simulation lives the DT era.
However, when considering the use of simulation as DT, we must guarantee several characteristics and functionalities of the simulation model.First, Wright and Davidson (2020) point out that the virtual model must be able to capture physical changes and adapt accordingly.Furthermore, considering the dynamic behaviour of the systems, it is necessary to guarantee the correct functioning of these DT models.In this sense, in addition to the model validation while the building phase, its validity over time must be ensured (Tao and Zhang 2017;Wright and Davidson 2020).Santos et al. (2022a) point out the need for methods and systems aiming at the periodic assessment of DT models, representing a fertile field of research.
On the one hand, the traditional validation during model building is critical to ensure that the virtual model matches the system behaviour and, regarding this, several techniques in the literature address qualitative and mainly quantitative methods (Sargent 2013).In this context, it is worth emphasizing the wide use of hypothesis tests to assess the correspondence of the model with the modelled systems.On the other hand, when considering the simulation-based DTs, Meng et al. (2013) report that ensuring their validity becomes an even more critical and complex task since the model is continuously updated according to the physical changes.In this case, Santos et al. (2022a) reveal that traditional validation approaches cannot deal with this dynamic characteristic of the models.Moreover, the authors state that, although periodic assessment is one of the requirements for the use of DT simulation models, only 1.3% of applications adopt assessment procedures, which are based on adaptations of conventional validation methods, such as periodic hypothesis testing.More details can be found in the literature review proposed by Santos et al. (2022a).
However, it is important to highlight that the adoption of periodical hypothesis tests has limitations since the difficulty of monitoring grows as the number of key variables to be monitored grows.In other words, this approach is unsuitable for complex models, which demand multi hypothesis tests and may be inefficient and time-consuming.In addition, it proposes a punctual assessment and does not consider the history of the DT's operation over time.Finally, comparing DT data with the physical system only using descriptive statistics (i.e.mean and standard deviation) may result in erroneous statements since there might be different statistical distributions with similar parameters (Chen et al. 2019).Therefore, considering there are no approaches in the literature focused on monitoring DT simulation models and that the adaptation of conventional validation techniques has several limitations, we intend to fill this theoretical and practical gap, which was also highlighted by Zhuang, Liu, and Xiong (2018) and Wright and Davidson (2020).
This work aims to propose a new approach for monitoring DT simulation models.One alternative for the assessment of complex systems with several variables is the use of classifiers and, in this case, we adopted the K-Nearest Neighbors Classifier (K-NN).Furthermore, to monitor the K-NN results and, consequently, the behaviour of DTs over time, we also propose the use of a control chart.This way, it is possible to monitor complex DT models and assess several variables simultaneously, creating a versatile and practical monitoring tool.It is important to note that this work does not simply adopt techniques widely disseminated in the literature, but combines them to create an efficient tool to solve a practical and theoretical gap.
First, we adopted the K-NN to compare physical and virtual systems since the use of classifiers stood out as a valuable tool for classifying and comparing datasets with different characteristics and complexities.In this case, among the classifiers, we highlight the K-NN considering its practicality and robustness, being widely used by researchers and practitioners (Kumar et al. 2020;Wang, Tsai, and Lin 2021).Furthermore, to monitor the K-NN accuracy over time, we adopted the control chart, one of the main process monitoring techniques which is already widely used by decision-makers (Montgomery 2009).
We highlight that the presented approach allows an easier, faster, and more robust validation of DT simulation models without the limitations and difficulties associated with the adaptation of traditional validation methods, as previously discussed.Through the proposed tool, the decision-makers can monitor the DT validity during its operation, even considering complex models with different operating characteristics, which can contribute to the growing adoption of DTs by decision-makers and fill the gap in the literature.
The rest of this paper is organised as follows: Section 2 provides a literature review to clarify the main concepts and themes covered in this work.The proposed approach is described in Section 3. Section 4 is dedicated to presenting the proposed approach implementation in theoretical and real cases.Finally, Section 5 concerns the conclusions and future directions.

Simulation-based DTs
The use of computer simulation as DT models is not recent, but it has been gaining prominence in recent years due to technological advances (Santos et al. 2021).For Wright and Davidson (2020), what differentiates a traditional computational model from a DT approach is the ability to connect it to the physical systems through their data and extend the use of the model over time scales.By connecting the model to the physical systems, we have a synchronised copy that adapts according to the current physical states.Finally, Santos et al. (2022b) add that DT responses can be autonomous or semi-autonomous (human decisions are required), and both real-time and near real-time approaches are reasonable.
According to Tao and Zhang (2017), although there are different levels of DT, which vary according to the level of integration between the physical and virtual environments, they are all based on four main components: (i) Physical System (PS), composed of humans, materials, and processes; (ii) Virtual System (VS), which consists of models that represent the physical behaviour; (iii) Service System (SS), which includes the structure capable of allowing communication between the physical and virtual environments; and (iv) DT Data (DTD), a set of data and information that is transmitted between the systems.Figure 1 illustrates the DT structure.
DTs have a fundamental role considering the search for smarter and more efficient decisions and we can highlight their use in different scopes and with different objectives.In this case, DTs can be adopted before and/or during the processes' operation phase (Santos et al. 2022a).Adopting DT models before the processes operation is an approach associated with the systems' design and configuration phase, where it is possible to test solutions, propose scenarios, carry out what-if analyzes, and validating the systems' behaviour (Leng et al. 2021;Liu et al. 2020).On the other hand, the adoption of DTs during the processes' operation phase is associated with some important objectives, such as evaluation of the physical behaviour over time, process adjustments and reconfiguration of its parameters, analysis based on the process' current status, among others (Alam and Saddik 2017;Leng et al. 2020;Tao and Zhang 2017).
In both approaches, before and during the operation process, the DT model must be able to connect with physical systems and mirror their main behaviours in a timely manner, evaluating, optimising, and predicting decisions (Semeraro et al. 2021).Furthermore, DT-based decisions also allow the coevolution between the physical and virtual environments.For Tao et al. (2019), this approach allows a continuous improvement of the process and a mutual evolution between physical and virtual parts is expected.
Moreover, according to Santos et al. (2022b), since DT commercial packages often involve significant investments, the use of simulation models as DTs may be a cheaper and more flexible alternative.In this sense, different from traditional simulation approaches, the authors pointed out some requirements to consider a model as DT: (i) the model must be connected with the physical systems and synchronised according to their changes; (ii) there is an automatic data flow between the model and the modelled systems; (iii) the model should be updated periodically (near/real-time), and it can act on the systems (autonomous) or just suggest decisions; and (iv) the DT-based decision is periodic and the model should have a user-friendly interface.
Regarding the use of DT simulation models, two different stages must be considered: DT building and DT operation (Santos et al. 2022b).During the building phase, we must build the digital model, carry out its integration with physical systems, and validate it to guarantee that the model represents the physical behaviours.On the other hand, in the operation phase, we assume that the model is validated and then it will be periodically updated to support decisions.In other words, the building phase focuses on developing the structure composed of the PS, VS, SS, and DTD while the operation phase guarantees that all components work correctly.In this case, Zhuang, Liu, and Xiong (2018) and Wright and Davidson (2020) highlight that ensuring the model's validity during its operation is a challenge and represents a fertile field for research.Wright and Davidson (2020) reveal that the use of DT is associated with high-impact decisions, a fact that illustrates the importance of ensuring the reliability of their results.To this end, Tao and Zhang (2017) highlight that we must frequently compare the physical and virtual results to guarantee valid and accurate systems during the DT operation.The authors state that verification, validation, and accreditation (VV&A) routines should be carried out periodically.According to Sargent (2011), verification and validation steps seek to ensure the correct functioning of the computational model and its satisfactory correspondence with the physical systems, respectively.Moreover, the accreditation is related to the decision-maker's assessment of the model reliability.However, despite the importance of ensuring the DT's VV&A, Zhuang, Liu, and Xiong (2018) and Onggo et al. (2018) reveal that there is still a lack of methods and techniques for this purpose.
The literature presents some approaches that address the assessment of DTs, such as the work proposed by Zhang, Qi, and Tao (2022), which described a framework to evaluate some important characteristics of DT models before and after the building phase.However, the authors do not evaluate the model during its use to support decisions.Moreover, Liu et al. (2022) proposed a mechanism that evaluates the reliability of an automated machine' DT, but they assess just one quality characteristic by monitoring the systems' behaviour at each manufacturing stage.Therefore, the observed works do not include proposals for monitoring simulation-based DTs throughout their life cycle in decision support, which would require a periodical analysis of several important variables to guarantee the validity of results.
Furthermore, Santos et al. (2022a) carried out a systematic literature review on applications involving simulation-based DTs to support decisions in productive systems.Among the main results, the authors highlight that the periodic assessment of DT models is neglected by the vast majority of authors.In this case, the only approach observed was proposed by Cho et al. (2019) and referred to the adaptation of traditional validation methods, where periodic hypothesis tests are used to assess the DT model validity.At each decision-making interval, the authors conduct hypothesis tests to ensure that the model is valid.Although this assessment method is widely adopted in traditional simulation approaches (Sargent 2013), we highlight that it does not focus on DT models.
Some limitations and challenges regarding the periodic assessment of DT simulation models should be highlighted.First of all, the frequency of updating the model and its complexity impact the choice of the accuracy assessment technique.Therefore, when adopting traditional statistical approaches such as hypothesis testing, the decision-makers may face some difficulties: subjective choices of the monitoring frequency; the use of different simultaneous tests, considering variables of different nature (i.e.numerical and categorical, integer and real); and difficulty in considering the DT's behaviour over time since these tests do not consider the system past states.Finally, it is important to mention that the execution of several hypothesis tests can be time-consuming and culminates in a lower confidence level.

K-NN and control chart to support validation of simulation-based DTs
Since there is a need to adopt advanced techniques and tools to assess simulation-based DTs over time, we highlight the K-Near Neighbors (K-NN) classifier as an alternative to compare DT and physical systems.The K-NN is a Machine Learning technique that bases itself on the labels of the 'K' closest neighbours samples (training data) to classify test data points (Kumar et al. 2020).Although K-NN is one of the simplest Machine Learning algorithms, we chose it because of its robustness (Lee et al. 2020).Furthermore, according to Kumar et al. (2020) and Zhang et al. (2021), this classifier is suitable for problems with multiple classes of data, especially considering unknown distributions.The algorithm is a supervised learning method (Lee et al. 2020) that uses the Euclidean distance to define the nearest neighbours and, then, predict the label of the test data points according to the labels of their nearest neighbours (Gong, Su, and Tseng 2020).Ghosh (2006) highlights that the K-NN performance depends on the value of the neighbourhood parameter 'K'.Thus, the 'K' value will vary according to the characteristics of the application, and choosing the optimum value is a difficult task (Hassanat, Abbadi, and Alhasanat 2014).Zhang et al. (2021) state that a larger 'K' can suppress noise in the data but increase the algorithm's complexity, while a small 'K' may amplify the noise, resulting in overfitting.Figure 2 shows the K-NN's classification procedure and the importance of the 'K' value choice.In this case, for 'K' = 3, we note that the nearest neighbours mostly have 'Label 2', while for 'K' = 7, the 'Label 1' is predominant.
Furthermore, the K-NN accuracy indicates how well the algorithm can classify the data.According to Boontasri and Temdee (2020), this accuracy will depend on the data training process, which is a critical step of supervised methods (Wang et al. 2020).Thus, we expect that the accuracy will increase as the algorithm can correctly classify the data (target accuracy of 100%).On the other hand, if no label has a clear majority among its close neighbours, the algorithm assigns a random label to the test data point and the accuracy tends to 50% (Kumar et al. 2020).Therefore, we will use the K-NN to classify and compare physical and DT data to assess the similarity between both and, consequently, the validity of the DT model.Moreover, since the proposed approach suggests monitoring the K-NN accuracy over time, it is necessary to adopt a monitoring technique.In this case, we highlight the Control Charts.
According to Montgomery (2009), the control charts are one of the primary statistical process control techniques, which are helpful for monitoring output variable(s) in systems subject to undesired sources of variability.Initially proposed by Walter Shewhart in 1924, control charts are used to signal the presence of these sources of variability.Basically, samples are regularly taken from the process and the value of a monitoring statistic is calculated.Then, this statistic is plotted on a chart and if it is plotted beyond the control limits of the chart, the process is considered to be out-of-control and corrective actions must be taken (Zwetsloot and Woodall 2021).
Control charts have been used in several areas, such as manufacturing, healthcare, and services.In this case, different types of charts have been developed (Abbas et al. 2019;Chukhrova and Johannssen 2019).Considering the K-NN accuracy as the process parameter to be monitored, we adopted the p-chart.This choice is justified since the K-NN accuracy may be obtained considering the proportion of errors and successes of the classifier when trying to separate physical and virtual data.In this case, there is a binomial distribution and Chukhrova and Johannssen (2019) highlight that the p-charts are widely used to monitor the proportion of observations with some particular characteristics.
The p-chart control limits, called 'Upper control limit' (UCL) and 'Lower control limit' (LCL), and the Center Line (CL) depend on the proportion 'p' of a characteristic evaluated considering a given dataset.In this case, details of their calculations can be consulted in the work of Abbas et al. (2019).It is important to highlight that a control chart study is divided into two main phases.Phase I corresponds to the UCL, LCL, and CL determination, while phase II is dedicated to process monitoring (Chukhrova and Johannssen 2019).Figure 3 illustrates a typical p-chart.

Proposed approach
It is important to highlight that this paper does not address steps related to DT building.Therefore, we assumed that there is a DT simulation model already verified and validated during its building phase and ready to support decision-making.In other words, we focused on ensuring the validation of the DT model over time.We also reinforce that this work focuses on the DT results to validate it.This approach is in line with Tao and Zhang (2017), who stated that one of the ways to assess the DT validity is by comparing its results with those of the physical environment.In this case, if there are problems in the model or any other DT component (software/hardware), we expect these issues to impact the DT results.
Thus, we suggest an approach based on three main phases: (i) Definition of the DT evaluation variables (in which the key system variables are selected and prepared); (ii) Monitoring interface building and configuration (which involves the creation/configuration of an interface that allows the classification of physical and DT data using K-NN, and also the classifier accuracy plotting on a p-control chart); and (iii) Periodic monitoring (where the interface is updated periodically).

Definition of the DT evaluation variables
The evaluation variables are the DT parameters representing how similar the virtual model is to the physical environment.Thus, it is important to highlight that the correct choice of these variables is as relevant as the functioning of the proposed monitoring system.According to Sargent (2013), the evaluation variables of a simulation model are related to the model's purpose.Therefore, when considering the use of simulation models as DT, we must choose the variables that directly impact decision-making.Some of the most frequent evaluation variables of DTs of productive processes are process times, waiting times, resource utilisation rate, among others.
Moreover, the choice of variables must be in pairs.Thus, it is important to ensure that each variable is available in both virtual and physical systems in the same proportion.Thus, it is possible to compare the two environments and infer the DT validity.Data can be collected in the physical system through sensors, smart devices, databases, among others, while the model must be programmed to present the same information.It is important to note that the evaluation variables are related to the model's output and not to input data, which are also collected over time.As highlighted before, the proposed approach allows different types of variables, i.e. numerical and categorical, integer and real.Finally, although the variables must be collected in pairs and in the same proportion, the time resolution of physical systems and the DT model are not necessarily the same.Therefore, we reinforce that the variables do not need to be collected at the same time but they need to be available for updating the monitoring tool, as will be described in the following sections.

Monitoring interface building/configuration
The monitoring interface was planned to be a userfriendly tool.It was coded in Python.The interface must carry out some activities, as illustrated by Figure 4. Furthermore, to better understand and reproduce the proposed solution, the complete code can be found in the supplementary material, where all the libraries, variables and logic are described.
Firstly, the physical and DT data are collected as a dataset, both in the same proportions.We opted to use datasets from Microsoft Excel R since it is a widely used platform and compatible with most software and hardware solutions.Moreover, before being classified, all the data must be scaled since the system might have variables with different characteristics, i.e. numerical and categorical data, discrete and continuous, real and integer data, among other combinations.Thus, several scaling techniques may be adopted and we chose the Robust Scaler.First, if there are categorical variables, the algorithm transforms them into numerical ones.Then, the Robust Scaler will transform all evaluation variables, one by one, by removing the median of each dataset and scaling the data according to the interquartile range.This way, we guarantee that all data is on the same scale.More details about this method can be consulted in Raju et al. (2020).It is important to highlight that although randomness is expected in simulation studies, the presence of possible outliers in the data does not compromise the usability of the proposed tool since the Robust Scaler scales the data according to the median of the dataset (more robust to outliers than the mean).After this step, the dataset is ready for classification through the K-NN.
The K-NN will be trained from the dataset through Cross-validation.This technique helps to minimise bias that occurred during the training process by using random sampling.It starts by dividing a given dataset randomly in 'Z' number of folds, where one fold is used to test the model (test set) and the rest is used to train the model (training set).Then, the algorithm repeats the process to train and test 'Z' times (Boontasri and Temdee 2020).The Cross-validation flow is illustrated by Figure 5.Moreover, the optimal 'K' value was also defined using Cross-validation, as suggested by Zhang et al. (2021).In this case, for each dataset reading (each observation), the algorithm will test some 'K' values and choose the one with the highest classification accuracy.
At each observation, the dataset, which has data labelled 'real' and 'virtual' is read by the algorithm, which will carry out the training process.In this case, the classifier will learn the behaviour of both labelled data.It is important to highlight that the training set and the test set (see Figure 5) might be composed of both labels ('real' and 'virtual').After training, the K-NN will try to separate the physical and DT data.In this case, if DT data are perfectly realistic, the K-NN algorithm will not be able to identify which data is from the DT and which is from the   1 were obtained from the interpolation of these extreme values and were presented to give an order of magnitude to the proposed approach.
The K-NN accuracy is calculated and stored over time.The algorithm will accumulate the first 'j' observations (sample size of 'n') and use them to define the p-chart control limits (phase I of the control chart).During this phase, we must ensure that there are no special causes in the process (the model and the real systems are validated and operating as planned).Then, it is possible to plot the p-chart and use it to monitor the DT over time.Moreover, the described activities are cyclical and must be carried out periodically.Thus, after the chart plotting, it will be updated with each new observation (phase II of the control chart).Furthermore, as stated before, the desired K-NN accuracy is around 50%, which indicates that the DT is valid and presents results close enough to the physical systems.Therefore, the p-chart should allow the monitoring of accuracy over time and, if any element that makes up the DT suffers any functional failure, it is expected that the control chart will identify this effect.It is important to highlight that, although the minimum K-NN accuracy value is 50% to consider the DT model valid, we expect that there will be variability during the measurement of this accuracy since we consider the stochastic behaviour of this monitoring variable.Therefore, the control chart may present values above and below 50%, but which must be within the defined control range to validate the model.

Periodic monitoring
Periodic monitoring is the main objective of the proposed approach and it is important to define some issues, such as the frequency of monitoring and actions considering possible problems regarding the functioning of the DT.
On the one hand, the definition of the monitoring frequency does not depend on the characteristics of the DT.Santos et al. (2022a) reveal that, when it comes to using simulation models such as DT, several approaches are feasible, such as model updates in real or near real-time and constant or variable update intervals.However, given the ease of data collection when considering DTs operation, the monitoring should be performed as the monitoring sample size is achieved.In other words, data are collected continuously and each time the monitoring sample size is reached, the process described in Figure 4 is carried out again, forming a cyclical process.
On the other hand, when considering actions regarding possible DT problems, it is important to consider that the proposed tool serves exclusively to indicate when the DT does not behave as expected but does not act to correct the system.In this case, when the DT model is considered invalid, problems may go beyond the model, such as failures in physical systems or communication interruptions.Therefore, when observing outof-control signals that indicate that the DT model is not valid, it is important to check if it is a false alarm or a problem to be solved.If positive for problems in its functioning, the user must return to the DT building phase to fix the issues and validate the model before using it to support decisions.Once operating again, the DT should be monitored over time.Therefore, we highlight three stages of the DT: (I) Working correctly; (II) Verification Required: where the decision-maker must verify possible false alarms; and (III) Readjustment required: where one or more problems were confirmed.Figure 6 illustrates the proposed tool architecture and DT stages.

Experimental results and discussion
The proposed approach was first applied in theoretical cases to demonstrate its applicability.In this case, we evaluated the monitoring tool behaviour when subject to special causes that would impact the DT results.This analysis would be impractical considering a real case study since inducing special causes would not be feasible.Then, after demonstrating the tool functionality, we applied it to a real case study, where the objective was to monitor a DT of a production line over time, ensuring its correct functioning.The results are presented below.

Parameters definition
Considering the proposed approach, some parameters must be defined before implementing it.They are: (i) the number of folds ('Z') adopted during the crossvalidation of the classifier; (ii) the number of neighbours ('K') considered for the classification by K-NN; (iii) the sample size ('n') collected in each system observation considering physical and virtual data; and the number of initial observations ('j') used in phase I of the control chart.Considering the 'Z' value, we adopted 5-fold cross-validation (library default option), as suggested by Li, Tan, and Ng (2019).Moreover, we carried out some tests with other values of 'Z' (i.e. 10, 15, and 20) and they did not result in more accurate classifications.We considered 'K' = 5 reasonable for our study.
Regarding the number of neighbours 'K', Hassanat, Abbadi, and Alhasanat (2014) state that several works adopt 'K' = √ m, where 'm' is the dataset size.Then, we configured the algorithm to automatically test multiples of this value and choose the one that results in the highest classification accuracy.Therefore, depending on the data, we might have different values of 'K' for different system observations.It is important to highlight that the 'm' (dataset size for the K-NN classification) and the 'n' (sample size for the p-chart) are the same, and their value was defined through the equation proposed by Montgomery (2009).Thus, we considered a fraction (p) of 0.5 (target value) of the assessed characteristic, the distance of the control limits from the centre line (L) of 3 standard deviations, and the magnitude of the process shift (δ) of 0.15 (to guarantee a strong DT accuracy according to Table 1).Therefore, we obtained 'n' = 100 and, considering that the dataset is composed of physical and virtual data, the monitoring tool was configured to periodically collect samples of 50 from both physical systems and DT, totalling a sample of 100.
The number of initial observations considered in phase I of the p-chart was defined as 'j' = 25.This value was suggested by Montgomery (2009) and has been widely adopted by several authors, being a reference in works focused on process control and monitoring.It is important to highlight that phase I is focused on chart building and it is necessary to ensure that the process is under control, that is, free from special causes.Since phase I is not a cyclical activity, the algorithm was programmed to define the chart limits based on the 25 initial observations and then just monitor the process based on the chart already built.Table 2 summarises the parameters adopted for the proposed monitoring tool.

Theoretical cases
We emulated physical and DT data for five theoretical cases considering standard statistical distributions.
For each case, we adopted the three steps proposed: the selection of the evaluation variables, the building/configuration of the monitoring interface, and periodic monitoring.We selected simple cases (using Lognormal, Bimodal, and Poisson distributions) and more complex ones (Multivariate normal distributions with 3 and 5 variables).This choice is justified because they are distributions that are commonly used in simulation projects.Thus, the objective was not to test the proposal in all possible distributions but to demonstrate its applicability and evaluate the behaviour of the monitoring tool when significant changes occur in the system.Table 3 presents the distribution parameters adopted.[96.6, 63.7, 66.4, 67.2, 40.8], [63.7, 67.6, 52.8, 46.7, 42.7], [66.4, 52.8, 77.3, 53.4, 34.9], [67.2, 46.7, 53.4, 53.3, 35.4], [40.8, 42.7, 34.9, 35.4, 34.2]]We emulated 100 observations with a sample size of 100 for each case, representing the DT and physical data collected over time.Thus, the distribution data represent the physical and DT evaluation variables (in the same proportions), and they were stored in different datasets to input the monitoring interface.For cases I, II, and III, the dataset has one data column that represents just one evaluation parameter ('x'), while in cases IV and V there are three and five data columns representing three ('x 1 ', 'x 2 ', and 'x 3 ') and five ('x 1 ', 'x 2 ', 'x 3 ', 'x 4 ', and 'x 5 ') evaluation variables, respectively.The evaluation variables choice was planned according to the characteristics of each distribution and it was not possible to consider the importance of the variables for the DT operation since they are theoretical cases.Figures 7 and 8(a-c) illustrate the comparison of what would be physical and DT data, considering cases I and IV, respectively.Cases II and III present a similar behaviour to case I, while case V is similar to case IV but with two more variables.
As stated before, the monitoring interface was configured to scale the data and assess the optimal 'K' value through Cross-validation, repeating the procedure at each observation.Figure 9 illustrates the 'K' value definition, considering 'K' = √ m and its multiples.The optimum value is the one of the highest K-NN accuracy.
Once trained, the K-NN accuracy was calculated for each observation.For all theoretical cases, the first 25 observations were used to build the control chart limits, while the others were plotted in the chart, representing data collected over time.Moreover, we induced a special cause (10% variation) on observation No. 50 (only in the dataset that would represent physical data) to assess the performance of the monitoring tool in identifying such a cause.It is important to note that a 10% difference between the physical and the virtual environment is generally considered common in simulation projects.However, considering that the use of DT is associated with decisions of significant impact, it is important to monitor its functioning and identify signals that may be causing this variation.For all cases, we observe that the monitoring interface works as expected.At first, the K-NN average accuracy for all cases represents a strong/very strong DT validity, demonstrating that the classifier can identify the similarity between what would be the virtual and physical data.Moreover, when considering a slight variation in the dataset that would represent the physical data, we noted that the classifier could identify this special cause, as shown by the out-of-control signal on the charts.Finally, we replicated the control charts 1000 times to analyse their behaviour, maintaining the special cause in the sample No. 50.In this case, as shown in Figure 11, we noticed that in about 97% of the charts, there was the presence of a special cause in sample No. 50, considering case I.The other cases presented a similar behaviour.Therefore, we concluded that the tool works as expected and can be adopted in real case studies.

Real case study
Considering the use of the proposed approach in a real study object, we opted for a DT adopted by a  medium-sized Fast Fashion company.The Fast Fashion segment has high demand variability, lacking faster and more efficient decision-making tools.Thus, the DT was planned through a DES model to optimise the operational planning of one of the production lines.The line produces three different products (clothing items), whose demand varies throughout the year, and the production flow is divided into 8 workstations (Process B to Process I), in addition to the Reception (A) and Shipping (J) areas.The process is mostly manual and part of the product transport is carried out automatically using an automated guided vehicle (AGV).Figure 12 illustrates the floor plant of the line with the production flow and also its 3D simulation model built using FlexSim R .
The DT was previously built, verified, and validated by the team responsible for the project and is already operational.It is a near real-time and non-autonomous approach.In this case, the DT is updated weekly and provides guidelines for decision-making.In addition to the DES model, the DT also has a tool based on Artificial Intelligence (AI) and a decision-making dashboard.In this sense, the AI algorithm, based on Artificial Neural Networks (ANN), predicts the weekly demand behaviour considering the demand history available in the local database.Moreover, based on the forecasted demand, the DES model tests different weekly resource planning strategies and indicates the best decision regarding resource sizing (physical and human).The decision dashboard integrates the physical and virtual environments and it provides a user-friendly interface for the decisionmaker through automated buttons where the user can run the AI and DES models and also view the guidelines for decision-making.Figure 13 shows the DT architecture and more details about the DT structure, components, and main characteristics can be found in Santos et al. (2021).
In addition to the demand history, the local database also has production data collected through radio frequency tags (RFID), which capture information about the products while they are being processed.Therefore, we carried out the three-steps approach proposed.First, we selected the DT evaluation variables, followed by configuring the monitoring tool, and, finally, the DT periodic assessment was allowed.The evaluation variables selected were the waiting times and process times for each product, considering the main workstations (process C to I), and also the transport time by AGV (totalling 14 variables).These variables are continuously collected by the RFID tags (physical data), while the DES model was configured to offer, periodically, the same information from the DT (virtual data).Then, both physical and virtual data are periodically stored in different datasets.
The average weekly demand is around 750 clothing items (including the three product types).The monitoring tool was planned to collect the physical data in samples of 50 during the week, while the DT data are also collected in the same proportion, totalling datasets of 100.Thus, if the line produces 750 items in the week, for example, about 15 samples will be collected.The monitoring tool was configured to be updated at each sampling.In this case, at each observation, the K-NN accuracy was obtained and, after the first two weeks of data collection, it was possible to define the control limits from the first 25 observations.Then, the next samples were plotted on the control chart as they were collected.For this work, we recorded the first month of the monitoring tool operation, as illustrated by the control chart shown in Figure 14.
We note that the monitoring interface works as expected considering its use in a real case study.The K-NN average accuracy represents a strong DT validity and, despite the complexity of the real DT (14 evaluation variables), the monitoring tool can compare the physical and real environments and identify their similarities.Moreover, it is also possible to assess the control chart performance to guarantee its results.In this case, Aebtarm and Bouguila (2011) and Leoni and Costa (2018) suggest some chart assessment indicators: (i) plotting illustration and (ii) Average Run Length (ARL).According to the authors, they refer to the visual quality of the chart and the rate of false alarms, respectively.The ARL is obtained by Equation (1).
Where the parameter pr represents the probability that a point is out-of-control (considering the Binomial distribution).
Regarding the plotting illustration, Aebtarm and Bouguila (2011) state that the control chart should clearly illustrate the behaviour of a system, making it possible to identify whether something is right or wrong.As shown in Figure 14, the monitoring tool fulfils this role.On the other hand, considering the ARL, Leoni and Costa (2018) highlight that a larger ARL means a lower false alarm rate.In this case, we obtained an ARL of about 714, considering a pr of 0.0014 (obtained from Binomial distribution and considering the chart parameters: CL = 0.61, UCL = 0.76, and LCL = 0.46), which means that if the process is in control, an out-of-control point will appear at every 714 observations.According to Aebtarm and Bouguila ( 2011), an ARL of 370 used to be a reference in the literature considering three standard deviation levels for the chart limits.Then, the proposed approach appears to be feasible.
Furthermore, we can compare the proposed approach with the current DT validation practices.As highlighted in this work, the only validation method adopted by researchers is the use of hypothesis tests to verify the similarity between the DT and physical systems.Therefore, considering the same dataset, which contains data referring to one month of operation of the DT, we carried out a comparative analysis to assess the advantages of the proposed tool.To evaluate the DT operation over time, it would be necessary to periodically perform hypothesis tests for each evaluation variable.
First, for each sample of 100, we tested the normality through the Anderson-Darling test and performed the proper hypothesis tests.Considering the monitored period, all samples did not fit with the normal distribution (P-value of the Anderson-Darling test < 0.05).Then, we adopted the Mann-Whitney hypothesis test for each evaluation variable to compare the physical and virtual data.In this case, the test indicates that the DT is valid when P-value ≥ 0.05. Figure 15 illustrates the monitoring procedure using this approach considering 2 of the 14 evaluation variables of the real case.
Although it is possible to assess the DT validity through periodical hypothesis tests, this approach might be unfeasible since it requires several tests when considering several evaluation variables.For the real case addressed in this work, about 840 hypothesis tests would be necessary to assess all 14 evaluation variables in a month of operation.Moreover, in each sample, the decision-maker must analyse 14 P-values from the 14 hypothesis tests to guarantee that the DT is operating as expected.If we consider a DT that updates in shorter time intervals or even a more complex DT, with more evaluation variables that may require different tests, this analysis may be unfeasible.On the other hand, as highlighted before, the proposed monitoring tool allows a faster and more practical assessment of the DT.In this case, with just one parameter (K-NN accuracy), which is plotted on the control chart, the decision-maker can quickly assess the validity of the DT model, regardless of the system complexity.

Conclusions
Although the use of simulation models as DT stands out as a flexible and efficient alternative for decision-making, we highlight that traditional validation methods do not consider some important characteristics of this approach, mainly related to the need for periodic evaluation of the models.In this case, considering that we did not find any works focused on this purpose in the literature, this work proposed an approach based on Machine Learning and control charts to allow the assessment of DT simulation models during its operational phase.More precisely, the K-NN classifier was used to compare the DT model with the physical systems, and, using the p-control chart, it was possible to monitor the validity of the DT over time.
The proposed approach is based on three main steps, starting with the selection of the evaluation variables to be monitored.Then, there is the building/configuration of the monitoring interface, which compares physical systems data with DT data through the K-NN, allowing the DT validity assessment.The K-NN accuracy is plotted in the p-chart over time to allow the DT assessment during its use to support decisions.Finally, the last step corresponds to constant and periodic monitoring.
The proposed approach was initially tested in five theoretical cases to assess its applicability.In this case, data were emulated representing DT and physical systems for five standardised probability distributions.It was possible to carry out experiments where the system was stressed to assess its ability to identify possible problems regarding the DT functioning.In addition, the approach was implemented in a real case study, which refers to a DT adopted by a medium-sized Fast Fashion company.As a result, the tool showed to be able to monitor the DT functioning and identify possible special causes that could compromise its results.
We concluded from this work the great versatility and flexibility of the proposed approach, being compatible with different DT approaches, contemplating near/realtime models with different characteristics regarding connection, integration, and complexity.Moreover, we highlight that the proposed work fills a theoretical and practical gap.First, through a unique tool, we can monitor complex models in a faster, simpler and more robust way.Furthermore, from more reliable models, we expect to contribute to the growing adoption of simulationbased DTs to support decisions.
As future works, we suggest using other Machine Learning techniques and control charts to replace the K-NN and the p-control chart, respectively.In this case, a comparative analysis with the proposed approach is suggested to verify advantages, limitations and possible improvements for the monitoring tool.Furthermore, since this work focused on simulation-based DTs, we also suggest works that adopt the proposed tool for monitoring DTs of equipment, which might be helpful for studies involving maintenance and setup planning, for example.

Figure 3 .
Figure 3. Illustration of a typical p-chart.

Figure 4 .
Figure 4. Activities that compose the monitoring interface.

Figure 7 .
Figure 7.Comparison of the physical and DT data for the case I (considering one variable).

Figure 8 .
Figure 8.Comparison of the physical and DT data for case IV, considering three variables: x 1 (a), x 2 (b), and x 3 (c).

Figure 10
Figure10(a-e) present the control charts for cases I to V, respectively.For all cases, we observe that the monitoring interface works as expected.At first, the K-NN average accuracy for all cases represents a strong/very strong DT

Figure 10 .
Figure 10.Control charts for theoretical cases: Case I (a), Case II (b), Case III (c), Case IV (d), and Case V (e).

Figure 11 .
Figure 11.Chart replication to identify the special cause (case I).

Figure 12 .
Figure 12.Production line and its 3D model (Real case).

Figure 15 .
Figure 15.Real case assessment through periodical hypothesis tests.

Table 3 .
Distribution parameters for theoretical cases.