Deep Recurrent Entropy Adaptive Model for System Reliability Monitoring

The aim of this article is to develop a methodology for measuring the degree of unpredictability in dynamical systems with memory, i.e., systems with responses dependent on a history of past states. The proposed model is generic, and can be employed in a variety of settings, although its applicability here is examined in the particular context of an industrial environment: gas turbine engines. The given approach consists in approximating the probability distribution of the outputs of a system with a deep recurrent neural network; such networks are capable of exploiting the memory in the system for enhanced forecasting capability. Once the probability distribution is retrieved, the entropy or missing information about the underlying process is computed, which is interpreted as the uncertainty with respect to the system's behavior. Hence, the model identifies how far the system dynamics are from its typical response, in order to evaluate the system reliability and to predict system faults and/or normal accidents. The validity of the model is verified with sensor data recorded from commissioning gas turbines, belonging to normal and faulty conditions.


I. INTRODUCTION
A S TECHNOLOGY advances, the complexity of the physical environment around us grows at an increasing rate. This is especially conventional in complex industrial environments, where the expansion of the so-called Industry 4.0 is rising the degree of interactivity between the ever larger number of subsystems. Complex systems typically display dynamics that can only be modeled through computationally expensive approaches [1], and do not allow for reduction to simple systems of equations. This is a result of their dynamics being produced by the interaction of their many subsystems-what is known in the complexity theory as emergent phenomena. Thus in a complex system, subsets of its subsystems cannot explain the global emergent dynamics, making prediction tasks particularly challenging.
Perrow studied the issue of low system predictability within the setting of industrial plants. In his influential work [2], he referred to these systems with the descriptive term of highly coupled systems. In the same publication, numerous examples of low probability and low predictability accidents or system faults are given. These accidents are termed as normal accidents because, although uncommon, they are bound to occur in the long run. For the reason that the unpredictability of such systems is a consequence of subsystems interaction, adding automatic safety mechanisms may further increase their complexity [2], potentially generating new types of normal accidents. Hence, the approach proposed in this article consists in identifying abnormal behavior in a system, instead of characterizing particular faulty conditions; there are far too many potential faults, some of which can be regarded as normal accidents. In many occasions these are originated by concomitant causes such as control systems failure, human error, or sensor misreadings (see Section III-A).
The method studied here utilizes a measure to estimate unpredictability or missing information with respect to a system. When the unpredictability of the system's state variables increases, the behavior of the system is regarded as anomalous. For such a task, an uncertainty measure known as information entropy has long been established [3]. Indeed, when Shannon proposed this measure, he also proved that it is the only possible formulation that satisfies the three basic reasonable requirements that any measure of information should comply [4]. The entropy measure is a function that takes as input a probability function, and yields as output a value indicating how much it is unknown about the underlying generating process; it is a measure of the amount of missing information with respect to the system at hand [4].
Finally, entropy analysis has received substantial attention in the domain of biomedical signal analysis [5], [6], and a fair level of attention for condition monitoring of industrial systems [7], [8]. A number of modified entropy measures are surveyed in [9]. As stated, any alternative entropy formulation must satisfy the three basic requirements described by Shannon [4]. Hence, generally, these newly proposed measures are reformulations of information entropy over different probability functions.
Since entropy measurement depends on a probability function, any entropy-based condition monitoring method needs an approach to estimate it. This can be done in various ways. One common scheme is direct approximation from the response signals of the system [10]. Nonetheless, when sequential observations in a time series are not independent, this may result in a biased estimate of the probability function. Another method is to use a model to predict the state of the system. For example, the errors between the model output and the actual system values can be treated as samples of a random variable. This strategy has been used successfully in [11] as a measure of driver's workload level, and later applied to calibrate steering control models [12]. Here, an alternative method is utilized, in which a model-by way of an artificial neural network (ANN)-yields directly as output a probability function.
Generally, system dynamics in industrial environments can be modeled by linear systems of differential equations. But linear approximations are only accurate in individual and restricted operational regimes. In most cases, system dynamics are better characterized by nonlinear equations or by fractional operators with memory [13], [14]. All these methods are even more relevant when the response of the system examined diverges from its typical behavior; when higher and unpredictable subsystem interactivity may bifurcate its response toward chaotic dynamics or dynamics exhibiting larger hysteresis [15].
ANNs are appropriate for modeling linear as well as nonlinear dynamics [16], but their classical formulation does not accommodate for memory properties. Nonetheless, for sequential data analysis, recurrent neural networks (RNN) can be employed. RNNs are a specialized type of ANN that maintain context-or memory-when trained with sequential data [17]. Examples of its applicability include weather forecast [18], stock market prediction [19], and machine translation [20]. One pertinent aspect of any ANN is that, when trained as a classifier, it approximates not only the forecasted output but also its probability distribution. Hence in this article, a RNN will be employed to approximate a probability distribution, with which the entropy measure can be computed. And this will be done with the aim of characterizing system predictability, as a way of identifying anomalous system behaviour.
Further, to provide a case study for validation purposes industrial gas turbines (IGTs) are considered. IGTs are a true example of complex systems organized in many subsystems: compressor, combustors, pumps, fuel supply, ignition system, lubrication, and drain modules, etc [21]. And as complex machines, IGTs can display a wide-ranging repertoire of anomalous behavior (see Section III-A). The proposed methodology is here validated for the particular case of condition monitoring of IGTs. Nevertheless, it has general applicability to other systems from varying domains, such as biomedical data analysis and any high risk technology generating sequential data or sensor data from arrays of sensor networks [22], [23]. A review of other information fusion methods for IGT diagnostics can be found in [24].
The remainder of article is organized into the following sections. In Section II, the required background elements of information theory and RNNs are introduced. Next, in Section III, the characteristics of the used data are specified. On the other hand, P 2 (X 2 ) in (b) has a higher entropy (H 2 (X 2 ) = 2) than P 1 (X 1 ) in (a) (H 2 (X 1 ) = 1). In both cases E(X 1,2 ) = 0.5.
Section IV describes the proposed model and the parameter fitting procedure, while in Section V the model is tested and validated with real-world sensor data recorded from IGTs. Finally , Section VI concludes this article.

A. Entropy or Missing Information
Although the concept of entropy was originated in the study of thermodynamics, today it is often used in multiple fields that are not related to energy transfer, e.g., language analysis [25] and redundancy estimation in the genome [26]. Most of these usages emerge from the information theory interpretation of entropy, which is the topic examined here.
Information entropy (or entropy), is a measure of the amount of missing information with respect to a process, understood as a random variable. Given an information source modeled by a random variable X, for which a probability distribution P (X) is known, and assuming that X has n ∈ N possible outcomes, the entropy H β (X) is defined as where . . n}. When β = 2, the units of H 2 are bits. H β (X) does not depend on the underlying process X being divided into parts, as that does not change the amount of uncertainty about the process [4]. This property reflects a very important idea; the missing information about a system is equal to the missing information of each of its subsystems plus the missing information corresponding to subsystem interactioni.e., the emerging complexity in a tightly coupled system [2].
A descriptive way to understand the definition in (1) is by comparing it to classical statistical measures. Consider two random variables-both with uniform probabilities: Fig. 1). X 2 has a higher number of equally likely outcomes than X 1 , therefore H 2 (X 2 ) > H 2 (X 1 ) (X 2 is less predictable and less reliable). Contrarily, for the standard deviation σ(X 1 ) > σ(X 2 ). Simultaneously, both random variables exhibit the same expected value E(X 1 ) = E(X 2 ) =½. H β is a measure about the potentiality of a system (see Fig. 2), and not a static simplification like mean and variance.
With respect to condition monitoring techniques, although machine learning algorithms based on classical statistics have been successfully applied for particular operational regimes [27], they are challenged when hardware-controlled or environmental variations dominate the sensor data. Diversely, entropy-based techniques are able to characterize how little it is known about the system, irrespective of the summary statistics [10]. Increased entropy can indicate system anomalies in different operational regimes-such as varying system workload.

B. Recurrent Neural Networks
ANNs and information entropy originated from participants of the influential Macy Conferences on cybernetics (1941-60) [28], which established principal foundations toward interdisciplinary science. Today, both computational approaches are deeply intertwined, as ANNs employed as classifiers are often trained by means of an entropy loss function [see (2)].
ANNs allow for the modeling of complex systems without the need of extensive domain knowledge, and have been used for condition monitoring purposes through supervised [7], [29] and unsupervised approaches [8], [30]. When the input features are sequential in nature, one common approach is to use recurrent models with a delayed feedback loop in each layer, i.e., RNNs. Each layer in a RNN includes a memory block that maintains information across an arbitrary number of iterative steps.
The first ANNs with recurrent properties can be traced back to the 1980s [31], but it was not until much later that models, robust to the vanishing gradients problem, were developed: notably, the long short-term memory (LSTM) model [17] and models based on gated recurrent units (GRUs) [32]. In both architectures, the degree of memory retention is controlled by a gating mechanism. Although the relative efficacy between these methods is still under debate [33], LSTM models seem more suitable when large data are available. Thus, in this article, the LSTM architecture is adopted. Novel applications of LSTM cells to model complex systems include: medical diagnosis [34] and monitoring superconducting magnets [35], [36]. Several improvements of the original LSTM model can be found in the literature. Here, it is considered the LSTM implementation incorporated in the TensorFlow library [37], [38].
In the literature different optimization algorithms have been proposed to train the ANN. These range from the Levenberg-Marquardt methods [39] to the Widrow-Hoff rule [40]. RNNs are typically trained by using a variant of the standard backpropagation algorithm, which is named backpropagation through time (BPTT). The choice of loss function depends on the particular problem. For regression problems, the mean squared error (MSE) can be employed, while for classification tasks the cross-entropy measure is more suitable. The latter is defined as follows: where p i are the true probabilities inferred from the labels in the training data and q i are approximated by the model through a softmax layer stacked after the recurrent layer. Additionally, when sufficient data are available, it is possible to augment the effectiveness by stacking one recurrent layer after anotherleading to a deep learning model (see Fig. 4).

A. Elements of Industrial Gas Turbines
In this article, sensor data recorded from IGT engines are employed [see Fig. 3(a)]. Essentially, an IGT engine transforms fuel energy into shaft power. When the engine is in operation, shaft rotation causes the rotor blades to turn into the air compressor. The rotors action induces the intake of atmospheric air from the compressor inlet [see Fig. 3(b)], and its subsequent pressurization as it flows through the various rotor and stator blades, which are sequentially spaced within the compressor. The resulting high pressure air is then released into the combustors, where the temperature of the compressed air is highly increased by burning liquid or gas fuel. Subsequently, the expansion produced by the high pressure and temperature causes the air to expand toward the power generator part of the engine, which consists of another series of rotors connected to the same shaft. A part of the generated shaft power is employed to compress new atmospheric air via a self-sustaining mechanism, which was initiated by an auxiliary electric motor. The new air prevents the flow from reverting backward and stalling the engine.
There are two types of IGT engines. In the first kind, the residual air from the power generator is directly pushed out through the exhaust as new air comes in. These engines have a single shaft and are typically employed for electrical power generation. Unlike single-shaft engines, twin-shaft engines have a supplementary shaft, immediately after but decoupled from the first shaft. This mechanism yields more control on the applied shaft torque. Thus twin-shaft engines, such as the ones  investigated here, are generally used for mechanical power generation-for example, on offshore drilling rigs.

B. Data Characteristics
The data were recorded from ten twin-shaft IGT engines, situated in different geographical locations with varying environmental conditions. The dataset was prepared by Siemens to analyze the performance and to monitor the condition of their machines. Typically, a customer can choose to install a number of sensors on an IGT. The raw data from these sensors are recorded in real time and stored by the OEM for future use, without any preprocessing or filtering. The majority of the data correspond to IGTs in healthy operating conditions. The dataset also includes message logs, annunciated by the control system-such as warnings and automatic shutdowns-and site visit reports written by the operators. These allow to identify segments of data corresponding to a faulty machine. The data from healthy machines were used to train the model (see Section IV), while the samples from faulty machines were employed for the case studies (see Section V). Further information of the dataset, such as the geographical location of the engines, the exact position of the sensors, and their specifications is confidential. Measurements from 35 sensors were employed in this study; these coincide with health indicators known to be correlated with different IGT component conditions [27] (see Table I). The sampling period was of 1 min, and comprised several months of data collected from each engine. Besides the data corresponding to normal engine functionality, data from different faulty conditions were collected (see Section V). Prior to their use for RNN training, the data were curated, data segments with missing sensors were discarded, and the data were standardized 1 (see Section IV-B). The curated data were split into training, development, and testing sets (see Table II).

IV. ANN MODEL ARCHITECTURE
In this section, two RNN models are described: a regressor and a classifier. The regressor is applied for demonstrating the feasibility of the methodology and to select the appropriate RNN architecture. The classifier, referred to as the Deep Recurrent Entropy Adaptive Model (DREAM), is an extension of the regressor, and is able to approximate the probability distribution of the outputs of a system and compute its entropy. The DREAM is used as an algorithm for system unreliability evaluation (see Section V).

A. Deep Regressor RNN Model
The regressor RNN is designed to predict the shaft power produced by an IGT (S-35 in Table I) at time step < t >, from the inputs of the sensors S-01, . . . , 34 within a window of length L W : {< t >−L W + 1, . . . , < t >}. Thus, the model uses as input the last L W observations of the states of the machine. The last layer of the model is fully connected-i.e., a dense layer, to generate a 1-D output: the predicted standardized shaft power. The regressor was optimized by minimizing the MSE from the training data through BPTT, with the Adaptive Moment (Adam) algorithm. Hyperparameter tuning was conducted by random search [41]. This approach is known to be more effective than grid search, as for the same number of evaluations more values of each particular hyperparameter are tested. In total the RNN was trained 50 times with different hyperparameter combinations. The combination that resulted in the lowest error in the development set (see Table II) is the one reported in the article. The optimized hyperparameters include the window length, Adam's learning rate, number of cells per LSTM layer, and number of LSTM layers. The retention probability of a dropout regularization method [42] was also tuned, which acted by randomly switching off a percentage of the inputs at each layer during the training. The tuned hyperparameters are shown in Table II. Longer or shorter memory windows than L W = 5 resulted in reduced RNN performance, but this value is highly dependent on the sampling rate.
For the examined data, a two LSTM layer model conveys the best performance. Dropout regularization is not significantly effective in this case, possibly because the degree of overfitting is low for the selected number of epochs-which was selected through the early-stopping method; more epochs decreased performance on the development set. Nevertheless, as it makes the model more robust to sensor misreadings, it is retained in the model. For the final hyperparameter selection (see Table II), the respective MSE over the training, development and testing sets are 0.0069, 0.0084, and 0.0103.

B. Deep Recurrent Entropy Adaptive Model
Based on the regressor model (see Section IV-A), a classifier is constructed. In the DREAM, the fully connected layer, with 1-D output, is replaced by a 10-D softmax layer, i.e., a fully connected 10-D output layer (with outputs y <t>,1 , y <t>,2 , . . . , y <t>,10 ) To train the DREAM, the standardized output (from Sensor S-35) was converted to a categorical variable. This was done by binning the numerical values into ten classes (see Table III), where classes 2 to 9 cover two standard deviations away from the mean in each direction (≈ 95% of the data). This bucketing strategy represents a range of arbitrary width within a finite number of classes, which was 10 in this case-this number keeps the number of RNN parameters to a minimum while yielding acceptable resolution. Sensor signals from IGTs under constant load display fluctuations, due to instantaneous variability of the dynamics and measurement noise. The number of buckets was chosen so that the corresponding signals stay within the same bucket while the machine is kept at constant load. This effect can be observed by comparing Figs. 5 and 6, where the signal of an IGT is displayed for a period of 21 days during which the load was altered multiple times. The central bins (2 to 9) embed the typical IGT dynamics, while the boundary bins (1 and 10) characterize outlier behavior. The loss function is the cross-entropy (2), where the true conditional probabilities are given by the corresponding class; for a particular training set input sample x <t> ∈ R 34 with class label kỹ the assigned conditional probability function is The approximated probability function is estimated by the RNN from (3): q <t> j = σ(y <t>,j ), satisfying 10 j=1 q <t> j = 1. The fundamental element of the DREAM relies on the following property of the entropy measure [4]: Specifically, the cross entropy is always greater than or equal to the entropy, and only equal when p <t> j = q <t> j for all j. Accordingly, as the optimization process minimizes the cross-entropy, it approximates the true conditional probability function of the output variable for a particular input, and consequently its where the limit denotes the ideal case of unrestricted training data-representative of the underlying dynamics-over an idealized recurrent model, with capacity to learn fully the conditional probabilities in the data.
The DREAM was trained with the same hyperparameter set as in the regressor model (see Table II), as it was found that further hyperparameter tuning does not provide significant difference in performance. The respective values of the cross-entropy loss over the training, development, and testing sets are 0.036, 0.059, and 0.082.
The computational graph of the model is shown in Fig. 4. The trained model, when used for feed-forward prediction, contains an additional step to compute the entropy. For simpler interpretation, the entropy is normalized H <t> ∈ [0, 1]:

V. CASE STUDIES AND RESULTS
Typically, fault diagnosis of IGTs has been based on identifying specific types of faults [27]. In a real industrial setting, there are innumerable things that could potentially go wrong, thus the human cannot be completely removed from the diagnosis loop-indeed, some types of faults are associated with human error and are virtually impossible to predict algorithmically. For example, occasionally the operators shut-off valves accidentally, and this is identified by a safety system as a particular type of mechanical problem [2]. As the response of the operators to the system warnings does not correspond to the actual issue-which is being ignored-a normal accident may be generated through a successive chain of inadequate decisions by the operators.
An actual example of a normal accident in an IGT is that of debris left inside the fuel ring supplying the combustors, which caused perplexity among the operators. Fuel flow induced the debris to move inside the ring, affecting different combustors at different times. A number of erroneous mechanical causes were considered-all based on previous characterizations of particular faults from sensor readings. An engineer was able to determine the problem only upon dismantling the engine and inspecting the fuel ring.
Therefore, the interest here is on identifying abnormal system behavior as such, so that potential faulty systems are flagged for further analysis by human experts. In this section, the DREAM (see Secton IV-B) is validated through sensor data from a healthy IGT engine and four faulty IGT cases, exhibiting rotor damage, rotor vibration, compressor damage, and worn shaft bearing vibration, respectively.

A. Unimpaired IGT Engine
This case study illustrates the behavior of a healthy IGT engine, and corresponds to independent testing data (see Table II) spanning a period of approximately 21 days. This data segment was selected because it displays an engine subject to extensive load variations. Nonetheless, the regressor model (see Section IV-A) is able to accurately predict the power output (see Fig. 5).
The same data are analyzed with the DREAM (see Section IV-B): Fig. 6(a) shows the prediction for the shaft power as a categorical variable (see Table III), while Fig. 6(b) shows the entropy resulting from the softmax layer output [see (8)]. The visual thresholds (dashed lines) in Fig. 6(b) were empirically set from the currently tested healthy and faulty cases; more than 98% of the samples satisfy H <t> < 0.34 for this case study. The exceptions are sporadic spikes due to operator induced variations. Moreover, all the faulty conditions studied next consistently display average H <t> over 0.5. In reality, these threshold values are dependent on the examined system and may also reflect system degradation through time. Hence, they should be tuned in an adaptive manner according to the type of system and unit being monitored.

B. Rotor Blade Damage
The blades of the first rotor after the combustors are exposed to the highest temperatures, thus are more likely to suffer damage. The case of an IGT with this particular condition is analyzed in Fig. 7. The DREAM exhibits lower predictability capacity and higher entropy. In the same figure, the time signal segments with higher variability and entropy correspond to human induced unpredictability; the operators were trying to restart the machine in order to mitigate vibrations.

C. Rotor Bearing Vibration in Compressor
In another case study, intense rotor vibrations at the compressor turbine were affecting an IGT engine. This was caused by blockage in the lube oil supply to the bearings. Fig. 8 shows the corresponding entropy analysis for this case; while the entropy prediction is larger than for the previous case, the variability is lower [see Fig. 8(b)], evidencing the reduced predictive capacity of the RNN [see Fig. 8(a)].

D. Compressor Damage
Because compressor discharge air temperature and pressure determine the performance of the combustion in an IGT, compressor efficiency, and IGT performance are directly related. The presented case study belongs to an IGT engine suffering compressor damage, possibly as a result of ice formation inside the compressor-through cold air intake. The data analysis for this case is shown in Fig. 9. The DREAM associates large entropy values to this engine after being started, hence the damaged engine is a system with high unpredictability; the model is unsure as to what bin the forecasts should be associated with, because the sensor inputs do not correspond to the typical behavior in the training set. Interestingly, as a result of reduced efficiency in reality-due to the damage in the compressor, the DREAM overestimates the power output. To date, there was no established methodology on detecting compressor icing cases.

E. Worn Bearing Vibration in Power Turbine
Furthermore, the case study of an IGT engine experiencing severe shaft bearing vibration due to lack of lubrication is analyzed. In this case, the actual power output of the engine was not available. Nonetheless, unlike other model-based methods, where the actual output is strictly required in order to calculate the residuals which act as a health indicator [43], the DREAM can predict not only the power output, but also the entropy level with the absence of the real output measures. The DREAM prediction [see Fig. 10(a)] shows when the machine was being shut down and restarted by the operators, to reduce vibration levels. As a result of the stabilization effect of the first restart sequence, the entropy was reduced [see Fig. 10(b)]. However, the DREAM indicates relatively low system predictability-or slightly higher entropy-suggesting that there is some system change, e.g., by the replacement of the faulty bearing.

VI. CONCLUSION
In nearly all industries, technological development goes hand in hand with an increase in complexity, which is generally associated with a larger number of subsystems, resulting in higher subsystem interactivity. Frequently, considering that the human operators in industrial environments typically interact with each of the subsystems independently, the nature and degree of the complexity is not obvious or intuitive. In addition, as the relevance of the so-called Industry 4.0 unfolds, cognitive computing techniques are gaining importance, and system complexity is likely to increase even more-and thus, normal accidents may become more likely.
With this in mind, a measure to determine the predictability of a complex system, that requires little or no knowledge about the underlying dynamics, is here introduced. The measure consists of using a RNN to estimate the conditional probability of an output variable, given the system inputs. From the probability function, the information entropy is calculated-indicating how certain is the artificial neural network about the system output, or how predictable is the system. The algorithm is named Deep Recurrent Entropy Adaptive Model (DREAM).
The use of an artificial neural network with memory (or recurrent) is justified theoretically-most real-world complex systems present memory and/or hysteresis effects, and experimentally-it was tested that the chosen memory window yields better results than a memory-less model through hyperparameter tuning. For the case of systems that do not present significant memory effects, the recurrent model can be replaced with a nonrecurrent neural network architecture-e.g., a convolutional neural network.
The proposed DREAM was validated by testing its efficacy in the particular context of industrial gas turbine engines; it is shown that the model is able to discriminate between normal dynamics, corresponding to healthy engines, and anomalous dynamics, corresponding to engines presenting different fault conditions. Moreover, for the case of healthy engines, the given approach can be used as a tool for system efficiency prediction.
The potential applicability of this research covers other types of industries, for example, high risk technologies such as nuclear, financial technologies (FinTech), and biomedical imaging. Future work is intended toward extending the approach to a broader range of complex systems.