Explainable Fault Diagnosis Using Invertible Neural Networks-Part I: A Left Manifold-based Solution

The series includes two parts, articulating the two novel avenues of research on intelligent fault diagnosis (FD) for nonlinear feedback control systems. In Part I of the series, we design a novel FD paradigm by elaborating an invertible neural network (INN) for feedback control systems


I. INTRODUCTION
Regarded as an application technique, fault diagnosis (FD) is now undergoing a dramatic evolution due to the rapid development of machine learning (ML) [1]- [4].Its evolution is also a step further promoting the establishment of digitalization and intelligence of the modern systems.Over the past two decades, one of the most common realizations of ML-based FD approaches utilizes neural networks (NNs), and the research results have been intensively published in IEEE Transactions journals [5]- [8].Particularly, the work in [1] makes the FD techniques reach a tipping point in the evolution, by discovering a bridge between the unsupervised and supervised ML-based designs for nonlinear dynamic systems.Moreover, the FD utilization in practical systems allows the engineers to evaluate its effectiveness and feasibility [9].
Nevertheless, some fundamental questions of FD for the feedback control systems have not been addressed well, even have not been aroused sufficient awareness and attention.For example, the research on • What benefits are offered by control theory when designing ML-based FD approaches for feedback control systems?• How does ML learn efficient system dynamics from a reference signal to system inputs and outputs?• How to elaborate an ML method that specializes in a feedback control system?In which, and to what sort of problem can ML solve?• Which kind of ML-based representation features does contribute to FD performance?• How to avoid the overfitting problem when training data are limited for an unknown system nonlinearity?remains open, although both the IEEE Computational Intelligence Society and International Federation of Automatic Control (IFAC) Technical Committee sponsor a number of special issues to call for more attention to these unexplored issues.Just in our opinion, nonlinear system identification (based on which, and residual generator) is in fact a key driver that can support and boost the development of MLbased explainable FD approaches.It could be claimed by retrospecting the development of data-driven FD approaches for both open-loop [10], [11] and closed-loop systems [12] over the past twenty years.As we can observe, control theory [13]- [16] is leveraged to complete system identification and construct residual generator, where the least squares method is an efficient tool in the implementation procedure.Of course, a number of alternative tools to least squares are also capable of performing system identification as well as establishing residual generator.In general, representative data, possibly without bias or outliers, are required to build an equivalent dynamic model of original systems [17].On the basis of this condition, principal component analysis with instrumental variables was proposed in [18] to achieve consistent estimates of a dynamic system.Later, [19] linked the identification result in [18] to the model-based residual generator, thereby developing an integrated FD scheme.In the meantime, projection-based subspace identification was proposed in [20] for a feedback system; it was also illustrated in [21] that the projection seeks a specific solution of residual generator under the framework of robust control.Recently, [22] investigated few-shot leaning-based system identification and bounded the deviation of identified results from the true system parameters through the probability.
In the 1990s and 2000s, [23] showed that the existing stable kernel representation (SKR) and stable image representation (SIR) of nonlinear feedback control systems can fulfill the construction of the observer and controller, respectively.Both the SKR and SIR can be instinctively understood as the two projections of mutual compensation [21].Specifically, an SKR maps the system data onto a subspace such that system uncertainties came to the surface.In this obtained subspace, the sensitivity to unknown faults vs. the robustness to system uncertainties can be explored to design FD algorithms.On the other hand, an SIR maps the system data onto another subspace, in which the system dynamic can be well described.With the aid of the Youla parameterization, an SIR and its induced term contribute to reformulating the controller for nonlinear systems [24].In the recent ten years, a number of data-driven and ML-based SKRs and SIRs have been developed, such as [1], [25]- [27], for designing both the observers and controllers of nonlinear systems.
Although in a new era of artificial intelligence that ML is the dominant technique in FD research, control theory is still necessary and must be present for addressing dynamic systems [1].The abuse of ML, as well as NNs, would preclude the possibility of developing explainable FD approaches for dynamic systems.In the series, we design the two ML frameworks respectively to obtain two manifolds, named invertible left and right manifolds (ILM and IRM).A noteworthy fact is that the ILM of Part I is obtained through a projection that is essentially a specific version of SKRs.To sum up, the contributions of Part I are threefold: 1) By specially formulating the master and slaver objective functions, a unified ML framework of residual generator towards solving the optimal FD problem is developed for a nonlinear feedback system.2) In order to display how to design the explainable FD algorithm, we specially made up an invertible neural network (INN) structure that generates an ILM. 3) In the obtained ILM, the unknown system uncertainties can be resurfaced; based on which, an explainable FD approach with optimal FD power, together with its link to residual generator, is developed.
More importantly, all the designs and analysis of this study are guided by control theory, which has 1) overcame the obstacle (i.e., how to achieve faithful learning) existing in both ML-based parameter identification and FD for nonlinear systems.
2) answered the mentioned open problems.We proceed the rest of the Part I as follows.Section II presents the background and formulates the identification problems of consideration.Section III details the explainable FD concept using an ILM, where the interpretability is the main focus.Section IV provides the implementation procedures and some alternative learning strategies of the proposed FD approach.In Section V, two applications in nonlinear systems are presented by using the proposed ILM-based FD solutions.We conclude this study in Section VI with some remarks and the introduction of Part II.
Notations: Vectors are defined in bold.Nonlinear systems and operators are denoted by the bold calligraphic letters such as P. H(θ, •) represents an NN framework with the hyperparameter θ.
returns an integrated space.N (µ, Σ) denotes a normal distribution with the mean µ and variance Σ. • is the composition of two operators.E(•) is the expectation operator.

II. BACKGROUND AND PROBLEM FORMULATION
Different from the commonly used system description in the time or z domain, "signal space" is used in both the Parts I and II to describe a feedback control system.This section first retrospects the signal-space description that was popular in 1990s.Then, several system characteristics, together with their definitions, are introduced.The formulation of FD tasks is finally given through the master and slaver objective functions.Let us consider a nonlinear system G in Fig. 1 with the input u(k) ∈ R ku and output y(k) ∈ R ky , where k is the time instant.Its state-space equation can be described by

A. Feedback Systems Described via A Signal Space
where x(k) ∈ R kx represents the unknown system state; φ and ϕ are two nonlinear mappings.The feedback system is equipped with a controller K such that u(z) = K(y(z)). ( For the sake of simplicity, k or z, referring to the domains, will be removed in the subsequent.At the same time, different kinds of signal spaces will be introduced to describe the feedback control system {G, K}.Given an initial condition x(0) ∈ X , (1) and ( 2) can be rewritten as where U and Y are the signal spaces corresponding to u and y, respectively.In this study, the cross-product spaces [28] such as will be widely used for introducing the complex mappings.

B. Desirable Characteristics
Any signal s considered in the study has a set S which is a signal space, denoted as s ∈ S. Because these signals are considered in {G, K}, the concept of system stability should be taken into account when defining signal spaces.For example, S consists of two subsets that have a complementary relation [28], i.e., where the superscripts "s" and "us" respectively refer to the stable and unstable subsets.In order to explore the desirable characteristics of the feedback control system {G, K}, several definitions on the s-related parts are introduced as follows.
Fig. 2: Kernel representations of a feedback control system.

1) Stable kernel representation (SKR):
In (4), the superscript "s" is usually formed by defining a norm on the signal vector, representing a signal with finite norm.For the FD purpose, the kernel representation of {G, K}, denoted as K G and K K in Fig. 2, can be used to parameterize the two nonlinear observers through where the residual signals, z G ∈ Z G and z K ∈ Z K , are unnecessarily equal to zero.Corresponding to (5a)-(5b), the two SKRs are introduced as follows.
Definiton 1.The two operators, K G and K K , are called SKRs if the following conditions In order to simplify notations, a compact description of K G and K K is defined as which generates the following projections: 2) Coprimeness: Similar to the linear systems, coprimeness of K {G,K} is given as follows.
Definiton 2. The operator K {G,K} is said to be coprime iff for x 0 ∈ X , a right inverse K # {G,K} exists such that where K # {G,K} is the following mapping: It should be mentioned that coprimeness of SKRs is a necessary condition of the feedback control system's wellposedness [28].Without loss of generality, all SKRs used in this study are assumed to be coprime.In fact, (9) presents a generalized version of Bezout identity.
Remark 1.For a feedback control system, it is unnecessary to distinguish u and y through the specific definitions on system inputs and outputs because of behavioral equations [29].
Remark 1 motivates us to develop two manifolds for both system modeling and FD, where u and y are treated fairly.In the study, the so-called "manifold" is a topological space whose definition can be found in mathematics.To start with, the well-definedness of K G is introduced as follows [24].
for all x 0 ∈ X and u ∈ U s .
In Definition 3, the well-definedness indicates the existence of the inverse function of K G (u, •), i.e., In mathematics, we call that K G (u, •) and K # G (u, •) are homeomorphic.In fact, the ILM to be learned in the Part I is obtained via a homeomorphic (or called isomorphic) projection/operator.

C. Problem Statements
A feedback control system {G, K}, as we encounter it in practice, is subject to the noises.Therefore, G in (3) becomes where d ∈ R ky includes the stochastic noises-related terms.It is emphasized here that d is generally unknown and independent of u and y.It is a general fact that F1): Signal-to-noise ratio of {G, K} is high, i.e., both u and y have higher levels of power than d.F2): d is unknown but reasonable to assume that d ∼ N (µ, Σ), where µ and Σ are unknown but constant.To conduct the FD tasks, a trustworthy identification of K G , denoted as KG , is necessary, based on which a residual generator is obtained such that [25] where r ∈ R is the residual signal whose dimension is R ky .Here, we intend to introduce a remark on the physical interpretation of ( 14).
Remark 2. In an ideal condition, KG (u, y) in ( 14) could mathematically be equivalent to K G when {G, K} is faultfree.However, limited samples in practice used for learning KG (u, y d ) cause the modeling error (including both approximation and estimation errors [1]), thereby resulting in r → z G .In addition, no generality is lost for z G ∼ N (µ, Σ) because in the time domain, both z G and d only represent the stochastic effects.
The F1) inspires us to define the master optimization objective, which is Correspondingly, a slave optimization objective can be formulated, according to F2), as In ( 15) and ( 16), r is a time-series signal obtained via INN.Most of the existing studies circumvent a fundamental difficulty, i.e. how to optimize the slave objective when µ and Σ are unknown.In fact, extreme pursuit of optimizing the master objective has detrimental effects on both parameter identification and FD because of over-fitting [1], i.e., resulting in incorrect or misleading solutions.It is one of the hardest puzzles for designing reasonable FD approaches for nonlinear feedback systems.
The Part I of the series will resort to an ILM to solve both the master and slave objectives, simultaneously.Then, the faithful designs of parameter identification and explainable FD algorithms for nonlinear feedback systems will be achieved.

III. EXPLAINABLE FAULT DIAGNOSIS USING AN
INVERTIBLE LEFT MANIFOLD In this section, we first introduce a faulty feedback system of interest.Then, an overall framework of an ILM-based FD solution will be outlined.In order to help the readers understand the proposed method, an INN will be taken as an example to detail several functional modules with interpretability.

A. The Overall Framework
Taking possible faults into account, (1) becomes where f (k) represents both the actuator and sensor faults, and φ and φ denote the structural faults.Alternatively, ( 17) is further described by In the following, we state an overall framework of parameter identification for estimating K G and based on which, set forth an explainable FD framework for detecting faults presented in (17).Then, interpretability with strictly theoretical support ties in with the overall framework.
Let P be a projection generated by any ML approach.A schematic description of the proposed unified framework is elaborated in Fig. 3, where the total loss function, L tt , is defined as follows.
where λ is a tuning factor, L 1 is defined as and L 2 could be formulated via Fig. 3 presents a bird's-eye view of the proposed FD approach and its learning procedures, in which • The forward mapping P tt is a bijective function, ensuring that r spans an ILM.Besides, it solves both the master and slave objectives simultaneously.• The P mo part is always invertible, and its inverse function is denoted as P −1 mo .Its loss function is defined via L 1 , corresponding to the master objective.
• The P so part (i.e., P 2 ) is also invertible and could be linear or nonlinear function.Its loss function is defined via L 2 , corresponding to the slave objective.• The reference signal of q should be Gaussian distributed.
For the sake of briefness, we choose q ∼ N (0, I).
Remark 3. In the system sense, the specific structure of P tt is same as the well-definedness of a feedback system {G, K}, where P mo is used for designing a residual generator, and P so contributes to ensuring the reasonability of the obtained residual generation by functionally evaluating how much information hidden in U s × Y s D is useful.
Based on the collected training data whose signals are stable, one can obtain Fig. 3: An overview of the proposed approach for parameter identification and FD using an ILM.
and its inverse is As we can observe, P mo contains the KG to be learned, which generates a kernel space Z G , i.e., In fact, P tt is the left-multiplication operator, delivering It indicates that are topological spaces or manifolds that must be invertible, and P tt is a homeomorphism (i.e., a continuous inverse function) between the two topological spaces above.Therefore, we name the proposed approach in the Part I of the series as "an ILMbased solution" that could be used, for instance, for both the parameter identification and FD purposes.In the subsequent, the functional modules presented in Fig. 3, together with the loss functions defined in ( 19)-( 21), will be explained.

B. Functional Modules with Interpretability
In Fig. 3, the basic configuration of the FD architecture using P tt is elaborated.It consists of two functional modules: 1) an observer P mo that can generate a raw residual signal r; 2) a calibration operator P so that tries to remove the noise-related component in ŷd such that r = z G .Here, we shall emphasize that the aforementioned functional two modules are the basic components.Because G d is a nonlinear dynamic system, other functional modules, such as unit-time delay operators [1], are necessary to be integrated into P tt in Fig. 3, especially for the purely data-driven FD algorithms.In addition, P so can be designed to have a same invertible structure as P mo .
1) Invertible structure of P mo : According to Fig. 3, P mo generates the following mapping: where P mo is a composite operator, denoted as Similarly, the estimation process is According to (28) and (29), it is easy to verify that det(P mo ) = det(P −1 mo ) = 1.
The basic properties of P mo are developed in Theorem 1, which serves as the basis for interpretability of the proposed ILM-based FD architecture.
Theorem 1.Consider an internally stable feedback system {G d , K} with K G and K # G .Then, P mo given in Fig. 3 is a homeomorphism between U s × Y s D and U s × R s if the system input u is given and P 1 is a continuous mapping.It, therefore, has the following properties: 1) P mo is a bijection; 2) P mo is continuous; 3) P −1 mo is also continuous.Proof : The complete proof is detailed in Appendix A. Note that P mo is an observer-based realization of the residual generator [25] Combining (32) with ( 14) yields the identification results of an SKR KG := ( with a well-definedness counterpart Remark 4. The concept behind the obtained (33) and (34) is control theory not only guiding the design of P mo and KG , but also guaranteeing interpretability.Moreover, P mo greatly simplifies the identification process with effective learning.
Remark 5. We should also emphasize that the residual signal r does not contain information to reconstruct u and the noisefree y.Therefore, for that Y s D and Z s G (or R s ) are homeomorphic, a necessary condition is the presence of u, as customized in Fig. 3.
2) Invertible structure of P so : In ( 19), striving to minimize As discussed in [25], if P 1 has a higher degree of nonlinear complexity than G, the problem of overfitting will arise because P 1 obtains an over small empirical error, i.e., P 1 has learned the noise information.It makes clear that r → 0 is not a credible learning objective.Such a challenge, largely because of the unknown µ and Σ, motivates the introduction of P so in Fig. 3, together with the loss function L 2 in (19).
As observed from Fig. 3, P so formulated via P 2 .Due to the property that linear transformations of Gaussian variables are also Gaussian distributed, P 2 constructed through an NN can normalize r such that by choosing linear activation functions, if P mo is well trained.However, if P mo is not well trained, the difference between P 2 (r) and N (0, I) will cause an increased loss of L 2 defined in (19).Because P 2 in this study is chosen as a linear operator, its derivative with respect to both W and b always exists.Therefore, P so is also invertible.In addition to that, it is worth emphasizing that introduction of an invertible P so can make a big difference, as summarized in the following Remark 6. Remark 6. 1) By penalizing the non-Gaussianity of r, P so can avoid overfitting presented in (35).In other words, it helps achieve a faithful estimation of ŷ with a constraint on the shape of the probability density function (PDF) of r. 2) L 2 contributes to adjusting both P so and P mo simultaneously so that for the unseen test data, the uncertainties of y d and r are still the same.
In addition, the manner of optimizing P 2 is not limited to (36).In Appendix B, we present an alternative version adopted in the open code.
3) Invertible structure of P tt : Next, the whole structure of P tt will be introduced through the following Theorem 2.
Theorem 2. Consider an internally stable feedback system {G d , K} with K G and K # G .Given a system input u, P tt in Fig. 3 is a homeomorphism between U s ×Y s D and U s ×Q if P 1 is a continuous mapping and P 2 is an invertible operator.It, therefore, has the following properties: 1) P tt is a bijection; 2) P tt is continuous; 3) P −1 tt is also continuous.Proof : According to the structure presented in Fig. 3, it can be verified that Because P 2 is invertible and P mo is homeomorphism, P tt must be homeomorphism, which completes the proof.Corresponding to Theorem 2, two manifolds with the same topological structure are sketched in Fig. 4, where the bottom one (i.e., ILM) can be directly used for FD.We can define y d 0 ∈ Y 0 for any given u 0 .Then, two coordinate charts can be utilized to describe Fig. 4 in a locally Euclidean space, i.e., It indicates that the proposed scheme can eliminate the dynamics caused by u so that the variation of y d caused by d can be estimated and evaluated.Mathematically, the conditional entropy can be used to quantify the information levels of two manifolds in Fig. 4. According to (C.3), we can know that because of det(P mo ) = 1.In addition, for a well trained P tt , the first equal sign in (39) yields and the second one can obtain It is interesting to know that in the proposed ILM-based solution, both r and q can be applied to the FD tasks, where • r has the same uncertainties as d, based on r optimal FD performance can be achieved.• A final entropy change, a constant, can be calculated because the reversible P so plays as a normalization role.Therefore, q, regarded as a normalized r, also delivers FD power.
Remark 7. Different from the existing ML-based parameter identification and FD approaches, the proposed framework is a first attempt to ensure optimal FD power by quantifying the degree how many uncertainties should r be carried.As described by ( 15) and ( 16), the proposed approach can effectively learn the intrinsic topological structure of a feedback system {G, K} so that the influence caused by unknown d can be reconstructed.It is expected that the proposed architecture in the Part I using an INN can open a novel avenue for both parameter identification and FD, especially for the explainable ML-based designs.

IV. DATA-DRIVEN FD VERSIONS WITH AN ILM
This section focuses on the implementation procedures of the ILM-based FD solution, where the results presented in Section III serves as theoretical foundations.By using two bunches of external delay operators [1], a static model is introduced, based on which both off-line training and online test algorithms for a feedback system are developed.

A. External Delay Operators
Consider a system state x(k) in (17).In general, x(k) is unknown but can be estimated according to the past and current data, i.e. [1], where m s is a mixed data matrix whose definition is In addition, P est is the projection of m onto the space x.
In the last two equations of (43), multiple delay operators (denoted as z −1 such that z −1 (u(k)) = u(k − 1) and z −1 (y(k)) = y(k − 1)) can be adopted to expand the horizon.As pointed out in [30], the dynamic system can be reduced to a static model.Mathematically, it can be interpreted that by defining a composite operator ϕ as follows.
In fact, using external delay operators allows us to construct a residual generator of {G, K} through a static model, i.e., (46) Fig. 5: A static model-based realization of an observer with the aid of delay operators that can eliminate difference dynamics from the reference signal γ(k) to both u(k) and y(k).

B. Optimal Fault Diagnosis Performance
Taking the INN H(θ, •) as an example to generate P tt , the following projection contains an residual generator for G d described by where the optimal parameter θ * can obtained according to Similar to (19), L 1 and L 2 in (49) is redefined by where n is the number of training samples.Due to the invertibility of P so , the following two test statistics regarding (48) are equivalent: • T 2 defined on r(k): where χ 2 denotes the Chi-square distribution, and Σ n can be estimated via • T 2 defined on q(k): because q → N (0, I).According to the test statistic defined in (51) or (53), a unified threshold J th,T 2 can be set to where β represents the user-defined significance level, mathematically equal to the acceptable false alarm ratio (FAR).
Based on the analysis from ( 42) to (54), the following development will be online FD, i.e., T 2 (r(online)) − J th,T 2 < 0 =⇒ Fault-free; Otherwise =⇒ Faulty; (55) or Moreover, the following properties are essential for achieving faithful FD tasks.(58) • In the online FD phase, the fault of interest will result in a deviation f rel , i.e., In general, f is also independent of d so that The above relationships can help readers check the optimal FD performance of the proposed ILM-based solution.To be specific, due to the effect caused by P so , one obtains (61) In addition, combing (51) with (60) yields where the subscript i refers to the i-th element, and the last equation holds for the noises that are mutually independent.Therefore, the optimal FD power of the proposed ILM-based scheme can be illustrated because of for a given f in a feedback system {G d , K}.
Remark 8.The physical interpretation of ( 63) is for the given training data pairs {m (k), y d (k)}, the loss function of INN used is closely related to the FD performance.Mathematically, because the low bound of r is d that is guaranteed via P so , the smaller L tt can return the larger FD power of the proposed ILM-based scheme.

C. Basic Descriptions and Procedures of INNs-based FD
Based on the analysis and description above in Sections III and IV, we formulate two algorithm procedures, including the off-line training and online test.The detailed Python codes of both Algorithms 1 and 2 can be found on the public website.
Algorithm 1 Procedures of the ILM-based FD approach: Offline learning 1: Construct an INN H(θ, •) according to Fig. 3, where P so is a linear operator and the INN is a homeomorphism if the activation function of P 1 is continuous (it is diffeomorphism if the activation function of P 1 is differentiable); 2: Collect the measurements u and y d from {G d , K}; 3: Pre-process u and y d to obtain m and y; 4: Define the input and output of H(θ, •) be [m T y d,T ] T and [m T q T ] T , respectively; 5: Set the loss function of this INN as L tt , as defined in (50); 6: Update θ via (49) and obtain θ * ; 7: Define a residual signal by using q or r; 8: Choose a test statistic such as T 2 ; 9: Return a threshold J th,T 2 via (54); 10: (If necessary to diagnose faults) extract the fault features through q d,f or r d,f .
As observed from Algorithms 1 and 2, both r and q can be used for the FD purpose.They can achieve the same FD performance because in the Part I, P so is chosen as a linear invertible module.
Algorithm 2 Procedures of the ILM-based FD approach: Online application 1: Read online input and output data and obtain their stacked forms; 2: Calculate online q or r; 3: Obtain the test statistic in real time; 4: Conduct an FD decision according to (55) or (56); 5: (If necessary) identify the fault through q(online) or r(online).

V. CASE STUDIES
In order to illustrate the superior learning ability of INNs and the improved FD power of the proposed ILMbased FD approach, this section presents two case studies, where the first one considers a static model, and the second one considers an electrical traction system.The readers can refer to the two case studies on the website "https://github.com/sunwenxin/Explainable-Fault-Diagnosis-Using-Invertible-Neural-Networks-Part-I-A-Left-Manifoldbased-Solution.git",whereas the Python and Matlab codes are provided.

A. A Static Model
Let us consider a static model described by In (64), d is the unknown noise of the Gaussian distribution In order to achieve extract the nonlinear relationship in (64) whilst construct an observer, an INN is defined as follows.
where the activation functions of P mo are chosen as 'Tanh', P so is a linear operator, other configurations of H(θ, •) are presented in Table I.Under a given u, 1 × 10 4 samples are collected for training the INN given in Table I.For the purpose of demonstrating the faithful identification ability, the nonlinear least squares using a traditional fully connected NN [11] without the slave objective is considered to conduct a fair comparison.It means that except for the network structure, all the configurations of both the proposed INN method and the traditional NN scheme proposed in [11] are the same.In order to have an intuitive observation of the learning ability, we present the estimation error of y d in Figs. 6 and 7 through the contour plots of PDFs.Owing to the effect caused by L 1 , the traditional NN-based least squares tries to fit the unknown nonlinear relationship through (35).As shown in Fig. 6, the excessive pursuit of (35) results in a too small empirical error, i.e., overfitting.This situation is very easy to happen, especially for traditional NNs equipped with a large number of neurons.However, the proposed ILM-based scheme can achieve the superior performance in estimating y d .For example, Fig. 7 shows that the influences caused by L 2 can avoid the overfitting problem.In the online test, 1 × 10 3 samples are used, where a fault, f = 1, is introduced to y 1 from 501-st to 1000-th time instant.The FD result for f using the traditional NN-based nonlinear least squares is shown in Fig. 8, where FARs are high due to the significant difference between the PDFs of d and r.Corresponding to the training results in Fig. 7, Fig. 9 shows the satisfactory FD result using the proposed approach.

B. A Feedback System
In the second example, a direct-current motor located in two closed loops is taken into account, whose the feedback structure is shown in Fig. 10, where 'SU' represents the subunit.Both the proportional-integral controllers and SU j , j = 1, • • • , 4, are summarized in Table II.
In the two control loops, we choose the highlighted part by blue dashed lines as G d , and the unknown noise d is also considered.For addressing the system dynamics, 10 delay operators are used to expand the horizon.According to (47), another INN is constructed to identify the SKR KG (m , y d ), whose configuration is shown in Table III.In addition, a similar traditional NN-based least squares scheme with the same configuration is also constructed for comparison analysis.
In the second case study, a set of 1 × 10 4 samples under normal conditions is utilized for training two NNs.Fig. 11   As presented in the top picture of Fig. 13, the feedback system is subject to a varying reference signal γ during the test phase, and a fault f = 1 is added in the outer (i.e., speed) control loop from the 801-st step.Observed from system dynamics in Fig. 13, f has a slight influence on system outputs because of the two control loops.
Based on the obtained traditional NN model that carries noise information, residual generator is sensitive to unseen data.Consistent with the analysis, Fig. 14 clearly shows that the traditional NN-based FD scheme fails in monitoring faulty system operations, i.e., a high missing detection rate after   the 801-st time instant.Its FD powder is not satisfactory when detecting f introduced to the speed control loop.The detection result using the proposed ILM-based approach is presented in Fig. 15.Simple observation can illustrate superior  Beyond that experimental results, some tips or notes on the improved performance in both identification and FD using the proposed ILM-based scheme are given as follows.
1) For a nonlinear system, P so can be chosen as a linear operator; but, we do not suggest a linear P mo .This is because the linear P mo has a limited learning ability, and will not cause a over small estimation error for training data because of the large approximation error.2) For fitting a nonlinear relationship, the ReLU activation function is strongly not suggested used in an INN.The main reason is that in a local Euclidean space, a linear function of ReLU will limit the learning ability.3) If a traditional NN (such as the P mo part) does not cause overfitting because of a small size of neurons or other aspects, INN with the same configuration would not show its superior ability.We refer the interested readers to think about this fact through the effects caused by the master and slave objectives.

VI. CONCLUSIONS AND DISCUSSIONS
In this first part of the series, we have proposed an ILMbased FD scheme for a nonlinear feedback system by developing a novel INN.Different from the existing ML approaches, the INN developed in Part I consists of P mo and P so .In an invertible framework, P mo serves as an optimizer to minimize the approximation error, and P so contributes to enhancing the generalization ability by minimizing the estimation error.More importantly, all the designs are guided by control theory, ensuring the interpretability of the proposed FD method.With the aid of an invertible data manifold, this study successfully links intelligent learning to the well-established system identification framework.At the same time, it also promotes the development of explainable FD approaches.
We would like to emphasize that behind the design of Part I, the core technique applied to the establishment of the proposed FD structure is still control theory, i.e., an SKR to parameterize an observer for a feedback system.Part I ends with the following essential comments and remarks.
• A stricter proof of the invertibility of P tt , as well as P mo , can be completed by using the Banach fixedpoint theorem in our previous study [1].Its condition in the proof procedures can guide us on choosing rational nonlinear projection operators.• For both nonlinear open-loop and closed-loop systems, different operation points should be covered for obtaining the reliable training data set, where a uniformly distributed system input is strongly recommended to excite the global system nonlinearities.• Classification in ML methods is incapable of FD for a dynamic system or feedback system.Consistent with control theory, an ideal learning-based residual generation must eliminate the system input-induced change, making the rest (i.e., the so-called residual signal) being only related to noises and faults.• There exists an IRM that is the complement to the developed ILM in Part I.For a feedback system, the IRM can be explored to describe the system dynamic from a reference signal to system inputs and outputs.
The last remark above motivates us to design another INN in Part II, where an IRM will be learned for both the identification and FD purposes.It is a necessary condition for a homeomorphism.By using a continuous mapping P 1 , it is easily verified that r = P 1 u + y d in Fig. 3 is also a continuous mapping.In mathematical notation, this is written as

APPENDIX B AN ALTERNATIVE SOLUTION TO OPTIMIZING P 2
Let P and P P be the PDFs of r and P 2 r, respectively.In order to simply the notation, entropy, denoted as H, defined on r is introduced as follows.

12 a 21 a 22 
 define the composite operator and data matrix, respectively.

Fig. 1 :
Fig. 1: A schematic description of nonlinear feedback control systems.

Fig. 4 :
Fig. 4: Two invertible manifolds when u is given, where the two blue solid lines represent bijective functions.

Fig. 6 :Fig. 7 :
Fig. 6: Training results using the traditional NN-based nonlinear least squares, where overfitting is obviously present.

Fig. 10 :
Fig. 10: A direct-current motor in the two control loops.

Fig. 11 :
Fig. 11: Training results using the NN-based nonlinear least squares, where overfitting is obviously present.

Fig. 12 :
Fig. 12: Training results using the proposed ILM-based approach, where overfitting is alleviated.
Because r = y d − ŷd ∈ R ky , one can know lim u→u0,y d →y d 0 r = P 1 u 0 + y d 0 .(A.2)Therefore, the mapping P mo given in (28) is also continuous.According to the structural diagram on the right-hand side of Fig.3, P −1 mo is also continuous.Moreover, we can verify that .3)with (A.4) can infer that the following two equal signsP −1 mo • P mo = P mo • P −1 mo = I, (A.5)are valid, which completes this proof.

TABLE I :
Configurations of the INN for a static model

TABLE II :
Descriptions of different modules used in Fig.10

TABLE III :
Configurations of the INN for the nonlinear feedback system