Reliable AI through SVDD and rule extraction

—The proposed paper addresses how Support Vector Data Description (SVDD) can be used to detect safety regions with zero statistical error. It provides a detailed methodology for the applicability of SVDD in real-life applications, such as Vehicle Platooning, by addressing common machine learning problems such as parameter tuning and handling large data sets. Also, intelligible analytics for knowledge extraction with rules is presented: it is targeted to understand safety regions of system parameters. Results are shown by feeding data through simulation to the train of different rule extraction mechanisms.

• safety regions are tuned on the basis of the radius of the SVDD hypersphere • simple rule extraction method from SVDD compared with LLM and DT The article is organized as follows: first, a detailed introduction of SVDD and Negative SVDD is introduced, also focusing on how to choose the best model parameters (Section 2.2) and how to handle large datasets (Section 2.3). Then Section 3 is devoted to rule extraction: LLM and DT are presented and how to extract intelligible rules from SVDD is explained. Finally, an application example is proposed in Section 4.

SUPPORT VECTOR DATA DESCRIPTION
Characterizing a data set in a complete and exhaustive way is an essential preliminary step for any action you want to perform on it. Having a good description of a data set means being able to easily understand if a new observation can contribute to the information brought by the rest of the data or be totally irrelevant. The task of the data domain description is precisely to identify a region, a border, in which to enclose a certain type of information in the most precise possible way, i.e. not adding misinformation or empty spaces. This idea is realized mathematically by a circumference (a sphere, a hypersphere depending on the size of the data space) that encloses as many points with as little area (volume) as possible. Indeed, SVDD can be used also to perform a classification of a specific class of target objects, i.e. it is possible to identify a region (a closed boundary) in which objects which should be rejected are not allowed.
This section is organized as follows: SVDD is introduced as in [29], focusing first on the normal description and then on the description with negative examples [30]. Then we will focus on two proposed algorithms for solving two problems involving SVDD: fast training of large data sets [4] and autonomous detection of SVDD parameters [32]. Finally, the last subsection is devoted to two original methods for finding zero False Negative Rate (FNR) regions with SVDD.
) C = 0.05, σ = 0.8 Figure 1. SVDD with (a) linear kernel, (b)polynomial kernel, (c) gaussian kernel and the respective parameters. In red are plotted the SV (with α i < C) of the description.

Theory
Let {x i }, i = 1, . . . , N with x i ∈ R d , d >= 1 , be a training set for which we want to obtain a description. We want to find a sphere (a hypersphere) of radius R and center a with minimum volume, containing all (or most of) the data objects.

Normal Data Description
For finding the decision boundary which captures the normal instances and at the same time keeps the hypersphere's volume at minimum, it is necessary to solve the following optimization problem [30]: But to allow the possibility of outliers in the training set, analogously to what happens for the soft-margin SVMs [1], slack variables ξ i ≥ 0 are introduced and the minimization problem changes into [30]: where the parameter C controls the influence of the slack variables and thereby the trade-off between the volume and the errors. The optimisation problem is solved by incorporating the constraints (3) into equation (2) using the method of Lagrange for positive inequality constraints [10]: with the Lagrange multipliers α i ≤ 0 and γ i ≤ 0. According to [29], L should be minimized with respect to R, a, ξ i and maximized with respect to α i and γ i .
Setting partial derivatives of R, a and ξ i to zero gives the constraints [8]: and then, substituting (5) into (4) gives the dual problem of (2) and (3): Maximimizing (7) under (8) allows to determine all α i and then the parameters a and ξ i can be deduced.
A training object x i and its corresponding α i satisfy one of the following conditions [29], [30]: Since a is a linear combination of the objects with α i as coefficients, only α i > 0 are needed in the description: this object will therefore be called the support vectors of the description (SV). So by definition, R 2 is the distance from the center of the sphere to (any of the support vectors on) the boundary, i.e. objects with 0 < α i < C. Therefore for any x k ∈ SV <C , the set of the support vectors which have α k < C.
To test a new object z it is necessary to calculate its distance T a (z) from the center of the sphere and compare it with R 2 (a) (b) Figure 2. Negative SVDD applied to a two-spirals shaped data set [21]. It is interesting to note that for changing the target objects it is only necessary to flip the labels. The asterisked points are the SV on the edge, depending on the respective class.
As it is common in machine learning theory [33], the method can be made more flexible [29], [30] by replacing all the inner products (x i · x j ) with a kernel function K(x i , x j ) satisfying Mercer's theorem. The data are mapped into a higher dimensional space via a feature map and there the previous spherically classification is computed. The polynomial kernel and the gaussian kernel are discussed in [29], [30].
An example description by SVDD with different kernel functions for a 2 dimensional gaussian data set is shown in Fig. 1. The 1000 data are generated by a gaussian distribution with mean [0, 0] and variance 1. Figures are handmade drawn using Matlab and the description bound is shown by a 2D contour plot.

Negative Examples Data Description
When two (or more) classes of data are available and it is necessary to identify a specific one among the others, SVDD can be trained to recognize objects that should be included in the description from those that should be rejected. This task of SVDD can be very useful in real-world applications where, for example, a safety region must be determined (see Section 4).
In the following the target objects are enumerated by indices i, j and the negative examples by l, m . We assume that target objects are labeled y i = 1 and outlier objects are labeled y l = −1 .
In the same way as before, we want to solve this optimization problem: The constraints are again incorporated in equation (14) and the Lagrange multipliers α i , α l , γ i , γ l are introduced [30]: Setting the partial derivatives of L with respect to R, a, ξ i and ξ l to zero gives new constraints [30]: and substituting (17) in equation (16) we obtain similarly to before the dual problem of (14) and (15): Again, solving the previous optimization problem allows to determine α i and α l and then we can classify all the data set objects according to the respective Lagrange coefficient: Similarly, we test a new point z based on its distance from the center and we evaluate it compared to the radius squared where the radius is calculated as the distance of any SV on the edge (0 < α i < C 1 , 0 < α l < C 2 ) from the center a Similarly to before, it is possible to replace all the inner products (x i · x j ) with a kernel function K(x i , x j ) [29], [30], [33] to obtain a more flexible description.
An example of Negative-SVDD is performed in Fig. (2): gaussian kernel with σ = 3 is used and the parameters C1 and C2 are both set to 0.25.

Autonomous Detection of SVDD Parameters with RBF kernel
Like most machine learning models, SVDD is massively influenced by the choice of model parameters. It is necessary to find the best trade-off between error and covering by choosing suitable C 1 and C 2 and the best kernel parameter σ that avoids overfitting or underfitting issues.
For this work we will focus on the RBF kernel since it is well known that it is the kernel function that performs well in application methods [29].
The method used to find the best model parameters is inspired by the work presented in [32] in which it is proposed an autonomous detection of the normal SVDD parameters based only in the training set, since in normal SVDD it is not possible to use cross-validation because only true positives and false negatives can occur during the training. In our work instead we joined some techniques in [32] with cross-validation method for finding the best C 1 , C 2 and σ parameters for negative SVDD.
The regularisation parameters C 1 , C 2 are lower bounded by 1/N 1 and 1/N 2 respectively, where N 1 is the number of target objects and N 2 the number of negative examples (N 1 + N 2 = N ) [29], [30], [32]. When in one class of training objects set no errors are expected we can set C i = 1 (i = 1, 2), indicating that all objects of the target class of training set should be accepted (C 1 = 1) and all outliers should be rejected (C 2 = 1). So the value range for C 1 and C 2 is The second parameter to be optimised is the kernel width σ. For high values of σ the shape of SVDD becomes spherical with the risk of underfitting, while for small values  . For too small or too high values of σ the optimization criterion λ (our metric for the 'best error') is high. Also keep in mind the behavior of the SV, which is very similar to the one described in [29], [30].
of σ too much objects become support vectors and the model is prone to overfitting. The search for the best parameters is performed by constructing a grid with C 1 , C 2 and σ, on which holdout cross-validation is performed. The optimization criterion is chosen according to [32], selecting the parameters such that the respective misclassification error e and radius R minimize for each triple C 1 , C 2 and σ in the grid. The idea behind (29) is that minimizing the misclassification error means reducing the number of support vectors [29], [30] (and so reducing overfitting) while constraining the radius to be close to 1 means choosing small σ [32] (and so reducing underfitting). Then the balance between these two terms seems the best criterion for finding the best parameters (see Fig. (3)).

Fast Training SVDD
The curse of dimensionality is a problem that affects many optimization and machine learning problems, and SVDD is not saved. To overcome this problem, a method based on iterative training of only SV is proposed by [4]. The method iteratively samples from the training data set with the objective of updating a set of support vectors called as the master set of support vectors (SV * ). During each iteration, the method updates SV * and corresponding threshold R 2 value and center a. As the threshold value R 2 increases, the volume enclosed by the SV * increases. The method stops iterating and provides a solution when the threshold value R 2 and the center a converge. At convergence, the members of the master set of support vectors SV * characterize the description of the training data set.

Zero FNR Regions with SVDD
Safety regions research is a well-known task for machine learning [11], [12], [13] and the main focus is to avoid false negatives, i.e., including in the safe region unsafe points. In this section, two methods for the research of zero FNR regions are proposed: the first one is based simply on the reduction of the SVDD radius until only safe points are enclosed in the SVDD shape, the second one instead performs successive iterations of the SVDD on the safe region until there are no more negative points.

Radius Reduction
Since also in the transformed space via feature mapping the shape of SVDD is a sphere, it is reasonable to think that reducing the volume of the sphere the number of negative points misclassifed should reduce. We implemented this simple procedure in Matlab and we tested it on several datasets (see Fig. (4)):

SVDD Zero FNR Iterative Procedure
Here we present another algorithm for finding zero FNR regions with SVDD. The idea is simply to perform successive SVDDs on the safe regions found with a preliminary SVDD to avoid the presence of unsafe points. Again, we achieve convergence when we reach a fixed number of iterations or when the condition on FNR is satisfied.

Algorithm 2 ZeroFNRSVDD
Data set X × Y is divided in training set X tr × Y tr and test set X ts × Y ts We performed this algorithm in Matlab and tested using data from [19]. In Fig.(5) is reported an example with a 2 dimensional gaussian data set.

RULES EXTRACTION
We now consider how to make the SVDD explainable in order to explicit the inherent logic and use the extracted rules for further safety envelope tuning as in [12].
Let us suppose to have an information vector I and to have to solve a classification problem depending on two classes ω = 0 or 1. Let ℵ = {(I k , ω k ), k = 1, . . . , } be a data set corresponding to the collection of events representing a dynamical system evolution (ω) under different system settings (I(·)). The classification problem consists of finding the best boundary function f (I(·), ·) separating the I k points in ℵ according to the two classes ω = 0 or ω = 1. For the case of SVDD the best boundary f is simply the shape of the hypersphere. Although the shape of the hypersphere is well intelligible (it is enough to have a center and a radius to describe it), it is still interesting to have a rule-based shape to describe it.

Logic Learning Machine
The derivation of f (I(·), ·) )in a rule-based shape is made by DT and LLM (the analysis was performed through the Rulex software suite, developed and distributed by Rulex Inc. (http://www.rulex.ai/)). They are both based on a set of intelligible rules of the type if (premise) then (consequence), where (premise) is a logical product (AND, ∧) of conditions and (consequence) provides a class assignment for the output. In the present study, the two classes correspond to the presence or the absence of anomalous patterns. LLM rules are obtained through a three-step process. In the first phase (discretisation and latticisation) each variable is transformed into a string of binary data in a proper Boolean lattice, using the inverse only-one code binarisation. All strings are eventually concatenated in one unique large string per each sample. In the second phase (shadow clustering) a set of binary values, called implicants, are generated, which allow the identification of groups of points associated with a specific class. (An implicant is defined as a binary string in a Boolean lattice that uniquely determines a group of points associated with a given class. It is straightforward to derive from an implicant an intelligible rule having in its premise a logical product of threshold conditions based on cut-offs obtained during the discretisation step. The optimal placement of these cut-offs is, therefore, an important phase to extract the highest information gain before clustering [2].) During the third phase (rule generation) all implicants are transformed into a collection of simple conditions and eventually combined in a set of intelligible rules. The interested reader on shadow clustering and algorithms for efficient rule generation is referred to [15] and references therein.

Rules extraction from SVDD
As far as SVDD is concerned, the derivation of intelligible rules is made in this way: after that a SVDD is computed and tested, a new data set of observations is provided and the classification via SVDD is made. The new dynamical system obtained is then exported in Rulex and a LLM algorithm with zero error or a DT algorithm is executed over the data, obtaining then the set of intelligible rules. Algorithm 3 summarizes the procedure:

Algorithm 3 IRulesSVDD
Apply Algorithm 1 or Algorithm 2 on X × Y data set generate randomly a new data set X new as a copy of X Classify X new in Y new with [a, R 2 ] from Algorithm1/Algorithm2 apply LLM/DT algorithm find an explained safety region R return R For example, for the case of vehicle platooning (see Section 4) the first three rules for covering (i.e. how many points are covered by rule r) of SVDD (Algorithm 2) using LLM are ) then safe if ((N < 6) ∧ (P ER > 0.08 ∧ P ER <= 0.46)) then safe As in [12] we applied these rules with the goal of maximizing the number of safe points (that is the number of points in the target class) while keeping FNR at zero. This is possible by performing rule tuning as in [12] but SVDD allows for much more flexibility.  Figure 6 shows, as an example, a summary of the rules extracted with LLM from SVDD, Algorithm 2, in the case of vehicle platooning (see Section 4.2). Each circle represents a rule and the larger this is the more the respective rule covers a larger number of points. In this example the classification is done in two classes, green and red, and in the outer crown the input features are shown. The high number of rules is an indication of the complexity of the system: with a twodimensional example we could say that a large number of rectangles (rules) is needed to best approximate the complicated shape of the SVDD. We will discuss these concepts in more detail in Section 4, dedicated to applications.

APPLICATIONS
Finally in this section we investigate how the SVDD works in real classification problems. First we focus on a simple example concerning the stability certification of dynamical systems through ROA [14], where we want to focus on the performance of rule extraction, and then we move on a much more complex and safety relevant automotive example of cyber-physical system [22]: the vehicle platooning [23].

ROA inference
The concept of Region of Attraction (ROA) is fundamental in the stability analysis of dynamical systems [20], [35] and it is topical when safety of cyber physical system should be preserved with zero (probabilistic) error [12], [13].
ROA is typically derived through the level sets of Lyapunov functions but in this case we want to estimate ROA through negative SVDD: we define the target class as the set of stable points and the negative class as the unstable ones. We consider the Van der Pol oscillator in reverse time: the stability region is depicted in blue in Figure (7). The system has one equilibrium point at the origin and an unstable limit cycle on the border of the true ROA. The simulation of the dynamical system is developed in C [18] and the dataset is composed by 300000 points (x 1 , x 2 ) with the relative labels (+1 stable, -1 unstable). Due to the big size of the dataset a Fast SVDD as in Section 2.3 is required. We implemented the negative SVDD and tested it over this dataset: we obtained good results (in term of zero FNR) without using Algorithm 1 or Algorithm 2 due to the good separation between the two classes. In Figure (7) it is shown the SVDD shape (in yellow), and the performance indices are: where ACC = T P +T N T P +T N +F P +F N is the accuracy of the model, F N R = F N F P +T N is the False Negative Rate and F P R = F P F P +T N is the False Positive Rate. Then a set of intelligible rules is extracted as described in Section 3 (LLM and DT) and they are tested on several extraction of different size datasets (see Figure 8), which are all copies of a same dataset [18], with the aim to profile the largest region in term of "safe points", that is the precision on the target class T P T P +F P . We made 10 3 successive extractions from the dataset (with different sizes, from 8% up to 50% of the total points): for each of them the FNR is almost zero and the precision on the target class is high, i.e. there is a good percentage of safe points. We can see that the performance of the rules extracted with DT after applying SVDD is quite inferior to the others. This is due to the fact that DT generates fewer rules than LLM and the constraint imposed by the shape of SVDD does not allow to generate rules with high coverage (i.e., small rectangles).

Vehicle Platooning
Vehicle Platooning (VP) is taken as a reference here as being representative of one of the most challenging CPS of the automotive sector [22]. The main goal in VP is finding the best trade-off between performance (i.e. maximising speed and minimising vehicles reciprocal distance) and safety (i.e. avoiding collision) [9]. Most of the literature on this topic focuses on advanced control schemes while abstracting the communication medium. Delay of communication is typically considered as fixed or described through probabilistic models. This allows the analytical derivation of stability models under some hypotheses of the dynamical system [17], but it may be unreliable under realistic conditions. Two branches are evident from the literature in this respect: the derivation of simple models of the delay bound that guarantees safety (see, e.g. Section IV.C of [34]) and extensive simulation with visualisation of safety regions under subsets of parameters when addressing realistic communication [7], [27], and realistic vehicles [25].
The following scenario is considered. Given the platoon at a steady state of speed and reciprocal distance of the vehicles, a braking is applied by the leader of the platoon [25], [34]. The behaviour of the dynamical system is investigated with respect to the following metrics. Safety is referred to a collision between adjacent vehicles (in the study, it is actually registered when the reciprocal distance between vehicles achieves a lower bound (e.g. 2 m)). For both safety and driving comfort, string stability (SS) is also important.  Figure 9. Scatter plots of the quantities of the platooning dynamical system as in [11], [12], [13]. In blue non-collision points are plotted, in red collision ones. It means that speed and acceleration fluctuations should be attenuated downstream the string of vehicles. The dynamic of the system is generated by the following differential equations [34]: where v i is the speed of vehicle i, m i the mass of the vehicle i, d i the distance of vehicle i from the previous one i − 1, a i is the tyre/road rolling distance, b i the aerodynamic drag and the control law F i . The behaviour of the dynamical system is synthesised by the following vector of features: N + 1 being the number of vehicles in the platoon (subscript i = 0 defines the index of the leader), ι = [d, v, a] are the vectors of reciprocal distance, speed, and acceleration of the vehicles, respectively (ι(0) denotes that the quantities sampled at time t = 0, after which a braking force is applied by the leader [25]. Simulations are set in order to manage possible transient periods and achieve a steady state of ι before applying the braking.), m are the vectors of weights of the vehicles, F 0 is the braking force applied by the leader, q is the vector of quality measures of the communication medium, fixed delay and packet error rate (PER) are considered in the study, p is the vector of tuning parameters of the control scheme.
Our goal is to determine the largest region of parameters with no false negatives (i.e. prediction of no collision, but a collision in reality). To do this, we applied the two algorithms proposed in Section 2.4 to the 7567 size sample above (a Fast-SVDD is used, see Section 2.3) using RBF kernel with C 1 = 1, C 2 = 1 and σ determined with cross-validation. The results are shown in Table 1, where FNR is the usual False Negative Rate, % safe is the percentage of safe points (computed as the precision on the positive class T P T P +F P ), #iter the number of algorithm iterations, #time (s) the time in second for the convergence, R 2 the squared hypersphere's radius, #SV the number of determined support vectors.
Then we tested the performances of the algorithms in different extractions of 10 3 subsets with different sizes from 8% to 50% of the total points available for test; 11×10 3 trials in total. We compared them with LLM and DT as in [12] (see Figure 10) and so a rules extraction has been requested (see Section 3). LLM and DT are tuned according to [12] (Section 4.4). The procedure can be briefly summarize in this way: (1) manually inspect of the most relevant regions for safety.
(2) LLM/DT is trained with zero error when developing the rules. (3) Progressively extraction of unsafe points from the original data set until only safe points are obtained.
The analysis shows that SVDD performs the best safety region in the chosen ranges of parameters: up to 70% of safe points with almost zero FNR for Algorithm 1 and up to 80% for Algorithm 2. The comparison with the other methods shows as the rules extracted from SVDD are the better ones, but due to the complex form of SVDD boundary function an higher number of them is required: 674 rules for Algorithm 1 and 771 for Algorithm 2 for LLM, 229 rules for Algorithm 1 and 314 rules for Algorithm 2 for DT against 5 rules for LLM and 3 rules for DT without using SVDD. The rules are applied all together in logical OR (∨).

CONCLUSION AND FUTURE WORK
The study shows how SVDD can be a very useful method for identifying safety regions, even in complex applications such as VP. This paper also provides a detailed methodology on how to deal with application problems in machine learning, such as parameter tuning and handling large data sets. In addition, a more thorough explanation on negative SVDD has been performed. Thus, the proposed approach could be applied for a wide range of applications. In the future, it will be interesting to study a method for direct rule extraction from SVDD, like the one developed for SVM in [16].