GAMA: A Multi-graph-based Anomaly Detection Framework for Business Processes via Graph Neural Networks

,


INTRODUCTION
A NOMALY detection, also known as outlier detection, novelty detection, etc., focuses on identifying rare, unexpected and suspicious instances within a swarm of normal data points [1].With a wide range of applications, including spam detection, financial fraud detection, and intrusion detection in cybersecurity [2], it has major implications for assisting practitioners and decision-makers in discovering, managing, and avoiding anomalous patterns from data.With recent developments in information technology, enterprises are increasingly relying on processaware information systems (PAISs) to manage their operations.However, anomalies in processes are inevitable due to numerous root causes, such as system failures and operator errors.Detecting anomalies in business processes is critical to the successful operation of a business.Moreover, lowquality event logs containing anomalies hinder our ability to extract valuable information from them.For example, process mining (PM) provides techniques to comprehend and enhance processes in various application fields [3].The output of process mining techniques using low-quality event logs may be of poor quality, potentially reducing the accuracy of any decisions based on it.Therefore, anomaly detection techniques should be used to detect and remove anomalies from the logs.
Event logs encompass multiple perspectives such as activities, resources, data, and time, with complex intrinsic dependencies between them.At the same time, there are different categories of anomalies in the event logs, such as Skip, Insert, Rework, Early, Late and Attribute [4], where the first five categories of anomalies can be referred to as control flow anomalies, which break the normal control flow dependencies.Anomalies related to resources, data, and time are categorized as Attribute.Capturing complex dependencies among multiple perspectives and detecting various categories of anomalies in business processes are challenging.
Although graph-agnostic approaches can be applied directly to the traces, they ignore the structural information of business processes which may affect the anomaly detection performance.Fig. 1 gives a toy example.It can be seen that the two traces are different in terms of their event sequences.However, both can be generated from the same process.Therefore, graph-based approaches, taking the structure of a business process into consideration, have a better generalization capability.Generally, in these approaches, normal process model-based methods are given or discovered from the clean log, and then conformance checking [22] is used to compare the differences between traces and the feasible Fig. 1: A toy example of a graph-based approach behaviors of normal process models for anomaly detection [23], [24], [25], [26].Since anomalies can be detected from a probabilistic perspective, process models are extended with probabilistic information in the form of a likelihood graph [27], Bayesian networks [28], hidden Markov models (HMM) [29], [30] and variable order Markov models (VOMMs) [31] for anomaly detection.However, these approaches all rely on a clean dataset to construct a normal process model, which is often not feasible in practice.With the rise of graph neural networks, anomaly detection based on graph encoding has aroused researchers' interests [32].However, the graphs derived from traces often rely solely on control flows, neglecting attribute information.While attributes can be incorporated as edge features in the graph, the underlying relationships between attributes and control flows, specifically the patterns of attribute changes with respect to control flows, are not adequately captured.As a result, existing graph neural network-based methods are ineffective in detecting anomalies at the attribute level.Additionally, the graphs extracted directly from individual traces lack crucial structural information about the underlying process.
To address the aforementioned challenges, in this paper, we propose a multi-Graph-based Anomaly detection fraMework for business processes via grAph neural networks, named GAMA, to detect both trace-level and attribute-level anomalies effectively.To obtain a graph with comprehensive structural information about the process, GAMA introduces a unique approach.It derives a multigraph for each trace by constructing a global graph using the entire event log.This global graph contains detailed structural information that encompasses the entire process.Moreover, within the multi-graph structure, GAMA adopts a meticulous modeling methodology where each attribute is represented as an independent graph.This approach ensures that attribute information is accurately captured and reflected, resulting in a comprehensive representation of the data.GAMA leverages graph neural networks (GNNs) [33] to learn the graph embeddings.These GNNs are integrated into a multi-graph encoder, enabling the extraction of hidden representations for the nodes (attribute values).These hidden representations are subsequently decoded into probabilistic maps using a multi-sequence decoder to detect anomalies.To effectively capture the intrinsic rela-tionships between attributes, GAMA incorporates an attention mechanism that operates across the graphs for each attribute.Three innovative teacher forcing styles that take into account the unique characteristics of business processes are introduced.These customized techniques are specifically designed to accommodate the unique properties and intricacies of business processes.By leveraging these tailored teacher forcing methods, GAMA improves the accuracy and effectiveness of the decoding process.
The main contributions of our work are as follows: • We propose a groundbreaking method to convert a trace into a multi-graph which proficiently capture both the structural information of the process and attribute information in a comprehensive manner.

•
We design the framework GAMA, which uses a multi-graph encoder and a multi-sequence decoder on the multi-graph to detect anomalies in terms of reconstruction errors.

•
Considering the characteristics of business processes, three teacher forcing styles are designed to enhance the capability to reconstruct normal behavior and improve detection performance.

•
We perform extensive experiments on both synthetic and real-world datasets.The experiment results show that GAMA achieves the best performance in detecting trace-level and attribute-level anomalies.
The rest of this paper is organized as follows.Section 2 introduces the related work.Section 3 gives the preliminaries and notations used in our study.Then proposed framework GAMA is elaborated in section 4. Section 5 reports the experiment results on synthetic logs and reallife logs.Finally, we conclude our work and discuss future work in section 6.

RELATED WORK
Anomaly detection approaches in business processes can be graph-agnostic or graph-based.A brief summary of the related work is given in Table 1.

Graph-agnostic Approaches
Some authors propose graph-agnostic approaches that simply treat traces as sequences.These approaches can be further divided into three categories, i.e., i) machine learningbased, ii) information theory-based, and iii) reconstructionbased approaches.
Machine Learning-based Approaches.These approaches detect anomalies using traditional machine learning anomaly detection algorithms.In [5], the normalized longest common sub-sequence (NLCS) between traces is calculated and k nearest neighbor (KNN) is applied to detect anomalies.The LOF algorithm based on k nearest neighbor imputation (KNNI) is used in [6] for the real-time business process monitoring to predict abnormal termination.Outlier-aware clustering algorithms are used in [7].In [8], [9], activities and traces are considered as words and sentences, respectively, and traces are encoded as vectors using the word2vec [43] encoding method.After encoding, the random forest algorithm is applied in [8] to classify traces into normal and abnormal.Similarly, the authors in

Graph Neural
Network-based [32] Traces are encoded using graph neural networks that fully take into account structural process information.
[9] apply a one-class SVM to classify traces into normal and abnormal.
Information theory-based Approaches.In statistics, leverage is a widely used metric to detect anomalies.However, traditional leverage-based approaches cannot be applied to detect anomalies in business processes.In [11], [12], a new statistical leverage [44]-based approach is proposed to detect trace-level anomalies in business processes which also takes into account that traces may be of different lengths.The authors in [10], [13] apply statistical leverage to the online detection of anomalous traces in business process event streams.
Reconstruction-based Approaches.These methods first train a model that can reconstruct normal behaviors, and then detect anomalies based on the reconstruction error.An autoencoder is used in [14], [15], [16], [17], [18] to detect anomalies.GRASPED [18] introduces a well-designed teacher forcing method and the attention mechanism to improve its detection performance.Guan et al. [34] employ an autoencoder as a feature extractor and utilize a multilayer perceptron (MLP) as an anomaly score generator, effectively harnessing the potential of a limited number of labeled anomalies.Since a model obtained by an autoencoder frequently runs the risk of over-fitting, a denoising autoencoder is applied in [19], [20].Inspired by the use of LSTM [45] in [46], [47], [48] to predict the next event in the event sequence, BINet is proposed in [4], [21], which is based on gated recurrent units (GRUs) [49] to predict the next event in the event sequence.BINet detects anomalies by predicting the probability of the attributes of the next event and if the probability is lower than the threshold, then the attribute is detected as an anomaly.
Graph-agnostic approaches can be applied directly to traces.However, a business process is a structured set of activities that have inherent relationships.These approaches do not take this information into account, which can limit the performance improvement of anomaly detection.

Graph-based Approaches
Graph-based approaches rely on a graph that models the relationships among activities to detect anomalies.
Behavior Conformance-based Approaches.A straightforward graph-based approach is to utilize conformance checking [22] to compare the differences between traces and legitimate behaviors of corresponding process models [35], [36], [37], [38], [39], [40], [41], [42], provided or discovered from the clean log using process discovery algorithms [50].In addition to relying on process models alone, the authors in [23], [24], [25], [26] combine process models and association rule learning to detect and analyze anomalies in business processes to improve the accuracy of anomaly detection.
Probabilistic Graphical Model-based Approaches.A process model can be extended with probabilistic information, such as the likelihood of activity execution transitions.In [27], traces are mapped onto the likelihood graph.The probability of the attribute value is used to determine whether it is anomalous.In [28], Bayesian networks are automatically inferred from Petri nets, which allows the detection of non-obvious and interdependent temporal anomalies.In [29], the event logs are analyzed using the hidden Markov model (HMM).Moreover, in [30], three sequence analysis techniques based on windowing, the Markov model and the hidden Markov model are used to detect anomalies in business processes.Variable order Markov models (VOMMs) are used in [31] to predict the probability of the execution of each activity in the trace.
Graph Neural Network-based Approaches.With the successful applications of graph neural networks (GNNs), researchers have started to apply these techniques to solve the anomaly detection problem since business processes can be naturally modelled as a graph.By using GNNs, graph embedding can be obtained and anomalies can be detected in terms of reconstruction errors.For example, GAE uses an edge-conditioned convolution (ECC) to obtain a better graph encoding and then detects anomalies [32].However, the current GNN-based approach often constructs a graph primarily based on control flow, neglecting the proper integration of attributes into the graph representation.As a consequence, it falls short in detecting attributelevel anomalies effectively.Moreover, the straightforward graph generation from individual traces further exacerbates  the issue by lacking crucial structural information about the underlying process.These limitations hinder the comprehensive detection of anomalies.GAMA is a notable example that addresses these limitations.Unlike methods that rely on a single graph to model control flows, GAMA surpasses them by constructing multiple graphs, one for each attribute, to ensure more comprehensive embeddings.A significant advantage of GAMA lies in its generation of a multi-graph derived from a global graph constructed using the entire event log.This global graph integration enables GAMA's multi-graph to accurately reflect the structural information of the process, leading to more precise and insightful analyses.Additionally, GAMA captures the intrinsic relationships between attributes through attention layers that operate across the graphs for each attribute.

PRELIMINARIES
Following the widely accepted notations, we adopt bold uppercase characters (e.g., A) to denote matrices, bold lowercase characters (e.g., b) to indicate vectors and normal lowercase characters (e.g., c) as scalars.
In this section, we formalize the definition of log.

Log
As a foundation for the following sections, we first define the terms log, trace, event, and attribute.Definition 3.1 (Log, Trace, Event).Let A = {a 1 , a 2 , ⋯, a A } be a set of attributes, where A = |A| represents the number of attributes.V a is the set of possible values for the attribute a ∈ A.
is a tuple, with one value for each attribute, where A trace t is a sequence of events and an event log L is a set of traces.Note that |t| is the number of events in trace t, |L| is the number of traces in log L.
It should be noted that activity is a special attribute reflecting the control flow in attribute set A. Let t e,a denote the value of attribute a of event e in trace t and e a denote the value of attribute a of event e.

METHODOLOGY
In this section, we elaborate on the proposed framework GAMA.

Architecture of GAMA
The architecture of GAMA is shown in Fig. 2. The entire framework is based on an autoencoder-like unsupervised deep learning model, which consists of four essential components: i) graph generator: it converts a trace t into multiple graphs, where each attribute corresponds to one graph.For the sake of illustration, we default the first attribute a 1 to Activity (control flow perspective); ii) multi-graph encoder: it leverages the graph attention networks (GATs) [51] to embed the graph corresponding to each attribute, obtaining the hidden representation of each node.iii) multi-sequence decoder: it attempts to reconstruct the values for each attribute of each event in trace t; iv) anomaly score calculator: it calculates anomaly scores for each attribute of each event in trace t.
Inspired by the teacher forcing method [52], we provide the ground truth as input for the GRUs in the multisequence decoder.Three different teacher forcing styles are proposed to guide the reconstruction of attribute values.
Obviously, the combination of a multi-graph encoder and a multi-sequence decoder forms a straightforward autoencoder that effectively accomplishes the processes of compression and reconstruction.In other words, it compresses the input trace t into low-dimensional representations and subsequently generates the probability distribution t p a e for each attribute a of every event e within trace t based on these condensed representations.Formally, the training procedure of the autoencoder can be characterized as minimizing the following reconstruction errors (i.e., cross-entropy loss [53]):      where t p a e [t e,a ] denotes the probability that the value of attribute a of event e in trace t is t e,a .
Numerous previous studies [14], [15], [19], [54] have concluded that the magnitude of reconstruction errors is a powerful indicator of anomalies since anomalies do not follow the patterns of the majority and cannot be precisely reconstructed.Therefore, based on this indicator, anomaly scores can be computed.
For ease of reference, this section focuses solely on the reconstruction process of a trace.Therefore, we simplify the notation by omitting the trace identifiers, denoting t p a e as p a e .

Graph Generator
By utilizing existing process discovery techniques [55], [56], [57], logs can be converted to graph-based representations.However, it is non-trivial to convert a trace into a graph-based representation.Recognizing that anomalous patterns inherently exhibit distinct behavior across various attributes, we generate a graph for each attribute respectively.
We construct multiple directed graphs for trace t in log L to reflect the structural process information as follows.When generating the directed edges of the graphs, we only consider the control flow perspective.To begin, we compute the number of occurrences of each directly-follows relation (b, c), which is a measure employed by the α-algorithm [55].This relation signifies the occurrence of activity b being directly followed by activity c within the log L. Secondly, a directed global graph G(L) is generated, containing the activities of log L as nodes.An edge b → c is present in directed global graph G(L) if and only if the number of occurrences of directly-follows relation (b, c) is no less than β * |L|, where β is a user-chosen threshold to filter out infrequent relations (i.e., noise).Thirdly, a directed event graph G(t) is generated, containing the events of trace t as nodes.An edge e → e ′ is present in directed event graph G(t) if e Activity → e ′ Activity is an edge in directed global graph G(L).Fourthly, to comprehensively capture the directly-follows relations between events, an edge e → e ′ is present in G(t) if event e directly follows event e ′ in trace t.Finally, a distinct graph can be derived for each attribute in A from the directed event graph G(t).We can obtain multiple graphs by replacing the nodes in the event graph G(t) from event e to attribute value e a for every attribute a ∈ A.

Example 2 (Example 1 continued). Consider attribute set A
and trace t in Example 1.We assume that the number of occurrences of each directly-follows relation in log L is shown in Fig. 3a(i).Considering β * |L| = 50, the adjacency matrix of the directed global graph G(L) is shown in Fig. 3a(ii).Fig. 3a(iii) and Fig. 3a(iv) present the adjacency matrix of directed event graph G(t) after step three and step four respectively.Fig. 3b illustrates the evolution of directed event graph G(t).The multiple graphs of trace t are shown on the right in Fig. 3b.

Multi-graph Encoder
Given A graphs generated by trace t, where each graph contains |t| nodes, we assign a distinct one-hot embedding, positional encoding and GAT to each graph.Next, we introduce these three components in detail, using attribute a as an example.Due to the neural networks' inability to interpret language, we must transform the language to numbers.Onehot embedding [58], which transforms categorical variables into binary vectors, is one technique to accomplish this.After one-hot embedding, the value of nodes can be converted to |V a |-dimensional vectors, where |V a | is number of possible values for the attribute a.
In our architecture, we use GATs to encode the nodes (i.e., the attribute values) to obtain hidden representations of the nodes.Compared to RNNs, which is commonly applied to solve sequence problems, GATs can better aggregate the information of events that have strong correlations in the trace.For example, assuming that the first event and the last event in a trace are strongly correlated (i.e., in the log, the activity of the first event is often directly followed by the activity of the last event), in the graph, there exists a directed edge from the first event to the last event, so that the last event is able to aggregate the information of the first event.However, from a sequence perspective, they are far apart, and RNNs will experience the vanishing gradient problem [59], and the last event cannot aggregate the information of the first event well.
The order of events is very important, so if the order of events in a trace changes, the trace will become an anomalous trace.However, GATs do not consider the order of nodes, losing the positional information, which is vital to reconstruct a complete trace.In order for the GATs to utilize the order of events, some information about the relative or absolute position of the events in the trace must be injected into one-hot embeddings.In this work, we use the positional encoding (PE) proposed in [60]: where pos is the position of the event in the trace and i-th dimension of the one-hot embedding f a pos .After positional encoding, a set of node features {f a 1 , f a 2 , ⋯, f a |t| } are input into the GAT and GAT outputs a set of hidden representations of nodes {h a 1 , h a 2 , ⋯, h a |t| }.We average these vectors: Overall, multi-graph encoder outputs |t| * A hidden representations {h 1 , h 2 , ⋯, h |t| * A } and A initial hidden states {s 1 0 , s 2 0 , ⋯, s A 0 } for GRUs in multi-sequence decoder, where |t| is the length of trace t and A is the number of attributes.

Multi-sequence Decoder
In ths subsection, the notation s e = GRU(s e−1 , x e ) represents the process by which the input x e at the current time step e and the hidden state s e−1 from the previous time step e − 1 are fed into a GRU.This GRU operation generates the updated hidden state s e at the current time step e.
The multi-sequence decoder, in contrast to the multigraph encoder, decodes the hidden representations into a probability map.The higher the probability of the attribute value, the more likely it is to be normal.
First, the output of the multi-graph encoder {h 1 , h 2 , ⋯, h |t| * A } needs to go through the attention layer and generates c a e .The attention mechanism serves as a vital link connecting the encoder and the decoder.The attention mechanism identifies which attributes of which events are relevant to the next target attribute value and gives high attention weights to those encoded attribute values.
where eg a ei represents the energy state, W a q and W a k , which convert s a e−1 and h i into d-dimensional vectors (i.e., W a q s a e−1 and W a k h a i ), are the learnable matrices.In Eq. ( 6), the energy states eg a ei computed at event e are normalized using softmax to obtain the corresponding attention weight α a ei , which intuitively reflects the significance of each encoded attribute value h i during reconstruction.A higher value of α a ei indicates that h i is more important for predicting the current attribute value.
Then, c a e can be obtained, which involves weighting the output of the multi-graph encoder with their respective attention weights in a direct manner.
Next, in order to better reconstruct the target attribute value at event e, we introduce the teacher forcing method which is widely used in the field of natural language processing.
In the case of the first attribute (Activity), the prediction of the probability distribution for the attribute value at the current event e is guided solely by the previous ground truth attribute value (i.e., activity name) t e−1,1 .e−1 and c 1 e , is inputted into the GRU (see the blue part in Fig. 2).The initial hidden state of the GRU is the output s 1 0 of the multi-graph encoder.Finally, the probability distribution p 1 e , which is the probability distribution over all possible values of the first attribute (Activity) at event e, can be calculated by which represents the linear layer and the softmax layer.
In the case of the other attribute a, three different teacher forcing styles are proposed to guide the reconstruction of attribute values.
i) Activity name (AN): We consider that the current attribute value depends mainly on the current activity name.Therefore, at current event e, the ground truth activity name t e,1 is used to guide the prediction of the probability distribution p a e which can be calculated by where tf 1 e is the embedding vector of t e,1 .ii) Previous attribute value (PAV): We consider that the current attribute value depends mainly on the previous attribute value.Therefore, the previous ground truth attribute value t e−1,a is used to guide the prediction of the probability distribution p a e which can be calculated by s where tf a e−1 is the embedding vector of t e−1,a .iii) Fusion of activity name and previous attribute value (FAP): We consider that the current attribute value depends both on the current activity name and the previous attribute value.Therefore, the fusion of current ground truth activity name t e,1 and the previous ground truth attribute value t e−1,a is used to guide the prediction of the probability distribution p a e which can be calculated by

Anomaly Score Calculator
The trained model can be applied to identify anomalies after the training phase.We input trace t into the trained model to obtain the probability distribution p a e over all possible values of attribute a of event e.
Typically, compared to a normal attribute value, the probability of an anomalous attribute value is lower.Based on this idea, the anomaly score for the value of attribute a of event e in trace t is defined as the sum of all probabilities in the probability distribution p a e greater than the probability of t e,a (i.e., p a e [t e,a ]).To calculate the anomaly score for attribute a of event e, the following formula can be applied: where p a e,i is i-th probability in probability distribution p a e .

Anomaly Detection
By applying a threshold τ , the anomaly scores are labeled into 0 or 1, with 0 indicating normal and 1 indicating anomalous.We label an attribute value as anomalous whenever its anomaly score exceeds τ (i.e., attribute-level).The likelihood of an attribute value being anomalous increases with its anomaly score.

Example 4 (Example 3 continued).
Given that threshold τ is 0.8, since the anomaly score S t,e,a = 0.75 which is less than 0.8, we can infer that the corresponding label for attribute a of event e in trace t is L t,e,a = 0, indicating normal.
The following rule can be utilized to adapt our method to trace-level anomaly detection: If there exists some attribute of some event in a trace that is anomalous, the trace is anomalous.The following formal representation is offered:

EXPERIMENTS
In this section, we empirically evaluate the effectiveness of GAMA on both synthetic and real-life datasets.GAMA is implemented in Python, and the source code is accessible at https://github.com/guanwei49/GAMA.All experiments are conducted in an unsupervised manner, meaning that no prior knowledge of the process is provided, and clean event logs are unavailable.The models are trained on event logs containing anomalies, and subsequently, anomaly detection is performed on the same event logs.

Compared Methods
To verify the superiority of the proposed method, we compare GAMA with the following state-of-the-art methods: • OC-SVM [61]: It transforms traces into vector representations and applies one-class SVM [62] to find anomalies.
• Naive [41]: It marks all traces that are infrequent in the event log as anomalies.
• Sampling [41]: It selects a sample of the event log and mines the process model to detect anomalies by comparing the differences between the traces and the mined process model.
• GAE [32]: It transforms traces into graphs and detects anomalies based on the reconstruction errors of the edges between nodes.
• DAE [19]: It transforms traces into vector representations and detects anomalies based on reconstruction errors in the denoising autoencoder.
• VAE [14]: It transforms traces into vector representations and detects anomalies based on reconstruction errors in the variational autoencoder.

Parameter Settings and Metrics
GAMA is implemented based on PyTorch [63].In our experiment, we apply two-layer GATs for encoding and two-layer GRUs for decoding.The first layer of GATs has four heads and the second layer of GATs has one head.Dropout [64] is applied after each layer of GATs and GRUs to counteract overfitting.We initialize weights using the initialization introduced in [65], and train the proposed model for a maximum of 20 epochs with the Adam optimizer [66].In the absence of special statements, the hidden layer size of the GATs and GRUs is 64, the learning rate is 0.0002, and β, which is used to control the complexity of the graph, is 0.01.
For a fair comparison, we adopt the widely used metric F − score, which is the harmonic mean of P recision and Recall, to evaluate the performance of these methods.These metrics are defined as follows: P recision = T P T P +F P , Recall = T P T P +F N , and F −score = 2 * P recision * Recall P recision+Recall , where T P , F N and F P represent true positives, false negatives and false positives respectively.To be fair, we use a grid search method to determine the optimal threshold value τ and use this threshold value to calculate the F −score for each method.The method has excellent anomaly detection performance when the F −score is close to 1.

Artificial Anomalies
As in previous studies [4], [12], [19], [27], [41], we inject artificial anomalies into event logs to verify the effectiveness Fig. 5: Different anomaly types applied to a normal trace of GAMA.Six anomaly types in [4] which frequently arise in real business processes are manually injected.Fig. 5 shows the different anomalous traces obtained by applying six anomaly types to a normal trace.These anomaly types are defined as follows: • Skip: A sequence of events has been skipped.

•
Insert: A sequence of random events has been inserted.
• Rework: A sequence of events has been executed a second time.
• Early: A sequence of events has been executed too early, and hence is skipped later.
• Late: A sequence of events has been executed too late, and hence is skipped earlier.
• Attribute: Some attribute value has been replaced by an incorrect value in some events.

Dataset
Both synthetic logs and real-life logs are applied to evaluate our method.
For synthetic logs, we adopt eight different business process models (Paper, P2P, Small, Medium, Large, Huge, Gigantic, Wide) [4] to generate synthetic logs, using simulation technology.Paper and P2P are handmade process models and the others are random process models generated by the PLG2 tool [67] with a different number of activities and varying size ranges.We also extend a likelihood graph [27] to generate causally dependent event attributes.For each process model, the number of attributes in the synthetic logs varies from 2 to 5, producing 4 * 8 = 32 synthetic logs free of anomalies.
For real-life logs, we consider six widely used event logs: i) Billing: it contains events that are related to the billing of medical services that have been provided by a hospital.ii) Receipt: it contains the records of the execution of the receiving phase of the building permit application process in an anonymous municipality.iii) Sepsis: it contains events of sepsis cases from a hospital.iv) RTFMP: it contains the execution of the road traffic fine management process.v) Permit: it contains events related to travel permits (including all related events of relevant prepaid travel cost declarations and travel declarations).vi) Declaration: it contains events related to international travel declarations.
We inject AP percent of artificial anomalies (i.e., AP percent of the traces are anomalous) into both synthetic and real-life logs.In our experiments, we evaluate the scalability of our method by considering different values of AP , specifically 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, and 0.45.In the end, we obtain 9 * 32 = 288 synthetic logs and 9 * 6 = 54 real-life logs.

Performance Evaluation
Table 2 reports the anomaly detection performance for tracelevel anomaly detection and attribute-level anomaly detection of various approaches on synthetic logs.We make the following observations.
In terms of trace-level anomaly detection, the proposed GAMA framework (-AN, -PAV and -FAP) achieves the best performance on all synthetic logs, which demonstrates the superiority of graph attention networks for the business process anomaly task.Teacher forcing style AN performs relatively poorly compared to PAV and FAP.OC-SVM has the lowest F − scores on most of the synthetic logs.As expected, OC-SVM is not specifically designed for business process anomaly detection.Although DAE, VAE and LAE are based on the autoencoder, which is similar to our approach, they do not consider structural process information, resulting in unsatisfactory detection performance.Although GAE takes advantage of GNNs, the graphs transformed  In terms of attribute-level anomaly detection, the proposed GAMA framework (-AN, -PAV and -FAP) also achieves the best performance on all the synthetic logs, which demonstrates the superiority of GAMA.GAMA with the teacher forcing style AN achieves the best performance on all the synthetic logs, which demonstrates that the attribute value of an event greatly depends on the activity name of the previous event in synthetic logs.VAE performs the worst, the F − scores of which are always lower than 0.25.

Impact of Anomaly Percentage
Next, we evaluate the impact of anomaly percentage.F − score is averaged over all synthetic logs.
In terms of trace-level anomaly detection, from Fig. 6a, we can see that GAMA-PAV has the best performance.Its F−score hovers around 0.95 and barely varies with anomaly percentage.F−scores of Sampling, GAMA-AN and GAMA-FAP decrease as anomaly percentage increases, which indicates that when anomaly percentage is too high, there will be some failure of the teacher forcing style AN and PAV.Furthermore, as anomaly percentage increases, the accuracy of the process model discovered by Sampling degrades, so the detection performance becomes worse.On the contrary, the F − scores of OC-SVM, Naive, GAE, DAE, VAE, LAE and BINet increase as anomaly percentage increases.As expected, when detection is weak, an increase in anomaly percentage must lead to an increase in P recision (i.e., ran-domly select anomalous traces, P recision is approximately equal to AP ), and thus F −score also increases.
In terms of attribute-level anomaly detection, from Fig. 6b, we can see that GAMA outperforms the other methods at any anomaly percentage.Similar to the anomaly detection results at trace level, F −score of GAMA-PAV barely varies with anomaly percentage and F − scores of GAMA-AN and GAMA-FAP decrease as anomaly percentage increases, which suggests that the teacher forcing methods AN and FAP are more applicable to datasets with a low anomaly percentage.

Performance Evaluation
Table 3 reports the anomaly detection performance of different approaches for trace-level anomaly detection and attribute-level anomaly detection on real-life logs.We make the following observations.
In terms of trace-level anomaly detection, we can see that the proposed GAMA framework (-AN, -PAV and -FAP) achieves the best performance on the Billing, Receipt, Sepsis and RTFMP datasets, which demonstrates the superiority of GAMA in the anomaly detection task.GAMA-AN and GAMA-FAP achieve the best performance on all real-life datasets.But due to the particularity of the Permit and Declaration datasets, BINet seems to be more effective than GAMA-PAV, although GAMA-PAV exhibits comparable results.Specifically, the activity name and attribute value of the current event in Permit and Declaration depend primarily on the previous activity name and have no long dependencies, which are carefully considered and treated by BINet.However, GATs used in GAMA are more suitable for capturing long dependencies, but for the teacher forcing method PAV, it does not input the activity name of the previous event into the network as AN and FAP do, so it does not perform better than BINet.Similar to the results shown in Table 2, OC-SVM and GAE perform the worst.
In terms of attribute-level anomaly detection, the GAMA framework (-AN, -PAV and -FAP) also achieves the best performance on the Billing, Receipt, Sepsis and RTFMP datasets.On Permit, GAMA-AN performs the best, but GAMA-PAV and GAMA-FAP do not perform better than BINet.On Declaration, BINet performs better than GAMA.These also demonstrate that the activity name and attribute value of the current event in Permit and Declaration depend primarily on the previous activity name and have no long dependencies, which are carefully considered and treated by BINet.

Impact of Anomaly Percentage
We then evaluate the impact of anomaly percentage.F − score is averaged over all real-life logs.In terms of trace-level anomaly detection, from Fig. 7a, we can see that the proposed GAMA framework (-AN, -PAV and -FAP) has the best performance, as its F − score is higher than the other methods at any anomaly percentage, which further suggests that GAMA is more scalable.Furthermore, the F − scores of all the methods increase as the anomaly percentage increases.There could be two major reasons for this: i) the real-life dataset itself may contain some natural anomalies that are detected but not labeled.As the percentage of injected anomalies increases, these originally anomalous traces are labeled to improve the anomaly detection performance; ii) an increase in anomaly percentage must result in an increase in P recision (i.e., randomly select anomalous traces, P recision is roughly equal to AP ), and thus F − score also increases.Although Sampling performs relatively well on synthetic logs (see Fig. 6a), it performs poorly on real-life logs.This illustrates the difficulty of mining real-life process models using process discovery algorithms.
In terms of attribute-level anomaly detection, from Fig. 7b, it can be seen that the F −score of GAMA is significantly larger than the other methods under any anomaly percentage.

Critical difference diagram
Fig. 8a and Fig. 8b show critical difference (CD) diagrams [68] of trace-level and attribute-level anomaly detection respectively to visualize the results with a confidence interval of 95 percent.A bold horizontal line is used to group a set of methods that do not exhibit significant differences.
In terms of trace-level anomaly detection, based on the critical difference, we recognize that GAMA, which takes full account of the structural process information within and between different attributes, performs significantly better than other methods and there was no significant difference in performance between the three teacher forcing styles.GAE, a method that also utilizes GNNs for anomaly detection, demonstrates significantly poorer performance.This discrepancy arises from the fact that GAE generates a graph directly from individual traces, resulting in a lack of structural information about the process.In contrast, GAMA derives a multi-graph for each trace from a global graph constructed using the entire event log.This global graph contains comprehensive structural information about the process, enabling GAMA to outperform other methods.
In terms of attribute-level anomaly detection, GAMA performs significantly better than other methods and teacher forcing style AN is significantly better than PAV and FAP.VAE performs significantly worse than other methods, indicating that the hidden representations of traces do not follow the normal distribution.

Effectiveness of Three Teacher Forcing Styles
We evaluate the effectiveness of three teacher forcing styles.The F − scores are presented in Table 4 and the best two results are shown in bold typeface and the best results are underlined.'-' implies that the teacher forcing method is not introduced.
Compared to not introducing the teacher forcing method, incorporating any of the teacher forcing styles significantly enhances the detection performance of GAMA.This finding serves as strong evidence for the effectiveness of the teacher forcing styles specifically designed for business processes.In terms of attribute-level anomaly detection, teacher forcing style AN always has the best performance, which indicates that the current attribute value depends mainly on the current activity name.As expected, the performance of the teacher forcing style FAP is usually between AN and PAV, and is not far from the best results, which we consider to be the more worthwhile teacher forcing style.

Ablation Study on Positional Encoding
It is well known that GATs do not consider the order of nodes, losing positional information, which is vital to reconstruct a complete trace.We verify whether the introduction of positional encoding can provide useful information to improve the detection performance of GAMA.
From Table 5, we can see that the introduction of positional coding (PE) improves the detection performance of GAMA (-AN, -PAV and -FAP) both for trace-and attributelevel anomaly detection, regardless of whether it is on

Impact of β
Note that parameter β controls the number of directed edges in the generated multiple graphs.In this section, we investigate the impact of β on the anomaly detection performance by varying the value of β.F−score is averaged over all the Small logs.From Fig. 9a and Fig. 9b, we can see that the F −scores of GAMA (-AN, -PAV and -FAP) increase and then decrease as parameter β increases both for trace-and attribute-level anomaly detection.As expected, when β is small, some directed edges between nodes without any relationship in the generated graph are retained, and these directed edges have a substantial impact on GATs' ability to encode the nodes.As β increases, some meaningless directed edges are removed, but the directed edges between nodes that have relationships are likewise incorrectly removed, resulting in GATs' inability to aggregate information about some useful nodes.Although the performance of GAMA varies with β, these variations are not significant and do not exceed 0.025, which means our proposed GAMA is not overly sensitive to β.

Impact of Hidden Layer Size
Next, we explore the impact of hidden layer size on detection performance.F − score is averaged over all the Small logs.
From Fig. 10a and Fig. 10b, we can see that the F−scores of GAMA (-AN, -PAV and -FAP) increase and then decrease as the hidden layer size increases both for trace-and attribute-level anomaly detection.As expected, the capacity of the model is limited by the hidden layer size.The hidden layer size is set too small, resulting in too small a model capacity, and the model is unable to learn useful information.The hidden layer size is set too large, resulting in too

CONCLUSIONS AND FUTURE WORK
In this paper, we propose GAMA, a multi-graph-based anomaly detection framework for business processes via graph neural networks.Our approach comprehensively incorporates the structural process information by transforming a trace into a multi-graph with the assistance of a global graph.We utilize GNNs to effectively learn the embedding of this multi-graph.The intrinsic relationships between different attributes are captured by aggregating multiple graphs using the attention mechanism.GAMA is trained in an unsupervised manner (i.e.no data labels are required) and independent of any prior knowledge of the process, which makes it easier to employ.Inspired by the teacher forcing method in natural language processing, three teacher forcing styles are designed to enhance the capability of GAMA to reconstruct normal behaviors and thus improve detection performance.The effectiveness of GAMA is demonstrated through experiments for both traceand attribute-level anomaly detection on both real-life and synthetic datasets.With an appropriate hidden layer size, GAMA can still capture normal patterns even when trained on a dataset containing anomalies and does not require a clean dataset, which is rarely available in the real-world.
A limited number of labeled anomalies are typically available in many real-world anomaly detection applications.These labeled anomalies may initially come from deployed detection systems, e.g., some successfully detected fraudulent transactions.A limited number of labeled anomalies can often be used as prior knowledge to train anomaly detection models.Future work will concentrate on improving the performance of anomaly detection models to detect anomalies using a limited number of labeled anomalies.

Fig. 2 :
Fig. 2: The architecture of GAMA.(Using the trace involves eight events as an example).

Fig. 4 : 1 . [c 1 e ∥tf 1 e− 1
Fig. 4: Probability distribution where tf 1 e−1 is the embedding vector of t e−1,1 .[c 1 e ∥tf 1 e−1 ], which is the concatenation of tf 1e−1 and c 1 e , is inputted into the GRU (see the blue part in Fig.2).The initial hidden state of the GRU is the output s 1 0 of the multi-graph encoder.Finally, the probability distribution p 1 e , which is the probability distribution over all possible values of the first attribute (Activity) at event e, can be calculated by

•
LAE[14]: It transforms traces into vector representations and detects anomalies based on reconstruction errors in the LSTM-based autoencoder.• BINet [4]: It detects anomalies by predicting the attribute values of the next event.DAE, VAE, LAE and BINet support attribute-level anomaly detection and trace-level anomaly detection.But OC-SVM, Naive and Sampling only support trace-level anomaly detection.

Fig. 6 :
Fig.6: F −score under different AP over synthetic logs from traces do not contain the information about the process underlying the event log.Sampling takes into account the structural process information, therefore, Sampling performs relatively well.In terms of attribute-level anomaly detection, the proposed GAMA framework (-AN, -PAV and -FAP) also achieves the best performance on all the synthetic logs, which demonstrates the superiority of GAMA.GAMA with the teacher forcing style AN achieves the best performance on all the synthetic logs, which demonstrates that the attribute value of an event greatly depends on the activity name of the previous event in synthetic logs.VAE performs the worst, the F − scores of which are always lower than 0.25.
-F A P (b) Attribute-level anomaly detection

Fig. 9 :
Fig.9: F −score under different β over Small logs synthetic or real-life datasets.This suggests that GATs are indeed position-insensitive and the introduction of positional coding significantly enhances the ability of GATs to encode graphs transformed from sequence-like event traces.

TABLE 1 :
A brief summary of related works

TABLE 2 :
F −score over synthetic logs where 'T' and 'A' represent trace-and attribute-level anomaly detection respectively

TABLE 3 :
F −score over real-life logs where 'T' and 'A' represent trace-and attribute-level anomaly detection respectively

TABLE 4 :
Effectiveness of three teacher forcing styles

TABLE 5 :
Ablation study on positional encoding