APEX: Characterizing Attack Behaviors from Network Anomalies

Networks regularly face various threats and attacks that manifest in their communication traffic. Recent works proposed unsupervised approaches, e.g., using a variational autoencoder, that are not only effective in detecting anomalies in network traffic, but also practical as they do not require ground truth or labeled data. However, the problem of characterizing anomalies into different attack behaviors is still less explored; in this work, we study this specific problem. We develop APEX, a framework that employs data mining approaches in a semisupervised way to extract the attack patterns from anomalous traffic and links them to specific attack types. APEX comprises two levels of mining: the first level extracts patterns in anomalous network flows, and the second level characterizes behaviors in the extracted patterns into different attack classes. We carry out extensive experiments on real network traces obtained from the MAWI traffic archive. The evaluations demonstrate that APEX is effective in extracting distinguishable behaviors of network attacks from anomalous traffic, and provide useful insights to security analysts investigating the anomalies.


I. INTRODUCTION
Networks are facing an increasing number of threats, attacks and data breaches in recent times. They are often carried out in different stages such as reconnaissance (scanning), bruteforce login attempts, malware download, C&C (command and control) communications, data exfiltration, distributed denial of service (DDoS) attacks, etc. Analyzing network traffic helps in detecting and identifying the threats and attacks faced by organizations. Past works have attempted to detect (often specific) attacks using supervised learning approaches that model the attack detection as a binary classification problem [1], [2], where labeled data of two classes of network traffic-normal and anomalous-are gathered and provided for training the models. However, network traffic (of, say, enterprises) tends to be noisy due to the large number of users, evolving landscape of applications and protocols (e.g., adoption of TLS 1.3 and HTTP/3), increasing use of new smart devices, etc. Therefore, the definition of 'normal' changes with time and space, i.e., the network where the model is deployed. Besides, labeling each traffic flow of a network * Zixu Tian shares equal contribution with the first author. as normal/anomalous is labour-intensive and unsustainable for model maintenance (retraining). Therefore, researchers have proposed unsupervised approaches for network anomaly detection [3], [4]. While unsupervised models are effective in detecting anomalies, they do not explain what these anomalies are. The anomalies could be due to scans, low-rate login attempts, floods, etc. Without such explanations, an analyst who investigates the anomalies would find it hard to understand the nature of the threat that is affecting the network.
The problem we study here is beyond anomaly detection, on how to characterize anomalies into different attack types. We assume an anomaly detector gives us anomalies, and from there address the problem of classifying anomalies into attacks. In this work, we focus on this specific problem of characterizing attack behaviors from network anomalies.
We propose and develop APEX (Attack Pattern EXtractor), a novel semi-supervised framework for analyzing, detecting and identifying attack behaviors from network anomalies. Specifically, we consider an existing anomaly detection solution, e.g., GEE [3], that detects and ranks anomalies using an unsupervised model. APEX takes the detected anomalies from such a solution and proceeds to function in multiple phases. First, it defines and extracts features from anomalies that capture different kinds of network attack behaviors. Subsequently, APEX employs an efficient and effective data mining technique called FIM (frequent itemset mining) at two levels to extract and group anomalies into a small number of attackbehavioral patterns. The first level of mining extracts attack patterns from five-tuple 1 anomalous flows in an unsupervised way. The mined patterns are then grouped according to attack types using labeled information. Subsequently, another level of mining (second level) is carried out on the grouped patterns to extract behavioral patterns of different attack classes. These behavioral patterns can be used to identify different attack types from anomalies and also provide useful insights to a security analyst. We carry out comprehensive experiments using real network traffic obtained from the MAWI archives [5]. APEX achieves comparable performance with the best-performing supervised model, while providing behavioral patterns that characterize and help visualize the different attacks.
We present an overview of APEX framework in Section III. In Section IV, we present the APEX framework in detail depicting the features and the extraction of behavioral patterns. Section V presents the experiments and analyzes the results.

II. BACKGROUND AND RELATED WORKS
Network anomaly detection has been an important research problem for decades, and many statistical approaches have been explored for this purpose [6]- [8]. One well-known approach is to apply supervised learning models. For example, Bilge et al. [1] proposed a botnet detection system by using a Random Forest (RF) classifier on NetFlow data. Pang et al. [9] applied reinforcement learning to anomaly detection based on partially labeled network anomalies. Although supervised learning is an effective approach, it requires ground truth of existing intrusion. Whereas the ever-evolving nature of networks (home, enterprises, etc.) as well as the threat landscape lead to newer threats and attacks. This trend may shift the decision boundary differentiating the pre-defined attack and benign traffic, and thus reduce the effectiveness of supervised anomaly detectors. Moreover, precise labeling of raw network traffic flows is labour-intensive and impractical [10], and thus it becomes challenging to use supervised approaches in real deployments. In comparison, unsupervised approaches are often more practical than the supervised approaches for anomaly detection, given the continuous evolution of network traffic and unavailability of ground truth. Unsupervised deep learning models such as Autoencoder [11] and Variational Autoencoder (VAE) [12] achieve robust and good performance in dealing with complex and noisy data, leading to the recent exploration of these models for network anomaly detection. For example, the authors in [13] leveraged VAE in learning latent representation of high-dimensional and sparse network data, to improve anomaly detection performance over a series of one-class classification algorithms. Kitsune [14] uses an ensemble of autoencoders, to achieve unsupervised online detection of anomalous network traffic. Yet another recent work, GEE [3], presents an unsupervised anomaly detection framework using VAE, with an additional gradient-based fingerprinting technique that attempts to explain the anomalies in network traffic.
In spite of the promising performance, most existing anomaly detection solutions tend to stop at binary classification, without looking deeper into the exact attributions of the anomalies detected, e.g., the specific attack types or different behavior patterns of mixed anomalies. Even though there are works like [15] and [16] that explore ways to explain and comprehend anomalies, those focusing on exact categorization of anomalies to attack classes are rare.
Data mining is a class of approaches for identifying data patterns, and Frequent itemset mining (FIM) is one of the wellknown approaches for data mining [17]. The patterns discovered using FIM are often in the form of association rules defined by a few critical feature items, which are extracted from a series of frequently occurring transactions [18]. Similarly, in network traffic analysis, each network flow can be treated as a transaction, with protocol-related and flow-level features modeled as items. Then, the pattern extraction of anomalous traffic flows is equivalent to mining the anomalous association rules from flow-level data. A recent work [19] proposed an FIM-based framework for detecting attacks specifically in distributed IoT networks. They mined the patterns to identify spatial and temporal correlations from aggregated alerts generated by home networks; these mined patterns were then fed to a supervised classifier to detect different stages of attacks. Ozawa et al. [20] applied association rule learning to discover regular patterns of scan attack in IoT network. They mine the association rules of scan attack based on the frequent items in the header of TCP SYN packets, and use such rules to interpret the attack behaviors. Shone et al. [21] introduced a framework to extract the rules associated with anomalous traffic flows in the backbone networks. They filtered the network flows by means of a set of meta-data of network traffic, which are jointly contributed by multiple histogram-based anomaly detectors, and then extracted the rules of suspicious behaviors from filtered flows using the Apriori algorithm. However, these works still do not provide a solution to explain the anomalies or to attribute the anomalies to specific attack classes for generic networks (such as enterprises, homes, etc.). Thus, in this work, we explore a systematic way for extracting behavior patterns and identifying attacks, so as to comprehend and explain anomalies in network traffic.

III. OVERVIEW OF APEX
As APEX is based on FIM, we first give a brief background of FIM, followed by an overview of APEX. Finally, we present the threat model.

A. Frequent Itemset Proposed Mining (FIM)
In FIM, each field of a transaction is called an item, and a set of k items is called a k-itemset, where k is also referred to as the length of the itemset. Note that the terms itemset and pattern are used interchangeably here. The support of an itemset is defined as the number of times the respective pattern is found in the transactions. An itemset is called a frequent itemset, if its support is beyond a given minimum support, θ (0 ≤ θ ≤ 1). Note that, the support is sometimes used as an absolute value and in this case, it is referred to as support count (see Fig. 2).
FIM discusses mining of different sets of patterns. We can extract all possible frequent itemsets (FI), which is also called a "lattice". Although a lattice provides a comprehensive overview of all the patterns, the number of patterns is very high, and more importantly, lower length itemsets 2 are usually subsets 3 of higher length itemsets, and thus redundant. For example, <<a,c>> and <<a,b,d>> are two subsets of the itemset <<a,b,c,d>>. Alternatively, we can mine for closed  frequent itemsets (CFI) and maximal frequent itemsets (MFI), both of which are subsets of FI. The itemsets in CFI do not have any superset with the same support, whereas the itemsets in MFI do not have any frequent superset. While MFI produces fewer itemsets compared to CFI, the length of the itemsets (the number of items) in MFI, which is related to the information content, is also higher. Therefore, in APEX, we mine for MFIs in both levels of mining.
B. APEX Framework: An Overview APEX aims to achieve two goals: i) to characterise anomalies detected, ii) unlike supervised models, be less dependent on data, while not compromising the detection capability. Fig. 1 presents the APEX framework, consisting of a training phase and a testing (or operational) phase. Both phases operate on anomalous traffic flows, which are obtained by employing existing anomaly detection solutions.
APEX performs two levels of mining in the training phase. First, the anomalous traffic flows obtained as output from the anomaly detection solution, say GEE [3], are mined with a set of features using FIM to extract different patterns (Section IV-A). In Fig. 2, we list real example patterns mined from the anomalous flows detected by GEE on MAWI data (see Section V-A for details on the dataset used). The extracted patterns reflect different attack profiles coming from or related to different sources. We aggregate patterns from the first level of mining according to attack types such as Scan, Flood, etc. We refer to these class-wise aggregated patterns as fingerprints. For example, the mined patterns corresponding to Scan in Fig. 2 are some of the fingerprints for that particular attack class.
Next, in order to have concise patterns representing each attack class, a second level of mining is carried out on the fingerprint patterns of respective attack classes (Section IV-B). Fig. 3 illustrates fingerprints (rows 1 to 10) and the patterns mined from them (rows I and II). Since there are many such patterns for each class, we identify the top m patterns as the behavioral patterns of each attack class and use them as identifiers in the testing phase. Examples of behavioral patterns of each class are shown in Table I. In the testing phase, anomalous five-tuple traffic flows go through the first level of mining, and the extracted patterns are compared with behavioral patterns of different attack classes for possible identification of attacks present in the network traffic. As the mining may extract multiple patterns (that are associated with multiple attack types) from network flows, we propose and evaluate three majority-voting based matching schemes-Exact Match, Best Match and Overall Match-to identify the attack type originating from a source IP address (Section IV-C).

C. Threat model
In this work, we consider different kinds of network attacks that manifest as multiple flows. These include port scans, network scans, C&C communications, volumetric attacks such as floods and DDoS, brute-force login attempts, etc. In case of multi-flow attacks, the network flows of an attack might occur between two hosts (point-to-point attack) or across multiple hosts (e.g., one server being attacked by a botnet). Our work focuses on detection and characterising such attacks. On the other hand, single-flow attacks, such as SQL injection attack that uses just one network flow, are considered outside the scope of our work, since for a data-mining framework these attacks have very low support (equivalent to that of noise).
In the context of this work, we define four high-level attack classes-Scan, Heavy, Flood, and Low-rate-based on the availability of MAWI traffic [5]. We provide a detailed introduction of these four attack classes in Section V-A. To demonstrate the framework design and work flow of APEX, we use MAWI traffic as an example. However, it is important to note that the framework of APEX is generic, and can be used with other definitions of attack classes.

IV. ATTACK MINING AND IDENTIFICATION A. First-Level Mining: Extracting Fingerprints
The anomalous traffic flows from a network anomaly detection form the input to APEX, specifically to the first level of mining (used in both training and testing phases). Specifically, APEX takes the top-ranked anomalous source IP addresses from an anomaly detection solution, and performs mining on their corresponding five-tuple flows. For mining these fivetuple anomalous network flows, we define a set of features and each of them becomes an item in FIM. IP Destination 4 : This feature is useful in identifying the direction of the attack. If the item is mined as a wildcard (represented by a " * "), then the source is connecting to multiple destinations, depicting a typical behavior of network scans (for example, rows 1 to 3 in Fig. 2). On the other hand, if the destination IP address is unique, it means the flow is a one-to-one connection; e.g., row 4 in Fig. 2  Features of first level of mining will be detected. We demonstrate a pattern from Flood-like flows related to subnets later in the Section V-D.
Protocol: Different protocols (e.g., TCP and UDP at transport layer) are numbered and used as categorical attributes. Source Port and Destination Port: Similar to protocol, we use port numbers as categorical features since they indicate the applications in use. If a port number is repeated in several flows, it will be extracted as a frequent item for that set of flows. Some common ports, for example 80 and 443 used for HTTP(S), display such a behavior; this is illustrated in Fig. 2. On the other hand, Scans and Low-rate attacks manifest as smaller size flows. Fig. 2 illustrates this contrast between Scan and Heavy. Flow Size in Bytes (FSB): FSB is the total size of a flow in bytes, and is also categorized into three bins. This feature is similar to FSP, but could capture subtle differences. For instance, while both Scan and Low-rate attacks usually have a small number of packets in their five-tuples, Low-rate flows may have larger packets due to the presence of application data (e.g., SSH brute force attempt would contain keyexchange messages, user authentication data, etc.). Flow Duration: The duration of the flow is also divided into three discrete bins. In short bursty flows such as Floods, the flow duration is likely to be short, whereas in Low-rate attacks the flows could prevail for a long period of time than Scans. Going back to the SSH brute-force example, flows may have a longer duration depending on, say, the number of user authentication attempts carried out by the attacker [22], and similar behavior is possible for other forms of dictionary attacks as well. Mining. Once the features are extracted, anomalous five-tuple flows represented using the above set of features are mined with FIM to extract patterns. As mentioned earlier, we mine for MFI using the FPMax algorithm [23] as MFI tends to provide a minimum set of patterns with higher lengths (more frequent items). As the first-level mining is performed without using the attack labels, this part of the training is unsupervised. Note that here we only mine flows originating from a single source IP address at a time; this is quite well-aligned with anomaly detection solutions that often predict the anomalous sources in a given time-slot since aggregation of flows at source level gives more information for better classification [2], [3].

B. Second-Level Mining: Extracting Behaviors
The second-level mining is executed only in the training phase. As in the first step, we aggregate the itemsets or patterns obtained as output from the first-level mining into different attack types. To do so, we assume the availability of labeled attack traffic. Note that, labeled attack flows are regularly detected by different intrusion detection solutions as well as with the help of analysts; therefore, obtaining this small labelled set is not as difficult as obtaining labels on all flows (which is required for supervised learning). We refer to the grouped and labeled attack patterns as fingerprints of attack classes. A few real fingerprint patterns of Scan and Heavy attack classes are shown in Fig. 2.
Motivation. The number of fingerprints for a given class can be quite large; even for one type of attack, there can be numerous fingerprint patterns from the first level of mining. For example, consider a set of bots scanning for vulnerable hosts in a subnet on a specific port, say 445. The first-level mining may generate a pattern for each of the bots, resulting in many patterns. However, another round of mining on these patterns could result in a few patterns that represent the attack behavior more concisely. Subsequently, the task with the second level of mining is to extract the behaviors from these fingerprint patterns. Fig. 3 provides examples of second-level mining with respect to Flood and Low-rate attack classes. Features for second-level mining. A closer look at the patterns of different attack classes obtained from first level of mining (refer to Fig. 2 and Fig. 3) shows distinct behaviors in some features. For instance, in Scans (rows 1 to 3 in Fig. 2), FSB, FSP, and flow duration tend to be small, whereas in Heavy (rows 4 to 6 in Fig. 2), average IP packet length (AIPL) and FSB tend to be large in most patterns. While Scan and Low-rate attacks (rows 6 to 10 in Fig. 3) appear similar and thus challenging to differentiate, they show differences in the flag sequences. Similarly, Heavy and Flood display similar behaviors, but items such as flow duration tend to be different in these attacks.
In addition to the features in the first level of mining, we include support of the extracted patterns as a feature for the second level of mining. In mining, support indicates the number of times a captured pattern is repeated in the transaction database. Based on the dominance of the pattern and correspondingly the attack, the support will vary. For instance, a flooding attack is a burst of flows within a short time period and thus, in the corresponding captured pattern, the support will be high. In the second-level mining, we categorize the support into four bins (see Table III). Additionally, we include the support as a numerical feature as well. In some attacks, the same number of flows can be seen in the network traces (e.g., a malware might scan 10 different ports on one target, before moving to another target), and the support as a numerical feature can help to detect those repetitive patterns.
If the destination IP address is a unique address (not a wildcard), we mark it as "V" before feeding to the second level of mining to capture flows with unique destinations. Rows 1, 3-5, and 6-10 of Fig. 3 illustrate such conversions and the corresponding second level patterns are in rows I and II. Mining. The main objective of the second level of mining is to extract these repeated items in the patterns of the first level of mining (fingerprints). As explained before, these key items can be used as identifiers (signatures) in the detection of attack types. Similar to the first level, we mine for MFI in the second level for the same reason that MFI generates more informative patterns of higher length. In Fig. 3, at the end of fingerprint patterns of Flood and Low-rate attack classes (i.e., in rows I and II), we depict the results from the second level of mining. For instance, most of the flows from the Out of the patterns from the second level of mining (which describe the frequent items in the attack patterns), we select the top m patterns as the identifiers that capture the behavior of the corresponding attack type. We refer to them as "behavioral patterns" and use them in the testing process as the signatures of the attack classes. Table I provides examples of behavioral patterns with respect to each attack class.

C. Attack Identification
Attack identification is carried out in the testing phase. First, similar to training, anomalous five-tuple flows are obtained from a network anomaly detection solution. From these anomalous flows, the set of features used for the first level of mining are extracted. Subsequently, these transactions are mined for MFI (first level of mining) with an identical minimum support to training. The resulting testing patterns are then cross-checked with behavioral patterns of each class with the following matching schemes to identify the attack class. Specifically, in the matching schemes defined below, we evaluate whether the items in behavioral patterns (obtained in the training phase) are contained in a testing pattern as the former (which is from the second level of mining) usually have lesser number of items compared to the latter (which is from the first level of mining). For our description below, we use the behavioral patterns shown in Table I as examples.   TABLE I  EXAMPLE Table I, best matching patterns are 1b for Scan, 2a for Heavy, 3a for Flood and 4a for Low-rate. The corresponding matching scores for best matching patterns are: Scan: 1/8 (only the protocol TCP matches), Heavy: 8/9 (there is a match for all items in 2a, except for flow duration), Flood: 7/9 (all items in 3a match except for destination port and flow duration), Low-rate: 2/10 (only destination IP address and protocol match in 4a). Thus, the testing pattern will be classified as Heavy.
Overall Match: Here, matching percentages of all the behavioral patterns with the testing pattern are evaluated and the final score is taken as an average:  Table I are: Scan:  Finally, with an overall score of 0.77, Low-rate will be selected as the attack class of the above testing pattern.
After all the patterns corresponding to a source IP address are evaluated, the source is classified to an attack class based on majority vote -to the attack class to which the majority of patterns are classified. In case the voting is equal for two or more attack classes, the number of matching frequent items in the corresponding behavioral patterns are considered; i.e., the attack class of the behavioral pattern(s) with more matched frequent items will be selected. Note that the considered behavioral patterns for this final decision depend on the matching scheme; in exact match, all the matching behavioral patterns are considered; in best match, the top scoring behavioral pattern is considered; in overall match, all the behavioral patterns are considered.

D. Summary of APEX
We summarize APEX from the perspective of training and testing operations. As described before, both phases perform the first-level mining on anomalous network flows. These common steps are listed in Algorithm 1. The training phase mines and returns behavioral patterns, as shown in Algorithm 2. The testing phase, Algorithm 3, uses the behavioral patterns for matching against the mined pattern from the test data. The MAWI traffic [5] consists of seven randomly selected days from Sep. 2018 to Dec. 2018; there are more than 1 billion packets and 60 million flows in total. For each day, pcap is collected for 15 minutes. We perform data cleaning to remove source IP addresses that have very few packets (< 20).
The labelling process is carried out as follows: First, in order to focus on IP addresses that are more likely to be anomalous, we apply a VAE-based unsupervised anomaly detector, specifically GEE [3]. A data point for GEE is a traffic session identified by all flows with the same source IP address within the 15min pcap trace; in other words, each traffic session is a single feature vector in GEE. Subsequently, the source IP addresses that are assigned the highest anomaly scores by GEE are selected for labelling. Since GEE is a deep learning model, we run GEE with different initial seeds multiple times, and perform the same process each time. To identify the ground truth of these top anomalies, we carried out a comprehensive and labour-intensive task. In addition to writing rules and heuristics (e.g., to look for high count/percentage of SYN packets in a session), we manually went through the traffic data, to identify sophisticated attacks. For example, there are multiple sessions with repeating short connections (starting and terminating like normal connections), which are possibly password spraying attacks given that they target applications such as SSH, Telnet, etc. [22]; these are hard to identify using rules. At the end of the process, we have 640 anomalous sessions labelled. The traffic characteristic of some source IP addresses are determined to be not malicious even though they are given high anomaly scores by the VAE model. We label Algorithm 3 Testing phase of APEX 1: Patterns 1 ← Extract level one patterns (Algorithm 1) 2: Let B be the behavioral patterns from training phase 3: Results ← Apply matching Scheme M on each pattern P ∈ Patterns 1 , across m behavioral patterns in B c , (∀B c ∈ B) (Section IV-C) 4: return Results these IP addresses as Normal and do not include them in the evaluations. Attack Classes. The following high-level attack classes are defined based on the anomalies detected in the MAWI traffic. Scan: Host and network scanning are integral components of any malware and are used to look for potential vulnerabilities. While the basic idea of scanning is obvious, a large variety of scanning strategies exist and it is non-trivial to automatically identify scanning traffic from a large network traffic trace. Heavy: This is a traffic class for high volume flows. In order to differentiate different high volume source IP addresses, we identify a source (IP address) as Heavy only if there is at least one five-tuple flow with sufficiently high volume with a certain threshold. In our experiments, the threshold is set to flow size measured in bytes that is greater than 99% of all flows in the original data, within the interval of interest. Flood: This is also a traffic class for high volume flows. The difference between Flood and Heavy is that instead of one (or a few) mammoth flow(s), Flood is characterised by having a large number of flows to one or different destinations. In aggregation, the flows are voluminous. Low-rate: These are malicious flows characterized by a large count, and each flow only transfers a small number of packets at relatively low rate. An example of such Low-rate attack is slowloris [24]. C&C communications as well as brute-force login attempts also fall under this class.
The attack class distribution based on the data we use is given in Table II. We have published the labels and information of anomalous source IP addresses 6 .
The boundary values for the categorical features we use in our experiments are listed in Table III. They are derived based on our understanding of the attack classes and the practical feature distributions based on the MAWI traffic. The letters T, S, M, L, and H, denote Tiny, Small, Medium, Large, and High, respectively.

B. Evaluation Metrics
For an attack class c, let TP c , TN c , FP c , and FN c denote the counts of true positives, true negatives, false positives, and false negatives, respectively. If C denotes the set of all classes, then the overall Precision, Recall and F1-Score are defined:

Feature
Boundary Since data in the attack classes can be imbalanced, we also use the Weighted F1-Score, which is based on the F1-Score of each class. Given N c is the number of data points belonging to attack class c, the F1-Score c of class c is: , ∀c ∈ C; and, For our evaluations below, we analyze based on the overall F1-Score (Eq. 1) and Weighted F1-Score (Eq. 2).

C. Results
We carry out three sets of experiments to investigate the best matching scheme of APEX, and also to compare the detection performance of APEX with supervised machine learning models. m, the number of behavioral patterns to compare, is the most important parameter of the matching scheme for APEX. Except in Experiment A where we analyse the effect of m, in all other experiment scenarios m is set to 7. Further, we mine for MFIs and set the support at 0.1 in both levels of mining and in all experiments.
1) Experiment A: Analysis of Matching Schemes and the impact of m: Next, we compare the effect of m in the performance of the different matching schemes. Recall, m denotes the number of behavioral patterns mined in the training phase; these behavioral patterns are matched against, by the mined patterns in the testing (operational) phase. As generating these behavioral patterns requires labeled attack traffic and hence human assistance, high values of m are not preferred. Hence, we experiment for values of m set to 3, 5, 7, 9, and 11. The 09-01 dataset is used in the training phase, and the 09-06 dataset is used for the testing phase. The experiment is repeated for all values of m, and we compare the three matching schemes (Section IV-C) in their ability to identify the different attack classes. The corresponding results are shown in Fig. 4.
In Exact Match, when m increases, the number of behavioral patterns available to identify the attack class also increases, and thus the probability of getting an accurate match improves. This is evident from Fig. 4, where the overall F1-Score of Exact Match increase from ∼0.3 to over 0.6 with increasing m. However, as the number of behavioral patterns m increases, it becomes harder to characterize attacks; besides, extracting more number of behavioral patterns would require more number of labeled attack flows. The Best Match and Overall Match consistently score over 0.5 for different values of m. On average, Best Match shows better performance than Overall Match -in terms of the overall F1-score, Best Match averages to ∼0.6, whereas Overall Match averages to 0.57. Henceforth, we use Best Match in the remaining experiments.
2) Comparison with supervised approaches: Existing literature does not provide a similar semi-supervised approach targeting the same problem. Therefore, we compare APEX with several supervised machine learning models that are widely adopted for classification: Random Forest, k-Nearest Neighbor (k-NN) and Naive Bayes. Observe that, supervised models require labels for all traffic flows, whereas APEX requires labels of only attack flows which can be obtained by analyzing the top-ranked anomalies of an anomaly detection model. Since supervised models are expected to outperform unsupervised/semi-supervised models, with this set of experiments our goal is to analyse how close is APEX to supervised models in identifying attack classes. We use Scikitlearn library [25] to construct the machine learning models and tune the key parameters to achieve constant and promising performance. For Random Forest, we set the number of trees as 10 and the tree-depth is automatically determined when all of the leaves contain less than 2 samples; for k-NN, we set the number of neighbors, k, as 7; for Naive Bayes, as there are not many parameters to adjust, we use default settings.
To have a fair comparison of the supervised learning models and APEX, we engineer the features as follows. The supervised models are trained and tested on five-tuple flows, where the features of Source IP address, Destination IP address, Source Port, Destination Port and Protocol (used in APEX) become the identifiers to extract the flows. The numerical features such as AIPL, FSB, FSP and Flow Duration are also features in machine learning models. Two categorical features are engineered to accommodate in the machine learning models. The Simplified Flag Sequence, is transformed to the basic flag counts in all the IP packets within a flow. The other categorical feature, Majority Packet Length, is converted to three-dimensional binary feature vector through one-hot encoding, i.e., the categorical values of S, M, L are converted to [0,0,1], [0,1,0] and [1,0,0] respectively. Since with machine learning models the classification is performed at the flow level, a majority voting based scheme is employed to assign the attack class originating from the corresponding sources. As mentioned before, for APEX, the minimum support is set to 0.1; and m is set to 7 as it encompasses a manageable number of behavioral patterns with a reasonable performance.   a) Experiment B: Effect of training dataset size on performance: We usually expect learning algorithms to perform better with increasing training dataset; in this experiment we evaluate APEX and the supervised machine learning algorithms with two sets of network traces. First set uses network traces of four days, i.e., 09-01, 09-06, 10-06, 11-07, for training, and the remaining three days to test. The second set uses five days, i.e., 09-01, 09-02, 09-06, 10-06, 11-07, to train the models and the remaining two days to test. The results are shown in Fig. 5.
Observe that, when the size of training data increases, different from k-NN and Naive Bayes, the performance of both Random Forest and APEX improves. Importantly, with more training data, the extent of APEX's performance improvement is greater than that of Random Forest; and in fact, with 5:2 ratio APEX outperforms both k-NN and Naive Bayes. Finally, the results show that, APEX achieves comparable performance to the supervised machine learning models, given that APEX is semi-supervised.
b) Experiment C: Performance on entire dataset: In this experiment, we merge all the source IP addresses in seven days of network traces and split them into training and testing at the ratio of 70%:30%, while also keeping the same training: testing ratio for each class. Observed from Fig. 6, when a comprehensive training dataset is used, APEX is able to achieve as high a performance as Random Forest; APEX also outperforms the other two supervised models. Besides being able to identify attack classes, APEX is also able to provide behavioral patterns of attacks. These patterns provide useful contexts to analysts; in addition, analysts can also use these patterns to quickly form rules to take mitigation actions.

D. Attack Patterns: A Visual Representation
Subnet pattern. We use a topological diagram (Fig. 7) to demonstrate a Flood type attack observed from the subnet-related patterns (based on the first level of mining) in the MAWI dataset. In this instance, patterns are formed between multiple sources and a single destination subnet, 163.94.0.0/16, over the ports of 80 (HTTP) and 443 (HTTPs). Note that both the average packet length (AIPL) and majority packet length (MPL) are in the "L" (large) category, and the support of these patterns goes beyond 10,000. Hence, these patterns are most likely from a volumetric attack such as HTTP Flood.
Observe that, the traffic flows are from web servers (ports 443 and 80) to the subnet 163.94.0.0/16; that is, they are responding to the connections initiated by this particular subnet. Therefore, the subnet 163.94.0.0/16 is the anomalous entity in this example. We highlight that, the datasets do not always contain bi-directional flows, and therefore no patterns from the subnet 163.94.0.0/16 to the web servers were observed. Pattern distribution. We present an example using Sankey diagram 7 to visually demonstrate the feature distribution of Scan attack patterns (Fig. 8) from the MAWI traffic. The patterns are derived from the first level of mining as elaborated in Section IV-A.
In the Sankey diagram, a pattern is viewed as a flow connecting a sequence of vertical bars of feature values from the left to the right. The left-most bar indicates the attack class, the rest of the columns are grouped based on features, and each bar representing a categorical feature value. The length of the bars are different based on the fraction of the fivetuple flows (corresponding to patterns) passing through them. In other words, it indicates how many flows contributing to the pattern that possesses such a specific feature value.
The pattern distributions presented in Sankey diagrams are useful in explaining results. In the case of Scan (Fig. 8), most feature values in the patterns are unique; the AIPL, FSB, FSP, Flow duration and MPL are in "S" (small) category for most patterns, and SYN flag (the last bar in the figure) is also present in majority of the extracted patterns. These unique feature values help to generate clear behavioral patterns in the second level of mining. For instance, consider the top behavioral patterns for Scan in Experiment C (Section V-C), which is specifically <<_, * ,6, * ,_,S,S,S,S,T,S,_,_>> (the order of features is the same as in Fig. 3 and " " indicates that there is no frequent item). Note that the above behavioral pattern of Scan has many unique items. This observation in fact aligns with the Sankey diagram of Scan attack class (in Fig. 8) where the feature values are concentrated.

VI. CONCLUSION
In this work, we proposed a data mining framework APEX, to characterize network anomalies and classify them into different attack classes. APEX employed two levels of mining, where the first level extracted attack patterns from anomalous network flows, and the second level extracted a more concise representation of attack patterns called behavioral patterns which are then used in the identification of attacks. The experiments conducted on real network traffic show that APEX achieves similar performance as the best performing supervised model. In addition, APEX provides a set of concise patterns matching attack classes as output. The patterns thus generated (and can be visualized) give an analyst more context to the detected anomalies to carry out further investigations.