This material consists of a replication package needed to reproduce all the results reported in our empirical study. We plan to make the replication package available under a suitable open license upon acceptance.
LADEmpStudy
A Comprehensive Study of Machine Learning Techniques for Log-Based Anomaly Detection
Requirements:
Python 3.10.2
Create the environment: python -m venv transformers_env
Activate the environment: source transformers_env/bin/activate
Install dependencies, using the following command: pip install -r requirements.txt
Log Template Extraction:
The extraction of templates of the log event occurrences is done by means of Drain template extraction technique.
This material contains:
A main script main.py to reproduce results in this empirical study. The corresponding command is given below:
python main.py
This command will load the experimental settings defined in ladempstudy/experimental_settings.py. Please check this file for setting the corresponding parameters for launching the experiments. Based on the settings, the script will execute hyperparameter tuning and testing for the corresponding ML techniques on given datasets.
Folder ladempstudy: It contains 4 scripts used to reproduce results in this empirical study.
The script hyperparameter_tuning.py is used to run the hyperparameter tuning phase (section Hyperparameter Tuning Phase in the paper)
The script test.py is used to run the testing phase (section Testing Phase in the paper)
The script commons.py contains the common methods for the above two scripts
The script experimental_settings.py contains all the required settings for running the experiments.
Folder data: It contains the preprocessed data from Hadoop dataset (Note: rest of the datasets (HDFS, Fdataset, BGL, Hades, Thunderbird, Spirit) can be automatically downloaded and preprocessed by executing the main script: main.py. To download manually, please check the scripts provided in scripts/preprocessing folder.
Folder scripts: It contains 18 scripts related to the implementation and the evaluation of the different machine learning techniques considered in this empirical study. The folder also contains the code of the Drain parser, which we store under the logparser folder.
8 python scripts containing the implementations of the different machine learning techniques (RF, SVM, LSTM, LogRobust, NeuralLog, DeepLog, OC-SVM, Logs2Graphs)
Folder postprocessing: It contains 4 scripts for generating boxplots
Folder preprocessing: It contains 8 scripts for downloading, parsing and preprocessing datasets
Folder configs: It contains 9 configuration files in json format with all the hyperparameter settings adopted for each machine learning technique used in this study
It also contains one subdirectory best_hyperparameter_config having 9 configuration files for eight ML techniques in json format with the best hyperparameter setting
Folder utils: It contains utility scripts.
Folder results: It contains different results (csv and pdf files). The folder contains 2 subfolders, semi and supervised. Each subfolder contains experimental results collected from different semi-supervised and supervised, traditional and deep ML techniques used in this study
semi: It contains results collected for both OCSVM, DeepLog and Logs2Graphs from both hyper-parameter tuning and testing phase.
supervised: It contains results collected for four supervised ML techniques (i.e., RF, SVM, LSTM, LogRobust, NeuralLog1 and NeuralLog2) from both hyper-parameter tuning and testing phase.
Each of the above folder contains sensitivity boxplots generated on 6 datasets w.r.t detection accuracy and time (7x2 = 14 plots) in .pdf format. It also contains 4 line charts in .pdf format to show the impact of data imbalance on 4 logevent-based datasets. Further, it contains 4 csv files (2 per dataset type; logevent or session) containing variability values w.r.t detection accuracy and training time.