ELNIDS: Ensemble Learning based Network Intrusion Detection System for RPL based Internet of Things

Internet of Things is realized by a large number of heterogeneous smart devices which sense, collect and share data with each other over the internet in order to control the physical world. Due to open nature, global connectivity and resource constrained nature of smart devices and wireless networks the Internet of Things is susceptible to various routing attacks. In this paper, we purpose an architecture of Ensemble Learning based Network Intrusion Detection System named ELNIDS for detecting routing attacks against IPv6 Routing Protocol for Low-Power and Lossy Networks. We implement four different ensemble based machine learning classifiers including Boosted Trees, Bagged Trees, Subspace Discriminant and RUSBoosted Trees. To evaluate proposed intrusion detection model we have used RPL-NIDDS17 dataset which contains packet traces of Sinkhole, Blackhole, Sybil, Clone ID, Selective Forwarding, Hello Flooding and Local Repair attacks. Simulation results show the effectiveness of the proposed architecture. We observe that ensemble of Boosted Trees achieve the highest Accuracy of 94.5% while Subspace Discriminant method achieves the lowest Accuracy of 77.8 % among classifier validation methods. Similarly, an ensemble of RUSBoosted Trees achieves the highest Area under ROC value of 0.98 while lowest Area under ROC value of 0.87 is achieved by an ensemble of Subspace Discriminant among all classifier validation methods. All the implemented classifiers show acceptable performance results.


I. INTRODUCTION
Advancement in the development of low powered tiny embedded devices has facilitated the growth of new networking paradigm called the Internet of Things (IoT) [1] in which anything can communicate to anyone and anytime.IoT consists of objects also known as "Things" (i.e.human, animal etc.) which carry smart devices with built-in intelligence that provides it with a capability to connect and share information over the internet and control the physical world [2].Smart devices share information to make decisions and perform actuating tasks.IPv6 enables this communication by providing each smart device with a unique IP address thereby making it globally addressable [3], [4].IoT enables a lot of applications that make human life better, however with a lot of benefits it also carries a lot of risks associated with users security and privacy [5], [6].In order to standardize IoT different organizations have proposed several protocol standards in the past decade.Most popular ones include 802.15.4 standard for physical and MAC layer by IEEE, IPv6 Routing Protocol for Low-Power and Lossy Networks (RPL) [7] for network layer by IETF, and CoAP for application layer by IETF [4], [8].As most of the IoT applications are based on tiny resource constrained (memory, processing, communication and energy) devices which are expected to operate for a long time thus the need for low power consuming protocols are desired.6LoWPAN (IPv6 over Low-Power Wireless Personal Area Networks) networks full fill these critical needs of IoT by enabling nodes or smart devices to operate on low power while maintaining cost-effective wireless personal area networks (WPAN).In order to provide a efficient routing in 6 LoWPAN [9] networks RPL protocol has been standardized [7].While giving major benefits in routing the RPL protocol also suffers from various security and privacy risks.Due to the open and self-organizing nature of IoT, nodes are vulnerable to insider and outsider attacks.We have seen a huge literature in the field of routing attacks particular to wireless sensor networks (WSN).Such attacks can also be performed on 6 LoWPAN networks.In addition to it, some newly tailored attacks for RPL are also present in the literature.Many solutions towards securing RPL protocol have been proposed in the literature [10].These include Intrusion Detection Systems (IDS) and trust based secure RPL protocols.These solutions provide security against a very small number of attacks which is a major concern when we talk about securing IoT.In this paper, we have focused on the development of a Network Intrusion Detection System (NIDS) named ELNIDS which provides defense against seven types of routing attacks.ELNIDS is based on ensemble learning and uses four types of classifiers namely Boosted Trees [11], Bagged Trees [12], Subspace Discriminant [13] and RUSBoosted Trees [14].We have proposed the architecture for ELNIDS and performed a performance analysis of individual classifiers using different RPL uses four types of control messages (DIO, DIS, DAO, DAO-ACK) for creating and maintaining DODAG.Routes between DODAG nodes are selected and optimized using an Objective function (OF).An OF uses various metrics and constraints in order to select the optimal path and parent among different preferred choices.Nodes are assigned a rank value (16 bit) which represents the node's individual position with respect to DODAG root.The rank concept is used to maintain the parent-child relationship, as well as prevent loops in the network.

B. RPL-NIDDS17 dataset
The RPL-NIDDS17 [15] is a synthetic dataset created using NetSim [16] tool.NetSim is capable of simulating various networking environments i.e.IoT, MANET, FANET, VANET etc.To create the dataset the IoT network scenario is configured with sensor nodes, gateway, router, and a wired node.For every attack, packet captures were retained in separate CSV files.Finally, all the CSV files were merged to form the complete RPL-NIDDS17 dataset.The dataset consists of 20 features and 2 additional labelling attributes.RPL-NIDDS17 contains traces of attacks including Sinkhole, Blackhole, Sybil, Clone ID, Selective Forwarding, Hello Flooding and Local Repair attacks.Features of the dataset have been classified into three categories namely flow, basic and time.Table I shows the full description of the RPL-NIDDS17 dataset and Table II shows the description of the part of the dataset used in this study.

II. RELATED WORK
In [17] an architecture for specification-based IDS is proposed for detecting rank and local repair attacks.The proposed IDS uses distributed placement strategy for placing monitoring modules.No simulation study is done in support of IDS performance analysis.Further, in [18] proposed an extension to their previous work [17].In this work, a specification based IDS is proposed to detect rank, sink-hole, local repair, neighbour and DIS attacks.The proposed specification-based IDS used hybrid placement strategy.Main limitations of this work include the added communication overhead, prior requirement of network trace and fall in IDS accuracy when it operates for a long time.Razaet al. [19] proposed a hybrid anomaly-based IDS named as SVELTE.It uses several modules IDS modules with a firewall that provides security against malicious traffic from the outside network.SVELTE is capable of defending against sink-hole, selective forwarding and spoofed or alteration attacks.SVELTE posses several limitations including synchronization issue, strategic placement of IDS modules, high false positive rate and vulnerability to coordinated attacks.It performs well in terms of packet delivery ratio, control packet overhead and energy consumption and the true positive rate.
In [20] a compression header analyzer based IDS named CHA-IDS is proposed.It uses signature-based detection mechanism which is embedded in the border router.It requires high memory and energy consumption.Moreover, it cannot locate the attacker.A signature-based IDS to detect DIS attack and Version number attack is proposed in [21].The proposed IDS requires detection and monitoring modules to be placed on nodes itself as in the case of hybrid detection schemes.However, authors consider using two types of additional nodes.The first type of nodes IDS routers which carry detection and firewall modules.The second type is IDS detectors which are responsible for monitoring and sending malicious traffic information to the router nodes.Kfouryet al. [22] proposed an IDS for detecting Sinkhole, Version number, and HELLO flooding attacks in particular to RPL protocol.The authors used Self Organizing Map for clustering the attack and normal traffic.In depth details of methodology behind labelling of clusters is not elaborated in this work.In addition, the proposed IDS is not evaluated in terms of the implementation overhead i.e. node resource.

III. PROPOSED WORK
In this paper, ensemble learning [23] methods are used to develop IDS [24] modules.This is because ensemble learning provides advantages in the case of classification problems.Main advantages include better prediction and model stability.
978-1-7281-1253-4/19/$31.00 © 2019 IEEE Ensemble methods help in improving classification results by combining multiple models.Thus, using multiple models helps in gaining better prediction accuracy.The aggregated output of the ensemble is less noisy than any other machine learning methods.In addition to this ensemble, models are avoid overfitting by utilizing bagging methods.RPL-NIDDS17 dataset from Zenodo has been used to train and test classifiers and results have been compared in terms of Accuracy and Area under ROC (receiver operating characteristic) [25] curve.Accuracy refers to the ratio of the total number of correct predictions to the total number of predictions.ROC curve is plotting True Positive Rate (TPR) against False Positive Rate (FPR), the area under ROC refers to the area under the ROC curve.

A. Experiment Flow Design
Fig. 1 shows the experimental flow design followed during this work.In the first step, the RPL-NIDDS17 dataset is preprocessed by applying cleaning, encoding and scaling methods.Cleaning refers to handling missing values, encoding is used to handling nominal data by one-hot-encoding i.e. conversion from nominal to numeric form, and scaling has been used to scale the concerned feature between 0 to 1.The preprocessed dataset is divided into train and test sets.In the second step, ensemble classifiers (Bagged Trees, Boosted trees, Subspace Discriminant and RUSBoosted Trees) are trained with the train set.The main reason behind the selection of these classifiers is that they perform well on different types of datasets i.e. balanced and imbalanced.We conducted experiments with other ensembles including AdaBoost and Random Forest and found better results with selected four ensembles (Bagged Trees, Boosted trees, Subspace Discriminant and RUSBoosted Trees) in case of RPL-NIDDS17.In the third step, trained models are then tested using the test set.In testing, phase models output their predictions for input test instances into attack or normal class.Classifier details are depicted Table III.

B. Ensemble Learning based Network Intrusion Detection System
We propose a signature-based NIDS architecture named ELNIDS for detecting routing attacks like Sinkhole, Blackhole, Sybil, Clone ID, Selective Forwarding, Hello Flooding and Local Repair attacks on RPL.Fig. 2 shows the architecture of ELNIDS.The proposed IDS architecture consists of the sniffer, sensor events/traffic repository, a feature extraction module, the analysis engine, signature database, user interface, alarm/attack notification manager.
Sniffer is responsible for listening to all the packet transmissions within the 6LoWPAN In addition, the analysis engine constantly sends information to the user interface where the traffic is being monitored regularly.User interface logs all the information collected from the analysis engine in the form of log reports.The signature database contains signature information which is used by the analysis engine while performing pattern matching.It is directly connected to the analysis engine.The main reason for using a dedicated voting scheme is to generalize the idea of prediction aggregation which additionally improves overall IDS performance.

IV. EXPERIMENTAL SETUP
The performance assessment has been carried out on a machine operated on 64-bit Windows 10 Pro and equipped with Intel ® i7-7700 four core CPU having 3.60 GHz clock speed and 12GB main memory.Matlab 2017b is used for the implementation and evaluation of ensemble classifiers.Dataset preprocessing is performed using Pandas library of Python programming language.

V. RESULTS AND DISCUSSION
We have used all 20 features of the dataset for performance analysis of classifiers.In stage 1, we perform preprocessing of the dataset features.We removed all the instances which consisted of missing values and then converted all the nominal or symbolic features to numeric form using one-hot-encoding.Then all the features are scaled between 0 to 1 i.e. normalization.In stage 2, the classification learner module of Matlab 2017b is used for the evaluation of ensemble classifiers.Every classifier is evaluated with four validation methods which include 30% hold-out, 40% hold-out, 5-fold and 10-fold crossvalidation.In stage 3, all the evaluation results are tabulated and compared and practicality of ELNIDS is generalized.In this paper, we emphasized using ensemble-based machine learning models for creating a network intrusion detection system.We proposed an architecture for a network intrusion detection system which we call ELNIDS.The proposed architecture is capable of detecting Sinkhole, Blackhole, Sybil, Clone ID, Selective Forwarding, Hello Flooding and Local Repair attacks.We implemented for different classifiers including the ensemble of Boosted Trees, Bagged Trees, Subspace Discriminant and RUSBoosted Trees.To evaluate the performance of classifiers we used the RPL-NIDDS17 dataset which contains traces routing attacks on RPL protocol.The simulation results show that ensemble classifiers based on Boosted Trees and RUSBoosted Trees achieve the best performance in terms of accuracy and Area under ROC.Thus, the overall classifier performance evaluation results show the effectiveness of ELNIDS.In future, we target to implement and evaluate ELNIDS on smart nodes.In addition, our aim to develop lightweight defense solutions for securing the Internet of things.

Figure 3 :Figure 4 :
Figure 3: AUC achieved in case of Boosted Trees

Figure 5 :
Figure 5: AUC achieved in case of Subspace Discriminant

Figure 6 :
Figure 6: AUC achieved in case of RUSBoosted Trees

Table I :
Full Dataset description

Table II :
Part of the dataset used in this study

Table IV :
The comparison of accuracy and AUC values achieved with ensemble classifiers.