Deep Reinforcement Learning Based Active Queue Management for IoT Networks

Internet of Things (IoT) finds its applications in home, city and industrial settings. Current network is in transition to adopt fog/edge architecture for providing the capacity for IoT. However, in order to deal with the enormous amount of traffic generated by IoT devices and to reduce queuing delay, novel self-learning network management algorithms are required at fog/edge nodes. Active Queue Management (AQM) is a known intelligent packet dropping techique for differential QoS. In this paper, we propose a new AQM scheme based on Deep Reinforcement Learning (DRL) technique and introduce scaling factor in our reward function to achieve the trade-off between queuing delay and throughput. We choose Deep Q-Network (DQN) as a baseline for our scheme, and compare our approach with various AQM schemes by deploying them at the interface of fog/edge node. We simulated them by configuring different bandwidth and round trip time (RTT) values. The simulation results show that our scheme outperforms other AQM schemes in terms of delay and jitter while maintaining above-average throughput, and also verifies that AQM based on DRL is efficient in managing congestion.


Introduction
Internet of Things (IoT) is related to the connectivity of billions of devices for smart automation toward future Internet that is possible due to the accelerated convergence of many technologies such as wireless communication, embedded systems, big data analysis, and real-time processing. With the high demand for IoT devices [17], such as smart hubs, cameras, and healthcare devices, recent studies predict that a typical family home by 2022 will connect large number of IoT devices to smart city applications [12]. The challenge of dealing with large amount of data from IoT devices processed by cloud computing and their latency requirements give rise to fog/edge computing architecture. The edge gateways in this architecture often manage traffic congestion through Active Queue Management (AQM) techniques. The performance of AQM is highly sensitive to the tuning of their configuration parameters. Machine Learning techniques provide a viable approach of adaptively configuring the AQM parameters-employing one of those techniques, deep reinforcement learning based AQM scheme is proposed in this paper.

Fog/Edge Computing Architecture for Internet of Things
The growing size of IoT data and low latency requirements of IoT applications are driving the current IoT-cloud architecture to move to fog/edge computing architecture connecting IoT devices with the tiered computing infrastructure, for example smart home devices are connected with smart city infrastructure. This architecture deploys computing nodes at the edge of the network to perform local computation and provide storage capacity. It provides low latency communication between IoT devices and servers, improves the user's experience, and achieves resilient services [22]. Figure 1 shows the basic fog/edge computing architecture with the following three tiers [26]: -Tier-1 Things/End Devices Devices of tier-1 contain IoT devices such as sensors and end user (EU)'s mobile devices including smart phones, tablets, and smart watch. These end devices are also known as Terminal Nodes (TNs). -Tier-2 Fog The fog/edge nodes of tier-2 consist of the network routers, switches, gateways, and Access Points (APs). In fog/edge computing architecture, these edge nodes have higher computation power and storage capacity than the traditional network devices to provide EU with better Quality of Service (QoS). -Tier-3 Cloud This tier includes general cloud servers and storage devices which are capable of providing the necessary storage and computing resources.
In the middle tier, the fog/edge nodes carry the traffic generated by both the cloud servers and end devices through efficient network scheduling or resource management.The edge gateways need to manage the potential congestion for the traffic they carry from numerous IoT devices by employing techniques of managing queues such as AQM [21]. Unlike First-In-First-Out (FIFO) where all packets are dropped in case of buffer overflow, the AQM avoids buffer overflow by selectively dropping packets when the queue builds up so that the affected TCP flows can slow down in response. The performance of AQM is highly sensitive to how well it adapts its parameter such as packet-drop probability to the network state. Recently, Deep Reinforcement Learning (DRL) has shown promising results for network resource management [23].

Motivation and Objective
The IoT network consists of various types of end devices with stochastic state transition that makes it a suitable domain for the application of DRL, which has shown remarkable performance for highly stochastic environments. Further, the discrete state transition of IoT network environment can be represented as a finite Markov Decision Process (MDP), as discussed in Sect. 3. Since MDP is the main principle  [26] of reinforcement learning, it provides another motivation for the application of DRL in IoT network. Our objective is to use DRL to make adaptive packet drop decision for AQM. This is the first scheme that implements a fine state-aware AQM at the interface of fog/edge nodes for the IoT environment. The fog/edge nodes have enough computing power to perform the training phase for DNN. To enhance active queue management, we propose a DRL applied AQM scheme for efficient queue management, and study the trade-off between queuing delay and throughput, while maintaining QoS in terms of low jitter. Our system is designed based on Deep Q-Network (DQN) including main Q-network and target network trained using experience replay. It selects a packet drop or no-drop action at the packet departure stage depending on the current state of the queue consisting of current queue length, dequeue rate, and queuing delay. After an action is selected, a reward is calculated based on several factors that is discussed in Sect. 3 in detail. All experiments are conducted on IoT based topology. The main contributions of this paper can be summarized as follows: -The design guidelines of DQN-based AQM scheme 1 (DQN-AQ) that is an application of DRL technique for efficient queue management to achieve the optimized trade-off between queuing delay and throughput while maintaining QoS such as jitter -Implementation in ns-3 integrated with tiny-dnn of our proposed DQN-based AQM scheme and IoT network consisting of fog/edge device, cloud server and 21 different IoT devices considering their actual characteristics such as sleep time, active time, average packet size, mean data rate, and peak/mean rate -Thorough performance evaluation and analysis on our proposed scheme in comparison with well-known AQM schemes, such as P-FIFO, RED, PIE, CoDel and FQ-CoDel The rest of this paper is organized as follows. In Sect. 2, related works including widely used AQM schemes, ML applied AQM algorithms, and applications of traffic management for IoT are described. The system model used for this scheme is discussed in 3. The overview and detail design of our proposed AQM scheme are presented in Sect. 4. Next, our IoT simulation platform including topology, scenarios, and IoT characteristics is explained in Sect. 5. Finally, we evaluate and analyze our experimental results in Sect. 6 and conclude the paper with summary and future work in Sect. 7. To improve the flow of this manuscript, we have provided the definitions of acronyms in Table 1.

Related Works
AQM is one of the intelligent network management methods deployed in the network interface of a router or a switch [21]. In the traditional queue management methods, such as drop-tail, slew of packets are dropped when the queue is full causing low throughput, high latency and jitter. In contrast, the AQM employs a random early drop technique to randomly select and drop a packet even before the queue is full. This allows responsive flows such as TCP to retract and avoid the queue overflow, which is beneficial for bursty flows known as bufferbloat. The performance benefit of AQM is highly sensitive to the tuning of its parameters such as packetdrop probability, which can be adjusted to the environment. There are proposals for either minimum tuning of parameters or self parameter tuning. In this section, we discussed some AQM schemes including the new approach of machine learning based AQM schemes. We also reviewed network scheduling algorithms and AQM performance studies in IoT network.

Active Queue Management Schemes
Traditional queue management cannot deal with rapidly changing stat of a queue such as oversized buffers which is bufferbloat. In this section, we discussed three well-known AQM schemes: RED [11], CoDel [27], and PIE [28], which are recommended by IETF [5].

Random Early Detection (RED)
RED is one of the early classic AQM schemes for congestion avoidance. It has three parameters to adapt to the state: two thresholds th min and th max , and packet marking probability p . If the average queue size L avg is less than th min , the packet is accepted to enqueue. If the queue length is in between th min and th max , RED calculates p to mark the packet by dropping it or by modifying the packet header depending on the transport protocol. If L avg is greater than th max , RED drops the packet rather than modifying the packet header to control L avg . Variants of RED have been developed including Adaptive RED (ARED) and Weighted RED (WRED) for better QoS [30].

Controlled Delay (CoDel)
CoDel is a packet-sojourn time based AQM, which tracks the (local) minimum queuing delay experienced by the packets. The minimum packet-sojourn time can be decreased only when a packet is dequeued; and to maintain its updated minimum value, CoDel measures the value within a certain interval T int , which is set to 100ms by default. CoDel assumes a target delay T target and when the queuing delay exceeds T target during T int , a packet is dropped and the next drop time, t new , is set by a control law defined as follows: where t new is the new next drop time, t curr is the current next drop time and N drops is the number of packets dropped since the dropping state was entered. The next drop time refers to the the time after which some incoming packets will be dropped. When the queue delay is below T target , the controller stops dropping the packets. A distinct feature of CoDel is that it drops packets at the packet departure stage. The FlowQueue-CoDel (FQ-CoDel) [15] is a variation of CoDel that classifies flows into one of 1024 sub-queues by hashing the 5-tuple of source and destination IP addresses, source and destination port numbers, and protocol number. Each (1) sub-queue is implemented based on CoDel, and a byte-based deficit round-robin (DRR) mechanism is deployed to serve the sub-queues. FQ-CoDel has a distinct parameter quantum, which is the number of bytes to dequeue on each round of the scheduling algorithm. The default size of quantum is set to 1514 bytes, which corresponds to the Ethernet maximum transmission unit (MTU) plus the hardware header length of 14 bytes.

Proportional Integral Controller Enhanced (PIE)
PIE is a burst-tolerance AQM scheme, which drops the packets, similar to CoDel, depending on the queuing delay, but it is more robust with additional parameters. For every T update interval, PIE calculates packet-drop probability p based on queuing delay, , and ; where and are static configured parameters set to 0.125 and 1.25 defaults values respectively, which determine the fine balance between latency offset and latency jitter. The PIE estimates queuing delay through dequeue rate using the following formula: where cur_del is the current queuing delay, qlen is the current queue length in bytes, and avg_drate is the average dequeue rate. To overcome rate fluctuation which is a common issue in wireless networking, PIE measures the dequeue rate periodically. And as PIE calculates the drop probability periodically, any short term burst that comes up within the period could pass through without incurring extra drops. There is a maximum burst allowance parameter B max set to 100ms default value which is the duration to allow the burst to bypass the random drop process. The AQM schemes are effective in alleviating the bufferbloat and network congestion problems. However, their performance are sensitive to the parameters tuning that can be improved through machine learning techniques discussed in the next section.

Machine Learning-Based AQM and Network Traffic Scheduling Algorithms
Recently machine learning techniques, especially one of them reinforcement learning, are used to deal with network problems such as AQM, network traffic and resource control.
In [8], Bouacida et al. implemented LearnQueue AQM algorithm based on RL for wireless networking. In their scheme, the Q-table is updated and the policy of the Q-function is optimized by manipulating the buffer size dynamically using Q-Learning in a certain interval. Their scheme was tested in small deployment scenarios of only two and three nodes. Bisoy et al. of [7] suggested the AQM scheme based on a shallow neural network of one hidden layer consisting of three neurons to deal with non-linearity of networking system and to reduce queuing delay, but their work did not deal with trade-off between throughput and delay performance. In [32], Vucevic et al. proposed Reinforcement Learning-Queuing Delay Limitation (RL-QDL) AQM algorithm. The RL agents receive topology information from the bandwidth broker that manages the resource and QoS provisioning based on the QoS expectations in egress routers (ER). It assumes class based queuing (CBQ) by supporting three different classes to provide end-to-end QoS to users with different types of services: expedited forwarding (EF), assured forwarding (AF), and best effort (BE).
In [18], Jin et al. proposed the AQM algorithm based on RL and CoDel for the optimized traffic distribution in the grid of operating points. Q-Learning-applied RED was proposed in [16] to optimize the parameters of RED scheme. Chen et al. proposed DRL based optimized computation offloading policy for scheduling by deploying a double DQN at the edge node [9]. Their approach showed optimal trade-off between task delay and packet drops. In [10], Chinchali et al. investigated real data of cellular network traffic and introduced a DRL based scheduler focusing on High Volume Flexible Time (HVFT) traffic including software and data updates to mobile IoT devices, large data transfer from IoT sensors to cloud servers, and pre-fetched ultra-high quality and bit-rate video traffic. As a baseline of DRL model, they used Deep Deterministic Policy Gradient (DDPG) algorithm to train actor and critic networks, and the RL agent resides at the cell tower for HVFT control. In [34], Xu et al. applied DRL to network traffic engineering by deploying actor-critic method with prioritized experience replay. They compared their algorithm with the widely used baseline solutions such as Shortest Path (SP), Load Balance (LB), and Network Utility Maximization (NUM), and verified that their model performs better than given baseline solutions.
Although there are several significant ML based schemes for scheduling and queue management, current ML based AQM schemes do not consider stochastic environments such as in IoT networks.

Traffic Management for IoT Applications
Along with the growth of IoT markets and technologies, traffic scheduling and management for IoT applications have been flourished recently.
Kua et al. analyzed performances of several AQM schemes such as FQ-CoDel and PIE assuming the IoT enabled smart home testbed including low and high-rate IoT applications, video call traffic, bulk file transfers, and multimedia streaming server [20]. They considered the throughput and round trip time (RTT) as the measures of the efficiency of AQM schemes. There are also schemes proposed to reduce IoT service delay for edge computation offloading policy [35,36]. In [36], Zhang et al. considered G/M/1 queue as a task buffer at the edge computing server in First Come First Served (FCFS) and non-preemptive manner. The authors used Locality-First policy and probability-based policy by setting a probability parameter to make an offloading decision as the offloading policies. Yousefpour et al. used load sharing approach between edge nodes to reduce the IoT service delay as an edge offloading policy [35]. They considered different types of requests from IoT devices and the queue length that is the task buffer for load sharing.
In [38], Zhao et al. introduced multi-tier in fog/edge layer for desired throughputdelay trade-off per user. It consists of fog access node (FAN) that caches a subset of popular content and fog control node (FCN) connected to FAN through the backhaul connections offering higher computing and storage capacity than FAN. Zheng et al. investigated multiple service frequency constraints in wireless channels [39]. They scheduled packets arrived at wireless links using round-robin and maximum-weight scheduling. The links are categorized into different stages indicating the priority level; and maximum-weight scheduling at the same stage gives a service priority to a link with longer queue size.
In [6], Bhandari et al. applied priority queues to the cluster head to prioritize and aggregate the incoming packets. The proposed mechanism guaranteed QoS requirements in terms of latency and reliability. Al-Kashoash et al. proposed optimizationbased hybrid congestion alleviation (OHCA) that combines both traffic and resource controls in the IPv6 over Low-power wireless personal area network (6LoWPAN) to utilize the network resources effectively [3]. Zhao et al. proposed a different approach from above schemes, which deploys cloudlets at optimal places to minimize the access delay using Software Defined Network (SDN) [37]. The control plane on SDN controllers assigns cloudlets to APs based on the average cloudlet access delay among all placement cases.
In this section, we discussed a wide variety of AQM schemes and reviewed related works for network traffic and resource management based on ML techniques. Unlike the above research about traffic management for IoT applications that is largely related to task scheduling or management, our approach deals with flow of packets at the physical queue level in the network device's interface. To the best of our knowledge, this is the first work to design deep reinforcement learning based AQM employing DQN.

Deep Q-Network
In reinforcement learning, the software agent interacts with the environment in a sequence of discrete time step t = 1, 2, 3, ... , and generally is stochastic. At each time step t , the agent selects an action a t ∈ A(s) from the set of action space A(s) in the given state s t ∈ S , where S is the state space. After taking an action a t , the agent receives a reward r t ∈ R ⊂ ℝ and observes a new state s t+1 . By iterating this interaction, each observation in the sequence or trajectory is suitable for being expressed as a finite MDP that begins like this: The transition of states in MDP satisfies Markov property that refers to the memoryless property of a stochastic process where future states of the process only depend on the present state and not on the trajectory of events. The agent at the network queue can observe the current queue state such as queue length and take an action whether to drop the packet or not as the packets arrive at the queue. The queuing delay is affected by the action of the agent that allows us to derive a reward and s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , s 3 , a 3 , r 3 , … observe a next state. Therefore, it can be considered as MDPs, and we can apply RL to the sequence representation at each time step t.
The two most significant features of RL are trial-and-error search and delayed reward that considers the next state's immediate reward in the current state [31]. Ultimately with these features, the agent selects an optimal action in any given state by maximizing the total reward. In order to evaluate the value of actions, we used Q-learning as a common method of Temporal Difference (TD) learning that updates the model at the end of each time step [33]. The Q-learning finds an optimal policy for expected cumulative future reward computed in the current state. Action-value function known as the Q-function, gives a value for every action in each state, which is updated by the following equation: where r t is a reward after taking an action a t in state s t , a ′ is an expected action to maximize the Q-value, is the learning rate, and is a discount factor for future reward. Since in our design packet drop/non-drop are discrete actions, our policy = p(a|s) is deterministic that maps each state s to an action a . The goal of the agent is to select an action to maximize the cumulative future reward through the Q-function, so the optimal Q-function can be expressed as follows: which is the maximum expected cumulative reward acquired by the policy after taking an action a in the state s . We employ deep neural network (DNN) as a function approximator that outputs the predicted Q-value Q(s, a| ) where s are weights of neurons of DNN. In order to minimize the difference between the predicted Q-value and the optimal Q-value, the loss function is defined as: However, when reinforcement learning uses a neural network as a function approximator, it is susceptible to become unstable or to diverge in training the Q-function. The reason for the instability is that the neural network receives highly correlated data as sequential inputs in the observation that may cause over-fitting or falling into local minimum. The divergence is due to non-stationary target. Both the predicted Q-value and the optimal Q-value depend on the weights that is the target value also changed when the neural network is trained, making the neural network difficult to converge. To overcome these issues, DQN was developed, which is the baseline model of our system [25]. There are two main features of DQN: experience replay and network separation.
In DQN, the concept of experience replay is introduced to reduce correlation between the observations. Instead of training the neural network in order of sequence, we store the agent's experiences e t = (s t , a t , r t , s t+1 ) at each time step t into the replay memory D t = {e 1 , e 2 , … , e t } . To update the neural network, mini-batches of experiences (s, a, r, s � ) ∼ U(D) are selected randomly through uniform distribution from the replay memory. The training of the neural network using a batch of experience at iteration i computes the following loss function that is similar to equation (5): We can set the stationary target by deploying two neural networks: main neural network, Q-network and target neural network, Q -network. The parameters − i of the target Q -network are updated by the parameters i of the main Q-network periodically instead of every time step. With these features, DQN has shown outstanding performance comparing with all previous Q-learning based algorithms as shown in Sect. 6.

Process of Selecting Action
In our proposed DQN-based AQM scheme (DQN-AQ), we consider three parameters to define the state for RL: current queue length in number of packets L , dequeue rate R deq , and queuing delay d , where dequeue rate is calculated using PIE's dequeue rate calculation method [28]. The state s t at each time step t is defined as s t = {L t , R deq,t , d t } , which is an input to a multilayer perceptron (MLP) consisting of two hidden layers of 64 neurons for each layer in the Q-network. The main Q-network is used to select an action, which returns two probabilities in its output (drop/non-drop probability). It decides a packet drop/non-drop action a t following the higher probability, and the action occurs at the packet departure stage of the queue as in case of CoDel's dropping action [27]. In order to find a better action in a certain state, we use explore/exploit strategy that allows the agent to either take an action based on its own selection (exploit), or sometimes take a random action chosen uniformly based on a certain probability (explore). For the explore/exploit strategy, we use -greedy algorithm starting from a high random action probability. In the first episode of our network simulation, the exploring probability is set to 90% based on the round of episode, which goes down to 0% through the episodes. The agent takes an action in periodic interval T int only when there is a queuing delay in the queue. Figure 2 shows the process of selecting an action.

Reward Engineering
After taking an action, the RL agent waits for the next state s t+1 during the interval T int . The selected action is evaluated by a reward function. The most important consideration in designing a reward function is to optimize the trade-off between queuing delay and drop-rate. Further, the reward function should avoid leading to infinite packet drop state or non-drop state. We used learnQueue's reward function [8] as our baseline function. There are two main components of our proposed reward function: delay_reward and enq_reward corresponding to queuing delay and packet drop-rate respectively. We defined the delay_reward function as: where delay_ref is the desired queuing delay, curr_delay is the current queuing delay, and is a scaling factor between delay_reward and enq_reward . The enq_reward is defined as: where enq_rate is enqueue rate during T int , and it is calculated as: where N enq is the number of enqueued packets and N drops is the number of dropped packets. When N enq is 0 , enq_rate is set to 0. The scaling factor can be used to configure the weight between the queuing delay and packet drop-rate for the agent. Unlike learnQueue's reward function that defines max_delay , which is the maximum time required to drain the queue depending on the AP's physical data rate; we defined min_delay , which is the delay when the link between the edge device and the cloud gateway is fully utilized. It is defined as follows: where L byte is the current queue length of the edge device in bytes, and R phy is the physical bandwidth (data rate) of peer-to-peer (P2P) link connected to the edge device. Finally, the reward is calculated as the sum of both components: reward = clip(delay_reward + enq_reward, −1, +1). To avoid the divergence of a reward and regularize it, we clip the reward to the constant minimum and maximum values corresponding to − 1 and +1.

Training Process
In our DQN model of AQM, the agent stores an experience e t = (s t , a t , r t , s t+1 ) in tuple format to the replay memory at each time step t . Once the number of experiences in the replay memory exceeds the mini-batch size, the agent makes a random uniform selection of samples of the experiences from the memory. Following (6), the agent minimizes the temporal difference error (TD-error) using mean square error (MSE) loss function. For the Q-network, we employ the rectified linear unit (ReLU) activation function defined as f (x) = max(0, x) at each hidden layer, and softmax function at the output layer to convert outputs into action probabilities. Algorithm 1 describes the training process of DQN-AQ, and we assume that in time step t , if the selected action is dropping the packet, a t = 1 , otherwise, a t = 0 . The interval of each time step is T int . For example, if T int is 1ms, the for loop in Algorithm 1 line 4 runs every 1 ms so that t = 1, 2, 3 corresponds to 1 ms, 2 ms, 3 ms. And when t reaches to the simulation time T , both simulation and training process terminate. In the first episode, we initialize the weights of MLPs using Xavier initializer [13], and optimize the loss using Adam optimizer [19] to train the model. Since we follow the principle of DQN introduced in [25] and there is no concept of "game over" in network simulation, the discounted future reward is always added to the immediate reward. To avoid infinite packet drop/non-drop action on certain states, we control the reward function. After giving the reward, we initialize the dequeue rate depending on the queuing delay following the PIE's dequeue rate calculation scheme [28].

Experimental Setup
We built our IoT simulation platform considering real IoT infrastructure based on IoT device classification and traffic characteristics reported in [29]. In that study, the IoT network traffic was measured for over three weeks and the IoT devices are classified by clustering their traffic characteristics through K-means clustering. The daily average load of IoT traffic is found to be 66kbps, which is mostly TCP traffic with more than 50% of TCP traffic uses HTTPS protocol [29]. Similar traffic pattern is also found in another study of smart home consisting of sensors, hubs, plugs, and electronics [4]. We used five characteristics in our simulation: Sleep time, Active time, Average packet size, Mean data rate, and Peak/Mean rate. As shown in Fig. 3, all IoT devices are connected to a single edge device through wireless links in the simulator topology, and the edge device is connected to the cloud gateway through a physical P2P link. For communication between cloud gateway, storage, and data center, carrier-sense multiple access (CSMA) is used in cloud server. The camera type devices use UDP for communicating real-time monitoring traffic, while devices in all other categories use HTTPS over TCP, which is a common traffic type found in [29]. We configured TCP CUBIC [14] and Proportional Rate Reduction (PRR) recovery algorithm [24] for TCP congestion control since they are used by default in Linux kernels.
We assume that IoT devices send packet bursts at peak transmission rate causing queuing delay and packet loss. The IoT traffic rarely causes queuing delay or packet loss at mean data rate due to periodic packet transmission and small size of packets. We configured same distance from each IoT device to the edge device. We regulate the round-trip time (RTT) by controlling the baseline delay of P2P link and set the low link capacity between the edge device and cloud gateway to measure the performance of AQM schemes from various perspectives assuming the narrow bandwidth for IoT devices.
We simulated five other AQM schemes-P-FIFO, RED, PIE, CoDel and FQ-CoDel-to compare the results of the proposed DQN-AQ scheme with them. All AQM schemes are deployed on the interface of the edge device connected to the cloud gateway through the P2P link that is the bottleneck link in the network. Priority-FIFO-fast queuing discipline (P-FIFO) is deployed on all other interfaces, which is the default priority queue in Linux. We conducted experiments by varying RTTs, P2P link capacity, reward scaling factor , and the number of IoT devices connected   Tables 2 and 3 provide the values of the parameters for the simulation environment and AQM setup respectively. We kept default parameters for the AQM schemes, but adjusted PIE's mean packet size and dequeue threshold parameters adapting to the topology. In Table 3, maxSize is the maximum queue size, meanPktSize is the mean packet size set by users to determine each AQM's behaviour, and delay_ref is the desired queuing delay. In RED, the parameter LInterm is the maximum probability of dropping a packet. In PIE, dq_threshold is the dequeue threshold that indicates the amount of data in the queue  needed to calculate dequeue rate. Since PIE's authors recommend dq_threshold value to be 10 times the meanPktSize , we set dq_threshold to 3500 [28]. In CoDel, minByte is the minimum number of bytes accumulated in the queue before allowing a packet drop. In FQ-CoDel, the variable Flows is the number of sub-queues for classifying flows. Table 4 indicates the DQN-based AQM algorithm setup parameters.

Comparison on Queuing Delay and Occupancy
We first study the queuing delay and queue occupancy of each AQM scheme. which is the RTT that the flows in those IoT devices' experience during the simulation time collected for each AQM scheme. We can observe that P-FIFO mostly experiences high delay of over one second in both small and large IoT networks, because it merely classifies packets without mitigating the effect of congestion. In both networks, DQN-AQ outperforms other AQM schemes by consistently showing lower delay. RED and PIE adapt their queues for delay around 500 ms. In contrast, DQN-AQ shows higher probability of queuing delay under 400 ms than other AQM schemes with its peak delay value never exceeds 500 ms. If queuing delay of 500ms is required under 1 Mbps bandwidth, P-FIFO, RED, PIE, CoDel, FQ-CoDel and DQN-AQ provide such guarantee with 4.11%, 99.83%, 99.99%, 73.87%, 79.48%, and 100% respectively for small networks, and 1.09%, 100%, 100%, 47.27%, 48.79%, and 100% respectively for large networks. The DQN-AQ is the only AQM scheme that consistently shows the lowest delay (less than 400ms) in both networks for 99% of the cases.  Figure 5 shows the queue occupancy of all the AQM schemes during the simulation time of 100 s. It is evident that DQN-AQ does not reach 50% of its maximum queue size (50 packets), which indicates its optimal behaviour for maximum future cumulative reward.

Comparison on Different Bandwidth
We analyze the performance of AQM schemes in the small network using 1Mbps and 0.5Mbps P2P link capacity between the edge device and the cloud gateway. Figure 6 is a boxplot of the throughput for each flow. Since we simulated a variety of IoT devices with varied traffic characteristics ranging from hard real-time to non-real time stochastic traffic, a good number of outlier points are observed, but this graphical measurement is relatively efficient. In Fig. 6, medians of throughput for each AQM schemes are relatively similar. FQ-CoDel shows higher median throughput for 0.5Mbps bandwidth, but it also incurs the second highest delay as shown in Fig. 7. The DQN-AQ shows the lowest median delay while maintaining the throughput comparable with other AQM schemes for 1Mbps bandwidth. It also shows comparable delay-throughput performance for 0.5Mbps bandwidth.

Comparison on Different Baseline RTT
We compare the AQM schemes for different baseline RTTs: 0, 10, 20, 50, and 100ms. Figure 8 shows boxplot of throughput for different baseline RTTs. We can observe that the inter-quantile range of boxplot decreases as the baseline RTTs increase. The P-FIFO shows the lowest median throughput and FQ-CoDel shows the highest median throughput for all RTT test cases. The DQN-AQ shows consistently moderate and comparable median throughput performance. In order to better understand throughput performance, we also studied mean, maximum and frequent delays and created their boxplots in Figs. 9, 10 and 11. The P-FIFO shows highest mean delay that is expected considering its lowest mean throughput performance. As a trade-off, FQ-CoDel shows high delay for every RTT case as shown in Fig. 9. In general, we can observe that DQN-AQ shows low values for both frequent and maximum delays, and the differnt RTT values have no significant impact on the queuing delay of AQM schemes as shown in different delay statistics.

Analysis of Different IoT Device Categories
We set different transport layer protocols for different categories of IoT devices. For smart cameras, we set UDP protocol assuming they are used for realtime monitoring, surveillance and storing streaming data in the cloud servers. For other categories consisting of Hubs, Switches & Triggers, Air quality sensors, Healthcare devices, Light bulbs, and Electronics, we set TCP protocol since HTTPS protocol is mostly used by IoT devices in [29] for reliable communication.

UDP Based IoT Flows
We studied the performance of UDP flows generated by smart cameras from the experiments of Sect. 6.3. Figures 12, 13, and 14 the boxplots of throughput, mean delay, and mean jitter per UDP flows respectively. The FQ-CoDel shows good throughput for over all device categories as discussed above, but it shows very low throughput for UDP flows in all cases of RTT values as shown in Fig. 12. The DQN-AQ shows comparable throughput performance for UDP flows. Jitter is a useful metric to measure the QoS experience of end users engaged in UDP based real-time communication such as video streaming and Voice over IP (VoIP). In Fig. 14, we can observe that DQN-AQ introduces low jitter irrespective of RTT values.

TCP Based IoT Flows
TCP is known to be responsive to delays and congestion, hence early drop in AQM causes their withdrawal from contributing to imminent congestion if it starts building up. We studied the performance of TCP flows separately to understand its characteristic behaviour to AQM. Figures 15 and 16 show the throughput and delay of 45 TCP flows of IoT devices. Unlike UDP, TCP flows carry small amount of mainly sensor traffic; hence overall throughut of TCP flows are smaller than UDP flows. The P-FIFO shows highest delays than other schemes and consequently lowest throughput. This shows that TCP is responsive to AQM as expected. Unlike the case for UDP flows, FQ-CoDel shows better throughput for TCP flows due to the responsive behaviour of TCP. Further, the throughput of TCP flows for DQN-AQ is comparable to their throughput for other AQM schemes. However, the DQN-AQ incurs significantly low delay as compared to the delays of other AQM schemes that clearly demonstrates the benefit of DQN-AQ.

Analysis of Reward Trajectory for Model Optimization
We analyze our reward function and how it affects different performance metrics as shown in Table 5. Figure 17 shows the growth of rewards on trained network model and non-trained model which consists of random weights and is being trained through every step in the first episode. We can observe that the cumulative reward of the trained model is much higher than that of the non-trained model. The nontrained model's cumulative reward increases linearly, since it is trained in parallel through each of the simulation time step. In Fig. 17b, non-trained model exhibits unstable behaviour due to randomness in its action. Compared to the non-trained model, trained model shows 22.362ms and 0.57841ms shorter delay and jitter respectively. Table 5 shows the sensitivity of different performance metrics to the reward scaling factor . Higher scaling factor indicates that the delay ( delay_reward ) is preferred over the drop rate ( enqueue_reward ) in the reward function. In an AQM, usually the low packet drop rate causes longer delays in the queue because low drop tail allows more packets to enter the queue and accumulate there. Since our scheme resorts to drop tail function even when the queue becomes full, the arriving packets are dropped not only due to early drop action but also due to queue overflow. Since multiple factors cause packet drop, the packet drop-rate is not always inversely proportional of the mean delay in our experiments. However, as shown in Table 5 higher scaling factor, which gives preference to delay, causes mean delay to decrease. Table 6 shows the average performance of different AQM schemes for different performance metrics when the reward scaling factor is to = 0.6 , which gives delay slight edge over drop rate. It is evident that DQN-AQ incurs the lowest mean delay and jitter while maintaining average throughput above all other AQM schemes. This clearly demonstrates that our proposed DQN-AQ scheme achieves better delay-throughput performance due to its effective prediction of future rewards in deciding about the packet drop action, and its ability to adapt its decision for achieving a weighted balance between delay and drop rate.

Summary
In this paper, we proposed the design of a Deep Reinforcement Learning based Active Queue Management. As a baseline model of the design, we selected Deep Q-Network since the state transition in a queue is discrete and it can be expressed as a finite Markov Decision Process which is the fundamental principle of reinforcement learning. We considered three parameters to define the state: queue length, dequeue rate, and queuing delay. We defined delay reward and enqueue reward functions and introduced a scaling factor to achieve the trade-off between delay and throughput. We evaluated the performance of our scheme by implementing it in a IoT simulation platform consisting of IoT network with fog/edge device and cloud server simulated in ns-3 with tiny-dnn package. The IoT network carries real IoT devices' characteristics including periodic operating cycle, actual transmission rate, packet size, and peak rate. We deployed the AQM scheme at the interface of the fog/ edge node connected to the cloud gateway. We evaluated the performance of our scheme through simulation and compared with well-known schemes, P-FIFO, RED, PIE, CoDel and FQ-CoDel. In general, compared to our scheme, P-FIFO performed inferior since the P-FIFO enabled queue works passively. The FQ-CoDel showed reasonably good throughput at the expense of trading-off with high delay. The proposed DQN-based AQM scheme exhibited low queuing delay in most cases preserving above the average throughput in the stochastic IoT environment. We clearly demonstrated that the scaling factor in our reward function is effective in tuning to the desirable trade-off between delay and throughput.

Future Work
We identified two broad areas to extend DQN-AQ scheme proposed in this paper. One issue is to make the AQM energy efficient and the other is to optimize reinforcement learning model.

Energy Efficient AQM Design
Energy efficient design plays a significant role for the future network since the power generation in the network grows significantly as the number of connected IoT devices increases in our life. In terms of energy efficient AQM, we expect that it can be achieved by increasing the interval of packet-drop probability calculation and optimizing memory usage.

Optimized Reinforcement Learning Model
Applying optimal model and tuning hyper-parameters are always huge challenges for deep learning applied work. Deep learning field is growing fast, and performance of the algorithms are also improved. Hyper-parameter tuning of selected model is also capable of improving the performance as well as selecting a good model, which opens a new area of research on Automatic Machine Learning (AutoML) that can be explored for AQM design.