Energy-Efficient Multidimensional Trajectory of UAV-Aided IoT Networks With Reinforcement Learning

This article proposes a multidimensional search space (or directional space) with more degrees-of-freedom (DOFs) to increase the energy efficiency of limited-battery-powered unmanned aerial vehicle (UAV) in the Internet of Things (IoT) data collection scenario. In this article, the UAV navigates from the initial to the goal point while collecting data from IoT sensors on the ground. Owing to the limited battery power of UAVs, an optimized trajectory is a crucial practical problem. Based on the available directional space, the direction of the UAV related to the navigation trajectory is optimized using reinforcement learning (RL). The objective of RL is to maximize the energy efficiency of the UAV as a long-term reward by selecting the optimal direction. Moreover, a practical energy consumption model and environment are presented in this article. Simulation results verified that the proposed multidimensional trajectory for UAV achieves higher energy efficiency compared with benchmark models.


I. INTRODUCTION
R ECENTLY, unmanned aerial vehicle (UAV)-aided communication networks have attracted significant interest because of their potential to provide reliable highrate connectivity, which is promising for future wireless networks [1]- [10]. On the one hand, UAVs can be deployed to support available networks and improve Quality of Service (QoS) for ground users by establishing Line-of-Sight (LoS) links [11], [12]. On the other hand, UAVs can be beneficially applied to several potential applications in next-generation wireless networks with high mobility, low cost, and flexible deployment [13].
The Internet of Things (IoT) directs a massive connectivity between physical things, e.g., vehicles, and wearable devices such as, various sensors, which are key components of future networks [14]- [16]. However, the transmission reliability of IoT devices remains challenging owing to the long-distance and power constraints [17]. UAVs are expected to overcome these problem because of their high mobility and flexibility. UAVs can navigate over large areas, collecting data from IoT sensors, and transmitting them to other IoT devices or directly to a data center [18]. The high feasibility of LoS connections can improve the transmission efficiency and power of IoT devices [11], [13]. Thus, the service life can be extended. Nonetheless, the limited battery power of UAVs [19] is a crucial challenge despite the benefits of UAV application in IoT data collection. Moreover, a survey discussing UAVs in public safety communications from an energy efficiency perspective and addressing the limited battery power problem is presented in [20]. Some previous studies proposed trajectory optimization to reduce the energy consumption of UAVs, thus expanding the flight time [21]- [23]. However, most of the these studies only considered a fixed-altitude scenario in a two-dimensional (2-D) environment with a limited action space (e.g., up, down, right, and left). The above-mentioned scenario is impractical in an actual environment in which a UAV moves in a three-dimensional (3-D) directional space. Moreover, a limited directional space affects the higher number of shifting actions that the UAV performs and its traveling cost, which results in higher energy consumption. Therefore, a 3-D directional space and 3-D environment for UAVs should be designed to address this problem. Several studies have proposed the optimization of 3-D trajectories to reduce the travel cost and improve the energy efficiency of UAVs [25]. However, the directional space of UAVs in these studies was limited (e.g., ascend, descend, right, left, forward, and backward).
In addition, owing to the high mobility of UAVs which results in change in the environment over time, the trajectory optimization problem of UAVs is difficult to solve using classical optimization techniques (e.g., mathematical derivation) [27]- [30]. Motivated by the advancement in machine learning, this article uses a type of machine learning called reinforcement learning (RL) to optimize the trajectory and maximize the energy efficiency of UAVs. RL is considered dynamic learning, which involves determining optimal actions for an agent that interacts with the environment [31]. RL employs trial-and-error processes to make informed decisions based on imperfect information about the environment. Based on these characteristics, RL is suitable for addressing challenges in dynamic environments and sequential times, such as UAV communication problems. Therefore, a multidimensional trajectory for UAV is proposed and integrated with RL to

A. Related Works
A comparison of related studies is presented in Table I. Some of these studies optimized the trajectory of UAV-assisted communication networks. An optimized 2-D trajectory for a UAV was presented in [21] to minimize energy consumption using the bisection search algorithm and block coordinate descent (BCD) method. Moreover, the energy consumption of UAVs during IoT data collection was minimized by jointly optimizing the communication scheduling, transmission power allocation for IoT devices, and the UAV trajectory [22]. A hovering positions, trajectory, and scheduling of UAVs were optimized for reliable and flexible emergency communication in [34]. In [35] nonorthogonal multiple access (NOMA) precoding and UAV trajectory were jointly optimized to maximize the sum rate using an iterative algorithm. Additionally, a joint optimization of 3-D UAV placement and a path-loss compensation factor was proposed by Shakoor et al. [36] to maximize the energy efficiency of the UAV. Furthermore, several studies have considered 3-D environments for trajectory optimization in UAV communication networks using RL. In [26], an optimized 3-D trajectory of a UAV was presented to improve system capacity using the constrained deep Q-network (cDQN) algorithm. The fairness and throughput of a communication system-assisted UAV network were improved by optimizing the 3-D UAV trajectory using a deep RL (DRL) approach [32]. A UAV control policy was proposed based on a deep deterministic policy gradient (UC-DDPG) energy efficiency of UAVs and fair coverage service is proposed in [33]. Additionally, in [37], 3-D trajectories with six degree of freedom (DOF) for multi-UAV-assisted communication were optimized using RL (specifically, Qlearning) for maximization of users' satisfaction. However, none of these studies utilized RL for trajectory optimization problems in UAV-aided IoT networks. Moreover, RL has been used to optimize the trajectory of UAVs in IoT data-collection mission [24]. In [38], the height of UAV is optimized using RL to maximize the collected data throughput of sensor devices and the energy efficiency of UAV with wireless charging scenario. However, none of the aforementioned studies considered the extended directional space of UAVs for 3-D trajectory optimization. Furthermore, none of these studies considered a UAV energy consumption model based on actual practical scenarios (e.g., avionics energy of UAVs).

B. Contributions
The key contributions of this article are summarized as follows.
1) A larger directional space of UAVs is proposed to increase energy efficiency. For more convenient understanding, in practical condition, limited DOFs results a higher travel distance for UAV to reach a specific location. Therefore, energy that consumed by UAV to travel and changing direction is hugely affecting flight time to become less. Furthermore, a less flight time results less collected data by UAV, which means less energy efficiency. Compared with previous studies in [24], [26], and [37], which only considered four and six directions  that affect the higher traveled distance of the UAV, the proposed model provides 26 directions that reduce the travel distance and energy consumption. First, the direction space in the 2-D scenario is increased to eight directions. Subsequently, a 3-D directional space with 26 directions is proposed. 2) A 3-D environment that considers ground contour is designed for UAV flight navigation, which makes the scenario more practical compared with previous studies [24], [26], [37]. 3) An energy consumption model is derived based on [19].
The proposed model considers practical scenarios, such as avionics, hovering, traveling, and communication energy consumption. To the best of our knowledge, none of the aforementioned studies considered practical energy consumption. 4) The effects of the directional space and flight environment of UAV toward energy efficiency are investigated. A higher DOF decreases the traveling distance and energy to change direction of UAV that leads to a reduction of cumulative energy consumption. Thus, the energy efficiency of UAV for collecting IoT sensors data is improved. Moreover, a larger dimension of flight environment (3-D space) has less energy efficiency compared to 2-D space, which is make sense in practical, since the energy consumption for UAV to travel is increase along with the travel space.

C. Paper Organizations
The remaining of this article is organized as follows. Section II presents the system model of a UAV in a 2-D environment. Furthermore, a system model in 3-D space for UAV is presented in Section III. In Section IV, an energy consumption model for actual rotary-wing UAV is presented.
The RL model of optimized trajectory for UAV is presented in Section V. The simulation results are discussed in Section VI. Finally, concluding remarks are given in Section VII.

D. Notations
Notations: In this article, the Euclidean norm of a vector is denoted by ||.||, {.} represents an array, and R M denotes the space of M-dimensional real vectors.

II. SYSTEM MODEL OF FIXED-ALTITUDE UAV IN 2-D SPACE
Consider a limited-battery UAV flying at a fixed altitude H m above K IoT sensors from the initial point L I to the goal point L F as shown in Fig. 1. The sensors, denoted by K = {K 1 , K 2 , . . . , K k }, are randomly distributed in the region M ∈ R 3 between the initial and goal points. As shown in Fig. 1, the UAV communicates with IoT sensors on the ground by providing direct connectivity from the sky. The UAV is assumed to apply time-division multiple access (TDMA) to collect the uploaded data sequentially from the sensors. Specifically, the UAV can only receive data-uploaded from a single sensor at each time t. This article focuses on the optimized trajectory of a UAV when collecting data for higher energy efficiency. In addition, we assume that the UAV communicates without the assistance of a central controller and has partial global knowledge of the environment. In other words, the sensor locations in the environment are locally known.

A. Signal Model
The environment of the proposed model M can be modeled using a 3-D Cartesian coordinate system. The location of sensor K k is denoted as P k at (a k , b k , 0). The initial and goal points can be represented by L I = (x I , y I , H) and L F = (x F , y F , H), respectively. Moreover, the location of the UAV at time t is denoted by L t = (x t , y t , H). Subsequently, these locations are projected onto the ground plane, which can be expressed as p k = (a k , b k ), l I = (x I , y I ), l F = (x F , y F ), and l t (t) = (x(t), y(t)). The Euclidean distance between the UAV and sensor K k at t is expressed as follows: (1) In contrast to the channel characteristics of terrestrial communications, UAV-to-ground (A2G) channels are likely to be dominated by (LoS) links [15]. Therefore, the communication between the UAV and sensors is considered to be monopolized by the LoS links. The channel gain between the UAV and the sensor K k can be expressed as follows: where α 0 denotes the power loss of the channel at d 0 = 1 m where d 0 is the reference distance and ζ pl ≥ 2 is the path-loss exponent. The channel gain h k (t) depends on the location of the UAV at t.

B. Data Transmissions
Based on the assumption that the UAV collects sensor data using the TDMA scheme, sensor K k has a maximum channel gain for each t ∈ 0, T when satisfying the following: Therefore, the received data rate of the UAV from sensor K k at t can be expressed using (4), shown at the bottom of the page, where P k denotes the transmit power of sensor K k , σ 2 denotes the noise power, and γ 0 denotes the signal-to-noise ratio (SNR). Moreover, considering that only a specific sensor can communicate with the UAV and upload data at t, the established connection is indicated by where ρ k (t) is an indicator of communication between the sensor and UAV at t. In addition, ρ k (t) = 1 if K k ∈ K communicates and uploads its data to the UAV; otherwise, ρ k (t) = 0. Note that only a specific sensor that can communicate with the UAV at t during flight time T, which can be expressed as Based on (3) and (4), the received data rate of the UAV at t depends on the minimum location of the UAV and sensor K k , as well as the established communication between them. In other words, the UAV trajectory directly affects the achievable data rate. Thus, the total received data rate during T can be expressed as follows: where C(t) denotes the total uploading data rate, and q(t) represents the trajectory of the UAV at t. Furthermore, the reception quality of the data should be ensured when a sensor such as K k is selected by assuming C k (t) ≥ r 0 , where r 0 is a predefined minimum target rate. Thus, the total received data rate C(t) of the UAV satisfies the following:

A. Signal Model
Under realistic scenarios, the exploration environment is most likely to be in a 3-D coordinate space. Therefore, in this article, we consider the terrain of the environment. This means that the K IoT sensors distributed in the environment M ∈ R 3 are located at different coordinates and heights. Moreover, the UAV maneuvers along the x, y, and z-axes. Similar to a 2-D system, the UAV is assumed to fly from the initial while collecting sensor data that are randomly distributed in the environment. In addition, P k denotes the location of sensor K k at (a k , b k , c k ). The location of the UAV at t can be represented as L t = (x t , y t , H t ). Therefore, the distance between the UAV and sensor K k at t can be expressed as follows: where x(t) and y(t) denote the UAV coordinates at t. The UAV height at t is denoted by H(t). Additionally, the channel gain between the UAV and sensor K k is given by where α 0 denotes the power loss of the channel at d 0 = 1 m, where d 0 is the reference distance and ζ pl ≥ 2 is the path-loss exponent. The location of the UAV at t affects the channel gain h k (t).

B. Data Transmission
As mentioned previously, the UAV is assumed to collect sensor data using the TDMA scheme. Moreover, sensor K k has the maximum channel gain when the distance between the UAV and sensor is minimum for each t ∈ [0, T] which can be expressed as follows: where P k and L(t) denote the locations of the sensor and UAV, respectively, at t. Furthermore, the received data rate of the UAV from sensor K k at t is given by (12), shown at the bottom of the page, where x(t), y(t), H(t) denotes the location of the UAV at t, and a k , b k , and c k represent the locations of the sensors along the x, y, and z axes, respectively. Subsequently, the total received data rate of the UAV during T can be calculated using (7) and (8).
IV. ENERGY CONSUMPTION MODEL FOR ROTARY-WING UAVS Generally, the energy consumption of a UAV is divided into two main components: 1) the communication and 2) propulsion energy. The consumption of propulsion energy is required to support the operation and maneuvering of UAVs. In this study, the propulsion energy consumption was modeled based on actual conditions that consider the physical model of the UAV. Propulsion energy is composed of three parts: 1) travel; 2) hovering; and 3) avionics energy [19]. The traveling energy, denoted by E t represents the amount of power consumed by the UAV when moving from one location to another. The traveling energy is regularly dependent on the distance traveled by the UAV, which can be expressed as in (13), shown at the bottom of the page. n l denotes the mass of the UAV framework and battery, g denotes the gravitational constant, q(t) denotes the distance traveled by the UAV at t, r denotes the lift-todrag ratio, and n denotes the number of rotors. In addition, the energy consumption of the UAV during hovering was considered. The hovering energy of a UAV can be expressed as follows: where ρ denotes the air density constant, and ζ denotes the spinning area of the blade disc of one rotor. In addition, the energy consumption of avionic components is given by P avio denotes the power for avionic components, q(t) denotes the travel distance of the UAV, and v(t) denotes the UAV velocity at t. In addition to the propulsion energy, the energy consumption of the UAV to receive and transmit data was considered. The communication energy of the UAV can be expressed as follows: where P UAV denotes the power of the UAV, ρ k (t) denotes the indicator of scheduled transmission of sensor k, and C k (t) denotes the achievable data rate at t. Therefore, the total energy consumed by the UAV while collecting data from sensors from the initial to the goal point can be expressed as where E t , E h , E a , and E c denote the travel, hovering, avionics, and data communication energy of the UAV, respectively. Based on the received data rate C (t) in (8), and the total energy consumption of the UAV in (13), the energy efficiency of the UAV can be calculated by dividing the received data rate by the total energy consumed by the UAV during flight. The energy efficiency of the UAV can be expressed using (20) and (21) η Based on (19)-(21), the energy efficiency of the UAV is affected by the total received data rate and traveled distance of the UAV.

V. REINFORCEMENT LEARNING MODEL OF ENERGY-EFFICIENCY-BASED OPTIMIZED TRAJECTORY FOR UAV
The problem formulation in this study is to maximize the energy efficiency η EE (t) in the long term by designing an optimized trajectory of the UAV. The optimization of the UAV trajectory for maximizing the energy efficiency can be expressed as where υ(t), E total , and T denote the velocity, travel distance, total energy consumption, and flight time of the UAV, respectively. Based on the above formulation, energy efficiency depends critically on the trajectory of the UAV. Considering an actual scenario in which information related to the locations of the sensors and the amount of data to be transmitted are unavailable to the UAV, the problem is modeled as a Markov decision process (MDP). The MDP is a well-known model that can solve the problem of a partially observed environment with uncertainties and challenges. To solve the problem formulation described in (20) and (21), shown at the bottom of the page, we convert the problem into a multiperiod decision scheme with finite states and actions.
A. Fixed-Altitude UAV 1) State Space: With this MDP model, the UAV is considered a learning agent that aims to learn an optimized trajectory through RL. The limited flight time of the UAV T is discretized into M time slots. Thus, the step size is defined as δ = (T/M). Furthermore, environment M can be divided into M 1 = (L 1 /δω) using M 2 = (L 2 /δω) tiles, where δω m is the length of each side. In addition, L 1 and L 2 denote the length and width of the environment (m), respectively, and ω denotes the speed of the UAV (m/s). Fig. 2(a) shows a projection of environment M on a horizontal ground plane for a fixed-altitude UAV. As Fig. 2(a) shows, environment M is divided into M 1 × M 2 tiles and each sensor K k ∈ K is placed at a single tile.
Based on the discretized region of the considered environment, the state set S of the UAV is represented as S = {s(1), s(2), . . . , s(M 1 M 2 )}. States s(m) and s(m) ∈ S refer to small tiles. Fig. 2(a) depicts a UAV agent navigating through the tiles from the initial to the goal point while collecting sensors data randomly distributed within the environment. Owing to its limited battery energy, the finite flight time of a UAV requires the observation of the optimal trajectory. By considering that the UAV can only partially observe the environment, RL aids in making accurate decisions for an optimized trajectory. 2) Action Space: In this scenario, the UAV agent has a maximum of eight actions in each state, that is, {West, North West, North, North East, East, South East, South, and South West}. For clarity, the UAV action set is denoted by A = {W, NW, N, NE, E, SE, S, SW}. Because the UAV flies at a fixed altitude (H) m, it moves along the x and y axes. Note that in practice, the UAV can select its direction along any direction θ , that is, θ ∈ [0, 2π ].
3) Reward Formulation: RL rewards are designed to determine the best solutions that satisfy the constraints asserted by the learning agent [11]. In this article, the UAV as an agent aims to maximize energy efficiency during data collection within a finite flight time. During the RL process, a score is assigned for action a that the UAV executes at time t and collects a reward r at future time t . A score is assigned to estimate the importance of the action in reward production. A fly hover-and-communicate design [22] is also considered in this article, in which the data stored at sensor K k are fetched when communication with the UAV is established. Moreover, the UAV hovers when collecting sensor data. By ensuring the communication quality, the UAV collects data from the sensor only when it moves to the sensor location at tile m ∈ M [ Fig. 2(a)]. As previously mentioned, the UAV has a finite flight time, and the maximum time required for the UAV to move from one tile to another can be expressed as follows: where T max denotes the maximum flight of the UAV, and M 1 , M 2 denote the tiles that represent the time slot. In addition, the maximum time required by the UAV to fetch sensor data is defined by where B k denotes the volume data stored at sensor K k , and C k (t) represents the received data rate at the UAV at t. Therefore, the total time required for the UAV in one tile can be expressed as where T s denotes the time required for the UAV to move from one tile to another. Because the main objective is to maximize the energy efficiency of the UAV for collecting data within a finite flight time, the reward function of the UAV agent can be expressed as follows: where r 0 denotes the minimum UAV data rate. In addition, to ensure that the UAV recognizes the goal point L F , a reward is received by the UAV when it arrives at its destination during the learning process.

B. Optimized Altitude UAV
1) State Space: Compared with the fixed-altitude scenario, the optimized-altitude UAV learns the optimal 3-D trajectory. The environment M is discretized into M 1 = (L 1 /δω) × M 2 = (L 2 /δω) × M 3 = (L 3 /δω) voxels, with a length of δω m on each side. Here, L 1 , L 2 , and L 3 denote the height, length, and width of environment (m), respectively, and, ω denotes the speed of the UAV (m/s). Fig. 2(b) shows a projection of environment M on a 3-D space for the optimized-altitude UAV scenario. As shown in Fig. 2(b), environment M is divided into M 1 × M 2 × M 3 voxels, and each sensor K k ∈ K is placed at a single voxel.
By referring to the discretized region of the considered environment, the state set S of the UAV can be represented as S = {s(1), s(2), . . . , s(M 1 M 2 M 3 )}. Each state s(m) and s(m) ∈ S refers to a voxel. Fig. 2(b) shows a UAV agent navigating through the voxel from the initial to the goal point while collecting sensors data randomly distributed within the environment. Owing to its limited battery energy, the finite flight time of the UAV requires the observation of an optimal 3-D trajectory. RL can aid the UAV in making accurate decisions for an optimized 3-D trajectory.
2) Action Space: Because the UAV can navigate in a 3-D space, the action space of the UAV is modeled in a Bloch sphere [ Fig. 2(b)]. In this scenario, the UAV agent has a maximum of 26 actions in each state, that is, {e.g., Ascend, Descend, West, North West, North, North East, East, South East, South, South West, Ascend West, and Descend North West}. For clarity, the action set of the UAV is denoted by  A = {A, D, W, NW, N (H(t)) m at t, it moves along the x, y, and z axes. Note that in practice, the UAV can select its path along any direction in θ and φ, that is, θ ∈ [0, 2π ], φ ∈ [0, 2π ].
3) Reward Formulation: Similar to the fixed-altitude UAV scenario, the maximum time required for a UAV to move from one voxel to another can be expressed as follows: where T max denotes the maximum flight of the UAV, and M 1 , M 2 , M 3 denote the voxels representing the time slot. Moreover, the maximum hovering time T H , total time T req , and reward function R(t) can be obtained using (24)- (26).

C. Reinforcement Learning-Based Optimized Trajectory Design
RL was adopted in this study to solve sequential decision making of a UAV during the data-collecting mission. RL addresses this problem by maximizing the immediate and long-term rewards. The maximum rewards are achieved from the best action performed by the agent. The best action is selected under a policy 1 and updates it using a Q-table that records the experience of the agent's action in a state and collects the reward. At the beginning of the learning process, the state-action values in the Q-table are initialized to zero. In a state S t at t, the UAV as the learning agent consecutively implements the following steps: select an action A t from action set A, move to the next state S t+1 , receive a reward R t+1 , and update the Q-value in the Q-table. In this article, the agent state is represented by the location of UAV (X t , Y t , H t ) at t with action set A as mentioned in Sections V-A and V-B. Based on the current state (X t , Y t , H t ) and selected action A t , the reward is collected, and the UAV moves to a new state (X t+1 , Y t+1 , H t+1 ). The reward can be considered as feedback from the environment-given action. In terms of maximizing the energy efficiency, the reward is a function of the total achievable throughput divided by the total UAV energy consumption over time T, as defined in (26), Section V-A.
In this article, a model-free temporal difference (TD) method 2 is used to update the Q-value because the UAV 1 A policy defines the behavior of a learning agent at a fixed instant in time [7], which denotes a set of rules for that agent. 2 The TD method improves the estimation of future reactions by extracting information from observations of sequential stochastic processes that model the environment of the agent. Q-learning and state-action-reward-state-action (SARSA) are TD-based methods [11]. requires multiperiod decision-making. First, the Q-value is updated using an off-policy method known as Q-learning. In the Q-learning method, the policy cannot be correlated with the policy being evaluated and improved. The Q-value in Q-learning can be updated using (28) where α ∈ [0, 1] denotes the learning rate (or step size), and γ ∈ [0, 1] represents the discount factor that determines the importance of future rewards. Furthermore, for comparison, the Q-value is also updated using an on-policy method, i.e., SARSA, which learns the optimal policy by learning the Q-value for the current policy and all state-actions. Specifically, the policy used in learning is the same as that being evaluated and improved. In this method, the Q-value is updated using The updated Q-value in (29) depends on the information of variables S t , A t , S t+1 , A t+1 . SARSA iteratively estimates the Q-value under policy Q π and simultaneously changes policy π in a greedy manner at the same time based on Q π .
According to the updated Q-values, the UAV as an agent exploits what it has already experienced to receive a maximum reward. However, the agent must also explore unmapped or altered environments to determine whether it can execute better action selections in the future. Exploration motivates the agent to learn from the environment, where it enables the learning process to get trapped out of a local optimum. Specifically, exploration provides the UAV the chance to perform a specific action, observe a reward, and update its action choices. Meanwhile, exploitation assists UAV by utilizing the knowledge that is already available to obtain the best action. Exploitation drives the convergence of the learning process. Thus, the selection between exploration and exploitation is an aspect that enables UAVs to obtain beneficial actions during the learning process. By introducing a parameter 0 < < 1, an -greedy exploration is utilized to better balance exploration and exploitation a), with probability 1 − random selection, with probability .
The agent selects the action because it considers that the action yields the maximum long-term reward with probability 1 − . In contrast, the agent selects an action uniformly at random with a probability . Action selection based on Q-values may fall into a local optimum when = 0.
To elaborate the Q-learning-based trajectory optimization with 3-D directional space, we present an example.
Example: When UAV begins exploring from the initial location at a state S 0 = (5, 1, 1), one of the neighboring states { (4, 1, 1), (4, 1, 2), (4, 2, 1), (4, 2, 2), (5, 2, 1), (5, 2, 2)} is explored by selecting an action from A according to stateaction values, i.e., max a∈A Q (S 0 , a). Because all Q-values at the initial Q-learning are initialized to zero, an action will be selected randomly or according to some preset criteria. Assumed action "AN" which represents the Ascend North direction, is selected; thus, the next state becomes S 1 = (4, 1, 2). The corresponding reward and Q-value Q(S 0 , A 0 = AN) are updated accordingly. Based on the example, the Q-values is increased by the collected reward value using (25). The Q-value is iteratively updated during learning based on the selected action that results in the maximum long-term reward.

VI. PERFORMANCE EVALUATION
In this simulation, the exploration region was considered in a 3-D space with the upper limit set to 100 × 100 m 2 for the 2-D trajectory scenario and 100 × 100 × 100 m 3 for the 3-D trajectory scenario. The considered region M was divided into 5 × 5 tiles for the 2-D trajectory and 5 × 5 × 5 voxels for the 3-D trajectory. The initial location of the UAV was set as L I = (0, 0, 0), whereas the goal location was L F = (100, 100, 100). Based on actual applications [22], the maximum flight time of a quadrotor UAV T was set to 30 min with a speed ω of 5.56 m/s. The mass of the quadrotor UAV, which consisted of its framework and battery n l was 1.46 kg. Because the UAV considered was a quadrotary-wing UAV, whose number of rotors n was four, and the area of the spinning blade disc of one rotor ζ was 0.0507 m 2 . In addition, the lift-to-drag ratio of UAV r was 3. The UAV navigated from L I to L F with a gravity acceleration of 9.807 m/s 2 , and the air density was 1.225 kg/m 3 . As described in Section II, K sensors were randomly distributed in region M. The number of sensors K simulated in this study was divided into three scenarios: 2, 4, and 6. The UAV communicated with sensor K k by transmitting power P UAV at 30 dBm, whereas the transmit power of sensor P k was 10 dBm. The sensors shared a total communication bandwidth of B = 2 MHz with a noise power σ 2 of −110 dBm. The channel power gain α 0 was −50 dBm and the path loss exponent ζ pl was 2. As an RL agent, the UAV learned through the environment M at a rate of α = 0.5. The discount factor of the reward was set to γ = 0.98. The overall simulation parameter values are presented in Table II. The performance of the proposed model was evaluated using two baseline RL methods: 1) Q-learning and 2) SARSA.  1) Q-learning is an off-policy method in which the learning of the optimal policy is based on the best action, which is implemented greedily, unlike the current policy. 2) SARSA is known as an on-policy method that learns the optimal policy by learning the Q-value for the current policy and for all states and actions. Moreover, the proposed extended directional space was compared with those of limited DOF in terms of energy efficiency. Fig. 4(a) shows the optimal policy π * SARSA for the optimized trajectory of the fixed-altitude UAV with eight-DOFs under the SARSA method. Meanwhile, the optimal policy for an optimized-trajectory fixed-altitude UAV with the 8-DOFbased Q-learning method is presented in Fig. 4(b). As the figure shows, the optimal policy resulting from the SARSA and Q-learning methods under an identical environment were different. A different optimal policy can generate a different sum of the rewards results. for state S t ∈ S do 4: Select action A t according to policy π derived from Q Q-learning (S t , A t ) of Eq.(30). 5: UAV takes action A t . 6: UAV receives a reward R t+1 , UAV moves to the next state S t+1 . 7: Observe elapsed time T s . 8: Update Q-values based on Eq.

A. Energy Efficiency of Fixed-Altitude UAV Under 2-D Trajectory
Update state S t , action A t , and elapsed time T s : S t ← S t+1 , A t ← A t+1 ; T s ← T s + T req ; 10: end forS t+1 == L F or T s ≥ T 11: ep ← ep + 1; 12: end while 13: Determine the optimal policy: higher rewards indicate a better performance. Moreover, "iterations" indicates the training iteration steps that optimized the action Q-values (Section V-C). In this study, the iteration steps were set to 10 000. As shown in the figure, the extended directional space with eight-DOFs achieved a higher energy efficiency than that with benchmark model four-DOFs ( [24]). Moreover, Q-learning converged faster than SARSA in this scenario owing to the limited T of the UAV. Specifically, during the learning process, SARSA required more actions than Q-learning, as shown in Algorithm 1 for Q-learning and Algorithm 2 for SARSA. This occurred because SARSA learns the optimal policy based on the current policy, which requires more time, whereas Q-learning learns based on the best action performed in a greedy manner. Consequently, the UAV fails to reach the optimal result within a finite T. Fig. 6 shows the comparison of R values between the proposed 8-DOF model with a related study [24] for four-DOFs under Q-learning and SARSA for four sensor scenarios. The figure shows similar trend results for the two sensors shown in Fig. 5. The figure clearly shows that 8-DOF UAV outperformed the 4-DOF UAV in maximizing the energy efficiency. Moreover, the proposed model under Qlearning achieved a better performance by converging faster than SARSA. Furthermore, the additional two other sensors resulted in a higher energy efficiency of the UAV because a higher amount of data could be collected.  for state S t ∈ S do 4: Select action A t according to policy π derived from Q SARSA (S t , A t ) of Eq.(30). 5: UAV takes action A t . 6: UAV receives a reward R t+1 , UAV moves to the next state S t+1 . 7: Observe elapsed time T s . 8: Select the next action A t+1 from the state S t+1 following the policy derived from Q SARSA (S t+1 , A t+1 ) as in Eq.(30). 9: Obtain updated Q-values based on Eq. (29) Update state S t , action A t , and elapsed time T s : S t ← S t+1 , A t ← A t+1 ; T s ← T s + T req ; 11: end forS t+1 == L F or T s ≥ T 12: ep ← ep + 1; 13: end while rewards than 8-DOF SARSA and 4-DOFs models, which had a similar trend to the previous results (see Figs. 5 and 6). Moreover, the result indicated that the number of sensors directly resulted in a higher number of collected rewards. Fig. 8(a) depicts the optimal policy π * SARSA for an optimized 3-D UAV trajectory with the 26-DOF SARSA method.  The optimal policy for the optimized 3-D UAV trajectory with the 26-DOF-based Q-learning method is shown in Fig. 8(b). Similar to the optimized trajectory for a fixed-altitude UAV, the optimal policies resulting from the SARSA and Q-learning methods in identical environments were different. A different optimal policy can generate different sum of rewards results. Fig. 9 shows the energy efficiency of a UAV under a 3-D trajectory with two randomly distributed sensors. "Energy efficiency" refers to average sum rewards of (27), where higher rewards indicate a better performance. Moreover, "iterations" indicates the training iteration steps that optimized the action Q-values (Section V-C). In this study, the iteration steps were set to 10000. As shown in the figure, the extended directional space with 26-DOFs achieved a higher energy efficiency than that with benchmark model 6-DOFs ( [26], [37]). Moreover, Q-learning converged faster than SARSA in this scenario because of the limited T of the UAV. Specifically, during the learning process, SARSA required more actions than Q-learning, as shown in Algorithm 1 for Q-learning and Algorithm 2 for SARSA. This occurred because SARSA learns the optimal policy based on the current policy, which requires more time, whereas Q-learning learns based on the best action  performed in a greedy manner. Consequently, the UAV fails to reach the optimal result within a finite flight time T. Fig. 10 shows the comparison of R values between proposed 26-DOFs with related studies [26], [37] on 6-DOFs under Q-learning and SARSA for four sensor scenarios. The figure shows similar trend results for the two sensors shown in Fig. 9. The figure clearly shows that the UAV with 26-DOFs outperformed the 6-DOF UAV in maximizing energy efficiency. Moreover, the proposed model under Q-learning achieved a better performance by converging faster than SARSA. Furthermore, the additional two sensors resulted in a higher energy efficiency of the UAV because a higher amount of data could be collected. Fig. 11 depicts the R values of proposed 26-DOF and 6-DOF [26], [37] models under Q-learning and SARSA. As shown in the figure, 26-DOF with Q-learning achieved higher rewards than the 26-DOF SARSA and 6-DOF models, which was a similar trend to the previous results ( Fig. 9  and 10). Moreover, the result indicated a higher number of collected rewards, which were directly affected by the number of sensors. Fig. 11. Energy efficiency of optimized-altitude UAV with respect to the iteration steps when the number of sensors K = 6.

VII. CONCLUSION
This article proposed a multidimensional search space that employs higher DOFs to optimize the trajectory of UAV. Moreover, the trajectory was optimized utilizing the RL method, specifically Q-learning, to maximize the energy efficiency of UAV-aided IoT networks. In addition, the energy consumption model of a UAV was designed by considering practical environments. The performance of the proposed model was compared with those using limited DOFs and the SARSA method. The proposed extended DOFs exhibited a higher energy efficiency than the limited DOFs and SARSA. Moreover, Q-learning achieved faster iteration convergence than SARSA. The following points can be considered in future studies.
1) Intersensor interference from other sensors can be considered for a data transmissions scheme. 2) A deep neural network RL scheme (called DRL), i.e., [32], [33] can be used to further improve energy efficiency of UAV-aided IoT networks. 3) For a better performance, a cooperative multiagent RL (MARL) scheme can also be applied to this scenario, such as in [19], [21], and [37].