Multi-UAV Navigation for Partially Observable Communication Coverage by Graph Reinforcement Learning

In this paper, we aim to design a deep reinforcement learning (DRL) based control solution to navigating a swarm of unmanned aerial vehicles (UAVs) to fly around an unexplored target area under partial observation, which serves as Mobile Base Stations (MBSs) providing optimal communication coverage for the ground mobile users. To handle the information loss caused by the partial observability, we introduce a novel network architecture named Deep Recurrent Graph Network (DRGN), which could obtain extra spatial information through graph-convolution based inter-UAV communication, and utilize historical features with a recurrent unit. Based on DRGN and maximum-entropy learning, we propose a stochastic DRL policy named Soft Deep Recurrent Graph Network (SDRGN). In SDRGN, a heuristic reward function is elaborated, which is based on the local information of each UAV instead of the global information; thus, SDRGN reduces the training cost and enables distributed online learning. We conducted extensive experiments to design the structure of DRGN and examine the performance of SDRGN. The simulation results show that the proposed model outperforms four state-of-the-art DRL-based approaches and three heuristic baselines, and demonstrate the scalability, transferability, robustness, and interpretability of SDRGN.

distance are interconnected and could communicate with low latency. Taking into account the dynamic topology and graph characteristics of FANET, we regard each UAV as a node and each connection in FANET as an edge. Then we use the graph attention network (GAT) [17] as the convolutional kernel to extract adjacent information through the edges. The proposed communication mechanism between neighbors could alleviate the influence of partial observation at a low cost and we named it as GAT-based FANET (GAT-FANET). To further alleviate the information loss in the partially observable environment, we process the graph data with a memory unit based on gated recurrent unit (GRU) [18], which could record long-term historical information. Based on the intuition that a stochastic policy is more robust than a deterministic one in the partially observable environment, we design a multi-agent deep reinforcement learning (MADRL) algorithm named Soft Deep Recurrent Graph Network (SDRGN), which could learn a DRGN-based stochastic policy with soft Bellman function [19]. Finally, we design a heuristic reward function based on the local information obtained by each UAV individually. Experiments show that the policy trained with the designed reward function could effectively perform well in terms of the global metrics. As a result, SDRGN achieves better performance than previous works at a lower training cost and can be fine-tuned on-line in a distributed manner.
The main contributions of this paper are as follows: 1) Based on simulations, we design a spatial-temporalaware network architecture named Deep Recurrent Graph Network (DRGN) for our partially observable environment, which could obtain spatial information with GAT-FANET and get historical information from the memory unit. 2) We propose a novel maximum-entropy reinforcement learning algorithm named SDRGN that could learn stochastic policies with the DRGN network. To our knowledge, this is the first work that learns a graph-based stochastic policy in the field of multi-UAV navigation. 3) As we assume that global information is not available during training, we design a heuristic reward function that only evaluates the local information but encourages the UAV swarm to perform well in terms of the global metric. 4) We conducted extensive simulations to validate the effectiveness of the heuristic reward function, and analyze several characteristics of SDRGN, including performance, scalability, transferability, robustness, and interpretability. The rest of the paper is organized as follows. In Section 2, we review recent works related to multi-UAV navigation. The system model and problem statement are defined in Section 3. We describe our approach in Section 4. Experimental studies including network architecture evaluation and analysis are given in Section 5. Finally, we draw conclusions and prospect the future work in Section 6.

UAV Deployment and UAV Ad-Hoc Network
Some recent works have made in-depth studies on the deployment of UAVs to make it more practical in real-world tasks. The authors in [20] proposed a multi-UAV control model that could increase the deployment coverage of UAVs with energy efficiency. [21] proposed a decentralized solution for multi-UAV deployment by adjusting a set of parameters that control the UAVs' behaviors and actions. [22] presented a method to minimize the deployment delay of the UAVs and the overall delay. [23] introduced a framework for optimizing the deployment and mobility of multiple UAVs to control the overall deployment with regard to the energy efficiency of the ground equipment.
With recent progress in Ad-hoc network [3], where all nodes within the communication range could establish connections, it is practical to design multi-UAV control methods with the assumption that the UAV could communicate with neighboring UAVs in low-latency. In recent years, Adhoc network for multi-UAV deployment, known as Flying Ad-hoc Network (FANET), has been proved to have better performance than other network structures [24], [25].

Cooperative Exploration and Path Planing
A control policy with high exploration efficiency is necessary when the UAV swarm is deployed in an unexplored environment under limited observation. The most widelyused exploration strategy may be -greedy, which has the probability of to select actions randomly in each decision, and often decays with the convergence of the policy. In recent years, many advanced methods have been proposed for multi-UAV cooperative exploration. For instance, [26] developed a game theory framework suitable for multi-UAV collaborative search and monitoring. [27] proposed a cooperative exploration strategy specialized for the UAV-UGV-combined system, which could minimize the total exploration distance under energy consumption and functional constraints. [28] used multi-UAV to build a slambased collaborative exploration system, which could reduce the amount of shared data by only exchanging the frontier points of the computed local grid map.
Before the emergence of deep learning, path planning for multi-UAV is mainly studied with heuristic methods and reinforcement learning (RL) methods. [29] proposed a multi-objective optimization algorithm for multi-UAV task assigning and path planning, in which genetic algorithm(GA) was adopted to minimize the inference time of the policy. [30] designed a variety of path planning algorithms based on the information of each station according to a fixed station, to find a path suitable for the central UAV. [31] proposed the mean-field game (MFG) control theory to achieve fast positioning and low flight consumption, in which two partial differential equations are solved by machine learning methods. [32] treated the UAV swarm as an agent and achieves the optimal navigation with an on-policy RL method SARSA [33], which is a lightweight policy with linear time complexity.

DRL for Multi-UAV Navigation
Deep reinforcement learning (DRL) is a powerful tool that uses deep neural network (DNN) to learn RL policies to solve decision-making problems. Compared with the RLbased method which is more computationally light, the DRL-based approach could handle high-dimensional data and extract complicated features with DNN, which leads to better flexibility and performance. Recent advances in edge computing have significantly improved the computing power of the onboard computer, thus many works have tried to use DRL models to control the multi-UAV navigation in real-world-complexity tasks. Different from RL methods that typically treat the whole UAV swarm as a single agent [32], recent DRL-based works seek to control each UAV in a decentralized manner, hence their performance improves with the progress of multi-agent deep reinforcement learning (MADRL).
Independent Q-learning (IQL) [34] is probably the simplest and most commonly applied method in the MADRL field. It discomposes the multi-agent problem into multiple simultaneous single-agent tasks, by learning a decentralized Qlearning model for each agent. [35] designed UAV longitudinal and lateral Q-learning fuzzy controllers to solve the multi-UAV formation control problem. [36] handled the problem of the flocking of small fixed-wing UAVs with IQL, where the technique of parameter sharing among the DRL models of all UAVs is applied to speed up the model convergence.
To address the environmental instability caused by training multiple policies simultaneously in IQL, [37] introduced the paradigm of centralized training and decentralized execution (CTDE), which trains each decentralized policy with a centralized critic network that is granted with a global state of the environment, then executes the policies in a decentralized manner. [38] proposed a CTDE MADRL model to provide secure communications by jointly optimizing the trajectory of UAV-MBSs, which applies self-attention mechanism [39] in MADDPG to improve the efficiency of the information aggregation among UAVs. [15] adopts MADDPG to handle the problem of navigating a group of UAVs as mobile base stations to provide long-term communication coverage for the ground mobile users in a target area.
Note that the CTDE approach requires global information in the training phase, which is normally infeasible in real-world tasks. To achieve fully decentralized training, DGN [40] utilized GAT [17] network to aggregate information from neighboring agents (instead of all agents). As is metioned in Section 2.1, with the development of FANET, it is much eaiser to achieve communications among adjacent UAVs, so it is practical to use GAT structure in the multi-UAV collaboration scenario. Most recently, [16] adopts DGN to solve the UAV-MBSs problem introduced by [15].
The differences between this paper and previous works are summarized as follows. [15] proposes the UAV-MBS problem and solves it with a CTDE approach, which assumes evenly distributed points of interest (PoIs) and requires global information during training. By contrast, we assume that the model could only obtain a partial observation in the training phase, and the PoIs are randomly distributed. [16] adopts DGN to the UAV-MBS problem, yet the network structure is not well-studied and its performance has no significant advantage over other model-free heuristic methods; whilst we design GAT-FANET, a network structure appropriate for the inter-UAV communication in the UAV-MBSs task, and introduce a memory unit for recording temporal information in the network. Besides, the works mentioned above handle  the UAV-MBSs problem with deterministic policies, while we propose to learn stochastic policies with the designed network structure. Experiments show that the stochastic method is more robust than previous deterministic policies in our partially observable environment. Further, [14], [15], [16] tested the algorithms with a relatively small UAV team (from 3 to 10), whilst we tested the model in environments with up to 40 UAVs, validating the better scalability of our proposed model.

System Model
We introduce a scenario of multi-UAV navigation control for fair communication coverage under partial observation. A group of N UAVs navigate at a fixed altitude H and serve as mobile communication base stations to provide communication services to ground users. The simulation world is a continuous 2D map, as shown in Fig. 1, and we assume that the map size is L Â L units. We set D Com as the maximum communication distance within which a UAV can communicate with other UAVs and ground users. As all UAVs fly at the same altitude, they are interconnected in the Ad-hoc network when their relative distance is less than D Com , and a UAV can communicate with any ground user whose distance to the center of UAV on the 2D map is less than the observe range R Obs ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi D 2 com À H 2 p . As real-world conditions might affect communication quality, we assume that a ground user can obtain a stable communication service from a UAV if its distance to the UAV on the 2D map is less than the coverage range R Cov . If the distance is more than R Cov and less than R Obs , the ground user can be observed by the UAV but provided with an unstable communication service. Different from the prior works [37], [41] assuming that the ground users are evenly distributed, we use randomly distributed PoIs to represent the ground users.
Specifically, we consider a task where N UAVs navigate to cover K PoIs for T timeslots. At the beginning of each task, the position of PoIs and UAVs is randomly assigned. Since the UAV can only observe the PoI within a limited observe range, in order to cover more PoIs, the UAVs should cooperate to explore the distribution of PoIs and choose the trajectory that maximizes the teams' interest in a decentralized manner. Table 1 lists important notations used in this paper.

Observation Space and Action Space
For each UAV i at timeslot t, it could obtain a local observation o i t from the environment. As depicted in Section 3.1, the UAV can observe the PoIs and other UAVs within the circles with the radius of R obs and D com centered on itself, respectively.
As the number of PoIs and UAVs in the observation area dynamically changes during the task, to keep the dimension of observation space consistent, we transform the continuous circle observation area into discrete pixels, in which the value of each pixel indicates the number of PoI or UAV on that position. Specifically, the observation consists of 4 elements: PoI map M PoI , UAV map M UAV , the binary encoding of the current location e pos , and the current velocity v. The PoI map and the UAV map are pixel maps in which each pixel records the number of PoI and UAV at the corresponding position in the circular observable area. Since we set 1 unit as the fundamental unit of the pixel map, the PoI map and UAV map are vectors of length bpR 2 obs c and bpD 2 com c, where bÁc denotes the floor operation. The binary encoding of location e pos is a vector of length 2d that consist of the binary code of the UAV's current position on the continuous world ðbx i t ; y i t cÞ, where d is a hyper-parameter that meets the condition that 2 d is greater than the world size L. The velocity vector v is a 2D vector that indicates the normalized velocity. Therefore the observation is a vector of length bpR 2 obs c þ bpD 2 com c þ 2d þ 2, and could be represented as where ðÁjÁÞ denotes the concatenation operation. The UAV controls its movement by performing a certain drag to adjust the velocity. Hence we design the action as a 2D drag vector. Concretely, we evenly divide the 2D plane into 8 directions and set the maximum drag of the UAV to 1 unit. The UAV's action space, which is shown in Fig. 2, consists of 17 actions, including 1 action of zero-drag, and 16 actions denoting the maximum-drag and half-drag for all 8 possible directions.

Evaluation Metrics and Problem Statement
We next introduce three global metrics to evaluate the performance of the UAV swarm, then state the goal of this problem. Following [15], we evaluate the performance of a policy from the perspectives of coverage, fairness, and energy consumption. The first metric is coverage index, which measures how every PoI was covered by any UAV in the past t time slots. At any time slot t, if the PoI k is in the coverage area of any UAV, we call it "covered", otherwise it is not covered. Specifically, the coverage index in timeslot t can be represented as where K is the number of PoI, w t ðkÞ is the number of timeslot that a PoI k was covered until the timeslot t; thus, c t 2 ½0; 1 always holds. We refer the coverage index at the last timeslot T to final coverage index, denoted as c T ¼ c tjt¼T , to evaluate the performance of the UAV team in an episode. However, the final coverage index could be high even when a small subset of PoIs are never covered, thus, fairness of coverage is very important in many critical circumstances. For instance, in case of earthquakes or storms, we hope that even an isolated PoI can have the opportunity to obtain communication services. Besides, as we consider a partially observable environment, we want to evaluate the exploration capability of the policy by verifying whether it could cover the remote/isolated PoIs. Therefore, following [15], we use Jain's fairness index to describe the geographical fairness of the coverage of PoIs, as Obviously, f t 2 ½ 1 K ; 1 always holds. Further, if all w t ðkÞ are equal, f t is the maximum value 1; if only 1 PoI is covered, f t is 1 K . We refer the fairness index at the last timeslot T as final fairness index, denoted as f T ¼ f tjt¼T , to evaluate the coverage fairness and the exploration level in an episode.
Lastly, following the previous work [14], we assume that the energy consumption is linear to the flight distance and define the energy consumption of each UAV as where e 0 ¼ 0:5 is the hovering energy consumption, l t i 2 ½0; 1 is the normalized flight distance in the last transition, and k ¼ 0:5 is a co-efficient. We herein define the energy index of the UAV swarm in a task by averaging all N UAVs in all T timesteps We combine the three metrics and define the overall objective as coverage-fairness-energy (CFE) score, represented as In a word, the goal of the problem is to learn a decentralized policy p Ã that could execute under partial observation to maximize CFE score of the whole episode, that is where T is the last timeslot of the episode. In addition to the above metrics, we add a constraint to the problem that the model will only be granted with local information as the training sample. It denotes that the learning objective, CFE score, cannot be obtained to formulate the objective function in the training phase. This setting is relevant to the keyword decentralized training (DT) in MADRL, which requires less training cost and brings possibilities for online learning after the model is deployed.

PROPOSED FULLY-DECENTRALIZED DRL SOLUTIONS FOR MULTI-UAV NAVIGATION
In this section, we present Soft Deep Recurrent Graph Network (SDRGN), a fully decentralized DRL-based multi-UAV control solution for partially observable communication coverage. In the solution, all UAVs share the same policy for path planning and control in a decentralized manner with Ad-hoc network communication. Different from previous works that adopt the CTDE framework [37], [42], [43] and therefore require global observation during the training phase, our approach only uses local information as the training sample and thus meets the partially observable constraints in both the training stage and the testing stage. We first introduce a novel network architecture named Deep Recurrent Graph Network (DRGN), then show how to train a DRGN-structured stochastic policy with maximum-entropy learning, finally design a heuristic reward function to support the decentralized training.

Deep Recurrent Graph Network
The network architecture of Deep Recurrent Graph Network (DRGN) for MADRL is shown in Fig. 3.
First, DRGN applies an encoder to process the raw input where o i and e i is the local observation and its embedding for agent i, and ENC is the encoder function shared by all agents. We use a fully connected layer as the encoder. Then, DRGN embeds the communication protocol into the network architecture with the graph convolution

GAT-FANET
Two GAT layers + skip connection The cost of activation function and bias are omitted for computing MACCs. The input dimension, hidden dimension, number of neighbors, and action dimension are represented by d in , d hid , n nei , and d act , respectively. Note that d 0 hid ¼ 3d hid is the output dimension after skip connection. mechanism. A UAV i is a node in the graph and has node embedding e i . Node embeddings are passed through the edge between nodes so that each node can obtain information from itself and neighboring nodes. For convenience, we define G i as the set of node i and its one-hop neighbors G i ¼ fjj8j s:t: Aði; jÞ ¼ 1g; (9) where A is the adjacency matrix of FANET. The observation and embedding of all UAVs in G i could be represented as In a subgraph G i , the importance of each neighboring node to node i may be quite different. For example, when the UAV i does not find any PoI in its observation area, the neighboring UAV that observes the biggest number of PoIs should be of the greatest importance, while other nodes need less attention. To this end, we adopt graph attention (GAT) [44] as the convolution kernel to process the graph data, which utilizes self-attention mechanism [39] to decide the importance of each node in subgraph i by Eq. (10), and aggregate the neighboring information for node i by Eq. (11) where a ij is the attention weight that determines the importance of node j to node i, and g i is the output embedding for node i after the information aggregation. W Q ; W K ; W V are learnable matrices that map the embedding e i into the query, key, value vector [39], respectively. Note that j 2 G i in Eq. (11) denotes that the self-attention operation is only executed between nodes interconnected in the graph. To improve the GAT's expressive capability, we implement the self-attention mechanism with M heads, i.e., execute M independent graph-attention kernels in parallel, and concatenate the outputs Based on the simulation, we establish a high efficient GAT-based network structure to execute learnable communication among adjacent nodes in FANET, which is named GAT-FANET. We stack two GAT layers together to provide a two-hop perception field for each node. To illustrate, the node i in the second GAT layer could obtain its neighbor nodes' embedding in the first GAT layer, which contains the information aggregated from the node that is two-hop to i. Note that the node i only communicates with its onehop neighbors. To accelerate the training process and prevent over-fitting, skip connections [45] are created over the two GAT layers. Specifically, the output of the encoder, the first and the second GAT layer are concatenated as the output of GAT-FANET, which is shown in Fig. 3b.
After GAT-FANET, each node i gets an embedding g 0 i that aggregates information from other nodes in its two-hop subgraph. Based on the intuition that storing history information could alleviate the information loss induced by the partially observable constraint, we design a memory unit for DRGN to record the historical graph embedding, which is shown in Fig. 3c. Specifically, we choose gated recurrent unit (GRU) [18] as the memory unit to utilize its long-term memory ability where g 0i t is the node i's embedding output from GAT-FANET, h i tÀ1 is the hidden state of UAV i's GRU in the last timeslot.
Finally, we use a linear transform to process the output of GRU to calculate the Q-values. For each agent, DRGN only takes the observation of the subgraph and the GRU's last hidden state h as input and outputs the Q-values for all possible actions A in one forward propagation, i.e., QðAjO; hÞ : O Â h ! R A , which decreases the computational complexity.
The multiply-accumulate operations (MACCs) of DRGN on each UAV is presented in Table 2. It can be seen that GAT-FANET and memory unit occupy the main computational overhead of DRGN. For the sake of inference speed, we do not stack more GAT layers in GAT-FANET; in the memory unit, we parallelize the computations inside the GRU. The overall MACCs of DRGN is

Learn Maximum-Entropy Policies With DRGN
In this section, we present Soft Deep Recurrent Graph Network (SDRGN), a novel MADRL algorithm based on maximum-entropy theory to learn DRGN-structured stochastic policies for partially observable communication coverage. As is mentioned in Section 4.1, DRGN is a neural network that maps from the state space to the action space. Inspired by soft Q-learning [19], instead of estimating the expected return value of each action like other communication-based MADRL algorithm [40], [46], we predict the probability distribution of all possible actions. Specifically, the unbounded output of DRGN is processed with a temperature-softmax P ðAjO; hÞ ¼ softmax QðAjO; hÞ a ; where a is a temperature hyper-parameter that is positively correlated with the degree of exploration. The process of the UAVs interacting with the environment can be summarized as follows: At each timeslot, every UAV i obtains an observation o i t and calculates the probability for all actions. We use the multinomial sampling strategy to sample the action from the output probability distribution. After the actions executed, each agent will obtain a reward r i and a new observation o i tþ1 . Following DQN [47], we use a target network for calculating the learning target of the model. The target network is a copy of the learned DRGN network. Its parameters are updated by directly copying the parameters of the trained model in every few iterations. We update the model by minimizing the squared temporal difference of the soft Bellman function [19] where O t ¼ fo j t j8j 2 G i t g and G i is the subgraph i, h i t is the last hidden states of agent i's GRU, S is the size of mini-batch and N is the number of UAVs in the sampled experience. The value function V ðO t ; h i t Þ is defined as where Q 0 denotes the target network. T T þ 1 7: for agent i ¼ 1 to N do 8: Obtain the observations O i t of the subgraph G i . 9: Select the action based on Eq. (15). 10: Execute the action a i t , then obtain the corresponding reward r i According to the realistic situation, we propose two strategies for replaying the experience and training the model. The first training scheme is designed for circumstances in which centralized training is feasible. It integrates the experience of all agents in a timeslot into a tuple ðo t ; h t ; A t ; a t ; r t ; o tþ1 ; h tþ1 ; A tþ1 Þ and resamples it as a whole, where A t is the adjacency matrix of FANET. In this way, we reduce memory usage by storing experiences without duplication. Besides, since the sampled experiences belong to the same timeslot, the stability of the training process is improved. After the centralized training, the learned policy can be executed in a distributed manner, which is similar to the centralized training and decentralized execution algorithms [37], [42], [43]. The second training method is designed for distributed training circumstances, e.g., online fine-tuning the model with the onboard computer during the task. This method is similar to other independent learning algorithms such as independent Q-learning [34], which stores the experience of each agent individually and resamples it randomly to train the model. Pseudocode for the centralized training process of SDRGN is presented in Algorithm 1, and the distributed training version is omitted for its simplicity.

The Heuristic Reward Function
As is mentioned in Section 3.3, the objective of the multi-UAV system is to maximize the global metric called CFE score, which is defined in Eq. (6). A very intuitive idea is to directly use this metric as the reward [15], i.e., However, due to the partially observable settings of our problem, the global metrics are not available in the distributed training process. Besides, although we could obtain them in the centralized training phase, due to the partial observability constraint there is an information gap for the decentralized model to predict the global metric, which could induce an ultra-unstable training process. Therefore, we seek to design a reward function which only depends on the local information that can be obtained by the decentralized policy. We expect that the reward function could encourage the UAV swarm to achieve a better CFE score, which denotes a larger coverage index c t and fairness index f t , while minimizing the energy index e t . As the objective of DRL is to maximize the expected return R i T ¼ S N i¼1 S T t¼1 g T Àt r i t , ideally, we expect to design a reward function r i that meets As the desired reward function plays a similar role to the heuristic function in A* algorithm, we name it as the heuristic reward function. The reward function mainly consists of individual term, teamwork term, and energy term.
The individual term r self is defined as the number of PoIs that are exclusively covered by the agent itself. Note that the PoIs occupied by multiple UAVs will not contribute to this term, which is supposed to encourage the UAVs to explore more PoIs in the map where n i poi denotes the number of PoIs exclusively covered by the agent i.
The teamwork term r team is defined as the averaged number of PoIs covered by other UAVs in the one-hop adjacency graph G i of the UAV i, which is expected to encourage the connectivity and cooperation among adjacent UAVs where n onehop denotes the number of one-hop neighboring UAVs, and n onehop poi denotes the number of PoI covered by these one-hop neighbors.
The energy term r energy is the energy consumption of UAV i, which is expected to reduce the energy consumption and improve the flight distance where e t i is UAV i's energy consumption in the last timeslot, which is defined in Eq. (4).
The overall reward function could be represented as where p i is an additional term to penalize the UAV i when it flies outside the 2D map. Specifically, we define the penalty term p as We set the learning rate to 1e-4 and use Adam [48] as the optimizer. The experience replay buffer size is 5e4, and the batch size is 128. The neuron number of all hidden layers is 256, and all attention kernels have 4 heads. The model is updated for 4 times in every 100 environmental timeslots. The target network is initialized with the same parameters as the learned network and is updated by copying the parameters of the model in every 500 environmental steps. The discounting factor g is 0.99, and the temperature parameter a for SDRGN and MAAC is 0.2. For deterministic policies, the exploration strategy is À greedy, in which the value of is initially 0.9 and exponentially decays to 0.05 at around 30,000 episode. For stochastic models such as SDRGN and MAAC, the exploration strategy is À multinomial, and the exponentially decays 0.
For each case, we train our models with 160,000 episodes. In every 100 episodes, an evaluation of the model is executed by running the frozen model with the minimum for 100 episodes and calculating the averaged value of all global metrics.
We use the CFE score defined in Eq. (6) as the major metric for evaluation. Other global metrics such as coverage index, fairness index, and energy index are also considered.

DRL and Non-DRL Baselines
We compare SDRGN with four state-of-the-art DRL algorithms, including DQN [47], CommNet [46], MAAC [42], and DGN [40]. Besides, to verify the effectiveness of our maximum-entropy learning methods in Section 4.2, we also learn a DRGN-structured deterministic policy based on the same training settings as DGN. To distinguish it from SDRGN, we named it as DRGN. Discussions on the tested DRL baselines are as follows: 1) DQN is a simple yet strong DRL approach widely used in large-scale multi-agent tasks such as [49]. 2) MAAC is a recent work that improves the scalability of MADDPG with the self-attention mechanism.
Since our most relevant works [15] adopt MADDPG to control the multi-UAV navigation, we choose MAAC as the major object to compare with. 3) CommNet is a centralized approach that performs communication among all UAVs during training and execution. Thus we compare with it to show the superiority of our GAT-FANET based communication protocol. 4) DGN is a recently proposed algorithm that also adopts GAT as the building block of the communication structure, we compare with it to examine the necessity of the memory unit. The analysis of the characteristics of all tested DRL algorithms are given in Table 3.
Meanwhile, we compare our approach with three non-DRL baselines as: 1) Random: At each timeslot t, all UAVs randomly select a action. 2) MB-Greedy: At each timeslot t, each UAV assumes that other UAVs choose "zero-drag", then tries all possible actions in a simulated environment to find the action a i t that could maximize the reward r i t . As a simulation environment is required that could estimate the reward of each action, this policy is named as Model-Based Greedy (MB-Greedy).
3) MB-GA: In this policy, the objective is to find a joint action a t that maximizes the joint reward S i r i t . As the joint action space is exponential to the number of UAVs, we adopted genetic algorithm (GA) to fit the near-optimal joint action at each timestep t. Since this policy also needs an environment model, we name it as Model-Based GA (MB-GA).
DT means the capability of decentralized training, and DE means decentralized execution, Ad-hoc denotes whether communicate during execution. Stochastic means the stochastic policy.
Since the MG-Greedy and MB-GA policy would enumerate numerous possible actions at each timeslot to choose the best action with the biggest reward, they are considered as strong baselines that have near-optimal performance at the expense of the prohibitively slow inference speed.

Neural Network Convergence and Reward Function Effectiveness
As the multi-head GAT layer is the key element of GAT-FANET, its convergence is crucial to ensure the performance of the SDRGN model. Since GAT needs to calculate the attention weights of all nodes in a dynamic subgraph, we use GAT's attention weight to the node itself as an indicator to judge whether it has converged. As shown in Fig. 4, each head of GAT converges to a certain value, and the difference between the value of different heads proves the necessity of using multiple heads. We then show the convergence of all tested DRL methods by illustrating the trend of evaluation reward over the training phase, which is presented in Fig. 5. We also notice that SDRGN converges faster, better, and more stable than other DRL algorithms.
Regardless of the SDRGN's superiority in terms of the reward, another question is that since we use the heuristic reward function that only evaluates local performance to train the DRL models, the model's performance is not guaranteed in terms of global metrics. In other words, the heuristic reward function should meet Eq. (19). As can be seen in Fig. 6, as the episodic reward grows, the final coverage index and final fairness index are improved, and the final energy index is decreased, thus the effectiveness of the designed reward function is empirically demonstrated.
To be more intuitive, we run a converged SDRGN model in the simulated environment for 1,600 timeslots, and show the trend of the global metrics during testing in Fig. 7. We observe that the coverage index and fairness index converge quickly to the maximum at around timeslot 400 and never falls. The initial energy index is close to the maximum value of 1, possibly since the UAVs need to explore the map to figure out the distribution of PoIs. Then it quickly drops to 0.7 at timeslot 100 and keeps decreasing slowly. As a result, the SDRGN policy distributedly controls each agent in the UAV swarm to optimize the global CFE score during the task. These curves indicate that we can learn a good strategy for multi-UAV navigation with our designed reward function.

Finding Appropriate Structure for GAT-FANET
Next, we present the experimental results trying to find an appropriate network structure for GAT-FANET. We try three structures as the communication structure and test them in DGN and DRGN: one GAT layer (one-hop), two stacked GAT layers (two-hop), and two stacked GAT layers with skip-connection (two-hop+SC). We adopt the CFE score as the metric, and the CFE score curves of DGN/DRGN with different GAT-FANET structures are shown in Fig. 8. There are two observations. First, stacking two GAT layers will slow down the convergence of the model in the early stages because it introduces more trainable parameters, but   in the long run, it brings an improvement of performance, possibly since it enlarges the perception field of the UAV. Second, skip connection through the two GAT layers could slightly improve the performance and achieve a fast training process as it alleviates the gradient vanishing. Therefore, we adopt two-hop+SC as the structure of GAT-FANET for DGN, DRGN, and SDRGN, which is shown in Fig. 3b.

Comparing With DRL and Non-DRL Baselines
In this section, we evaluate the performance of our proposed approach DRGN and SDRGN, then compare them with state-of-the-art DRL algorithms and non-DRL baselines described in Section 5.2. To make comparisons, for each DRL algorithm, we choose the best-learned model and test it in environments with 20 UAVs for 100 episodes.
We first compare DRGN and SDRGN with other DRL methods. It can be seen from Fig. 9 that, in terms of CFE score, DRGN outperforms all other DRL baselines, and SDRGN further improves the performance of DRGN.
Then, to better evaluate the performance of the learned policies, we analyze each component in CFE score, which is presented in Table 4, and make the following observations: First, we observe that the policies with GAT-FANET (DGN, DRGN, and SDRGN) outperform other baselines from the aspects of coverage and fairness, which denotes that better cooperative exploration and path planning is achieved by graph-based communication. It also verifies that we can learn an effective communication protocol through the backpropagation of the proposed GAT-based network structure. Besides, we notice that CommNet, which is a communication-based DRL method, fails to achieve better coverage and fairness index than non-communication methods (DQN and MAAC), possibly due to the poor capacity of its communication protocol.
Second, by comparing the performance of DRGN and DGN, we know that DRGN improves coverage by 0.018 and fairness by 0.038 at the cost of 0.017 more energy overhead, resulting in a 0.031 improvement of CFE score, which proves the necessity of memory unit in our partially observable environment.
Third, we find that SDRGN decreases the energy consumption by 0.027 from DRGN, at the expense of slightly degraded coverage and fairness, and achieves the best CFE score. An interesting finding is that DQN achieves the lowest energy consumption in the DRL methods, which is intuitive since DQN cannot obtain extra information from other UAVs and its only solution to gain a higher reward is to cover the PoIs within the observation range and reduce its own energy consumption. The non-DRL methods such as MB-Greedy and MB-GA, also achieve a low energy consumption yet show no competence in terms of coverage and fairness, due to the fact that these policies decide the optimal action only based on the reward in the current step. This problem can be partially solved by considering  multiple future steps with a heuristic search approach such as beam search, at the expense of prohibitively higher computation costs.

Scalability and Transfer-Ability
Although some previous works could effectively control the multi-UAV navigation, the scale of the UAV swarm is small, typically 3 to 10 [37], [41]. Such a small scale of UAVs could not meet the requirements of many realistic missions, hence the learned model needs to be tested with a larger group of UAVs. Moreover, in real-world tasks, due to emergencies such as energy shortage or being attacked, the number of UAVs may change dynamically. Therefore, it is necessary to compare the performance of the trained model under different UAV scales. To this end, we tested the performance of all DRL methods with the number of UAVs varying within [5,10,15,20,25,30,35,40].
We first compared the scalability and transferability of the DRL methods. During the training phase, there are consistently 20 UAVs in the environment; while in the testing phase, the scale of UAVs varies. For each case, we tested 100 episodes and calculated the averaged CFE score, as shown in Fig. 10. By experiments, we have two observations: First, as the number of UAVs grows, the performance of DRGN and SDRGN increases linearly, validating their scalability. Besides, regardless of the number of UAVs, SDRGN consistently performs better than other DRL methods, verifying its good transferability.
Second, when the number of UAV is greater than 20, the performance gap between DGN and DRGN is narrowed. This is partially due to the increased UAV density in the environment, which makes each UAV could obtain more information from GAT-FANET and the memory unit less necessary. By contrast, when the number of UAV is less than 20 and the UAV density is relatively low, the memory unit could greatly handle the information loss problem in the partially observable environment, and DRGN significantly outperforms DGN.

Robustness and Interpretability of GAT-FANET
After verifying the performance of SDRGN, we now dive deeper into the performance and the principle of GAT-FANET, which is a key element in SDRGN.
We first examine the robustness of GAT-FANET. In the training phase, we assume that the communication is reliable. However, in the real world, communications within the FANET can be interrupted for various reasons. Thus, we assume that the communication between adjacent nodes is interrupted with probability p, and tested the performance of DGN, DRGN, and SDRGN under different p. As can be seen from Fig. 11, the performance of each model deteriorates as the communication drop rate p grows. We The best and second model is indicated by bold and underline, respectively.  also notice that the performance degradation of DGN is less than DRGN and SDRGN. Our insight towards this phenomenon is that the memory unit in DRGN and SDRGN takes the node embedding aggregated from GAT-FANET as the input, and the random communication drop may harm the timing correlation of the input samples. Nevertheless, when the communication is unavailable (i.e., p ¼ 1:0), the GAT-FANET based algorithms still outperform other DRL baselines (MAAC, DQN, and CommNet), and our proposed SDRGN still achieves the best performance, which demonstrates the robustness of GAT-FANET. We then consider the interpretability of GAT-FANET. To this end, we design an intuitive example as shown in Fig. 12a, in which each UAV is controlled by our GAT-FANET based policy. Due to the partial observability, UAV 0 cannot directly observe the PoIs around UAV 4. Since there are 5 neighboring UAVs connected with UAV 0 while only UAV 4 has observed the PoIs, the GAT-FANET of UAV 0 should learn to pay the most attention to UAV 4 to make the optimal action to move rightwards. To figure out whether the learned GAT-FANET has learned to extract valuable information from neighboring nodes, we visualize the attention weights of UAV 0's GAT layer at the current timeslot, as shown in Fig. 12b. It can be seen that head 1 has 0.7 attention to UAV 4, while other neighboring UAVs only have 0.43 attention weight at most. This denotes that GAT-FANET has learned the ability to automatically identify the valuable neighbor.

Practicality of the DRL-Based Approach
Using DRL to control multi-UAV navigation in a distributed manner, especially using the neural networks to express the communication protocol among UAV as GAT-FANET does, requires more computation expense than RL priors [32] and may raise concerns about computing efficiency. We hereby examine whether the onboard computer could operate the proposed DRL-based policy in real-time.
Thankfully, with the development of edge computing, there are already many lightweight devices with high matrix computing capabilities and low energy costs. To validate the practicality of our DRL-based approach, we deployed several DRL policies on Jetson Nano, which is widely used in autonomous UAV control [50]. To figure out how each component in SDRGN affects the inference speed, we start with testing DQN and sequentially adding the proposed modules into the model. Each model is executed for 500,000 timeslots to calculate the mean and standard deviation of the inference time, and the results are shown in Table 5. Note that onehop and twohop denote adding one GAT layer and two stacked GAT layers, respectively. SC denotes the skip connection through the two GAT layers. DRGN adds a memory unit based on DGN, and SDRGN additionally executes multi-nominal sampling on the Qvalues.
Based on Table 5, we have the following insights: First, all tested DRL policies can be executed in real-time, which empirically prove the practicality of our DRL approach. Second, the GAT layer is the bottleneck of inference speed, hence we don't stack more GAT layers in GAT-FANET to provide a larger perception field. We leave the inference acceleration of the GAT layer as a future work.

CONCLUSION
In this paper, UAV-MBSs is redefined as a partially observable problem, in which global information is not available during the training phase. To handle the information loss raised by partial observability, we introduce a spatial-temporal-aware network DRGN where a communication structure GAT-FANET is well designed based on network architecture search, and the memory unit is equipped with GRU to provide long-term memory capability. Inspired by maximum entropy learning, we propose a novel DRGNstructured stochastic policy named SDRGN, which shows better performance, scalability, transferability, robustness, and interpretability than previous DRL methods in our  environment. Since our model is trained with a heuristic reward function that is based on local information obtained by each UAV individually, SDRGN reduces the training cost greatly and enables distributed online learning when the UAV swarm is on the mission. As for future work, we would like to speedup the inference time of the model and try to extend SDRGN to the actor-critic style to achieve continuous action space.