Multi-Agent Reinforcement Learning Based 3D Trajectory Design in Aerial-Terrestrial Wireless Caching Networks

This paper investigates a dynamic 3D trajectory design of multiple cache-enabled unmanned aerial vehicles (UAVs) in a wireless device-to-device (D2D) caching network with the goal of maximizing the long-term network throughput. By storing popular content at the nearby mobile user devices, D2D caching is an efficient method to improve network throughput and alleviate backhaul burden. With the attractive features of high mobility and flexible deployment, UAVs have recently attracted significant attention as cache-enabled flying base stations. The use of cache-enabled UAVs opens up the possibility of tracking the mobility pattern of the corresponding users and serving them under limited cache storage capacity. However, it is challenging to determine the optimal UAV trajectory due to the dynamic environment with frequently changing network topology and the coexistence of aerial and terrestrial caching nodes. In response, we propose a novel multi-agent reinforcement learning based framework to determine the optimal 3D trajectory of each UAV in a distributed manner without a central coordinator. In the proposed method, multiple UAVs can cooperatively make flight decisions by sharing the gained experiences within a certain proximity to each other. Simulation results reveal that our algorithm outperforms the traditional single- and multi-agent Q-learning algorithms. This work confirms the feasibility and effectiveness of cache-enabled UAVs which serve as an important complement to terrestrial D2D caching nodes.


I. INTRODUCTION
I N RECENT years, the proliferation of data-intensive wireless applications, such as augmented reality and mobile online gaming, leads to an explosion of network traffic. To alleviate the backhaul overload caused by duplicate content transmission, device-to-device (D2D) caching is a promising approach by storing popular content at the users' cache. In that way, the content requests can be served via D2D communications without incurring the cost of using cellular bandwidth [1]. However, designing wireless D2D caching systems face two major challenges. Firstly, the content requests may not be satisfied due to the limited cache storage of D2D users. Secondly, caching at static nodes may not be effective in a mobile environment. Although caching contents at multiple D2D nodes or base stations (BSs) may resolve this challenge, it still suffers from signaling overhead and additional storage cost. Therefore, there is a need for a flexible deployment of cache-enabled BSs that can track the users' movement to effectively transmit the requested contents. Unmanned aerial vehicles (UAVs) as aerial BSs [2] in which popular contents can be cached provide several benefits to the cellular network. Firstly, UAVs can provide wider wireless coverage due to high line-of-sight (LoS) probabilities at high altitudes [3]. Secondly, cache-enabled UAVs can be dynamically deployed and moved to deliver the requested files to the desired users, thereby improving caching efficiency [4]. However, the operation time of UAVs is constrained by its limited battery capacity. Therefore, how to design an efficient UAV trajectory to achieve a high overall content delivery performance is a critical issue.
In this paper, we consider an aerial-terrestrial wireless caching network in which popular contents can be cached at UAVs and ground mobile users. In this case, to utilize the channel and storage resources efficiently, the trajectory design should consider both the movement of the ground user and the behavior of other UAVs. This makes the problem of finding optimal trajectories more complex and challenging. In short, two key problems are addressed in this paper: 1) how to design a cooperative moving strategy for multiple cache-enabled UAVs taking into account the user mobility; and 2) how to improve network throughput by efficiently allocating the cache storage capacity at the UAVs.

A. Related Work
Deploying cache-enabled UAVs in the presence of terrestrial networks has been discussed in [5]- [12]. In [5], the authors considered the joint caching and resource allocation of cache-enabled UAVs that can serve ground users over licensed and unlicensed bands. An effective UAV spectrum allocation scheme was proposed to allocate appropriate bandwidth with the objective to maximize the number of stable queue users. The authors of [6] investigated the UAV-assisted content caching and 0018-9545 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
transmission problems for the wireless virtual reality networks, whose goal is to find the optimal content and cache storage capacity at the ground BSs. To maximize the quality-of-experience (QoE) of ground users, a machine learning (ML) framework was proposed to determine the UAVs' position and optimal cache contents at the UAV [7]. A blockchain-based approach was proposed to solve the node failure and network connectivity problem for maintaining the reliability requirements of a drone-caching network [8]. Some studies on cache-enabled wireless networks focused on the design of secure transmission schemes by adjusting the UAV trajectories [9] and performing interference management [10]. In addition, there exist works focusing on utilizing UAVs to assist terrestrial D2D networks. In [11], a UAV transmission resource allocation problem was considered in which the UAV acts as a carrier to transfer energy to the D2D pairs. The authors of [12] considered a UAV enabled caching in which contents can be transferred via a terrestrial BS or a UAV. However, to our best knowledge, the design of aerial-terrestrial wireless caching network in which contents can be cached at both UAVs and terrestrial D2D users has not been explored in the existing literature. Furthermore, the trajectory design or placement problem for optimizing the performance of the UAV-assisted network has received tremendous attention under different setups, such as coordinate multipoint (CoMP) architectures [13], UAV-enabled multicasting systems [14], and UAV sensing systems [15]. To fully exploit the benefit of the UAV, several works have studied the trajectory design taking into account the user mobility [16], [17]. In these works, the UAV trajectory was designed by assuming that the positions of ground users are known [16] or predictable [17] in a given period. However, in practice the users may move randomly and independently, resulting in unpredictable mobility patterns. Moreover, the UAV trajectory design in the existing works was solved offline either in 2D space [16], [17] or by separately designing the altitude and horizontal location [18]. Different from these works, in this paper the 3D movement design for UAVs is adjusted in an online fashion taking into account the time dynamics of user positions.
To realize the highly maneuverable autonomous UAVs, MLbased solutions are desired for the UAV control without human intervention, which has been considered as a use case in the 3 rd Generation Partnership Project (3GPP) [19]. The K-means clustering algorithm was adopted to partition the ground users into K clusters, in which the UAV can be initially placed at the centroid of the cluster [20], [21]. For the online UAV trajectory design with mobile ground users, the Markov decision process (MDP) has been widely applied to model the UAV control decision problem, which can be solved by reinforcement learning (RL) techniques. In RL, an agent can learn the optimal policy by interacting with the unknown environment (e.g., user movements, channel variations). A joint K-means clustering and single RLbased algorithm for the multi-UAV deployment was proposed in [22]. However, the single RL requires a centralized controller with full knowledge of network information from each UAV, which is infeasible for highly dynamic aerial networks [23]. Compared to the single RL, multi-agent reinforcement learning (MARL) has been shown to provide a more effective learning performance, especially when only local information is available [24]. By considering the individual and applicationspecific information, MARL solves the sequential multi-agent decision making problem in distributed manners. The authors of [25] investigated a cellular network in which multiple UAVs transmit their sensory data to terrestrial nodes. By utilizing MARL, a multi-UAV trajectory design algorithm was developed, whose goal is to optimize the number of successful data transmissions. In [26], an MARL-based approach was proposed to solve the joint trajectory design and power control problem that aims to maximize the instantaneous transmit rate. In the above works, each UAV is regarded as an independent learning agent that conducts the standard single agent RL algorithm with no interaction between other UAVs. However, it is revealed that appropriate cooperation between UAVs can substantially improve the long-term performance [27]. Therefore, the cooperation and learning for the online trajectory design of multiple cache-enabled UAVs still require further investigation.

B. Contribution
As discussed above, invoking MARL to UAV-assisted wireless networks offers a promising solution for intelligent UAV control and resource allocation. However, the research gap still exists in investigating the aerial-terrestrial wireless caching networks, which is worthy of further study. In this paper, we aim to develop an MARL framework for the continuous operation of multiple cache-enabled UAVs. Specifically, we consider a downlink wireless caching network that allows the coexistence of terrestrial D2D caching and aerial UAV-to-Device (U2D) caching using sub-6 GHz and millimeter wave (mmWave) spectrum bands, respectively. Due to higher data transmission rates in LoS as compared to sub-6 GHz, content delivery over the U2D links is prioritized first. We assume that each UAV can communicate with ground users and other UAVs in the absence of a central controller. Based on the proposed framework, the main contributions of this paper can be summarized as follows: r We develop a novel framework for 3D UAV trajectory design in which multiple cache-enabled UAVs are deployed to serve ground users with an unpredictable mobility pattern. Meanwhile, we formulate the network throughput maximization problem by optimizing the UAVs' initial positions and their dynamic movements. We show that the formulated problem is NP-hard due to the coupled association constraint.
r We exploit the MARL technique to design the optimal trajectories of multiple UAVs in a distributed manner, in which each UAV acts as an agent and conducts a decision algorithm independently. Utilizing the local state measurements, the proposed online UAV control scheme autonomously adjusts the real-time UAV position without prior knowledge of content request statistics or channel models. Moreover, unlike the existing MARL-based trajectory design based on fully independent agents, in the proposed scheme agents can share the gained experiences within a certain proximity. By optimally selecting the size of cooperative region, the proposed algorithm can strike a balanced tradeoff between independent action selection in r Simulation results confirm the effectiveness of integrating cache-enabled UAVs into the terrestrial D2D wireless caching network. Also, it is demonstrated that our proposed learning algorithm can significantly improve the network throughput as well as the U2D link utilization over other existing state-of-the-art solutions. Finally, we provide some valuable insights for the design of learning parameters (e.g., learning rate) and network parameters (e.g., cache storage capacity).

C. Organization
The rest of the paper is organized as follows. In Section II, we introduce the system model. Section III presents the content delivery service and the problem formulation. In Section IV, we provide the RL-based trajectory design with independent agents. We propose the cooperative MARL-based trajectory design in Section V. Simulation results are given in Section VI. We conclude the paper in Section VII. Additionally, the list of notations is given in Table I.

A. Network Model
We consider an aerial-terrestrial wireless caching network, where K UAVs equipped with cache storage are deployed to serve ground mobile users, as shown in Fig. 1. Let U and K be the sets of all users and UAVs, respectively. Besides, one ground BS is located at the center of the geographical area while all the UAVs are covered by the ground BS. We assume that the locations of mobile users are spatially distributed as a homogeneous Poisson point process (HPPP) Φ u with density λ u [28]. The HPPP is widely used in the performance analysis of mobile networks due to its mathematical tractability [29].
The user content requests are served via one of the following links: U2D links, D2D links, and BS-to-device (B2D) links. The UAVs operate at the mmWave band while the B2D and D2D links operate at the same sub-6 GHz band. We also assume that the UAVs are connected to the ground BS via high-speed backhaul links. We will discuss the link selection policy for the content delivery in Section III.
Let h k,t ∈ [h min , h max ] denote the altitude of the k th UAV at the t th time slot, where h min and h max are the lowest and the highest altitudes for all the UAVs, respectively. The horizontal coordinate of the k th UAV at the t th time slot is denoted by w k,t = (x k,t , y k,t ) and the coordinate of the u th user at the t th time slot is denoted by q u,t = (r u,t , s u,t ). We consider a realistic Gauss Markov mobility model [30] for the movement of ground users. In the Gauss Markov mobility model, both the values of velocity and moving direction at time t are calculated based on the values of velocity and moving direction at time t − 1. Namely, the velocity and direction for the u th user at the t th time slot can be written as where α ∈ [0, 1] reflects the degree of randomness in the mobility pattern [31]. Also,v andθ denote the asymptotic mean value of velocity and direction for all users when t approaches infinity, respectively. Parametersv andθ are independent stationary Gaussian processes with zero mean and unit variance.
In our system, we focus on the case where the mobility model incorporates randomness and memory by setting 0 < α < 1.

B. Channel Model
Two different channel models are considered in our system, namely UAV-to-ground channel and ground-to-ground channel.

1) UAV-to-Ground Channel Model:
Compared to the propagation of terrestrial communications, the UAV-to-ground channel is highly dependent on the altitude and the elevation angle. For the propagation model, we adopt the log-normal shadowing channel [7], in which LoS links and non-line-of-sight (NLoS) links can be modeled with corresponding channel parameters. The LoS and NLoS pathloss (in dB) from the k th UAV to the u th user at the t th time slot are given by where is the free-space path loss at reference distance d 0 and d u u,k,t is the distance between the k th UAV and the u th user at the t th time slot, i.e., Also, f c and c are the carrier frequency and the speed of light, respectively. Here, μ LoS and μ NLoS are the large-scale path loss exponents for the LoS and NLoS links, respectively. χ σ LoS and χ σ NLoS represent the Gaussian random variables with zero mean for the LoS and NLoS links, respectively. σ LoS and σ NLoS are, respectively, the standard deviations for the LoS and NLoS links. The LoS probability can be modeled as a logistic function of the elevation angle φ u,k,t [32], i.e., where X and Y are environment-dependent parameters (e.g., urban or rural). Also, the elevation angle is given by . Therefore, the average path loss for the U2D links can be expressed as ) . (7) According to [33], the interference can be neglected if the distance is large enough in the mmWave UAV networks. The signal-to-noise ratio (SNR) for the U2D link from the k th UAV to the u th user is given by where P UAV denotes the transmission power of a UAV and σ 2 denotes the power of additive white Gaussian noise (AWGN). In addition, |g u,k,t | 2 denotes the small-scale fading gain, which follows a Nakagami-m distribution to characterize a wide range of fading environments.
2) Ground-to-Ground Channel Model: For the terrestrial links, we consider the general power-law propagation model and the small-scale Rayleigh fading channel. We denote the channel gain of the D2D link between the u th user and the m th user at the t th time slot as |g d u,m,t | 2 . Besides, the channel gain of the B2D link between the u th user and the ground BS at the t th time slot is denoted as |g c u,t | 2 . It is assumed that |g d u,m,t | 2 and |g c u,t | 2 are independent and identically distributed exponential random variables with mean μ d and μ c [34]. For the D2D links, the receivers experience both inter-D2D interference and cross-tier interference from the ground BS. Therefore, the signal-to-interference-plus-noise ratio (SINR) of the D2D link achieved by the u th user from the m th user at the t th time slot can be expressed as where Φ denotes the set of D2D transmitters; β represents the path loss exponent; P D2D and P BS denote the transmission power of the D2D user and the ground BS, respectively. Besides, is the distance between the u th user and the m th user; d c u,t = r u,t 2 + s u,t 2 is the distance between the u th user and the ground BS which is located at the origin.
On the other hand, for the B2D links, we ignore the interference from other D2D links since the transmit power of D2D users is much less than that of the ground BS. Hence, the SNR of the B2D link at the u th user at the t th time slot is given by

C. Caching Model
Suppose that each ground user and UAV have finite cache storage capacity M q and M w = cM q with integer c > 1, respectively. Also, we consider a requested content library F = {f 1 , f 2 , . . ., f N } that consists of N equal-sized content files. It is assumed that the ground BS has the entire requested content library. We denote the content request probability of f i as p i , which satisfies the Zipf law [35], [36]. The request probability of f i is given by where κ is the popularity factor. A large κ implies that the content files are concentrated distribution. On the contrary, a smaller κ implies that the content request probability is more evenly distributed.
We adopt the geographic caching strategy [37] for both U2D and D2D links. In the U2D caching, each UAV stores content to the cache storage capacity constraints at UAVs and users, respectively [38].
The following cases are considered for calculating the cache hit probability, which is defined as the probability that a user can receive the requested file. 1) Self-Request Cache Hit: In this case, the requested file has been cached in its own local device. The cache hit probability can be written as 2) D2D Cache Hit: The probability of having the requested file cached at a D2D user depends on the D2D user density λ u and the area size. According to [39], the cache hit probability of file f i within distance R d can be written as 3) U2D Cache Hit: Recall that the UAVs can adjust their positions according to the users' positions. Thus, for the infinite time horizon, the UAVs can also be considered to be spatially distributed as an HPPP. In principle, the UAV cache hit probability p UAV hit depends on the average number of UAVs in which the user-UAV distance is within the UAV's transmission range R u . Clearly, the actual value of p UAV hit is affected by the UAV trajectory design. The upper bound of p UAV hit is obtained when the requested user is within the transmission range of all the K UAVs. Hence, we have

A. Content Delivery
Let the requested file of the u th user at the t th time slot be denoted as τ u t . The sets of cached files at the k th UAV and the m th user are denoted by C k t and D m t , respectively. The following actions will be performed when receiving a content request from a ground user.
r Case 1: The requested file is stored at the k th UAV located in the U2D transmission range, i.e., where ∧ is the logical "and". The u th user can obtain τ u t from the k th UAV via the U2D link.
r Case 2: There does not exist a UAV caching τ u t within the U2D transmission range, i.e., Then, there exist two possible sub-cases: r If the requested file τ u t is stored at the m th ground user within the D2D transmission range, i.e., the u th user receives τ u t from the m th user via the D2D link.
r Otherwise, τ u t is delivered to the u th user by the ground BS. Next, we analyze the successful transmission probability under different transmission modes.

1) U2D Links:
If the required content is stored in one or more UAVs, the user is assumed to associate with the nearest UAV. Let k f i ,u denote the index of the nearest UAV caching the content f i requested by the u th user. To simplify the notation, we use k instead of k f i ,u hereafter. Given the U2D SNR threshold η UAV , the successful transmission probability is given bỹ where is the lower incomplete Gamma function. Also, m and b denote the shape parameter and scale parameter of the Gamma distribution, respectively.
For the Rayleigh fading channel, which can be obtained by setting m = 1 in the Nakagami fading channel, the channel gain |g u,k,t | 2 follows the exponential distribution with unit mean. Since the UAVs are assumed to be distributed according to HPPP with parameter Kq u i , the probability density function (PDF) of the distance between the u th user and the k th UAV is given by [35]. Then, we can rewrite (18) as where q u,k,t (q u , r) is the successful transmission probability conditioned on d u u,k,t = r. 2) D2D Links: Let Φ i and Φ −i denote the sets of D2D users with and without content f i requested by the u th user, respectively. Considering the interference from the other D2D links and the ground BS, we rewrite ference from other D2D users with and without content f i , respectively. Also, −β represents the interference from the ground BS. Then, the successful transmission probability of D2D links conditioned on d d u,m,t = r is given by where denotes the Laplace transform of I i and η D2D is the SINR threshold for D2D links.
According to [40], we can have L I i (s, r)| s=r β η D2D given by where B (x, y, z) where B(x, y) du denotes the Beta function. Since the D2D users form an HPPP with parameter . Then, the successful transmission probability of the D2D links can be obtained as 3) B2D Links: Recall that the ground BS has the entire requested content library. The successful transmission probability of the B2D link can be obtained from (19) by setting K = 1 and q u i = 1, i.e., where η BS is the SNR threshold for B2D links.
With the expressions of the successful transmission probabilities reported above, in the following subsection we define the performance metric for the considered aerial-terrestrial wireless caching network.

B. Problem Formulation
As discussed previously, we assume that the U2D links have the highest priority in order to best utilize the mmWave spectrum resource. The throughput of the U2D link can be written as where B UAV is the system bandwidth of UAV transmission and N k is the number of users receiving data from all the UAVs. If the content request is fulfilled by the D2D links, the throughput of the D2D links can be written as where B D2D is the system bandwidth of D2D transmission and N u is the number of users receiving data through D2D links. If both the U2D and the D2D links cannot provide the requested content, the user acquires the content from the ground BS. The throughput of the B2D links can be written as where B BS is the system bandwidth of cellular transmissions and N b is the number of users receiving data from the ground BS. Then, the total network throughput can be calculated as the sum of throughput obtained from all the links, i.e., From (25) and (28), it can be observed that the network throughput is substantially affected by the positions of UAVs. More specifically, the positions of UAVs not only affect the U2D successful transmission probability but also the UAV cache hit probability. It is worth noting that (14) only gives the upper bound of the UAV cache hit probability. The real value of the UAV cache hit probability depends on the real-time positions of all the UAVs and all the ground users. Thus, the UAV trajectory optimization problem is formulated as where (29b) and (29c) indicate the constraints of the limited area, with [x min , x max ] and [y min , y max ] representing the x-axis and y-axis movement range of the UAVs, respectively; (29d) is the altitude constraint for the UAVs. Given the maximum UAV velocity v max , (29e) represents the flight velocity constraint, in whichẋ k,t ,ẏ k,t , andḣ k,t denote the first derivatives of x k,t , y k,t , and h k,t , with respect to t, respectively. Furthermore, (29f) and (29g) are the association constraints to ensure that each user is only served by one UAV at most, with ρ u,k denoting the binary association indicator of the u th user and the k th UAV. Note that problem (29a) is a non-convex mixed integer programming problem because of the complicated mathematical expression of the successful transmission probability and the integer association constraint in (29f). Such problems are generally difficult to obtain the solution in polynomial computational complexity [41]. In the following sections, we propose an MARL-based method for optimizing the UAV trajectory strategy.

IV. REINFORCEMENT LEARNING FOR 3D UAV TRAJECTORY DESIGN WITH INDEPENDENT AGENTS
Generally, the proposed RL framework for UAV trajectory design consists of two phases. In the first phase, initial UAVs positions are calculated by the K-means clustering algorithm. In the second phase, the UAVs positions are dynamically adjusted based on the Q-learning algorithm to achieve the maximum network throughput.

A. UAV Trajectory Initialization Based on K-Means Clustering
From (8), the distance between the UAVs and the users plays a key role in network performance. To minimize the distance between the UAVs and the users, we apply the K-means clustering algorithm to divide all the users into K clusters and find the initial UAVs positions. As an unsupervised learning method, K-means clustering algorithm can be employed to cluster the objective nodes into several groups and find the centroid of each group, which is widely adopted in recent works for UAV trajectory design [22]. We denote the cluster set of all the users as S = {S 1 , S 2 , . . ., S K }. The objective of the K-means clustering algorithm is to minimize the overall squared distance between each user and its nearest centroid, which is given by where is the centroid point of cluster k. We select c k from the final results of the K-means clustering algorithm to be the initial UAVs positions.

B. Independent Q-Learning Based Trajectory Design
Although the K-means clustering algorithm can provide the initial UAVs positions, it has the following limitations. First, the K-means clustering algorithm may converge to a local minimum and thus becomes inefficient when the network consists of a large number of mobile nodes [22]. Second, the clustering-based approach fails to take into account other key factors, such as the UAV cache hit probability and the successful transmission probability. To deal with these limitations, we propose to adopt the Q-learning algorithm in the second phase of our proposed method. Hereafter, the "UAV" and "agent" are used interchangeably.
The UAV trajectory design problem can be formulated as a MDP, which is composed of five tuples (S, A, R, P, γ), where S = {s 1 , . . ., s I } denotes a finite set of states, A = {a 1 , . . ., a J } denotes a finite set of actions, R = {r(s i , a j )|s i ∈ S, a j ∈ A} is the set of immediate reward r(s i , a j ) when action a j is selected while in state s i , P = {p(s, s, a j )|s, s ∈ S, a j ∈ A} is the transition probability for an agent that moves from the state s to the state s when performing action a j , and γ ∈ (0, 1] is the discount factor. Next, we show how the MDP can be solved using independent Q-learning. In the conventional Q-learning algorithm, the UAVs can be regarded as independent agents that learn the optimal action independently. Hereafter, we refer to this kind of Q-learning method as independent Q-learning.
To learn the optimal action, the first step is to let the agent recognize the current system state through interacting with the environment. In particular, the interaction experience is represented by a Q-value, which is an estimate of the expected reward for performing a certain action at a certain state. The Q-values are usually stored in a look-up table called Q-table for reusing the gained knowledge. In the learning phase, the Q-learning algorithm iteratively updates the Q-values by the following rule: where Q k (s, a) and Q * k (s, a) are the old and new Q-values, respectively, for the k th agent performing action a at state s. Here, r k (s, a) denotes the reward for the k th agent and μ is the learning rate which determines how much the new observation overwrites the old one. Also, γ is the discount factor, which represents the impact degree of the future reward on the current decision.
Next, we describe the MDP design for the proposed trajectory design using the Q-learning method.
1) State: We consider the 3D position of the UAV to be the state in our system. Hence, the state of the k th UAV is denoted by s k = (x k , y k , h k ). To simplify the learning process, we adopt a discrete state-space approximation in which the state space is quantized into discrete regions.
2) Action: a k = (d a k , l a k ) denotes the action of the k th UAV, which consists of the direction d a k and the moving distance l a k of the UAV. We consider that, at any time slot, the UAV can fly to an adjacent grid in one of the six directions: up, down, right, left, forward, and backward. Also, the action space for the moving distance at each time slot is discretized into z levels associated with corresponding flight velocities ranging from [0, v max ].
3) Reward: In this work, the reward can be interpreted as how the action affects the total network throughput T total . The reward function is designed in a way that encourages the UAV to take actions that maximize its own throughput. At a time slot, if the SNR for the U2D link is larger than a minimum SNR threshold, the UAV receives a reward r k , which is defined as the throughput provided by the k th UAV. Otherwise, the UAV receives a zero reward. Hence, the reward function r k can be expressed as where is the set of users who are located within the transmission range of the k th UAV which caches the requested file τ u i .
Additionally, we adopt the ε-greedy policy [42] for the action selection in the Q-learning algorithm. More specifically, the UAV takes the action that has the largest Q-value with a probability 1 − ε for a given current state while randomly selecting other actions with a probability ε |A|−1 .

A. Overview
Independent Q-learning methods present inefficiency in terms of learning speed and effectiveness when they are adopted in multi-UAV systems. Since each of the UAVs learns the policy only from its own experience, the learned knowledge of a single UAV (e.g., channel environment and user mobility) cannot be reused by other UAVs. It is time wasted for all the UAVs to explore the unknown environment when some of them are near to each other.
To deal with this problem, we propose a distributed cooperative mechanism for the learning process of Q-learning. In the proposed mechanism, UAVs within the same cooperative region (defined formally later) work together to achieve a common goal by sharing the past experience. More specifically, these coordinated UAVs exchange the obtained reward and update the shared Q-table. Hence, unnecessary learning processes can be reduced by reusing the past experiences learned in the same cooperative region. Figure 2 shows illustrations of different Q-learning methods for multi-UAV systems. In the single Q-learning, the action selection is performed in a centralized manner in which the ground BS maintains a global Q-table. On the other hand, in the conventional MARL, each UAV distributedly learns its own policy based on its own Q-table. Unlike the above two independent Q-learning methods, the proposed cooperative MARL allows UAVs to select actions in a coordinated fashion by updating the shared Q-table. Hence, the proposed algorithm is executed on both the ground BS and the UAVs.

B. Algorithm Design
Now we present the proposed cooperative MARL (CMARL) based solution for multi-UAV trajectory design in the aerialterrestrial wireless caching network. Each UAV learns an optimal policy for controlling and optimizing its trajectory from its own experience as well as from other nearby UAVs within the same region.
The whole observed area is divided into N A equal-size cooperative regions. We denote the cooperative region number at which the k th UAV is located as c r k ∈ {1,. . .,N A }, ∀k ∈ K. For  r Uncoordinated state to uncoordinated state: In this case, each UAV performs flight decisions independently according to its own Q-table. The action a k of the k th UAV is only affected by its own reward r k . In fact, this case is equivalent to that in the conventional MARL. Therefore, the update rule is the same as (32).
r Coordinated state to coordinated state: When the UAVs move from a coordinated state to another coordinated state, the optimal action is selected based on the shared Q-table.
The update rule in this case can be written as where Q(s k , a k ) and Q * (s k , a k ) are the old and new Q-values, respectively. Also, r(s k , a k ) represents the obtained shared reward when executing the action a k at the state s k , which is calculated as the average reward over all cooperative UAVs. That is, r(s k , a k ) = a k ), where G k = {j ∈ K|c r j = c r k } denotes the set of UAVs within the cooperative region c r k .
r Coordinated state to uncoordinated state: If the UAV moves from a coordinated state to an uncoordinated state, the new Q-value should take into account the policy from its own Q-table. Specifically, both the immediate reward and the expected discounted reward stored in its own Q-table are incorporated into the update equation, i.e., s, a )] . (35) r Uncoordinated state to coordinated state: Last, in case that the UAV moves from an uncoordinated state to a coordinated state, the update equation incorporates the immediate shared reward and the expected discounted reward stored in the shared Q-table. Thus, we have The details of the proposed CMARL based 3D UAV trajectory design algorithm are summarized in Algorithm 1. Lines 2-6 initialize the positions of UAVs by using the K-means algorithm. In lines 9-19, the Q-table is updated using the aforementioned cooperative mechanism. Lines 23-29 determine the optimal action and the corresponding 3D UAV position.
It is worth pointing out that state discretization is employed by using the grid approach in order to obtain an appropriate size of state space. To deal with the problem induced by large state space, the proposed method can be extended by simply replacing the Q-table with deep neural network [42]. Moreover, in this paper we assume a perfect communication link between the ground BS and the UAVs. However, in realistic environments, UAVs may fly out of the wireless coverage of the ground BS [3]. In this case, more complex Q-value updating mechanisms are required for ensuring the learning stability, which will be our future work.

C. Analysis of the Proposed CMARL Based Algorithm
The feasibility of utilizing on-device RL for UAV flight control has been demonstrated in [23], [25]. Also, reducing power consumption is one of the major challenges in UAV networks. According to [12], the energy cost of the UAV communication link is proportional to the transmission data size. We note that the proposed method requires message exchange between the UAV and the ground BS when updating the shared Q-table. Clearly, the extra transmission overhead for updating the Q-value of a single state-action pair is relatively small compared to the content data transmission and thereby the related energy cost can be neglected.
1) Convergence Analysis: The convergence analysis can be divided into two cases based on the state transition.
r Non-cooperative Case: In this case, the UAV moves from an uncoordinated state to an uncoordinated state. Recall that the agent in this case only considers its own reward during the update of Q-value. Therefore, this case can be regarded as non-cooperative MARL in which multiple agents execute the single Q-learning independently. For the convergence of the non-cooperative MARL, it has been proved in [43] that the non-cooperative MARL Algorithm 1: Proposed CMARL Based 3D UAV Trajectory Design Algorithm. Input: Number of UAVs K; number of cooperative regions N A . Output: UAV's horizontal position w k , UAV altitude h k . 1: Initialize Q k (s, a) = 0 and Q(s k , a k ) = 0, ∀s ∈ S, ∀s k ∈ S, ∀a ∈ A, ∀a k ∈ A. 2: Initialize K centroids at random. 3: Divide all the users into K clusters according to (30). 4: for each UAV agent k do 5: Obtain the initial position of the k th UAV according to (31). 6: end for 7: for each step of episode do 8: for each UAV agent k do 9: Select action a k based on ε-greedy policy. 10: Observe the state s k and receive the reward r k . 11: if Uncoordinated → uncoordinated then 12: Update its own Q- if In a coordinated state then 24: Selectâ k = arg max a k Q(s k , a k ) as the optimal action for a given state s k . 25: Obtain w k and h k fromâ k . 26: else 27: Selectâ = arg max a Q k (s, a) as the optimal action for a given state s.

28:
Obtain w k and h k fromâ. 29: end if 30: end for 31: return w k , h k ; can converge to the optimal policy ∪ K k=1 π * k (s k ) under the following conditions. a 1 , a 2 =a 1 )|∀(k, j ∈ {1, 2}, a), Here, π * k (s k ) andπ k (s k ) refer to the optimal policy from the global perspective and local perspective, respectively. If condition (a) is satisfied, the global optimal action a * k of the k th UAV is equal to the local optimal actionâ k in which each UAV selects the optimal action independently. In addition, let a * be the set of actions including a * k . Condition (b) states that there does not exist another action set a † such that the Q-value with respect to a † is larger than that to a * . Last, from condition (c), we know that there must be at most one global optimal action that maximizes the Q-values of all the UAVs. That is, no coordination between the UAVs is required to achieve an optimal equilibrium. r Cooperative Case: On the other hand, the UAV whose Q-value update involves the shared Q-table is classified as a cooperative case. By regarding UAVs as nodes, the interaction topology of information exchange between agents can be described as a graph, with each edge representing the interaction between two UAVs. Also, each edge is assigned a nonnegative weight indicating the strength of interaction. Let (i, j) denote the edge between node i and node j. It has been proved in [44] that the convergence of the cooperative MARL is guaranteed when the weight of the edge (i, j), denoted by w(i, j), is chosen by the Metropolis criterion [45], which is given by where K denotes the number of neighboring agents; H represents the set of edges; N (i) = {1 ≤ j ≤ K : (i, j) ∈ H} denotes the set of neighboring agents of the i th node; d(i) = |N (i)| is the degree of the i th node of the graph. Recall that, in our proposed CMARL, agents within the same cooperative region can communicate with each other. Hence, the proposed design is a particular case of the Metropolis weights where the communication graph for the agents within a cooperative region is always a fully connected graph with d(i) = d(j) = K − 1 in (37) and (38). 2) Complexity Analysis: For the single Q-learning algorithm, the computational complexity is O(T Q ), where T Q denotes the time required to converge to an optimal solution. Then, for the conventional MARL algorithm, due to having K independent UAVs, the computational complexity is O(KT Q ). In our proposed CMARL algorithm, additional complexity is required for each UAV to handle the shared Q-table. Then, the complexity for each UAV is O(T Q + K c T Q ), where K c denotes the number of UAVs within a cooperative region. Clearly, K c depends on the actual environment and the total number of cooperative regions N A . In the worst case, we have K c = K and thereby the total computational complexity of CMARL is O((K 2 + K)T Q ). However, although the complexity for one UAV is linearly related to the UAV number in the network, it is worth noting that the UAVs are usually too far away from a UAV to affect its flight decision. According to our simulation results, which will be presented later, the optimal network performance can be achieved by selecting an appropriate cooperative region size such that K c ≈ 1. In this case, the complexity of the proposed CMARL is O(KT Q ), which is the same as the conventional MARL algorithm.   II  COMPLEXITY AND EXECUTION TIME OF DIFFERENT RL ALGORITHMS   TABLE III  SIMULATION PARAMETERS users' mobility traces from [46]. The execution time is measured by over 500 iterations. It is observed that the execution time of the proposed CMARL is close to that of the conventional MARL, which is consistent with our complexity analysis.

VI. SIMULATION RESULTS
In this section, the network performances of the proposed CMARL based trajectory design algorithm are evaluated. We use MATLAB to implement the proposed algorithm based on the system model described in Section II. In our simulations, we consider that the ground users are uniformly and independently distributed within an area of size 1000 m × 1000 m. The UAVs can dynamically move in the 3D space with adjustable altitude ranging between 60 and 80 m. The maximum velocities of users and UAVs are set to 1 and 20 m/s, respectively. The discount factor is set as 0.8 due to the relatively fast convergence. We adopt the time-decayed exploration rate ε = 1 Δ [47], where Δ denotes the number of episodes. The default parameter values for simulation are given in Table III.
We compare the proposed CMARL with another three algorithms: r K-means: In this case, the initial positions of UAVs are determined by the K-means clustering algorithm and then remain unchanged in the rest of the time slots.  r RL: After the initialization by the K-means clustering algorithm, the UAV trajectory is determined by a centralized controller executing the single Q-learning algorithm.
r MARL: The third one is that of using the conventional MARL algorithm in which each UAV learns its optimal policy in a distributed and independent manner. Fig. 3 plots an example of the 3D deployment of four UAVs in our simulations. The ground users are divided into four clusters, each of which is associated with one UAV. We can observe that some users are not served by any UAV due to the limited U2D transmission range. Each UAV's state can be represented by a 3D grid point as shown in the figure. Fig. 4 compares the total network throughput for different UAV trajectory design algorithms. In addition to the aforementioned algorithms, we also compare the performance of the pure D2D scheme in which the contents can only be cached at D2D users (without UAV caching). Hence, it is reasonable that the U2D SNR threshold does not affect the throughput performance. For the K-means algorithm, it is observed that there is almost no performance gain when the U2D SNR threshold η UAV is high. The reason is that the static UAV placement using the clustering scheme cannot make adaptations to the real-time location-dependent channel conditions. On the other hand, we can observe that the network throughput can be significantly improved by adopting the RL-based UAV trajectory design algorithms. Furthermore, we observe that different RL-based algorithms achieve a similar performance when the U2D SNR threshold is low. This is due to the fact that the maximum throughput is limited by the available bandwidth of U2D links. On the other hand, when the U2D SNR threshold is high, the performance gain of the RL-based algorithms is not significant since a large portion of users are served by a D2D transmitter or the ground BS. Finally, our proposed CMARL significantly outperforms all the compared algorithms, regardless of the U2D SNR threshold, which demonstrates the benefits of our proposed cooperative UAV trajectory design. For example, when η UAV = 0 dB, the network throughput can be increased by 66.4% compared with the conventional MARL algorithm. As shown in the green dashed line with downward triangles, the pure D2D scheme is observed to achieve the same throughput performance as CMARL when the cache storage capacity of each user is increased to 380 files, which is 9.5 times larger than that of CMARL. Fig. 5 shows the network throughput performance with different numbers of UAVs. All of the cases can provide better network performance as the number of UAVs increases except the pure D2D scheme. It is observed that, when the number of UAVs increases, the distributed schemes (MARL and CMARL) can provide significant performance improvement, whereas the performance improvement is not obvious when the centralized schemes (RL and K-means) is applied. Also, as the UAV number increases, the performance improvement of CMARL is higher compared to that of MARL. This shows the effectiveness of the proposed cooperative mechanism in exchanging UAVs' reward information. Clearly, the performance gains of using UAVs tend to decrease with the increase of the UAV number. Since the coverage of each UAV is limited, there exists a minimum number of UAVs that can provide ubiquitous coverage to the ground users such that the throughput performance is maximized. It is observed that CMARL converges to a stable performance when K = 10. Again, we see that our proposed CMARL can provide the highest throughput among all compared algorithms, regardless of the UAV number. Fig. 6 shows the throughput performance obtained within a given number of iterations. We can see that, as the number of iterations increases, the throughput performance increases until convergence. Furthermore, compared to the MARL algorithm, the proposed CMARL algorithm presents a similar convergence speed while achieving around 50% throughput improvement. This result is consistent with our complexity analysis presented in Section V. Finally, it can be seen that the learning rate of 0.9 used for all the RL-based algorithms outperforms that of 0.8 and 0.99. This can be explained by the fact that a large learning rate (e.g., μ = 0.99) will hinder convergence while a small learning rate (e.g., μ = 0.8) leads to slow convergence. Fig. 7 shows the throughput ratio of each UAV with different UAV numbers. The throughput ratio of each UAV is defined as the throughput per UAV provided relative to the total network throughput. From the figure shown, the throughput ratio decreases as the UAV number increases. This is reasonable, since the number of average connected users per UAV decreases with an increased UAV number. Furthermore, we can see that the throughput ratio of CMARL is larger than that of all other algorithms, which confirms the effectiveness of the proposed scheme for improving the U2D link utilization. We note that although the throughput ratio of each UAV decreases when the UAV number increases, the total network throughput can be improved as shown in Fig. 5. Then, Fig. 8 demonstrates a similar phenomenon for U2D SNR thresholds. For the same reason, the throughput ratio decreases with the increase of U2D  SNR threshold. Again, we observe that CMARL outperforms all other algorithms in terms of throughput ratio. In particular, when η UAV = 0 dB, the throughput ratio can be increased by 36.1% compared with the conventional MARL algorithm.
We present the impact of cooperative region number N A on the throughput performance of CMARL as shown in Fig. 9. We can see that N A = 4 leads to the highest throughput when the number of UAVs K is less than 10. However, CMARL setting N A = 9 yields the best performance when K = 10. This indicates that N A needs to be adjusted according to K for achieving the optimal throughput performance. To further demonstrate this, we plot the throughput performance of each UAV with respect to different N A and K, as shown in Fig. 10. The expected number of UAVs within a cooperative region can be expressed byK c = K N A . In the case ofK c 1, as shown in the upper left corner of Fig. 10, it is difficult for a UAV to find another UAV to cooperate with, since the total number of cooperative regions is much larger than the UAV number. As a consequence, the UAV will tend to update its own Q-table. Essentially, this case can be regarded as the conventional MARL in which all the agents independently learn their own policy. On  the other hand, in the case ofK c ≈ K, as shown in the lower right corner of Fig. 10, UAVs are highly likely to cooperate with each other, since the UAV number is much larger than the total number of cooperative regions. Therefore, the UAVs will tend to select actions according to the shared Q-table, which can be regarded as the single Q-learning where a central Q-table is used to achieve the global optimality. Remarkably, the best performance is achieved whenK c ≈ 1, i.e., N A ≈ K, which is in line with the results shown in Fig. 9. This result actually indicates that, by setting appropriate cooperative region number N A , the proposed CMARL can achieve the optimal balance between the distributed and centralized RL systems. Fig. 11 shows the throughput performance of the proposed CMARL with different cache-related parameters. We can see that the throughput increases with the cache storage capacity of UAV. This is because a larger UAV cache storage capacity will result in a higher UAV cache hit probability. Furthermore, the throughput performance increases as the popularity factor κ increases, which is consistent with previous results reported in [35]. The reason is that, when the content requests are more concentrated, the U2D cache hit probability increases. Also, Fig. 11. Network throughput using the proposed CMARL algorithm with different cache storage capacities. a linear relationship between the throughput performance and the cache storage capacity can be observed when N = 1000. However, when N = 500, a log-linear relationship between the throughput performance and the cache storage capacity is observed. In particular, as the cache storage capacity increases, the throughput performance for different popularity factors κ becomes closer and eventually converges to a constant when N = 500. The reason is that when M w = N , all the contents can be stored on the UAVs, resulting in the maximum UAV cache hit probability.

VII. CONCLUSION
In this paper, we have proposed a multi-UAV trajectory design algorithm based on reinforcement learning (RL) for cacheenabled UAVs to dynamically learn their optimal 3D positions while maximizing the network throughput. A cooperative multiagent RL (CMARL) method has been developed to overcome the inefficiency of independent RL in multi-agent systems. Using the proposed CMARL trajectory design algorithm, each UAV can autonomously decide whether or not the flight decisions should be coordinated with other UAVs. Simulation results have shown that the proposed trajectory design algorithm can improve the network throughput and the UAV-to-Device (U2D) link utilization compared to the conventional RL algorithms. Also, our results have revealed that the proposed cooperative method can achieve the optimal balance between the distributed and centralized RL systems with an appropriate number of cooperative regions. These results can provide important design guidelines for high-throughput aerial-terrestrial wireless networks.