Multiagent Federated Reinforcement Learning for Resource Allocation in UAV-Enabled Internet of Medical Things Networks

In the 5G/B5G network paradigms, intelligent medical devices known as the Internet of Medical Things (IoMT) have been used in the healthcare industry to monitor remote users’ health status, such as elderly monitoring, injuries, stress, and patients with chronic diseases. Since IoMT devices have limited resources, mobile edge computing (MEC) has been deployed in 5G networks to enable them to offload their tasks to the nearest computational servers for processing. However, when IoMTs are far from network coverage or the computational servers at the terrestrial MEC are overloaded/emergencies occur, these devices cannot access computing services, potentially risking the lives of patients. In this context, unmanned aerial vehicles (UAVs) are considered a prominent aerial connectivity solution for healthcare systems. In this article, we propose a multiagent federated reinforcement learning (MAFRL)-based resource allocation framework for a multi-UAV-enabled healthcare system. We formulate the computation offloading and resource allocation problems as a Markov decision process game in federated learning with multiple participants. Then, we propose an MAFRL algorithm to solve the formulated problem, minimize latency and energy consumption, and ensure the quality of service. Finally, extensive simulation results on a real-world heartbeat data set prove that the proposed MAFRL algorithm significantly minimizes the cost, preserves privacy, and improves accuracy compared to the baseline learning algorithms.

Abstract-In the 5G/B5G network paradigms, intelligent medical devices known as the Internet of Medical Things (IoMT) have been used in the healthcare industry to monitor remote users' health status, such as elderly monitoring, injuries, stress, and patients with chronic diseases.Since IoMT devices have limited resources, mobile edge computing (MEC) has been deployed in 5G networks to enable them to offload their tasks to the nearest computational servers for processing.However, when IoMTs are far from network coverage or the computational servers at the terrestrial MEC are overloaded/emergencies occur, these devices cannot access computing services, potentially risking the lives of patients.In this context, unmanned aerial vehicles (UAVs) are considered a prominent aerial connectivity solution for healthcare systems.In this article, we propose a multiagent federated reinforcement learning (MAFRL)-based resource allocation framework for a multi-UAV-enabled healthcare system.We formulate the computation offloading and resource allocation problems as a Markov decision process game in federated learning with multiple participants.Then, we propose an MAFRL algorithm to solve the formulated problem, minimize latency and energy consumption, and ensure the quality of service.Finally, extensive simulation results on a real-world heartbeat data set prove that the proposed MAFRL algorithm significantly minimizes the cost, preserves privacy, and improves accuracy compared to the baseline learning algorithms.Index Terms-Emergency, federated learning (FL), healthcare, Internet of Medical Things (IoMT), multiagent RL (MARL), unmanned aerial vehicle (UAV).

I. INTRODUCTION
R ECENTLY, the advent of 5G and beyond 5G (B5G) has emerged as a promising paradigm for the healthcare industry to increase reliability, provide smart services, and reduce the end-to-end (E2E) delay.The B5G network infrastructure enables ultradense Internet of Things (IoT) devices to be deployed in various industries, including healthcare sectors.In the B5G networks era, intelligent medical devices are expected to be deployed in healthcare systems to monitor chronic, pandemic, and epidemic diseases that exist at different times in the world.The Internet of Medical Things (IoMT) is an emerging technology in the healthcare industry that allows people, smart medical devices, and real-time applications to collaborate and exchange healthcare data via wireless networks [1].In the healthcare industry, the IoMT technology enables the interconnection of personal medical IoT devices and healthcare providers to provide better E2E services (accuracy, speed, and disease prediction), improve quality of life, and reduce cost, thus providing better service to society [2], [3].The IoMT creates new opportunities for the healthcare industry because of its scalability, genericity, mobility, and flexibility.
Nevertheless, resource limitations in IoMT devices, network congestion, data privacy breaches, and E2E transmission delay are critical issues in IoMT, which affect the E2E communication and data delivery performance in the healthcare system.The emerging mobile edge computing (MEC) paradigm in 5G and B5G empowers the healthcare system by allocating resources to the patients or health monitoring IoT devices in the edge layer.Resource management in the healthcare system is critical to satisfy the Quality of Service (QoS) and save patients' lives.It allows emergency data communication in IoMT to reduce the delay of emergency packet delivery and avoid network congestion [3].The MEC technologies enable the beyond-wireless body area networks (WBANs) devices to offload their data to the nearest node and get it processed there.With beyond-WBANs (BWBANs), heterogeneous bio-IoT devices are deployed on the human body and generate delay-sensitive medical packets.The MEC system handles delay-sensitive medical packet transmission in BWBANs by categorizing random packet arrivals at each computational node (gateway) into emergent alarms and nonemergent routines [4].Ning et al. [5] studied the cost minimization of the MEC-enabled 5G health monitoring system for IoMT under two subnetworks: 1) intra-WBANs and 2) BWBANs.
2327-4662 c 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
However, the conventional MEC network can not fully meet the healthcare system's requirements since most IoMT devices are mobile.In the event of an emergency or when IoMT devices are out of network coverage, patients' lives may be at risk.In this context, unmanned aerial vehicles (UAVs) are a promising technology in 5G and B5G that can support ultrareliable low-latency communication (ULLC), mobility, network coverage enhancement, and public safety communication.UAVs have gained popularity in various areas, such as post-disaster recovery, agriculture, and healthcare [6].The modern healthcare industry is expected to rely heavily on UAVs to collect and transfer medical data from IoMT devices to base stations (BSs) and transmit medical data to patients and physicians, particularly in areas with emergency scenarios or no physical infrastructure coverage.UAVs can support the healthcare systems in providing medical treatments and diagnoses to patients at any time and location.They can also help IoMT devices function properly by sharing communication and computation resources, such as energy, spectrum, computing, storage, etc. [7].Many previous studies reported that UAV-based healthcare systems could minimize the challenges of health monitoring and control, reduce the burden on the healthcare system, facilitate the administration of medical vaccines and patient authentication, and automatically disinfect contaminated areas, particularly in cases of pandemics like COVID-19 [8], [9].
Furthermore, researchers have attempted to solve various problems in healthcare systems using machine learning (ML) approaches, such as reinforcement learning (RL), deep RL (DRL), deep learning, etc.These approaches have been increasingly applied to real-world optimization problems of resource management, computation offloading, localization, and privacy preservation, particularly in wireless communication networks, such as the IoT, MEC, UAVs, and smart healthcare networks [10].Among those approaches, RL has become an attractive approach for constructing optimal dynamic treatment regimes in healthcare to monitor chronic disease [11].A multimodal RL algorithm is used to maximize the battery life of IoT devices through data compression, energyefficient communication, and minimizing latency in medical IoT systems, particularly for emergency cases [12].
However, the aforementioned approaches have limitations, such as high energy consumption, communication costs, and latency when uploading a massive volume of data to the computational server.Furthermore, the learning is performed on a central server after offloading all data, affecting the patients' privacy.In this context, the federated learning (FL) paradigm has gained attraction by allowing heterogeneous edge nodes to train data models and centralized aggregation, thereby protecting data privacy.FL has been used in wireless communications to empower distribution services and address privacy concerns [13].More specifically, FL has recently been integrated with RL for healthcare applications that rely on IoMT to address these issues.In addition to privacy-preserving medical data, it builds robust and high-accuracy models and supports decentralization [14].
Nevertheless, there are insufficient research attempts to address the problems of healthcare systems when IoMT is out of network coverage, when the computational nodes are overloaded, and/or when the terrestrial network is affected by either artificial or natural disasters.The healthcare system faces challenges in controlling emergencies and saving patients' lives in these cases.Therefore, based on the limitations of the current schemes available in the literature, we are motivated to propose a new multiagent federated RL (MAFRL) framework for efficient resource allocation in a multi-UAV-enabled IoMT network to minimize the time delay and energy consumption of medical data processing.The main contributions of this article are summarized as follows.
1) We propose an MAFRL framework for a multi-UAVenabled IoMT for healthcare systems to ensure QoS of healthcare devices/patients' health monitoring devices, preserve security, and minimize system costs considering the latency and energy consumption.Healthcare IoT devices train their own model and send it to the computational node (UAV cluster head (UCH)/BS) to aggregate the global model.To enhance network coverage in suburban and remote areas and manage failure/loss of communication in the healthcare system, we deployed a multi-UAV system that patrols a location providing different resources for healthcare entities within the UAVs coverage areas.2) We formulate a computation offloading, Age of Information (AoI), and resource allocation optimization problem and transform it into a Markov decision process (MDP), modeling it as multiagent RL (MARL) problem.Each computational node (BS, UCH) can make its decision based on its observations to allocate resources.3) We develop an MARL-based resource allocation algorithm by including the FL model in multi-UAV-enabled IoMT to solve the formulated optimization problem.4) We conduct extensive simulations to evaluate the performance of the proposed algorithm with the existing benchmarks.We use the heartbeat data set to evaluate the proposed algorithm.The remainder of this article is organized as follows: Section II introduces the related work.Section III presents the proposed system model and the optimization problem.Section IV discusses the proposed MAFRL.The proposed solution is discussed in Section V. Section VI presents the performance evaluation and analysis.Finally, we present our conclusion in Section VII.

II. RELATED WORK
Several research attempts have investigated the optimal computation offloading and resource allocation problem in the healthcare system and UAV-enabled emergency communications [7], [15], [16].
In recent years, MEC has been used in the healthcare industry to assist IoMT, in which the MEC servers or nearest edge nodes allocate resources to compute medical data generated by IoMT devices, ensuring high QoS and saving patients' lives [17].Ning et al. [5] proposed a potential game-based decentralized approach to minimize the overall system cost of IoMT.They focused on patient costs depending on three metrics: 1) medical criticality; 2) AoI; and 3) energy consumption.AoI is a new metric that captures the time elapsed since the last successfully received update packet at the medical IoT devices (MID) were generated at their source; in particular, it refers to the freshness of information [18], [19], [20].AoI is an E2E metric that can characterize latency in status update systems and applications.Furthermore, various ML approaches have been applied to improve the smartness of healthcare services, enhance resource management, and control user data privacy.In particular, MARL has been widely used to empower distributed MEC systems [21].Besides, FL has also been proposed to allow privacy preservation through cooperative model training among geographically dispersed users [13], [22], [23].It is widely used in smart healthcare to protect patient information, monitor patient health, and empower remote healthcare [24], [25].FL has been applied in UAV-enabled networks for optimization problems, such as privacy preservation, deployment, placement, and resource management.Yang et al. [26] introduced an FL-based UAVenabled network to protect the end user's privacy by keeping the data used for the training locally and exchanging only the model parameters.The authors jointly formulated device selection, UAV placement, and resource management problems.Then, they applied a multiagent asynchronous advantage actor-critical (A3C) algorithm to enhance the FL convergence speed and efficiency.Elayan et al. [27] presented a deep FL paradigm to monitor and analyze patient data utilizing IoT devices in order to protect medical data privacy and facilitate decentralization.Lim et al. [28] proposed FL-based edge computing to enable privacy-preserving collaborative model training among distributed IoT devices/users to develop smart healthcare applications.They introduced a dynamic smart incentive mechanism to allow the sustainable participation of users in the system.Albaseer et al. [14] studied a fully decentralized FL-enabled double deep Q-network (DDQN) to empower edge nodes.The DDQN was deployed to obtain a stable and sequential clinical treatment policy in the IoT E-health system.
Recent advances in MARL and FL have emerged as powerful solutions to the optimization problem and resource allocation in the MEC network.Yu et al. [21], Yang et al. [26], Xu et al. [29], and Zhu et al. [30], [31] proposed an MAFRL framework for resource allocation, computation offloading, and privacy preservation.Yu et al. [21] proposed a new FL-enabled two-timescale DRL framework to minimize the total delay in data transfer and the use of network resources by jointly optimizing data offloading, resource allocation, and service caching placement.Xu et al. [29] studied an MAFRL framework for secure resource allocation and incentive mechanism for an intelligent cyber-physical system with heterogeneous devices.The problems of communication, computation, and data resource allocation are formulated as a Stackelberg game.As we know, IoT devices are generating time sensitive data/age sensitive data/applications, and these data offload to the nearest edge node for further processing.To handle the AoI, ensure data freshness, allocate resources, and adjust schedules, a hierarchical FL-based multiagent actorcritic (MAAC) framework was designed by [30] and [31] resulting in improved system performance.The optimization problem of AoI is formulated by MDP and solved by combining edge FL with the MAAC learning approach.The edge devices and central controllers have collaborated and learned from their observation.
To summarize, the approaches in [5], [14], [21], [26], and [27] have been proposed for other resource management environments and cannot be directly applied to ensure QoS of the IoMT equipment and resource allocation in UAV-enabled IoMT infrastructure, where the edge IoMT devices collaborate through FL.Despite that, most of these approaches neglect the emergency cases in the air-to-ground (ATG) environment in the healthcare system.This work is the first attempt to utilize MARL with FL for resource allocation in a UAV-enabled IoMT network.

III. SYSTEM MODEL AND PROBLEM FORMULATION
In this section, we first describe a multiagent ATG network comprised of communication, computation, and energy models.As shown in Fig. 1, a multiagent ATG network is presented for reliable resource allocation, computation offloading, and association in healthcare systems.The lower layer utilizes smart devices, such as sensors, smartwatches, and other medical IoMT devices to generate data and monitor patients' health conditions.The middle layer of the edge computing node includes UCH, BS, and intelligent ambulance/vehicle, which provide resources, extend network coverage and maintain network sustainability, and relay data to software-defined networking (SDN).The SDN is commonly used to manage network infrastructure, resource management, association, and computation offloading.

A. Communication Model
As depicted in Fig. 1, we consider a clustered UAV-enabled ground network that connects multiple heterogeneous smart devices for IoMT in the smart city.The clustered UAV network is equipped with MEC servers and controlled by a UCH to provide resources for multiple heterogeneous MIDs.MIDs are deployed to monitor and diagnose patient health status updates and report to the healthcare center.We assume there are K small cells in a smart city with BS N and I MIDs randomly distributed in the small cell; some MIDs are mobile and far from the network coverage.The set of BS and MIDs in a particular cell is denoted as N {n = 1, 2, . . ., N}, I {i = 1, 2, . . ., I}, respectively.The UAV network is clustered into M clusters, where each cluster consists of J UAVs that fly at a fixed altitude H j > 0 over a small cell in the city to serve ground MIDs with various applications.The set of clusters and set of UAVs in each cluster is given as M {m = 1, 2, . . ., M}, J m {j m = 1, 2, . . ., J}, respectively.In this work, UAVs are deployed to support the ground network when the BS is overloaded or malfunctioning and provide reliable emergency communications (e.g., healthcare issues).Despite its robustness, the ground network may not reach remote urban areas.Therefore, UAVs can extend network coverage and stream temporary events.In our scenario, resource-constrained MIDs offload their medical data to either BSs or UAVs for execution through wireless links.The deployed multi-UAV network maintains network coverage, provides resources to MIDs, and relays data to central medical servers.The MEC server at the UAV or ground network process different operations at each time slot t from appointed time slot T, T {t = 1, 2, . . ., T}.Without loss of generality, UAVs, BSs, and MIDs are assumed to be in 3-D Cartesian coordinates.In time slot t, the MID i horizontal coordinate is denoted as z z z i (t) = (x i (t), y i (t), 0) and v v v n (t) = (x n (t), y n (t), 0) is the location coordinate of BS.Also, the flying direction of UCH j projected on the horizontal coordinate is given by u u u j (t) = (x j (t), y j (t), H j ).The Euclidean distance between the UCH j and MID i at time slot t is expressed as For efficient and reliable communication between the UCH j and the MID i, the ATG communication link can be modeled by path loss with a specific probability in both line of sight (LoS) and non-LoS (NLoS) [32].The LoS connection probability between UCH j and MID i at time slot t is dependent on the environment, the angle of elevation, altitude, and location of both the UCH and the MID [33], which is calculated as: , where ς 1 and ς 2 are constants depending on environment and θ ij denotes the angle of elevation between UCH j and MID i, denoted as The average path loss related to LoS and NLoS connection at time slot t is expressed as follows: where LoS ij (t) and NLoS ij (t) represent the path loss expression of LoS and NLoS between UCH j and MID i connection link, respectively.Depending on the above analysis, the LoS and NLoS can be calculated as in ( 2) and (3), respectively LoS where η LoS and η NLoS denote the losses of free-space propagation for LoS and NLoS connection, respectively.In addition, c is the speed of light, and f c is the carrier frequency.Therefore, based on ( 3)-( 5) the closed-form expression for the average path loss ¯ ij (t) is calculated as: The communication channel between UCH and MID follows the quasistatic fading model.The corresponding channel coefficients are constant within each time slot but may vary across time slots [34], [35].The channel coefficient between UCH j and MID i at the time slot t is denoted as h ij (t).

It can be expressed as
, where α ij (t) is the large-scale channel and hij (t) is small-scale fading.The path loss coefficient can be expressed as α ij (t) = α 0 d β ij (t), α 0 is the average power gain at 1-m distance, and β is the path loss exponent.The small-scale fading contains LoS LoS ij (t) and NLoS NLoS ij (t) components.Therefore, small-scale , where K is Rician fading with Rician factor [34].Let φ w ij (t) denote the offloading decision variable defined as: φ w ij (t) ∈ {0, 1} ∀j ∈ J ∀i ∈ I, where φ w ij (t) = 1 implies the MID i decides to offload the computation task to associated UCH j in time slot t, while φ w ij (t) = 0 represents that the task is executed by MID i.When the MID i decides to offload a computation task to UCH j at time slot t, the MID i must be under the coverage radius of UCH j represented as φ w ij (t)d ij (t) ≤ r max , where r max is maximum radius coverage of UCHs.We use a predefined path during swarming to avoid UAV collisions in our scenario.Assume that UAV m flies at fixed altitude h m after the launch phase ∀m.To efficiently utilize the space, the UAVs are clustered into groups such that h c 1 = h c 2 if and only if UAV c 1 and UAV c 2 belong to the same group.Let d min be the minimum distance required for two UAVs to avoid collisions.To ensure that two UAVs that belong to different groups never collide, min Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
various groups do not collide, focusing on a single group is acceptable [36].Let J i denote the set of UCHs which cover MID i, and is expressed as Moreover, we assume that each MID i computation task can be executed by and connected to at most one UCH at time slot t expressed as J j=1 φ w ij (t) = 1 ∀i ∈ I.The probability of the LoS communication link is much higher than the NLoS communication link in the multi-UAVenabled IoMT systems, the achievable data rate during the task offloading from MID i to UAV j at the tth time slot in bits per second (bps) can be expressed as follows: where b ij (t) and p ij (t) represent the bandwidth (radio resource) between MID i and UCH j, the transmission power of MID i offloading task to UCH j at time slot t, respectively.δ 2 is noise power and p î(t) ||h îj (t)|| 2 is interference between MID i and MID î at time slot t.We consider 0 and the radio resource allocation should fulfill the following expression i∈I b ij (t) ≤ 1 ∀i ∈ I.We suppose that ATG has multiple channel access techniques in OFDMA [37].We use multiple radio access techniques to divide the operational frequency band B into an equal subchannel b = (B/W) [Hz] and assign the subchannels to the MIDs.Each MID then offloads tasks to only one UCH using the subchannel assigned to it in the time slot t [38].The radio resource allocated between MID i and UCH j is expected to satisfy the following: Each UCH can serve up to I MIDs at time slot t.We define each UCH's available subchannels as w ∈ W = {1, . . ., W}.
Each MID has distinct medical data and task sizes during time slot t.The MID in the ATG system generates time-dependent patients' medical data and offloads it to the UCH or central controller (SDN) for further processing.Each MID i has a computationally intensive task X i (t) to be executed, which is denoted by three tuples is required computational capacity (CPU cycle), and Tmax i is maximum tolerable time of task X i (t).To ensure the QoS of MID, the input task should be finished before its maximum latency; else, the patient's life and health would be at risk.
The MIDs collect/generate different medical health data like physiological condition assessment and monitoring the health status of patients [i.e., electrocardiogram (ECG)].We used the ECG data set [39], [40], which comprises the normal (N), supraventricular ectopic (S), ventricular ectopic (V), fusion (F), and unknown (Q) main classes, each with many subclasses.In these classes, we employed certain groups with five beat classifications, including normal beats, atrial premature beats (APBs), left bundle branch block (LBBB), right bundle branch block (RBBB), and premature ventricular contraction (PVC).The ECG data set utilized in this scenario has distinct levels of risk, requiring different treatments.For example, the LBBB is associated with a higher risk of death, demanding immediate treatment, and this type of data is labeled as high medical criticality data.Although the RBBB and PVC are not life-threatening, they do raise the risk of death in those who have already experienced heart failure or a heart attack; such data could be labeled as medium medical criticality data.On the other hand, normal beats and PVC are not severe; thus, such data could be characterized as low medical criticality data.In this scenario, each MID executes these data locally or offloads them to the UCH/BS when it does not have sufficient resources.Each distributed MID collaboratively trains the local FL model to offload its data to the UCH/BS.The MIDs that monitor patients' health status could be prioritized based on the severity level of the medical data to be offloaded.This means the MID with the highest critical level of data offloads first and gets the results processed faster.This article focuses on the medical data collected by MIDs for monitoring patients' health status, which has been prioritized according to their severity or critical level.As mentioned above, the medical data X i (t) of the MID i is offloaded/transmitted into the MEC server based on its medical criticality and AoI.
1) Medical Emergent Data: The medical data collected by the MIDs have different criticality/seriousness levels, indicating the health severity index of patients from a medical perspective [5].Any health monitoring data collected/generated by MIDs can be characterized into discrete medical criticality classes [4], expressed as D = {1, 2, . . ., D}.Let a decision variable κ id (t) = {−1, 0, 1} denote the medical criticality class used to prioritize the offloading request of MIDs, where κ id (t) = −1 denotes low criticality medical data, κ id (t) = 0 means medium criticality medical data, and κ id (t) = 1 indicates high critical medical data d ∀d ∈ D. Since patients' health information is sensitive, which is both time and criticality class dependent, examining these factors is critical for health monitoring.Health monitoring data from a class with highly emergent data should always be given priority for execution and offloading over data from a lower emergent/nonemergent data class.In this work, we consider a linear form of medical criticality.Define Xi (t) is a medical criticality of MID i medical data, where Xi (t) ∈ X i (t).Therefore, the health monitoring data ν i (t) labeled in class d ∀d ∈ D, its medical criticality expressed as follows: where Once the medical criticality of MIDs' data has been determined, each MID will offload tasks, upload model parameters, and allocate resources based on their priority.In this scenario, if the same MID generates/collects tasks with different MC classes, the task with the highest MC class will be offloaded/executed first; otherwise, tasks of MID i will be offloaded/executed on a first-come, first-served basis.In this way, we can also improve the QoS satisfaction of MIDs.
2) Age of Information: Patient medical information monitored by MID is time sensitive, and AoI measures the freshness (from generation to the arrival of information to the desired node) of health monitoring information.The MIDs continuously offload the generated/collected information into the nearest computational node/medical server.Therefore, to accurately update patient health information in real time, the MEC system must allocate resources efficiently based on patient health information to maximize efficiency and minimize latency.Moreover, resource allocation needs to be efficient to meet QoS and AoI requirements for MIDs.Whenever MIDs generate medical information, it is time stamped, and the time stamp is used to manage AoI.The AoI of packet ν i (t) can be computed as: , where ψ i (t) is AoI at time slot t and τ i (t) is the time stamp of the most recent updated medical data from MID i.Therefore, the numerical value of ψ i (t) equals the transmission latency; it indicates the freshness of medical information.
3) Local Computation: The MID i can compute its tasks locally using its resources or offload computation-intensive tasks to the edge server.In the case of φ w ij (t) = 0, the MID i computes its tasks locally, and when φ w ij (t) = 1, it offloads its tasks to the edge server.The edge servers then process the tasks offloaded by the MIDs and return the results to them.When MID i decides to compute medical data locally, then the completion time of X i (t) is expressed as follows: where f i (t) stands for computation capacity of MID i.The computation capacity of MID i must satisfy 0 ≤ f i (t) ≤ f max i , where f max i is the maximum CPU capacity of MID i.The energy consumption during local task execution at MID i is calculated as follows: where κi is the capacitor's coefficient of the CPU.Therefore, the energy consumption of MID i, including local computation and offloading medical data at time slot t is expressed as , where p i (t) is transmission power of MID i at time slot t, the power constrain is represented as follows: where p max i is maximum power capacity of MID i.Let ω t and ω e denote the weighted parameters of latency and energy consumption, respectively.The local computation cost of the task X i (t) of MID i is expressed as:

4) MID Medical Data Offloading Model:
When MID i determines to offload tasks to UHC j or BS n based on current policy and other MIDs' information, the time delay cost is evaluated in three phases: 1) transmitting time; 2) execution time; and 3) outcome delay.When MID i decides to offload its medical data or the updated local model to associated UCH j at time slot t, the transmission time of data is expressed as follows: All associated MIDs share each computational resource block on the UCH j at time slot t.The UCH's MEC computing capacity 1 is F j .The offloaded medical data task execution time is calculated as follows: where f ij (t) is allocated computational capacity from UCH j to MID i at time slot t.The UCH j computational resource constraint is expressed as follows: where F max j denote the maximum available computational resource block of UCH j.As a result, the total time needed to complete the task is expressed as follows: Hence, the actual task completion time T ij (t) must be less than or equal to the upper bound deadline, which is expressed as follows: In general, the lower the T ij (t) value could be, the higher the QoS satisfaction of MIDs.Note: The tasks with different medical criticality levels have different upper-bound deadlines.

B. Energy Consumption
This section discusses the energy consumption when the MID i offloads a task onto a UCH j in various phases, including transmission, execution, flying, and hovering.First, we assume each MID i and UCH j adopts discrete transmit power control.We adopted power transmission between MID i and CH j following [37].
At time slot t, the transmission energy that UCH j consumes to transmit the MID i task is expressed as: , where p ij (t) denotes the transmission power allocated by UCH j to MID i.Likewise, the UCH j' energy consumption to execute the task offloaded from the MID i at time slot t is expressed as follows: where κj represents the CPU-dependent effective capacitance coefficients of UCH j.Moreover, the transmission power of UCH j at time slot t must satisfy the following constraint: 1 The UAVs can compute offloaded tasks assigned by UCH in the clustered network.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where P max j represents the maximum transmission power of UCH j.Accordingly, the total energy consumption of UCH j to complete medical tasks of MID i is expressed as follows: The energy consumption of UCH j at a time slot t is calculated using the energy consumption constraints of flying, hovering, and execution [37].The first goal of this research is to optimize offloading and resource allocation decisions to efficiently allocate resources to MIDs while minimizing MIDs' energy consumption and latency.However, this could result in an unfair process because one UCH may serve more MIDs than others.In this regard, to address the unfairness issue, we use the fairness level among UCHs that serve MIDs and among MIDs themselves to regulate the fairness of UCH coverage as in [37].

C. Problem Formulation
In this work, the main objective is to optimize the resource allocation and computation offloading to minimize the latency and energy consumption while ensuring privacy and minimizing training costs in the ATG network environment at time slot t.Each MID can generate/collect health information from patients and ordinary users (i.e., athletics) in different network coverage.The overall energy consumption and latency of MID i to compute tasks at time slot t is expressed as follows: where ω t and ω e denote the weight of latency and energy consumption, respectively, and ω e + ω t = 1.Therefore, the optimization problem is expressed as follows: C2 : where A A A = {φ w ij (t)} i∈I,j∈J , F F F = {f ij (t)} i∈I,j∈J , P P P = {p ij (t)} i∈I,j∈J , and B B B = {b ij (t)} i∈I,j∈J .
The objective function (19a) as computed in ( 18) is the sum of the normalized value of power consumption and latency of MID I that is used to compute medical data using resource allocated from UCH J at time slot t.The constraint (19b) represents the binary offloading/association and MC class indicator of the task.Constraint (19c) denotes that the MID i either conducts a medical task locally or offloads the task to one associated UCH j at time slot t.In this optimization problem, constraint (19d) denotes the bandwidth resource.Constraints (19e) and (19f) represent the allocated bandwidth resource to MID I should be less than or equal to 1, and the total radio resource/bandwidth allocated should be less than or equal to the maximum bandwidth (B) in the system, respectively.To control task computation latency of the MID, constraint (19g) states that the actual transmission latency of each class of data must be less than or equal to its maximum tolerable latency, including transmission and execution latency.Furthermore, the constraints (19h) and (19i) determine the maximum amount of computation and power resources that can be allocated to MID I from UCH j at time slot t.At time slot t, UCH's energy consumption must be less than or equal to its maximum energy budget, as determined by constraint (19j).Constraints (19k) and (19l) represent the fairness level of UCH coverage and MID.Finally, constraint (19m) represents the AoI of MID i data at time slot t that cannot exceed a specified threshold ψ max .It is dependent on resource allocation.
Certainly, the optimization problem (P 1 P 1 P 1 ) in ( 19) is a mixedinteger nonlinear programming (MINLP) problem, wherein the computation offloading indicator A A A is a binary variable, while the bandwidth B B B, transmission power P P P, and computation resource F F F allocation ratio are real positive numbers.Besides, the problem (P 1 P 1 P 1 ) is an NP-hard problem due to the objective function's nonconvexity and binary decision variables.To handle this, the first problem (P 1 P 1 P 1 ) is decomposed into two subproblems [41].The first subproblem relates to the offloading and bandwidth allocation of MIDs, whereas the second concerns computation and power allocation.It cannot solve directly in a dynamic network environment.In the ATG network environment, healthcare users/MIDs, ABSs, and BSs increase problem complexity over time.Due to these reasons, the ML approach, particularly DRL, is one of the popular, efficient methods to find an optimal policy in the curse of dimensionality and complex dynamic system.Therefore, we adopt to exploit MARL to tackle the challenges of the (P 1 P 1 P 1 ) optimization problem in this article.

A. Hierarchical FL Model
In this section, we present the basics of multiagent FRL in multi-UAV-enabled IoMT networks.The learning model has a global FL model and a local FL model with global and local parameters.Due to the multilayer and heterogeneous nature of the ATG network infrastructure, we utilize a hierarchical FL approach [42], [43].DRL helps control the resource allocation aspects to ensure optimal energy and latency of this hierarchical FL system in the ATG dynamic network system.
We assume that each MID has its own private data set (i.e., sensitive medical data) that it wants to offload to its associated UCH/BS at time slot t in order to get them computed within deadlines while minimizing costs.To preserve the privacy of these medical data, each MID is required to participate in the FL model training with good-quality model updates or high levels of accuracy.In our scenario, we utilized hierarchical FL in which MIDs perform local FL model training on their local raw data set in a private manner without exchanging personal data with other MIDs.The MIDs download the FL model parameters from the UCH/BS, train the local FL models using their training data set with the help of DRL, and upload them to the UCH/BS through the assigned bandwidth/channel.The MEC servers on UCHs/BSs serve as cluster-level aggregators; they collect local FL model parameters from connected MIDs, aggregate them, and upload the aggregated model parameters to the SDN for global model aggregation.Finally, the aggregated global model parameters will be sent to the UCHs/BSs and MIDs for the next round of training.The UAVs/BSs allocate resources, execute the tasks received from the MIDs, and send the results back to them.
The main objective of FL is to minimize the overall loss function concerning the local data of MID.Let Q j and Q i represent the global parameters of the global FL model of j-UCH and the local parameters of the local FL model of ith MID.Each associated ith MID owns data set D i = {d 1 , d 2 , . . ., D i } with its size D i , the overall data size D = I i=1 D i .The data can be generated by MID that monitors the patients in the healthcare system and trains its data locally by the local model using stochastic gradient descents (SGDs) [44], [45].Without loss of generality, for each input data sample n at the ith MID associated with the jth UCH, the loss function g n (Q j , c i,n , ĉi,n ) determines the FL error over the input vector c i,n on the learning model Q and the scalar output ĉi,n .The overall loss function on a data set of ith MID is expressed as follows: The average global loss with respect to the local data set is expressed as follows [44], [45]: The goal of FL is to optimize the global loss function G(Q j ) by finding the minimal weighted average of the local loss G i (Q j ), which is expressed as [44] min In our scenario, the local model is the DRL-trained model on MID and UCH, and the global model is the aggregated model on UCH and SDN.The FL training process has the following phases.
1) Broadcasting the Global Model: SDN broadcasts the global model to active FL entities (i.e., BS, UCH, and MID).In this phase, first, the SDN broadcasts the global model to UCH; second, the UCH transmits the received global model to the associated MID at time slot t.
2) Local Model Training and Updating: At each time slot t, MID i trains a local FL model with its own parameters Q i (t) based on its data set D i by using SGD, ∇G i (Q i (t − 1)) where η is the learning step.Then, the MID uploads its locally trained model parameters Q i (t) to the UCH j for local FL model aggregation.The UCH j then aggregates the local FL model parameters and trains its own parameters Q j (t) based on its data set D j by using SGD, ∇G j (Q j (t − 1)) where η is the learning step.As shown in Fig. 2, UCHs/BSs receive individual models from MIDs, aggregate them, and upload the aggregated model to SDN.The SDN then receives the individual models from BSs and UCHs and performs global aggregation to manage the allocated resources in this work.
3) Global Aggregation: The SDN aggregator receives FL model parameters from UCHs and executes the global model aggregation by averaging and updating the global model parameters Q(t) as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Hence, the global model aggregated on UCH is expressed as follows: The radio resource is limited; due to this, the federated averaging (FedAvg) algorithm [46]

V. PROPOSED SOLUTION
The optimization problem, as described in (19), is difficult to handle due to its NP-hard and nonconvex nature.Besides, ATG network environments are characterized by the high mobility and dynamism of network entities, which leads to dimensional curse problems and higher optimization complexity.It is complex and time consuming to address with traditional optimization techniques [47].Model-free RL is a well-known optimization approach for many problems in dynamic contexts.It can deal with the decision-making problem by using a dynamic programming approach to learn an optimized policy [48].Thus, integrating model-free RL with FL can enhance scalability and patients' data privacy in different layers and minimize training time and communication overhead in multi-UAV-enabled IoMT network environments.Furthermore, several works have adapted the FL model in the ATG network environment [49], [50].The FL is used to update the parameters between the ABS/UAVs and edge IoT devices, but these works have different challenges.First, when MIDs upload locally trained models to ground MEC servers and/or UAV servers, the UAVs' resource consumption varies across UCH, resulting in delays when updating the global model.Second, the MID generates/collects more delay sensitive (emergent data) in the healthcare system than other systems.The patients are movable and have different resource demands, and the privacy issue is critical.Therefore, to address the above difficulties and the action of agents in continuous space, we proposed an MAFDRL framework.In this framework, the UCHs and MIDs are agents that can observe the environment and take action.MIDs are associated with the nearest computational nodes (i.e., UCHs and BSs), and resources are allocated from the associated UCHs/BS while considering the fairness of UCHs, the emergent data of MIDs, and their AoI.The proposed framework has the following merits.Overall, the proposed framework in multi-UAV-enabled IoMT is used to minimize medical data processing latency and energy consumption on different computational nodes to ensure the QoS of MIDs and rescue the emergency in the healthcare system.We adopt a deep deterministic policy gradient (DDPG) algorithm to solve this problem [51].

A. MADDPG Algorithm
To handle the complexity of the optimization problem, the objective function shown in (19) is transformed into an MDP.The MDP model is a sequential decision-making process [52], contains four tuples, i.e., S, A, P, R , where S is a set of states, A is the action space, P is the state transition function, and R is the reward function.Each agent tries to maximize the expected reward function.We use a model-free RL with the FL technique to handle the association between MIDs and UCHs, computation offloading, and resource allocation problems given in (19).Agents interact with the network environment to constantly update their policies based on their observations.The state, action, transition state, and reward function are defined as follows.
1) State Space: The state space S of agents at time slot t denoted as s ij (t) ∈ S, composed of the level of MID data either emergent or not κ id (t), maximum tolerable latency of task Tmax i , connection status/channel strength between MID and UCH ζ ij (t), UCH coverage υ j (t) depends on fairness level and resource capacity, and available resource block of UCH ϑ j (t) at time slot t, which is expressed as follows: 2) Action Space: Each agent selects the appropriate action a ij (t) ∈ A is determined by the observed state s ij (t) and current policies π .The agents select computational node φ ij (t), bandwidth resource b ij (t), computation resource f ij (t), and transmission power p ij (t) based on the priority set in MC classes, which is expressed as follows: The action of all agents is denoted a(t) = {{a ij (t)} i∈I,j∈J }.
3) Reward Function: The target of agents in the healthcare system is to maximize long-term reward while decreasing delay and energy consumption.The agent would get rewards based on the probability of state transition, then the reward function is defined as follows: From ( 19) and (29), ω e and ω t are weighted parameters of energy and latency, respectively.The overall rewards in the system is expressed as follows: In this scenario, the optimization problem is complex, multiobjective, and involves a large state space and continuous action space, making it difficult to solve using a single agent-based RL algorithm [53].Therefore, to address this optimization problem, we apply the multiagent DDPG algorithm, which is capable of dealing with both continuous action spaces and mixed-cooperative competitive environments [53].In general, the MADDPG learning framework is an extension of the DDPG learning framework that combines DQN and the Actor-Critic algorithm in centralized training with decentralized execution techniques to produce a hybrid learning framework.There are target and evaluation networks for actor and critic networks.Using the policy gradient approach, the actor-network generates the agent's action at time slot t.Then, this action is reviewed by the critic network (Q-value function).We define π l = {π 1 , . . ., π L } and θ l = {θ 1 , . . ., θ L } as the sets of policies and parameters for agents, respectively.The policy gradient π l of agent l is expressed as follows: where M is experience replay buffer which stores The critic-network is updated by minimizing the loss function, which is expressed as follows: where y = r l + γ Q π l (S , a l , . . ., a L )| a l =π l (s l ) and γ ∈ [0, 1] is discount factor.The actor-network is updated by minimizing the agent l's policy gradient, represented by where H and k denote the size of the mini-batch and the index of samples.
Algorithm 1 MADR-Based Resource Allocation 1: Initialize: Available resources of UCH and MID, MIDs data type, computing, transmission power, and bandwidth of UCH/SDN.2: Initialize: The weight of actor and critic network with random parameter θ , a random process N for action exploration, initialize size of agents replay memory buffer.3: for episode 1 to V do /* V maximum episodes */ 4: Initial state space of each agent S = {S 1 , . . ., S L } 5: for each iteration t = 1, 2, . . ., T do 6: Each agent receive initial state s l (t) 7: Each agent l select action a l (t) = π l (s l (t)) + N 8: All agent execute action a(t) = {a 1 , . . ., a L }, receive reward r(t), obtain new state s l (t + 1) ∼ s l 9: Store {s l (t), a l (t), r(t), s l } into its replay memory 10: Each agent uploads the tuple values from its replay memory to higher/upper agent replay memory 11: Merge lower agent tuples into higher agent tuple replay memory 12: Download tuples from higher agent to lower agent 13: s l (t) ← s l 14: for agent l = 1 to L do 15: Randomly select a mini-batch of transitions In multi-UAV-enabled IoMT networks, a multiagent DRL algorithm is employed for MID association, computation offloading, and resource allocation.It consists of two procedures: 1) data collection and 2) training.We begin by initializing the available resources at the UCH and MID, the UCH's resource block (i.e., bandwidth, computation, and transmission power), the actor and critic network parameters with random weight θ , and the replay memory buffer (lines 1 and 2).The agents collect data based on observations (lines 3 to 14 Following the completion of Algorithm 1, the FL framework algorithm process is as follows. 1) The SDN server broadcasts the global model parameters Q(t−1) to UCH/BS, and the UCH FL model parameters are set as Q j (t) = Q(t − 1).2) Each UCH j updates its FL model parameters using SGD according to (24).
3) The UCH j broadcasts the FL model parameters Q j (t−1) to the associated MID i. 4) Each associated MID i with UCH j updates their local model parameters frequently by the gradient of the loss function G i (Q j (t − 1)).On each iteration, t the FL local model parameters of MID i Q i (t) are calculated using (23).5) The associated MID i with UCH j can upload its updated local parameters to UCH j. 6) Each UCH j calculates the aggregation of the uploaded FL model parameters of MID I and its own Q j (t) FL model parameters according to (26).Then, the UCH j can broadcast the updated model.The UCH j uploads the FL model parameters to SDN. 7) SDN calculates the aggregation of the uploaded UCH J model parameters using (25), then broadcasts the updated model to UCH j.

B. Complexity Analysis
Due to the complexity of the MAFRL algorithm, each FL entity (i.e., MID and UCH) maintains its policy and makes decisions independently to select the action.The input and output dimensions are determined by the dimensions of the observation space and action space.Let Z and M represent the hidden layers and output dimensions, respectively.Then, the complexity of each actor is calculated as O(|M| 2 Z).The computational complexity between agents is expressed as O(|M|.J.I 2 ), and the final issuing policy at each time slot is estimated as O(|M| 2 .Z).The increasing number of agents in the local and global FL models does not influence the individual agent's computational complexity.

VI. PERFORMANCE EVALUATION
In this part, we evaluate the performance of the proposed MAFRL algorithm in multi-UAV-enabled IoMT networks with different parameter settings.

A. Simulation Setup
The simulations are conducted using Python 3.8 environment, Pytorch, and TensorFlow 2.1.0on a Dell Laptop equipped with an Intel Core i9-11950H CPU @ 2.60 GHz (16 CPUs), 32-GB RAM, and a 16-GB NVIDIA T600 GPU running Microsoft Windows 64 bits.
The deployment and parameter configuration of multi-UAV networks and IoMT networks commonly depend on the work in [37].The UAV networks are deployed in smart cities to serve smart healthcare centers with radius coverage r j = 800 m, where the MIDS are distributed in 1.0 km×1.0 km communication range.The MIDs are randomly distributed in IoMT networks.One UAV cluster serves a maximum of 100 MIDs at a time slot t, and the UAVs fly at a fixed altitude H j = 100 m.The subchannel bandwidth is (B/W) 80 kHz.For a probabilistic model, ς 1 = 9.61, ς 2 = 0.16, f = 2 GHz, η LoS = 1, and η NLoS = 20.
Each UCH has 25-dBm transmission power, 15-GHz/s computation capacities, and a channel bandwidth of 50 MHz.The size of medical data for MIDs is distributed in [100, 12000] kBps with the required CPU cycle distributed in [0.5, 1.5] Gcycle.
In this proposed framework, we employ a fully connected neural network with critic and actor networks.For each agent, we deploy two hidden layers in both the actor and critic neuronal networks set as 128 and 256 neurons, respectively.We set the size of the mini-batch to 256 and the replay memory buffer to 10 5 .We use the ReLU and sigmoid activation functions for hidden layers and the output layer, respectively.We utilize Adam Optimizer for the loss function of RL.The learning framework is constructed from UCHs, MIDs, and SDN.see that the proposed MAFRL algorithm performs better than MADDPG and other algorithms.This is because the proposed MAFRL enables the agents to learn cooperative policies and reach the optimal policies in a different layer by sharing and updating their models.As a result, the cost of communication latency and energy consumption is reduced, and the average system reward is higher than with other algorithms.Moreover, we observe in Fig. 3, that all algorithms did not converge and become stable before 200 episodes.The proposed MAFRL algorithm converges and becomes stable after 200 episodes, while MADDPG, DDPG, and greedy algorithms converge after around 300, 450, and 650 episodes, respectively.The reason is that the DDPG algorithm is learned in noncooperative policy and mainly focuses on optimizing policies.Therefore, the computational costs of DDPG and the greedy algorithm are higher than that of the multiagent algorithm.The proposed MAFRL algorithm obtains the optimal offloading/association and resource allocation policy, resulting in higher reward value and accuracy, ensuring medical data privacy, and minimizing costs, compared to the three baseline algorithms.The proposed MAFRL algorithm outperforms the system reward by 6.89%, 9.68%, and 19.35% compared with MADDPG, DDPG, and greedy algorithms, respectively.The simulation results show that MAFRL, MADDPG, and DDPG algorithms have achieved better accuracy than greedy.However, these convergence speeds and accuracy rates are lower than the MAFRL algorithm.It can be seen that after 50 communication rounds, the accuracy rate of the proposed MAFRL algorithm is more remarkable than 98%.Therefore, the proposed algorithm outperforms the other algorithms.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.As shown in Fig. 5(a), we evaluate the system cost with respect to increasing learning episodes.The system cost of all compared algorithms is higher at the beginning due to less learning experience in the high dimension of the state and action spaces, but it gradually decreases as the number of learning episodes increases.The proposed MAFRL algorithm has a lower system cost than baseline algorithms, which is a significant advantage in minimizing the latency and energy consumption of the local MIDs and the UAVs.Although the MADDPG algorithm has a larger system cost than the MAFRL algorithm, it is less than the DDPG and greedy algorithms.The greedy algorithm system cost is worst than the other three algorithms due to less cooperation of agents and can not handle continuous action space in dynamic multi-UAV-enabled IoMT networks.The FL model in the proposed MAFRL algorithm synchronizes the local and global models with low communication and energy consumption, allowing it to outperform the baseline algorithm in terms of system cost minimization.Fig. 5(b) shows the system cost with respect to clients' or MIDs' data size.We can observe that increasing the MID data size increases the system cost in all algorithms.However, the proposed algorithm reduces system costs by 16.33%, 25.12%, and 35.17% compared with MADDPG, DDPG, and greedy algorithms.It implies that the proposed MAFRL algorithm can minimize computational latency and energy consumption.Fig. 5(c) shows the system cost versus the number of participated/associated MIDs.With increasing the number of MIDs, all algorithms' system cost gradually increases.However, the MAFRL and MADDPG algorithms can make better decisions than DDPG and greedy algorithms.The number of MIDs and edge servers is not equivalent; due to this, there is a scarcity of resources to compute all computational tasks simultaneously.Hence, MAFRL and MADDPG algorithms can minimize the system costs more than DDPG and greedy algorithms.We can observe that the proposed MAFRL algorithm outperforms all the baseline algorithms, reducing system costs by 32.4%, 61.5%, and 68.7% compared with MADDPG, DDPG, and greedy algorithms, respectively.
As shown in Fig. 6(a), the overall computation latency of all algorithms decreases as the allocated CPU cycle of the edge node increases.When the allocated resource, i.e., the CPU cycle per task, increases, the MID task can be processed quickly, and the task's waiting time at the edge server can be reduced.Then, time-sensitive tasks or data can get better priority.The communication latency of DDPG and greedy algorithms can decrease more slowly than the proposed MAFRL and MADDPG algorithms because there is more communication overload.Even though the overall communication latency of the MADDPG algorithm is lower than that of DDPG and greedy algorithms, the proposed MAFRL algorithm performs better than all the baselines in different CPU cycles.
Fig. 6(b) shows the total energy consumption with CPU cycles of all tasks computed by edge nodes.We observed that the power consumption increases for all algorithms as the allocated computation resources increase.The MIDs generate time-sensitive data and tasks.These tasks require more computational resources.The MAFRL and MADDPG algorithms have less power consumption than the DDPG and greedy algorithms.Generally, the proposed algorithm reduces energy consumption by 56.84%, 68.45%, and 73.63% compared with the MADDPG, DDPG, and greedy algorithms, respectively.
The AoI is one of the primary metrics for time-sensitive task processing in the healthcare system.Fig. 7 shows the impacts of MIDs and CPU cycles on AoI.Fig. 7(a) shows the average AoIs of tasks of MID versus the number of MIDs for proposed MAFRL and baseline algorithms.As the number of MIDs increases, the average AoI increases for both algorithms.This implies that many MIDs request more resources and offload tasks frequently wherein the edge server (aerial MEC server) cannot allocate and compute tasks simultaneously.Then, there is a higher waiting time which increases the average AoI.Besides, the proposed MAFRL algorithm gains a minimum average AoI than the baseline algorithms.Furthermore, the DDPG and greedy algorithm highly increase the average AoI with an increasing number of MIDs.In the proposed MAFRL and MADDPG algorithms, the agents cooperate to minimize the computational cost or maximize the rewards; due to this, the average AoI also reduces more than DDPG and greedy algorithms.However, the DDPG and greedy algorithms are noncooperative and cannot efficiently minimize computational costs.Generally, the proposed MAFRL algorithm can reduce the average AoI by up to 29.5%, 39.5%, and 46.4% compared with MADDPG, DDPG, and greedy algorithms, respectively.Fig. 7(b) shows the impact of CPU cycles on AoI.We can observe that the AoI decreases as the CPU cycles increase in all algorithms.It indicates that the MID tasks can be computed more frequently when the allocated CPU cycle increases from edge nodes.The MADDPG algorithm has a lower AoI than DDPG and greedy algorithms.However, it is higher than the proposed MAFRL algorithm.
Fig. 8 depicts the system performance in terms of communication latency and energy consumption with increasing bandwidth allocation.From Fig. 8(a), we observe that the communication latency of the MID decreases gradually in all algorithms as the allocated bandwidth increases.The impact of bandwidth resources significantly minimizes communication latency for time-sensitive applications.Fig. 8(b) shows    that the overall energy consumption in the proposed scenario decreases gradually in all algorithms as the bandwidth increases.We observe that the impact of bandwidth resources is less significant in energy consumption.The energy consumption of the proposed MAFRL algorithm is lower than the baseline algorithms.Generally, bandwidth resources significantly impact communication latency more than energy consumption.Therefore, the proposed MAFRL algorithm can reduce the overall system cost compared to the baseline algorithms.
Fig. 9 demonstrates that as the number of MIDs increases, the task processing latency can increase for all medical criticality levels.This is due to the fact that when the number of MIDs increases, the network may experience increased data traffic and processing demands, potentially leading to task completion delays.The figure shows that tasks with varying latency requirements are served according to their priority, ensuring QoS while minimizing cost.This could be because our proposed framework prioritizes task processing based on medical criticality.
From the simulation results presented in this work, one can observe that the overall performance evaluation of our proposed (MAFRL) algorithm compared with the baseline algorithms outperforms in terms of communication latency, energy consumption, system cost, accuracy, and system reward in the configured scenario.

VII. CONCLUSION
In this article, we proposed an MAFRL framework for resource allocation and task offloading in a multi-UAV-enabled IoMT network to minimize communication latency and energy consumption.We have formulated a joint optimization problem for resource allocation and task offloading problems.We then transformed the optimization problem into an MDP Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
model and used an MARL algorithm to solve the optimization problem.The proposed framework uses a distributed FL-based DRL algorithm.It provides distributed computing, allowing the local model training of healthcare data without sending sensitive raw data to edge servers (AMEC servers) and aggregating it on UCH servers and SDN.Through this, the privacy of sensitive healthcare data can be protected.Simulation results show that the MIDs can obtain resources and offload sensitive tasks from/to UCH using a model-free algorithm with a minimum computational cost.The multiagent algorithms can achieve better performance, and the proposed MAFRL algorithm outperforms the baselines while ensuring the privacy of MIDs.Furthermore, we analyzed the algorithm through various parameter settings.The simulation results demonstrated that the proposed MAFRL framework algorithm outperforms the baseline algorithms in terms of accuracy, convergence, communication latency, energy consumption, and AoI.

Fig. 4
shows the classification accuracy of the heartbeat data set in different algorithms.In this simulation, we used the 0.5 MID participation rate learned in the FL training through trial and error.The accuracy of all algorithms increases rapidly in the first 15 rounds and gradually converges after 30 rounds as the number of communication rounds or the global update increases.When the MID performs poorly, it indicates a lack of resources and poor data quality.Without loss of generality, the quality of training data, the number of communication rounds, and the number of MIDs affect the training accuracy.

Fig. 5 .
Fig. 5. Effect of training episodes, data size, and MIDs on system cost.(a) System cost versus episodes.(b) System cost versus data size.(c) System cost versus MIDs.

Fig. 7 .
Fig. 7. Effect of MID and CPU on AoI.(a) Impact of MID.(b) Impact of CPU.
Multiagent Federated Reinforcement Learning for Resource Allocation in UAV-Enabled Internet of Medical Things Networks Abegaz Mohammed Seid , Member, IEEE, Aiman Erbad , Senior Member, IEEE, Hayla Nahom Abishu , Graduate Student Member, IEEE, Abdullatif Albaseer , Member, IEEE, Mohamed Abdallah , Senior Member, IEEE, and Mohsen Guizani , Life Fellow, IEEE Each active UCHs j ∈ Û(t) and MIDs i ∈ Î(t) runs a local update algorithm based on local data set D j and D i and a global model of SDN isQ(t − 1), whose output is updated model Q j (t) and global model of UCH is Q j (t), whose output is updated model Q i (t).The SDN and UCH can aggregate local updates Q j (t) and Q i (t), respectively, computing their weight average as the updated model.
1) As part of healthcare systems, UAVs are deployed as ABS, providing various resources to MIDs in smart cities, including energy, bandwidth, computing, etc.It can enhance network coverage, increase communication efficiency, enable emergency communication, and restore malfunctioning networks damaged by natural or artificial disasters.2) To protect the patients' information collected by MIDs and preserve their privacy, agents only share local model parameters rather than patient data.The agents send the trained local parameters to the UCH, and the UCH makes an aggregation of the local agents' parameters and sends it to the SDN/central controller.Finally, the SDN sends updated global model parameters to UCHs, and the UCH likewise sends updated global model parameters to associated MIDs.It is worth noting that each MID is assumed to have sufficient computing and communication resources to perform the FL model update.
Algorithms 1 and 2 describe the DRL training and MAFRL execution phases.As we know, the RL and MAFRL algorithms have training and execution phases.In these phases, data sets/training data are acquired by interacting with multi-UAVenabled IoMT networks through FL techniques.Algorithm 1: ).Each agent acts, receives the reward, and creates a new state (line 8).The experience is stored in the replay memory buffer in the training phase (lines 14 to 20); we employ policy Maximum number of iterations (i.e., T SDN , T UCH , T MID ), set of UCH J and associated MID I, learning rate η, η.Emphasize the FL execution phase, which contains the local training and global aggregation stages.