Caching and Computing Resource Allocation in Cooperative Heterogeneous 5G Edge Networks Using Deep Reinforcement Learning

In this work, we explore a framework for a 5G non-standalone (NSA) heterogeneous network, to meet heterogeneous content requests for users moving in vehicles. We consider that an enhanced NodeB (eNB) acts as a macrocell and next-generation NodeBs (gNBs) act as the small cells. To reduce the downstream latency, entire (or part) of the popular contents are fetched from the core network and cached (stored) at the eNB and gNBs. The computing resources are required at the eNB and gNBs along with the caching resources, for content compression and decompression, leading to a reduced requirement for the caching resources. The eNB and gNBs cooperatively decide on the resources (caching and computing) to be allocated. In this network planning approach, first we compute the optimal coverage radius of the eNB and gNBs. Thereafter, we identify the optimal number of non-overlapping gNBs under the coverage area of the eNB. Finally, we propose a novel deep-Q network (DQN)-based algorithm to train the centralized controller agent so as to identify an optimal policy for caching and computing resource allocation. In case the content popularity and road traffic condition change, the agent can be trained again so as to identify a new optimal policy. We also explore the resource allocation policy using other optimization techniques, such as pattern search, genetic algorithm, and multi-start search. The proposed DQN-based algorithm is scalable and shows an average percentage gain of 66.52%, 76.31%, and 53.64% in terms of computation time to identify an optimal policy for caching and computing resource allocation, over pattern search, genetic algorithm, and multi-start search technique, respectively.

built-in processing power, memory, storage space, and battery life, mobile devices often perform below par.The mobile cloud computing (MCC), typically enabled in a group of dispersed servers offering significantly high processing, memory, storage, energy, and networking resources, is envisioned to be a promising solution to meet such challenges.Offloading the exhaustive processing jobs to the cloud is an excellent way to get around these limitations.However, even then, the strict latency and network throughput requirements may not be satisfied for various applications conforming to different service classes, viz., ultra reliable low latency communication (URLLC), enhanced mobile broadband (eMBB), and massive machine type communication (mMTC).
To address the challenges of latency and network throughput, the edge computing based solution, wherein the servers are provided close to the end users is considered to be an attractive solution.In an edge-computing assisted communication network, the contents are selectively pre-allocated (cached) by the content providers at the edge servers.Typically, dedicated servers are co-located with base stations (BSs) or packet gateways to facilitate computing and storage.The popular contents frequently requested by the users are fetched from the core network and cached at the edge servers.This leads to significant reduction in the service time and reduction in the network congestion due to decrease in the backhaul traffic.An European Telecommunications Standards Institute (ETSI) white-paper details the implementation of the edge computing system architecture for the fourth generation (4G) and fifth generation (5G) networks [1].
Other than employing the cache-enabled content-centric networks, another popular approach to achieve high throughput is the network densification by deploying a multi-tier heterogeneous network (HetNet) [2].The HetNet includes the small BSs, such as micro BS, pico BS, and femto BS, that are deployed together with the traditional macro BS.HetNets decrease the communication distance by bringing the network closer to the users, thereby increasing the area spectral efficiency and network capacity.In this paper, we consider a 5G non-standalone (NSA) network where both the 4G enhanced NodeB (eNB) and 5G next generation NodeB (gNB) coexist in the same network infrastructure [3].NSA helps in reducing the capital expenditure of the 5G network by reusing the existing long term evolution (LTE) sites.
1932-4537 c 2024 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Motivated by the aforementioned benefits of edge caching and heterogeneous networks, in this paper, we investigate a novel methodology for mobility-aware cooperative content caching in HetNets.We address the network planning problem for a HetNet with focus on providing content-centric service to the users requesting heterogeneous contents, while moving in vehicles.We consider that primarily the gNBs, located on the roadside, provide network coverage to the users (Fig. 1), and majority of the network service requests are met by them.In the areas where the gNB network coverage is not available, if required, the service requests are met by the eNB.The gNBs and eNB are equipped with a local server that contains cache storage and a processor providing computing power.The computing power is required to support necessary processing required for compressing (decompressing) the contents.We consider that the eNB has a wide service area and acts as a macro BS, whereas the gNBs act as the small BSs providing high QoS (characterized by outage probability and data rate).The outage probability represents the probability of the received signal being less than a predefined threshold [4].We consider two scenarios that may be experienced by the users.The first scenario relates to straight roads (considered to be freeway type) wherein no gap (i.e., void) exists between the consecutive gNBs (Fig. 1).In this case, the gNBs cooperatively assign the caching and computing resources to be allocated for the contents requested by the users.The second scenario relates to curved (or straight) roads (freeway) wherein there exist voids between the consecutive gNBs (Fig. 1).When a user passes through the void, the network services are provided by the eNB.In the second scenario, the gNBs and the eNB cooperatively assign the caching and computing resources for the contents.We investigate a joint framework wherein optimal network planning for 5G NSA HetNet is executed only during the infrastructure deployment, whereas, the optimal caching and computing resource allocation policy is determined for short time intervals depending upon the variation in data size, compressibility and popularity of contents, and mobility of users.In our network setting, we consider that a passive optical network (PON) based shared backhaul infrastructure is used to connect the eNB and gNBs to the core network (Fig. 1).The PON consists of an optical line terminal (OLT) located at the central office, optical network units (ONUs) placed along with eNB and gNBs, and a passive splitter (PS) placed in-between the OLT and ONUs [5].
In recent times, reinforcement learning (RL) which is a subcategory of machine learning (ML) has been proven effective for making intelligent decisions regarding caching and computing resource allocation in 5G edge networks as compared to traditional approaches.Q-learning [6] is simple and one of the most popular algorithms for solving problems based on the RL framework.However, the Q-learning algorithm fails when the state and action spaces are large.Moreover, Q-learning also suffers when the state and action spaces are continuous.Thus, Q-learning algorithm is well suited only for discrete and small numbers of states and actions.As a large number of contents are demanded by the HetNet users, decisions regarding caching and computing resource allocation need to be taken for many cooperative gNBs and eNB.The involvement of a large number of contents and cooperative gNBs results in a large number of states in the decision-making process for optimal resource allocation, thereby, making the Q-learning algorithm inefficient for the present-day largescale 5G NSA HetNets.Deep reinforcement learning (DRL) algorithms overcome the aforementioned challenges.The deep Q network (DQN) algorithm, introduced by DeepMind [7] is one of the most suitable algorithms for DRL [8].This motivates us to employ a DQN-based optimal resource allocation strategy.Moreover, in a dynamic 5G edge network supporting heterogeneous content requests of various sizes and dynamic vehicular traffic conditions, it becomes indispensable to take decisions (regarding optimal caching and computing resource allocation) in a timely manner.Furthermore, the resource allocation scheme should also be scalable, i.e., the scheme should be able to provide a solution for a large number of contents involved in present-day large HetNets.This further, motivates us to propose an effective DQN-based framework for large-scale 5G HetNets.

A. Contributions
The main contributions of this paper are summarized as follows: • For infrastructure planning of a 5G NSA HetNet, we compute the optimal eNB radius (constrained by the coverage probability), the optimal radius of gNBs (satisfying the desired QoS characterized by outage probability and data rate), and the optimal number of non-overlapping gNBs within the coverage radius of the eNB (minimizing the coverage gaps between the consecutive gNBs).

B. Organization of the Paper
The rest of the paper is organized as follows.Section II presents the related works regarding content caching in 5G edge networks.Section III discusses the system model and problem formulations.Section IV describes the specific relations used to compute the optimal coverage radius of eNB and gNBs, and the optimal number of non-overlapping gNBs within the coverage area of the eNB.Section V presents a brief overview of RL and DRL.Here, we also discuss the proposed caching and computing resource allocation policy.Section VI illustrates the simulation results.Finally, we conclude the paper in Section VII.

II. RELATED WORKS
In general, the caching strategies in 5G edge networks can be broadly categorized into two groups, viz., content caching without utilizing the ML techniques and content caching by utilizing the ML techniques.

A. Content Caching Without ML
Authors in [9] propose a graph coloring approach to optimize the placement of popular contents in the memory of a small BS.In [10], the authors propose an incentivebased caching strategy in a 5G-enabled edge network taking into consideration a competitive environment with multiple 5G mobile network operators and multiple content providers.Authors in [11] propose a separable assignment problembased framework to determine the caching strategy so as to maximize the cache hit ratio for cooperative content caching in 5G/6G networks enabled with edge servers.In [12], authors propose a cache placement algorithm based on a heuristic greedy algorithm to maximize the energy efficiency of content caching.Authors in [13] propose an integer NLP to obtain the optimal quality of experience by placing content units at appropriate small BSs.

B. Content Caching With ML
In recent years, various studies have been undertaken to leverage various ML techniques, including deep learning and DRL for effective content caching in 5G edge networks.Various algorithms of the DRL framework, such as DQN, actor-critic, and deep deterministic policy gradient (DDPG) have been utilized in the literature to solve DRL-based content caching in 5G edge networks.In [14], the authors address the problem of content caching in three-tier wireless networks, consisting of multiple BSs, relays, and users.The authors model the reward of the RL agent as a combination of cache hit rate and cache storage cost, and train the agent by utilizing DQN to maximize the reward.In [15], the authors propose a proactive sequence-aware content caching strategy to reduce the network load and improve the user experience.Specifically, the authors use a deep learning-based framework employing a convolutional neural network to implement a proactive caching scheme at the network edge.The authors in [16] propose a digital twin-based technology to create a virtual model of the vehicular edge network to replicate and analyze the behaviour of the physical system.The authors utilize the framework of this virtual model and investigate a DDPG learning-based solution to optimize cache resource allocation.The authors in [17] propose a size-adaptive content caching algorithm based on DRL framework that adapts to the varying content size and arrival rate, thereby improving the cache hit ratio.In their DRL-based methodology, the authors utilize an actor-critic framework and model the reward of the RL agent as a cache hit ratio to optimize the cache replacement policy.In [18], the authors propose a DRLbased framework to optimize the computational resource usage and energy consumption of the edge computing nodes in the Internet of vehicles environment.The authors utilize DDPG algorithm of the DRL framework for the allocation of caching and computing resources.The authors in [19] propose a content caching policy that maximizes the caching benefit while minimizing the caching cost.The authors utilize actor-critic method to train the RL agent to determine the caching policy.In [20], the authors propose a DRL-based management framework employing DQN algorithm for virtual cache slicing and content placement to optimize the cache resource allocation.In another work [21], the authors utilize a DRL-based framework employing a dueling DQN technique to cache contents in unmanned aerial vehicles (UAVs) and determine the location of UAVs in a UAV-assisted cellular network.
NSA HetNet infrastructure deployment with due consideration of optimal eNB radius, optimal gNB radius, and the optimal number of gNBs within the eNB, coupled with the identification of optimal resource (caching and computing) allocation policy for heterogeneous contents.Few studies (e.g., [14], [16], [18], and [19]) only employ cooperative content caching with cooperation among gNBs.On the contrary, we consider the joint cooperation between the eNB and gNBs of 5G NSA HetNet, for efficient content caching.Moreover, mathematical optimization formulation for content caching in a realistic network scenario, wherein there exists coverage gaps between the consecutive gNBs has not been considered in the aforementioned works.The proposed optimization framework incorporating coverage gaps and employing joint cooperation between the eNB and the gNBs enables our framework to be applicable for moving vehicles on straight as well as curved roads.In some studies (e.g., [14], [15], [17], and [19]), the mapping between frameworks of mathematical optimization and ML to obtain effective content caching strategies was not presented.To the best of our knowledge, this is the first work which maps the NLP-based optimal resource allocation problem for content caching in an NSA HetNet setting into an efficient DQN-based novel framework to achieve a low computation time and high scalability.Unlike other works, we have demonstrated the performance results with due consideration of 3GPP standards for different 4G and 5G outdoor propagation scenarios.

III. SYSTEM MODEL AND PROBLEM FORMULATIONS
In this section, we discuss the system model and optimization problem formulations.In the first subsection, we present the system model.The next subsection discusses the mathematical formulation for four different optimization problems, viz., maximizing the coverage radius of the eNB (problem P1), maximizing the coverage radius of the gNBs (problem P2), minimizing the number of gNBs under the coverage area of the eNB (problem P3), and minimizing the overall downstream latency (problem P4).

A. System Model
In the system model (Fig. 1), we consider that all users, moving in vehicles, have to be provided network services.The users request heterogeneous contents with variable data sizes, where the heterogeneity arises due to a variety of applications, such as emails, 4K videos, games, augmented reality (AR), and virtual reality (VR) applications.
In our network setting, the eNB and gNBs are connected to the core network by means of a PON.Even though all contents are available at the core network, parts of them are cached at the gNBs and the eNB to improve QoS (lower overall downstream latency).As the coverage radius of the gNBs is small, a vehicle needs to pass through multiple gNBs before the desired content is downloaded.The part of the content that cannot be cached at the gNBs and eNB, is fetched from the core network.The gNBs and eNB cooperatively allocate caching and computing resources for the contents.
We consider that the popularity of contents is relatively nontime-varying over a certain duration, e.g., for a few hours or days depending on the contents, such as news along with short videos, movies, or musics [22].Further, a cognitive engine in the 5G edge network can be used to predict the content popularity periodically or whenever necessary for the users moving in vehicles by utilizing ML models or other learning methods [22], [23].Thus, we consider that content popularity of users entering the eNB at a given time frame are known beforehand.The centralized controller agent can be trained so as to identify an optimal policy for cooperative resource allocation.Thereafter, under similar content requirements and road traffic conditions, the learned policy can be reused again.Further, if the content requirements and road traffic conditions change, the agent can be trained again so as to identify a new optimal policy in a short computation time.In our framework, the road traffic condition is represented by the recommended speed of the vehicles moving in freeway roads [24].
Set C denotes the set of all contents requested by the users, i.e., C = {c 1 , c 2 , . . ., c l , . . ., c L }, where L is the total number of contents.Each content has a specific data size d l (l = 1, 2, . . ., L), and the content sizes are represented by a set D = {d 1 , d 2 , . . ., d l , . . ., d L }.The popularity and compressibility of content c l are denoted by p l and f l respectively, wherein the compressibility depends on the type of application and the compression algorithm used.Thus, each content (c l ) has three attributes, viz., the content size (d l ), content popularity (p l ), and content compressibility with a unit computing resource (f l ).The attributes for all contents are represented by matrix C mat , The rows of C mat represent different contents c l for l = 1, 2, . . ., L. The columns of C mat represent the three attributes of content c l , viz., d l , p l and f l .We consider that the total number of gNBs under the coverage of the eNB is N gnb , and the set of the gNBs is denoted by G = {g 1 , g 2 , . . ., g i , . . ., g N gnb }.The total number of voids formed between the consecutive gNBs within the coverage radius of eNB is denoted by N v , and the set of voids is denoted by V = ϑ 1 , ϑ 2 , . . . ,ϑ j , . . . ,ϑ Nv .The cache storage and computing power resources available at the local server of a gNB are denoted by S g m and S g c , respectively.On the other hand, the cache storage and computing power resources available at the eNB are denoted by S e m and S e c , respectively.

B. Problem Formulations
In this subsection, we discuss the theoretical framework for four different optimization problems (P1-P4) to be solved sequentially.However, we discuss the specific relations to be used for problems P1, P2, and P3 in Section IV.The problem Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
to maximize the coverage radius of eNB, i.e., problem P1 is presented as Subject to, The objective function of P1 is to maximize the coverage radius of eNB (r enb ).Here, R enb is the optimum radius of eNB to be obtained by solving P1.Constraint (3) guarantees that the coverage probability of eNB (P cov ), which is a function of radius r enb is greater than the threshold γ th .Constraint ( 4) puts a lower bound on the decision variable r enb .Next, we consider the optimization problem to maximize the coverage radius of gNB constrained by the outage probability and data rate (referred to as problem P2).The mathematical representation of the optimization problem is shown below.
Subject to, The objective function of P2 is to determine the optimal coverage radius of gNB.Here, r gnb is the coverage radius of gNB, and R gnb is the optimal radius of gNB to be obtained by solving P2.Constraint (6) specifies that the outage probability (P out ) which is a function of r gnb is less than the threshold th .Constraint (7) guarantees that the data rate obtained in a cell of gNB is greater than the threshold D th .Here, BW represents the communication bandwidth, and SNR is the signal to noise ratio.The SNR is a function of coverage radius of the gNB (i.e., r gnb ).Thus, constraints ( 6) and ( 7) together guarantee the required QoS in the small cell of the gNB.Constraint (8) specifies that the radius of gNB (r gnb ) is less than the radius of eNB (r enb ).Constraint ( 9) specifies a lower bound on the decision variable r gnb .
After determining the optimal coverage radius of eNB and gNB, utilizing P1 and P2 respectively, we next describe the optimization problem P3 to compute the optimum (minimum) number of non-overlapping gNBs under the coverage of the eNB.The optimum number of gNBs is computed such that each of the gNBs (within the coverage radius of the eNB) satisfies the desired QoS constraints characterized by outage probability and data rate.Furthermore, the computed optimal number of non-overlapping gNBs ensures that the coverage density (i.e., the ratio of the areas covered by the gNBs and eNB) remains larger than the threshold value.The coverage density (D cov ) is given as where, R gnb is the radius of the non-overlapping gNBs satisfying the QoS constraints, and N gnb is the number of gNBs.Problem P3 is presented as Subject to, The objective function of P3 provides the minimum number of non-overlapping gNBs (N * gnb ) that can be placed within the coverage radius of the eNB (R enb ).The number of nonoverlapping gNBs (N gnb ) is a function of outage probability (P out ) and data rate (D).Constraint (12) guarantees that the coverage density is greater than the threshold χ, ensuring that the number of voids is minimized in the geographical area of service spanned by the eNB.Constraint (13) ensures that the gNB radius is less than (or equal to) R gnb , signifying that the QoS constraints are satisfied in each of the gNBs.Constraint (14) specifies the lower bound on decision variable N gnb .N gnb ≥ 2 guarantees the QoS requirements of outage probability and data rate within the coverage of gNB.
After solving problems P1 to P3 in sequence, we obtain a network model (similar to that in Fig. 2), wherein, the optimal eNB radius, the optimal number, radii, and positions of nonoverlapping gNBs are identified. 1With regard to minimizing the overall downstream latency (i.e., problem P4), we consider two scenarios that may be experienced by users.The first scenario relates to the users moving along a straight road, wherein the services are provided by the gNBs along the entire stretch of the road with no void between the consecutive gNBs' coverage (problem P4.1).The gNBs cooperate together to assign the caching and computing resources.The second scenario relates to curved (or straight) roads where users experience voids between consecutive gNBs' coverage (problem P4.2).As the users move through the voids, network services are provided by the eNB.In this case, the eNB and gNBs cooperatively assign caching and computing resources.Minimizing the overall downstream latency for the first scenario is presented in the following as problem P4.1 along the lines of the model used in [25].
Subject to The objective function of P4.1 is to minimize the overall downstream latency.For the sake of problem formulation, we consider the length of the road in each gNB to be 2R gnb .x c l ,g i and y c l ,g i are the decision variables representing the caching and computing resources, respectively, to be allocated to store content c l at gNB g i , where c l ∈ C (set of contents) and g i ∈ G (set of gNBs).The speed of the vehicles is v. t g and t c represent the delay experienced by a user to obtain one unit of content from the gNB and the core network, respectively.t g p represents the processing delay at a gNB required by one unit of computing resource.N g (c l ) denotes the minimum number of consecutive gNBs along the road (i.e., the index of the farthest gNB) required for caching content c l .Constraint (16) ensures that caching resource x c l ,g i must be limited by the caching resource available at a gNB (i.e., S g m ).Constraint (17) guarantees that computing resource y c l ,g i does not exceed the computing power available at a gNB (i.e., S g c ). Constraint (18) puts a bound on x c l ,g i , and ensures that the content cached at a gNB does not exceed the amount of content that can be downloaded by a user moving with velocity v, when passing through gNB g i .Constraint (19) guarantees that N g (c l ) is less than the total number of available gNBs (i.e., N gnb ).Equation (20) gives N g (c l ), where ceil denotes rounding off to the higher integer value.Constraint (21) shows that N g (c l ) belongs to the set of positive integers.Constraints (22) and ( 23) put lower bounds on decision variables x c l ,g i and y c l ,g i respectively.
The optimization problem to minimize the overall downstream latency for the second scenario is presented as problem P4.2, shown as Subject to, The objective function of P4.2 is to minimize the overall downstream latency.N ϑ (c l ) denotes the number of voids encountered by a user while downloading content c l .Decision variable x c l ,ϑ j denotes the part of the content c l to be cached at the eNB corresponding to void ϑ j .Decision variable y c l ,ϑ j denotes the computing resource to be assigned at the eNB for content c l , corresponding to void ϑ j .t e denotes the delay for fetching one unit of content from the eNB, and t e p is the processing delay at the eNB required by one unit of computing resource.Constraint (25) assumes that the number of voids (i.e., N ϑ (c l )) is always less than the number of gNBs (i.e., N g (c l )) by unity.Equation (26) gives the expression to compute N g (c l ).For the sake of problem formulation, we consider the length of the road in each gNB to be 2R gnb and the length of each void to be L ϑ .Constraint (27) guarantees that for content c l , the combined parts of the content cached at the gNBs (x c l ,g i ) and eNB (x c l ,ϑ j ) do not exceed the content size d l .Constraint (28) guarantees that the caching resource allocated at the eNB for different contents (with reference to all the voids ϑ j ∈ V), does not exceed the maximum available caching resource (i.e., S e m ).Constraint (29) ensures that the computing resource allocated at the eNB for different contents (with reference to all the voids) does not exceed the maximum available computing power (i.e., S e c ). Constraint (30) guarantees that the size of the content cached at the eNB does not exceed the maximum Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
content amount that can be downloaded while users travel through void ϑ j with velocity v. Constraints (31) and (32) put lower bounds on the decision variables x c l ,ϑ j and y c l ,ϑ j respectively.Furthermore, problem P4.2 also has additional constraints with reference to the gNBs (i.e., Constraints (16) to ( 23)), as described in problem P4.1.The detailed derivation of the mathematical expression utilized in equation ( 26) of problem P4.2 is presented in the Appendix.
IV. COMPUTATION OF ENB RADIUS, GNB RADIUS, AND NUMBER OF NON-OVERLAPPING GNBS In this section, we discuss the specific relations used to compute the optimal coverage radius of eNB (i.e., problem P1), optimal coverage radius of gNBs (i.e., problem P2), and the minimum number of non-overlapping gNBs under the coverage area of eNB (i.e., problem P3).

A. Computation of Optimal Coverage Radius of eNB
With regard to the computation of the optimal service area of eNB (i.e., problem P1), two propagation environments (urban and rural) are considered.We utilize the realistic pathloss (PL) models mentioned in the 3GPP 4G standards [26].The PL (in dB) for the urban propagation environment (PL 4g urban ) is given by the relation, PL Here, h enb is the height of the eNB, f 4g is the carrier frequency (in MHz), r enb is the distance from eNB (in km), and σ 4g is the variance of the shadowing random variable (in dB).
The closed-form expression for the cell coverage probability (P cov ) is given by [27], erfc is the complementary error function [28] represented as where, P min is the receiver sensitivity and P 4g r is the received signal power.P 4g r is expressed as where, P 4g t is the transmitted signal power from the eNB, and PL 4g is the PL represented either through equation (34) or (35), depending on the propagation environment (urban or rural).c 2 is a constant given as Thus, constraint (3) in Section III can be presented as The solution of P1 provides the optimal coverage radius of eNB, i.e., R enb .We solve P1 using the constrained Nelder-Mead optimization technique [29].

B. Computation of Optimal Coverage Radius of gNBs
With regard to the computation of the optimal coverage radius of gNBs (i.e., problem P2), three 3GPP 5G PL models, viz., RMa, UMa and UMi-street canyon are considered [30].The received signal power (P 5g r ) is computed as where, P 5g t is the transmitted signal power and PL 5g is the PL (in dB).We consider that gNBs are operated in the FR1 frequency band [31].Specifically, in this paper, we consider that the gNBs are operated at a carrier frequency of 6 GHz.We derive the closed-form mathematical expressions of PL 5g for RMa, UMa, and UMi-street canyon propagation environments considering a carrier frequency of 6 GHz as equations ( 43), (44), and ( 45) respectively.(45) Here, PL rma , PL uma , and PL umi are the PL in RMa, UMa, and UMi-street canyon propagation scenarios respectively.r gnb is the distance from the gNB and d bp is the breakpoint distance [30].X σ 5g is a random variable following log-normal distribution [32] with the following mathematical relation where, σ 5g is the standard deviation of X σ 5g .G(θ, φ) is the antenna gain where θ and φ are the elevation and azimuth angles respectively.The outage probability (P out ) is expressed as where P denotes the probability and β is the received signal power threshold.Using equations ( 42) and (47), the outage probability is expressed as Utilizing equations ( 46) and (48), the modified expression for outage probability is given by For high QoS in the cell of gNB, the outage probability should be less than the threshold th , i.e., th → 0 results in a small outage probability, leading to high QoS.Using equation ( 49), constraint (6) in Section III is presented as The obtained data rate (D) is presented as where, BW is the channel bandwidth and n 0 is the standard deviation of the noise.Thus, constraint (7) in Section III is presented as where, D th is the data rate threshold and P 5g r is obtained from equation (42).Substituting the expressions of PL for RMa, UMa, and Umi-street canyon propagation environments from equations ( 43), (44), and (45) respectively, we observe that data rate D (equation (52) depends on the coverage radius of gNB (i.e., r gnb )).Therefore, computing the maximum value of r gnb that satisfies equation (53), ensures that the data rate requirement is satisfied within the entire coverage of the gNB for the propagation environments of RMa, UMa, and Umistreet canyon.We solve P2 using the constrained Nelder-Mead technique.
The higher frequencies of the FR1 band result in higher PL.However, this challenge is alleviated by using directional beams.Beamforming is done by using an advanced antenna system (AAS) that typically consists of a rectangular planar array (RPA).With a RPA consisting of M rows of antenna elements along x axis and N columns of antenna elements along y axis, the array factor (AF) for the RPA is given as [33], [34] where, (54) Here, λ c is the carrier wavelength, d x and d y are the spacing between the antenna elements along the x and y axis respectively.β x and β y are the phase shifts applied to the elements along the x and y axis respectively.By applying different phases to the antenna elements, the beam can be Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The denominator of equation ( 57) is computed using the numerical integration technique [35].For low reflection losses in the antenna array, the gain is stated as The gains for different dimensions (M × N ) of RPA considering 0.5λ c spacing (i.e., d x = d y = 0.5λ c ) is shown in Table III.

C. Computation of Optimal Number of Non-Overlapping gNBs
With reference to finding the optimal number of nonoverlapping gNBs that can be placed within the service area of the eNB (i.e., problem P3), we utilize the concept of circle packing [36].The theoretical upper bound of coverage density (D cov ) is given by Groemer's inequality [36] Utilizing equations ( 10) and (59), R gnb can be rewritten as (60) The constraints ( 12) and ( 13) in Section III are represented using equations ( 59) and ( 60) respectively.The solution of P3 provides the minimum number of non-overlapping gNBs satisfying the QoS constraints within the eNB (i.e., N gnb ).In this paper, we use graphical approach to solve P3.

V. DRL-BASED EDGE CACHING AND COMPUTING POLICY
With reference to minimizing the overall downstream latency (i.e., problems P4.1 and P4.2), in this section, we present a brief overview of RL and DRL, followed by an algorithmic description of our proposed method (policy).

A. Brief Overview of RL and DRL
The framework of RL consists of environment, agent, actions, and rewards.The RL framework is mathematically governed by the Markov decision process (MDP) [37].The agent interacts with the environment at discrete time steps t = 0, 1, 2 • • • t f , where t f is the final time step.At each time step t, the agent receives a state s t ∈ S from the environment, where S is the set of all states.Based on the observed state, the agent takes an action a t ∈ A, where A is the set of all actions.The agent then receives a scalar reward r t ∈ R (set of real numbers) from the environment, and observes the next state s t+1 at time step t + 1.The goal of the agent is to maximize the long-term cumulative discounted reward (R cum ), given by Discount factor (γ) (where 0 ≤ γ ≤ 1) signifies that the reward at time step t is more valuable than rewards at future instances.RL algorithm achieves the goal by learning a policy (π), where a policy is a function from S → A. The action value function (Q π (s t , a t )) or the Q-function for policy π is stated as The Q value, i.e., Q π (s t , a t ) indicates the action a ∈ A in state s ∈ S that will lead to the highest discounted reward.The Q values are updated using Bellman's equation presented by equation (63).
Here, Q pr is the present Q value, α is the learning rate, Q(s t , a t ) is the maximum expected future reward, and Q new is the updated Q value.The Q-learning algorithm learns the optimum policy π by selecting the action a that offers the highest Q(s, a) corresponding to each state s.The optimal value of Q function, i.e., Q (s t , a t ) is computed using the equation, where, E π is the expectation operator and V π is the state-value function.Utilizing the optimum Q values, the optimal policy (π (s)) can be obtained using the relation, DQN algorithm is a special category of DRL algorithm that approximates the Q function by utilizing a deep neural network (DNN) (also known as a Q-network).The DNN can be used to approximate various functions and thus acts as a universal function approximator [38].The Q-network takes the present state s t as input and offers Q value as the output for each action a t .Further, DQN algorithm utilizes an experience buffer E with size D. The experience buffer stores D number of transitions in the form of a tuple (s t , a t , r t , s t+1 ), i.e., (present state, action, reward, next state) sampled from the environment [39].The approximation of the Q values utilizing the Q-network is expressed as where w are the parameters or weights of the neural network.
At each time step during the training, the agent samples a minibatch of N experiences (s t , a t , r t , s t+1 ) from the experience buffer E. The agent then computes the target Q values for each of the N samples using the modified version of the Bellman's equation Here, max a Q(s t+1 , a ; w ) is the maximum Q value over all possible actions a in state s t+1 .The agent learns the Q values by minimizing the mean square error (MSE) between the predicted Q-value and the target Q-value, where the MSE loss function is expressed as The gradient of the loss function is computed through backpropagation [40].Further, the weights w of the Q network are updated by utilizing stochastic gradient descent.The trained Q network is utilized for estimating the Q values for new states and actions that the agent encounters during its interaction with the environment.The RL agent utilizes the learned Q-values coupled with an epsilon-greedy action selection strategy to determine the optimal policy.In the epsilon greedy strategy, the agent selects a random action with probability ˆ (ˆ is a hyper-parameter known as exploration factor where 0 ≤ ˆ ≤ 1) or selects the action corresponding to the highest Q-value with probability 1 − ˆ .Initially, the value of ˆ is set high (i.e., ˆ → 1), thereby promoting the agent to explore the environment to determine better policy.ˆ is gradually reduced over episodes to encourage the agent to exploit the experience it has learned through interactions with the environment.

B. Proposed Algorithm for DRL-Based Caching and Computing Resource Allocation Policy
In this subsection, we describe our proposed algorithm for cooperative caching and computing resource allocation policy with reference to problem P4.1.In this regard, we present the proposed modelling of the state, action, and reward for the mapping of the NLP-based optimization into an efficient DRL framework.The centralized controller acts as an agent, whereas the eNB, gNBs, and vehicles act as the environment.The pictorial representation of the proposed methodology utilizing DRL framework (discussed in Section V-A) is presented in Fig. 3.The state (i.e., s, s ∈ S ) received from the environment at a given instant is expressed as, s contains all decision variables (i.e., x c l ,g i and y c l ,g i ) which need to be optimized.The set of actions (A) is considered as where, || denotes the cardinality (i.e., the total number of elements in s).We consider that at each time step t, the agent takes a discrete action k (k = 1, 2, . . ., |s| + 1) and the environment progresses to the next state s given by, where s(k) is the k th element of set s and δ is the step size.The next state s is obtained by adding step size δ to the k th element of the present state s, i.e., when the agent executes the given action k, the decision variable at the k th index in s, which corresponds to a caching resource (i.e., x c l ,g i ) or a computing resource (i.e., y c l ,g i ), is incremented by δ.However, the next state is equal to the previous state (i.e., s = s), if the agent takes the action k = |s|+1.Once the agent reaches the optimal state s in a particular time step t of an episode, the agent takes the action k = |s|+1 in the subsequent time steps and retains the same state (i.e., retains the values of decision variables).In Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
our proposed algorithm, we model the reward function R eff as follows, We model the reward R 1 as where, R 1 is the inverse of the objective function of problem P4.1, signifying that as the delay reduces, the reward increases.This motivates the agent to minimize the delay so as to maximize the reward.We define reward R 2 as (74) Equation ( 74) ensures that if the caching constraint for gNBs is satisfied, the scaling factor is λ 1 ; otherwise, the scaling factor is λ 2 .If the caching constraint of gNBs is satisfied, the agent receives a small positive reward; else, it receives a large negative reward.As observed from equations ( 72) and (74), if the caching constraint is satisfied, R 2 becomes negative, leading to an increase in the overall effective reward (R eff ); otherwise, the effective reward decreases.This modelling motivates the agent to satisfy the caching constraints of the gNBs.Following the same logical argument, we define rewards R 3 and R 4 using equations (75) and (76), respectively.
Reward R 3 motivates the agent to satisfy the computing constraints of the gNBs.Reward R 4 ensures that the amount of content that is cached at the gNBs is less than (or equal to) the content that can be downloaded as users travel with velocity v.We consider the same scaling factors (λ 1 and λ 2 ) for equations (74), ( 75) and (76).This modelling ensures that the agent gives equal weight to all the constraints, and the reward is maximized when all the constraints are satisfied simultaneously.
The agent executes Algorithm 1 (for environment progression) in an episode.The inputs to Algorithm 1 are the action set (A), the maximum time steps per episode (T f ), and the episode break reward threshold (R break ).The output of the algorithm is the total episode reward (R ep ).In Algorithm 1, R total is obtained from Algorithm 2 (step 13 and step 19).The proposed reward module (Algorithm 2) is used to compute the total reward (R total ) received by the agent at the end of each time step t.The inputs to Algorithm 2 are the present state (s), time step (t), number of contents, (L) and number of gNBs required (N g (c l )∀c l ∈ C ).The output of Algorithm 2 is R total .In order to bound the total reward, in our proposed of gNBs for all contents (N g (c l )∀c l ∈ C ) 2 Output: Total Reward (R total ) 3 Start 4 Obtain x c l ,g vector from state s for all contents, reward module, we incorporate a reward bound R β .Steps ( 9) and ( 12) assign the maximum negative and maximum positive rewards that can be received by the agent, respectively.As seen from Algorithm 1, the episode ends when the agent reaches the end of the time step t or when the episode reward is less Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Update the Q-network parameters w by minimizing the loss

Execute action k on
16 end 17 end 18 Set policy π (s) = arg max k Q(s, k ; w ) for all s in the state space than R break .R break is introduced to facilitate early termination of the episode in case huge negative rewards are received by the agent.Thus, incorporating R break helps to determine the resource allocation faster, and therefore, leads to lower computation time.
The agent is trained using Algorithm 3 that utilizes the frameworks of Algorithm 1 (proposed environment progression) and Algorithm 2 (proposed reward module).The inputs to Algorithm 3 are the maximum number of episodes (η f ) and the Q-network structure (i.e., the DNN architecture).The output of Algorithm 3 is the optimum resource allocation policy (π * ).The optimal resource allocation policy is obtained after completing all the episodes (i.e., episode η f ).For training the agent, the exploration factor ˆ is initialized to ˆ in (ˆ in → 1).The value of ˆ is reduced over episodes using step 8.The agent selects an action k using the epsilon greedy action selection strategy (step 10).The next state s and reward R total are obtained by utilizing Algorithms 1 and Algorithm 2 respectively (step 11 of Algorithm 3).The transition (s, k , R total , s) is stored in the experience buffer E (step 12 of Algorithm 3).The sampling of a mini-batch of N samples from E and the computation of the target values of each sample are carried out utilizing steps 13 and 14 We use the same logic to obtain the optimal resource allocation policy for problem P4.2.The reward function Reff for problem P4.2 can be stated as Here, obj is the objective function, and λ m is the scaling factor for each constraint (cons m ) of problem P4.2.Computational complexity: Algorithm 1 runs for a maximum of T f time steps.The computational complexity of Algorithm 1 is O(T f ).The complexity of Algorithm 2 is O(LN g (c l )).The proposed DQN-based algorithm (i.e., Algorithm 3) incorporates Algorithms 1 and 2. Algorithm 3 runs for a maximum of η f episodes.For each episode in Algorithm 3, Algorithms 1 and 2 run for T f time steps resulting in a complexity of O(T f ) and O(T f LN g (c l )), respectively.The complexity of Algorithm 3 for one episode is estimated as the sum of the complexities of Algorithms 1 and 2, executed for T f time steps, i.e., O(T f + T f LN g (c l )).For η f episodes and mini-batch of N transitions, the overall complexity of the proposed algorithm is estimated as

VI. SIMULATION RESULTS AND DISCUSSIONS
In the first subsection, we compute the optimal coverage radius of eNB.In the second subsection, we compute the optimal coverage radius of gNBs under RMa, UMa, and UMi-street canyon propagation scenarios.Here, we also illustrate the graphical solution to obtain the minimum number of non-overlapping gNBs within the coverage area of the eNB, satisfying the outage probability and data rate constraints.In the third subsection, we present the analysis of the overall downstream latency and comparison results.

A. Numerical Analysis of eNB Coverage Radius
To obtain the optimal eNB coverage radius, we use the simulation parameters shown in Table V.The graphical representations of PL for 3GPP 4G urban and rural propagation environments using equations (34) and (35) respectively, are presented in Fig. 4. The variation in coverage probability utilizing equation (36) is shown in Fig. 5.By solving problem P1, we obtain the optimal coverage radius of eNB for urban and rural scenarios as 1.423 km and 1.977 km respectively.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Coverage probability of 3GPP 4G rural and urban propagation environments.

TABLE VI
LIST OF PARAMETERS USED TO COMPUTE GNB RADIUS Fig. 6.PL for 3GPP 5G RMa, UMa and UMi-street canyon propagation scenarios.

B. Numerical Analysis of gNB Coverage Radius
To compute the optimal coverage radius of gNBs, we use the parameters as specified in Table VI.We consider a 64 × 64 RPA at the gNBs.The related antenna array gain for the RPA is taken from Table III.The PL for 3GPP 5G RMa, UMa, and UMi-street canyon are presented in Fig. 6.We obtain the optimum coverage radius that satisfies the QoS constraints of outage probability and data rate and show them in Table VII.In Fig. 7, we show the coverage density versus the number of non-overlapping gNBs (i.e., N gnb ) obtained using Groemer's inequality relation (59).We use graphical solution method to   determine the optimal number of non-overlapping gNBs, i.e., the solution of optimization problem P3 (Fig. 8), with due consideration of the 4G and 5G scenarios.The optimal number of non-overlapping gNBs is obtained at the intersection point of the horizontal line (representing the optimal coverage radius of gNB satisfying the QoS constraints) and the curved line (representing the variation of gNB radius and the number of gNBs).In Table VIII, we show the minimum number of non-overlapping gNBs satisfying the QoS constraints that can be placed within the coverage area of the eNB.

C. Numerical Analysis of Downstream Latency
This subsection presents the analysis of the overall downstream latency.The simulation parameters used in this regard are shown in Table IX.The content sizes are assumed to be uniformly distributed between 5 GB and 10 GB, i.e., d l ∼ U [5, 10] GB.The content popularity is assumed to follow Zipf distribution with a skewness factor of 0.7, similar Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.to existing works [22], [41].Fig. 9 (a) shows the overall downstream latency corresponding to 5G RMa, UMa and UMi-street canyon propagation scenarios.Fig. 9(b) shows the number of cooperative gNBs needed with reference to the above propagation conditions.We show the impact of the speed of vehicles on the overall downstream latency in Fig. 10 (a) for 4G urban and 5G UMa scenarios.Fig. 10 (b) shows the number of cooperative gNBs needed with reference to the above propagation conditions.We consider that vehicles move with the recommended speed [24].For vehicles moving with different speeds, the required contents may not be optimally cached at the eNB/gNBs; however, the contents can always be downloaded from core network.Thus, the minimal latency (computed considering recommended speed)

TABLE X PROPOSED DNN ARCHITECTURE
may not be obtained for the scenarios with deviations from the recommended speeds.
We present the DNN architecture used for obtaining the results of the proposed methodology in Table X.The proposed DNN architecture consists of an input layer, three hidden layers, and an output layer.The states sampled from the experience buffer E at each training iteration acts as an input to the DNN and the estimated Q value for each action is the output of the DNN.The number of neurons present in the input layer is |s| (i.e., the number of elements in s) and the number of neurons present in the output layer is |A| (i.e., the number of actions).Further, we consider Tanh [42] activation function for the neurons.The range of Tanh lies between −1 to 1, thereby helping to normalise the inputs.Moreover, Tanh is a non-linear activation function that enables the agent to learn the complex relationships between the inputs and outputs, thereby enabling the RL agent to make efficient decisions based on different input values.Moreover, as Tanh is a smooth and differentiable function, it makes training neural network easier through backpropagation.The parameters used for training the DNN are given in Table XI.Fig. 11 illustrates that the proposed DNN architecture with 800 neurons in each of the three hidden layers yields a higher average reward as compared to other DNN architectures (with different numbers Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.   of neurons in each hidden layer).Fig. 12 illustrates that the proposed DNN architecture with three hidden layers yields high average rewards as compared to other DNN architectures (with different numbers of hidden layers).Fig. 13 illustrates that the proposed DNN architecture with a learning rate of 0.0001 yields a higher average reward as compared to other DNN architectures (with different learning rates).From Fig. 11, Fig. 12, and Fig. 13, we observe that the proposed DNN architecture (Table X) with the training parameters as shown in Table XI yields better performance as compared to other DNN architectures.
In this paper, we also solve the problem of minimizing the overall downstream latency using pattern search [43], genetic algorithm [44] and multi-start search technique [45].We show the comparison in performance (in terms of the computation time) among the proposed algorithm, pattern search algorithm, genetic algorithm, and multi-start search   techniques in Fig. 14, 15 and 16, respectively, considering the maximum number of contents as 20 (similar in lines of [25]).
Fig. 14 shows that the pattern search algorithm becomes infeasible (more than 700 s) as the number of contents exceeds  Comparison among the proposed algorithm and other existing techniques in terms of the overall downstream latency.15.For a large number of contents, pattern search becomes computationally expensive as the size of the search pattern and the number of required function evaluations increase.On the other hand, in our proposed methodology, during training, if the RL agent receives huge negative reward (i.e., constraints are not satisfied) then this results in early termination of the episode.The proposed modelling enables the agent to recognize the infeasible points (i.e., points that result in a negative reward) in a shorter computation time and motivates the agent to identify the point which maximizes the reward (i.e., the point which minimizes the delay and satisfies all constraints).Thus, the proposed modelling of the objective function and the constraints coupled with the early termination of an episode for infeasible cases (Section V-B) results in reducing the computational time and enhancing the scalability.Fig. 15 shows that the genetic algorithm becomes infeasible when the number of contents exceeds 10.The genetic algorithm becomes ineffective for a large number of contents as evaluating the fitness of each candidate solution in the population is a computationally extensive process.This leads to a higher computation time and lower scalability.Fig. 16 shows that the multi-start search technique becomes infeasible when the number of contents exceeds 15.For high-dimensional problems (i.e., a large number of contents) obtaining multiple solutions and evaluating them utilizing multi-start search becomes a computationally expensive task.Thus, multi-start search technique leads to higher computational time and lower scalability.The comparison of the proposed algorithm and the existing pattern search, genetic algorithm, and multi-start search, in terms of the overall downstream latency achieved is presented in Fig. 17.Figs.14, 15, 16 and 17 show that our proposed algorithm can offer a solution for the resource allocation problem with a lower computation time, albeit with only a marginal increase in the overall downstream latency.It may also be observed (Figs.14, 15 and 16) that, contrary to the existing techniques, the proposed algorithm can offer a solution even for large-scale HetNets (a large number of contents).The average percentage gains of the proposed algorithm, in terms of computation time, over pattern search (Fig. 14), genetic algorithm (Fig. 15), and multi-start search technique (Fig. 16) are observed as 66.52%, 76.31%, and 53.64%, respectively.The solution of the proposed algorithm is also validated by the duality principle of the convex optimization [46].The primal problem (i.e., P4.1) is solved  (78) In Fig. 18, we show the norm difference between the dual and primal solutions.The positive norm difference (indicated in Fig. 18) shows that the obtained solution of the primal is greater than the dual (i.e., p > d ).Thus, using the duality framework of convex optimization, the solution of the proposed algorithm is validated.Fig. 19 presents the downstream latency obtained when there exist voids between successive gNB coverages, and the eNB and the gNBs cooperate to allocate caching and computing resources for the contents.Here, we consider the content size to be uniformly distributed between 1 GB and 5 GB, i.e., d l ∼ U [1,5] GB.The length of the void and the other parameters related to eNB are taken from Table IX.We consider less content size compared to the first case (i.e., P4.1) as the number of decision variables becomes very large in problem P4.2.The increase in computational complexity restricts us to obtain results for large problem sizes (i.e., large data size and a large number of contents).

VII. CONCLUSION
In this paper, we considered a heterogeneous 5G NSA wherein both the eNB and gNBs coexist.Multiple gNBs and the eNB cooperatively allocate caching and computing resources for the contents to minimize the overall downstream latency, thereby improving the QoS.In this regard, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
first, we computed the optimum coverage radius of the eNB constrained by the coverage probability.For the same, we utilized urban and rural propagation scenarios mentioned in 3GPP standards for 4G.Thereafter, the optimum radius of gNBs satisfying the QoS constraints (data rate and outage probability) is determined.For computing the optimal gNB radius, we considered RMa, UMa, and UMi-street canyon propagation scenarios mentioned in the 3GPP standards for 5G.Going further, we adopted a novel approach to compute the optimum number of non-overlapping gNBs required so as to provide a high QoS within the coverage area of the eNB.We also performed a detailed analysis of the overall downstream latency, and the impact of various parameters affecting the same.We compared the performance of the proposed DQN-based algorithm with that of the pattern search, genetic algorithm, and multi-start search technique in terms of computation time, scalability, and overall downstream latency.The proposed DQN-based algorithm always offered a solution with less computation time.It can also provide a solution for large-scale 5G HetNets with a large number of heterogeneous contents.However, we experienced only a marginal increase in the overall downstream latency.Finally, we also validated the solution of the proposed algorithm using the duality principle of the convex optimization technique.

Fig. 2 .
Fig.2.An example network model, where the big circle represents the eNB and the small circles represent the gNBs.The black straight line represents a road without any void between consecutive gNBs.The red straight line represents a road with voids between consecutive gNBs.

4g urban = 40
(1 − 0.004h enb ) log 10 (r enb ) − 18 log 10 (h enb ) + 21 log 10 For a carrier with a frequency of 900 MHz, eNB antenna height of 15 m and variance of 10 dB, the PL relation becomes PL 4g urban = 120.869+ 16.325 ln(r enb ) in dB) experienced in the rural propagation environment (PL 4g rural ) with a carrier frequency of 900 MHz, eNB antenna height of 45m and variance of 10 dB becomes PL 4g rural = 96.779+ 14.797 ln(r enb )

Fig. 3 .
Fig. 3. Pictorial representation of the proposed DQN-based algorithm utilizing DRL framework of Section V-A.

Algorithm 1 : 4 7 10 if action = k and k ≤ |s| then 11 s 12 s = s 13 R 19 RAlgorithm 2 :
Proposed Algorithm for Environment Progression 1 Inputs: Action set (A), Maximum time steps per episode (T f ), Episode break reward threshold (R break ) 2 Output: Total episode reward (R ep ) 3 Start Initialize the state s 5 Set time step t = 0 6 Initialize reward array R array 1×T f to zeroes Initialize the total episode reward R ep = 0 8 Start playing the episode 9 while t ≤ T f and R ep> R break do (k ) = s(k ) + δ array (1, t) = R total (Algorithm 2) array (1, t) = R total (Algorithm 2) t > T f or R ep ≤ Rbreak then 25 End the episode 26 end Proposed Reward Module 1 Input: State (s), Time step (t), No. of contents (L), No.

Fig. 9 .
Fig. 9. (a) Overall downstream latency for RMa, UMa, and UMi-street canyon scenarios.(b) Number of cooperative gNBs needed with reference to the above propagation conditions.

Fig. 10 .
Fig. 10.(a) Impact of vehicle speed on the overall downstream latency for urban propagation scenario with 4G and UMa 5G.(b) Number of cooperative gNBs needed for the above scenario.

Fig. 11 .
Fig. 11.Comparison in performance of the proposed DNN architecture with other DNN architectures for different numbers of neurons in each of the three hidden layers.

Fig. 12 .
Fig. 12.Comparison in performance of the proposed DNN architecture with other DNN architectures for different numbers of hidden layers.

Fig. 13 .
Fig. 13.Comparison in performance of the proposed DNN architecture with other DNN architectures for different learning rates.

Fig. 16 .
Fig. 16.Comparison between the proposed algorithm and multi-start search technique.

Fig. 17
Fig. 17.Comparison among the proposed algorithm and other existing techniques in terms of the overall downstream latency.
Fig. 17.Comparison among the proposed algorithm and other existing techniques in terms of the overall downstream latency.

Fig. 18 .
Fig. 18.Difference of norms between the dual and primal solutions.
framework reduces the overall downstream latency, and offers low computation time and high scalability.The DQN-based framework allocates optimal resources in the cooperative gNBs and the eNB for contents demanded by the users.•The proposed methodology also takes into account the practical network deployment scenario with/without coverage gaps between consecutive gNBs.The proposed framework can also be employed for vehicles moving on straight as well as curved roads.
• To the best of our knowledge this is the first work that maps the non-linear programming (NLP)-based caching and computing resource allocation in a 5G NSA HetNet to an efficient DQN-based framework.The proposed DQN-based • The proposed resource allocation policy is validated using the duality framework of convex optimization.

TABLE I DESCRIPTION
OF SYMBOLS USED IN PROBLEMS P4.1 AND P4.2 TABLE II DESCRIPTION OF SYMBOLS USED FOR COMPUTATION OF OPTIMAL COVERAGE RADIUS OF ENB c 1 is a constant given as the environment and obtain next state s (utilizing Algorithm 1) and reward R total (utilizing Algorithm 2) Compute target values y ĵ for each sample ( ĵ = 1, 2, . . ., N ) using y ĵ = r ĵ + γ max k Q(s, k ; w ) (modified Bellman's equation) 12 Store transition (s, k , R total , s) in experience buffer E 13 Sample a mini-batch of N transitions from E 14

TABLE V LIST
OF PARAMETERS USED TO COMPUTE ENB RADIUS respectively.The weights w of the Q-network are updated by minimizing the loss function (step 15).

TABLE VIII OPTIMAL
NUMBER OF NON-OVERLAPPING GNBS THAT CAN BE PLACED WITHIN THE COVERAGE RADIUS OF ENB

TABLE IX LIST
OF PARAMETERS USED TO COMPUTE OVERALL DOWNSTREAM LATENCY

TABLE XI PARAMETERS
USED FOR TRAINING THE DNN