Deep Reinforcement Learning for RSMA-Based Multi-Functional Wireless Networks

The upcoming sixth generation (6G) is expected to support a wide range of applications that require efficient sensing, accurate localization, and reliable communication capabilities. Furthermore, 6G is expected to catalyze the development of new use cases that will require working in extreme environmental and hazardous conditions and have ultra-small size and low-cost wire-less devices. Thus, developing sustainable multi-functional wireless networks that are capable of incorporating billions of low-power devices and supporting their sensing and communication requirements on top of energy harvesting capability is of paramount importance. Motivated by this, we consider in this work a rate-splitting multiple access (RSMA)-based multifunctional wireless network with sensing, energy harvesting, and communication capabilities. We employ trust region policy optimization (TRPO), a deep reinforcement learning (DRL) algorithm, to efficiently allocate the available resources and manage the interference between the three functionalities. TRPO/DRL is capable to learn a near-optimal policy for the resource allocation problem in a complex and dynamic environment. This enables us to obtain near-optimal transmit precoders, power splitting ratios, and rate-splitting among the common and private rates in a multiple access setting. Simulation results demonstrate the effectiveness of RSMA in mitigating the interference in such multi-functional networks and its capability to accommodate the rate and energy harvesting requirements of the devices while still capable of sensing multiple targets.


I. INTRODUCTION
The sixth generation (6G) is envisaged to unleash the full potential of abundant autonomous services comprising past, as well as new emerging trends.More precisely, 6G is envisioned to bring novel disruptive wireless technologies and innovative network architectures.It is further envisaged that 6G will ultimately realize the next generation of connectivity, driven by the evolution from connected everything to connected intelligence.Besides that, supporting high-precision sensing for tactile and haptic applications to provide the required sensory experience at different levels, is expected to be a key area of innovation in 6G networks [1].Nevertheless, the scarcity of spectrum resources leads to severe spectrum congestion, which hinders the acquiring of high data rates to perform both sensing and communication.Furthermore, deploying a large number of devices would require a significant amount of power, placing a considerable burden on the energy consumption of the network.Moreover, in extreme and inaccessible conditions, low-power devices require frequent batteries replacement and recharging plugins, which is cost-inefficient, difficult, and hazardous.Therefore, it is of paramount importance to design sustainable multi-functional wireless networks that are capable of incorporating billions of low-power devices and supporting their sensing and communication requirements on top of energy harvesting capability.
Simultaneous wireless information and power transfer (SWIPT) has been proposed to allow a signal to carry both energy and information simultaneously, which enables seamless integration of energy harvesting functionality and wireless information transmission while mitigating interference.However, since devices cannot perform simultaneous energy harvesting and data detection using the same received signal, various SWIPT-based receiver architectures have been developed, including separate receiver hardware, power splitting, time switching, and antenna switching mechanisms, to coordinate and schedule energy harvesting and data decoding functionalities [2].It is worth mentioning that the power splitting architecture has the best rate/energy tradeoff, hence, it is adopted in this work.Likewise, 6G is envisaged to enable localization and sensing as a service on top of communication functionality, known as integrated sensing and communication (ISAC), by sharing the same time-frequency-spatial resources.In such systems, wireless signals are not only used to transmit information but also to sense the surrounding targets and environments.Thus, efficient waveform design and resource allocation are of paramount importance to reduce the incurred interference between sensing and communication in ISAC systems.
Extensive research has focused on optimizing resources for SWIPT and ISAC systems separately, with solutions based on accommodating sensing and communication in allotted orthogonal radio resources such as time/frequency/space/code domains [3].Beamforming techniques have also been proposed for joint MIMO radar-communication systems [4] and [5].Likewise, non-orthogonal multiple access (NOMA) has been employed to multiplex sensing and communication signals in the power domain, but it presents significant challenges such as high complexity and sensitivity to channel disparity [6].Recently, rate-splitting multiple access (RSMA) has been proposed as a generalized multiple access scheme that includes power-domain NOMA and MIMO linear precoding as special cases while outperforming them by splitting user messages into private and common parts, where the common parts are jointly encoded and the private parts are independently encoded into multiple private streams.Owning to its better management of the interference between sensing and communication, RSMA has been considered in ISAC-based systems [7].
On the other hand, the optimization of SWIPT-based systems for transmission has been studied in various works.Examples include joint transmit beamforming and receive power splitting design for a multiuser MISO broadcast system in [8], SWIPTenabled NOMA systems in [9], and the integration of RSMA with SWIPT-based systems using RIS in [10].These works aimed to improve the efficiency and power constraints of SWIPT-based systems.
Despite the existing efforts on investigating energy harvesting, sensing, and communication, the above works did not consider the case of multi-functional networks wherein a single base station transmits signals to simultaneously serve communication, sensing, and energy harvesting users.Furthermore, none of the existing work has also investigated the use of RSMA to allocate the available resources among the three functionalities.With this motivation, the aim of this work is to employ RSMA for multi-functional networks that encompass communication, sensing, and harvesting requirements.In order to allocate the available resources, we formulate an optimization problem that couples the maximization of the sum rate, minimization of the mean squared error (MSE) of the beampattern matching design, and the energy harvesting requirements.It is worth mentioning that in complex and dynamic environments, researchers often turn to advanced optimization tools like deep reinforcement learning (DRL) algorithms to solve intricate resource allocation problems.In this work, we propose the use of trust region policy optimization (TRPO) [11], a DRL algorithm, which is a robust and effective algorithm for continuous control problems.It uses a trust region constraint to ensure stability and avoid drastic policy changes during optimization [11].The employment of TRPO/DRL enables us to learn a near-optimal policy for the resource allocation problem in a complex and dynamic environment.Finally, by solving the problem using TRPO/DRL, we obtain the transmit precoders, power splitting ratios, the design beampattern, and rate-splitting among the common and private rates.

II. SYSTEM MODEL
We consider a downlink multi-user MISO system consisting of a multi-functional base station (BS) that comprises a uniform linear array (ULA) equipped with N transmit antennas.In detail, the system consists of K information decoding (ID) users, U radar targets at azimuth angles of interest θ u , ∀u ∈ U = {1, 2, ..., U }, to be detected, and L energy harvesting (EH) users each equipped with power split circuit to simultaneously decode information and harvest energy.Furthermore, without loss of generality, all users are assumed to have a single antenna.We denote by h ID k ∈ C N ×1 the channel vector from the BS to the k th ID user and h EH l ∈ C N ×1 the channel vector from the BS to the l th EH user.
In order to provide efficient management of the incurred interference, RSMA is adopted to allocate the resources among the ID and EH users as well as for the sensing waveform.Specifically, the k th ID user message u k and the l th EH user message m l are split into common and private parts as follow: Then, in order to reduce the effect of the incurred interference, all streams are manipulated by the transmit precoders at the BS.Therefore, the transmitted signal from the antenna array is represented by the following where , and w EH l ∈ C N ×1 represent the precoding vectors for the common stream, the k th ID user private stream, the l th EH user private stream, respectively.Therefore, the received signal at either the ID or EH user is represented by the following In the above, n j i denotes the additive white Gaussian noise (AWGN), which is assumed to have zero mean and σ 2 j,i variance.At each ID user, the common stream s c is decoded first, assuming the interference from other private streams as noise.Therefore, the achievable rate at the ID user due to decoding the common stream is expressed as the following .
Assuming that the k th ID user successfully decoded the common stream and subtracted its effect using SIC, it decodes the corresponding private stream s ID k , assuming all other private streams as a noise.Therefore, the achievable rate at the k th ID user due to decoding the private stream is expressed as follows .

IEEE Global Communications Conference: Mobile and Wireless Networks
In contrast, at the l th EH user receiver, the received signal is divided first into two different parts using a power split ratio α l ∈ (0, 1) before performing the detection process.In specific, the part √ α l y EH l is forwarded to the detection circuits while the remaining part √ 1 − α l y EH l is used for energy harvesting purposes.The detection procedure at any EH user receiver is similar to the ID users, consequently, the achievable rate due to decoding the common and private streams at the l th EH user receiver is written in ( 5) and ( 6) on the top of the next page, where δ 2 l is the variance of a zero mean complex Gaussian random variable which represents the circuit noise.
where, ρ EH c,l and ρ EH l,l are given as follows, respectively In order to guarantee that the common stream s c is successfully decoded by all ID and EH users, the rate of the common stream has to be set to the worst case scenario, i.e, Then, R c is assumed to be shared between all ID and EH users, where each user contribution in the common rate has to be optimized.Therefore, the total achievable rate of either the ID or the EH user is expressed as follows where C j i denotes the contribution of either the i th ID or EH user in the common rate.In addition to performing information decoding, the l th EH user will perform energy harvesting by utilizing the other part of the received signal, i.e., √ 1 − α l y EH l .By adopting the non-linear harvesting model, the amount of the harvested energy at the harvester module is expressed as follows [9] where β l = 1 1+e a l b l and the parameters a l , b l are constants related to the harvester circuit, while E max l denotes the maximum harvested power by the l th EH user.Furthermore, E rx l is the received signal power at the l th EH user, which written as follows It is worth noting that the design of the precoders for the EH and ID users has to be carefully considered in order to meet certain requirements, including the transmit power budget, minimum rate requirements of the ID and EH users, minimum harvested energy requirements, and finally matching the desired beampattern for targets detection.In order to design a desired high directional transmit beampattern, this is equivalent to designing the covariance matrix of the probing signals, i.e., the communication and harvesting signal, R x = WW H , where W = [w c , w ID 1 , ..., w ID K , ..., w EH 1 , ..., w EH L ] ∈ C N ×(K+L+1) .In this work, we employ the MSE of the beampattern matching design as the metric for the sensing functionality, which is expressed as follows where θ m is the azimuth angle of the m th grid out of M grids that cover detection angle range of [−π/2, π/2].Additionally, P d (θ m ) denotes the desired beampattern level at θ m , which is assumed to be previously known and generated using beampattern synthesis methods for given ULA parameters and azimuth angles of the targets of interest.Furthermore, γ > 0 represents the scaling factor of P d (θ m ) and v t (θ m ) = [1, e j 2π λ dsin(θm) , ..., e j 2π λ d(N −1)sin(θm) ] T ∈ C N ×1 is the steering vector for the ULA at the BS, where d represents the distance between the adjacent array elements.Furthermore, Following the work in [12] to design the desired beampattern, we obtain P d (θ m ) as follows where 2∆ denotes the beamwidth.It can be noted from (12) that the better the matching between the desired and the transmit beampatterns, the lower the MSE, and hence, the higher the signal-to-noise ratio (SNR) at the sensing targets.
Assuming that the BS has a perfect knowledge of the CSI of different users, the sum rate maximization problem coupled with the MSE minimization for the considered system model is expressed as follows 2023 IEEE Global Communications Conference: Mobile and Wireless Networks where 0 ≤ λ ≤ 1 is the regularization parameter to set the trade-off between the sum rate and MSE for the communication and sensing functionalities.Moreover, C = [C ID 1 , ...C ID K , C EH 1 , ..., C EH L ] denotes the common rate vector to be optimized.Constraints (P1.a), (P1.b), and (P1.g) ensure that the common stream is successfully decoded by all ID and EH users, while constraints (P1.c-P1.d)represent the minimum rate requirements for the ID and EH users, respectively.Furthermore, (P1.e) sets the minimum required energy to be harvested by the l th EH user.Finally, (P1.f) ensures that the transmitted power does not exceed the power budget of the ULA.The primary challenge posed by the non-convex problem (P1) is the interdependence between the power split ratio, precoder, and common rate vector.To address this, we propose employing a model-free approach based on DRL, which offers the ability to handle inaccuracies in both modeling and online decisionmaking while tackling (P1).

III. DEEP REINFORCEMENT LEARNING-BASED SOLUTION A. MDP Formulation
The Markov decision process (MDP) is defined by ⟨S, A, P, R, γ⟩.In our formulation, the state space S represents channel vectors, minimum harvested energy, and rate requirements: The action space A includes precoding vectors, power split ratio, scaling factor, and common rate vector: We assume a stochastic state transition with the transition probability function, P, [13].

B. Reward Function and Policy Optimization
The reward function R encourages the agent to maximize the achievable rate of ID and EH users while minimizing the radar MSE: where ζ is the penalty for violating constraints given as the ratio of the number of satisfied constraints to the total number of constraints.The agent's goal is to maximize the expected cumulative reward J(θ) by finding the optimal policy π(a|s) that maximizes the state-action value function Q(s, a).This function estimates the expected cumulative reward obtained by taking an action a in a state s and following the optimal policy thereafter.
1: Initialize policy parameters θ and value function parameters ϕ 2: while not converged do 3: Collect a set of trajectories D by running the current policy π θ

4:
Compute the advantages A π θ (s t , a t ) for each stateaction pair in D using a value function estimator, e.g., GAE with value function V π θ ϕ 5: Update the value function parameters ϕ using D 6: Compute the surrogate objective function L(θ) using D and advantages A πθ (s t , a t )

7:
Compute the policy update by solving the constrained optimization problem: Update the policy parameters θ ← θ new 9: end while

C. DRL Agent
We use the TRPO algorithm to train the DRL agent, a policy gradient method optimizing policy parameters by gradient ascent on the expected reward.TRPO maintains stability by optimizing a surrogate objective function with a trust region constraint, ensuring the updated policy stays close to the previous one [11].TRPO is suitable for problems with continuous action spaces, making it effective solution for optimizing precoding vectors W, power split ratio α l , scaling factor γ, and common rate vector C.

1) Design of the TRPO Agent:
The TRPO agent has a policy network and a value network.The policy network maps system states to action space, while the value network estimates expected rewards for state-action pairs [11].The system state consists of channel vectors for ID and EH users, required energy ξ l for the l th EH user, and minimum rate requirements R ID th,k and R EH th,l .The action space includes precoding vectors W, power split ratio α l , scaling factor γ, and common rate vector C. The policy network has two hidden layers with 256 neurons each and an output layer for each action space element.It uses ReLU activation functions for hidden layers and softmax for the output layer.The value network, with a similar structure, estimates expected rewards using ReLU activation functions.
2) Training the TRPO Agent: During training, the TRPO agent interacts with the environment by selecting actions based on the current policy and receiving rewards as feedback.The agent uses this experience to update the policy and value networks using the TRPO algorithm, optimizing the surrogate objective function with a trust region constraint to maintain stability [11].The policy optimization problem in TRPO can be written as: TRPO solves a constrained optimization problem to ensure the new policy remains close to the previous one, with the constraint: TRPO employs the conjugate gradient method to obtain the search direction and performs a line search to find the step size satisfying the constraint [11].The TRPO algorithm iteratively performs the steps illustrated in Algorithm 1.

IV. SIMULATION RESULTS
In this section, simulation results are presented to demonstrate the performance of the proposed framework.Without loss of generality, we assume the following parameters: N = 10, K = 2, L = 2, U = 2. Furthermore, we assume that all noise variances are normalized to 1 and P tx = 25 dB, hence, the transmit power is used to denote the transmit SNR.In this paper, the spacing d of the ULA elements is chosen as λ/2.Unless otherwise stated, the user's minimum rate requirements are fixed to R ID th,k = R P S th,l = R th = 1 bps/Hz.For the energy harvester module, the parameters are set as follows [10]: Fig. 1 presents the convergence plot of the TRPO agent for different regularization parameter λ values over 1,500 episodes.It can be observed that the overall performance improves as the episode count increases.Interestingly, the convergence of λ = 10 −4 is faster than that of λ = 10 −6 .Moreover, once convergence is achieved, the average reward obtained by λ = 10 −4 is higher than that of λ = 10 −6 .This indicates that a lower value of λ leads to a higher objective function value.This observation is consistent with the fact that the objective function value is the difference between the sum rate and the MSE, where λ serves as the weight factor of the MSE.As a result, as λ decreases, the weight of the MSE diminishes, leading to an increase in the objective function value.Fig. 2, illustrates the beampattern approximation performance for RSMA-enabled multi-functional networks resulting from solving problem (P1) for R th = 1 bps/Hz, P tx = 25 dB, ξ = −16 dBm, and two different regularization parameter values, i,e., λ = 10 −5 and λ = 10 −4 .In this figure, we can observe that the proposed scheme is capable of approximating the desired beampattern for radar detection, while still satisfying the energy harvesting requirements.Furthermore, the beampattern design highly depends on the selected regularization value.From Fig. 2a and 2b, it is noted that the higher the value of λ the lower the MSE.On the other hand, with higher regularization parameter values the sum rate gets degraded.For instance, the sum rate has dropped from 25.21 bps/Hz to nearly 16.96 bps/Hz for λ = 10 −5 and λ = 10 −4 , respectively.This reveals that there is a tradeoff between beampattern approximation MSE and the sum rate.
Finally, Fig. 3 depicts the sum rate performance for different two different regularization parameters.We also compare the results obtained by the DRL solution with the iterative sequential quadratic programming (SQP) as a benchmark.As revealed by the figure, the sum rate increases with the transmit power and the solution obtained by the DRL achieves superior performance compared to the benchmark.Furthermore, the sum rate increases as the regularization parameter decreases, i.e., higher priority for the rate performance, while still satisfying the minimum harvesting energy level.V. CONCLUSION In this paper, we considered RSMA-based multifunctional wireless network with sensing, energy harvesting, and com-munication capabilities.To efficiently allocate the available resources and manage the interference between the three functionalities, we employed trust region policy optimization (TRPO), a deep reinforcement learning (DRL) algorithm.The TRPO/DRL can efficiently learn a near-optimal policy for the resource allocation problem in a complex and dynamic environment.This enabled us to obtain near-optimal transmit precoders, power splitting ratios, and rate-splitting among the common and private rates in a multiple access setting.Simulation results demonstrated the effectiveness of RSMA in mitigating the interference in such multi-functional networks and its capability to accommodate the rate and energy harvesting requirements of the devices while still capable of sensing multiple targets.
6400, b l = b = 0.003 mW , E max l = E max = 0.2 mW , and ξ l = ξ = −16 dBm.The channels associated with the ID and EH users are assumed to be only due to the non-line-of-sight (NLoS) component, i.e., independent and identically distributed (i.i.d.) complex Gaussian entries with a certain variance.Finally, for beampattern synthesizing, we have utilized (13) by choosing the beamwidth 2∆ = 10 o , the number of the points on the grid M = 181 and the targets' azimuth angle of interest are assumed to be 40 o and −40 o .During the agent training, we set the experience horizon (H) to 512, minibatch size (B) to 128, entropy loss weight to 0.01, discount factor to 0.99, Kullback-Leibler (KL) divergence limit (D KL ) to 0.01, advantage estimate method to generalized advantage estimation (GAE), and GAE factor to 0.95.

Fig. 3 .
Fig. 3. Sum rate performance versus transmit SNR for different regularization parameters.(Benchmark SQP) are jointly encoded into a single common stream s c to be decoded by all users, where E(|s c | 2 ) is assumed to be 1.On the other hand, the private messages of the ID users are encoded separately into independent private streams {s ID 1 , s ID 2 , ..., s ID K }.Similarly, the private messages of the EH users are encoded separately into independent private streams {s EH 1 , s EH 2 , ..., s EH L }, where E(|s ID k