Buffer-Aided Relay Selection for Cooperative Hybrid NOMA/OMA Networks With Asynchronous Deep Reinforcement Learning

This paper investigates asynchronous reinforcement learning algorithms for joint buffer-aided relay selection and power allocation in the non-orthogonal-multiple-access (NOMA) relay network. With the hybrid NOMA/OMA transmission, we investigate joint relay selection and power allocation to maximize the throughput with the delay constraint. To solve this complicated high-dimensional optimization problem, we propose two asynchronous reinforcement learning-based schemes: the asynchronous deep Q-Learning network (ADQN)-based scheme and the asynchronous advantage actor-critic (A3C)-based scheme, respectively. The A3C-based scheme achieves better performance and robustness when the action space is large, while the ADQN-based scheme converges faster with a small action space. Moreover, a-prior information is exploited to improve the convergence of the proposed schemes. The simulation results show that the proposed asynchronous learning-based schemes can learn from the environment and achieve good convergence.

the Internet of Things (IoT) [1], vehicle-to-everything [2], heterogeneous networks [3], signal detection [4], and edge computing [5]. Of particular interest to this paper is NOMA in cooperative networks. Many works have appeared in this area. For example, in [6], a two-stage max-min relay selection algorithm with fixed power allocation based on users' quality of service (QoS) and channel quality was proposed to reduce the outage probability. In [7], the outage performance was enhanced by the relay selection scheme with adaptive power allocation in cooperative NOMA networks. In [8], based on the instantaneous channel state, two two-stage schemes with fixed and adaptive power allocation for cooperative NOMA networks were proposed to reduce the outage probability. In [9], the diversity orders of single-state and two-stage relay selection algorithms were obtained with full-duplex (FD)/half-duplex (HD) relays in cooperative NOMA networks.
The buffer-aided relay scheme has been widely applied in cooperative networks to improve the system performance such as the outage probability and throughput [10]- [13]. The buffer-aided relaying technique has been implemented in NOMA transmissions in [14]- [16]. The throughput performance was enhanced by an adaptive transmission mode selection scheme in [14]. In [15], the average transmission rate was improved by using NOMA transmission in a buffer-aided cooperative network. In [16], a buffer-state-based relay selection was proposed to reduce the outage and average packet delay in the NOMA network.
To further improve the throughput, the hybrid NOMA/OMA transmission mode has been investigated in various works. In [17], the buffer-aided relaying system was proposed with adaptive power allocation to improve the outage performance in the hybrid downlink NOMA/OMA network. In [18], based on the instantaneous channel state information (CSI) and the buffer state, the outage probability was reduced by giving high priority to the NOMA transmission mode for uplink buffer-aided networks. In [19], the average sum rate and delay performance were improved by giving higher prioritization to relay-to-user transmissions in the downlink cooperative network with hybrid NOMA/OMA relay-to-user transmissions. Similarly in [20], prioritization-based buffer-aided relay selection was proposed to enhance the throughput and outage performance for downlink hybrid NOMA/OMA cooperative networks. Of particular relevance to this paper are [19] and [20], which also investigate relay selection in the 2-hop 0733-8716 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
buffer-aided NOMA relay network. In both works, NOMA is only applied over the relay-to-destination hop, while timedivision-multiple-access (TDMA) is applied to the sourceto-relay hop. By dividing one time slot into two sub-slots, the source can send two data packets in two sub-slots to the selected relay. This, however, doubles the data rate over the corresponding link. In this paper, we assume fixed-rate transmission over every link and TDMA is not applied. Instead, hybrid NOMA/OMA transmission is applied over both the source-to-relay and relay-to-destination hops.
On the other hand, while employing buffers at the relay improves the data throughput, it may increase the packet delay which is a crucial issue in future wireless communications [20]- [22]. In this paper, we will investigate joint buffer-aided relay selection and power allocation in hybrid NOMA/OMA networks to improve the data throughput with the delay constraints. This is a complicated high-dimensional problem. Deep reinforcement learning algorithms such as the deep Q-network (DQN) can be introduced to solve such complicated high-dimensional space problems [23], [24]. Various works have been done in this area. For example, the outage performance was improved by applying a DQN-based relay selection scheme in [25]. In [26], a DQN-based hybrid relay selection algorithm was proposed to improve robustness and speed up learning. Although DQN has high sample efficiency in training, the impact of overestimation is particularly serious in Q-learning algorithms [27], [28]. To solve this problem, we introduce the asynchronous DQN algorithm to overcome the hardships. Moreover, in high-dimensional systems with large action space, the DQN algorithm suffers from the problem of sample complexity, which leads to slow convergence and local maxima [29]. We then investigate the asynchronous advantage actor-critic (A3C) algorithm to further improve the performance of convergence with a large action space [30]. This paper investigates the joint optimization of relay selection and power allocation to maximize the throughput in cooperative NOMA networks, subject to the delay constraint, by using asynchronous reinforcement learning algorithms with a-priori information. The main contributions of this paper are summarized as follows: • We investigate the hybrid NOMA/OMA transmission mode in buffer-aided cooperative networks. Unlike existing works [19], [20], the hybrid NOMA/OMA transmission is applied over both the source-to-relay and relay-to-user hops, and delay constraint and power allocation are also considered. • We propose the asynchronous DQN-based and A3C-based schemes to optimize the throughput with the delay constraint. A-priori information is used to reduce the exploration range of the proposed schemes and improve the convergence performance during training. • The simulation results show that with a-priori information, both the A3C-based scheme (A3C-PI) and asynchronous DQN-based scheme (ADQN-PI) outperform the benchmark. ADQN-PI can achieve high performance with a small action space, while A3C-PI performs significantly with a large action space. • For comparison, we also show the results of OMA-only and NOMA-only relay selection in the simulation, based on a similar proposed reinforcement learning algorithm.
The results show that the proposed hybrid NOMA/OMA scheme takes advantage of both the OMA-only and NOMA-only approaches. The rest of the paper is organized as follows: Section II introduces the system model and the problem formulation. The tuples of reinforcement learning and a-priori information are defined in Section III. Section IV investigates the ADQN-PI algorithm. The A3C-PI scheme is proposed in Section V. Simulation results verify the proposed scheme in Section VI. Finally, Section VII concludes the paper.

II. SYSTEM MODEL AND PROBLEM FORMULATION
As shown in Fig. 1, we consider a two-hop buffer-aided relay network which consists of one source S, K HD decodeand-forward (DF) relays R k where k ∈ {1, . . . , K}, and two users U 1 and U 2 . We assume that there is no direct link between S and the two users as in [31], and each relay node is equipped with two data buffers B k,1 and B k,2 of size L for U 1 and U 2 , respectively, where L denotes the maximum number of packets stored in a buffer. The buffers follow the First-In-First-Out rule. Moreover, we assume S knows the instantaneous CSI of all channels and the buffer states, and then makes the decision at each time slot as in [32], [33]. All channels are assumed to be quasi-static Rayleigh fading, the channel coefficient g ij = h ij d −β/2 ij between nodes i and j remains unchanged during one time slot and varies from one time slot to another [13], where h ij is the fading coefficient, d ij is the distance between two nodes, and β is the path loss exponent.

A. Transmission Mode
At a given time slot, in terms of S → R k links, the transmission can operate in two modes: OMA and NOMA. For example, if the transmission operates in the OMA mode, one single data packet for user U n , n ∈ {1, 2}, can be transmitted from source S to the corresponding buffer B k,n at a single relay R k . Therefore, the received signal at relay R k is given by where P denotes the transmit power at each transmitter, x (O) S denotes the packet from source S, and n R k denotes the additive-white-Gaussian-noise (AWGN) noise with variance σ 2 n at relay R k . Then the channel capacity between source S and relay R k is given by Otherwise, if the transmission operates in the NOMA mode, source S can transmit one packet for user U n1 to the corresponding buffer B k1,n1 of relay R k1 , and one packet for user U n2 to the corresponding buffer B k2,n2 of relay R k2 simultaneously, where k 1 , k 2 ∈ {1, . . . , K}, n 1 , n 2 ∈ {1, 2}, g SR k1 ≤ g SR k2 . Then the superimposed information symbol at source S is defined as where α ∈ (0, 1) is the power allocation coefficient, x SR k 1 and x SR k 2 are the signals for relays R k1 and R k2 , respectively. Then we can obtain the received signals at relays R k1 and R k2 as respectively, where n R k 1 and n R k 2 denote the AWGN noise with variance σ 2 n at relays R k1 and R k2 , respectively. We apply the successive interference cancelation (SIC) scheme in [34] 1 to decode the signal for relay R k1 first, and then remove x SR k 1 from the received signal at relay R k2 and decode x SR k 2 . Thus, the channel capacity regions for S → R k1 and S → R k2 are given by respectively. Notice that because both the signals x SR k 1 and x SR k 2 are decoded at relay R k2 via the SIC scheme, we consider k 1 = k 2 is possible for S → R k NOMA transmissions. On the other hand, the transmissions between relay R k and users can also operate in two modes: OMA and NOMA. If the transmission operates in the OMA mode, one single data packet for user U n , n ∈ {1, 2}, can be transmitted from 1 Notice that the imperfect SIC will affect the system performances. As in the existing literature (e.g. [18]- [20], [34]), the perfect SIC assumption provides a clear benchmark scenario for performance comparison. Detailed discussion of the imperfect SIC is beyond the scope of this work and can be found in [35], [36]. relay R k to the corresponding user U n . The channel capacity between the relay R k and user U n is given by Otherwise, in terms of the NOMA mode, the signals for users U 1 and U 2 can be superimposed. We assume g R k Un 1 ≤ g R k Un 2 , where n 1 , n 2 ∈ {1, 2}. After applying SIC, we can obtain the channel capacities for R k → U n1 and R k → U n2 as respectively.

B. Problem Formulation
For both NOMA and OMA modes in buffer-aided cooperative networks, the link between nodes i and j is available for a single data transmission when its channel capacity satisfies where η is the target rate. Moreover, we consider the delay constraint in buffer-aided relay networks. The delay ε for packet ε is defined as the period between packet ε being successfully transmitted from the source and arriving at the corresponding user. For example, if a packet ε is transmitted from source S to relay R k successfully at time slot t, and then successfully arrives at the corresponding user at time slot t+2, it takes three time slots (t, t+1 and t+2) for the transmission and its delay ε = 3. Based on (8), if the link between nodes i and j satisfies C ij ≥ η, the OMA mode can be applied to transmit one single packet from nodes i to j. On the other hand, if the NOMA mode is applied for a transmit node i and two selected receiver nodes j 1 and j 2 , the channel capacities should satisfy C ij1 ≥ η and C ij2 ≥ η simultaneously.
Furthermore, the optimal power allocation coefficient α is a crucial criterion in NOMA transmissions. Therefore, to maximize the throughput with delay constraint for the buffer-aided relay hybrid NOMA/OMA networks, we formulate the optimization problem as follows: where u(t) = 0 denotes S → R k transmissions and u(t) = 1 denotes R k → U n transmissions, N is the number of time slots, v(t) = 1 denotes the OMA mode, v(t) = 2 denotes the NOMA mode, μ(·) = 1 if the enclosing function holds and μ(·) = 0 otherwise, Δ denotes the target delay, ε (t) denotes the delay of the packet which arrives at node j in the OMA transmission mode at time slot t, ε1 (t) and ε2 (t) denote the delays of the packets which arrive at nodes j 1 and j 2 in the NOMA transmission mode at time slot t, respectively, l k,j (t) denotes the buffer state of the corresponding buffer at relay R k for user node j when an OMA R K → U n transmission is selected and j denotes user U n , l k,U1 (t) and l k,U2 (t) denote the buffer state of the two buffers at relay R k . (9a) shows the selection of NOMA/OMA for the transmission at time show that the corresponding buffer should not be empty for R k → U n transmissions, while (9f) shows that each buffer can store no more than L packets. (9g) denotes the range of power allocation coefficient in the NOMA mode. To be specific, if the OMA mode is selected for R k → U n transmissions, the buffer at relay R k for user U n should not be empty. On the other hand, if the NOMA mode is selected for R k → U n transmissions, both the two buffers at relay R k should not be empty. The optimization problem function ensures that the capacity should satisfy (8) for each selected transmission link, and throughput only includes the packets arriving at the corresponding user with the delay constraint. According to (9), to maximize the throughput with delay constraint, we need to optimize the selection of transmission mode and the corresponding nodes at each given time slot. When the NOMA transmissions are not available, or available NOMA transmissions are not the optimal selection when considering the delay constraint at a given time slot, the system can switch to the OMA mode to reduce the outage.
For the NOMA transmission mode, the optimal power allocation coefficient α is required to guarantee the availability of links for data transmission. In terms of the transmit node, if we choose a S → R k transmission with NOMA, we need to perform relay pairing to select two available relays to receive packets from S. On the other hand, in a successful R k → U n NOMA transmission, relay R k can transmit one packet to each user from the corresponding buffer. Therefore, the optimal decision should also consider the impact of buffer state to maximize the throughput.
Moreover, the delay constraint is introduced into the optimization function. To be specific, in a time-varying system, the optimal buffer states for maximizing the throughput are affected by the channel states, and the transmission mode selection should satisfy the delay constraint. Due to the relay pairing, power allocation, NOMA/OMA switching, delay constraint, buffer states and time-varying system in the proposed network, the optimization function in (9) is a complicated non-convex high-dimensional optimization problem to achieve the long-term performance.

A. Overview
To solve the complicated non-convex high-dimensional optimization problem in (9), we apply reinforcement learning algorithms in buffer-aided relay hybrid NOMA/OMA networks to achieve the long-term performance. Reinforcement learning generally consists of states, actions and rewards, which need to be defined carefully for the implementation in buffer-aided relay hybrid NOMA/OMA networks. Moreover, a-priori information can help reinforcement learning schemes reduce the action-state space and improve the convergence.

B. State
The system state of the buffer-aided relay hybrid NOMA/OMA network is characterized by the buffer state and the channel states. The buffer state l k,Un (t) denotes the number of packets stored in the buffer at relay R k for user U n , where n ∈ {1, 2} and k ∈ {1, K}. In terms of the channel states, directly applying the CSI in reinforcement learning will cause an infinite number of states due to the continuous values of CSI. Thus, the channel state can be processed by where c k denotes the availabilities for the S → R k OMA transmission and R k → U n NOMA/OMA transmissions at time slot t, up k1k2 denotes the availability for the NOMA transmission between source S and the relay pair {R k1 , R k2 }. Therefore, the system state is given by In reinforcement learning, the environment can transit from a state s(t) to the next state s(t + 1) by taking an action, the action set will be defined in the following subsection.

C. Action
In buffer-aided relay hybrid NOMA/OMA networks, an action is not only to select links for data transmission but also to decide the transmission mode. Moreover, an action also includes the value of power allocation coefficient α for the NOMA transmission mode. Furthermore, in the NOMA mode for S → R k transmissions, the relay pair selection should be considered in an action. Therefore, we define the action as a(t) = a v,α,i,j,j1,j2 , where v = 0 denotes the NOMA mode and v = 1 denotes the OMA mode, α ∈ (0, 1), i denotes the transmit node, j denotes the receiver node in the OMA mode, j 1 and j 2 denote the receiver nodes in the NOMA mode, respectively. To achieve stable convergence, we consider a discrete action space the same as in [37], and the power levels coefficient is δ, thus we can obtain Based on the action and state, the reinforcement algorithm can be applied to make decisions in the buffer-aided relay hybrid NOMA/OMA network. To optimize the decisions to achieve the maximum throughput with delay constraint, we introduce the reward to help train the reinforcement algorithm.

D. Reward
Reinforcement learning algorithms require a set of rewards to evaluate the action-state space. In the proposed networks, the reward is designed to help the reinforcement learning algorithms maximize the throughput with delay constraint. We consider giving a positive bonus as the reward if there is a packet arriving at the corresponding user within the target delay. Furthermore, negative rewards are considered when unavailable links are selected for data transmissions. Therefore, the reinforcement learning algorithm can learn the experience from the total reward during training and evaluate the action-state space.

E. A-Priori Information
Reinforcement learning needs to explore the state-action space and evaluate the rewards during training. However, based on the states in (11) and actions in Section III-C, we can obtain a high-dimensional action-state space. Therefore, the range of exploration is quite large in the proposed network, and easily leads to many local optima for reinforcement learning algorithms during training. To reduce the range of exploration and improve the convergence efficiency, a-priori information is introduced for the proposed schemes.
For a given state, the action space is still large. However, we can remove the invalid actions for a given state to reduce the exploration range. We assume that an action for the OMA transmission between nodes i and j is valid at time slot t when it satisfies the requirements as follows: Therefore, the OMA transmission requires an available link for single packet transmission. Moreover, an empty buffer is invalid for R k → U n transmissions, whilst a full buffer is invalid for S → R k transmissions. On the other hand, we assume that an action for the NOMA transmission from node i to nodes j 1 and j 2 is valid at time slot t when it satisfies the following requirements: where U n1 and U n2 denote the corresponding user for the selected buffer in NOMA transmissions. A valid NOMA transmission requires two available links for the transmit node. Moreover, the corresponding buffer is not empty for R k → D m transmissions, and not full for S → R k transmissions. Furthermore, we consider that the buffer state can also be a priori information. By considering the target delay, a trade-off between different buffer states is required to ensure the delay constraint. Moreover, the channel gains also have an impact on the delay. Strong links may be always available, while weak links have low probability to be available. Therefore, we introduce the target buffer length ξ k,n for user U n at relay R k to improve the convergence performance of reinforcement learning, where n ∈ {1, 2}. ξ k,n can be expressed as If the buffer state l k,n (t) > ξ k,n at time slot t, we assume the corresponding actions, which transmit a packet to buffer n at relay R k , are invalid. Based on the priori information in (13), (14) and (15), we can remove invalid actions from the actions set at a given time slot. Therefore, the exploration efficiency is improved by reducing the action-state space. With high exploration efficiency, the proposed algorithm can improve the performance and converge faster, which will be shown in the simulation section.

IV. ASYNCHRONOUS DQN-BASED SCHEME
In this section, we propose the asynchronous DQN-based scheme (ADQN) for joint buffer-aided relay selection and power allocation. Moreover, we further develop the ADQN scheme with the priori information (ADQN-PI). In the DQN, there is an agent that takes the action and obtains the next state for the system. With applying the -greedy strategy, the agent can explore the system randomly in the exploration mode. If in the exploitation mode, the agent takes the action estimated from the neural network. A DQN consists of two neural networks: the prediction and target deep neural networks. The prediction network is used to estimate the current action for the current state in the exploitation mode, and the target network is used to estimate the target value for updating the prediction network. The target value function in the DQN is given by where r s(t),a(t) denotes the reward for s(t) and a(t), ρ denotes the discount coefficient, Q T ar s(t + 1), a denotes the estimated Q-value from the target network. Therefore, the prediction network can update its weights by minimizing the loss between the true value and the target value. The loss function of the DQN-based scheme is given by where Q P re s(t), a(t) denotes the true value estimated from the prediction network. Then, we apply a widely used scheme Adam [38] to minimize the loss and obtain the gradient to update the weights of the prediction neural network. To update the target network, we copy the parameters from the prediction network to the target network periodically.
To solve the problem of overestimation in the conventional DQN, we apply the multi-threaded asynchronous DQN scheme to achieve better convergence [30]. We design multi agents for the asynchronous scheme, each agent has its own copy of the environment. The agents can use the same networks to explore the same environment in parallel. However, because we apply the -greedy strategy and the channel coefficients vary independently from one time slot to another, the agents can explore different parts of the environment in their own threads. The framework of the asynchronous reinforcement learning is shown in Fig. 2. To be specific, in the proposed asynchronous DQN scheme, the two neural networks are shared with all agents. At time slot t, a local agent applies the -greedy strategy with the priori information to explore the environment, and then obtains the current state s(t), the current action a(t), the corresponding reward r(t), and the next state s(t + 1). Then the agent forms these elements as a sample {s(t), a(t), r s(t),a(t) , s(t + 1)}. The loss from this sample can be calculated based on (17), and the Adam method is used to obtain the gradient based on the loss. Then after exploring the environment for W P time slots, the total gradient from this agent can be used to update the prediction neural networks. Due to the multi agents in the asynchronous DQN algorithm, the algorithm can obtain a different gradient to update the network. After updating the prediction network W T times, we copy the parameters from the prediction network to the target network. The pseudo code of the proposed asynchronous DQN based scheme with priori information is shown in Algorithm 1. We build the two deep neural networks, consisting of three fully-connected layers with 64, 64 and 32 neurons. The computational complexity of the proposed ADQN algorithm for training is W T × W P as the number of iterations of loops in [39], [40], the priori information does not introduce extra complexity. Moreover, after training the algorithm, the complexity of ADQN for making decisions depends on the structure of the neural network. Unlike existing works, the number of relays or buffer size does not affect the computational complexity for making decisions after training.
For the asynchronous DQN-based scheme, each agent generates samples and calculates the gradient in its own environment, and then updates the neural networks independently.

Algorithm 1 ADQN-PI:
1: Initialize the variables. 2: Initialize the shared prediction network θ p and target network θ t 3: Initialize the thread-specific prediction network θ p and target network θ t . 4: repeat for each agent thread: 5: for w = 1, · · · , W T do 6: Synchronize the thread-specific networks θ p = θ p and θ t = θ t .

7:
for t = 1, · · · , W P do 8: Use the -greedy strategy to decide the explo--ration/exploitation mode, and then select a(t) based on θ p or randomly. 9: Get the reward r s(t),a(t) and the next state s(t + 1).

11:
Calculate the loss ζ(t) of sample, based on the loss function (17) with θ p and θ t .
Thus, compared with the conventional DQN, the asynchronous DQN-based scheme can not only explore more parts of the environment within one training iteration, but also solve the problem of correlation between successive experiences and overestimation. Notice that DQN is a value-based reinforcement learning algorithm which has high sample efficiency [27]. Therefore, DQN can converge well in environments with small action space. However, it propagates the impact of rewards to the related action-state space by only using the target value. Though we apply the asynchronous method to the DQN, the robustness and stability are still open problems for the proposed DQN-based scheme in buffer-aided cooperative NOMA networks with a large action space. Therefore, we will consider the A3C-based scheme to achieve high performance with a large action space in the next section.
V. A3C-BASED SCHEME Because policy-based reinforcement learning algorithms are effective and stable for function approximation, we consider to combine the advantages of the value-based scheme and the policy-based scheme to improve the convergence performance and robustness of training. Therefore, we introduce the actor-critic scheme to solve the optimization problem in (9). In the actor-critic algorithm, there are two networks: the actor network and the critic network. Unlike using the prediction and target networks in the DQN algorithm, the actor network uses the actor network to evaluate the probabilities for the action-state space based on the policy, and the critic network is used to evaluate the advantage for each action-state pair. To be specific, the input of both networks is the current state s(t) at time slot t, the outputs of the actor network are the probabilities of all actions for s(t), and the output of the critic network is the state value Q which is used to determine the average value of s(t). Thus, the estimated value function in A3C is given by V (s(t)) = r s(t),a(t) + γr s(t+1),a(t+1) + . . .
where γ is the discount coefficient, and θ c is the weights matrix of the critic network. Therefore, we can use the average value and the estimated value to evaluate the advantage of taking action a(t) for state s(t) at time slot t. The advantage function is given by which can help the agent understand the direction of updating networks. To be specific, the advantage function is used to evaluate the advantage or disadvantage of actions for the policy from the actor network. The actor network is a deep neural network with the weights matrix θ and policy π. The actor network aims to optimize the policy π, which is the action probability sets for each state, to achieve the maximum throughput with the delay constraint. In the actor-critic algorithm, the action with the maximum probability is selected for the corresponding state. The loss function in the actor network is given by (20) The critic network can provide the value function to obtain the loss of the actor network in (20). It aims to evaluate the policy π from the actor network. The loss function in the critic network is given by Moreover, we apply the RMSProp method [41], which is a widely used optimization method for deep neural networks, to obtain the gradients based on the loss in the actor-critic algorithm [30]. The framework of the actor-critic algorithm is shown in Fig. 3.

17:
end for 18: Asynchronous update θ with ν and θ c with ν c . 19: ν = 0 and ν c = 0. 20: until final convergence. algorithm, where each agent has its own copy of the environment and can use its own copy of the shared networks to explore its environment in parallel.
To be specific, each agent performs actor-critic learning asynchronously in its thread. At a given time slot t, a local agent can use its actor network to estimate the action a(t) with policy π. Notice that the priori information is introduced to help the agent avoid invalid actions. After receiving the action a(t), the environment takes a(t) to change to the next state s(t + 1), and then obtains the corresponding reward r s(t),a(t) and feeds back the reward to the agent. Finally, the agent generates a sample as {s(t), a(t), r s(t),a(t) } and stores it to the memory.
After repeating W A time slots, the local agent can obtain the state value Q(s(W A )) from the critic network, and then the estimated value for each state within W A time slots can be calculated based on (18). Therefore, the losses of the action network and critic networks can be obtained based on (20) and (21). Then, the local agent applies the RMSProp method to calculate the gradients and accumulate them. The shared networks are updated with the accumulated gradient from each agent asynchronously, and then the updated weights are sent to each agent. The Pseudo code of the proposed A3C-based scheme with priori information is shown in Algorithm 2. We build the two deep neural networks, which consist of three fully-connected layers with 256, 128 and 128 neurons, respectively. The computational complexity of the proposed A3C-based algorithm for training is W A as the number of iterations of loops in [39], [40]. (20) and (21), it is clear that the actor network and critic network are trained separately. With training iteration increase, the actor network learns to optimize its policy to find the optimal estimation of the action-state space, while the critic network learns to evaluate the actions estimated from the actor network efficiently. Therefore, the critic network can help the actor network converge more stably. Moreover, compared with conventional reinforcement learning algorithms, the asynchronous method can help them explore the environment more efficiently and converge faster [30]. However, it is difficult to let both the actor network and critic network converge well, when the action space is not large to generate a sufficient number of samples for evaluating the action-state pairs. This analysis is verified in the simulation section.

Remark 1: From
The computational complexity of the proposed reinforcement learning at the training stage depends on the layer structure of the neural networks and the amount of the training data. After the training stage, the neural network can immediately output the relay selection decision given the input states, making it an attractive scheme in communications.

VI. SIMULATION RESULTS
The results of the proposed asynchronous learning-based schemes are shown in this section. The simulation parameters include system parameters and learning parameters for the buffer-aided cooperative NOMA systems and the deep reinforcement learning algorithm. Unless otherwise stated, the system parameters are set as follows: the number of relays K = 4, the buffer size L = 10, the transmit power to noise ratio for all transmitters P /σ 2 n = 20 dB, the number of power levels δ = 10, the path loss exponent β = 3, the target rate η = 1. Moreover, as there is no existing proper benchmark, we design a Max-Min-SNR joint relay selection and power allocation scheme as the benchmark. The Max-Min-SNR scheme uses the discrete power level coefficient δ for power allocation, and can switch the NOMA/OMA mode for transmissions in both two hops. The decision from the Max-Min-SNR scheme is carried out as follows.
1) First, the scheme checks all valid NOMA mode actions.
In each NOMA mode action, two links are selected for transmission, and we define the one with smaller SNR as a "weak link". Then the priority is given to the NOMA mode action which has the strongest "weak link" among all "weak links". 2) If all NOMA transmission actions are invalid, the priority is given to the valid OMA transmission action which uses the link with the strongest SNR. 3) Otherwise, no actions can be taken.   It is clearly shown that the A3C-PI scheme converges to 0.85 packet/time slot after 3,000 training iterations, while the A3C-based scheme only achieves about 0.7. Without the priori information, the reinforcement learning based scheme needs to explore a large action-state space to achieve the optimal solution, and it is much more difficult to avoid the local optima problem. Furthermore, a large numbers of relays and power levels lead to large action space, and both the A3C-PI and the A3C-based scheme can outperform the ADQN-PI and ADQN schemes in this case. These results shows that with a large action space, the A3C algorithm improves the convergence performance by combining the advantages of the value-based scheme and the policy-based scheme.
As we can see in Fig. 5, the ADQN-PI scheme converges to 0.48 packets/time slot after 2,000 training iterations, while the A3C-PI scheme only achieves about 0.43. Compared with Fig. 4, we reduce the action space by decreasing the number of relays and power levels to improve the convergence efficiency, and the DQN-based scheme can achieve better results than the A3C-based scheme due to its high sample efficiency. It can be seen that both two proposed algorithms converge unstably in a small action space, because of the impact of the small  number of power levels. Moreover, both the ADQN-PI and the A3C-PI schemes can achieve higher throughput than the other two schemes without the a-priori information, this result confirms that reducing the range of exploration can still help reinforcement learning algorithms converge with a small action space.
The results in Fig. 6 show the comparison of throughput with delay constraint vs. different target rates between the proposed schemes and benchmark. All proposed schemes outperform Max-Min-SNR significantly, the A3C-PI and ADQN-PI achieve the throughput of 0.89 and 0.6 with the target rate η = 1 bps/Hz, respectively, while Max-Min-SNR only achieves 0.09. This clearly indicates that the proposed schemes can optimize the throughput with the delay constraint. Moreover, the A3C-PI scheme with fixed α = 0.7 only achieves the throughput of 0.5 when the target rate η = 1 bps/Hz, while the ADQN-PI scheme with fixed α = 0.7 achieves only 0.38. The results show that the power allocation can improve the throughput in NOMA transmissions. Fig. 7 indicates the impact of target delay for the proposed schemes and Max-Min-SNR scheme. The A3C-PI and ADQN-PI achieve throughputs of 0.85 and 0.57 when the target delay Δ = 10 time slots, while Max-Min-SNR only obtains 0.09. One of the advantages of the proposed scheme is that the learning based algorithm can search the solution in environments with different requirements. Therefore, with different target delay, the proposed schemes can achieve high throughput with delay constraint. Moreover, both ADQN-PI and A3C-PI can perform better than learning-based schemes with fixed α. It shows that the power allocation is an efficient way to improve the throughput. Notice that all schemes can only achieve throughput with delay constraint when Δ ≥ 2, because a packet takes at least two time slots to arrive at the corresponding user. Fig. 8 shows the throughput vs. transmit power to noise ratio P/σ 2 n for the proposed schemes and Max-Min-SNR scheme with the delay constraint. It is clearly shown that A3C-PI and ADQN-PI achieve throughputs of 0.88 and 0.6 when P/σ 2 n = 25 dB, while Max-Min-SNR only achieves the throughput of 0.1. Due to the requirement of the target rate, the low value of P/σ 2 n leads to a large number of outages in transmissions. With low transmit power, many links are not available for transmission at a given time slot. Although the proposed scheme can learn to optimize the action to avoid selecting unavailable links for transmission at a given time slot, a packet requires many time slots to arrive at the corresponding user, and the delay is difficult to be minimized. Fig. 9 shows the throughput with the delay constraint vs. power level coefficient δ for the proposed schemes and Max-Min-SNR scheme. It is clear that the A3C-PI and ADQN-PI schemes achieve throughputs of 0.7 and 0.5 when δ = 4, while Max-Min-SNR only achieves the throughput of 0.06. The results indicate that higher throughput is achieved with a larger power level coefficient, because the power allocation can perform more efficient in a discrete action space with a large δ. We can observe the unstable uptrends for all schemes, because the higher discrete power allocation coefficient may not provide a corresponding larger action space frequently. Fig. 10 compares the delay-constrained throughput between the hybrid NOMA/OMA, NOMA-only and OMA-only schemes. The NOMA-only and OMA-only schemes only apply the NOMA and OMA transmission, respectively, based on using ADQN similar to that in the hybrid NOMA/OMA scheme. It is clearly shown that, at high P/σ 2 n , the hybrid scheme performs similar to the NOMA-only approach. This is because when the channel SNR is high, NOMA will be selected by the learning to ensure high data throughput. On the other hand, at low P/σ 2 n , the hybrid scheme performs similar to the OMA-only approach. This also matches the intuition that the channels with low SNR will not support the NOMA transmission, and the learning process has successfully captured this. Therefore, the proposed hybrid NOMA/OMA scheme takes advantages of both the OMA-only and NOMA-only approaches.
VII. CONCLUSION This paper proposed two asynchronous learning algorithms for joint hybrid NOMA/OMA relay selection and power allocation in buffer-aided delay-constrained networks. We compared the asynchronous DQN-based scheme and the A3C-based scheme with different action spaces. A-priori information is exploited to further improve the learning process. The comparison shows that the A3C-PI scheme outperforms ADQN-PI for large action spaces, and the ADQN-PI outperforms A3C-PI for small action spaces. The simulation results also showed the advantages of exploiting the priori information in the learning. Finally, in our future work, we will investigate the scalability of the proposed learning scheme in dynamic scenarios with varying parameters, including relays numbers, buffer lengths and others.