Deep Reinforcement Learning-Based Relay Selection in Intelligent Reflecting Surface Assisted Cooperative Networks

This letter proposes a deep reinforcement learning (DRL) based relay selection scheme for cooperative networks with the intelligent reflecting surface (IRS). We consider a practical phase-dependent amplitude model in which the IRS reflection amplitudes vary with the discrete phase-shifts. Furthermore, we apply the relay selection to reduce the signal loss over distance in IRS-assisted networks. To solve the complicated problem of joint relay selection and IRS reflection coefficient optimization, we introduce DRL to learn from the environment to obtain the solution and reduce the computational complexity. Simulation results show that the throughput is significantly improved with the proposed DRL-based algorithm compared to random relay selection and random reflection coefficients methods.

optimization of the IRS reflection coefficients and transmit power allocation was proposed to maximize the orthogonal frequency division multiplexing (OFDM) achievable rate in [6].
However, the high computational complexity for optimizing the phase shifts of IRS is a complicated problem for practical implementation [7]. Fortunately, the deep reinforcement learning (DRL) algorithm can be used to solve complicated problems and reduce the computational complexity for wireless communications without the training data set [8]. Therefore, the phase shifts were optimized via the DRL algorithm to enhance the received signal-to-noise ratio SNR and reduce the computational complexity in [9]. In [10], a DRLbased joint design of the transmit beamforming matrix and phase shifts was proposed to improve the sum rate in IRSassisted networks. Most related works, however, assume the reflection amplitude is fixed, which is not practical as the reflection amplitude varies with the phase shift, according to the practical phase shift model in [11]. Furthermore, the above DRL-based schemes only consider the continuous phase shift design. Therefore, in this letter, we will consider the discrete phase shift variables and the practical phase shift model to design our system.
On the other hand, a cooperative relay network is an attractive technology to improve the outage performance in wireless communications [12]. To amalgamate the benefits of the IRS-assisted and relay-assisted networks, a hybrid half-duplex (HD) decode-and-forward (DF) relay and IRS network with continuous phase shifts and fixed reflection amplitude was investigated to improve the achievable rate in [13]. To further enhance achievable rate, [14] proposed optimization of continuous phase shifts with fixed reflection amplitude for a hybrid IRS with full-duplex (FD) DF relay networks. Moreover, relay selection is an efficient way to harvest the diversity gain in cooperative communications [15]. Motivated by this, [16] utilized a DRL-based relay selection scheme to enhance the outage performance.
In this letter, therefore, we propose DRL-based relay selection in IRS-assisted cooperative networks (DRL-RI) to maximize the throughput with the discrete phase shifts and practical phase-dependent amplitude model. The main contributions of this letter are listed as follows: • We propose joint relay selection and optimization of IRS reflection coefficients for cooperative networks, whilst considering the discrete phase shifts and the practical phase shift model.  • We introduce the DRL algorithm to solve the complicated non-convex optimization problem and thereby reduce the computational complexity of optimization in wireless networks. • Simulation results show that the proposed DRL-based scheme can achieve a higher throughput than the random relay-selection/reflection-coefficients methods. The rest of this letter is organized as follows. Section II introduces the system model and the problem formulation. The DRL-based algorithm is proposed in Section III. Simulation results verify the proposed scheme in Section IV. Finally, Section V concludes this letter.

II. SYSTEM MODEL AND PROBLEM FORMULATION
As shown in Fig. 1, we consider a two-hop IRS assisted cooperative network, which is composed of one source S, one destination D, K HD DF relays R k (k∈ {1, . . . , K }), and one IRS I with M reflecting elements. Each of S, D and R k nodes is equipped with a single omnidirectional antenna. The IRS is equipped with a controller to determine the phase shift for each reflecting element at a given time slot. We assume there is no direct link between S and D by considering the signal loss over distance. Moreover, we assume the channels of S→ R k and R k → D links are assumed to be Non-Line-of-Sight (NLoS) Rayleigh fading channels. On the other hand, we assume the channels from and to I are assumed to be Rician fading with pure Line-of-Sight (LoS) components [14], [17]. Therefore, we can obtain the channel coefficients h ij between node i and node j as where K ij denotes the Rician factor between node i and j. In NLOS Rayleigh fading channels,h ij =ḡ ij d −ᾱ/2 ij , ij∈ {SR k , R k D}, whereḡ ij is modeled by complex-Gaussian small-scale fading with zero mean and unit variance, d ij denotes the distance between nodes i and j,ᾱ denotes the path loss exponent for NLoS Rayleigh fading channels, and all channels are assumed to remain unchanged during the two hops. On the other hand, in LoS Rician fading channels,ĥ ij =ĝ ij d −α/2 ij , whereα denotes the path loss exponent for a LoS Rician fading channel, andĝ ij can be expressed aŝ (2) where β 0 is the path loss at the reference distance D 0 = 1 m [18], ψ ij ∈ [0, 2π] is the angle of departure (AoD) or angle of arrival (AoA) for the signal between nodes i and j. 1 At the first hop S→ R k , the source S transmits the signal x S to both I and relay R k , and I can reflect the incident signal to R k . Thus, the received signal at relay R k is given by where P S denotes the transmit power at S, n R k denotes the additive-white-Gaussian-noise (AWGN) with variance σ 2 n at R k , Θ = diag(η 1 e j θ 1 , η 2 e j θ 2 , . . . , η M e j θ M ) denotes the diagonal reflection matrix for the IRS, with η m ∈ [0, 1] and θ m ∈ [0, 2π] denoting the reflection amplitude and phase-shift for the mth reflecting element of I, respectively. Without loss of generality, we assume v = [v 1 , . . . , v M ] denotes the reflection coefficient vector for the IRS, such that η m = |v m | and θ m = arg(v m ) for the mth IRS element [11]. Notice that the reflection amplitude varies with the phase shift. Therefore, in this letter we apply the practical model to obtain the amplitude and phase shift based on the reflection coefficient as in [11,Fig. 3(b)] with the effective resistance R = 2Ω. Moreover, we assume that the phase shifts are discrete variables for implementing the IRS in practice as in [19], and the range of the phase shift for each IRS element can be given as where L denotes the number of phase quantization levels. Based on (3), the received SNR at R k for the first hop transmission can be given as Therefore, the channel capacity for the first hop transmission is At the second hop R k → D, relay R k transmits the decoded signal x R k to both I and D, and I can reflect the incident signal to D. Thus, the received signal at D is given by where P R denotes the transmit power for node R k and n D denotes the AWGN with variance σ 2 n at D. Therefore, the received SNR at D can be given as Thus, the channel capacity for the second hop transmission is C R k D = log 2 (1 + γ D ). Moreover, we assume that the transmission for each hop is available when the corresponding channel capacity satisfies where ϑ denotes the target rate. This means that if C ij satisfies (8), the corresponding link can support the single packet transmission from nodes i to j at a given time slot. Due to considering the DF relay, a packet can be transmitted from S to D successfully when min{C SR k (t), C R k D (t +1)} ≥ ϑ at time slot t and t+1. To investigate the maximum throughput in IRS-assisted cooperative networks with practice phase shift model, the joint relay selection and reflection coefficients optimization can be formulated as where T denotes the number of time slots observed at the destination, μ(.) = 1 if the enclosed holds and μ(.) = 0 if otherwise. With the relay selection, the discrete phase shifts variables, and the relation between the phase shifts and reflection amplitudes, (9) is a complicated non-convex optimization problem [9] and hard to solve. The exhaustive search algorithm to maximize the throughput has a high complexity of O(KL M ). In addition, the existing IRS optimization schemes usually require high computational complexity to find the solution [20]. To solve the optimization problem in (9) with low complexity, we introduce DRL in the following section.

III. DEEP REINFORCEMENT LEARNING BASED OPTIMIZATION SCHEME
To avoid the overestimation problem, the double deep Q-Learning network (DDQN) is applied in this letter. Firstly, there is an agent in the DDQN algorithm to make decisions to optimize the relay selection and IRS reflection coefficients for the proposed network. The agent can apply the -greedy strategy to explore the network and make decisions randomly, and then learn the decision policy from its exploration experience. Secondly, when the agent selects the exploitation mode, it makes decisions from its stored experience. We can model the proposed system as a Markov Decision Process (MDP) [16]. In DDQN, the algorithm has two different Q-tables, A and B, to store its experience. We assume where r s(t),a(t) is the reward of the MDP to evaluate the corresponding state s(t) and action a(t), ρ ∈ (0,1) denotes the learning rate for Q-tables in the DDQN, δ ∈ (0,1) denotes the discount rate in the DDQN, and argmax a {Q A (s(t + 1), a)} denotes the action with the maximum Q-value for the next state s(t+1) in Q-table A. In the proposed scheme, the reward is given to the agent when a packet arrives at the destination successfully. To reduce the impact of the overestimation problem, Q-  a(t))). (11) Since the dimension of the action-state space is high in the proposed MDP, it is difficult to form and update Q-tables for the DDQN. To solve this problem, the deep neural network (DNN) is introduced in the DDQN as the function approximator instead of Q-tables. Similar to Q-tables, the DNN can receive the state as the input and output the actions as the decisions for the proposed network. It significantly reduces the computational complexity of estimating the optimal decision for IRS assisted communication. Moreover, the DDN can use the gradient descent algorithm to update the neural network for high performance with high-dimensional environment. In this letter, we apply Adam [21] as the adaptive learning rate iterative optimization algorithm to calculate the gradients for the DDN.
In the proposed scheme, every T time slots the agent can generate samples for each time slot as {s(t), a(t), r s(t),a(t) , s(t + 1 )}, and then selects W samples randomly for the training in the DNNs to avoid the overfitting problem. Two neural networks are designed for the proposed scheme as the prediction network and the target network, and provide the estimation value Q P (s(t), a(t)) and the target value Q T (s(t + 1), argmax a Q P (s(t + 1), a)), respectively. Thus, we can calculate the loss between the prediction network and the target network, and then obtain the gradients via the Adam algorithm to update the prediction network. The loss function in the proposed algorithm is given by After updating the prediction networks V times, we can copy the weights from the prediction networks to update the target network. The Pseudo code of the proposed DRL-RI scheme is shown in Algorithm 1. The computational complexity of the proposed algorithm is V(T+W) for each iteration during training. After training, the computational complexity of the DRL-based algorithm for making decisions is much smaller than that in training, because it only depends on the structure of the neural network without any more learning. Thus, the proposed algorithm can reduce the complexity significantly, compared with conventional methods such as SDR with complexity of O(K (M + 1) 6 ) [9].

11:
Get value Q T (s(t + 1), argmax a Q P (s(t + 1), a)) w from the target network based on s(t + 1).

12:
end for 13: Use the loss function (12) to update the prediction network. 14: end for 15: Update the target network.

IV. SIMULATION RESULTS
Simulation results of the proposed DRL-based schemes are shown in this section. Unless otherwise stated, we set the parameters for the system as follows: the number of relays K = 5, the transmit power to noise ratio P/σ 2 n = P S /σ 2 n = P R /σ 2 n = 35 dB, the number of IRS elements M = 16, the path loss exponentα = 2,ᾱ = 2.5, the Rician factor K = 10 dB for links with Rician fading, the target rate ϑ = 0.5 bps/Hz, the discount coefficients δ = 0.9, the number of time slots for updating the prediction network T = 500, the training sample number W = 32, and the iteration number of updating the target network V = 100. Moreover, the suggested quantization for the IRS log 2 (L) is two bits based on [5], [19].   can achieve approximately 0.1 packets/time slot at the beginning, and converges to about 0.4 packets/time slot after 13,000 training iterations. This result indicates that due to the highdimensional space of the hybrid relay and IRS networks, the DRL-based scheme needs many iterations to explore the environment and the convergence is not very stable during training. However, finally the DRL algorithm can converge and obtain a solution because it can learn from the exploration experience. Moreover, after training, the proposed scheme can obtain a low complexity DNN for making decisions [9], which can be implemented to reduce the computational complexity in hybrid relay and IRS networks. Fig. 3 shows the comparison of throughput versus different target rates between the proposed scheme, IRS reflection coefficient optimization scheme with random relay selection (Random RS), and relay selection scheme with random IRS reflection coefficient (Random IRS). It is shown that the proposed scheme outperforms the other two schemes significantly. The DRL-RI scheme achieves about 0.4 packets/time slot when the target rate ϑ = 0.5 bps/Hz, while Random RS and Random IRS achieve 0.17 and 0.12 packets/time slot, respectively. The proposed DRL-RI scheme can not only optimize the reflection coefficients for the IRS, but also optimize the relay selection to reduce the outage probability.
Thus, the proposed scheme can amalgamate the benefits of relay selection and IRS to achieve a high throughput. Fig. 4 shows the comparison of throughput versus different transmit power to noise ratios between the proposed scheme, Random RS, and Random IRS. It is shown that the proposed DRL-RI scheme achieves approximately 0.45 packets/time slot when the transmit power to noise ratio P /σ 2 n = 40 dB, while Random RS and Random IRS only achieve 0.35 and 0.31 packets/time slot, respectively. It is clearly shown that the performance of all algorithms get better with the increase of the transmit power to noise ratio. This is because the SNR varies directly proportionally to the transmit power to noise ratio, based on (5) and (7).

V. CONCLUSION
This letter investigated the throughput maximization problem in cooperative networks with IRS joint relay selection and discrete IRS reflection coefficients optimization. We apply the DRL algorithm to learn from the environment to map the relation between the optimization variables and throughput, solve the non-convex optimization problem in which the IRS reflection amplitudes vary with the discrete phaseshifts. Compared with the random relay selection algorithm and the random IRS reflection coefficient optimization algorithm, the proposed scheme can obtain significant performance gain. This result shows the benefits of joint relay selection and IRS reflection coefficients to reduce the signal loss over distance, and provide a potential way to solve complicated optimization problems in wireless communications with low computational-complexity.