DRL-Based Sequential Scheduling for IRS-Assisted MIMO Communications

Efficient resource allocation strategies are pivotal in vehicular communications as connected devices steeply increase in scenarios with much more stringent requirements. In this work, we propose a deep reinforcement learning (DRL)-based sequential scheduling approach for sum-rate maximization in the uplink of intelligent reflecting surface (IRS)-assisted multi-user (MU) multiple-input multiple-output (MIMO) vehicular communications. We formulate the scheduling task as a partially observable Markov decision process (POMDP) and propose a novel stream-level sequential solution based on the proximal policy optimization (PPO) algorithm. We consider a realistic imperfect channel state information (ICSI) model and assess the proposal in several communication setups comprising both spatially uncorrelated and correlated links. Simulation results show that the proposed DRL-based sequential scheduling approach is a robust alternative to more computationally demanding benchmarks.


DRL-Based Sequential Scheduling for IRS-Assisted MIMO Communications
Dariel Pereira-Ruisánchez , Student Member, IEEE, Óscar Fresnedo , Member, IEEE, Darian Pérez-Adán , Member, IEEE, and Luis Castedo , Senior Member, IEEE Abstract-Efficient resource allocation strategies are pivotal in vehicular communications as connected devices steeply increase in scenarios with much more stringent requirements.In this work, we propose a deep reinforcement learning (DRL)-based sequential scheduling approach for sum-rate maximization in the uplink of intelligent reflecting surface (IRS)-assisted multi-user (MU) multipleinput multiple-output (MIMO) vehicular communications.We formulate the scheduling task as a partially observable Markov decision process (POMDP) and propose a novel stream-level sequential solution based on the proximal policy optimization (PPO) algorithm.We consider a realistic imperfect channel state information (ICSI) model and assess the proposal in several communication setups comprising both spatially uncorrelated and correlated links.Simulation results show that the proposed DRL-based sequential scheduling approach is a robust alternative to more computationally demanding benchmarks.Index Terms-Scheduling, intelligent reflecting surfaces, deep reinforcement learning, PPO, resource allocation.

I. INTRODUCTION
O VER the last few years, vehicle-related technologies have evolved to support advanced applications related to driving assistance and collision avoidance tasks.The high mobility and critical conditions under which these applications must work impose latency, reliability, and throughput requirements that cannot be achieved with current wireless technologies [1], [2].
IRS-assisted MIMO systems have been regarded as enablers of the next generation of vehicle-to-everything (V2X) communications.These technologies set the basis for providing ubiquitous coverage while supporting high-throughput, ultra-reliable and low-latency transmissions.On the one hand, MIMO systems achieve significant spatial multiplexing and enable more efficient spectrum usage, higher data rates, and privacy [3].In addition, the use of IRSs allows to smartly control the communication environment and reduce signal degradation in high-frequency bands, which include millimeter wave (mmWave) and terahertz (THz) [4], [5].The deployment of unmanned aerial vehicle (UAV)-carried IRSs is an attractive solution to enhance ground communications by creating artificial links between vehicles and roadside units (RSUs) that would be obstructed otherwise [6], [7].
Although the deployment of IRS-assisted MIMO systems might be a game-changing paradigm, several deployment considerations remain open problems.Recently, solving the joint optimization of the precoders and the IRS phase-shift matrix has attracted the research interest the most.However, results in [8] show that the performance of some promising solutions significantly degrades when the number of transmitted streams increases beyond the number of receiving antennas.As explained in [9] and [10], the next generations of vehicular communications will face highly heterogeneous and dynamic scenarios where loads of users, sensors, and vehicles will compete for the available resources.Hence, satisfactory communication performance will be unfeasible without addressing appropriate resource allocation techniques.
As discussed in [11], the selection of the co-scheduled users significantly affects performance in non-orthogonal transmissions like those in IRS-assisted MIMO systems.However, selecting the set of co-scheduled users that optimizes the system performance is a non-deterministic polynomial time problem because the set of feasible solutions grows exponentially with the number of users.As a result, many existing scheduling approaches consider sub-optimal heuristics to reduce the search over the high-dimensional solution space [9], [12].These approaches use surrogate objective functions based on spatial compatibility metrics and perform the user scheduling sequentially.However, they only consider the immediate effect of incorporating a given user and disregard long-term effects over the final result.In addition, some of the spatial metrics considered cannot be extended to all the scenarios since they depend on channel properties like the channel correlation, which provides no useful information in uncorrelated channel models [9], [12].
Data-driven scheduling approaches have recently gained attention due to the ability of artificial neural networks (ANNs) to solve high-dimensional resource allocation problems in an efficient and flexible way [13], [14].In this regard, DRL approaches-which combine reinforcement learning (RL) with ANN-based function approximations-stand as the most appealing alternatives [10].DRL frameworks learn how to solve optimization problems by continuously interacting with the environment, becoming a robust alternative for rapidly timevarying channels.Authors in [11] use a dueling double deep Q-network (D3QN) formulation to find the user association strategies that maximize the long-term downlink performance of a cellular network.The scheduling problem is addressed as a tree-structured combinatorial problem, and an adaptation of the deep Q-network (DQN) framework is employed.In [15], a multi-agent formulation based on DQN is considered to address the resource allocation in UAV-enabled communications.The results in [11] and [15] show that DRL-based approaches outperform conventional alternatives while offering a better trade-off between execution time and adaptability.In addition, DRL approaches in [11] and [15] leverage the attention to long-term rewards in order to achieve near-optimal solutions.However, these implementations suffer from the curse of dimensionality [16], [17] since they consider combinatorial approaches whose set of actions comprises all the feasible scheduling combinations.As a result, the set of actions increases exponentially with the number of users, thus leading to unfeasible storage and computing requirements.
Because of the limitations of existing data-driven schedulers, and the inability of heuristic methods to perform long-term analysis, we propose an innovative approach that combines the best of the sequential formulations and DRL algorithms to handle the user scheduling in the uplink of IRS-assisted multi-stream (MS) MU MIMO communications.We developed an efficient and robust scheduling framework such that practical strategies for the joint optimization of the IRS matrix and MIMO precoders could be later derived from its output.
The remainder of this paper is structured as follows.Section II details some theoretical fundamentals of RL, analyzes the most relevant existing DRL-based scheduling approaches, and presents the main contributions of our work.Section III introduces the IRS-assisted MS MU MIMO system model and the scheduling optimization problem.Section IV introduces the proposed sequential DRL-based solution.Section V presents the results of simulation experiments, and Section VI is devoted to the conclusions.

A. Notation
Along this work, the following notation will be employed: a is a scalar, a is a column vector, and A represents a matrix.Notice that we use the scalar notation for actions and states in Section II-A, but their formats vary according to the RL formulation of the problems.[A] i,: and [A] :,j stand for the i-th row vector and the j-th column vector of A, respectively.
[A] i,j is the entry where the i-th row and the j-th column of A meet.Transpose, conjugate transpose, and the Frobenius norm of A are represented by A T , A H , and A 2 F , respectively.Â represents the estimate of the matrix A and AB stands for the estimate of the result of the matrix operation between A and B. Calligraphic letters are employed to denote sets and tuples.|R| stands for the cardinality of a set R. I N indicates an N × N identity matrix, and I N denotes the set of integers from 1 to N .We use 0 to represent indistinctly zero-valued vectors or matrices whose dimensions can be easily inferred.
The operator blkdiag(•) constructs a block diagonal matrix from its input matrices, the operator diag(•) constructs a diagonal matrix from an input vector, and flatten(•) is the operator that reshapes any matrix V ∈ C A×B into a vector v ∈ C AB .Finally, OR(•) computes the element-wise binary OR operation between the binary-valued input vectors.The mathematical relationships presented in the following sections hold for all the consecutive time steps t that fit within one coherence block.Hence, for the sake of simplicity, sub-index t is used only where necessary to avoid ambiguities.

A. Reinforcement Learning (RL) Fundamentals
As stated in [16], RL is a computational approach to learningby-interacting, i.e., mapping situations to the actions that maximize a numerical reward function.Most RL problems are formalized in terms of a Markov decision process (MDP), where the learning and decision-maker element (the agent) interacts with the external components (the environment) through actions that affect the subsequent states and rewards.Hence, RL problems are characterized by a dynamics function p(s t+1 , r t |a t , s t ) such that, in every time instant t, the next state s t+1 , and the reward r t are conditioned by the effect of taking an action a t ∈ A in the current state s t .A is the set of feasible actions and |A| is the number of feasible actions.
The policy π(•) is a critical element of RL approaches.The policy is the decision-making rule that returns the probability π(a t |s t ) for taking a given action a t while being in a given state s t .RL algorithms aim to maximize a function of the expected long-term reward, which depends on the sequence of states and actions taken.Hence, training in RL focuses on learning the policy that maximizes that reward function.The state-value function V π (s t ) and the action-value function Q π (s t , s t ) are the most commonly used reward functions.The former computes the expected return when starting in a state s t and following the policy π(•), while the latter evaluates the expected return starting from s t , taking the action a t and following policy π(•) afterward.
The tuples E t = (s t , a t , r t , s t+1 ), ∀t that store the interactions between the RL elements are commonly termed experiences.The structure of the experience tuples may change according to the RL algorithm.
Conventional tabular approaches to RL problems have proven efficient when considering low-dimensional and discrete state and action spaces [16].However, these algorithms are unfeasible in optimization problems with continuous or arbitrarily large search spaces.In this regard, the ANN-based DRL algorithms are appealing alternatives.The use of ANNs for the approximations of the policy and reward functions enables a wide range of new approaches to high-complexity problems like the one we are addressing in this work.In this case, the objective function is a determining factor since the trainable parameters of the ANNs update according to it.Hence, several recent advances in DRL have been motivated by the search for better objective functions.

B. Related Works
We next analyze three existing approaches to user scheduling in IRS-assisted communications where DRL-based techniques have been considered.Although these works also address the configuration of the IRS matrix, we focus only on the scheduling stage, which is the scope of this work.
Authors in the complementary works [18] and [19] address the user scheduling task in the uplink of an IRS-assisted communication system.They consider a sum-rate maximization problem and propose a solution based on the neural combinatorial optimization (NCO) framework.NCO is a stochastic method widely used to handle RL formulations where the solutions comprise the combinations of the optimization variables.This framework overcomes the high dimensionality of combinatorial problems by considering a sequential recurrent structure.The simulation results show that the proposed algorithms achieve near-optimal performances in the considered scenarios.However, the authors in [18] and [19] make some questionable assumptions.First, they assume a fixed number of scheduled users along all the channel realizations, although the optimal number of scheduled users varies according to the characteristics of the communication channels.Second, when analyzing the impact of the imperfect CSI (ICSI), they use ICSI models where the channel matrices (from the users to the IRS and from the IRS to the base station (BS)) are individually estimated.However, because of the passive nature of the IRSs, these estimated matrices cannot be obtained separately in practical situations [20].
Authors in [21] propose a solution based on the PPO framework to the scheduling problem in the downlink of IRS-assisted vehicular communications.Unlike [18] and [19], the authors in [21] aim to maximize the minimum average rate experienced by the vehicles.The PPO algorithm is simpler to implement and tune, and the results in this paper show that it enables an efficient approach for user scheduling in the considered simulation scenarios.However, this approach cannot be extended to denser configurations.The major drawback of [21] is related to the formulation of the actions and the scalability of the resulting set.The authors in [21] propose a combinatorial approach where the feasible actions comprise all user scheduling combinations.Hence, considering denser networks with more competing vehicles might lead to intractable computing requirements.
Notice that all previous related works consider single-antenna users and single-stream transmissions.This simplification limits the scheduling capabilities to the user level and disregards the advantages of considering stream-level scheduling.Besides, they only assess uncorrelated fading channels while, as explained in [22], practical channels are generally spatially correlated.Because of the antenna's non-uniform radiation patterns and the physical propagation environment, some spatial directions are more likely to carry strong signals.

C. Overcoming the Limitations
Based on the limitations of the previous related works, we have developed the present research, whose main contributions can be summarized as follows: r We propose a sequential DRL-based scheduling frame- work for IRS-assisted MIMO communications.This proposal leverages the attention to long-term rewards in RL algorithms and remains feasible for highly populated communication systems.
r We formulate the scheduling optimization problem as a partially observable MDP (POMDP) where observations are composed of the estimated channel state information (CSI) matrices, and actions are related to selecting the user at each step of the sequential scheduler.We develop a DRL solution termed PPO, which efficiently handles the highdimensional continuous states and discrete actions.
r We extend the scheduling optimization to the stream level, which provides higher flexibility and enhanced performance regarding conventional user-level scheduling approaches.
r We consider a realistic ICSI model where the individual channel matrices in the cascaded channels are jointly estimated.Besides, we evaluate performance over several communication scenarios, including both correlated and uncorrelated channels.In this work, we have selected a PPO-based approach as it stands as a game-changing framework regarding stability, simplicity, and scalability in DRL algorithms.As we will explain later, PPO is an innovative policy gradient algorithm that introduces several improvements to overcome major constraints like sample inefficiency and the instability of policy updates.During the initial phases of the investigation, we also considered other DRL-based algorithms.However, those lacked some desired features (e.g., DQN and advantage actor-critic (A2C)) or were too complex to be considered for practical implementations (e.g., trust region policy optimization (TRPO)).
In addition, we formalize the scheduling problem as a POMDP because the interactions between the RL elements are always affected by the uncertainty introduced by the considered ICSI conditions.Although we will not distinguish between the terms states and observations, we assume that the observed rewards are affected by a non-observable element: the CSI estimation errors.

III. SYSTEM MODEL AND OPTIMIZATION PROBLEM
Let us consider the uplink of an IRS-assisted MS MU MIMO vehicular communication system, where the communication between K vehicles and an RSU is aided by a UAV-carried IRS, as shown in Fig. 1.The set of all the connected vehicles is K = {k : k = 1, . . ., K}.We assume each vehicle uses N t antennas to send up to N s data streams to the RSU equipped with N r antennas.We consider the vehicles to transit in a dark zone, i.e., there are no direct paths between the vehicles and the RSU.Hence, coverage is provided through the vehicle-IRS-RSU cascaded channels.According to this system model, the signal received at the RSU is given by where x = [x T 1 , . . ., x T K ] T ∈ C KN s stacks the symbols transmitted by the K vehicles, with x k ∈ C N s , ∀k.Vector n ∈ C N r stands for the receive complex-valued additive white Gaussian noise (AWGN), modeled as n ∼ N C (0, σ 2 I N r ).We assume every x k follows a circularly symmetric complex Gaussian distribution, i.e., x k ∼ N C (0, I N s ).
The diagonal matrix Ξ = diag(ξ) ∈ C N s K×N s K is the stream-level scheduling matrix.The vector ξ = [ξ 1,1 , . . ., ξ K,N s ] T is a binary scheduling vector, such that ξ k,n s takes the value one if the n s -th stream of the k-th vehicle is scheduled and zero otherwise.We define the set of scheduled vehicles as , such that it contains all the vehicles with at least one scheduled stream according to ξ.
The matrix P = blkdiag(P 1 , . . ., P K ) ∈ C N t K×N s K stacks all the individual precoders P k ∈ C N t ×N s .We assume the IRS to have N reflecting elements.Hence, we model its phase-shift matrix as the diagonal matrix Θ = diag(θ) ∈ C N ×N where θ = [e jθ 1 , . . ., e jθ N ] T ∈ C N is the vector that stacks the phase shifts introduced by the elements of the IRS.
The matrix H IR ∈ C N r ×N is the channel response from the IRS to the RSU, and stacks the channel response matrices of the links between the vehicles and the IRS.Hence, the cascaded channel matrix H C can be defined as Note that the response of the cascaded channel can be rewritten as where is the combined channel matrix response for the n-th element of the IRS.
We next introduce the ICSI model to be considered.As shown in [20], estimates of the channel matrices ĤIR and ĤVI cannot be obtained individually due to the passive nature of the IRS elements.Alternatively, estimates of the combined channel response Ĥcom n = [H IR ] :,n [H VI ] n,: , ∀n are feasible at the RSU.Such estimates can be modeled as [20] Ĥcom where H IR and H VI are the true channel matrices, and the matrices E comn , ∀n contain the estimation errors, which are assumed to be zero-mean Gaussian distributed with covariance matrices E[e com n e H com n ] = σ 2 com n I KN t N r , ∀n, where e com n is the vectorized version of the n-th estimation error matrix.Hence, the estimated cascaded channel matrix can be represented as Notice that ĤC stacks all the individual estimated cascaded channels such that ĤC = [ ĤC 1 , . . ., ĤC K ].
Next, the vector of estimated symbols at the RSU x can be obtained by linear filtering the received signal, i.e., x = W H y, being the RSU receiving filter matrix that stacks all the individual receiving filter matrices We assume W H k , ∀k to be the minimum mean square error (MMSE) filters, which are computed as in [23] and [22] by Now, we can formulate the optimization problem to determine the scheduling vector that maximizes the system sum-rate (Λ ξ ) as follows with where R s is the individual rate of the s-th vehicle given by is the interference plus noise matrix.For greater clarity, Table I summarizes the main system model parameters and their descriptions.
Let us recall that we must use the estimated channel matrices for computing the entries of ξ, Θ, P, and W H .However, we will use the true channel matrices when determining the system sumrate values for a realistic analysis of the system performance.Some conventional approaches require computing the matrices Θ, P, and W H in every scheduling step.They also require executing an alternating optimization algorithm that iterates between the scheduling and system optimization stages.However, the proposed DRL-based framework enables us to fully separate the scheduling optimization task and determine these matrices only for the final scheduling vector ξ.As we explain later, our proposal entails no knowledge of the IRS and precoder matrices along the steps in the online scheduling algorithm.Since matrices P and W H are computed for a given scheduling vector, we can assume that columns/rows related to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I SYSTEM MODEL PARAMETERS
the non-scheduled streams take zero-valued entries and do not impact the sum-rate.
During the training of the PPO-based scheduler and subsequent performance analysis, we compute the IRS matrix Θ and the precoders' matrix P using the DCB-DDPG framework introduced in [8].The simulation results in [8] show that this framework achieves near-optimal performance in several communication scenarios, even under ICSI conditions, which allows us to focus on the scheduling framework.However, note that the proposed PPO-based scheduler is independent of the implementation used for IRS and precoder optimization, so other alternatives could also be considered.
As stated in [8], the computed precoders meet individual power constraints P k 2 F ≤ Ω k , ∀k, where Ω k represents the available power at the k-th vehicle.For simplicity, we assume the same power constraint value for all vehicles, i.e., Ω k = Ω, ∀k.Without loss of generality, we also set the noise variance σ 2 equal to one.Therefore, the signal-to-noise ratio (SNR) in dB per user is given by SNR = 10 log 10 (Ω).
In this work, two types of scenarios will be considered depending on whether a spatial correlation exists between the channel responses of the different vehicles or not.In the following subsections, we describe the channel models for these two types of scenarios.

A. Spatially Uncorrelated Channel Modeling
When considering spatially uncorrelated setups, we assume the links between the vehicles and the IRS follow an uncorrelated Rayleigh fading model: the entries of H VIk , ∀k are independent and identically distributed (i.i.d.) random variables such that H VIk ∼ N C (0, β k I N ), where β k is the average channel gain for the vehicle k.
We also assume that the IRS is installed so that a line-of-sight (LoS) to the RSU exists.Therefore, a Rician fading channel model is adopted to describe H IR .As in [24], [25], [26], H IR is given by where ψ is the Rician factor, which is set to ψ = 3. H LOS IR and H NLOS IR ∼ N C (0, I N ) are the LoS component and the nonline-of-sight (NLoS) uncorrelated Rayleigh fading component, respectively.For simplicity, we assume the RSU is equipped with a uniform linear array (ULA), and H LOS IR is computed as in [22].
At the same time, two different spatially uncorrelated situations will be considered for the vehicle-IRS links as shown in Fig. 2. The one on the left side of Fig. 2 (constant distance) assumes the average channel gains β k = 1, ∀k.As in [22], we consider β k stands for macroscopic large-scale fading related to distance-dependent path loss.Hence, this condition represents vehicles located the same distance from the IRS.On the right side, distances to the IRS are randomly distributed (varying distance).In this configuration, the average path gains follow In practice, we use |β k | to avoid negative values.

B. Spatially Correlated Channel Modeling
In the second type of scenario, we assume the IRS reflecting elements are spatially correlated.The channels between the vehicles and the IRS are now described as is the positive semidefinite spatial correlation matrix for the k-th user.On the other hand, the correlated Rayleigh fading component of the channel between the IRS and the RSU is given by As in [21], we assume the IRS is equipped with a ULA.Hence, the entries of the spatial correlation matrices are computed using the following simplified equation [22]: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where φ is the nominal angle in radians, σ φ is the angular standard deviation in radians, and d H is the antenna spacing measured in multiples of the wavelength.Equation ( 10) holds for computing the entries of the spatial correlation matrices R k , ∀k, and R IR , by considering the nominal angles φ k , ∀k, and φ IR , respectively.We assume σ φ = 0.17 rad (≈ 10 • ) for all the correlation matrices since it is a reasonable value in urban cellular networks [22].Besides, we set the antenna spacing d H = 0.5, which is a commonly used value [22].Fig. 3 illustrates the two spatially correlated situations considered for the vehicle-IRS links.In both configurations, vehicles are gathered in G sets (G g , ∀g) with nominal angles φ G g such that φ k ∼ N R (φ G g , 0.05), ∀k ∈ G g .Hence, vehicles in the same set have similar nominal angles and, thus, similar correlation matrices.Forcing this condition makes scheduling more challenging since vehicles with similar spatial properties strongly interfere between them.As in the spatially uncorrelated case, there is a configuration where the vehicles are located at the same distance (correlated constant distance) and another where the distances are different (correlated varying distance).

IV. PPO-BASED SEQUENTIAL SCHEDULING
This section describes the proposed sequential scheduling based on the PPO framework.First, some key features of this DRL-based framework are reviewed.Next, the state, action, and reward spaces considered for the optimization problem are defined.Finally, the algorithmic solution is derived.

A. PPO Framework
PPO is a model-free policy-gradient framework that was first introduced in [27] and has attracted a great deal of research interest.As explained in [27] and [28], PPO achieves a balanced performance.It combines the sample efficiency and policy improvement reliability of algorithms like TRPO while performing computationally tractable policy updates.Some of the features of a PPO agent which make it an appropriate choice for solving our optimization problem are the following: r Actor-critic: called this way due to the interactions be- tween the ANN-based approximations of the reward and policy functions.The critic, ν : (s t , ϑ v ) → V (s t ), is the framework element that learns to map the state into an approximate of the state-value function.On the other hand, the actor, : (s t , ϑ π ) → a t , learns actions that maximize the critic's output.The trainable parameters of the critic, ϑ v , and the actor, ϑ π , are updated by performing stochastic gradient descent updates over a joint objective function.As explained in [28] and [29], using a critic function approximation provides better stability since the variance of reward values used for training decreases.This reduction may lead to a faster convergence of the actor and critic networks than other actor-only policy gradient algorithms.
r Stochastic actions: the actor in PPO is composed of three stages.The first is the ANN-based approximation of the policy and computes a probability mass function for the feasible actions, i.e., π(a|s t , ϑ π ), ∀a ∈ A. The second stage defines a discrete probability distribution from the computed probability mass function.Finally, a random sampler selects an action (a t ) according to the defined probability distribution.This stochastic selection of the discrete actions has natural capabilities for balancing exploration and exploitation during the training phase.Besides, stochastic actions are more desirable in POMDP since the probabilistic selection of actions prevents policies from getting stuck in wrong actions provoked by erroneous state observations.r Multi-environment on-policy training: on-policy algo- rithms like PPO only use training experiences generated with the current policy.To enhance the efficiency of the sample generation task, PPO performs a multienvironment process where several agents with the same current policy run in parallel over different environments.These experiences are stored in a replay buffer (which we will call R), which refills with the up-to-date collected data at every training stage.
r Clipped objective function: this is the most relevant fea- ture of PPO algorithms.The objective function used by PPO to train the policy network avoids destructively large updates by clipping the output of one of the elements of the objective function.As updates are limited to not significantly altering the existing policy, multiple updates are possible over the same up-to-date collected data.Hence, it provides higher stability for training since it prevents the ANN-based policy from suffering the effects of vanishing or exploding gradients.

B. State, Action and Reward
In order to solve the scheduling optimization problem in (6), we introduce the following states, actions, and rewards.The state vector s t comprises the entries of the current scheduling vector ξ t and the entries of the estimated cascaded channel Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.matrix ĤC .Hence, the state vector is constructed such that The estimated cascaded channel matrix ĤC is computed as in (4) by considering the estimated combined channel matrices Ĥcom n , ∀n and assuming an IRS matrix Θ = I N along all the training steps.This way, the information in the state vectors is only affected by variations in the estimated combined channel matrices.We consider each training episode to fit within one channel coherence block.Hence, ĤC does not vary during each training episode.
For every episode, the initial state is a vector state where all the entries related to the scheduling vector equal zero, i.e., any stream is scheduled.On the other hand, we consider a state s t as terminal if the scheduling vector, ξ t , equals the scheduling vector of the previous state, ξ t−1 .After reaching a terminal state, the environment resets to an initial state and a new episode starts.
The dimension of the state space vectors is D state = KN s + N r KN t .We assume the state space to be continuous-valued because the entries of ĤC can take any complex value, although the entries of ξ t are binary-valued.Notice that we assume signal processing techniques that handle complex-valued entries.Otherwise, the imaginary and real parts should be treated as independent inputs, leading to vectors twice the size.
The action vector a t is the binary one-shot encoding representation of the stream to be included in the scheduling of the next state.Hence, the action vector is such that the scheduling vector of the current state, ξ t , and the scheduling vector of the next state, ξ t+1 , are related as follows: ξ t+1 = OR(ξ t , a t ).According to this formulation of actions, the size of the set of feasible actions grows only linearly with the number of streams (i.e., |A| = KN s ).Hence, it overcomes the scalability constraint of previous combinatorial approaches [11], [15], [21].The dimension of the action vectors is then The reward r t is determined as a function of the sum-rate since it is the metric we aim to maximize.In this formulation, rather than using the sum-rate value itself, we calculate r t as the difference between the values after and before taking the action a t , i.e., The ANNs used for function approximation are sensitive to the scale of the features.If we use the sum-rate values as rewards, the difference between the scales of actions and rewards can be unfavorable for learning and the stability of the consecutive time steps.By computing the rewards as the difference between the sum-rate values, we force the scales of state, action, and reward values to remain similar.for each minibatch B m do: 24: unpack the stored experiences: 25:

C. The PPO Algorithm
In this section, we present the algorithm used during the offline training of the PPO agent when solving the scheduling optimization problem in (6).In particular, the scheduling policies that maximize the system sum-rate in the considered communication scenarios are learned by following Algorithm 1.
During the initialization stage, the actor and the critic are created with random initial parameters ϑ π and ϑ v , respectively.Next, we perform the sample collection stage (lines 4 to 18) where we store the experience tuples that we will use later to train the actor and the critic.We generate the samples by considering the current actor policy running over E parallel environments.The interactions in each environment e are orderly stored in a replay buffer R e .We consider extended experience tuples whose structure is (s t , a t , r t , χ t , π old (a t |s t ), V t , A t ), where χ t ∈ {0, 1} equals one if the next state s t+1 is terminal, and π old (a t |s t ) stands for the probability of taking action a t in the current state with the current actor policy.When χ t = 1 (i.e., the next state is terminal), the environment resets and s t+1 becomes an initial state.Let us recall that a new estimated cascaded channel matrix ĤC is generated for each initial state, ensuring the exploration of the state space.The state values V t and the advantage values A t are computed backward starting from the last visited state as in lines 15 and 17, respectively.Note that for computing these two values, we use the terms in the reduced tuples (s t , a t , r t , χ t , π old (a t |s t ), ∼, ∼) stored in line 13.We use a conventional and simple approach to the advantage function since no improvement was achieved when evaluating more complex alternatives.
After completing the sample collection stage, we combine all the replay buffers into one (R).Samples in this buffer are used along I u iterations to update the trainable parameters of the actor and the critic networks.At each iteration, the order of the tuples in the buffer R is randomized as it is no longer relevant.Besides, randomizing reduces the correlation between the samples within a mini-batch, improving the training performance.
In every iteration, we divide the randomized data into M minibatches (B m , ∀m).Lines 24 to 35 describe the training process by considering the samples within one mini-batch.For every experience tuple E i , we determine the probability of taking the action a i according to the current actor policy π upd (a i |s i ).Note that this value differs from the stored value π old (a i |s i ) since the policy used for sample collection is transformed along with the update iterations.We compute the ratio between these two probabilities ρ i as stated in line 28.Next, we compute Φ i which is the entropy of the probability distribution of the actor for the state s i .
Finally, by considering all the experience tuples in the minibatch and the computed values related to them, we determine the individual terms of the joint objective function L. The term L CLIP that we aim to maximize is the main part of the objective function in PPO since it ensures the policy updates remain stable [27], [28].This term is given by (14) where is the hyper-parameter that determines the clipping interval (see [27], [28] for further details on the clip(•) operator).The other term we aim to maximize is the entropy bonus H which is given by Including this entropy-related term in the objective function enhances the exploratory behavior of the agent.Finally, we aim to minimize the difference between the estimates computed by the critic and the real state values of the sampled states.Hence, we calculate the critic loss L V as Later, we back-propagate the value computed for the joint objective function as The entire training algorithm runs over I a iterations.The expected result of this algorithm is to obtain a trained actor capable of predicting near-optimal schedulings for unseen channel realizations.
Table II shows the configuration parameters considered for the training of the PPO agent.The selected numbers of algorithm and update iterations provide enough training to reach a good performance of the PPO agent since no significant improvement was observed beyond this point.Using four parallel environments and 128 sample collection steps at each iteration ensures a proper exploration of the state space.Mini-batches with 64 entries constitute an adequate trade-off between complexity and learning speed.We selected the different coefficients, the learning rate, and the discount factor through a grid search approach.The values obtained are similar to those proposed in [27].
The training described in Algorithm 1 is expected to perform mostly offline so that the trained actor can be later deployed in a practical scenario.We next describe in Algorithm 2 the online behavior of the deployed actor.The expected result is to obtain the scheduling vector ξ t that maximizes the system sum-rate for each estimated cascaded channel ĤC .
During the online stage, the observed interactions improve scheduling because they allow the trained actor to adapt to changes in the communication environment.However, in this stage, we only compute the system sum-rate for the scheduling vector in the terminal state.Therefore, we must use a different reward assignment, where r t = Λ ξ t if s t+1 is a terminal state and r t = 0 otherwise.This episodic reward is less efficient than Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the one used for offline training but enables the actor to continue learning without sacrificing the scalability and performance of the scheduling algorithm.
Notice that, during online interactions, we select actions deterministically to find the scheduling vectors that maximize system performance.Therefore, we limit the exploration of the action space to unpredictable or undesirable behaviors.As explained in [16], model-based offline training can be deployed simultaneously to ensure sufficient action exploration, such that both kinds of experiences contribute to the learning process.On the other hand, exploration of the state space is guaranteed by the dynamic behavior of the communication conditions.

D. Actor and Critic ANN Structures
Fig. 4 shows the structures of the actor and the critic that we propose to use as parts of the PPO framework.As explained above, the actor comprises three stages: an ANN, a probability distribution stage, and a random sampler.The dimensions of the input and output layers of the actor ANN equal D state and D action , respectively.We use two fully connected hidden layers with 2D state neurons each.We made several tests with different configurations, and no improvement was observed when using bigger setups.We use the rectified linear unit (ReLU) function as activation in the hidden layers.Besides, we use the softmax function at the output layer to obtain the normalized probabilities over the feasible actions.
In the critic ANN, the dimension of the input layer also equals D state .We use three fully connected hidden layers with identical shapes to those in the actor.The output layer dimension is one since this network aims to predict the state-value function for a given state.Hence, we use the linear activation function at this output layer.Finally, we use the Adam optimizer in both actor and critic ANNs since this algorithm has proven to be computationally efficient and robust for supervised and DRL problems [30], [31], [32].

E. Convergence Analysis
As in [8], [33], [34], we perform a convergence analysis to demonstrate the suitability of our proposed algorithm to solve the optimization problem in (6).We begin this evaluation by assessing the system's reward performance.Fig. 5 shows the normalized sum-rate values calculated at the algorithm iterations.Notice that we train a single agent capable of handling various network configurations and channel models.This agent matches the dimensions of the setup employed in most of the simulations considered in Section V.However, we have also considered different setups (such as those with fewer users or varying the number of IRS elements and receiving antennas at the RSU) to assess the generalizability of our proposed solution.
To evaluate the convergence metric across different scenarios and channel conditions, we utilize 1000 channel samples that were not visited during the training process.We employ the maximum sum-rate values as normalization factors to enable a general representation encompassing all the simulation setups.
As shown in Fig. 5, the sum-rate values improve and nearly converge for two learning rate configurations.On the considered simulation setups, the best performance is achieved when the learning rate μ c equals 0.001.These simulation results demonstrate the proper behavior of the actor network since it continuously improves on predicting the scheduling vectors that maximize system performance.
Next, we perform a convergence analysis based on the critic loss (L V ).This parameter measures the difference between the critic network approximation and the actual state-value function, so lower is better.The simulation results in Fig. 6 show the algorithm's convergence with respect to this metric.As in the previous experiment, μ c = 0.001 offers the best trade-off between convergence speed and stability.
Previous results demonstrate that the PPO-based framework provides reliable and stable solutions for the sequential scheduling problem in the considered scenarios.Both the actor and critic ANNs steadily learn from the interactions and contribute to the system performance in unseen channel realizations.The convergence of general PPO solutions is also proven in [35], demonstrating the robustness of this approach.

V. SIMULATION RESULTS
In this section, we present the results of computer simulations, which validate using the PPO framework to find the scheduling vector that maximizes the system sum-rate in the uplink of an IRS-assisted MS MU-MIMO system.We considered the scenarios presented Sections III-A and III-B, and the following benchmarks: r User-level exhaustive search (ES): this method evaluates all the 2 K user-level feasible scheduling vectors and exhaustively searches for the one that results in a higher sum-rate value.ES is expensive because the IRS matrix, the precoder matrices, and the resulting sum-rate values must be computed for each possible scheduling vector.
r User-level greedy direct (GD): this method first schedules the vehicle with the highest rate and, in the following steps, schedules the vehicle that enhances the system sum-rate the most.The algorithm stops when no performance improvement can be achieved.GD might require computing the IRS and precoder matrices up to K(K+1) r Stream-level GD: This method first schedules the stream that provides the highest rate and then schedules the stream that enhances the system sum-rate the most in each subsequent step.The algorithm stops when no further performance improvement can be achieved.Stream-level GD may require computing the IRS and precoder matrices up to KN s (KN s +1) 2 − 1 times.
r Greedy indirect (GI): this method starts scheduling the vehicle with the highest cascaded channel norm.In the next steps, it selects the vehicles based on a spatial compatibility metric until reaching K max vehicles.GI computes the IRS and precoder matrices only for the final scheduling vector [12,Algorithm 2].
r Maximum channel gain (MG): this method schedules se- quentially the K max vehicles with the highest cascaded channel norms.The IRS and precoder matrices are computed only for the final scheduling vector.As in [12], we constrain the number of scheduled vehicles in the GI and MG methods to be K max = N r N s .However, simulation results demonstrate that scheduling more vehicles could lead to higher system sum-rate values in several communication setups.We do not use a stream-level exhaustive search benchmark since the number of feasible scheduling vectors becomes intractable in the considered configurations.For comparison, we include two versions of the proposed PPObased sequential scheduling, namely the stream-level approach explained in the previous section and a user-level version where if a vehicle is scheduled, all its streams are.To ensure fairness between the algorithms, we use the same approach to compute the IRS and precoder matrices in all cases (i.e., [8,Algorithm 1]).
Figs. 7 and 8 show the achievable sum-rate values obtained in the spatially uncorrelated and correlated channel scenarios, respectively.We considered a setup with K = 10 vehicles employing N t = 2 antennas to send N s = 2 streams each, an IRS with N = 30 scattering elements, and an RSU with N r = 8 receiving antennas.In the correlated channel scenarios, we set Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.G = 5.The nominal angles for the different groups are selected uniformly in the interval from 0 to π 2 rad, to avoid mirror angles [22].As shown, the increment of the sum-rate values is smaller in the correlated scenarios since the interference between the spatially correlated vehicles steeply increases with the SNR values.
In the scenarios considered in Figs.7 and 8, the streamlevel PPO scheduling significantly outperforms the benchmarks and the user-level PPO scheduling.The stream-level PPO scheduling selects the specific streams per vehicle to schedule achieving better interference control.This flexibility is fundamental in scenarios like the ones considered where the number of competing streams is higher than the number of receiving antennas.The stream-level GD benchmark also leverages this capability and outperforms most user-level approaches.Note that GD benchmarks are also sequential.However, they require computing the IRS and precoder matrices for all the evaluated scheduling vectors.These computing costs and delays can limit their usage in rapidly varying vehicular communications.
The gap between stream-level approaches and their user-level counterparts is more evident for high SNR values, specially in the correlated scenarios.In those cases, selecting only specific streams enables to alleviate the inter-vehicle interference and fully leverage the transmission power without affecting other vehicles with similar spatial properties.
The performance of the user-level PPO scheduling is close to that of the ES benchmark and better than more computationally demanding approaches like the user-level GD benchmark.Besides, although the GD approaches are competitive in scenarios where the average channel gains are the key aspect (like in the uncorrelated varying distance scenario), they fail at assessing the long-term effects of the sequential scheduling in more complex setups (like in the correlated scenarios).
The performance of the GI and MG benchmarks is limited by the K max value, which is lower than the optimal number of scheduled vehicles in several scenarios.Figs. 9 and 10 show the average numbers of vehicles scheduled in the uncorrelated varying distance and correlated varying distance scenarios, respectively.In both setups, the flexibility of the stream-level PPO scheduling enables allocating more vehicles with at least one stream per vehicle without affecting the system sum-rate.Besides, this proposal makes no previous assumption of the number of vehicles to schedule, enabling it to adapt to the characteristics of the several scenarios.User-level PPO schedules an average number of vehicles similar to the ES benchmark.Unlike the GD approaches, both PPO-based algorithms leverage the capability of RL for long-term analysis.The greedy behavior of GD tends to allocate first the vehicles (streams) with the best communication conditions and disregards the effect it has on the final scheduling vector.Because of this, the capability of the user-level GD to schedule more users steeply decreases for large SNR values in both channel configurations.This effect is more evident in the spatially correlated scenarios since vehicles with high channel gains strongly interfere with others of similar spatial properties.
The following experiments enable us to analyze how the spatial correlation affects the scheduling performance.For this purpose, we considered a setup with K = 10 vehicles having N t = 2 antennas to send N s = 2 streams each, an IRS with N = 30 scattering elements, an RSU with N r = 8 receiving antennas, and an SNR = 10 dB.We considered the number of groups of vehicles G ranging from 1 to 10.Note that for G = 1, all the vehicles are grouped together and share similar nominal angles and spatial correlation matrices.On the other extreme, for G = 10, their spatial properties are well defined.
Figs. 11 and 12 show the results obtained for constant distance and varying distance scenarios, respectively.Again, the GI and MG benchmarks perform poorly since they schedule a predefined number of vehicles (K max = 4) regardless of the communication conditions.For lower values of G, scheduling K max vehicles leads to significant interference and, therefore, system performance degradation.
As observed in previous experiments, the user-level PPO scheduling and the ES and user-level GD benchmarks perform similarly.The performance of these approaches initially improves with the increase in the number of groups since vehicles can be properly separated.However, from G = 3 to G = 10, the sum-rate values saturate.Although the nominal angles become more distributed, the degrees of freedom to manage the interference remain limited by the number of receiving antennas at the RSU.The proposed stream-level PPO scheduling reaches the highest performance along the range of G values.Besides, it keeps improving with the increase of G, even in the interval where most benchmarks stagnate.

A. Imperfect CSI
In the simulations so far, we have considered perfect CSI (PCSI) to ensure a fair comparison with the different benchmark algorithms since they disregard the effects of channel estimation errors.We next present some computer experiments to illustrate the capability of the proposed stream-level PPO scheduling to address the optimization problem in (6) while considering ICSI.
We evaluated the performance of stream-level PPO scheduling by considering a simulated online stage where channel matrices are estimated with an error of variance σ 2 com .We consider this variance σ 2 com common to all the estimated combined channel matrices (i.e., σ 2 com n = σ 2 com , ∀n).Fig. 13 shows the sum-rate values obtained with the trained PPO-based scheduling agent in a spatially correlated varying distance setup with SNR = 10 dB, K = 10, N t = 2, N s = 2, N = 30, N r = 8, and G = 5.The figure also shows the sumrates of three benchmarks: stream-level PPO with PCSI, userlevel ES, and random scheduling.
The results in Fig. 13 show that stream-level PPO outperforms ES in several of the considered scenarios.However, the performance of the proposed solution in ICSI conditions deteriorates for high SNR values because, in this regime, scheduling the wrong vehicles can lead to significant interference.Nevertheless, the proposed solution outperforms the random scheduling benchmark, even for the challenging worst setting (σ 2 com n = 0.25 and SNR = 15 dB).
During the online stage, the trained agent keeps learning.Although the system performance initially degrades because of the ICSI, it improves by observing up-to-date interactions.This Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.continuous-learning capability makes our PPO proposal a robust alternative to less flexible classical approaches.

B. Computational Complexity
In this subsection, we present a computational complexity analysis of the stream-level PPO scheduling based on the required multiplications.As stated in [36], this is a high-level metric since it disregards other less time-consuming operations.The complexity analysis for Algorithms 1 and 2, which occur at different moments and have different resource limitations, is addressed separately.Algorithm 1 corresponds to the training stage, which is mostly executed offline, and Algorithm 2 implements the online stage where the trained PPO-based agent is used to predict the best scheduling vectors for the estimated channel matrices.
The highest computational complexity in Algorithm 1 is in the actor and critic ANNs.The computational complexity for ANNs with fully connected layers is bounded by O( ζ 2 ), where is the number of hidden layers, and is the number of neurons in the widest layers [34], [36].We disregard the number of layers because it does not on the communication parameters, and its value is generally small compared to ζ.In both the actor and the critic, ζ equals 2D state .Hence, the complexity of the sample collection in Algorithm 1 is in the order of O(I a T (K 2 N 2 s + N 2 r K 2 N 2 t )), where I a and T stand for the number of algorithm iterations and sample collection time steps, respectively.On the other hand, the computational complexity during the ANN updates is in the order of O(I a |B m |(K 2 N 2 s + N 2 r K 2 N 2 t )), where |B m | stands for the size of the mini-batches.Finally, the general complexity of Algorithm 1 is in the order of O((I a T + I a |B m |)(K 2 N 2 s + N 2 r K 2 N 2 t )).During the online scheduling stage, the computational complexity is remarkably lower, which is suitable for the stringent latency requirements of vehicular communications.In this stage, the trained actor is used to predict the scheduling vector in a sequential fashion with up to KN s forward passes of the ANN.Since the number of hidden neurons in the actor's widest layers equals 2D state , the computational complexity of this stage is in the order of O(KN s (K 2 N 2 s + N 2 r K 2 N 2 t )).Table III compares the computational complexity of the proposed solution to some of the best-performing benchmarks.A significant part of the complexity in these scheduling algorithms is in computing the objective function for a given scheduling vector, i.e., finding the IRS and precoder matrices, and calculating the sum-rate.We reference the complexity of the algorithm used for this purpose in [8, Algorithm 1], which is in the order of O(K 2 N 2 t N 2 + N 2 r N 2 ).Notice that our proposed solution performs the IRS and precoder optimization only after finding the right scheduling vector, while both benchmarks evaluate multiple scheduling options.
Fig. 14 shows the number of required multiplications for evaluating several network setups.We start with a simplified configuration (1x) where K = 5, N s = 2, N t = 2, N r = 4, and N = 15, and gradually increase these values by a linear factor (2x, 3x,..., 6x).As shown in the figure, the computational complexity of the proposed solution is lower in all setups.Moreover, the difference compared to both benchmarks increases with the size of the network parameters.For the largest scenario (6x), the computational complexities of the stream-level GD and ES benchmarks are almost four and eight orders of magnitude larger, respectively.

VI. CONCLUSION
We have investigated a sequential DRL-based scheduling approach for the sum-rate maximization in the uplink of an IRSassisted MU MIMO communication system.The optimization problem is formulated as a POMDP, and a PPO-based framework is proposed to address the scheduling task in a sequential fashion.The scheduling capabilities have been extended to the stream level, and this proposal has been tested in several scenarios with spatially correlated and uncorrelated channels and ICSI conditions.We assessed the proposed stream-level PPO-based sequential scheduler against several benchmarks.Our main findings can be summarized as follows:  r The proposed scheduler outperforms the considered bench- marks in terms of system sum-rate in all the evaluated scenarios.Besides, it allows scheduling more vehicles by selecting the appropriate streams to allocate.
r The proposed solution has proven robust in ICSI conditions since it achieves a competitive performance regarding the optimal user-level scheduling even for high estimation errors

Fig. 1 .
Fig. 1.Uplink of an IRS-assisted MIMO communication between several connected vehicles and an RSU.

r
The sequential DRL-based scheduling formulation enables us to overcome the scalability limitations of combinatorial Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
scheduling vector ξ t from terminal s t compute the new values of the trainable parameters ϑ π and ϑ v by performing a stochastic gradient descent update.The parameters λ H and λ V stand for adjustable coefficients.
2: while state s t is not TERMINAL : 3: get a t ← (s t , ϑ π )

TABLE III COMPUTATIONAL
COMPLEXITYapproaches, as it reduces the number of feasible actions from 2 KN s to KN s .