Hybrid IRS-Assisted Secure Satellite Downlink Communications: A Fast Deep Reinforcement Learning Approach

This paper considers a secure satellite downlink communication system with a hybrid intelligent reflecting surface (IRS). A robust design problem for the satellite and IRS joint beamforming is formulated to maximize the system's worst-case secrecy rate, considering practical models of the outdated channel state information and IRS power consumption. We leverage deep reinforcement learning (DRL) to solve the problem by proposing a fast DRL algorithm, namely the deep post-decision state–deterministic policy gradient (DPDS-DPG) algorithm. In DPDS-DPG, the prior known system dynamics are exploited by integrating the PDS concept into the traditional deep DPG (DDPG) algorithm, resulting in faster learning convergence. Simulation results show a faster learning convergence of 50% for DPDS-DPG compared to DDPG, with a comparable achievable system secrecy rate. Additionally, the results demonstrate system secrecy rate gains of 52% and 35% when employing active IRS and hybrid IRS, respectively, over conventional passive IRS, thereby supporting secure communications.

Hybrid IRS-Assisted Secure Satellite Downlink Communications: A Fast Deep Reinforcement Learning Approach Quynh Tu Ngo , Senior Member, IEEE, Khoa Tran Phan , Member, IEEE, Abdun Mahmood , Senior Member, IEEE, and Wei Xiang , Senior Member, IEEE Abstract-This paper considers a secure satellite downlink communication system with a hybrid intelligent reflecting surface (IRS).A robust design problem for the satellite and IRS joint beamforming is formulated to maximize the system's worst-case secrecy rate, considering practical models of the outdated channel state information and IRS power consumption.We leverage deep reinforcement learning (DRL) to solve the problem by proposing a fast DRL algorithm, namely the deep post-decision state-deterministic policy gradient (DPDS-DPG) algorithm.In DPDS-DPG, the prior known system dynamics are exploited by integrating the PDS concept into the traditional deep DPG (DDPG) algorithm, resulting in faster learning convergence.Simulation results show a faster learning convergence of 50% for DPDS-DPG compared to DDPG, with a comparable achievable system secrecy rate.Additionally, the results demonstrate system secrecy rate gains of 52% and 35% when employing active IRS and hybrid IRS, respectively, over conventional passive IRS, thereby supporting secure communications.
Index Terms-Hybrid IRS, satellite downlink communications, physical layer security, fast reinforcement learning, robust design.

I. INTRODUCTION
I NTEGRATING satellites into terrestrial communication sys- tems has been identified as a critical solution in the new era of 6G.However, satellite communications are particularly vulnerable to security risks, including eavesdropping, due to their broadcast nature and wide coverage area.These risks are especially pronounced in the downlink communication, where sensitive information is transmitted from the satellite to the intended receivers on the ground.To ensure secure wireless transmissions, the upper-layer cryptographic encryption and physical layer security (PLS) approaches can be employed.Nonetheless, the latter approach relying on channel coding to exploit random characteristics of the wireless medium has become attractive in future wireless systems [1].Recently, PLS techniques have become even more effective with the deployment of intelligent reflecting surfaces (IRS) in wireless communications [2].An IRS is a metasurface of reflective elements that is able to alternate the incoming signals to redirect them to desire locations.Therefore, IRS can be exploited to enhance or weaken the signal strength at the legitimate users or eavesdroppers, respectively.However, a significant challenge in employing IRS lies in the design for tunable elements, i.e., the control of amplitude amplifying and phase shifting.
Research has explored utilizing IRS with PLS techniques to bolster wireless communication security, as exemplified by [3] and related references.Most notably, for IRS-aided multipleinput multiple-output (MIMO) systems [4], [5] or IRS-aided multiple-input single-output (MISO) systems in the presence of multiple eavesdroppers [6], [7], the joint beamforming design for the base station (BS) and IRS coefficients is achieved through traditional optimization approaches, i.e., the alternative optimization algorithm or the semidefinite relaxation method with the assumption of perfect transmit channel state information (CSI).In [8], the PLS performance of a cache-enabled IRS-aided satellite network was analyzed through the system secure transmission probability.The computational complexity of these approaches will significantly increase with larger IRS size.Furthermore, the CSI in practical IRS systems could be outdated due to processing and signal propagation delays, that could add more system dynamic and complexity when being accounted for.Overall, traditional optimization approaches are not effective to design large-scale and high dynamic IRS-assisted communication systems.
Deep reinforcement learning (DRL) has been utilized to design IRS systems [9], [10], [11], [12], [13], [14].To joint design the transmit beamforming and IRS phase shifts, [9] and [10] employ the deep deterministic policy gradient (DDPG) algorithm.In [9], DDPG is incorporated in the beamforming design with a low complexity implementation and has shown superior performance comparing to the weighted minimum mean square error algorithm.Similarly, [10] proposes the soft-actor-critic (SAC) algorithm that achieves higher average reward with lower variance than the DDPG algorithm.Both DDPG and SAC algorithms are shown to achieve comparable performance to existing optimization algorithms with shorter running time.A twin-DDPG learning algorithm is proposed in [11] to solve the joint optimization problem of UAV's trajectory and IRS beamforming in a secure IRS-assisted mmwave UAV communication system.Investigating the secure transmission of an IRS-aided MIMO full-duplex system, [12] employs the DDPG learning algorithm to design the transmit beamforming and IRS phase shifts, considering hardware impairments at the transceivers and IRS.Leveraging the deep Q-learning, [13] proposes the deep PDS-PER learning based secure beamforming for an IRS-assisted MISO system.The post-decision state (PDS) and prioritized experience replay (PER) are employed to enhance the learning efficiency of deep Q learning algorithm.In [14], a Dyna architecture using actor-critic DRL enhances security in energy-efficient wireless body area networks (WBANs) against active eavesdroppers.This approach effectively decreases eavesdropping rate, intercept probability, sensor energy consumption, and transmission latency for coordinators supporting deep learning, thereby boosting overall transmission security in IRS-aided WBANs.
The above results show the potential benefits of leveraging DRL into the beamforming design process for secure IRSassisted wireless communication systems.While the mentioned works [9], [10], [11], [12], [13], [14] consider IRS with passive elements only, it is worth mentioning that hybrid IRS with both active and passive elements has been introduced in [15], [16], [17], [18], [19], [20].A hybrid IRS with even a small number of active elements can significantly improve the network performance comparing to employing only passive IRS.Nonetheless, the challenge when deploying the hybrid IRS to assist securing wireless communications lies in the design of the IRS amplitude amplifying and phase shifting as well as the number of active elements to maintain the power consumption of a low power IRS.Hence, this paper1 considers a hybrid IRS-assisted secure multiuser MISO satellite downlink system and designs robust satellite beamforming as well as IRS configuration considering practical outdated CSI and power consumption models.The main contributions of this paper are summarized as follows: 1) The secrecy performance in the hybrid IRS-assisted multiuser MISO satellite communication system is derived through the worst-case secrecy sum-rate under outdated CSI.Then, the sum-rate maximization based beamforming design is formulated as a non-convex optimization problem with satellite and IRS power budget constraints.Due to the system high dynamics and dimensions, it is challenging to solve the optimization problem.We address this issue by reformulating the problem as a reinforcement learning problem.2) We develop a rapid learning algorithm for beamforming design using DRL.Our approach integrates post-decision state into the actor-critic based algorithm, DDPG, improving learning efficiency by utilizing only the states as inputs in the critic's deep neural networks (DNNs), unlike conventional DDPG which processes both states and actions.3) The computational complexity and performance of the proposed algorithm and DDPG are compared analytically and numerically.The results confirm a better learning efficiency in our fast learning algorithm with comparable secrecy performance of the hybrid IRS-aided satellite downlink system when using DDPG.Simulation results also validate the secrecy gain when employing hybrid IRS versus passive IRS.The remainder of this paper is organized as follows.Section II describes the proposed system model and hybrid IRS power consumption model.The beamforming design is formulated in Section III.Section IV proposes the DPDS-DPG learning based secure beamforming.Section V presents the simulation investigations, and the conclusions are made in Section VI.A comprehensive list of variables used in this paper are presented in Table I.Notations: Matrices and vectors are denoted respectively by boldface capitalized and small letters.Tr(.) and (.) H represent the trace and Hermitian transpose operations.I M is the identity matrix of size M .diag{a} denotes the diagonal matrix with diagonal elements in a. C M ×N denotes the space of complex-valued matrices.[x] + max{0, x}.E[.] denotes the expectation, and ∇ denotes the gradient.

A. Network Model
Consider a downlink hybrid IRS-assisted satellite communication system as depicted in Fig. 1, which consists of a GEO satellite (S) with L antennas serving N single antenna users (U) on the ground.A hybrid IRS (I) is employed to assist the communications between the satellite and the users.There exist K non-colluded single antenna eavesdroppers (E) trying to wiretap the transmission.
The IRS composes of M elements including M a active elements and M p passive elements, M = M a + M p .Active elements are capable of amplifying as well as reflecting incident signals, while passive elements only reflect incident signals.Let a = {0, 1} M ×1 with a i = 1 or a i = 0 respectively indicate an i-th element being active or passive.Let γ i = α a i i e jθ i denote the reflection coefficient of an i-th element.Hence, for the i-th active element, γ i = α i e jθ i where α i ∈ (1, α max ] represent the amplitude amplifying factors with α max is a predefined maximum amplitude amplifying factor, and θ i ∈ [0, 2π] are the phase shifts.For the i-th passive element, we have γ i = e jθ i .The IRS interaction matrix can be defined as the diagonal matrix Φ = diag{γ 1 , . . ., γ M }.Define a matrix Ψ = diag{a 1 γ 1 , . . ., a M γ M } containing the reflection coefficients of active elements.In our model, the IRS operates in full-duplex mode where the passive elements create no self-interference and noise amplification [22], and the active elements work in full-duplex amplify-and-forward (AF) manner [15], [16], [18], [19], [20].Although the delay induced by active elements is longer than that of passive elements, the delays are much less than the symbol period [22].Hence, the signals from the reflection channels and direct links can be assumed to be received simultaneously at the user in a single time slot for combining.
Let N = {1, 2, . . ., N}, K = {1, 2, . . ., K} and M = {1, 2, . . ., M} denote the user set, eavesdropper set and IRS element set, respectively.Let denote the channel coefficients from satellite to the n-th user, the k-th eavesdropper and IRS, from IRS to the n-th user and the k-th eavesdropper, respectively.The channel coefficients consist of small-scale fading, where the satellite-related channel coefficients adhere to the Shadowed-Rician fading model as described in [23].This model follows the probability distribution function given by: where b represents the average power of the scatter components, ω represents the average power of the line-of-sight components, m represents the Nakagami parameter, and 1 F 1 (•, •, •) denotes the confluent hypergeometric function.Furthermore, the terrestrial channel coefficients follow the Rayleigh fading model.Let W = [w 1 , . . ., w N ] ∈ C L×N denote the beamforming matrix at satellite, where w n ∈ C L×1 is the beamforming vector for the n-th user.We assume the satellite has maximum transmit power P S,max , i.e., Tr(W H W) ≤ P S,max .Let s n denote the transmitted symbol for the n-th user, and s n is assumed to follow complex normal distribution with zero mean and unity variance, s n ∼ CN(0, 1).The received signal at the n-th user can be expressed as where represent the fullduplex residual self-interference and noise amplification created by active elements, respectively.In the following, we assume perfect SI cancellation, i.e., n The received signal at a k-th eavesdropper wiretapping the aforementioned transmission can be given by where is the complex noise at the eavesdropper with power σ 2 E k .In practice, the CSI is outdated by the time satellite/IRS transmits/reflects the signal due to transmission and processing delays.Since the IRS configuration is mainly based on the CSI, it is necessary to account for the outdated CSI [13], [24], [25].The actual channel coefficients are adopted from the real-time Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
CSI model [26], i.e., where Q SU n specify the shape and size of the ellipsoids and are assumed to be known [27].Note that perfect CSI is assumed for the channel between satellite and IRS, H SI .This assumption is based on the fact that the IRS acts as a passive device, solely reflecting the signal from the satellite to the users.As the IRS does not have the capability to send pilot sequences necessary for the channel estimation process at the satellite, perfect CSI is assumed for the satellite-IRS link.
The achievable rate of the n-th user is given by The achievable rate of a k-th eavesdropper when wiretapping the n-th user's signal can be expressed as Proposition 1: The worst-case individual achievable secrecy rate for the n-th user can be expressed as in (8) shown at the bottom of this page.
Proof: Please see Appendix A.
Note that the worst-case individual achievable secrecy rate of the n-th user in (8) depends on the estimated CSI instead of the outdated CSI, which comprises the estimated CSI and the corresponding CSI error vector.This outcome is expected due to the bounded nature of the CSI error vectors.
The system worst-case secrecy rate is The system's worst-case secrecy rate is primarily determined by the satellite's beamforming vector, the IRS's interaction matrix, and the estimated CSI.Therefore, the focus of secure transmission design should be on developing a joint beamforming strategy for the satellite and IRS to achieve the optimal worst-case secrecy rate for the entire system.

B. Hybrid IRS Power Consumption Model
Active RF circuits are used to implement IRS passive elements [22], [28], [29], hence the power consumption by a passive * The worst-case individual achievable secrecy rate of the n-th user: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
element is mainly from the RF circuit and the control circuit for that element.Let P c denote the power consumption of the RF and the control circuit, the total power consumed by M p identical passive elements is P I,p = M p P c .Active elements amplify the signal power in addition to reflecting the signal.Hence, besides the power consumption of the RF and control circuit for an IRS element, there are power dissipated at components enabling the power amplification ability, denoted as P a , and the power required for signal amplification, which is modeled as a function of the incident signal power [16], i.e., P out i = α 2 i P in i / where and P in i respectively denote the amplifier efficiency and the power of an incident signal to the i-th active element.The total power consumption of M a active elements can be written as Let P I,max denote the maximum power budget at IRS, then we have P I,p + P I,a ≤ P I,max .The total power consumption of the hybrid IRS can be expressed as

A. Problem Formulation
This paper aims to jointly design the satellite beamforming matrix W and the hybrid IRS interaction matrix Φ to maximize the system worst-case secrecy rate.The design of IRS interaction matrix includes the design of IRS phase shift tuning, amplitude amplifying factor as well as the optimal number of active elements.The optimization problem can be formulated as i∈M The constraints in (12b) and (12c) are imposed to satisfy the maximum power budget at the satellite and IRS, respectively.The constraint in (12d) is set for the amplitude amplifying factor of IRS active elements.Lastly, the constraint in (12e) is on the phase shift of all IRS elements.It is observed that ( 12) is a non-convex optimization problem, which is challenging to solve using traditional optimization techniques.To achieve a robust design, reinforcement learning will be employed.

B. Reinforcement Learning Approach
In this section, problem ( 12) is modeled as a reinforcement learning problem with the environment being the secure hybrid IRS assisted satellite downlink system.The key elements of the problem are described as follows.
Action space: Let A denote the action space.Since problem (12) aims to find the optimal satellite beamforming matrix and IRS interaction matrix, the action a t ∈ A can be defined in terms of the beamforming vector w n,t , the amplitude amplifying factor α i,t and the phase shift θ i,t , i.e., State space: Let S and s t ∈ S represent the system state space and the system state at time step t, respectively.s t includes the CSI of all channels denoted as h t and the action from previous time step.s t can be expressed as Reward function: The reward function is used to signal the agent on how good the secure beamforming is after the agent took an action.Since the goal of problem ( 12) is maximizing the system worst-case secrecy rate, the immediate reward is defined as a function of the achievable secrecy rate, i.e., r t = R sec t .At each time step t, the agent obtains current state s t from the environment and executes an action a t on the environment based on its policy π, which is a mapping from states to actions, then receives an immediate reward r t and the next state s t+1 from the environment.The accumulated reward represents long-term reward, which is defined as R t = ∞ i=t Λ i−t r i where Λ ∈ (0, 1) is the reward discount factor.The agent's goal is learning a policy π which results in an optimal action a * that maximizes the expected long-term reward.
The value function of state-action pair, which is used to measure the expected long-term reward starting from a pair (s t , a t ) under a given policy π, can be defined as

IV. DEEP POST-DECISION STATE -DETERMINISTIC POLICY GRADIENT LEARNING BASED ROBUST SECURE BEAMFORMING
The reinforcement learning problem described in Section III has high-dimensional system states with dynamic characteristics, i.e., real-time CSI, and continuous states and actions.One DRL method that can deal with such conditions is the DDPG learning in [30].DDPG is an actor-critic method based learning algorithm where the policy structure, namely the actor, selects actions and the estimated value function, namely the critic, gives insides on how good the actions are.DDPG learning will be utilized for the agent to learn the optimal policy of (12).Then to improve the learning efficiency, we propose the DPDS-DPG algorithm that employs post-decision state, a well-known fast learning algorithm [31], [32], at the critic of DDPG approach.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. DDPG Based Learning
The DDPG based learning algorithm structure is depicted in Fig. 2, in which the actor and critic each contains two DNNs, namely an evaluation network and a target network.
Let μ : S ← A denote a function that specifies policy that maps states to actions in a deterministic way, i.e., only the most appropriate action is given with certain system states.Let Θ μ , Θ μ , Θ Q , Θ Q respectively denote the parameters of the actor evaluation and target networks, and the critic evaluation and target networks.The actor network takes a state s t as input and outputs an action where a random noise in N(0, 1) is added for exploration.To evaluate actions, the critic network takes a state-action pair (s t , a t ) as input and outputs the state action value In DDPG, the agent learns in mini batches where the transitions < s t , a t , r t , s t+1 > are stored in a replay buffer and normalized into mini batches < s i , a i , r i , s i+1 > before being updated to the actor and critic at each time step.The critic network parameter Θ Q is updated using mini-batch gradient descent, i.e., where B represents the mini batch size and α Q is the learning rate of the critic network.The actor network parameter Θ μ is updated using mini-batch policy gradient, i.e., where α μ is the actor network learning rate.To stabilize the learning process, the target networks, which are copies of the actor and critic networks, calculate the target values.Soft update is used to update the parameters of target networks [30], where τ << 1 constraints target values to change slowly.

B. Proposed DPDS-DPG Learning
In DDPG, the Q-value is a function of the state-action pair and is evaluated by the critic.To accelerate the learning, in the proposed DPDS-DPG algorithm, a post-decision state is defined and the value function evaluated by the critic is a function of only the post-decision states.The proposed algorithm structure of DPDS-DPG learning is shown in Fig. 3.
The PDS is defined as the intermediate state that captures all the known information of the transition from state s t to s t+1 [32].Hence, it is the state immediately after the agent took action a t at state s t and before the transition to the next state s t+1 .Let st denote the PDS at time step t.Immediately after taking the action, the system CSIs have not changed, only the satellite beamforming vector and the IRS interaction matrix are affected.Thus the PDS at time step t is defined as r State at time step t: r PDS at time step t: r State at time step t + 1: In DPDS-DPG learning, the transition from state s t to s t+1 goes through an intermediate state, the PDS st .First, the agent takes action a t at state s t and gets a known reward r k t depending on the current state and action.The agent then makes the transition to the PDS st with a known transition probability P k (s t |s t , a t ).Lastly, the agent transits from the PDS to state s t+1 with an unknown transition probability P u (s t+1 |s t , a t ) and receives an unknown reward r u t .This transition captures the system random dynamics as the transition from the PDS to the next state depends on the channel statistics.The transition probability from s t to s t+1 can be modeled as a controlled Markov process and factorized into the known and unknown components as The reward can be expressed as Let V (s) denote the value function defined over the PDS.The PDS value function measures the expected long-term reward starting from state st under a policy π, i.e., Based on the above definition of the PDS value function and the transition probability factorization, the relationship between the state-action value function Q(s t , a t ) and the PDS value function V (s t ) can be expressed as When substituting ( 22), ( 23) and ( 25) into (26), the output of the critic is achieved, i.e., Q(s t , a t ) = r t (s t , a t ) + Λ Q(s t+1 , μ(s t+1 ))P(s t+1 |s t , a t ). ( Remark 1: In DPDS-DPG learning, the critic network estimates V π (s t ) instead of Q π (s t , a t ), which requires less information since V π (s t ) only depends on the PDS.Hence, the learning process becomes more efficient than that of the DDPG with faster convergence rate.
Let Θ V and Θ V respectively denote the parameter of the PDS critic and target networks.Similar to (18), the update rule for Θ V can be expressed as with α V is the PDS critic network learning rate, and Θ V is updated from Θ V using soft update as in (20).
The training algorithm for secure beamforming based on DPDS-DPG learning is described in Algorithm 1.During the initialization phase, four networks of the actor and critic are generated, in which the target network parameters are copies of the actor and critic network parameters.The replay buffer is utilized to address the assumption on independently and identically distributed samples in DRL.During each episode, a random Update target networks Θ μ , Θ V using (20); 23: end for 24: end for noise is generated for action exploration and the system states, i.e., CSI of all users, are observed.The learning agent then takes action a t , i.e., choosing satellite beamforming vector, amplifying factor and phase shift for the hybrid IRS elements, produced by the actor network, computes the immediate reward r k t , and observes the transition to the PDS st then to the next state s t+1 .The reward function, PDS value function and state-action value function are updated using ( 23), ( 25) and ( 26), respectively.The transition < s t , a t , r t , s t+1 > is then collected and stored into the replay buffer.Oldest samples will be discarded when the replay buffer is full.Mini-batches of transitions are generated and used to train the actor and critic networks.The target values y i are computed by the target networks, then the actor and critic network parameters are updated follow the rules in ( 19) and (28), respectively.Finally, the target network parameters are soft-updated using (20).The above process is repeated until the neural networks converge, then the DPDS-DPG learning model is obtained.
It is worth noting that the training process of DPDS-DPG requires certain level of computational power.Hence, the training process can be done offline.After achieving the learning model, the satellite beamforming and the configuration for hybrid IRS are processed online.We only need to update the learning model parameters when there are significant changes in the system.

C. Computational Complexity Analysis
Let L a and L c denote the hidden layers of the actor and critic neural networks, respectively.Let Z l denote the number of neurons of the l-th layer.In training stage, the computational complexity of a single neural network per time step [13], i.e., the actor network, is ) with Z 0 denotes the input layer neurons number.The computational complexity of actor and critic updates in DDPG per time step is O(|S| × |A|) [33].In DPDS-DPG, the PDS st has the same dimension as the state s t , thus the computational complexity of actor and critic updates per time step is O(2|S| × |A|).The computational complexity of DDPG and DPDS-DPG can be respectively expressed as The computational complexity is significantly reduced after the training stage when the learning model is achieved.In online working mode when choosing the satellite beamforming vector and configurating the hybrid IRS, the computational complexity is

V. SIMULATION RESULTS
In this section, the performance of the proposed DPDS-DPG algorithm will be evaluated.The secure hybrid IRS-assisted STN simulator is set up with key parameters in Table II.
The estimated CSI hSU , hSE , H SI are randomly generated following Shadowed-Rician fading with parameters {m, b, ω} = {4, 0.126, 0.835} [23] and hIU , hIE are randomly generated following Rayleigh fading.Large-scale fading is omitted in the simulator.The CSI error vectors are generated with value inside a ball of radius 0.01, i.e., Q SU n = (1/0.01 2 )I.
The original DDPG algorithm [30] with DNN structure proposed in [9] is selected as baseline.First, the proposed DNN structure for DPDS-DPG algorithm is described with detail implementation, then followed by the performance evaluation.

A. DNN Structure
The DNN structure of the actor and critic networks are fully connected networks, each consists of an input layer, two hidden layers and an output layer.The input dimensions of the actor and critic networks are defined as the sizes of the state space and action space, respectively.The output dimensions of the actor and critic networks are defined as the sizes of the action space and Q value function space, respectively.Since the state involves complex-valued matrices but the neural network only takes real number inputs, a complex value is split into real and imaginary parts which are fed into the network as independent input ports.For example, the CSI of satellite to n-th user link is separated as The real part with dimension 1 × L and the imaginary part with dimension 1 × L are used as independent input port to the actor network and will contribute 2 L entries into the state.Similarly, the total number of entries formed by the CSIs are 2MN + 2MK + 2ML + 2NL + 2KL; the satellite beamforming matrix forms 2NL; and the IRS interaction matrix forms 2 M entries.Thus, the state space dimension is D s = 2MN + 2MK + 2ML + 2KL + 4NL + 2 M , and the action space dimension is D a = 2NL + 2 M .The Q value function space dimension is 1.All hidden layers of the actor and critic networks are identically constructed with 2 (D s −1)bit-length neurons in each hidden layer.Similar to [9], activation function tanh is used in each DNN to address negative inputs.Adam optimizer is used for both the actor and critic networks with adaptive learning rate, i.e., α where ζ denotes the decaying rate.The power constraints (12a) and (12b) , and the constraints on amplifying factor and phase shift of the IRS are implemented at the output of actor network.After the agent made decisions on the satellite beamforming matrix W and the IRS interaction matrix Ψ, these matrices were adjusted to ensure compliance with the constraints.The maximum amplitude amplifier coefficient of active elements for the hybrid IRS is set to 1 [19].Other value of hyper-parameters used are described in Table III.

B. Implementation Description
The system environment is implemented in MATLAB, utilizing the Inmarsat Broadband Global Area Network (BGAN) 4-F1 satellite as GEO satellite.The satellite is at an altitude of 35 786 km above the Earth's surface.On the ground, a hybrid IRS is deployed at the center of a circle with radius of 2 km, with users and eavesdroppers randomly distributed within the circle.The system environment provides CSI as the system states for the DRL framework.The DRL framework is implemented using PyTorch, renowned for its flexibility and efficiency.Network parameters, architecture, and hyperparameters are detailed earlier.PyTorch enables DRL algorithm implementation and neural network training.MATLAB's user-friendly interface complements PyTorch's efficiency, balancing ease of use with computational power.This combination streamlines DRL model development and training.
Regarding computational complexity, for illustrative purposes only, the neural networks of the DRL platform, as described in Section V-A, have been considered with a 10-element hybrid IRS, where two elements are active, and L = N = K = 4.In this scenario, the computational complexities during training with 100 time steps are C DDP G = O(9, 657, 548, 800) and C DP DS−DP G = O(9, 758, 720, 000).Furthermore, the computational complexity of the proposed scheme in online working mode, which involves the selection of the satellite beamforming vector and the configuration of the hybrid IRS, is estimated to be O (1,493,184).

C. Comparisons With Baseline
To evaluate the learning efficiency of the proposed DPDS-DPG algorithm and the baseline, we compare the learning curves when the system is set up with parameters as in Table II and there are two active elements on the IRS.The results shown in Figs. 4 and 5 are the average of instant rewards per episode in solid lines and the average rewards per hundred episodes in dashdotted lines.It is observed that the proposed algorithm with PDS converges faster than the DDPG based algorithm in all training scenarios.The reason is that in the proposed DPDS-DPG, the critic evaluates the value function of PDS only while in the DDPG baseline, the critic evaluates the value Fig. 5. Average of instant rewards per episode and per hundred episodes with adifferent decaying rates.Fig. 6.System secrecy rate versus satellite maximum transmit power with P S,max = 45 dBm and P I,max = 10 dBm.function of both state and action.The effect of the learning rate on the performance of both algorithms are shown in Fig. 4. With the same decaying rate of 0.00001, a smaller learning rate of 0.0001 significantly affects the performance of both algorithms with the average rewards decrease at most 25%.This result also confirms the suitable learning rate suggested for DNNs is 0.001 [9], [13].The decaying rate, on the other hand, does not affect the learning algorithm performance as much as the learning rate.As shown in Fig. 5, a larger decaying rate of 0.0001 only makes the average reward down by at most 9% with respect to decaying rate of 0.00001.In both figures, the proposed DPDS-DPG reaches convergence after half the number of episodes compared to the DDPG algorithm, implying a 50% improvement in learning efficiency.Fig. 6 shows the effect of satellite transmit power budget on the system secrecy rate.We consider two system scenarios: (i) L = N = K = 4 and IRS has 10 elements, in which there is 2 active elements; (ii) L = N = K = 10 and IRS has 20 elements with 4 active elements.For both scenarios, the satellite transmit power and IRS power budgets are P S,max = 45 dBm and P I,max = 10 dBm, respectively.In Fig. 6, the larger the satellite transmit power budget is, the higher the system secrecy rate is achieved.As expected, the results from using DPDS-DPG and DDPG algorithms are comparable with each other, which reflects that DPDS-DPG algorithm speeds up the learning process during training without degrading the satellite and IRS beamforming performance.

D. Impact of Outdated CSI
In Fig. 7, we inspect the effect of outdated CSI coefficient on the system secrecy rate when L = N = K = 4, M/M a = 10/2, P S,max = 45 dBm and P I,max = 10 dBm.When ρ increases, the CSI becomes less outdated, and ρ = 1 refers to non-outdated CSI.The results show increases in the system secrecy rate when CSIs are less outdated.The reason for that is more accurate CSIs help the optimization process of satellite and IRS beamforming to achieve higher system secrecy rate.

E. Impact of Hybrid IRS
Fig. 8 shows the effect of different IRS structures on the system secrecy rate.The system is set up with L = N = K = 10, P S,max = 45 dBm and P I,max = 15 dBm.It can be observed that more elements on the IRS enhance the secrecy performance regardless of IRS type.The performance is significantly improved when active elements are enabled, i.e., for an IRS having 30 elements, enabling 12 active elements gives 36.27%more in secrecy rate, and enabling all active elements gives 63.42% more in secrecy rate.The maximum system secrecy rate improvement are 52% when employing active IRS and 35% when employing hybrid IRS, compared to employing passive IRS.Note that the results here are from simulating ideal scenarios where the IRS power budget is enough to enable all active elements and to supply for the power amplification circuit on these active elements.In Fig. 9, we investigate the relationship between number of active elements and power amplifying factor under a limited power budget at IRS of 10 dBm.A passive IRS with maximum 100 elements can be power with 10 dBm.Under the same budget, enabling only 2 active elements on a 50-element-IRS gives better performance.As the number of active elements increases, the system secrecy rate first increases to an optimal value then decreases.The reason for this behavior is when M a > M * a , increasing the number of active elements will decrease the power budget left for amplification.

VI. CONCLUSION
This work investigated the worst-case secrecy sum-rate of a hybrid IRS-assisted satellite downlink communication system under time-dependent channel conditions.The satellite beamforming matrix and hybrid IRS interaction matrix were jointly designed to maximize the system secrecy rate.Due to the high system dynamics and dimensions, DRL was employed to solve the beamforming problem.A fast DRL learning algorithm, named DPDS-DPG, was proposed to robustly design the optimal satellite beamforming and IRS configuration.Simulation results showed a 50% improvement in learning efficiency with the proposed DPDS-DPG algorithm compared to the conventional DDPG algorithm.The performance gains of employing hybrid IRS over passive IRS for supporting secure communications were also verified through simulation results.The system secrecy rate increased by 52% when employing active IRS and by 35% when employing hybrid IRS, compared to using passive IRS.

APPENDIX A PROOF OF PROPOSITION 1
The worst-case individual achievable secrecy rate from the satellite to the n-th user in the presence of K non-colluding eavesdroppers is where (33a) and (34a) are obtained following the triangle inequality, and (33b) and (34b) are obtained following the Cauchy-Schwarz inequality along with the boundary of CSI error vectors, i.e., Δh H SU n Q SU n Δh SU n ≤ 1.From the results of ( 33) and (34), (32) can be expressed as Similarly, we have inf Using the results from ( 35) and ( 36) into (31) with R n from (6) and R k,n from (7) will arrive at (8).

Fig. 1 .
Fig. 1.System model of a hybrid IRS-assisted satellite downlink communications with multiple eavesdroppers.

Fig. 3 .
Fig. 3. Structure of the proposed DPDS-DPG learning with post-decision state in the critic architecture.

Fig.
Fig. System secrecy rate versus number of active elements under limited power budget at IRS of 10 dBm.
hSE k , hIU n , and hIE k represent the estimated CSI vectors.Δh SU n , Δh SE k , Δh IU n , and Δh IE k represent the corresponding CSI error vectors.The CSI error vectors are assumed to have any value inside multi-dimensional complex ellipsoids, i.e.,

TABLE II KEY
PARAMETERS OF THE ENVIRONMENT SIMULATOR Let introduce new variables ḣU n and ḣE k that are defined asḣU n inf Δh SUn ∈E SUn , Δh IUn ∈E IUn h H SU n + h H IU n ΦH SI w n Δh SE k ∈E SE k , Δh IE k ∈E IE k Δh H SU n + Δh H IU n ΦH SI w n inf Δh SUn ∈E SUn , Δh IUn ∈E IUn R n − max ∀k∈K sup Δh SE k ∈E SE k , Δh IE k ∈E IE k R k,n ⎤ ⎥ ⎥ ⎦ + .(31)Authorizedlicenseduse limited to the terms of the applicable license agreement with IEEE.Restrictions apply.kΦH SI w n+ 1 − ρ 2 Δh H SE k + Δh H IE k ΦH SI w n (b)