Robust Lane Change Decision Making for Autonomous Vehicles: An Observation Adversarial Reinforcement Learning Approach

Reinforcementlearning holds the promise of allowing autonomous vehicles to learn complex decision making behaviors through interacting with other traffic participants. However, many real-world driving tasks involve unpredictable perception errors or measurement noises which may mislead an autonomous vehicle into making unsafe decisions, even cause catastrophic failures. In light of these risks, to ensure safety under perception uncertainty, autonomous vehicles are required to be able to cope with the worst case observation perturbations. Therefore, this paper proposes a novel observation adversarial reinforcement learning approach for robust lane change decision making of autonomous vehicles. A constrained observation-robust Markov decision process is presented to model lane change decision making behaviors of autonomous vehicles under policy constraints and observation uncertainties. Meanwhile, a black-box attack technique based on Bayesian optimization is implemented to approximate the optimal adversarial observation perturbations efficiently. Furthermore, a constrained observation-robust actor-critic algorithm is advanced to optimize autonomous driving lane change policies while keeping the variations of the policies attacked by the optimal adversarial observation perturbations within bounds. Finally, the robust lane change decision making approach is evaluated in three stochastic mixed traffic flows based on different densities. The results demonstrate that the proposed method can not only enhance the performance of an autonomous vehicle but also improve the robustness of lane change policies against adversarial observation perturbations.


I. INTRODUCTION
I N recent years, autonomous driving has attracted significant attention since its promise is profound to revolutionize automobile industry [1], [2].However, safety remains a major challenge for the development of autonomous vehicles [3], [4], [5].Undesirable decision making behaviors of autonomous vehicles may endanger life safety and cause enormous economic loss.As one of the most advanced artificial intelligence technologies, reinforcement learning (RL) has achieved a success in fulfilling a series of challenging decision making tasks (e.g., Go and StarCraft II) [6], [7], [8].Hence, applying This work was supported by Schaeffler Hub for Advanced Research (SHARE@NTU) under program Smart Mechatronic Lab for Industrial Collaborative Robotics in Manufacturing (No. I2001E0067 (Schaeffler) -PA5 and I2001E0067 (IAF-ICP) -PA5).(Corresponding author: Chen Lv) X. He, H. Yang, Z. Hu and C. Lv are with the School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore 639798.(e-mail: xiangkun.he@ntu.edu.sg,haohan001@e.ntu.edu.sg,zhongxu.hu@ntu.edu.sg,lyuchen@ntu.edu.sg)RL to decision making task of autonomous driving has become a hot topic for researchers [9].
While existing RL based decision making methods of autonomous vehicles have achieved many compelling results [10], [11], [12], [13], the real-world driving tasks involve unavoidable measurement errors or sensor noises which may mislead an autonomous vehicle into making suboptimal decisions, even cause catastrophic failures.In light of these risks, autonomous vehicles are required to ensure that their decision making systems can handle the natural observation uncertainties from sensing and perception system, especially adversarial perturbations.However, few researches concern and cope with the aforementioned challenge.
Therefore, in this paper, a novel observation adversarial RL (OARL) approach for robust lane change decision making is proposed to improve the performance of an autonomous vehicle while enhancing the robustness of driving policies against adversarial observation perturbations.The main contributions of this paper are summarized as follows: • A constrained observation-robust Markov decision process (COR-MDP) is advanced to model lane change decision making behaviors of an autonomous vehicle under policy constraints and observation perturbations.Meanwhile, a black-box attack technique with Bayesian optimization is implemented to approximate the optimal adversarial observation perturbations efficiently.• A constrained observation-robust actor-critic (COR-AC) algorithm is presented to optimize lane change policies and minimize the Jensen-Shannon (JS) divergence based average variation distance of the policies attacked by the optimal adversarial observation perturbations.
Three testing cases with different traffic flow densities are implemented to evaluate the performance of our robust lane change decision making approach through simulation of urban mobility (SUMO) [14], [15].The results demonstrate that the proposed OARL method is effective and outperforms the competitive baselines.
The rest of this paper is arranged as follows.The related works with respect to this paper are reviewed in Section II.The proposed OARL method for robust decision making of autonomous vehicles is illustrated in Section III.Implementation details of our method are provided in Section IV.The evaluation results and analyses are discussed in Section V.The conclusions of this paper are made in Section VI.
©2022 IEEE.Personal use of this material is permitted.Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

II. RELATED WORK
According to different driving behaviors (e.g., lane change, acceleration or deceleration) or tasks (e.g., overtaking or ramp merging) in existing related studies, RL based decision making of autonomous vehicles can roughly be divided into three categories: longitudinal, lateral and coordinated decision making [9].RL based longitudinal decision-making methods generally adopt RL algorithm to determine the speed modes of autonomous vehicles, such as keeping, acceleration and deceleration [11], [16], [17], [18].

A. Reinforcement Learning based Lateral Decision Making for Autonomous Vehicles
RL based lateral decision making approaches of autonomous vehicles mostly employ RL algorithm to learn lane change behaviors or select target lanes.One popular paradigm is the lateral decision making schemes with the deep Qnetwork (DQN) or its variants.A lane change decision-making framework for autonomous vehicles is developed to learn risk sensitive driving policies using risk-awareness prioritized replay DQN in [12].A lane change decision making method is presented for autonomous vehicles through DQN with safety verification in [19].A harmonious lane-changing decision making approach based on DQN is advanced to improve overall traffic efficiency in [20].A DQN method with rulebased constraints is developed for lane change decision making of autonomous vehicles in [21].A lane change decisionmaking approach for autonomous vehicles is developed via double DQN with the structure of Deep Sets in [22].A lane change decision making method based on partial observed Markov decision process and DQN is introduced for autonomous vehicles in [23].The above methods are simple but effective.Moreover, combined with rule based constraints, the driving safety of autonomous vehicles can be guaranteed.However, these schemes can not find the optimal driving policies necessarily.
In addition to the DQN based paradigms, there are the autonomous driving lateral decision making approaches with other RL algorithms.A proximal policy optimization (PPO) based lane change decision-making method is presented for autonomous drving in [13].A multi-objective approximate policy iteration algorithm is proposed to implement lane change decision making of an autonomous vehicle in [24].A lane change decision-making scheme based on attention-based hierarchical deep RL is proposed for autonomous vehicles in [25].Although these methods may achieve better performance than the DQN based schemes, the robust decision-making problem of autonomous vehicles is not studied among them.

B. Reinforcement Learning based Coordinated Decision Making for Autonomous Vehicles
RL based coordinated decision making schemes usually leverage RL algorithm to determine longitudinal and lateral driving behaviors of autonomous vehicles simultaneously.A longitudinal and lateral coordinated decision making approach based on AlphaGo Zero algorithm is developed for autonomous vehicles in [26].The requested speed and target lane can be determined by the five decision making behaviors of RL agent simultaneously.A DQN based decision making method is advanced, which can simultaneously determine discrete speed modes and lane change behaviors of an autonomous vehicle in [27].An optimization embedded RL with actor-critic framework is presented to determine longitudinal and lateral coordinated decision making behaviors for autonomous vehicles in [28].A coordinated decision making method based on deep deterministic policy gradient algorithm is developed to determine throttle and steering maneuvers for autonomous driving in [29].Unfortunately, the above methods mostly assume that the state observations are free of unexpected perturbations.Such assumption can hardly hold in real-world scenarios.

III. OBSERVATION ADVERSARIAL REINFORCEMENT LEARNING FOR ROBUST DECISION MAKING
A. Robust Lane Change Decision Making Framework for Autonomous Vehicles Since the existing lane change decision-making framework of autonomous vehicles do not take into account perception uncertainty mostly, the robust lane change decision making framework with OARL algorithm is proposed to cope with the adversarial perturbations on state observations in autonomous driving, as shown in Fig. 1.Ego vehicle is red, and it is an autonomous vehicle.The longitudinal decision-making of the ego vehicle is implemented by SUMO based intelligent driving model (IDM).The vehicles of other colors are social vehicles, and the longitudinal and lateral driving behaviors of the social vehicles are determined by the IDM of SUMO.The social vehicles can perform lane change maneuvers via the LC2013 model [30] in SUMO.Moreover, the output of the ego vehicle is discrete, which includes lane keeping, left lane changing and right lane changing.
Our RL autonomous driving agent seeks to maximize the expected return while satisfying the policy constraints.In Fig. 1, the block with respect to COR-MDP and COR-AC is used for optimizing robust driving policy and interacting with the environment.Its input includes state s, reward r and the optimal adversarial observation perturbations ∆ * .t denotes time step.The optimal adversarial observation perturbations ∆ * contains the optimal adversarial multiplicative-perturbation ∆ * m and the optimal adversarial additive-perturbation ∆ * a .The output is the action a based on the policy π(a|s).
The block with regard to the black-box attacks is employed to approximate the optimal adversarial perturbations.The input of this block includes state s and the policy π(a|s), and its output is the optimal adversarial perturbation.Additionally, the block associated with the environment is leveraged to generate state s and reward r.Its input is the action a based on the policy π(a|s), and the output contains state s and reward r.

B. Constrained Observation-robust Markov Decision Process
To model the decision making behaviors of RL based autonomous driving agent under policy constraints and observation perturbations, the proposed COR-MDP is introduced in this section.
Definition 1: A COR-MDP can be characterized via a 7tuple (S, A, p, r, c, ∆, γ).S is the set of states called the state space.A is the set of actions called the action space.p is the transition probability distribution of the next state s ∈ S given the current state s ∈ S and action a ∈ A. r : S × A → R represents the reward function, and c denotes the constraint function.∆ indicates the observation perturbation.γ ∈ (0, 1) is the discount factor.
COR-MDP attempts to solve the following problem: where T is timestep, and is an expected minimum deviation.

C. Black-Box Attack with Bayesian Optimization
In this section, the black-box attack based on Bayesian optimization is implemented to approximate the optimal adversarial observation perturbations.
Bayesian optimization is a black-box optimization algorithm with Bayes theorem [31].This approach works by building a probabilistic model of the objective function, called the surrogate model, that is then searched efficiently through an acquisition function before candidate samples are determined for evaluation on the real objective function [32], [33].
The JS divergence is a symmetrized and smoothed version of the Kullback-Leibler (KL) divergence [34], [35].But more importantly, JS divergence has a finite value which is bounded by 1 for two probability distributions.Hence, JS divergence is employed to measure average variation distance of the policies attacked by the observation perturbations.The optimization objective with JS divergence can be designed as: where D JS represents the distance based on JS divergence, D KL denotes KL divergence, and where ã, s and s are the action, the state and the next state perturbed by observation perturbations respectively.Therefore, our black-box attack approach is formulized as: represents observation perturbation, ∆ m and ∆ a are the multiplicative-perturbation and the additive-perturbation, ∆ 0 m and ∆ 0 a are the reference values of the multiplicative-perturbation and the additive-perturbation, δ m and δ a are the desired bounds of the multiplicativeperturbation and the additive-perturbation respectively.
Algorithm (1) outlines the black-box attack method using Bayesian optimization.The acquisition function is designed through upper confidence bound (UCB) [36].Additionally, Gaussian process is leveraged to built surrogate model for the optimization objective in our algorithm.

Algorithm 1 Black-box attack with Bayesian optimization
for i = 1, 2, ..., I do Find new adversarial observation perturbation via optimizing the acquisition function UCB(•) over Gaussian process model: Augment data to memory M : Update the Gaussian process model.end for

D. Constrained Observation-Robust Actor-Critic
To learn the robust optimal lane change policy, the proposed COR-AC algorithm is introduced in this section.COR-AC attempts to solve the following optimization problem: where represents the optimal adversarial observation perturbation.
A policy iteration (PI) scheme is employed to solve COR-MDP, which is called constrained observation-robust PI (COR-PI).COR-PI consists of policy evaluation and policy improvement, and they are iteratively updated until convergence.
According to Lagrange duality theory [37], the Lagrange function of the optimization problem (6) can be derived as: where λ is dual variable of RL agent.
1) Constrained Observation-Robust Policy Evaluation: The action-value function Q π (•) with adversarial observation perturbations can be learned under a fixed policy iteratively, starting from any action-value function Q π (•) : S → R |A| and repeatedly leveraging a modified Bellman backup operator T π given via: where is the expected value function with adversarial observation perturbations.Since the policy model outputs the discrete action distribution, the expectation of value function V π (•) can be calculcated directly.
To speed up training, COR-AC algorithm adopts two parameterized action-value functions with network parameters φ z , z ∈ {1, 2}.The action-value function parameters can be updated via minimizing the following loss function of critic network: where T s represents transition sampled from replay buffer D, and y t is the target value of the action-value function in the time step t.To avoid overestimating the value function, the smaller one of two Q π (•) values is used to train critic network.With Eq. ( 8) and Eq. ( 9), y t can be defined as: Qπ (s t+1 ; φz ) where Qπ (•) is the target action-value function, and φz is the network parameter of the target action-value function.The network parameters of target action-value function can be updated once per parameterized action-value function update via polyak averaging: where ρ is a hyperparameter between 0 and 1.
2) Constrained Observation-Robust Policy Improvement: In COR-PI, policy improvement designates optimizing and updating the policies of RL agent.The RL agent attempts to maximize the expected return of the policy while satisfying the nonlinear constraint c(•).
With Eq. ( 7), the Lagrange dual function can be written as: Furthermore, the Lagrange dual problem associated with the problem (6) can be represented as: = min The optimal policy π * and the optimal dual variable λ * can be approximated iteratively.First given a fixed λ, then solve the best policy π * by maximizing L(π, λ).Moreover, plug in π * and find λ * via minimizing L(π * , λ).Therefore, with Eq. ( 14), the following expressions can be derived: The value function V π (•) is implicitly defined through the action-value function Q π (•) and the policy π(•) and the constraint c(•).With the double Q(•) trick in Eq. ( 11), Eq. ( 9) and Eq. ( 15), the policy model parameters θ can be optimized via maximizing the following objective function of actor network: Additionally, with Eq. ( 16), the dual variables can be updated via minimizing the following loss function: IV. ALGORITHM IMPLEMENTATION Algorithm 2 outlines the proposed OARL method in detail.d t is done signal, and d t indicates whether the ego vehicle has collided at the time step t.The proposed method can optimize autonomous driving RL agent via the following main procedure.The initial the network parameters of actor and critic are sampled from a random distribution.In each iteration, RL agent first need to collect the data of M timesteps and store them in buffer D. Environment contains the state transition probability and the reward functions to generate the data trajectories.The optimal adversarial observation perturbations ∆ * are found by the black-box attack based on Bayesian optimization.Then the policies of RL agent is updated iteratively.
When the vehicle in front is close and driving slowly, the ego vehicle will perform lane change maneuvers to ensure Algorithm 2 Observation Adversarial Reinforcement Learning Reset state s0.
We select the relevant states of the six nearest social vehicles on the lane the ego vehicle is located and on the lanes on both sides of the ego vehicle as the observations of the ego vehicle.When the X-axis distance of the social vehicles on the left or right of the ego vehicle are greater than or equal to the one of the ego vehicle, we consider these social vehicles as left front vehicles or right front vehicles, and vice versa.The state of the autonomous driving agent includes 16 dimensions, and the detailed description is provided in Fig. 2 and Table I.The social vehicles perform lane change maneuvers by the LC2013 model [30] during the training and testing for the RL agent.
Moreover, the action of autonomous driving RL agent is discrete, which includes lane keeping, left lane changing and   One challenge of this work is to learn the robust lane change policies from scratch with no prior knowledge being applied.Therefore, the reward function plays a crucial role for optimizing the polices of the autonomous driving RL agent.Efficiency, comfort and safety are considered to design the reward function.
To encourage the ego vehicle to enhance transport efficiency, the reward function r(•) is designed as v 0 /35.This means that the autonomous driving agent is able to increase the reward by running at high speed.To avoid the ego vehicle following the front vehicle all the time, if the distance between the ego vehicle and the front vehicle is less than 30 meters, the reward of the agent will be reduced by 0.1.In terms of autonomous driving safety, both of collision and vehicle dynamics stability are considered.According to the upper limit for the desired yaw rate given in [38], if the yaw rate of the ego vehicle exceeds the upper limit k μg/v 0 , the reward of the agent will be reduced by 0.05.k is dynamic factor proposed in [39], μ is adhesion coefficient, and g represents gravity acceleration.Additionally, if the ego vehicle is involved in a collision, the reward of the agent will be reduced by 0.1.To avoid frequent lane changes at high speeds, when the ego vehicle performs a lane change manoeuvre at a speed of more than 20 m/s, the reward of the agent will be reduced The actor and critic networks are designed via a single fully connected hidden layer, and the layer size is 128.All activation functions in hidden layers are ReLU.The inputs and outputs of the neural networks have 16 and 3 dimensions respectively.The main hyperparameters of our algorithm are provided in Table IV of Appendix.

V. TESTING RESULTS AND PERFORMANCE EVALUATION A. Environment
The simulation test based on SUMO platform is implemented to verify the performance of the proposed robust lane change decision-making method for autonomous vehicle in this section.We employ SUMO to create three stochastic mixed traffic flows based on different densities in highway scenarios.
Fig. 3 illustrates our evaluation scheme.P is adopted to denote the probability of emitting a vehicle each second.P n, P l and P h are defined as the probabilities of emitting a vehicle each second in mixed traffic flows based on normal, low and high densities respectively.In addition, P n, P l and P h are set as 0.14, 0.035, 0.245 respectively.Our method and baseline approaches is tested in both training and testing.The policy models are trained and tested based on the mixed traffic flow with normal density.Moreover, the mixed traffic flows with low and high densities are only leveraged to evaluate the policy models.

B. Baseline
The DQN and PPO based autonomous driving lane change decision making algorithms are implemented as classical baseline methods.Additional, since soft actor-critic (SAC) with discrete action [40] is state-of-the-art discrete action RL algorithm, it is adopted as a state-of-the-art baseline scheme.

C. Evaluation
Fig. 5 demonstrates the performance of each algorithm during training in the highway scenario based on stochastic mixed traffic flow with normal density.The final performance of different schemes is given in Table II.Bold number is the best in each column of Table II.All the algorithms are evaluated for five trials via different random seeds in stochastic mixed traffic flow with normal density.The solid curve corresponds to the mean and the shaded region represents the standard deviation.
Fig. 5 and Table II shows that the robust lane change decision making method based on OARL outperforms the baseline schemes with a large margin, both in terms of the learning efficiency and the final performance.We count the average metrics over the final 2000 time steps (10 episodes × 200 time steps).Moreover, the average return of one episode is counted over the final 2000 time steps.It can be found that OARL approach performs comparably to SAC method and outperforms DQN and PPO schemes in term of the final speed in stochastic mixed traffic flow with normal density.For example, in contrast to DQN, PPO and SAC schemes, OARL gains 31.52%,9.31% and 0.83% improvements with respect to the final return respectively.In addition, compared with DQN, PPO and SAC methods, the collision safety of  OARL is enhanced by about 83.33%, 250.00% and 16.67% respectively.It can be seen that, PPO is superior to OARL in terms of the final driving speed.However, the collision safety of PPO method is the worst.Eq. 2 is utilized to measure the robustness of policy models against adversarial observation perturbations.We evaluate the final policy models trained by each methods with different random seeds.Additionally, the average metrics are counted over 40000 time steps (200 episodes × 200 time steps).Table III shows the test results of different policy models.The performance of OARL policies outperforms DQN, PPO and SAC in three stochastic mixed traffic flows with different densities, especially in terms of robustness metric.For instance, in contrast to DQN, PPO and SAC policies, OARL gains 16.25%, 24.83% and 7.10% improvements with respect to return in mixed traffic flow with low density respectively.Meanwhile, compared with DQN, PPO and SAC methods, the traffic efficiency of OARL policies is improved by about 10.73%, 19.11% and 1.97% respectively.It can be inferred that, to ensure the transport efficiency, the autonomous vehicle based on OARL policies performs more lane changes to overtake than one with the baseline scheme driving policies.Additionally, the robustness metric of OARL policies almost unchanged under adversarial observation perturbations.
In the stochastic mixed traffic flow scenario with normal density, the average return of OARL policies outperforms one of DQN, PPO and SAC policies.Hence, although each of PPO and SAC policies has a metric which is superior to one of OARL policies, OARL policies have better comprehensive performance than the baseline policies.
In the stochastic mixed traffic flow scenario with high density, OARL policies perform comparably to SAC policies and outperforms DQN and PPO polices in term of transport efficiency under adversarial observational perturbations.Moreover, in contrast to DQN, PPO and SAC policies, OARL gains 257.14%, 16.28% and 8.70% improvements with respect to return respectively.Compared with DQN, PPO and SAC policies, the collision safety of OARL policies is improved by about 332.14%, 28.57% and 28.57% respectively.It is obvious that the robustness of OARL policies against adversarial observation perturbations is superior to the one of DQN, PPO and SAC policies.Hence, it can be seen that the proposed method performs consistently in three different highway scenarios.
Furthermore, Fig. 6 visually shows the performance of DQN, PPO, SAC and OARL policies in the stochastic mixed traffic flows with low and high densities.it can be seen that OARL policies outperform baseline policies with a large margin, in term of return, robustness and collision safety.Moreover, the performance and robustness of OARL policies are scarcely influenced by adversarial observation perturbations.This means that the proposed robust lane change decision-making approach with OARL is able to improve the performance and generalization of autonomous driving RL agent while keeping the robustness of decision-making behaviors against observation uncertainties.

D. Ablation
In this section, we evaluate the impact of the nonlinear constraint on the performance of OARL agent.A scheme called actor-critic (AC) is implemented by removing the items associated with the constraint in OARL.AC and OARL methods are assessed in stochastic mixed traffic flow with normal density.Moreover, we train 5 different instances with different random seeds.
As shown in Fig. 7, the proposed OARL algorithm outperforms AC scheme with a large margin, in terms of average return.It can be found that AC algorithm fails to make any progress during policy model training.Hence, we can find two possible explanations for this phenomenon: (1) our constraint setting is able to encourage RL agent to explore and avoid falling into local optimum; (2) updating policy gradients in more directions may be beneficial to improve model performance.
Additionally, the performance of our OARL scheme with double hidden layer based network (DHLN) is evaluated in stochastic mixed traffic flow with normal density.It can be seen from Fig. 7 that OARL with a single hidden layer based neural network performs comparably to the OARL with DHLN, in terms of average return.

VI. CONCLUSION
This paper introduces a novel OARL approach for robust lane change decision making of autonomous vehicles.A COR-MDP is presented to model lane change decision making behaviors of autonomous vehicles under policy constraints and observation uncertainties.Meanwhile, the black-box attack technique with Bayesian optimization is implemented to find the optimal adversarial observation perturbations efficiently.Furthermore, a COR-AC algorithm is advanced to optimize autonomous driving lane change policies while keeping the variations of the policies attacked by the optimal adversarial observation perturbations within bounds.
The experiment results in three stochastic mixed traffic flows with different densities demonstrate that the proposed scheme can make lane change decisions robustly under observation uncertainties.In comparison with three baseline methods, the policy models trained by the proposed algorithm show superior generalization and robustness against adversarial observational perturbations.
Future work involves to evaluate the robust lane change decision making approach with OARL in more scenarios.Moreover, OARL with continuous action will be investigated to copy with longitudinal decision making problem of autonomous vehicles.

Fig. 1 .
Fig. 1.Framework of the proposed robust lane change decision-making approach for autonomous driving.

Fig. 2 .
Fig. 2. Illustration of states of the OARL based autonomous driving agent.

Algorithm 3
Reward Function Design for RL Agent Input: State and action of RL agent.1: r(•) = v0/35.Encourage agent to be more efficiency 2: if d1 < 30 then 3: r(•) = r(•) − 0.1.Encourage lane change behavior 4: end if 5: if |3.14 • ω0/180| > k • μ • g/v0 and v0 > 30 then 6: r(•) = r(•) − 0.05.Penalize dynamics instability 7: end if 8: if Vehicle changes lane and v0 > 20 then 9: r(•) = r(•) − v0/350.Penalize high-speed lane change 10: end if 11: if Collision occurs then 12:r(•) = r(•) − 0.1.Penalize collision 13: end if Output: r(•) Parameters (Unit) Definition a 0 (m/s 2 ) Longitudinal acceleration of autonomous vehicle ω 0 (rad/s) Yaw rate of autonomous vehicle v 0 (m/s) Velocity of autonomous vehicle v 1 (m/s) Velocity of vehicle in front in same lane d 1 (m) Distance from vehicle in front in same lane v 2 (m/s) Velocity of vehicle behind in same lane d 2 (m) Distance from vehicle behind in same lane v 3 (m/s) Velocity of vehicle in front in left lane d 3 (m) Distance from vehicle in front in left lane v 4 (m/s) Velocity of vehicle behind in left lane d 4 (m) Distance from vehicle behind in left lane v 5 (m/s) Velocity of vehicle in front in right lane d 5 (m) Distance from vehicle in front in right lane v 6 (m/s) Velocity of vehicle behind in right lane d 6 (m) Distance of vehicle behind in right lane l index Index of lane in which autonomous vehicle is located right lane changing.

Fig. 3 .
Fig. 3. Schematic diagram of evaluation method using SUMO-based mixed traffic flow with a random number of vehicles.

Fig. 4 .
Fig. 4. Illustration of model evaluation scheme.The autonomous driving RL agent observes the perturbed state st rather than the state st in model testing.As shown in Fig. 4, unlike the model training stage of OARL, the autonomous driving RL agent observes the state

TABLE I STATE
OBSERVED BY AUTONOMOUS DRIVING RL AGENT.

TABLE II FINAL
PERFORMANCE OF DIFFERENT ALGORITHMS IN MODEL TRAINING.

TABLE IV THE
MAIN HYPERPARAMETERS OF THE PROPOSED ALGORITHM.