Robotic Control in Adversarial and Sparse Reward Environments: A Robust Goal-Conditioned Reinforcement Learning Approach

With deep neural networks-based function approximators, reinforcement learning holds the promise of learning complex end-to-end robotic controllers that can map high-dimensional sensory information directly to control policies. However, a common challenge, especially for robotics, is sample-efficient learning from sparse rewards, in which an agent is required to find a long sequence of “correct” actions to achieve a desired outcome. Unfortunately, inevitable perturbations on observations may make this task trickier to solve. Here, this article advances a novel robust goal-conditioned reinforcement learning approach for end-to-end robotic control in adversarial and sparse reward environments. Specifically, a mixed adversarial attack scheme is presented to generate diverse adversarial perturbations on observations by combining white-box and black-box attacks. Meanwhile, a hindsight experience replay technique considering observation perturbations is developed to turn a failed experience into a successful one and generate the policy trajectories perturbed by the mixed adversarial attacks. Additionally, a robust goal-conditioned actor–critic method is proposed to learn goal-conditioned policies and keep the variations of the perturbed policy trajectories within bounds. Finally, the proposed method is evaluated on three tasks with adversarial attacks and sparse reward settings. The results indicate that our scheme can ensure robotic control performance and policy robustness on the adversarial and sparse reward tasks.


Robotic Control in Adversarial and Sparse Reward
Environments: A Robust Goal-Conditioned Reinforcement Learning Approach Xiangkun He , Member, IEEE, and Chen Lv , Senior Member, IEEE Abstract-With deep neural networks-based function approximators, reinforcement learning holds the promise of learning complex end-to-end robotic controllers that can map high-dimensional sensory information directly to control policies.However, a common challenge, especially for robotics, is sample-efficient learning from sparse rewards, in which an agent is required to find a long sequence of "correct" actions to achieve a desired outcome.Unfortunately, inevitable perturbations on observations may make this task trickier to solve.Here, this article advances a novel robust goal-conditioned reinforcement learning approach for end-to-end robotic control in adversarial and sparse reward environments.Specifically, a mixed adversarial attack scheme is presented to generate diverse adversarial perturbations on observations by combining white-box and black-box attacks.Meanwhile, a hindsight experience replay technique considering observation perturbations is developed to turn a failed experience into a successful one and generate the policy trajectories perturbed by the mixed adversarial attacks.Additionally, a robust goal-conditioned actor-critic method is proposed to learn goal-conditioned policies and keep the variations of the perturbed policy trajectories within bounds.Finally, the proposed method is evaluated on three tasks with adversarial attacks and sparse reward settings.The results indicate that our scheme can ensure robotic control performance and policy robustness on the adversarial and sparse reward tasks.
Impact Statement-In recent years, reinforcement learning has been an impressive component of modern artificial intelligence and is still under vigorous development.Nonetheless, compared to supervised learning, which has been widely applied in a variety of domains, reinforcement learning has not been broadly accepted and deployed in real-world problems.One key factor is an agent's trustworthiness, where its policy robustness is essential.Additionally, designing the reward function requires both domain-specific knowledge and reinforcement learning expertise, which limits the applicability of reinforcement learning.Unfortunately, on some real-world tasks, on account of reward function design complexities and inevitable perception errors, the agents have to learn under sparse rewards and observation uncertainties.However, so far there are few studies to cope with the challenge.Our approach contributes to the foundation for the realization of trustworthy and efficient artificial intelligence, potentially bringing reinforcement

I. INTRODUCTION
W ITH the development of emerging technologies, such as artificial intelligence and 5th-generation mobile communication technology (5G), data-driven robotic control has become a research hotspot, which can lead to a dramatic breakthrough in the next generation of robot industry [1], [2], [3].Conventional robotic control approaches employ a hierarchical control architecture that mainly consists of sensing, planning, and control modules [4], [5], [6].In addition, these methods require accurate kinematics or dynamics models that are tricky to acquire, especially in complex robotic control [7], [8], [9].
With deep neural networks as function approximators, reinforcement learning (RL) methods have demonstrated their worth in a series of challenging tasks, from games to robotic control [10], [11], [12], [13].High-quality end-to-end robot controllers can be implemented through RL methods without prior hierarchical framework, kinematics, and dynamics models [14].The RL-based robot control approach was proposed through a mixture of actor-critic experts in [15].The residual RL for robot control is designed by combining a learnable parametrized model with a conventional feedback controller in [16].The multiagent advantage actor-critic algorithm was developed for snake robot control in [17].The RL method with Lyapunov stability theory was presented to guarantee closed-loop stability of the robot controller in [18].The prediction-guided RL algorithm was developed for multiobjective continuous robot control tasks in [19].
While existing RL-based robot control methods have achieved many compelling results [20], [21], [22], they mostly rely on carefully crafted reward functions that require both domainspecific knowledge and RL expertise.The reward design engineering restricts the real-world applicability of RL.Furthermore, since it is very tricky to design the reward function for some tasks, such as Go, the agents have to learn in sparse reward environments.Hence, a common challenge in RL, especially for robotics, is sample-efficient learning from sparse rewards, in which an agent has to find a long sequence of "correct" actions to achieve a desired outcome.Many studies have made efforts to solve this tricky issue [23], [24], [25].One popular way of tackling the sparse reward task is the hindsight experience replay (HER) technique [26] that seeks to cope with this problem via converting failed experiences to successful ones through relabeling the goals.
The abovementioned studies generally assume that the agents' sensing and perception systems are free of uncertainties.Nonetheless, this assumption can barely hold in real-world situations.The observations of robots include inevitable perturbations that naturally arise from unpredictable stochastic noises or sensing errors.If a robot's control policies are not robust, its performance may be degraded by observation uncertainties, especially in sparse reward environments.In other words, the perturbations on observations make agents more difficult to discover a trajectory of "correct" actions to obtain a positive sparse reward signal.
In recent years, some researchers have tried to improve the robustness of policies based on RL in robotic control [27], [28], [29].For example, Pinto et al. [30] presented the robust adversarial RL algorithm for robotic control by modeling uncertainty as an adversarial agent.Tessler et al. [31] proposed the robotic control approach with action robust RL through structuring probabilistic action robust Markov decision process (MDP) and noisy action robust MDP.Pattanaik et al. [32] introduced the state adversarial robust RL scheme to improve the robotic control policy's robustness via the gradient based adversarial attack technique.However, few works simultaneously try to cope with observational uncertainties and sparse rewards in robotic control tasks.Consequently, there is still room for advancement and refinement.
In this article, we advance a novel robust goal-conditioned reinforcement learning (RGCRL) approach for end-to-end robotic control in adversarial and sparse reward environments.The RGCRL scheme is based on two key insights.First, the robust policies facilitate the agent to make a long sequence of "correct" decisions to attain a positive sparse reward signal under uncertain perturbations.Second, the method of combining white-box attack [33] and black-box attack [34], [35], [36] is able to generate diverse adversarial samples.Specifically, this article's main contributions are summed up as follows.
1) A mixed adversarial attack scheme is proposed by combining white-box and black-box attacks, which aims to maximize the average variation distance on attacked policies while generating diverse adversarial samples on observations.

2) An HER technique considering observation perturbations
is developed to turn a failed experience into a successful one and produce the policy trajectories perturbed by the mixed adversarial attacks.3) A robust goal-conditioned actor-critic (RGCAC) method is presented to optimize goal-conditioned policies and keep the variations of the perturbed policy trajectories within bounds, in sparse reward environments.The proposed approach's framework is illustrated in Fig. 1.The module with respect to the mixed adversarial attacks aims to generate diverse adversarial samples on observations.Its input contains the state s, goal g and the goal-conditioned policy π(s g).denotes concatenation.The output is the optimal adversarial attack Δ * .
The HER module considering observation perturbations is to provide the hindsight experiences and the policy trajectories under the mixed adversarial attacks.Its input includes the state s t , goal g, action a t , reward r t , next state s t+1 , goal-conditioned policy π, and optimal adversarial attack Δ * .t and k represents the time steps.s denotes the state perturbed by the mixed adversarial attacks.The output contains samples for policy optimization and adversarial attacks.
The module associated with the RGCAC algorithm attempts to optimize the end-to-end robust control policies against observation perturbations.Its input includes the hindsight experiences and the perturbed policy trajectories.The output is the robotic control policy.
In addition, the module with regard to the robotic control environment is adopted to produce the transition data.The input is the goal-conditioned policy π(s g) based action a.Its output includes the state s, goal g and reward r.
Three testing cases with adversarial and sparse reward settings are executed to benchmark the performance of our robotic control method in the Franka Emika Panda robot environment [37].The results indicate that the RGCRL scheme can learn robust control policies against observation uncertainties while improving the robot performance.
The rest of this article is organized as follows.The preliminaries and the methodology of our method are introduced in Sections II and III, respectively.Section IV analyzes the testing results of the proposed algorithm on the different robot tasks.Finally, Section V concludes this article.

A. Markov Decision Process
An MDP is a mathematical formalism for modeling the sequential decision making of an agent, and it is also a straightforward paradigm for the problem of learning from interaction to attain a goal.A standard MDP is able to be expressed via the 5-tuple (S, A, r, p, γ) based on the state space S, action space A, reward function r : S × A → R, transition probability function p : S × A × S → R, and discount factor γ ∈ (0, 1).

B. Reinforcement Learning
The MDP is the theoretical foundation of RL, and the RL agent aims to learn a policy that is capable of maximizing the expected return.In the RL problem, the expected return can then be denoted by where π indicates the agent's policy, t is the time step, T is the last timestep, state s t ∈ S, and action a t ∈ A. Therefore, the central optimization problem of RL can be represented with where π * is the optimal policy.

A. Mixed Adversarial Attacks
In this section, in order to generate diverse adversarial samples, the mixed adversarial attack method is introduced by combining black-box and white-box attacks.The mixed adversarial attack scheme can be described as black-box attack, with probability ω white-box attack, with probability where ω represents a probability.
3: Compute the objective function f (Δ i , s, g, π).4: Augment data in memory M : The Bayesian optimization method, a black-box optimization approach founded on the Bayes theorem, is particularly well suited for dealing with issues involving expensive queries.This technique works by building an objective function's surrogate model that can be efficiently searched via an acquisition function before candidate samples are selected for assessing the real objective function.
The Bayesian optimization based black-box attack method is described by Algorithm 1.The upper confidence bound (UCB) method is chosen to design the acquisition function in our scheme.We adopt the Gaussian process to build the surrogate model.Moreover, in this article, the objective function f (•) of the black-box attack scheme is defined as where the observation uncertainty is represented as Δ = [Δ m Δ a ], the perturbed state is denoted as s = Δ m s + Δ a , Δ m indicates the multiplicative uncertainty, and Δ a is the additive uncertainty.

2) White-Box Attack With Lagrange Duality Theory:
The white-box attack method is transformed into a constrained optimization problem that can be solved by Lagrange duality theory.The white-box attack technique requires complete knowledge of the objective function to find successful adversarial samples.
In this article, the white-box attack scheme is formulated as where the perturbation reference value can be denoted as Δ = [ Δm Δa ], Δm , and Δa represent the multiplicative and additive uncertainties' reference values, and δ m and δ a are the multiplicative and additive uncertainties' bounds, respectively.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Therefore, the Lagrangian of the abovementioned constrained optimization task can be obtained where α is dual variable.With Lagrange duality theory, the white-box attack can be formulated as Hence, the adversarial perturbation Δ can be updated in J steps by where η is the learning rate of optimizing Δ.Moreover, j = 1, . .., J.
The dual variable α is updated in J steps through where ξ is the learning rate of approximating α.

B. HER Considering Observation Perturbations
On sparse reward tasks, the reward relies on whether the policy trajectory enables the agent to reach the desired goal g or not, and only the successful policy trajectory can trigger a positive reward.In most cases, the successful policy trajectories collected by the agent are usually insufficient for training.
The HER technique considering observation perturbations aims to provide the hindsight experiences and the perturbed policy trajectories.
To generate the hindsight experiences for addressing the sparse reward problem, the HER scheme converts failed experiences into successful ones via relabeling the goals.Specifically, the HER method converts the achieved goals g based on the states in failed experiences to the desired goals g in the training data.Here, the desired goal g denotes the real target that the agent attempts to attain.Moreover, an achieved goal g represents a state that the agent has reached.When g is displaced via a g , the corresponding failed experiences are assigned desirable rewards, which is able to facilitate the agent to learn policies in sparse reward environments.The policy trajectory based on the hindsight experience in an episode can be expressed as Additionally, we construct the perturbed policy trajectories via the sampled hindsight experiences and the mixed adversarial attacks.Hence, the perturbed policy trajectory in an episode is able to be represented as τ = {π(s 0 g ), . . ., π(s T −1 g )} . (11)

C. Robust Goal-Conditioned Actor-Critic
This section introduces the proposed RGCAC algorithm that enables the robotic agent to learn the robust goal-conditioned control policy.

1) Robust Goal-Conditioned MDP:
A robust goalconditioned MDP (RGCMDP) is proposed to model agent behaviors under uncertainties and sparse rewards, which can be defined as follows.
To handle the worst-case situation, RGCMDP aims to solve the optimal policies under optimal adversarial attacks on observations.The agent tries to learn goal-conditioned policies and keep the variations of the perturbed policy trajectories within bounds.The optimization task can be formulated as where π represents a goal-conditioned policy π(s g), π * denotes an optimal goal-conditioned policy π * (s g), and β is an upper bound.

2) Robust Goal-Conditioned Policy Evaluation:
The actionvalue function Q π (s t , a t , g) at time step t is able to be computed under a given goal g and a fixed agent policy iteratively via a Bellman backup operator The RGCAC algorithm leverages two action-value functions with parameters φ z , z ∈ {1, 2} to speed up training.The actionvalue function parameters are able to be learned by minimizing the following the critic network's loss function where T b denotes a minibatch trajectory sampled from the HER memory M, and y represents the action-value function's target value.
To alleviate the overfitting problem, a smoothing regularization scheme [38] is adopted by adding a small amount of stochastic noises to the selected action with a deterministic policy.Moreover, with the state s t , goal g, and deterministic policy π, the agent's action with the random noises can be designed as where θ denotes the actor network's parameters, c is a weight coefficient, and N (•) represents Gaussian distribution.
To mitigate the action-value function's overestimation issue, the minimum estimation between two target action-value functions is utilized to optimize the critic network's parameters.Hence, based on (13), y is able to be written as Qπ s t+1 , π(s t g; θ), g; φz (17) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where Qπ (•) represents the target action-value function, φz denotes the target action-value function's network parameters, and z ∈ {1, 2}.

TABLE I MAIN HYPERPARAMETERS OF THE RGCRL AND BASELINE ALGORITHMS
To stabilize the model training, the target action-value function's network parameters are able to be renewed through polyak averaging where μ denotes a interpolation factor between 0 and 1.

3) Robust Goal-Conditioned Policy Improvement:
The policy improvement aims to optimize and update the agent's policies.Our robotic control method attempts to maximize the agent's expected return and keep the changes of the policy trajectories perturbed by the mixed adversarial attacks within certain ranges, in sparse reward environments.
The Lagrangian of the constrained optimization problem ( 12) is able to be written as where λ ≥ 0, and λ represents the dual variable.
With (19) and Lagrangian duality [39], the constrained optimization problem's Lagrange dual function is able to be represented as In addition, the Lagrange dual problem concerning the problem ( 12) is able to be expressed as Under the given state s and goal g, with (21), the optimal goal-conditioned policies π * (s g) and the optimal dual variable λ * are able to be found iteratively.Fix a dual variable λ, and then optimize the goal-conditioned policies π(s g) via maximizing (19).Furthermore, with the optimal goal-conditioned policies Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.π * (s g), the optimal dual variable λ * can be attained via minimizing (20).As a result, the following expressions are able to be obtained A double action-value function trick is employed to mitigate the expected return's estimation error problem.At time step t, the average action-value function with the goal g is able to be represented as Hence, with ( 22) and ( 24), the optimal policy of the agent is able to be learned via maximizing the following actor network's objective function: In addition, with (23), the dual variable λ is able to be learned through minimizing the following objective function:

D. Algorithm Implementation
Algorithm 2 introduces our RGCRL approach in detail.Moreover, d t represents if s t+1 is terminal.
The actor and the critic networks are implemented by 2 fully connected hidden layers whose layer sizes are {256, 256}.Additionally, ReLU is employed as the activation function in the hidden layers.The main hyperparameters of the proposed RGCRL algorithm are provided in Table I.

A. Experimental Environment
To benchmark our RGCRL method, we leverage the Franka Emika Panda robot environment [37] consisting of the Franka Emika Panda robotic arm model, the PyBullet physics engine [40], and OpenAI Gym [41].As shown in Fig. 2, the three robotic control tasks are illustrated, which include reach, push, and pick-and-place.The green shaded spaces in Fig. 2 denote the target positions.The green cubes represent the objects on push and pick-and-place tasks.
Specifically, on the reach task, the robotic arm has to control its end-effector at a specified position.The agent's input (i.e., state and goal, 9 dimensions) consists of the position and speed of its end-effector (6 dimensions), and the target position (3 dimensions).The output (i.e., action) is the end-effector control command (3 dimensions, one for movement on x, y, and z axes).
On the push task, the robotic arm tries to push an object to a target position on the table surface.The robotic agent's input Sample an initial state s 0 and a desired goal g.

6:
for time step t = 1, 2, . . .T do 7: Determine an action with the policy: Execute a t in the environment and receive a transition: s t+1 , r t , d t ∼ p(s t+1 |s t , a t ).9: end for 10: Save the transition trajectory in the HER memory M.

14:
Construct a batch of the perturbed policy trajectories.

18:
Update the parameters of the target action-value function network via (18): 19: end for 20: end for 21: end for (21 dimensions) consists of its end-effector's position and speed (6 dimensions), the object's position, orientation, linear and rotational speed (12 dimensions), and the target position (3 dimensions).The robotic agent's output is the end-effector control command (3 dimensions, one for movement on x, y, and z axes).
Moreover, on the pick-and-place task, the robotic arm attempts to pick up and place a cube at a target position above the table.The robotic agent's input (22 dimensions) consists of its end-effector's position and speed (6 dimensions), the object's position, orientation, linear and rotational speed (12 dimensions), the target position (3 dimensions), and the opening of the end-effector (1 dimension, the distance between the fingers).The robotic agent's output (4 dimensions) is the end-effector control command (3 dimensions, one for movement on x, y, and z axes), the fingers control command (1 dimension, one for movement of the fingers).The sparse reward setting is leveraged for all three robotic control tasks, i.e., a reward of 0 is received when the object to move is at the target location and −1 otherwise.

B. Performance Evaluation 1) Baseline:
To evaluate the proposed scheme, we try to implement comparisons with state-of-the-art off-policy RL algorithms, including twin delayed deep deterministic policy gradient (TD3) [38] and soft actor-critic (SAC) [42].Additionally, since robust deep deterministic policy gradient (RDDPG) [32] is one of the state-of-the-art robust RL algorithms, it is also employed to test the proposed solution.However, it is intractable for traditional RL methods to attain desired performance in sparse reward environments.Hence, by combining HER [26] with TD3, SAC, and RDDPG algorithms, three baselines are implemented to benchmark the proposed RGCRL technique, which are indicated as TD3-HER, SAC-HER, and RDDPG-HER, respectively.The main hyperparameters of the baseline algorithms are given in Table I.
2) Model Training Performance: We perform five different runs of each method with different random seeds.On the reach, push and pick-and-place tasks, the robotic agents are trained for 10, 200, and 800 epochs, respectively.One epoch includes 50 episodes.Moreover, 1 episode contains 50 time steps.Fig. 3 shows the performance of each agent during training across the three robotic control tasks.The solid curve indicates the mean, and the shaded region represents the standard deviation.The results indicate that, overall, the proposed RGCRL scheme performs comparably to the baselines on the reach and push tasks, and surpasses them on the pick-and-place task with a large margin, both in the matter of the final performance and learning speed.For instance, on the reach and push tasks, the final success rates of the TD3-HER, SAC-HER, RDDPG-HER, and RGCRL agents are 100%.Additionally, on the pick-andplace task, the final success rates of the TD3-HER, SAC-HER, RDDPG-HER, and RGCRL agents are about 50%, 40%, 40%, and 100%, respectively.
3) Model Testing Performance: As depicted in Fig. 4, in contrast to the training phase of the policy model, in model testing, the robotic agent receives the state st perturbed by the mixed adversarial attacks instead of the state s t .
We assess the final neural network models trained via each algorithm under five different random seeds.Each model is tested for 100 episodes on the three tasks.Here, like the model training, 1 episode includes 50 time steps.Equation ( 4) is adopted to measure the policy robustness against the adversarial attacks on observations.This means the agent's policy with a smaller value of (4) shows stronger robustness against observation perturbations.
Figs. 5 and 6 show the performance and robustness of the TD3-HER, SAC-HER, RDDPG-HER, and RGCRL agents during testing across the three robotic control tasks under the mixed adversarial attacks on observations.Obviously, the RGCRL agent outperforms the TD3-HER, SAC-HER, and RDDPG-HER agents in terms of success rate and policy robustness, on the three tasks.
Qualitatively, we report the average metrics of 100 episodes in Table II for each agent on the three robotic control tasks under the mixed adversarial attacks on observations.In each row of Table II, the bolded number represents the best result.For example, on the reach task, in comparison with the TD3-HER, SAC-HER, and RDDPG-HER agents, the RGCRL agent gains approximately 0.503%, 2.041%, and 0.000% improvements, respectively, regarding success rate.Compared with the TD3-HER, SAC-HER, and RDDPG-HER agents, the policy robustness of the RGCRL agent is enhanced by approximately 96.842%, 98.413%, and 96.000%, respectively.
On the push task, in contrast to the TD3-HER, SAC-HER, and RDDPG-HER agents, the RGCRL agent approximately Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.enhances the success rate by 17.160%, 85.047%, and 18.563%, respectively.In comparison with the TD3-HER, SAC-HER, and RDDPG-HER agents, the policy robustness of the RGCRL agent is improved by approximately 97.753%, 98.667%, and 97.283%, respectively.
On the pick-and-place task, compared with the TD3-HER, SAC-HER, and RDDPG-HER agents, the RGCRL agent gains approximately 0.503%, 2.041%, and 0.000% improvements, respectively, in terms of success rate.Compared with the TD3-HER, SAC-HER, and RDDPG-HER agents, the RGCRL agent's policy robustness is improved by approximately 551.724%, 329.546%, and 339.535%, respectively.
Taken overall, the proposed RGCRL approach surpasses the baseline schemes by a large margin, both in terms of success rate and policy robustness, especially on the complicated tasks.Furthermore, our method performs consistently across all three tasks.This means that the RGCRL algorithm can also provide more stable performance than the baselines.

V. CONCLUSION
This article proposes a novel RGCRL scheme for end-to-end robotic control in adversarial and sparse reward environments.First, a mixed adversarial attack method is advanced to generate diverse adversarial perturbations on observations via combining Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
white-box and black-box attacks.Second, an HER technique considering observation perturbations is developed to turn a failed experience into a successful one and generate the policy trajectories perturbed by the mixed adversarial attacks.Third, an RGCAC method is introduced to learn goal-conditioned policies and keep the variations of the perturbed policy trajectories within bounds.
Evaluation of the policy models is carried out on the three robotic control tasks with adversarial attacks and sparse reward settings.The results demonstrate that the RGCRL approach enables the agent to learn efficiently from sparse rewards.Additionally, compared to the three baselines, the RGCRL agent has superior performance concerning the success rate and policy robustness under the mixed adversarial attacks.
While we have demonstrated the potential of the proposed RGCRL technique, some limitations remain.Consequently, future work involves evaluating our approach in more scenarios.In addition, the proposed algorithm will be applied to end-to-end robotic control tasks in the real world.

Fig. 1 .
Fig. 1.Illustration of the proposed RGCRL framework for robotic control in adversarial and sparse reward environments.

Fig. 3 .
Fig. 3. Training curves obtained via the TD3-HER, SAC-HER, RDDPG-HER, and RGCRL approaches on the three robotic control tasks with sparse rewards.

Fig. 4 .
Fig. 4. Illustration of model testing scheme.In model testing, the robotic agent observes the state s perturbed by the mixed adversarial attacks rather than the state s.

Fig. 5 .
Fig. 5. Success rate of different robotic agents under the mixed adversarial attacks on observations.(a) Reach task.(b) Push task.(c) Pick-and-place task.

Fig. 6 .
Fig. 6.Robustness of different robotic agents on the three robotic control tasks under the mixed adversarial attacks on observations.
We implement the Bayesian optimization based black-box attack technique that has a high query efficiency for approximating adversarial samples.The black-box attack methods do not require information on the objective function architecture or parameters and only observe the input-output correspondences by querying the model.The black-box attack method, however, typically requires a large number of attempts to find effective adversarial samples.

TABLE II EVALUATION
OF DIFFERENT ROBOTIC AGENTS ON THE THREE ROBOTIC CONTROL TASKS UNDER THE MIXED ADVERSARIAL ATTACKS ON OBSERVATIONS