Risk-Aware Multi-Armed Bandits With Refined Upper Confidence Bounds

The classical multi-armed bandit (MAB) framework studies the exploration-exploitation dilemma of the decision-making problem and always treats the arm with the highest expected reward as the optimal choice. However, in some applications, an arm with a high expected reward can be risky to play if the variance is high. Hence, the variation of the reward should be considered to make the arm-selection process risk-aware. In this letter, the mean-variance metric is investigated to measure the uncertainty of the received rewards. We first study a risk-aware MAB problem when the reward follows a Gaussian distribution, and a concentration inequality on the variance is developed to design a Gaussian risk aware-upper confidence bound algorithm. Furthermore, we extend this algorithm to a novel asymptotic risk aware-upper confidence bound algorithm by developing an upper confidence bound of the variance based on the asymptotic distribution of the sample variance. Theoretical analysis proves that both proposed algorithms achieve the $\mathcal {O}(\log (T))$ regret. Finally, numerical results demonstrate that our algorithms outperform several risk-aware MAB algorithms.


I. INTRODUCTION
T HE multi-armed bandit (MAB) models the online sequential decision making problem for various applications including financial portfolio design, online recommendations, and crowdsourcing systems. In a standard MAB, a random reward of an arm can be observed once played by an agent, and the objective is to maximize the cumulative rewards in a certain number of plays. The main challenge is how to balance the tradeoff between exploiting the existing knowledge of arms to get a considerable cumulative reward and further exploring arms to gain more information for potentially higher revenue. The standard MAB always treats the arm with the highest expected reward as the optimal choice. However, in many applications, not only the expected rewards of arms but also the uncertainty of the rewards imposed by the variations are important. For instance, in clinical trials, instead of choosing a treatment which reaches the best average therapeutic result but may occasionally lead to unacceptably poor results, a treatment that works consistently well for every patient is more reliable and hence desirable. Therefore, for such applications, the tradeoff between the expected rewards and the variances of arms should be considered in a risk-aware MAB framework.
One of the mainstream measures to balance the tradeoff between maximizing the expected reward and minimizing the uncertainty of the reward is the mean-variance (MV) [1]. This paradigm considers a linear combination of the mean and the variance of the reward when determining the optimal arm. In [2], an algorithm is designed based on the MV paradigm to minimize the learning regret by deriving a lower confidence bound of the MV. However, their proposed MV lower confidence bound (MV-LCB) algorithm achieves a O(log 2 T ) learning regret which is worse than the classical risk-neutral MAB algorithms. In [3]- [5], finer analyses of the theoretical performance of MV-LCB are presented, and a new definition of the learning regret is derived for the MV metric. By extending MV-LCB, a new algorithm is designed to reach a O(log(T )) regret. However, the regret bound only holds for a limited class of reward distributions. Another way to measure the risk is to use conditional value at risk level (CVaR) [6]- [9]. In [6], it is proved that the learning regret of the proposed CVaR-based scheme is O(log(T 2 )).
In this letter, we study the risk-aware MAB problem based on the MV paradigm and propose two novel risk-aware MAB algorithms. The key difference of the proposed algorithms from the existing ones (which utilize Hoeffding inequality to drive the confidence bound of the reward variance) is that a finer confidence bound is derived by determining the distribution of the sample variance. Focusing on the MABs with continuous reward distributions, we first build a finer upper confidence bound (UCB) of the MV with a Gaussian reward assumption and design a Gaussian risk aware-upper confidence bound (GRA-UCB) algorithm to solve the risk-aware MAB. We prove that GRA-UCB can reach a O(log(T )) regret. Next, utilizing the asymptotic distribution of the empirical variance and by extending the GRA-UCB algorithm, a novel asymptotic risk aware-upper confidence bound (ARA-UCB) algorithm is generally designed for sub-Gaussian reward distributions and also proved to achieve O(log(T )) learning regret. Both proposed algorithms perform numerically well.

II. PROBLEM FORMULATION
Consider a risk-aware MAB problem with K arms, i.e., K = {1, 2, . . . , K}. Playing each of the arms yields a continuous random reward sampled from an independent distribution. In risk-aware MABs, the risk of receiving a very low reward is considered and the agent prefers playing the arm with a higher mean and lower uncertainty. To measure the risk, in [2], the MV of an arm i is defined as where μ i and σ 2 i are the mean and variance of the reward of arm i respectively. In (1), ρ ≥ 0 is a risk-tolerance factor introduced to balance reward-risk tradeoff. As ρ → ∞, the risk-aware MAB problem degenerates to a risk-neutral MAB, and when ρ = 0, the problem aims to find the arm with the lowest risk. Defining the arm played at t following the arm-selection policy π as π(t), the observed reward is denoted as r π(t) (t). After playing arms for T rounds, the cumulative MV following the policy π can be calculated as The objective of the risk-aware MAB problem is to minimize η π (T ) under the given risk-tolerance factor ρ. In order to demonstrate the performance of an arm-selection policy, the optimal policy π * which has the full knowledge of arms is used as the benchmark to calculate the performance gap between the optimal policy and a proposed policy π. The performance gap is referred to as the learning regret defined as Reg π (T ) = η π (T ) − η π * (T ).
(3) In risk-aware MAB, as proved in [5], simply playing the arm with the lowest MV may not be the optimal policy but only a proxy of the optimal policy, that is in general intractable.

III. PROPOSED ALGORITHM
Let us first define some notations used in the proposed algorithm. Denoting r i (t) as the received reward by playing arm i at round t, the empirical meanμ i (t) and the empirical variance where t i (d) represents the round when the d th reward of arm i is observed by the agent and τ i (t) denotes the number of times arm i has been played till round t. Accordingly, the empirical MV can be calculated asη i = s 2 i (t) − ρμ i (t). We design two risk-aware bandit algorithms, namely, GRA-UCB and ARA-UCB, depending on whether the reward distribution knowledge is available or not. Both algorithms are index-based policies which assign different indexes to each arm. The indexes represent the estimation of the UCB of the MV based on the historical observations, and the arm with the lowest index is played. The key difference is that for GRA-UCB, the reward is assumed to follow Gaussian distribution so the UCB is specifically derived for this case. However, in ARA-UCB, an asymptotic UCB of the variance is derived in the absence of the reward distributions knowledge.

A. Gaussian Risk Aware-Upper Confidence Bound (GRA-UCB)
Suppose that the rewards of arms follow a Gaussian distribution with different means and variances. We utilize the following facts to design an arm-selection algorithm.
Fact 1: (Hoeffding inequality) Let (X − μ) be a sub-Gaussian random variable. Define the empirical mean over n samples asμ n and μ = E[X], then Fact 1 states that the deviation δ of the empirical mean from the true mean after n samples is bounded by e −2nδ 2 . Here, setting δ = log t τ i (t) , then we have Fact 2: Let X be a Gaussian random variable with variance σ 2 . Define the empirical variance over n samples as s 2 n , based on [10] we have where χ 2 1−α,n−1 is the upper 100α percentage points of the chisquare distribution with (n − 1) degrees of freedom.
Fact 2 gives a definition of the confidence interval of the variance of a random variable when the variable follows the Gaussian distribution. It implies that there is a probability of 100(1 − α)% that the constructed confidence interval based on the sample variance will contain the true value of σ 2 .
In GRA-UCB algorithm, each arm is assigned with an index which is the UCB of the MV. Based on Facts 1 and 2, the index of arm i is defined as where the first and second terms represent the UCBs of the variance and the mean based on (8) and (7), respectively. In each round, after calculating indexes, the arm with the lowest B i (t) will be played and the number of plays of this arm are updated as τ i (t + 1) = τ i (t) + 1. Subsequently, the empirical mean and variance are updated based on (4) and (5).

B. Asymptotic Risk Aware-Upper Confidence Bound (ARA-UCB)
GRA-UCB is designed under the assumption that the reward follows an independent Gaussian distribution. However, when the reward distribution is unknown, the confidence interval presented in (8) is not pertinent. Following the idea of GRA-UCB, and in this section, we utilize the asymptotic distribution of the empirical variance to drive a confidence interval without any prior assumption of the reward distribution.
Fact 3: Let X be a continuous random variable with mean μ, variance σ 2 , and μ 4 = E[(X − μ) 4 ]. According to [11], the asymptotic distribution of the empirical variance is Based on Fact 3, in the following lemma, we develop an asymptotic UCB on the variance.
Lemma 1: Applying Fact 3, for a sufficiently large n, an asymptotic confidence interval of σ 2 can be derived as where v u n = and s 4 n is the empirical estimate of σ 4 .

Proof: The distribution presented in Fact 3 can be reformulated as a standard Gaussian distribution as
Therefore we have n(s 2 Consequently, the one-sided confidence interval is defined as Consequently, the (1 − α) asymptotic confidence interval of the variance σ 2 is established as where v l n = . From (14), obviously we have Pr{σ 2 ≥ v u n } ≤ α, which completes the proof. Since μ 4 is unknown, an estimate of μ 4 is required. We 4 in the asymptotic UCB provided in Lemma 1 to define a refined bound of the MV compared with the GRA-UCB policy. Therefore, an alternative index will be assigned to each arm to determine arm-selection. The index of arm i is defined as where . Similar to GRA-UCB algorithm, the agent chooses the arm with the smallest index to play.

IV. THEORETICAL ANALYSIS
In this section, we study the regret analysis of the proposed algorithms. We provide a complete analysis for the learning regret of the proposed GRA-UCB algorithm. For the ARA-UCB algorithm, Fact 2 is replaced with Lemma 1 to drive the bound of learning regret. The analysis is similar to the GRA-UCB case and the theoretical bound still holds. In Section II, we mentioned that the optimal arm-selection policy is intractable in risk-aware bandit. Therefore, we measure the learning regret by using an approximated policyπ * which keeps playing the arm with the smallest MV.

A. Learning Regret of GRA-UCB
Lemma 2: (Lemma 2 of [5]) The learning regret of a policy π with respect to the approximated policyπ * is bounded as Lemma 3: (Theorem 1 of [5]) With σ 2 max = max i∈K σ 2 i , the learning regret difference can be bounded as Lemma 4: The expected number of plays of any suboptimal arm i = i * in GRA-UCB can be upper bounded by Proof: First, we upper bound τ i (T ) as where I t represents the index of the arm played at round t, {A} is an indicator function which is equal one if A is true and zero otherwise, and l i is an arbitrary positive integer. Thus, where , the last term of (20) will be always positive since Δ i,i * − 2ρc t ≥ 0. Consequently, to ensure B i (t) − B i * (t) ≤ 0, the first term of (20) needs to be negative and the second term of (20)> needs to be positive. Therefore, continuing from (19) and (20), we have Applying Facts 1 and 2, E[τ i (T )] can be written as which completes the proof. Theorem 1: Combining Lemmas 2, 3, and 4, the learning regret of the GRA-UCB algorithm can be upper bounded as

B. Learning Regret of ARA-UCB
In ARA-UCB, similar to the analysis of GRA-UCB, we need to bound the number of plays for sub-optimal arms. The following lemma provides an upper bound on E[τ i (T )].
Lemma 5: The expected number of plays of any suboptimal arm i = i * in ARA-UCB for sub-Gaussian rewards can be upper bounded by (25) Proof: According to (15) and (19), we have For , and hence, the last term of (26) will always be positive. To ensure D i (t) − D i * (t) ≤ 0, we need the first term of (26) to be negative and the second term of (26) to be positive. Consequently, continuing from (26), we have According to Lemma 1, we can conclude that where β > 1 is a scalar. Given a sufficiently large number of samples (L i ) of arm i, Lemma 1 holds when β = 1. However, when τ i (t) < L i which means τ i (t) is not sufficiently large, the actual confidence bound in (14) is smaller than the expected bound. Since the reward is sub-Gaussian, β cannot go to infinity and will decrease when more samples are collected. Applying Fact 1 and (28), by setting α = Ct −1 with C > 0, the expected value of τ i (T ) in (27) can be written as which completes the proof. Theorem 2: Combining Lemmas 2, 3, and 5, the learning regret of ARA-UCB algorithm for sub-Gaussian rewards can be upper bounded as

V. NUMERICAL RESULTS
We simulate the proposed algorithms and compare them with several benchmarks. Two reward distributions are considered which are Gaussian distribution and truncated Gamma distribution with the range [0, 10]. Since different reward distributions with different parameters are considered, different risk tolerance factors (ρ = 1 and ρ = 2.5, respectively) are used to ensure the optimal arm is risk sensitive with relatively high mean and low variance. Moreover, the arm number in both experiments is set to be K = 10. To evaluate the performance of GRA-UCB and ARA-UCB in comparison with the existing schemes, we simulate three algorithms, namely, MV-LCB [2], MV-UCB [5], and CVaR [6]. Finally, we calculate the learning regret by running each simulation for 100 times and taking the average results.
In Fig. 1, we observe that both proposed algorithms reach low learning regret, because GRA-UCB is designed for the Gaussian reward and ARA-UCB is model agnostic. In addition, although MV-UCB performs relatively well, the performance is sensitive to its parameter b which needs to be tuning for better performance and no clear hint is discussed in the literature. Moreover, our proposed schemes outperform the MV-LCB because the confidence bound used in MV-LCB is not tight enough.
In Fig. 2, benefiting from the asymptotic upper confidence bound, ARA-UCB reaches the lowest learning regret. Additionally, we should also note that, because truncated Gamma distribution is used to generate reward while MV-UCB is based on an assumption of symmetric reward distribution, the derived confidence bound of MV-UCB is inaccurate.

VI. CONCLUSION
In this letter, we studied a different type of MAB which is the risk-aware MAB. Instead of purely considering the expected reward when making decision, the variance of each arm is also taken into account to reduce the risk of receiving low rewards.
The key novelty of this letter is building the confidence bound of the MV with the help of identifying the sampled reward variance distribution and designing a risk-aware MAB algorithm when the reward distribution knowledge is unknown. The proposed algorithms achieve logarithmic learning regret and performs numerically well as compared to several benchmarks.