Globally Informative Thompson Sampling for Structured Bandit Problems with Application to CrowdTranscoding

Multi-armed bandit is a widely-studied model for sequential decision-making problems. The most studied model in the literature is stochastic bandits wherein the reward of each arm follows an independent distribution. However, there is a wide range of applications where the rewards of different alternatives are correlated to some extent. In this paper, a class of structured bandit problems is studied in which rewards of different arms are functions of the same unknown parameter vector. To minimize the cumulative learning regret, we propose a globally-informative Thompson sampling algorithm to learn and leverage the correlation among arms, which can deal with unknown multi-dimensional parameter and non-monotonic reward functions. Our studies demonstrate that the proposed algorithm achieves significant improvement in the learning speed. In particular, the designed algorithm is used to solve an edge transcoder selection problem in crowdsourced live video streaming systems and shows superior performance as compared to the existing schemes.


I. I n t r o d u c t io n
Multi-armed bandit (MAB) is used to model the decision making problem, which can be applied into a wide range of applications including cognitive radio networks [1], crowd sourcing systems [2], and online recommendation [3]. In an MAB problem, a machine (which is referred to as an arm) can instantly generate a random reward if it is played and the agent's aim is to maximize the received cumulative reward by sequentially pulling a number of arms. The challenge of solving the MAB problem is how to balance the exploration exploitation (EE) dilemma. If the agent only exploits the learned knowledge to select a relatively good arm, the opportu nity of gaining higher cumulative reward is missed. However, if the agent keeps exploring every arm, obviously the arms with lower rewards can be played frequently.
In the classical MAB problem, the reward of an arm is assumed to follow an independent probability distribution. Therefore, the observations from one arm cannot reveal any information of other arms. This type of MAB can be cate gorized as the non-informative bandit and many algorithms are designed to solve it including upper confidence bound (UCB) [4] and Thompson sampling (TS) [5]. However, this assumption does not hold in many applications. For example, in a news recommendation system, a company needs to decide which news to push to the users so that the click-through rate can be maximized. To identify the most appealing news for different users [6], the correlation of user preferences shall be considered since users with similar ages or occupations may have similar preferences.
Considering the correlation among different arms in decision-making problems, the structured bandit model [7] is introduced. one model of structured bandits is contextual MAB in which the correlation of arms is modelled under assumption that the reward distributions are conditioned on the context information and hence the arms with similar context information can generate close rewards. In [8], a contextual MAB was modelled which assumes the expected reward of an arm is a linear function of contextual information and unknown parameters. In [9], a TS-based algorithm was proposed to solve the contextual MAB problem with the same reward function assumption. Moreover, in [10], instead of explicitly assuming the mapping from the context to the reward to be a linear function, Gaussian process regression was used to model the relationship between the context and the reward.
Contextual MAB relays on knowledge of context informa tion of the environment. However, this information might be unavailable in some problems due to the privacy concerns [11]. In this paper, we study a different model of structured bandit problems assuming the context information is hidden and the reward mappings are functions of one or multiple unknown parameters shared by all the arms. In [12], a structured bandit problem is modelled which assumes each group of arms shares the same unknown parameter. In [13] and [14], the unknown parameter is estimated using the knowledge of the reward functions and empirical mean rewards. However, the proposed algorithms in [13] and [14] are limited to considering monotonic expected reward functions.
In this paper, we study a model which removes the restric tions on the properties of the expected reward function and the number of parameters in state-of-the-art works. In order to solve the structured bandit problem, we first build a confidence set which is a set of parameters whose expected reward is close to the empirical mean rewards of all arms, and then a novel technique is designed to estimate the true value of the unknown parameter based on the established confidence set. In addition, a novel TS-based algorithm is proposed to handle the EE dilemma. Simulation results demonstrate that the proposed globally-informative Thompson sampling (GI TS) algorithm can solve the structured bandit problem with a noteworthy improvement of the learning regret compared with the existing benchmarks.
We apply the proposed GI-TS algorithm to tackle the transcoder selection problem for crowdsourced live streaming (CLS) services as a case study. Recently, the edge transcoding has been introduced for CLS to utilize the abundant compu tational resources at the user end for quality of experience enhancement [15]. However, the optimal transcoder selection for seamless video transcoding remains to be a challenge due to performance uncertainties of devices. Since the edge devices with similar specifications and attributes shall offer similar transcoding performance, the transcoder selection problem can be modelled as a structured bandit problem. Numerical results confirm the effectiveness of the proposed GI-TS approach for the case study of transcoder selection for crowdtranscoding.
The rest of this paper is organized as follows. Section II describes the system model. The detailed GI-TS algorithm is presented in Section III. The simulation setup and results are provided in Section IV, followed by conclusions in Section V.

II. Sy s t e m M o d e l
Consider a structured bandit problem with K arms, i.e., arm k £ K = {1, 2 , . . . , K} . At each round, one of the K alternatives must be selected by an agent to play, and the reward of the played arm can be observed. For example, in the edge transcoder selection problem, at each round, a transcoder need to be selected to do transcoding and each edge transcoder can be treated as an arm. Define the arm played at round t as kt where kt £ K, the observed reward of playing arm kt can be defined as Rkt (t), which is a random variable following an unknown probability distribution. As a structured bandit problem, the expectation of Rkt (t) can be defined as where 0* = [01,02, 0N] is a fixed but unknown parameter vector, which belongs to a parameter set © and is shared by all the arms. In addition, the expected reward function of each arm which is known can be of any form. The observed rewards are assumed to be sub-Gaussian with variance a 2 and the variance is assumed to be known to the agent. This assumption is common in MAB literature ( [7], [15]) and enables us to utilize Hoeffding's inequality which is the foundation of the proposed algorithm (discussed in the next section).
The goal of the agent is to select arm kt at each round t to maximize the total reward in any T rounds. Assume there is an oracle policy that knows the true value of 0* and always selects the optimal arm which is defined as k* = argmaxfceK fj,k(0*). The cumulative reward following this policy can be represented as ^'t=1 l^k* (0*). If the agent misses choosing the optimal arm at round t, a reward loss is incurred, which is defined as fj,k* (0*) -¡ikt (0*). Therefore, the expected cumulative learning regret can be defined as It should be noticed that minimizing the cumulative learning regret equals to maximizing the cumulative reward.

III. Pr o p o s e d Al g o r it h m
In the structured bandit, the agent knows the forms of reward functions. Therefore, if the unknown parameter vector can be accurately estimated, the optimal arm can be deter mined surely. Moreover, since all arms share 0*, no matter which arm is played, the observed reward can be used to refine the estimation of 0*. In this regard, enlightened by [7], a TSbased algorithm called GI-TS is designed to utilize the feature of the setting to solve the structured bandit problem.

A. Confidence Set Construction
Define the empirical mean reward of arm k at round t as f k (t) and define the number of times that arm k has been played till round t as nk (t). GI-TS starts to estimate the unknown parameter vector by constructing a confidence set based on the following fact: Fact 1: (Hoeffding inequality) Let Z be a random variable which follows a a2 sub-Gaussian distribution with mean f(0*), then where fi represents the empirical mean of Z , and n denotes the number of samples. When 5 = , Hoeffding inequality can be bounded as which implies that the gap between the empirical mean and the real mean being larger than a threshold is increasing unlikely to happen with t increasing. Here a is introduced to control the variation rate of the bound. According to Fact 1, for each arm k £ K , we choose a set of parameters whose mean rewards are close to the empirical mean reward. This set of parameters is defined as ©k (t), which can be written as Since 0* is shared by all the arms in their reward functions, the constructed © k (t) of each arm are utilized to estimate 0* more precisely. We define a set of parameters satisfying the specified condition in (5) for every arm as the confidence set. The confidence set © t is calculated as the intersection of ©k(t)'s of all the arms, which can be written as © t = p| ©k (t).

(6) kec
An example of confidence set construction is illustrated in Fig.  1.

B. Confidence Set Analysis
In this section, we prove that the parameter vector 0* falls into the constructed confidence set with a high probability.
Theorem 1: The probability that 0* belongs to the confi dence set is bounded as Proof: Based on (5), define ©C(t) as the complement set of ©k (t). The probability that 0* does not belong to the confidence set can be written as m=1 which completes the proof.
Theorem 1 indicates that, as t increases, the probability of 0* falling into the confidence set © t superlinearly converges to 1 with a > 2.

C. Unknown Parameter Estimation
After establishing the confidence set, we need to estimate the unknown parameter vector 0*. Since the confidence set maintains a group of parameters whose corresponding ex pected rewards are close to the empirical mean rewards of each arm, the confidence set can be used to estimate 0*. Because the more times an agent plays, the higher is the probability that the true value of 0* falls into this set as presented in (7).
To utilize the knowledge of the reward functions, we do not simply estimate 0* by directly solving the equation f k(t) = f k(0), which can incur high computation complexity  (5), so the gap between f k and f k (0) is smaller than 2a^k t . By taking the intersection of the two sets, the confidence set is established. In this case, we can find that 0* falls into the confidence set. and hardly guarantee a sufficiently accurate solution. Besides, the reward function can be non-monotonic and hence directly solving the equation may lead to multiple solutions.
In GI-TS, we define a parameter in the confidence set as 0n , the estimated parameter 0t can be calculated as where wgn (t) represents the weight of 0n at round t. The weight can be calculate as where GBn (t) = kec | fk(0n) -fk(t) I. To estimate 0*, 0n is uniformly selected within the confidence set. The logic behind is that, instead of calculating the mean value of the confidence set, the parameter whose expected reward is closer to the empirical mean reward has a greater impact on the estimated unknown parameter vector. Therefore, the estimation of 0* is accomplished by calculating the weighted summation of the parameters inside the confidence set.

D. Arm Selection
After calculating 0t, we need to utilize it for the arm selection to minimize the learning regret. However, we still need to balance the tradeoff between exploring the reward of each arm and exploiting the learned knowledge to play the best arm, because the empirical mean reward can be inaccurate especially at the early stage. Therefore, a TS-based scheme namely GI-TS is designed to handle the EE dilemma.
Consider (5), (6), and (12), if 0* £ 0 1, given 0t £ 0 1, for each arm, it can be inferred that | ¡ik (0*) -nk (0t) | < E t (t)S t . Since the confidence set is the intersection among 0 k (t) of different arms, the worst case occurs if nk (t) = t / K , which leads to the possibility of having the largest gaps between f k (0*) and f k (0t). Therefore, in the worst case, the gap between f k(0t) and f k(0*) can be upper bounded as which means the estimated reward f k (0t) w ill converge to its real reward as t increases.
In the absence of prior and likelihood knowledge, we assume the posterior distribution of the arm k's reward to be Gaussian distribution. According to the previous discussion, we choose the estimated reward f k (0t) to be the mean of the distribution since it can converge to the real mean as more samples are observed. The variance of the posterior distribution is set to be sj-given the fact that any play is informative for all arms and hence can help to refine the estimation of 0 *. Therefore, we will generate an independent sample Rk: (t) from the distribution N ( f k (0t), for each arm. The agent will then play the arm with the highest Rlk:(t).
The pseudocode of GI-TS is presented in Algorithm 1.

A. Simulation Setup
In this section, we evaluate the performance of the GI TS. We first generate a range of different reward functions to demonstrate the robustness and performance of the proposed algorithm, and three scenarios are set by changing the number of arms and the length of the unknown parameter vector. The setting is depicted in Table I. In addition, we apply GI-TS to solve a transcoder selection problem in CLS systems which can be modelled as a structured bandit problem.
In the simulations, a is set to be 3. We choose the variance of each arm in all simulations as a2 = 4, and Rk is drawn from N ( f k(0*), 4). t h is set to be 10, which is the number of initialization rounds for arm selection. In addition, each of Number of arms Length of 0 *  1st scenario  3  1  2nd scenario  10  1  3rd scenario  3  2   TABLE I

B. Benchmarks
Several existing algorithms are simulated as benchmarks. First, the classical scheme uCB and TS are simulated. In addition, we also simulate the uCBc and TSc schemes which are specially designed for the structured bandit problem [15]. In these two schemes, first a confidence set of parameters is built with the knowledge of reward functions, and some of the arms are identified to be the competitive arms with the help of the confidence set. After that, either uCB or TS is applied to select only one of the competitive arms. Fig. 2, the reward functions of arms versus 0 are presented. We can find that arm 1 is optimal when 0* £ [2.5,5], arm 2 is optimal when 0* £ [0,1], and arm 3 is optimal when 0* £ [1, 2.5], With this setting, we test the proposed algorithm and the benchmarks with varying 0*. In Fig. 3, it is apparent that GI-TS and TSc outperform classical schemes by considering the correlation among the arms and hence the learning regret is greatly reduced. This is because no matter which arm is played, the observed reward can help to find 0 by refining the confidence set each round and to improve the learning speed. In addition, our proposed GI-TS algorithm reaches the lowest regret compared with all the benchmarks. This is because GI-TS directly estimates the value of 0 to find the optimal arm with the help of reward function information.

1) 1st scenario: In
2) 2nd scenario: In this section, we further test the proposed algorithm in a more complex setting with 10 arms whose reward functions are similar to the functions in the 1st scenario. According to Figs. 4 and 5, we can find that GI-TS outperforms all the benchmarks given different 0 *, and the learning regret of GI-TS is only around one-fifth of the regret incurred by TSc. Compared with the 1st scenario, GI-TS scales well when more arms are involved. This is because, instead of using the number of plays to estimate the variance of the reward distribution of an arm, which is used in TSc, the total number of rounds t is utilized to estimate the variance of the reward distribution. This is based on the fact that the confidence set is updated in each round no matter which arm is played. This novel method improves the convergence rate and helps to achieve a competitively small learning regret.
In Figs. 6 and 7, we choose different pairs of 0*. The result demonstrates that GI-TS scales well when one more parameter is added as compared to the results in the 1st scenario, and GI TS still outperforms all the benchmarks by reaching smaller learning regrets with a higher convergence speed.

D. Case Study: Transcoder Selection Problem
In this section, we study a decision-making problem of edge transcoder selection for CLS services, which can be modelled as a structured bandit problem.
The revolution of the Internet driven by the powerful mobile devices and social networks has greatly enriched the sources of live video and the CLS has emerged as a new type of video services that both serve tremendous viewers and receive videos from various sources. To provide viewers with adaptive bit rate videos, it is necessary to transcode the heterogeneous original live videos into industrial standard representations. To meet the massive transcoding demands, the edge computing is proposed to assist CLS by incentivizing end viewers' devices to be candidate transcoders [16]. However, as discussed in Section I, the candidate devices can be offline during transcoding which can severely deteriorate the transcoding performance, and hence more stable devices should be selected as transcoders.
According to [17], the distribution of candidate devices' online durations follows the Pareto distribution, which implies that the longer a device is online, the more likely this device will continue to be online. The cumulative distribution function of a Pareto random variable X with parameters 3 and xm is For simplicity, assume x > x m always holds, define Sf (xm) = 1 -(xm/ x f )l3i as the transcoding stability of edge device f to show the probability that the transcoder will be online after the selection. In addition, x f and 3 f are predefined parameters which vary due to the heterogeneity of transcoders.
Based on above settings, each edge device can be regarded as an arm, and the expected reward function of arm f can be written as fj,f = Sf (xm). Moreover, the form of each reward function is known with 0* = x m, which is shared by all the arms. Therefore, this transcoder selection problem can be modelled as a structured bandit problem and can be solved by the proposed GI-TS algorithm. More specifically, in each round, the transcoder k with the highest sampled reward Rk(t) is selected for live video transcoding.
In Fig. 8, the transcoder selection problem is simulated assuming there are 10 transcoders. The simulation runs for 20000 rounds and the learning regret is calculated as the av erage of 100 independent experiments. To build a challenging scenario that the expected rewards of arms are close to each other, we set 0* = 22.5. The result demonstrates that the proposed GI-TS algorithm outperforms all the benchmarks and achieves the lowest learning regret.
V. Co n c l u s io n s In this paper, a new model of MAB called structured bandit is studied which assumes the arms are correlated by sharing the same unknown parameter vector in their reward functions. In order to maximize the cumulative reward in this new bandit problem, we design an algorithm to accurately estimate the true value of the unknown parameter with the knowledge of the reward functions and the historical reward observations. Simulation results demonstrate that the proposed GI-TS algorithm outperforms all the benchmarks by reaching a low cumulative learning regret. The GI-TS algorithm is robust and scales well when more arms are added or multi dimensional unknown parameter is considered. As a future work, we envision the extension of the proposed structured bandit model to consider the risk awareness [18] in the decision making aiming to identify the most stable arm with relatively higher mean reward and lower reward variation.
Re f e r e n c e s