From External to Internal Regret

External regret compares the performance of an online algorithm, selecting among N actions, to the performance of the best of those actions in hindsight. Internal regret compares the loss of an online algorithm to the loss of a modified online algorithm, which consistently replaces one action by another. In this paper, we give a simple generic reduction that, given an algorithm for the external regret problem, converts it to an efficient online algorithm for the internal regret problem. We provide methods that work both in the full information model, in which the loss of every action is observed at each time step, and the partial information (bandit) model, where at each time step only the loss of the selected action is observed. The importance of internal regret in game theory is due to the fact that in a general game, if each player has sublinear internal regret, then the empirical frequencies converge to a correlated equilibrium. For external regret we also derive a quantitative regret bound for a very general setting of regret, which includes an arbitrary set of modification rules (that possibly modify the online algorithm) and an arbitrary set of time selection functions (each giving different weight to each time step). The regret for a given time selection and modification rule is the difference between the cost of the online algorithm and the cost of the modified online algorithm, where the costs are weighted by the time selection function. This can be viewed as a generalization of the previously-studied sleeping experts setting.


Introduction
The motivation behind regret analysis might be viewed as the following: we design a sophisticated online algorithm that deals with various issues of uncertainty and decision making, and sell it to a client.Our online algorithm runs for some time and incurs a certain loss.We would like to avoid the embarrassment that our client will come back to us and claim that in retrospect we could have incurred a much lower loss if we used his simple alternative policy π.The regret of our online algorithm is the difference between the loss of our algorithm and the loss using π.Different notions of regret quantify differently what is considered to be a "simple" alternative policy.
At a high level one can split alternative policies into two categories.The first consists of alternative policies that are independent from the online algorithm's action selection, as is done in external regret.External regret, also called the best expert problem, compares the online algorithm's cost to the best of N actions in retrospect (see Hannan (1957); Foster and Vohra (1993); Littlestone and Warmuth (1994); Freund andSchapire (1995, 1999); Cesa-Bianchi et al. (1993)).This implies that the simple alternative policy performs the same action in all time steps, which indeed is quite simple.Nonetheless, one important application of external regret to online algorithm analysis is a general methodology of developing online algorithms whose performance matches that of an optimal static offline algorithm by modeling the possible static solutions as different actions.
The second category are those alternative policies that consider the online sequence of actions and suggest a simple modification to it, such as "every time you bought IBM, you should have bought Microsoft instead."This notion is captured by internal regret, introduced in Foster and Vohra (1998).Specifically, internal regret allows one to modify the online action sequence by changing every occurrence of a given action i by an alternative action j.Specific low internal regret algorithms were derived by Hart and Mas-Colell (2000), Foster and Vohra (1997, 1998, 1999), and Cesa-Bianchi and Lugosi (2003), where the use of the approachability theorem of Blackwell (1956) has played an important role in some of the algorithms.
One of the main contributions of our work is to show a simple online way to efficiently convert any external regret algorithm into an internal regret algorithm.Our guarantee is somewhat stronger than internal regret and we call it swap regret, which allows one to simultaneously swap multiple pairs of actions.(If there are N actions total, then swap-regret is bounded by N times the internal regret.)Using known results for external regret we can derive a swap regret bound of O(N √ T log N + N log N ), and with additional optimization we are able to reduce this regret bound to O( √ N T log N + N log N log T ).We also show an Ω( √ N T ) lower bound for the case of randomized online algorithms against an adaptive adversary.
The importance of internal regret is due to its tight connection to correlated equilibria, introduced by Aumann (1974).For a general-sum game of any finite number of players, a distribution Q over the joint action space is a correlated equilibrium if every player would have zero internal regret when playing it.In a repeated game scenario, if each player uses an action selection algorithm whose internal regret is sublinear in T , then the empirical distribution of the players actions converges to a correlated equilibrium (see, e.g., Hart and Mas-Colell (2000)).In fact, we point out that the deviation from a correlated equilibrium is bounded exactly by the average swap regret of the players.
We also extend our internal regret results to the partial information model, also called the adversarial multi-armed bandit (MAB) problem in Auer et al. (2002b).In this model, the online algorithm only gets to observe the loss of the action actually selected, and does not see the losses of the actions not chosen.For example, if you are driving to work and need to select which of several routes to take, you only observe the travel time on the route actually taken.If we view this as an online problem, each day selecting which route to take on that day, then this fits the MAB setting.Furthermore, the route-choosing problem can be viewed as a general-sum game: your travel time depends on the choices of the other drivers as well.Thus, if every driver uses a low internal-regret algorithm, then the uniform distribution over observed traffic patterns will converge to a correlated equilibrium.For the MAB problem, our combining algorithm requires additional assumptions on the base external-regret MAB algorithm: a smoothness in behavior when the actions played are taken from a somewhat different distribution than the one proposed by the algorithm.Luckily, these conditions are satisfied by existing external-regret MAB algorithms such as that of Auer et al. (2002b).For the multi-armed bandit setting, we derive an O( N 3 T log N + N 2 log N ) swap-regret bound.Thus, after T = O( 1 ǫ 2 N 3 log N ) rounds, the empirical distribution on the history is an ǫ-correlated equlibrium.(The work of Hart and Mas-Colell (2001) also gives a multiarmed bandit algorithm whose internal regret is sublinear in T , but does not derive explicit bounds.) One can also envision broader classes of regret.Lehrer (2003) defines a notion of wide range regret that allows for arbitrary action-modification rules, which might depend on history, and also Boolean time selection functions (which determine which subset of times is relevant).Using the approachability theorem, he shows a scheme that in the limit achieves no regret (i.e., regret is sublinear in T ).While Lehrer (2003) derives the regret bounds in the limit, we derive finite-time regret bounds for this setting.We show that for any family of N actions, M time selection functions and K modification rules, the maximum regret with respect to any selection function and modification rule is bounded by O( T N log(M K) + N log(M K)).Our model also handles the case where the time selection functions are not Boolean, but rather real valued in [0,1].
This latter result can be viewed as a generalization of the sleeping experts setting of Freund et al. (1997) and Blum (1997).In the sleeping experts problem, we again have a set of experts, but on any given time step, each expert may be awake (making a prediction) or asleep (not predicting).This is a natural model for combining a collection of if-then rules that only make predictions when the "if" portion of the rule is satisfied, and this setting has had application in domains ranging from managing a calendar (Blum, 1997) and textcategorization (Cohen and Singer, 1999) to learning how to formulate web search-engine queries (Cohen and Singer, 1996).By converting each such sleeping-expert into a pair expert, time-selection function , we achieve the desired guarantee that for each sleepingexpert, our loss during the time that expert was awake is not much more than its loss in that period.Moreover, by using non-Boolean time-selection functions, we can naturally handle prediction rules that have varying degrees of confidence in their predictions and achieve a confidence-weighted notion of regret.
We also study the case of deterministic Boolean prediction in the setting of time selection functions.We derive a deterministic online algorithm whose number of weighted errors, with respect to any time selection function from our class of M selection functions, is at most 3OP T + 2 + 2 log 2 M , where OP T is the best constant prediction for that time selection function.
Recent related work.Comparable results can be achieved based on independent work appearing in the journal version of Stoltz and Lugosi (2003): specifically, the results regarding the relation between external and internal regret in Stoltz and Lugosi (2004) and the multi-armed bandit setting in Cesa-Bianchi et al. (2004).In comparison to Stoltz and Lugosi (2004), we are able to achieve a better swap regret guarantee in polynomial time (a straightforward application of Stoltz and Lugosi (2004) to swap regret would require time-complexity Ω(N N ); alternatively, they can achieve a good internal-regret bound in polynomial time, but then their swap regret bound becomes worse by a factor of √ N .On the other hand, their work of is applicable to a wider range of loss functions, which also capture scenarios arising in portfolio selection.)We should stress that the above techniques are very different from the techniques proposed in our work.

Model and Preliminaries
We assume an adversarial online model where there are N available actions {1, . . ., N }.At each time step t, an online algorithm H selects a distribution p t over the N actions.After that, the adversary selects a loss vector ℓ t ∈ [0, 1] N , where ℓ t i ∈ [0, 1] is the loss of the i-th action at time t.In the full information model, the online algorithm receives the loss vector ℓ t and experiences a loss ℓ t H = N i=1 p t i ℓ t i .In the partial information model, the online algorithm receives (ℓ t k t , k t ), where k t is distributed according to p t , and ℓ t H = ℓ t k t is its loss.The loss of the i-th action during the first T time steps is L T i = T t=1 ℓ t i , and the loss of H is L T H = T t=1 ℓ t H .The aim for the external regret setting is to design an online algorithm that will be able to approach the best action, namely, to have a loss close to L T min = min i L T i .Formally we would like to minimize the external regret R = L T H − L T min .We introduce a notion of a time selection function.A time selection function I is a function over the time steps mapping each time step to [0, 1].That is, I : {1, . . ., T } → [0, 1].The loss of action j using time-selector I is L T j,I = t I(t)ℓ t j .Similarly we define L H,I , the loss of the online algorithm H with respect to time selection function I, as L T H,I = t I(t)ℓ t H , where ℓ t H is the loss of H at time t.This notion of experts with time selection is very similar to the notion of "sleeping experts" studied in Freund et al. (1997).Specifically, for each action j and time selection function I, one can view the pair (j, I) as an expert that is "awake" when I(t) = 1 and "asleep" when I(t) = 0 (and we could view it as "partially awake" when I(t) ∈ (0, 1)).
We also consider modification rules that modify the actions selected by the online algorithm, producing an alternative strategy we will want to compete against.A modification rule F has as input the history and an action choice and outputs a (possibly different) action.(We denote by F t the function F at time t, including any dependency on the history.)Given a sequence of probability distributions p t used by an online algorithm H, and a modification rule F , we define a new sequence of probability distributions f t = F t (p t ), where f t i = j:F t (j)=i p t j .The loss of the modified sequence is L H,F = t i f t i ℓ t i .Similarly, given a time selection function I and a modification rule F we define L H,I,F = t i I(t)f t i ℓ t i .In our setting we assume a finite class of N actions, {1, . . ., N }, a finite set F of K modification rules, and a finite set I of M time selection function.Given a sequence of loss vectors, the regret of an online algorithm H with respect to the N actions, the K modification rules, and the M time selection functions, is Note that the external regret setting is equivalent to having a single time-selection function (I(t) = 1 for all t) and a set F ex of N modification rules F i , where F i always outputs action i.For internal regret, the set F in consists of N (N − 1) modification rules F i,j , where F i,j (i) = j and We define a slightly extended class of internal regret which we call swap regret.This case has F sw include all N N functions F : {1, . . ., N } → {1, . . ., N }, where the function F swaps the current online action i with F (i) (which can be the same or a different action).
A few simple relationships between the different types of regret: since F ex ⊆ F sw and F in ⊆ F sw , both external and internal regret are upper-bounded by swap-regret.Also, swap-regret is at most N times larger than internal regret.On the other hand, even with N = 3, there are simple examples that separate internal and external regret (see Stoltz and Lugosi (2003)).

Correlated Equilibria and Swap Regret
We briefly sketch the relationship between correlated equilibria and swap regret.
that maps the action of player i and the actions of the other players to a real number.(We have scaled losses to [0, 1]) The aim of each player is to minimize its loss.A correlated equilibrium is a distribution P over the joint action space with the following property.Imagine a correlating device draws a vector of actions a using distribution P over ×A i , and gives player i the action a i from a. (Player i is not given any other information regarding a.) The probability distribution P is a correlated equilibria if, for each player, it is its best response to play the suggested action (provided that the other players do not deviate).
We now define an ǫ-correlated equilibrium.
Definition 2 A joint probability distribution P over ×A i is an ǫ-correlated equilibria if for every player j and for any function , where a −j denotes the joint actions of the other players.
The following theorem relates the empirical distribution of the actions performed by each player, their swap regret and the distance from a correlated equilibrium (see also, Foster andVohra (1997, 1998) and Hart and Mas-Colell (2000)).
Theorem 3 Let G = M, (A i ), (s i ) be a game and assume that for T time steps each player follows a strategy that has swap regret of at most R(T, N ).The empirical distribution Q of the joint actions played by the players is an (R(T, N )/T )-correlated equilibrium, and the loss of each player equals, by definition, its expected loss on Q.
The above states that the payoff of each player is its payoff in some approximate correlated equilibrium.In addition, it relates the swap regret to the distance from a correlated equilibria.Note that if the average swap regret vanishes then the procedure converges, in the limit, to a correlated equilibria (see Hart and Mas-Colell (2000) and Foster andVohra (1997, 1999)).

Generic reduction from external to swap regret
We now give a black-box reduction showing how any algorithm A achieving good external regret can be used as a subroutine to achieve good swap regret as well.The high-level idea is as follows.We will instantiate N copies of the external-regret algorithm.At each time step, these algorithms will each give us a probability vector, which we will combine in a particular way to produce our own probability vector p.When we receive a loss vector ℓ, we will partition it among the N algorithms, giving algorithm A i a fraction p i (p i is our probability mass on action i), so that A i 's belief about the loss of action j is t p t i ℓ t j , and matches the cost we would incur putting i's probability mass on j.In the proof, algorithm A i will in some sense be responsible for ensuring low regret of the i → j variety.The key to making this work is that we will be able to define the p's so that the sum of the losses of the algorithms A i on their own loss vectors matches our overall true loss.
To be specific, let us formalize what we mean by an external regret algorithm.
Definition 4 An algorithm A has external regret R(L min , T, N ) if for any sequence of T losses ℓ t such that some action has total loss at most L min , for any action j ∈ {1, . . ., N } we have We assume we have N algorithms A i (which could all be identical or different) such that A i has external regret R i (L min , T, N ).We combine the N algorithms as follows.At each time step t, each algorithm A i outputs a distribution q t i , where q t i,j is the fraction it assigns action j.We compute a vector p such that p t j = i p t i q t i,j .That is, p = pQ, where p is the row-vector of our probabilities and Q is the matrix of q i,j .(We can view p as a stationary distribution of the Markov Process defined by Q, and it is well known such a p exists and is efficiently computable.)For intuition into this choice of p, notice that it implies we can consider action selection in two equivalent ways.The first is simply using the distribution p to select action j with probability p j .The second is to select algorithm A i with probability p i and then to use algorithm A i to select the action (which produces distribution pQ).
When the adversary returns ℓ t , we return to each A i the loss vector p i ℓ t .So, algorithm A i experiences loss (p t i ℓ t ) • q t i = p t i (q t i • ℓ t ).Now we consider the guarantee that we have for algorithm A i , namely, for any action j, If we sum the losses of the N algorithms at any time t, we get i p t i (q t i • ℓ t ) = p t Q t ℓ t , where p t is the row-vector of our probabilities, Q t is the matrix of q t i,j , and ℓ t is viewed as a column-vector.By design of p t , we have p t Q t = p t .So, the sum of the perceived losses of the N algorithms is equal to our actual loss p t ℓ t .Therefore, summing equation ( 1) over all N algorithms, the left-hand-side sums to L T H . Since the right-hand-side of equation ( 1) holds for any j, we have that for any function F : {1, . . ., N } → {1, . . ., N }, We have therefore proven the following theorem.
Theorem 5 For any N algorithms A i with regret R i , for every function F : {1, . . ., N } → {1, . . ., N }, the above algorithm satisfies A typical optimized experts algorithm, such as in Littlestone and Warmuth (1994), Freund and Schapire (1995), Auer et al. (2002b), and Cesa-Bianchi et al. (1993), will have (Alternatively, Corollary 14 can be also used to deduce the above bound.)We can immediately derive the following corollary.
Corollary 6 Using an optimized experts algorithm as the A i , for every function F : {1, . . ., N } → {1, . . ., N }, we have that We can perform a slightly more refined analysis of the bound by having L i min be the minimum loss for an action in A i .Note that N i=1 L i min ≤ T , since we scaled the losses given to algorithm A i at time t by p t i .By convexity of the square-root function, this implies that N i=1 The only problem is that algorithm A i needs to "know" the value of L i min to set its internal parameters correctly.One way to avoid this is to use an adaptive method of Auer et al. (2002a).We can also avoid this problem using the standard doubling approach of starting with L min = 1 and each time our guess is violated, we double the bound and restart the online algorithm.The external regret of such a resetting optimized experts algorithm would be Going back to our case of N multiple online algorithms A i , we derive the following, Corollary 7 Using resetting optimized experts algorithms as the A i , for every function F : {1, . . ., N } → {1, . . ., N }, we have that One strength of the above general reduction is it ability to accommodate new regret minimization algorithms.For example, using the algorithm of Cesa-Bianchi et al. (2005) one can get a more refined regret bound, which depends on the second moment.

Lower bound for swap regret
Notice that while good algorithms for the experts problem achieve external regret O( √ T log N ), our swap-regret bounds are roughly O( √ T N log N ).Or, to put it another way, for external regret one can achieve regret ǫT by time T = O(ǫ −2 log N ), whereas we need T = O(ǫ −2 N log N ) to achieve swap-regret ǫT (or an ǫ-correlated equilibrium).A natural question is whether this is best possible.We give here a partial answer: a lower bound of Ω( √ T N ) but in a more adversarial model.First, one tricky issue is that for a given stochastic adversary, the optimal policy for minimizing loss may not be the optimal policy for minimizing swap-regret.For example, consider a process in which {0, 1} losses are generated by an almost fair coin, but with slight biases that change each day so that the optimal policy for minimizing loss uses each action T /N times.Because of the variance of the coin flips, in retrospect, most actions can be swapped with an expected gain of Ω( (T log N )/N ) each, giving a total swap-regret of Ω( √ T N log N ) for this policy.However, a policy that just picks a single fixed action would have swap-regret only O( √ T log N ) even though its expected loss is slightly higher.
We show a lower bound of Ω( √ T N ) on swap regret, but in a different model.Specifically, we have defined swap regret with respect to the distribution p t produced by the player, rather than the actual action a t selected from that distribution.In the case that the adversary is oblivious (does not depend on the player's action selection) then the two models have the same expected regret.However we will consider a dynamic adversary, whose choices may depend on the player's action selection in previous rounds.In this setting (dynamic adversary and regret defined with respect to the action selected from p t rather than p t itself) we derive the following theorem.
Theorem 8 There exists a dynamic adversary such that for any randomized online algorithm A, the expected swap regret of A is (1 − λ) √ T N /128, for T ≥ N and λ = N T e −cN for some constant c > 0.
Proof We start by describing the adversary.At time t, the loss of action i is determined as follows.If algorithm A has selected action i less than 8T /N times so far, then we flip a fair coin and if heads set ℓ t i = 1 and if tails set ℓ t i = 0. Otherwise, we again flip a coin but set ℓ t i = 1 in either case (the coin now is just to help with the analysis later).We call actions of the first type randomized actions, and those of the second type 1-loss actions.Call an action that never becomes 1-loss "untouched", and one that does "touched".So, there must be at least 7N/8 untouched actions.Also, let T R denote the number of times the algorithm plays a randomized action (which could be a random variable depending on the algorithm).Notice that the expected loss of the algorithm is We break the argument into two cases based on E[T R ].The simpler case is E[T R ] ≤ T /2 (i.e., the algorithm plays many 1-loss actions).In that case, the expected loss of the algorithm is at least 3T /4.On the other hand, with probability N e −N/32 ≤ λ there is some action of total loss at most T /2 (because even if the algorithm could decide which actions to touch knowing the future coin flips, with probability N e −N/32 at least N/4 will have ≤ T /2 heads, and the algorithm can only touch N/8 of them).So, the expected regret is at least (1 − λ)T /4 ≥ (1 − λ) √ T N/128.We now analyze the case that E[T R ] > T /2.Let T i denote the time steps in which the algorithm plays i, let T i = |T i | and let GOOD i denote the set of actions whose loss in time steps T i is at most T i /2 − √ T i /2; i.e., GOOD i = {j| t∈T i ℓ t j ≤ T i /2 − √ T i /2}.Let us first assume that none of the sets GOOD i is empty.Denote by SR the swap-regret, i.e., SR = N i=1 max j ( t∈T i ℓ t i − ℓ t j ).The expected swap-regret of the algorithm, E[SR], is then at least the difference between its expected loss and where we use the fact that i T i = T and T R ≤ T .The number of actions i such that , the expected number of such actions is at least N/16 − N/32 = N/32 and therefore, It remains to show that with probability 1 − λ, every set GOOD i is not empty.First, note that for T coin tosses, with probability at least 1/7 we have T /2 − √ T /2 heads.Fix an action i and any given value of T i .This implies that even if the algorithm could decide which (at most N/8) actions to touch after the fact, with probability e −(1/8−1/7) 2 N/2 we have that at least one action with the desired loss remains and hence GOOD i = ∅.Summing over all i and all possible values of T i yields a failure probability at most N T e −cN = λ.We complete the proof by noting that this failure probability decreases the bound by 1−λ.

Reducing External to Swap Regret in the Partial information model
In the full information setting the learner gets, at the end of each time step, full information on the costs of all the actions.In the partial information (multi-arm bandit) model, the learner gets information only about the action that was selected.In some applications this is a more plausible model regarding the information the learner can observe.
The reduction in the partial information model is similar to the one of the full information model, but with a few additional complications.We are given N partial information algorithms A i .At each time step t, each algorithm A i outputs a distribution q t i .Our master online algorithm combines them to some distribution p t which it uses.Given p t it receives a feedback, but now this includes information only regarding one action, i.e., it receives (ℓ t k t , k t ), where k t is distributed according to p t .We take this feedback and distribute to each algorithm A i a feedback (c t i , k t ), such that i c t i = ℓ t k t .The main technical difficulty is that now the action selected, k t , is distributed according to p t and not q t i .(For example, it might be that A i has q t i,j = 0 but it receives feedback about action j.From A i 's point of view this is impossible!Or, more generally, A i might start noticing it seems to have a very bad random-number generator.)For this reason, for the reduction to work we need to make a stronger assumption about the guarantees of the algorithms A i , which luckily is implicit in the algorithms of Auer et al. (2002b).Since results of Auer et al. (2002b) are stated in terms of maximizing gain rather then minimizing loss we will switch to this notation, e.g., define the benefit of action j at time t to be b t j = 1 − ℓ t j .We start by describing our MAB algorithm SR M AB.Initially, we are given N partial information algorithms A i .At each time step t, each A i gives a selection distribution q t i over actions.Given all the selection distributions we compute an action distribution p t .We would like to keep two sets of gains, one is the real gain, denoted by b t i , and the other is the gain that the MAB algorithm A i observes g t A i .Given the action distribution p t the adversary selects a vector of real gains b t i .Our MAB algorithm SR M AB receives a single feedback (b t k t , k t ) where k t is a random variable that with probability p t j equals j.Algorithm SR M AB, given b t , returns to each A i a pair (g t A i , k t ), where the observed gain g t A i is based on b t , p t and q t i .Again, note that k t is distributed according to p t , which may not equal q t i : it is for this reason we need to use an MAB algorithm that satisfies certain properties (stated in Lemma 9).
In order to specify our MAB algorithm, SR M AB, we need to specify how it selects the action distribution p t and the observed gains g t A i .As in the full information case, we compute an action distribution p t such that p t j = i p t i q t i,j .That is, p = pQ, where p is the row-vector of our probabilities and Q is the matrix of q i,j .Given p t the adversary returns a real gain (b t k t , k t ), namely, the real gain is of our algorithm b t k t .We return to each algorithm A i an observed gain of g t A i = p t i b t k t q i,k t /p t k t .(In general, define g t i,j = p t i b t j q t i,j /p t j , if j = k t and g t i,j = 0 if j = k t .)First, we will show that From the property of the distribution p t we have that, This shows that we distribute our real gain among the algorithms A i ; that is, that the sum of the observed gains equals the real gain.In addition, it bounds the observed gain that each algorithm A i receives.Namely, 0 ≤ g t A i ≤ b t k t ≤ 1.In order to describe the guarantee that each external regret multi-arm bandit algorithm A i is required to have, we need the following additional definition.At time t let X t i,j be a random variable such that X t i,j = g t i,j /q t i,j if j = k t and X t i,j = 0 otherwise.The expectation of X t i,j is, Lemma 9 (Auer et al. (2002b)) There exists a multi-arm bandit algorithm, A i , such that for any sequence of observed gains g t i,j ∈ [0, 1] it outputs actions distributions q t i , and for any sequence of selected actions k t , and for any action r and parameter γ ∈ (0, 1], then, where X t i,j is a random variable such that X t i,j = g t i,j /q t i,j if j = k t and X t i,j = 0 otherwise. Note that in Auer et al. (2002b) the action distribution is identical to the selection distribution, i.e. p t ≡ q t , and the observed and real gain are identical, i.e., g t ≡ b t .Auer et al. (2002b) derive the external regret bound by taking the expectation with respect to the action distribution (which is identical to the selection distribution).In our case we separate the real gain from the observed gain, which adds another layer of complication.(Technically, the distribution p t is a random variable that depends on the observed actions k 1 , . . .k t−1 and well as the observed gains b 1 k 1 , . . .b t−1 k t−1 .We will slightly abuse the notation referring directly to p t , but one should interpret it as conditioning on the observed actions k 1 , . . .k t−1 .)We define the benefit of SR M AB to be B SR M AB = T t=1 b t SR M AB and for a function F : {1, . . ., N } → {1, . . ., N } we define . We now state our main theorem regarding the partial information model.
Theorem 10 Given a multi-arm bandit algorithm satisfying Lemma 9 (such as the algorithm of Auer et al. (2002b)), it can be converted to a master online algorithm SR M AB, such that where the expectation is over the observed actions of SR M AB, B max bounds the maximum benefit of any algorithm and Proof Let the total observed gain of algorithm A i be by Lemma 9, this implies that for any action r, after taking the expectation over the observed actions, we have where B i,r = T t=1 p t i b t r , B max ≥ max i,j B i,j and γ = min{ (N ln N )/B max , 1}.For swap regret, we compare the expected benefit of SR M AB to that of N i=1 max j B i,j .Therefore, taking the expectation over the observed actions, where the expectation of B i,F (i) over the observed actions.

External Regret with Time-Selection Functions
We now present a simple online algorithm that achieves a good external regret bound in the presence of time selection functions, generalizing the sleeping experts setting.Specifically, our goal is for each action a, and each time-selection function I, that our total loss during the time-steps selected by I should be not much more than the loss of a during those time steps.More generally, this should be true for the losses weighted by I when I(t) ∈ [0, 1].The idea of the algorithm is as follows.Let R a,I be the regret of our algorithm with respect to action a and time selection function I.That is, R a,I = t I(t)(ℓ t H − ℓ t a ).Let Ra,I be a less-strict notion of regret in which we multiply our loss by some β < 1, that is, Ra,I = t I(t)(βℓ t H − ℓ t a ).What we will do is give to each action a and time selection function I a weight w a,I that is exponential in Ra,I .We will prove that the sum of our weights never increases, and thereby be able to easily conclude that none of the Ra,I can be too large.
Specifically, for each of the N actions and the M time selection functions we maintain a weight w t a,I .We update these weights using the rule w t+1 a,I = w t a,I β (3) For the inductive step we show that the sum of the weights can only decrease.Note that for any β ∈ [0, 1] and x ∈ [0, 1] we have , where Rt a,I = t ′ ≤t I(t ′ )(β I ℓ t ′ H − ℓ t ′ a ).We then let w t a = I (1 − β I )I(t)w a,I , W t = a w t a and p t a = w t a /W t .The proof of Claim 11 holds in a similar way, and from that one can derive, analogously, the more refined regret bound.

Arbitrary time selection and modification rules
In this section we combine the techniques from Sections 3 and 6 to derive a regret bound for the general case where we assume that there is a finite set I of M time selection functions, and a finite set F of K modification rules.Our goal is to design an algorithm such that for any time selection function I ∈ I and any F ∈ F, we have that L H,I is not much larger than L H,I,F .
We maintain at time t a weight w t j,I,F per action j, time selection I and modification rule F .Initially w 0 j,I,F = 1.We set w t+1 j,I,F = w t j,I,F β and let W t j,F = I I(t)w t j,I,F , W t j = F W t j,F , and ℓ t H,j = F W t j,F ℓ t F (j) /W t j .
We use the weights to define a distribution p t over actions as follows.We select a distribution p t such that I.e., p is the stationary distribution of the associated Markov chain.Notice that the definition of p implies that the loss of H at time t can either be viewed as i p t i ℓ t i or as j p j F (W t j,F /W t j )ℓ t F (j) = j p t j ℓ t H,j .The following claim bounds the magnitude of the weights.
Claim 15 For every action j, at any time t we have 0 ≤ I,F w t j,I,F ≤ M K.
Proof This clearly holds initially at t = 0.For any t ≥ 0 we show that I,F w t+1 j,I,F ≤ I,F w t j,I,F .Recall that for β ∈ [0, 1] and x ∈ [0, 1] we have , where in the second to last equality we used the identity F ℓ t F t (j) W t j,F = ℓ t H,j W t j .
The following theorem derives the general regret bound.
Theorem 16 For every time selection I ∈ I and modification rule F ∈ F, we have that  4).Summing over all actions j this sum is L H,I .Therefore, where L H,I is the cost of the online algorithm at time selection I and L H,I,F is the cost of the modified output sequence at time selection I. Optimizing for β we derive the theorem.

Boolean Prediction with Time Selection
In this section we consider the case that there are two actions {0, 1}, and the loss function is such that at every time step t one action has loss one and the other has loss zero.Namely, we assume that the adversary returns at time t an action o t ∈ {0, 1}, and the loss of action a t is 1 if a t = o t and 0 if a t = o t .Our objective here is to achieve good bounds with a deterministic algorithm.
For each time selection function I ∈ I, action a ∈ {0, 1}, and time t, our online Boolean prediction algorithm maintains a weight w t a,I .Initially we set w 0 a,I = 1 for every action a ∈ {0, 1} and time selection function I ∈ I.At time t, for each action a ∈ {0, 1}, we compute w t a = I I(t)w t a,I , and predict a t = 1 if w t 1 ≥ w t 0 , and otherwise predict a t = 0.The weighted errors of our online Boolean prediction algorithm, during the time selection function I ∈ I, is t:ot =a t I(t).
Following our prediction we observe the adversary action o t .If no error occurred (i.e., a t = o t ) then all the weights at time t + 1 equal the weights at time t.If an error occurred (i.e., a t = o t ) then we update the weights as follows.Proof Clearly this holds at time t = 0.When an error is performed, we have that w t error ≥ w t correct , where correct = o t and error = 1 − o t .The additive change in the weights is at most ( √ 2 − 1)w t correct − w t error /2 < 0, which completes the proof.
For a time selection function I ∈ I, let v a,I = t:ot=a I(t).The preferred action for a time selection function I is 1 if v 1,I ≥ v 0,I and 0 otherwise.Let OP T (I) be the weighted errors of the preferred action during time selection function I. W.l.o.g., assume that the preferred action for I is 1, which implies that OP T (I) = v 0,I .By Claim 17 we have that w t 1,I ≤ 2M .The total decrease in w t 1,I is bounded by a factor of 2 −v 0,I .Since w T 1,I ≤ 2M the total increase x, is bounded since 2 x/2−v 0,I ≤ 2M , which implies that x ≤ 2v 0,I + 2 + 2 log 2 M The weighted errors of our online Boolean prediction algorithm, during time selection function I ∈ I, i.e., t:a t =o t I(t), is at most x + v 0,I , while the preferred action makes only v 0,I weighted errors.This implies that the weighted errors of our online Boolean prediction algorithm during time selection function I is bounded by 3v 0,I + 2 + 2 log 2 M , which establishes the following theorem.
Theorem 18 For every I ∈ I, our online algorithm makes at most 2+3OP T (I)+2 log 2 M weighted errors.

Conclusion and open problems
In this paper we give general reductions by which algorithms achieving good external regret can be converted to algorithms with good internal (or swap) regret, and in addition develop algorithms for a generalization of the sleeping experts scenario including both real-valued time-selection functions and a finite set of modification rules.
A key open problem left by this work is whether it is possible to achieve swap-regret that has a logarithmic or even sublinear dependence on N .Specifically, for external regret, existing algorithms achieve regret ǫT in time T = O( 1 ǫ 2 log N ), but our algorithms for swapregret achieve regret ǫT only by time T = O( 1 ǫ 2 N log N ).We have shown that sublinear dependence is not possible in against an adaptive adversary with swap-regret defined with respect to the actions actually chosen from the algorithm's distribution, but we do not know whether there is a comparable lower bound in the distributional setting (where swap-regret is defined with respect to the distributions p t themselves), which is the model we used for all the algorithms in this work.In particular, an algorithm with lower dependence on N would imply a more efficient (in terms of number of rounds) procedure for achieving an approximate correlated equilibrium.
proof of the claim.We use the above claim to bound the weight of any action a and time-selection function I. Corollary 12 For every action a and time selection I we have w t a,I = β L a,I −βL H,I ≤ M N, where L H,I = t I(t)ℓ t H is the loss of the online algorithm with respect to time-selection function I.A simple algebraic manipulation of the above implies the following theorem Theorem 13 For every action a and every time selection function I ∈ I we have L H,I ≤ L a,I + log N M log 1 β β We can optimize for β in advance, or do it dynamically using Auer et al. (2002a), establishing: Corollary 14 For every action a and every time selection function I ∈ I we have L H,I ≤ L a,I + O( L min log N M + log M N ), where L min = max I min a {L a,I }.Remark: One can get a more refined regret bound of O( L min,I log N M + log M N ) with respect to each time selection function I ∈ I, where L min,I = min a {L a,I }.This is achieved by keeping a parameter β I for each time selection function I ∈ I.As before we then set For every time selection function I ∈ I we set the weight of action b to w t+1 b,I = w t b,I 2 cI(t) , where c = −1 if b = o t and c = 1/2 if b = o t .We establish the following claim, Claim 17 At any time t we have 0 ≤ a,I w t a,I ≤ 2M .
Rt a,I , where Rt a,I is the "less-strict" regret mentioned above up to time t.At time t we define w t a = I I(t)w t a,I , W t = a w t a and p t a = w t a /W t .Our distribution over actions at time t is p t .The following claim shows that the weights remain bounded.Claim 11 At any time t we have 0 ≤ a,I w t a,I ≤ N M .Proof Initially, at time t 0, the claim clearly holds.Observe that at time t we have the following identity,