Adaptive Algorithm for Multi-Armed Bandit Problem with High-Dimensional Covariates

Abstract This article studies an important sequential decision making problem known as the multi-armed stochastic bandit problem with covariates. Under a linear bandit framework with high-dimensional covariates, we propose a general multi-stage arm allocation algorithm that integrates both arm elimination and randomized assignment strategies. By employing a class of high-dimensional regression methods for coefficient estimation, the proposed algorithm is shown to have near optimal finite-time regret performance under a new study scope that requires neither a margin condition nor a reward gap condition for competitive arms. Based on the synergistically verified benefit of the margin, our algorithm exhibits adaptive performance that automatically adapts to the margin and gap conditions, and attains optimal regret rates simultaneously for both study scopes, without or with the margin, up to a logarithmic factor. Besides the desirable regret performance, the proposed algorithm simultaneously generates useful coefficient estimation output for competitive arms and is shown to achieve both estimation consistency and variable selection consistency. Promising empirical performance is demonstrated through extensive simulation and two real data evaluation examples. Supplementary materials for this article are available online.


Introduction
Sequential decision making problems are commonly encountered optimization tasks with important modern applications.For example, in medical service, a physician must decide the appropriate dose level for prescriptions, with the hope of maximizing patients' well-being and preventing adverse effects; in online service, a news website must recommend "top" news articles from multiple candidate news articles to upcoming website visitors to attract more readings; in financial service, a lending firm seeks to decide whether and under what terms they should approve upcoming applicants' loan requests and to reduce overall default rates.These decision making problems can be formulated as the multi-armed stochastic bandit problem: at each user visit, an agent must choose one of the candidate decision arms (e.g., news articles) and then observe a reward (e.g., 1 for reading and 0 for nonreading) from the chosen arm, where the reward follows some unknown distribution; the primary target is to maximize the overall reward over a certain number of visits.
The classic settings (Robbins 1954;Berry and Fristedt 1985;Lai and Robbins 1985;Lai 1987;Gittins 1989;Auer, Cesa-Bianchi, and Fischer 2002) typically assume that the reward distribution of each arm is homogeneous.See, for example, Bubeck and Cesa-Bianchi (2012), Lattimore and Szepesvári (2020), Chan (2020), and references therein for a recent overview on algorithm efficiencies under related settings.In many real appli-CONTACT Ching-Kang Ing cking@stat.nthu.edu.twInstitute of Statistics, National Tsing Hua University, Hsinchu, Taiwan.Supplementary materials for this article are available online.Please go to www.tandfonline.com/r/JASA.cations, we have access to extra covariate information from users of the service, which holds promise for personalized service.In personalized medical service, for example, the treatment effect can be dependent on a patient's medical profiles such as age, medical history, and genetic information; in personalized online service, a reader's interest in news article contents may also be associated with information such as location and browsing history.This promising variation of sequential decision making problems that incorporate user-space covariates is known as the multi-armed bandit problem with covariates.
Initialized by Woodroofe (1979), bandit problems with covariates tend to be classified into two categories according to assumptions on the mean reward functions.The first category is referred to as the nonparametric bandit problem with covariates, in which the mean reward functions are assumed to satisfy mild smoothness conditions.Notably, Yang and Zhu (2002) studied strong consistency properties of a class of randomized allocation algorithms.Rigollet and Zeevi (2010) and Perchet and Rigollet (2013) proposed arm-elimination type algorithms and established their near minimax rates for cumulative regrets.Some related recent work in this category can also be found in Qian andYang (2016a, 2016b), Guan and Jiang (2018), and Reeve, Mellor, and Brown (2018).
The second category is called the parametric linear bandit problem with covariates, where the mean reward functions take a linear form with unknown arm-specific parameters.
In this category, Goldenshluger andZeevi (2009, 2013) and Bastani and Bayati (2020) considered fixed dimensions and high-dimensional covariates, respectively, and showed that their forced sampling algorithms with exploitation achieve (near) minimax rates when a margin condition (Tsybakov 2004) and a constant gap condition are imposed.However, the performance of their algorithms remains unknown in more general scenarios where these two conditions are possibly violated.A detailed discussion involving these conditions is given in Section 6 to exhibit the valuable connection and critical difference between our work and the literature.
In this article, we propose a multi-stage arm allocation algorithm with arm elimination and randomized allocation to solve the linear bandit problem with high-dimensional covariates.We particularly study the integration of a class of stepwisetype high-dimensional regression methods into the proposed approach and develop new technical tools to analyze non-iid samples inherited from arm allocation of the bandit algorithm.Our work significantly extends the theoretical understanding under the parametric framework; the main contribution is outlined as follows.
First, this article investigates a new study scope that does not necessarily require the margin condition or the constant gap condition of competitive arms (the arms with positive probabilities of being optimal), and demonstrates a finite-time regret analysis that shows near minimax optimal performance of the proposed algorithm (Section 5.2).To our knowledge, no other existing algorithm is known to work under this new study scope (see also the discussion in Section 6.1).By the discovery of an intriguing connection between the margin and the gap conditions, our new results on regret analysis also synergistically complement the existing literature and together verify the "benefit" of margin conditions in a minimax sense that, if satisfied, can lead to significantly improved regret rates.Second, our algorithm enjoys adaptive performance, in that it automatically captures the regret benefit under the margin and the constant gap conditions and always maintains nearoptimal performance regardless of whether these conditions are satisfied (Section 6).This seems to be the first study to exhibit such an adaptive phenomenon for linear bandits with highdimensional covariates.Third, we show that the outputs of our bandit algorithm possess desired statistical properties, including parameter estimation consistency and variable selection consistency for competitive arms (Section 5.3).Note that variable selection consistency with simultaneous optimal regret guarantees (without or with the margin and constant gap conditions) has not been reported elsewhere in the literature.Also, promising applications of our proposal are demonstrated through two real data examples on drug dose assignment and news article recommendation.
It is worth noting that bandit problems have been studied under other related settings.The examples include best policy matching (e.g., Langford and Zhang 2008;Agarwal et al. 2014), arm-space (with or without user-space) contextual bandits (e.g., Auer, Ortner, and Szepesvári 2007;Abbasi-Yadkori, Pál, and Szepesvári 2011), difficulty links on simple and cumulative regret minimization (Bubeck, Munos, and Stoltz 2011), the multi-class banditron (e.g., Kakade, Shalev-Shwartz, and Tewari 2008;Beygelzimer, Orabona, and Zhang 2017), Bayesian-type approaches (e.g., May et al. 2012;Laber et al. 2018), and bandits with delayed feedback (e.g., Bistritz et al. 2019;Arya and Yang 2020), among many others (see, e.g., Cesa-Bianchi and Lugosi 2006;Bubeck and Cesa-Bianchi 2012;Zhou 2015; Lattimore and Szepesvári 2020 for bibliographic remarks, surveys and references therein).However, these alternative settings and the corresponding algorithms do not address the main issue of this study.For example, (Lattimore and Szepesvári 2020, chap. 23) studied a general arm-space setting for sparse contextual linear bandits, where the (possibly infinitely many) arms share the same unknown sparse coefficient vector.The cumulative regret of the algorithm designed for this setting increases at a polynomial rate with respect to the arm feature dimension.In constrast, our study framework focuses on a user-space setting with a finite and relatively small number of arms, which have their own individual sparse coefficients.As will be seen, the optimal arm depends on the user covariates, and the corresponding cumulative regret has the desirable logarithmic rate in terms of the user covariate dimension.
In fact, our study is in line with the very fruitful research topic known as dynamic treatment regimes (DTR; e.g., Murphy 2003;Qian and Murphy 2011;Goldberg and Kosorok 2012;McKeague and Qian 2014;Laber et al. 2014;Shi et al. 2018, and many important others).Rather than considering an iid sample with multi-time point decision rules, this article focuses on the single-time point decision for sequentially coming users and intends to achieve guaranteed near optimal cumulative rewards for all these users as a whole.
In the remainder of the article, we provide the basic settings of the bandit problem with high-dimensional covariates in Section 2. The main algorithm and the integrated stepwise-type coefficient estimation are described in Sections 3 and 4, followed by a theoretical investigation in Section 5.The benefit of the margin condition and the algorithm's adaptive performance are studied in Section 6. Simulation and real data evaluation are given in Sections 7 and 8, respectively.
We close this section by briefly summarizing the notation consistently used in this article: n for the user visit index and N for the total number of visits; k for the stage index and K for the total number of stages; i for the arm index, I for a chosen arm, and l for the total number of arms.

Setting for Linear Bandits with High-Dimensional Covariates
In many applications, as opposed to the classical setting with homogeneous distributions, the reward from a decision arm often depends on many user covariates.In the following, we propose developing a new algorithm to solve the sequential decision making problem with linear mean reward structures in high-dimensional settings.Suppose there are l candidate decision arms (l ≥ 2) and let N be the total number of user visits.Given user covariate vector X ∈ R p and arm i (1 ≤ i ≤ l), we consider linear model structures in which the observed reward is the true coefficient vector for arm i.We assume the sparsity condition in which only a subset of elements in X is associated with Y i .Define the set of relevant variables for arm i to be Our problem of interest works like the classical setting but with the necessary incorporation of the covariates.At each user visit n (1 ≤ n ≤ N), a user covariate vector X n ∈ R p is first revealed, where the X n 's are iid from some unknown distribution (same as X) with domain X ⊂ R p .Let I j be the chosen arm at each visit point j (1 ≤ j < N), and let Y i,j be the reward if arm i is chosen.Then given the observable information {(X j , I j , Y I j ,j ), 1 ≤ j ≤ n − 1} and current covariate vector X n , a bandit algorithm is applied to choose an arm I n and receive the corresponding reward Y I n ,n = X T n β I n + ε I n ,n , where ε i,n is the random error of arm i and is not necessarily independent of X n .

Definitions and Assumptions
Before introducing the algorithm evaluation, we first give key assumptions.For x ∈ X , define the optimal mean reward f * (x) = max 1≤i≤l x T β i .Assume that the set I = {1, . . ., l} of all candidate arms can be partitioned into a set of competitive arms I o and a set of noncompetitive arms I u .Let T i be the competitive region where arm i ∈ I is optimal: ( 1 ) As given in Assumption 1, we define that arm i is a competitive arm in I o if it is an optimal arm with a positive probability bounded away from zero.

Assumption 1 (Competitive arms).
There is a positive constant c 1 such that for each arm i ∈ I o , P(X ∈ T i ) > c 1 .
As given in Assumption 2, we define that arm i is a noncompetitive arm in I u if it is always a sub-optimal arm with a gap of ζN from the optimal reward.Here we allow I u to be an empty set.If I u = ∅, then Assumption 2 simply reduces to a null assumption, which is also the case in the settings of Goldenshluger and Zeevi (2013).If I u = ∅, ζN is allowed to approach zero as N → ∞.
Assumption 2 (Noncompetitive arms).Each arm i ∈ I u satisfies that with probability 1, max 1≤j≤l X We also assume in Assumption 3 that the covariates satisfy a version of the restricted isometry property (RIP; Candes and Tao 2005).The RIP condition and its related variants have often been used in the analysis of high-dimensional linear regression methods (e.g., Meinshausen and Yu 2009;Zhang 2010Zhang , 2011b)).By the nature of our targeted bandit problem with covariates, an "oracle" allocation strategy (the benchmark in regret definition that knows the competitive regions for all the competitive arms) is to always deliver a competitive arm at this arm's own competitive region; it is then natural to have conditions that use the arms' own competitive regions, since under the "oracle" benchmark, each competitive arm's data points must all fall within its own competitive region.Specifically, for each arm i ∈ x o , define the conditional second moment on the competitive region in Assumption 3.There exists a constant c * > 0 such that for each arm i ∈ I, λ i (q * ) > c * , where q * := C 1 max 1≤i≤l q i for some constant C 1 > 1.
In Assumption 3, q * serves as an upper bound of all q i 's at the same order of max i∈I q i ; a sufficient condition of Assumption 3 is that the minimum eigenvalues of the i 's, denoted by λ min ( i ), are bounded away from zero.
In addition, we assume bounded reward coefficients such that β i 1 ≤ b for some constant b > 0, and the sub-Gaussian condition for random errors such that E(e vε i,n | X n ) ≤ exp(v 2 σ 2 /2) for all v ∈ R. For simplicity, we consider bounded domain X with X n ∞ ≤ θ for some constant θ > 0, but it may be extended to covariates with a sub-Gaussian distribution.

Algorithm Evaluation
Let i * (x) = argmax i∈x f i (x) be the arm that has the maximum mean reward given x, and define f * (x) = f i * (x).Without knowledge of random error, the "oracle" (but clearly not applicable) benchmark is to choose the optimal arm I * n := i * (X n ) at each visit point n.To evaluate the algorithm performance, define the cumulative regret R N that measures the shortfall of the algorithm in cumulative mean reward compared to the "oracle" benchmark: ( 2 ) It is desirable for an allocation strategy to have a guaranteed finite-time upper bound on cumulative regret.Note that for each visit point n, only the reward of the chosen arm can be observed while the rewards of all the other arms are not observable: we inevitably encounter incomplete information under the bandit settings.
In addition, a useful but less discussed question of interest in the linear bandit problem is whether the devised algorithm outputs meaningful variable selection results for the competitive arms.Suppose at the end of running an allocation strategy, the algorithm output gives a set of estimated competitive arms xo , and for each arm i ∈ xo , there is an associated estimate βi = ( βi1 , βi2 , . . ., βip ) for β i ; the estimated set of important variables is defined as Vi = {1 ≤ j ≤ p : βij > 0}.Then we say an algorithm is variable selection consistent if (3) It is also desirable to establish that the algorithm is coefficient estimation consistent.That is, for each competitive arm i ∈ problems.In our bandit problem setting, these results provide some asymptotic theoretical guarantees on the algorithm output for an analyst who may want to subsequently use the output for understanding relevant variables and designing new offline policies.

A Useful Example
In our following study, we will first focus on the study scope from Section 2.1, that is, the class of l-armed bandit reward function (or coefficient) sets with joint distributions P X,ε of (X n , ε 1,n , . . ., ε l,n ) that satisfy all the conditions in Section 2.1.Each member in the class is characterized by a set of coefficients {β 1 , . . ., β l } with a distribution P X,ε .Later on in Section 6, we will present another study scope that imposes two additional assumptions including a margin condition and a constant gap condition of competitive arms.In general, more assumptions lead to smaller class size and a potentially lower (minimax) optimal regret rate; as will be seen, the different study scopes lead to different optimality results (and different algorithmic design).
To facilitate an appreciation of the generality and challenges of the study scope in Section 2.1, we next present a useful example.Given l = 2 and q, define a subclass consisting of all the twoarmed bandit pairs of coefficients {β 1 , β 2 } with P X,ε that satisfy the following scenarios.Treating the first elements in β 1 and β 2 as intercept terms, we define where β 1 and β 2 have q nonzero elements besides the intercept, κ > 0, ω ∈ (−κ, κ), and κ √ q is upper bounded by a positive constant.Also denote the covariates by X = (1, X 1 , . . ., X p−1 ), where X 1 , . . ., X p−1 are iid with Uniform[−1, 1]; conditioning on X n , the random errors ε 1,n and ε 2,n satisfy the sub-Gaussian condition.This gives the simple scenarios in which f 1 (X) For convenience, we denote this bandit subclass as P. Then all the members in P satisfy the assumptions in Section 2.1 and indeed fall within the intended study scope (as shown by Propositions 7 and 8 in Supplement A.1).We can then construct a sequence of its members with both coefficient parameters κ and ω indexed by N: This example gives the properties in Proposition 1.
Proposition 1.Consider the sequence of the class members constructed above from P. Then given any constants α > α > 0 with δN = N −α , we have where Proposition 1 reflects a philosophy for our proposed study in which a newly designed algorithm may ideally be able to handle increasingly closer competitive arms as N gets larger, so that to some extent, it parallels the statistical thinking that larger sample size allows for the finding of increasingly smaller treatment effects.The class P will also be helpful to establish a regret lower bound (to be shown in Section 5.2).
Noting the polynomially decreasing δN in ( 4) and ( 5), it will be seen in Section 6.1 that the study scope of Section 2.1 and the associated algorithm design are deemed different from the existing literature.On one hand, Bastani and Bayati (2020) novelly designed algorithms that are well-suited with provable optimality under the additional margin condition and constant gap condition for competitive arms.On the other hand, neither of these two additional conditions are necessarily satisfied for Section 2.1, and the literature has not yet shown how to design a generally near optimal algorithm.We will defer the detailed discussion to Section 6.1 on the connection between the different study scopes, without or with the two conditions.
Furthermore, it would be interesting for a newly designed algorithm to simultaneously perform optimally when these additional conditions are imposed: that is, can an algorithm adaptively achieve near optimality in both worlds of the different study scopes, and attain potential regret "benefit" if the additional conditions are satisfied?The efforts to address this issue will be presented in Section 6.2.

A Multi-Stage Algorithm in High Dimensions
Our proposed algorithm divides the total visit points into K + 1 stages, with stage 0 being the initial forced sampling stage.Here • is the ceiling function, and stage K may have a sample size less than 2N K−1 .We set c 0 = 32θ 2 c ρ c −2 2 (or its upper bound) for Section 5, where c ρ > 0 is a constant (to be given in Theorem 1).Given stage k, define A k,i = {n : Ñk−1 + 1 ≤ n ≤ Ñk , I n = i} to be the set of visit points where arm i is chosen; similarly, define Let X N = (X 1 , X 2 , . . ., X N ) T be the N × p matrix containing all the user covariates, and let y N = (y 1 , y 2 , . . ., y N ) T be the vector containing the reward responses from the chosen arms with to be the corresponding covariate design sub-matrix from X N and the reward response sub-vector from y N , respectively; that is, row n (X A ) = row j n (X N ) and row n (y A ) = row j n (y N ) for 1 ≤ n ≤ |A|.We can apply a specified high-dimensional linear regression method with tuning parameter ξ to obtain the coefficient estimator β(X A , y A , ξ).In our following discussion, unless stated otherwise we will use the high-dimensional Interactive Greedy Algorithm (IGA, Qian et al. 2019), which is a generalized method from stepwise-type regression (e.g., Zhang 2011aZhang , 2011b;;Ing and Lai 2011).Here, ξ represents the tuning parameter for IGA and regulates the estimator sparsity from the solution path.It is closely related to the penalty term of the highdimensional information criterion (Ing and Lai 2011), which is used to overcome potential overfitting problems associated with the orthogonal greedy algorithm.We offer a brief description of the coefficient estimation by IGA in Section 4.Then, given arm i, βi := β(X A 0,i , y A 0,i , ξ 0 ) are the estimated coefficients from stage 0; we set βi,k := β(X A k−1,i , y A k−1,i , ξ k ) to be the coefficients used by stage k and estimated from the data of its previous stage, where the ξ k 's are their respective tuning parameters.
• Prescreen arms using the initial sampling data to generate the arm set • If k > 1, eliminate arms on Sn to generate the set of "promising" arms otherwise, set Ŝn = Sn .• Define În = argmax i∈ Ŝn X T n βi,k .Perform randomized allocation to choose an arm I n from Ŝn with h ≥ 1 and receive reward Y I n ,n : We are now ready to describe the details of the proposed multi-stage algorithm as shown in Algorithm 1. Specifically, Step 1 is the initial sampling of stage 0 that allocates each arm an equal number of times.
Step 2 shows that for each visit point n of a given stage k, after the observation of covariate X n ∈ R p , there are two substeps of arm screening procedures: (6) prescreens out uncompetitive arms, and (7) performs an extra elimination step to generate "promising" arms for use in the subsequent randomized allocation substep.We set the parameters δ N = 2θ b 0 and k = 2θ b k with b 0 = q * 2c ρ log p N /τ 0 and b k = q * 2c ρ log p N /N k , k ≥ 2, for Section 5, where c ρ and cρ are positive constants (to be given in Theorems 1 and 2).Here q * can also be replaced by a general upper bound s * (s * ≥ q * ); its implication w.r.t. the analysis is given in Remark 6 of Section 6.2.
In the last substep of Step 2, define În = argmax i∈ Ŝn X T n βi,k where any tie-breaking rule may apply.Let h ≥ 1 be a randomization parameter.Then, under the randomized allocation scheme, we choose an arm i from Ŝn with probability 0 < p n,i ≤ 1, where i∈ Ŝn p n,i = 1 and Ŝn .In particular, h = 1 corresponds to simple randomization among arms in Ŝn .We use h = 1 in theoretical development for simplicity.
Step 3 updates the coefficient estimation after the current stage.In Step 4, the algorithm moves to the next stage, and continues in a stage-wise fashion until the end of N user visits.Then Step 5 outputs the estimated set of competitive arms and their associated coefficient estimates.Considering the scenario in which the last stage K has a small sample size, we use the last two stages to estimate ÎN .
Remark 1. Algorithm 1 includes the arm prescreening substep (6) for all stages.If I u = ∅, the algorithm can be further simplified by removing this substep.However, if I u = ∅, the optimal arm may be eliminated by a noncompetitive arm, and the analysis argument (to be outlined in Section 5.1 and Proposition 3 for having "good" events) may not hold without this substep.The use of randomized allocation with h > 1 (as opposed to h = 1) is mainly motivated by the potentially more efficient exploitation of the estimated promising arms in practice.A similar empirical idea for randomization has also been used for the nonparametric bandit problem with covariates (e.g., Qian and Yang 2016b); the feature of (nonuniform) randomized allocation, together with the embedded key arm-elimination technique (Perchet and Rigollet 2013), can be practically useful to provide additional flexibility for an algorithm to further use the reward function estimation; all theoretical results of our proposed algorithm remain the same for upper bounded h; we will demonstrate its empirical performance with h > 1 in the numerical studies.

Coefficient Estimation
As IGA is embedded into Algorithm 1 and plays an important role in coefficient estimation, we next briefly describe main steps of IGA summarized in Algorithm 2 to keep the article selfcontained.
Given the input design matrix X ∈ R m×p and response vector y ∈ R m , define the objective function . Let e j ∈ R p be the unit vector with the jth element being zero.Then from Algorithm 2, following initialization (Step 1), the forward selection in Step 2 selects one variable into the active set G (r) and drives down the objective function Q(β) in a stepwise fashion, that is, (8) essentially considers all the candidate variables one by one and finds those that Algorithm 2 Stepwise coefficient estimation.r) + αe j ). 2. Perform forward selection with the following substeps.
rank high in reduction of Q(β).Alternatively, to avoid repeated optimization tasks on the objective function and to significantly reduce computation time, we can also replace (8) and φ (r) by gradient-based criterion: ∞ and ( 9) where ∇Q(β) is the gradient vector and ∇ g Q(β) is its gth element.Without additional information on true variables, it suffices that we set ρ = 1.
Step 3 is the backward elimination step that checks if some variables may become redundant after the new variable is included from forward selection.This forwardbackward iteration scheme continues until the addition of any new variables does not significantly reduce the objective function as shown in Step 4.
Remark 2. Given X, y, and ξ , the output of Algorithm 2 gives the coefficient estimator β(X, y, ξ).The parameter ξ regulates the solution sparsity: a larger ξ tends to provide a sparser solution.
In empirical studies, instead of giving explicit values for ξ , we use the number of steps to determine solution sparsity, which is automatically selected by 10-fold cross validation (CV) on (X, y) under the mean square error criterion.The package that implements the IGA method with CV is publicly available on GitHub.Also, in the description of Algorithm 1, we use the stage-specific sample A k,i for coefficient estimation to make the proofs more concise.In practice, we recommend using the sample choice of including all historical data from previous stages so that βi,k+1

Understanding Algorithm Performance
To understand the performance of the proposed algorithm, it is helpful to study how the algorithm estimates the conditional mean rewards and the coefficients and how these estimates are associated with "good" events on arm selection.In Section 5.1, we outline the analysis strategy for the cumulative regret upper bounds, which consist of four main steps.We provide the upper and lower bounds on the cumulative regret in Section 5.2, and establish the variable selection and coefficient estimation consistency properties in Section 5.3.

Outline of Main Analysis Steps
The first main step is regret decomposition via the partitioning of the sample space into properly defined events.Specifically, let R N0 and R N1 be the regrets accumulated in Stage 0 and the following stages, respectively.Then we see that R N = R N0 +R N1 .Also define the following events on coefficient estimation errors.
For 2 ≤ k ≤ K, define and The whole sample space can be partitioned into the events to further decompose the cumulative regret, so that To provide upper bounds for the decomposed regrets, we need to understand the properties and implications of these associated events to be shown in the next two main steps.
In the second main step, we intend to achieve the following specific objective (1): under "good" events, via connection with coefficient/reward estimation errors, the regret can be upperbounded.We further divide the analysis effort of this step into two substeps, which include studying (1a) arm prescreening behavior and (1b) arm elimination behavior.Steps (1a) and (1b) are summarized in Propositions 2 and 3, respectively, whose proofs are relegated to Supplement A.2. Proposition 2. Given stage k (k ≥ 1), if the event U k holds, then at any visit point n ( Ñk−1 + 1 ≤ n ≤ Ñk ), the optimal arm I * n remains in Sn , and any noncompetitive arm i ∈ I u is excluded from Sn .Proposition 3. Given stage k (k ≥ 2), if the event U k holds, then at any visit point n ( Ñk−1 + 1 ≤ n ≤ Ñk ), the optimal arm I * n remains in Ŝn ; in addition, any "promising" arm i ∈ Ŝn belongs to the arm set The two propositions above suggest that with the arm prescreening and elimination procedures, the event U k regarding the coefficient estimation errors leads to the "good" event that the algorithm always keeps the optimal arm while all the other remaining arms must be in the arm set U n,k , thereby restricting the regret of each step within 2 k to achieve objective (1).Therefore, to study the maintenance of "good" events for arm selection, it is important to understand the coefficient estimation errors.
Due to the nature of necessarily evolving arm allocation in sequential decision making, only one response from the selected arm is revealed while responses from all the other arms are not available; the accumulated data for each arm are not iid random samples anymore (as opposed to regular settings in high-dimensional regression problems), which poses unique challenges in studying the statistical properties of the estimated coefficients.With the multi-stage approach and stage-wise arm elimination, we also employ randomized arm allocation to help partly overcome the technical issues (besides empirical performance considerations, to achieve a balance between exploration and exploitation).
In the third main step, we intend to achieve the specific objective (2): the (conditional) probabilities of violating the "good" events are relatively small.For this purpose, we establish Theorems 1 and 2 (see below).These theorems are proved through four substeps (2a) randomized allocation with "random" samples, (2b) sample size determination, (2c) covariate "design matrix" properties, and (2d) coefficient estimation upper bounds, details of which are also relegated to Supplement A.2.Note that ξ 0 and the ξ k 's correspond to the tuning parameter ξ in Algorithm 2, which computes βi and the βi,k 's, respectively; recall that p N = p ∨ N.
Theorem 1. Suppose Assumptions 1-3 hold.Then there exists a positive constant c r such that given ξ 0 = c r log p N τ 0 , it holds with probability less than l/N 3 that βi − β i 1 > c ρ q * (q i + log N + q i,0 log p N ) τ 0 for some i ∈ I, where q i,0 = J i,0 , J i,0 = {j ∈ V i : β i,j < c β log p N /τ 0 }, and c ρ , c β > 0 are some constants.
Theorem 2. Suppose Assumptions 1-3 hold.Then there exists a positive constant c r such that given for some i ∈ I o , where q i,k = J i,k , J i,k = {j ∈ V i : β i,j < cβ log p N /N k }, and cρ , cβ > 0 are some constants.
These two theorems suggest that with the proposed algorithm, given U k , the probability of violating F k+1 (or U k+1 ) on the coefficient estimation errors is small; consequently, since U k+1 always implies the "good" arm selection events on the next stage as shown in the propositions for objective (1), the same probability bound applies to violating these "good" events, thereby achieving objective (2).
As the last main step, we obtain the decomposed regrets by Propositions 2 and 3 from objective (1) and Theorems 1 and 2 from objective (2), and subsequently assemble the cumulative regret upper bounds to be shown next in Section 5.2.1.

Upper and Lower Bounds on Cumulative Regret
We demonstrate here the near minimax optimal regret performance of the proposed algorithm, where the upper bound and the lower bound are given in Sections 5.2.1 and 5.2.2, respectively.

Upper Bound
The analysis efforts briefly summarized in Section 5.1 enable us to provide the following finite-time regret analysis for (2).
Theorem 3. Suppose Assumptions 1-3 hold.Then there exist positive constants C 21 and C 22 such that the cumulative regret of Algorithm 1 satisfies with C 21 = 4θ bc 0 + 6θ b and C 22 = 8θ c1/2 ρ ; in particular, if ψ = 0 and p = o(N ζ ) for some constant ζ > 0 with fixed l and q * , then for any large enough N, In Theorem 3, the upper bound of (13) consists of two components.Roughly speaking, the first component is mainly attributed to the initial forced sampling, which generates initial crude estimates for the coefficients and ensures good performance for the prescreening of the uncompetitive arms; mainly from the much more refined arm elimination stages for the competitive arms, the second component is usually a dominating term as shown by ( 14).
Note that under additional conditions (to be introduced in Section 6.1), existing algorithms (Goldenshluger and Zeevi 2013;Bastani and Bayati 2020) indicate that by an exploitationbased strategy, it is ensured for regret analysis that the optimal arm in its competitive region with a certain constant reward gap can be exclusively selected with high probability.However, such analysis argument is not technically feasible here.To overcome this difficulty, we employ arm elimination and randomized allocation to carefully control regret accumulation in a stagewise fashion, thereby circumventing the need for these additional conditions.The inherited new technical challenges in regret analysis are naturally shared with the simultaneous establishment of variable selection consistency to be shown in Section 5.3.

Lower Bound
We then seek to address whether it is possible for any alternative algorithm to achieve a regret rate much slower than that of ( 14).For this purpose, recall the bandit subclass P defined from the example of Section 2.3, which has been verified to satisfy all the conditions of Section 2.1.
Theorem 4. For any admissible bandit strategy, there is a positive constant C 3 such that with any large enough N, we can always find some class member in P under which its cumulative regret satisfies The regret lower bound in Theorem 4 implies that the upper bound in Theorem 3 is almost not improvable for N (up to a logarithmic factor), and that our proposed algorithm has near minimax optimal performance under the study scope of Section 2.1.
Remark 3. In the upper-bound regret analysis, it is assumed that X n ∞ is bounded above by a constant θ > 0, which is involved in setting the coefficients of algorithm parameters.This condition can be relaxed to allow element-wise sub-Gaussian conditions on the covariates.Specifically, assume that for all covariates X n = (X n,1 , X n,2 , . . ., X n,p ) T , there exists some constant σ X > 0 such that E(e vX n,j ) Then the following Proposition 4 shows that the regret contributed by A c is relatively negligible.
Proposition 4. Given the sub-Gaussian conditions on covariates, it is satisfied that By treating A c as a "bad" event in our regret decomposition, Proposition 4 suggests that we can just focus on the "good" event in which all covariates are bounded by θN = c x σ X log p N and replace the constant θ by θN instead; as a result, the algorithm analysis under event A can be performed similarly, with the mild price on regret rate by extra multiplicative factors of log p N .

Variable Selection and Coefficient Estimation Consistency
The proposed algorithm also generates consistently estimated competitive arms ÎN and their consistently estimated coefficients as shown in Theorem 5. Here, qi is the size of variables with relatively weak signals.Note that the coefficient estimation error bound of βi in Theorem 2 includes the slight price of an extra additive log N term; this reflects the subtle need for the bandit algorithm to simultaneously achieve the desired finite-time regret guarantees.However, this extra log N term can be removed for the coefficient estimation consistency in Theorem 5, which matches a known result of a regular sparse high-dimensional regression setting (that is, O p ( (q i + qi log p N )/N)).
Theorem 5.Under the same conditions of Theorem 3, the algorithm output of the estimated competitive arms satisfies P( ÎN = I o ) → 1 as N → ∞.In addition, the output of coefficient estimation for each arm i ∈ ), where qi = Ji , and Ji = {j ∈ V i : Combined with a beta-min condition, we further establish coefficient estimation and variable selection consistency simultaneously for the competitive arms in Theorem 6.Therefore, the proposed bandit algorithm also achieves the desired property (3).Theorem 6. Suppose an arm i ∈ I o satisfies min j∈V i β i,j ≥ 4c β log p N N . Then under the same conditions of Theorem 3, the output of coefficient estimation for arm i ∈ I o satisfies 1. coefficient estimation consistency: βi The variable selection consistency of Theorems 5 and 6 also uses results from finite-time analysis, which shows the desired sparsity recovery with high probability.Indeed, it is shown in Supplement A.4 that for any large enough N, P(I N = I o ) ≤ 3K/N and for every i Remark 4. From the proofs of Theorems 1 and 2, we can see that the positive constants c r , cρ , cβ , c r , c ρ , c β exist.Given that there are constants c d , c f > 0 associated with the IGA method as shown in Lemma 1 of Supplement B, we can set

Benefit of Margin Condition
A margin condition is known as an assumption that regulates the complexity and rates of convergence for classification and estimation problems (Mammen and Tsybakov 1999;Tsybakov 2004;Audibert and Tsybakov 2007).To fully appreciate the contribution of our new algorithm design in this work and discern its distinction from the existing literature, it is helpful to consider and discuss a margin condition under linear bandits with covariates.In particular, a margin condition has been assumed and carefully studied in earlier work under both the fixed-dimension setting (Goldenshluger and Zeevi 2013) and the targeted high-dimensional setting (Bastani and Bayati 2020); their corresponding bandit algorithms are well-designed to optimally solve the problem under both a margin condition and a constant gap condition.We next define these conditions.For x ∈ X , let Assumption 4.There exists a positive constant L such that given any δ > 0, Assumption 4 requires that except for a subset of the domain with small probability close to the decision boundary, the optimal mean reward can be separated from sub-optimal rewards by arbitrarily small δ.Alongside the margin condition, earlier work also assumes the following constant gap condition.
Assumption 5.There are positive constants , c1 > 0 such that for each arm i ∈ I o , P(X ∈ Ti ) > c1 , where First, we discover that the margin condition of Assumption 4 and the gap condition of Assumption 5 are closely related.Indeed, as shown in the following first statement of Proposition 5, if we impose the margin condition in addition to those of Section 2.1, then the resulting study scope becomes largely equivalent to that of Bastani and Bayati (2020) since it is guaranteed that Assumption 5 is also satisfied.
The second statement of Proposition 5 implies that the study scope of Bastani and Bayati (2020) is subsumed in (and is smaller than) that of Section 2.1.In particular, neither Assumptions 4 nor 5 are necessarily satisfied under the study scope of Section 2.1 with Assumption 1: indeed, as an example, the bandit class P of the example in Section 2.3 together with Proposition 1 implies the following results.Proposition 6. Assumptions 1-3 are satisfied for all the class members in P, but neither Assumptions 4 nor 5 holds for all the members in P.
Consequently, in light of the connection illustrated by Proposition 5, the key difference in the study scopes and the regret bounds for Section 2.1 from the existing literature lies in the margin condition.In a synergistic manner, our regret bounds in Section 5.2 complement earlier results with the margin condition (Goldenshluger and Zeevi 2013;Bastani and Bayati 2020), and together verify the benefit of a margin condition to achieve a significantly improved regret rate (from polynomial to logarithmic).
Remark 5.The discussion above resolves the seemingly contradictory optimal regret rates for the bandit problem with highdimensional covariates: In Section 5.2, we show that the near N 1/2 rate is optimal and is achievable by Algorithm 1, but the existing literature (Bastani and Bayati 2020) shows that the near log N rate is optimal and is achievable by an exploitation-based algorithm.There is no conflict here since the study scope of Section 2.1 imposes no assumption on the margin (or the related constant gap condition); hence, under this more "difficult" situation without assuming the margin, it is natural that the optimal regret rate is higher than the logarithmic rate; Theorem 4 has shown that no algorithm is able to give a regret rate lower than N 1/2 .To some extent, this observation of different optimal regret rates is reminiscent of the intriguing debates on the optimal convergence rates (and their associated classifier rules) for nonparametric classification in the statistics literature as discussed by (Tsybakov 2004, p.146): How fast can the convergence of classifiers be and how does one construct the classifiers that have optimal convergence rates?... Yang (1999) claims that the optimal rates are quite slow (substantially slower than n −1/2 ) and they are attained with plug-in rules; Mammen and Tsybakov (1999) claim that the rates are fast (between n −1/2 and n −1 ) and they are attained by ERM (empirical risk minimization rules) and related classifiers.... In fact, there is no contradiction since different classes of joint distributions of (X, Y) are considered.Yang (1999) ... do not impose assumption on the margin.Therefore, it is not surprising that they get rates slower than n −1/2 : one cannot obtain a rate faster than n −1/2 with no assumptions on the margin.... On the contrary, Mammen and Tsybakov (1999) ... show what can be achieved when ... assumption on the margin holds.In this case the fast rates (up to n −1 ) are realizable.
Therefore, the results presented in this section for the targeted bandit problem with covariates pleasantly join the celebrated group of known benefits by margin conditions (if satisfied) as exhibited in nonparametric estimation and nonparametric bandit problems (Tsybakov 2004;Audibert and Tsybakov 2007;Rigollet and Zeevi 2010;Perchet and Rigollet 2013).

Achieving Regret Benefit Adaptively
An important question naturally arises from our discussion in Section 6.1: since it is usually unknown whether the margin condition (or the closely related constant gap condition) holds, is it possible to design a bandit algorithm to adaptively achieve the regret benefit from the margin condition?That is, does there exist an algorithm that can simultaneously perform optimally under both of the study scopes, without or with assuming the margin, and automatically take advantage of the desirable regret benefit if the margin condition is satisfied?To a large extent, this question also resembles the spirit of adaptive performance to the margin proposed for classical classification and estimation problems (Tsybakov 2004).In the following, we provide an affirmative answer and show that our proposed algorithm indeed adapts to the two different study scopes, and always attains near optimal regret rates (up to a logarithmic factor) regardless of whether the margin condition holds.Assumption 6.If I u = ∅, Assumption 2 holds with ψ = 0.
Like Assumptions 4 and 5, Assumption 6 above for noncompetitive arms was also used in Bastani and Bayati (2020), which considers a special case of Assumption 2. Now our study scope in this section, similar to that of Bastani and Bayati (2020), is devised to be the bandit class that imposes Assumptions 4 and 6 in addition to those of Section 2. Theorem 7. Suppose Assumptions 4 and 6 and the conditions of Theorem 3 hold.Then there exists a positive constant C2 such that the cumulative regret of Algorithm 1 satisfies with C2 = 4θ bc 0 + 6θ b + 32θ 2 cρ .
Using the same algorithm designed in Section 3, Theorem 7 shows that under the margin condition, our algorithm also enjoys a nearly optimal regret rate up to a logarithmic factor (the lower bound is given by Goldenshluger and Zeevi 2013); for example, if l and q * are upper bounded and p = o(N ζ ) with some constant ζ > 0, then the regret upper bound in Theorem 7 is simplified to O((log N) 2 ).The upper bound here slightly improves on the result in Bastani and Bayati (2020) by removing an additive term of O((log p) 2 ).This result together with Theorems 3 and 4 confirms that our proposed algorithm simultaneously enjoys near optimal performance under both study scopes given in Sections 2.1 and 6.
In addition, as the conditions of Theorem 3 are still satisfied here, the variable selection consistency results of Theorem 6 for the proposed algorithm continue to hold under the margin.Remark 6.For studying Algorithm 1 in the previous two sections, to help maintain the "good" events of arm elimination and selection required by Propositions 2 and 3 with high probabilities, the coefficients used in parameters τ 0 , δ N , and k involve q * , an upper bound of max i∈I q i at the same order.We can also replace q * with a general upper bound s * (s * ≥ q * ) in setting these coefficients; then the proofs remain largely the same, although as a mild compromise, in the regret upper bounds of Theorems 3 and 7, q * should be replaced by s * as well.We note that the use of a general upper bound s * in setting algorithm parameter coefficients for theoretical development was also required in the related literature; for example, the regret bound in Theorem 7 becomes O(ls 2 * log p N log N), and the quadratic rate of s * matches the result of Bastani and Bayati (2020), which required both Assumptions 4 and 5.In addition, the regret lower bounds with the margin (Goldenshluger and Zeevi 2013) and without the margin (Theorem 4) are both in respect of N only.It remains unclear whether s * can be unknown to an algorithm and whether a matching bound for s * can be obtained.We leave these as open challenging questions for future investigation.

Simulation
We next evaluate the performance of the proposed bandit algorithms on simulated data.For brevity, the multi-stage type algorithms described in Section 3 are abbreviated as "MS." We considered IGA and lasso as the methods for coefficient estimation and denote the corresponding bandit algorithms by MS-IGA and MS-lasso.For comparison, we used the MS algorithm without any covariates (denoted by MS-simple), that is, the mean reward estimates in Algorithm 1 were replaced by the simple average of the accumulated response values of each arm.We also considered the bandit algorithm in Bastani and Bayati (2020) as a useful benchmark (denoted by B-lasso).Due to the page limit, all simulation settings and results are relegated to Supplement C, where we evaluate the performance of the proposed algorithms in Supplement C and perform a sensitivity analysis on parameter choice in Supplement C.

Real Data Evaluation
We next use two real datasets to evaluate the performance of the proposed algorithm.One challenge naturally arises due to the incomplete nature of the datasets for the bandit setting: unlike simulation, for each user visit, we only observe the user response to one selected arm.To account for such limited feedback, the following two datasets require different evaluation strategies, which will be described in their respective sections.In addition, to achieve faster computation for MS-IGA, we used the gradient-version of Algorithm 2 that replaces criterion (8) with (9).The parameters were chosen the same way as discussed in Supplement C.

Warfarin Dose Assignment
Warfarin is a widely used anticoagulant, and its appropriate dosing is important for the prevention of adverse events (International Warfarin Pharmacogenetics Consortium 2009).The warfarin dataset (available from https://www.pharmgkb.org) contains 6922 patient records, each of which has covariate information including demographic variables (e.g., gender, ethnicity, age), clinical background variables (e.g., height, weight, comorbidities, medication, smoking), and genotypic variables (CYP2C9 and VKORC1 genetic variants).We converted categorical variables to corresponding binary indicators and replaced missing values by the respective sample means, which resulted in 127 covariates for each patient.In addition, the continuous outcome variable was the stable therapeutic dose of warfarin, and we included 6037 patients for bandit algorithm evaluation after removing records with missing dose values.
To generate bandit arms, we categorized the outcome variable by grouping it to l (l = 2, 3, 4) categories, using the l-quantiles as breaking points (that is, we used median for l = 2, tertiles for l = 3, and quartiles for l = 4) so that each arm (or category) in the dataset corresponds to approximately the same number of patients.Since the outcome variable is the doctor-prescribed steady-state dose values that gave stable anticoagulation levels, if the therapeutic dose value fell in the category of an arm i * , we set this arm i * to be the patient's optimal arm with reward 1, while all the other arms j (j = i * ) were considered sub-optimal with reward 0. This setting allowed us to evaluate any bandit algorithm: an algorithm incurs no regret if it chooses i * for the patient, and incurs unit regret otherwise.We randomized the order of patient visits and ran the bandit algorithms sequentially over the whole dataset to record the final per-round regret r N , the sample size of each chosen arm n i , and the number of selected variables nVar i (i = 1, . . ., l).The experiment was repeated 100 times with permuted visit orders; the averaged results are summarized in Figure 1 and Table 1.
The boxplots from Figure 1 show that MS-simple without considering covariates yielded the least favorable performance in all three scenarios, indicating the effectiveness of using covariate information in choosing warfarin dose.Together with Table 1, we observe that MS-IGA performed better than MSlasso in these scenarios; MS-IGA also performed very competitively compared to the benchmark and had reduced variability in per-round regret.In addition, the averaged sample sizes of different arms appear more balanced for MS-IGA than for the benchmark, particularly under the 3-arm and 4-arm scenarios; to some extent, this may reflect the less greedy nature of the proposed algorithm.MS-IGA often selected fewer variables than the benchmark; the exceptions come from arm 3 of the 3-arm scenario and arm 4 of the 4-arm scenario as these arms  were chosen less often than the other candidate arms by the benchmark.

News Article Recommendation
In the following, we use the Yahoo!front page user click log dataset (version 2.0; Yahoo!Academic Relations 2011; available from http://webscope.sandbox.yahoo.com).The complete set includes about 28 million user visits to the news front page from October 2 to 16, 2011, and each user visit record has 135 binary user covariates and a pool of candidate news articles.One article is chosen uniformly at random from the pool and is displayed to the user; the binary user response to the selected article is also recorded, with 1 for click and 0 for nonclick.As the candidate pools of news articles are dynamic and the popularity of a news article can change in the long run, to account for these complications in algorithm evaluation, we adopted a screening strategy similar to May et al. (2012) and only considered shortterm performance using data collected on the first day (October 2, 2011) with a three-article (id 563115, 563846, 565822) set as the stationary candidate arms.Accordingly, we retained the user visit records where the candidate pool contained all three articles and the displayed article was one of them.The resulting reduced dataset contained 148,341 user visits for subsequent bandit algorithm evaluation.
Unlike the warfarin dose data, since a randomly selected news article is displayed at each visit, we should not assume the optimal arm is known.Instead, we applied the unbiased offline evaluation strategy developed in Li et al. (2010) to evaluate a bandit algorithm.That is, for each user visit, if the arm chosen by the algorithm matched the displayed arm, we kept this visit as a "valid" data point for algorithm use; otherwise, this visit record was ignored and not accessible by the algorithm.Accordingly, each algorithm ran through the dataset sequentially until N "valid" data points were obtained with N = 30,000; the resulting "valid" data was used to calculate the click through rate (CTR) as an unbiased evaluation of the bandit algorithm performance.We ran the MS-simple, B-lasso, and MS-IGA algorithms over a random permutation of the reduced dataset and repeated the experiment 100 times.We used the averaged CTR from a complete random strategy (that chose arms uniformly at random) to generate each algorithm's relative CRT by computing the ratio between the algorithm's CRT and that of the complete random strategy.We then summarized the numerical results in Figure 2 and Table 2.
Compared to the complete random strategy, we observe from the plots in Figure 2 that MS-simple (without considering covariates) significantly improves the averaged CTR by about 4%.MS-IGA further improves the averaged CTR, which can be attributed to the user covariates in the reward modeling,  while the benchmark surprisingly underperforms.The very unbalanced arm sample sizes from the benchmark suggest that its observed result could be again due to the more greedy nature of the benchmark designed to emphasize arm exploitation more than the MS-type algorithms; as a numerical check, we then revised the benchmark by keeping the lasso as the coefficient estimation method (with the same tuning parameter setting as B-lasso) but adopting our MS-type algorithm instead (thus, we denote it by MS-B-lasso).Interestingly, as shown in Table 2, MS-B-lasso performs competitively in this case compared to MS-IGA, with less sparse variable selection outcomes and reasonably balanced sample sizes.

Discussion
We study the bandit problem with high-dimensional covariates by designing an adaptive algorithm with arm elimination and randomized allocation.The algorithm enjoys near minimax optimal regret performance under both study scopes (without or with the margin), and demonstrates adaptive performance by one unified algorithm.We also establish simultaneous coefficient estimation and variable selection consistencies for the output of the proposed algorithm.The extensive numerical studies indicate that our proposal holds promise in real applications on personalized medical and online services.The previous discussion implicitly assumes that the total number of visits N is known a priori; if N is unknown, the proposed approach can be extended by employing the "doubling argument" (e.g., Cesa-Bianchi and Lugosi 2006; Perchet and Rigollet 2013).Although we only used IGA (as opposed to lasso) for Algorithm 1 to help achieve variable selection consistency with improved coefficient estimation consistency, we expect that popular shrinkage-type regression methods such as the adaptive lasso, SCAD, and MCP (Fan and Li 2001;Zou 2006;Zhang 2010) could be other promising coefficient estimation candidates to be integrated for the bandit problem algorithms; a comprehensive and rigorous investigation on their theoretical and numerical properties could be of independent interest and is left for future studies.
3. Find the estimated coefficient for next stage by computing βi,k+1 for each i ∈ I. 4. Set k = k + 1. Repeat steps 2-4 until the end of N user visits.5. Obtain an estimated set of competitive arms ÎN = N n= ÑK−2 +1 Ŝn and output the estimated coefficient βi = βi,K for all i ∈ ÎN .

Figure 1 .
Figure 1.Boxplots of per-round regret from different bandit algorithms using warfarin dose data with 100 random permutations.Left panel: 2 arms; middle panel: 3 arms; right panel: 4 arms.

Figure 2 .
Figure 2. Averaged relative CRT with news article recommendation data.
where the alternative choice of estimated coefficients with the larger sample B k−1,i (that includes all historical data of arm i) is given in Remark 2 of Section 4. Set initial sampling stage with sample size N 0 .Choose each arm an equal number of times τ 0 .For each arm i ∈ I, compute the initial estimated coefficient βi .Set k = 1. 2. At stage k, perform the following substeps at n

Table 1 .
Averaged algorithm performance using warfarin dose data with 100 random permutations.

Table 2 .
Averaged algorithm performance with news article recommendation data.