Variable Selection Via Thompson Sampling

Abstract–Thompson sampling is a heuristic algorithm for the multi-armed bandit problem which has a long tradition in machine learning. The algorithm has a Bayesian spirit in the sense that it selects arms based on posterior samples of reward probabilities of each arm. By forging a connection between combinatorial binary bandits and spike-and-slab variable selection, we propose a stochastic optimization approach to subset selection called Thompson variable selection (TVS). TVS is a framework for interpretable machine learning which does not rely on the underlying model to be linear. TVS brings together Bayesian reinforcement and machine learning in order to extend the reach of Bayesian subset selection to nonparametric models and large datasets with very many predictors and/or very many observations. Depending on the choice of a reward, TVS can be deployed in offline as well as online setups with streaming data batches. Tailoring multiplay bandits to variable selection, we provide regret bounds without necessarily assuming that the arm mean rewards be unrelated. We show a very strong empirical performance on both simulated and real data. Unlike deterministic optimization methods for spike-and-slab variable selection, the stochastic nature makes TVS less prone to local convergence and thereby more robust.


Interpretable Machine Learning
A fundamental challenge in statistics that goes beyond mere prediction is to glean interpretable insights into the nature of real-world processes by identifying important correlates * Yi Liu is a 3 rd year PhD student at the Department of Statistics at the University of Chicago LG] 1 Jul 2020 of variation. Unfortunately, many today's most powerful machine learning prediction algorithms lack the capacity to perform variable screening in a principled and/or reliable way.
For example, deep learning (DL) is widely accepted as one of the best performing artificial intelligence (AI) platforms. However, DL prediction mappings lack an intuitive algebraic form which renders their interpretability/explainability (i.e. insight into the black box decision process) far from straightforward. Substantial effort has been recently devoted to enhancing the explainability of machine (deep) learning through the identification of key variables that drive predictions (Garson (1991); Olden and Jackson (2002); Zhang et al. A variable can be important because its change has a causal impact or because leaving it out reduces overall prediction capacity (Jiang and Owen (2003)). Such leave-one-covariateout type inference has a long tradition, going back to at least Breiman (2001). In random forests, for example, variable importance is assessed by the difference between prediction errors in the out-of-bag sample before and after noising the covariate through a permutation. Lei et al. (2018) propose the LOCO method which gauges local effects of removing each covariate on the overall prediction capability and derives an asymptotic distribution for this measure to conduct proper statistical tests. There is a wealth of literature on variable importance measures, see Fisher et al. (2019) for a recent overview. In Bayesian forests, such as BART (Chipman et al. (2001)), one keeps track of predictor inclusion frequencies and outputs an average proportion of all splitting rules inside a tree ensemble that split on a given variable. In deep learning, one can construct variable importance measures using network weights (Garson (1991); Ye and Sun (2018)). Owen and Prieur (2017) introduce a variable importance based on a Shapley value and Hooker (2007) investigates diagnostics of black box functions using functional ANOVA decompositions with dependent covariates. While useful for ranking variables, importance measures are less intuitive for model selection and are often not well-understood theoretically (with a few exceptions including Ishwaran et al. (2007); Kazemitabar et al. (2017)).
This work focuses on high-dimensional applications (either very many predictors or very many observations, or both), where computing importance measures and performing tests for predictor effects quickly becomes infeasible. We consider the non-parametric regression model which provides a natural statistical framework for supervised machine learning. The data setup consists of a continuous response vector Y (n) = (Y 1 , · · · , Y n ) that is linked stochastically to a fixed set of predictors x i = (x i1 , · · · , x ip ) for 1 ≤ i ≤ n through and where f 0 is an unknown regression function. The variable selection problem occurs when there is a subset S 0 ⊂ {1, · · · , p} of q 0 = |S 0 | predictors which exert influence on the mixing function f 0 and we do not know which subset it is. In other words, f 0 is constant in directions outside S 0 and the goal is to identify active directions (regressors) in S 0 while, at the same time, permitting nonlinearities and interactions. The traditional Bayesian approach to this problem starts with a prior distribution over the 2 p sets of active variables. This is typically done in a hierarchical fashion by first assigning a prior distribution π(q) on the subset size q = |S| and then a conditionally uniform prior on S, given q, i.e. π(S|q) = 1 ( p q ) . This prior can be translated into the spike-and-slab prior where, for each coordinate 1 ≤ i ≤ p, one assumes a binary indicator γ i for whether or not the variable x i is active and assigns a prior P(γ i | θ) = θ, θ ∼ Beta(a, b) for some a, b > 0.
This paper introduces Thompson Variable Selection (TVS), a stochastic optimization approach to subset selection based on reinforcement learning. The key idea behind TVS is that variable selection can be regarded as a combinatorial bandit problem where each variable is treated as an arm. TVS sequentially learns promising combinations of arms (variables) that are most likely to provide a reward. Depending on the learning tool for modeling f 0 (not necessarily a linear model), TVS accommodates a wide range of rewards for both offline and online (streaming batches) setups. The fundamental appeal of active learning for subset selection (as opposed to MCMC sampling) is that those variables which provided a small reward in the past are less likely to be pulled again in the future. This exploitation aspect steers model exploration towards more promising combinations and offers dramatic computational dividends. Indeed, similarly as with backward elimination TVS narrows down the inputs contributing to f 0 but does so in a stochastic way by learning from past mistakes. TVS aggregates evidence for variable inclusion and quickly separates signal from noise by minimizing regret motivated by the median probability model rule (Barbieri and Berger, 2004). We provide regret bounds which do not necessarily assume that the arm outcomes be unrelated. In addition, we show strong empirical performance and demonstrate the potential of TVS to meet demands of very large datasets.
This paper is structured as follows. Section 2 revisits known facts about multi-armed bandits. Section 3 develops the bandits framework for variable selection and Section 4 proposes Thompson Variable Selection and presents a regret analysis. Section 5 presents two implementations (offline and online) on two benchmark simulated data. Section 6 presents a thorough simulation study and Section 7 showcases TVS performance on real data. We conclude with a discussion in Section 8.

Multi-Armed Bandits Revisited
Before introducing Thompson Variable Selection, it might be useful to review several known facts about multi-armed bandits. The multi-armed bandit (MAB) problem can be motivated by the following gambling metaphor. A slot-machine player needs to decide between multiple arms. When pulled at time t, the i th arm gives a random payout γ i (t). In the Bernoulli bandit problem, the rewards γ i (t) ∈ {0, 1} are binary and P(γ i (t) = 1) = θ i . The distributions of rewards are unknown and the player can only learn about them through 4 playing. In doing so, the player faces a dilemma: exploiting arms that have provided high yields in the past and exploring alternatives that may give higher rewards in the future.
More formally, an algorithm for MAB must decide which of the p arms to play at time t, given the outcome of the previous t − 1 plays. A natural goal in the MAB game is to minimize regret, i.e. the amount of money one loses by not playing the optimal arm at each step. Denote with i(t) the arm played at time t, with θ = max 1≤i≤p θ i the best average reward and with ∆ i = θ −θ i the gap between the rewards of an optimal action and a chosen action.
The expected regret after T plays can be then written as is the number of times an arm j has been played up to step T . There have been two main types of algorithms designed to minimize regret in the MAB problem: Upper Confidence Bound (UCB) of Lai and Robbins (1985) and Thompson Sampling (TS) of Thompson (1933). Tracing back to Agrawal (1995), UCB consists of computing, at each round t and for each arm i, a reward index (e.g. an upper bound of the mean reward of the considered arm that holds with high confidence) and then selecting the arm with the largest index. Thompson Sampling, on the other hand, is a Bayesian-inspired heuristic algorithm that achieves a logarithmic expected regret (Agrawal and Goyal, 2012) in the Bernoulli bandit problem. Starting with a non-informative prior θ i iid ∼ Beta(1, 1) for 1 ≤ i ≤ p, this algorithm: (a) updates the distribution of θ i as Beta(a i (t) + 1, b i (t) + 1), where a i (t) and b i (t) are the number of successes and failures of the arm i up to time t, (b) samples θ i (t) from these posterior distributions, and (c) plays the arm with the highest θ i (t). Agrawal and Goyal (2012) extended this algorithm to the general case where rewards are not necessarily Bernoulli but general random variables on the interval [0, 1]. Scott (2010) and Sabes and Jordan (1996) proposed a Randomized Probability Matching variant which allocates observations to arms according to their probability of being the best arm.
The MAB problem is most often formulated as a single-play problem, where only one arm can be selected at each round. Komiyama et al. (2015) extended Thompson sampling to a multi-play scenario, where at each round t the player selects a subset S t of L < p arms and receives binary rewards of all selected arms. For each 1 ≤ i ≤ p, these rewards r i (t) are iid Bernoulli with unknown success probabilities θ i where γ i (t) and γ j (t) are independent for i = j and where, without loss of generality, θ 1 > θ 2 > · · · > θ p . The player is interested in maximizing the sum of expected rewards over drawn arms, where the optimal action is playing the top L arms S 0 = {1, . . . , L}. The regret depends on the combinatorial structure of arms drawn and, similarly as before, is defined as the gap between an expected cumulative reward and the optimal drawing policy, i.e. E[R(T )] = E T t=1 i∈S 0 θ i − i∈St θ i Fixing L, the number of arms played, Komiyama et al. (2015) propose a Thompson sampling algorithm for this problem and show that it has a logarithmic expected regret with respect to time and a linear regret with respect to the number of arms. Our metamorphosis of multi-armed bandits into a variable selection algorithm will ultimately require that the number L of arms played is random and that the rewards at each time t can be dependent.  (2012)) can be seen as a generalization of multi-play bandits, where any arbitrary combination of arms S (called super-arms) is played at each round and where the reward r(S) can be revealed for the entire collective S (a full-bandit feedback) or for each contributing arm i ∈ S (a semi-bandit feedback), see e.g. Wang and Chen (2018)

Variable Selection as a Bandit Problem
Bayesian model selection is regarded as more or less synonymous to finding the MAP (maximum-a-posteriori) model S = arg max S π(S | Y (n) ). Even when the marginal likelihood is available, this model can computationally unattainable for p as small as 20. In order to accelerate Bayesian variable selection using multi-armed bandits techniques one idea immediately comes to mind. One could treat each of the 2 p models as a base arm.
Assigning prior model probabilities according to θ i ∼ Beta(a i , b i ) for 1 ≤ i ≤ 2 p for some 1 a i > 0 and b i > 0, one could play a game by sequentially trying out various arms (variable subsets) and collect rewards to prioritize subsets that were suitably "good". Identifying the arm with the highest mean reward could then serve as a proxy for the best model. This naive strategy, however, would not be operational due to the exponential number of arms to explore.
Instead of the MAP model, it has now been standard practice to report the median probability model (MPM) (Barbieri and Berger (2004)) consisting of those variables whose 1 chosen to correspond to marginals of a Dirichlet distribution 6 posterior inclusion probability π i ≡ P(γ i = 1 | Y (n) ) is at least 0.5. More formally, MPM is defined, for π = (π 1 , . . . , π p ) , as (3) This model is the optimal predictive model in linear regression under some assumptions (Barbieri et al., 2018). Obtaining π i 's, albeit easier than finding the MAP model, requires opportunity. This is appealing for at least two reasons: (1) there are fewer arms to explore more efficiently, (2) the quantity r π (S) can be regarded as a mean regret of a combinatorial arm (more below) which, given π, has MPM as its computational oracle.
We view Bayesian spike-and-slab selection through the lens of combinatorial bandit problems by treating variable selection indicators γ i 's in (2) as Bernoulli rewards. From now on, we will refer to each θ i as an unknown mean reward, i.e. a probability that the i th variable exerts influence on the outcome. In sharp contrast to (2) which deploys one θ for all arms, each arm i ∈ {1, . . . , p} now has its own prior inclusion probability θ i , i.e.
In the original spike-and-slab setup (2), the mixing weight θ served as a global shrinkage parameter determining the level of sparsity and linking coordinates to borrow strength (Rockova and George (2018)). In our new bandit formulation (4), on the other hand, the reward probabilities θ i serve as a proxy for posterior inclusion probabilities π i whose distribution we want to learn by playing the bandits game. Recasting the spike-and-slab prior in this way allows one to approach Bayesian variable selection from a more algorithmic (machine learning) perspective. 7

The Global Reward
Before proceeding, we need to define the reward in the context of variable selection. One conceptually appealing strategy would be to collect a joint reward R(S t ) (e.g. goodness of model fit) reflecting the collective effort of all contributing arms and then redistribute it among arms inside the super-arm S t played at time t. One example would be the Shapley value (Shapley (1953); Owen and Prieur (2017)), a construct from cooperative game theory for the attribution problem that distributes the value created by a team to its individual members. The Shapley value has become a popular method for prediction attribution in machine learning (Sundararajan and Najmi (2019) We try a different route. Instead of collecting a global reward first and then redistributing it, our strategy consists of first collecting individual rewards γ t i ∈ {0, 1} for each played arm i ∈ S t and then weaving them into a global reward R(S t ). We assume that γ t i 's are iid from (4) for each i ∈ {1, . . . , p}. Unlike traditional combinatorial bandits that define the global reward R(S t ) = i∈S γ t i as a sum of individual outcomes (Gai et al., 2012), we consider a global reward for variable selection motivated by the median probability model.
One natural choice would be a binary global reward R(S t ) = i∈St γ t i i / ∈St (1 − γ t i ) ∈ {0, 1} for whether or not all arms inside S t yielded a reward and, at the same time, none of the arms outside S t did. Assuming independent arms, the expected reward equals E[R(S t )] = i∈St θ i i / ∈St (1 − θ i ) = r θ (S t ) and has the "median probability model" as its computational oracle, as can be seen from (3). However, this expected reward is not monotone in θ i 's (a requirement needed for regret analysis) and, due to its dichotomous nature, it penalizes all mistakes (false positives and negatives) equally.
We consider an alternative reward function which also admits a computational oracle but treats mistakes differentially. For some 0 < C < 1, we define the global reward R C (S t ) for a subset S t at time t as Similarly as R(S t ) (defined above) the reward is maximized for the model which includes all the positive arms and none of the negative arms, i.e. arg max S R C (S) = {i : γ t i = 1}.
8 Unlike R(S t ), however, the reward will penalize subsets with false positives, a penalty log(C) for each, and there is an opportunity cost of log(1 + C) for each false negative.
The expected global reward depends on the subset S t and the vector of yield probabilities Note that this expected reward is monotone in θ i 's and is Lipschitz continuous. Moreover, it also has the median probability model as its computational oracle.
Lemma 1 Denote with S O = arg max S r C θ (S) the computational oracle. Then we have With C = ( √ 5 − 1)/2, the oracle is the median probability model {i : θ i ≥ 0.5}.
Proof: It follows immediately from the definition of R C (S t ) and the fact that log(1/C) = Note that the choice of C = ( √ 5 − 1)/2 incurs the same penalty/opportunity cost for false positives and negatives since log(1 + C) = − log(C). The existence of the computational oracle for the expected reward r C θ (S) is very comforting and will be exploited in our Thompson sampling algorithm introduced in Section 4

The Local Rewards
Having introduced the global reward (5), we now clarify the definition of local rewards γ t i . We regard S t as a smaller pool of candidate variables, which can contain false positives and false negatives. The goal is to play a game by sequentially trying out different subsets and reward true signals so that they are selected in the next round and to discourage false positives from being included again in the future. Denote with S the set of all subsets of {1, . . . , p} and with D the "data" at hand consisting of |D| observations (Y i , x i ) from (1).
We introduce a feedback rule which, when presented with data D and a subset S t , outputs a vector of binary rewards r(S t , D) = (γ t i : i ∈ S t ) for whether or not a variable x i for i ∈ S t is relevant for predicting or explaining the outcome. This feedback is only revealed if i ∈ S t . We consider two sources of randomness that implicitly define the reward distribution r(S t , D): (1) a stochastic feedback rule r(·) assuming that data D is given, and (2) a deterministic feedback rule r(·) assuming that data D is stochastic.
The first reward type has a Bayesian flavor in the sense that it is conditional on the where rewards can be sampled using Bayesian stochastic computation (i.e. MCMC sampling). Such rewards are natural in offline settings with Bayesian feedback rules, as we explore in Section 5.1. As a lead example of this strategy in this paper, we consider a stochastic reward based on BART (Chipman et al. (2001)).
In particular, we use the following binary local reward r(S t , D n ) = (γ t i : i ∈ S t ) where γ t i = I(M th sample from the BART posterior splits on the variable x i ).
The mean reward θ i = P(γ t i = 1) = P[i ∈ F | D n ] can be interpreted as the posterior probability that a variable x i is split on in a Bayesian forest F given the entire data D n .
The stochastic nature of the BART computation allows one to regard the reward (9) as an actual random variable, whose values can be sampled from using standard software.
Since BART is run only with variables inside S t (where |S t | << p) and only for M burn-in MCMC iterations, computational gains are dramatic (as we will see in Section 5.1).
The second reward type has a frequentist flavor in the sense that rewards are sampled by applying deterministic feedback rules on new streams (or bootstrap replicates) D t of data. Such rewards are natural in online settings, as we explore in Section 5.2. As a lead example of this strategy in this paper, we assume that the dataset D n consist of n = sT observations and is partitioned into . . , T . One could think of these batches as new independent observations arriving in an online fashion or as manageable snippets of big data. The 'deterministic' screening rule we consider here is running BART for a large number M of MCMC iterations and collecting an aggregated importance measure IM (i; D t , S t ) for each variable. 2 We define IM (i; D t , S t as the average number of times a variable x i was used in a forest where the average is taken over the M iterations and we then reward those arms which were used at 2 This rule is deterministic in the sense that computing it again on the same data should in principle provide the same answer. One could, in fact, deploy any other machine learning method that outputs some measure of variable importance. least once on average, The mean reward θ i = P(γ t i = 1) can be then interpreted as the (frequentist) probability that BART, when run on s = n/T observations arising from (1), uses a variable x i at least once on average over M iterations. We illustrate this online variant in Section 5.2.
Remark 1 Instead of binary local rewards (8), one can also consider continuous rewards

Introducing Thompson Variable Selection (TVS)
This section introduces Thompson Variable Selection (TVS), a reinforcement learning algorithm for subset selection in non-parametric regression environments. The computation alternates between Choose, Reward and Update steps that we describe in more detail below.
The unknown mean rewards will be denoted with θ i and the ultimate goal of TVS is to learn their distribution once we have seen the 'data' 3 . To this end, we take the combinatorial bandits perspective (Chen et al. (2013), Gai et al. (2010)) where, instead of playing one arm at each play opportunity t, we play a random subset S t ⊆ {1, . . . , p} of multiple arms. Each such super-arm S t corresponds to a model configuration and the goal is to discover promising models by playing more often the more promising variables.
Similarly as with traditional Thompson Sampling, the t th iteration of TVS starts off by sampling mean rewards θ i (t) ∼ Beta(a i (t), b i (t)) from a posterior distribution that incorporates past reward experiences up to time t (as we discussed in Section 2). The

Choose
Step then decides which arms will be played in the next round. While the singleplay Thompson sampling policy dictates playing the arm with the highest sampled expected reward, the combinatorial Thompson sampling policy (Wang and Chen (2018)) dictates playing the subset that maximizes the expected global reward, given the vector of sampled Algorithm 1: Thompson Variable Selection with BART Define C = log(1/C) log[(1+C)/C for some 0 < C < 1 and pick M, a, b > 0 Initialize a i (0) := a and b i (0) := b for 1 ≤ i ≤ p and for t = 1, . . . , T Choose Step C: Set S t = ∅ and for i = 1 · · · p do C1: Step R: Collect local rewards for each 1 ≤ i ≤ p from (9) (offline) or (10) (online) Update Step U: If γ t i = 1 then set a i (t + 1) = a i (t) + 1, else b i (t + 1) = b i (t) + 1 Algorithm 1: Thompson Variable Selection with BART ( is an alternative with known q ) probabilities θ(t) = (θ 1 (t), . . . , θ p (t)) . The availability of the computational oracle (from Lemma 1) makes this step awkwardly simple as it boils down to computing S O in (7).
Unlike multi-play bandits where the number of played arms is predetermined (Komiyama (2015)), this strategy allows one to adapt to the size of the model. We do, however, consider a variant of the computational oracle (see (13) below) for when the size q = |S | of the 'true' model S = arg max S r C θ (S) is known. The Choose Step is then followed by the Reward Step (step R in Table 1) which assigns a prize to the chosen subset S t by collecting individual rewards γ t i (for the offline setup in (9) or for the online setup (10)). Finally, each TVS iteration concludes with an Update Step which updates the beta posterior distribution (step U in Table 1).
The fundamental goal of Thompson Variable Selection is to learn the distribution of mean rewards θ i 's by playing a game, i.e. sequentially creating a dataset of rewards by sampling from beta posterior 4 distributions that incorporate past rewards and the observed data D. One natural way to distill evidence for variable selection is through the means π(t) = (π 1 (t), . . . , π p (t)) of these beta distributions which serve as a proxy for posterior inclusion probabilities. Similarly as with the classical median probability model ( Barbieri and Berger (2004)), one can deem important those variables with π i (t) above 0.5 (this corresponds to one specific choice of C in Lemma 1). More generally, at each iteration t TVS outputs a model S t , which satisfies S t = arg max S r C π(t) . From Lemma 1, this model can be simply computed by truncating individual π i (t)'s. Upon convergence, i.e. when trajectories π i (t) stabilize over time, TVS will output the same model S t . We will see from our empirical demonstrations in Section 5 that the separation between signal and noise (based on π i (t)'s) and the model stabilization occurs fast. Before our empirical results, however, we will dive into the regret analysis of TVS.

Regret Analysis
Thompson sampling (TS) is a policy that uses Bayesian ideas to solve a fundamentally frequentist problem of regret minimization. In this section, we explore regret properties of TVS and expand current theoretical understanding of combinatorial TS by allowing for correlation between arms. Theory for TS was essentially unavailable until the path-breaking paper by Agrawal and Goyal (2012) where the first finite-time analysis was presented for single-play bandits. Later, Leike et al. (2016) proved that TS converges to the optimal policy in probability and almost surely under some assumptions. Several theoretical and empirical studies for TS in multi-play bandits are also available. In particular, Komiyama et al. (2015) extended TS to multi-play problems with a fixed number of played arms and showed that it achieves the optimal regret bound. Recently, Wang and Chen (2018) introduced TS for combinatorial bandits and derived regret bounds for Lipschitz-continuous rewards under an offline oracle. We build on their development and extend their results to the case of related arms.
Recall that the goal of the player is to minimize the total (expected) regret under time horizon T defined below and where the expectation is taken over the unknown drawing policy. Choosing C as in Lemma 1, one has log(1 + C) = − log(C) = D and thereby Note that (2θ i − 1) is positive iff i ∈ S . Upper bounds for the regret (12) under the drawing policy of our TVS Algorithm 1 can be obtained under various assumptions.
Assuming that the size q of the optimal model S is known, one can modify Algorithm 1 to confine the search to models of size up to q . Denoting I = {S ⊂ {1, . . . , p} : |S| ≤ q }, one plays the optimal set of arms within the set I, i.e. replacing the computational oracle in (7) with S q O = arg max S∈I r θ (S). We denote this modification with C2 in Table 1. It turns out that this oracle can also be easily computed, where the solution consists of (up to) the top q arms that pass the selection threshold, i.e.
We have the following regret bound which, unlike the majority of existing results for Thompson sampling, does not require the arms to have independent outcomes γ t i . The regret bound depends on the amount of separation between signal and noise.
Lemma 2 Define the identifiability gap ∆ i = min θ j : θ j > θ i for j ∈ S for each arm i / ∈ S . The Algorithm 1 with a computational oracle S q O in C2 achieves the following regret bound for any ε > 0 such that ∆ i > 2ε for each i / ∈ S and for some constant C > 0.
Proof: Since I is a matroid (Kveton et al. (2014)) and our mean regret function is Lipschitz continuous and it depends only on expected rewards of revealed arms, one can apply Theorem 4 of Wang and Chen (2018).
Assuming that the size q of the optimal model is unknown and the rewards γ t i are independent, we obtain the following bound for the original Algorithm 1 (without restricting the solution to up to q variables).

Lemma 3 Define the maximal reward gap
and for each arm i ∈ {1, . . . , p} define η i ≡ max S:i∈S Assume that γ t i 's are independent for each t. Then the Algorithm 1 achieves the following regret bound for some constant C > 0 and for any ε > 0 such that ∆ S > 2B(q 2 + 2)ε for each S.
Proof: Follows from Theorem 1 of Wang and Chen (2018).
We now extend Lemma 3 to the case when the rewards obtained from pulling different arms are related to one another. Gupta et al. (2020) introduced a correlated single-play bandit version of Thompson sampling using pseudo-rewards (upper bounds on the conditional mean reward of each arm). Similarly as with structured bandits (Pandey et al. (2007)) we instead interweave the arms by allowing their mean rewards to depend on S t , i.e. instead of a single success probability θ i we now have We are interested in obtaining a regret bound for the Algorithm 1 assuming (14) in which case the expected global regret (6) writes as To this end we impose an identifiability assumption, which requires a separation gap between the reward probabilities of signal and noise arms.
Assumption 1 Denote with S = arg max S r C θ (S t ) the optimal set of arms. We say that S is strongly identifiable if there exists 0 < α < 1/2 such that ∀i ∈ S * we have θ i (S ) ≥ θ i (S) > 0.5 + α ∀S such that i ∈ S, ∀i / ∈ S * we have θ i (S) < 0.5 − α ∀S such that i ∈ S.
Under this assumption we provide the following regret bound.

Interpretable Machine Learning with TVS
This section serves to illustrate Thompson Variable Selection on benchmark simulated datasets and to document its performance. While various implementations are possible (by choosing different rewards r(S t , D) in (8)), we will focus on two specific choices that we broadly categorize into offline variants for when p >> n (Section 5.1) and streaming/online variants for when n >> p (Section 5.2).

Offline TVS
As a lead example in this section, we consider the benchmark Friedman data set (Friedman (1991)) with a vastly larger number of p = 10 000 predictors Due to the considerable number of covariates, feeding all 10 000 predictors into a black box to obtain variable importance may not be computationally feasible and/or reliable. However, TVS can overcome this limitation by deploying subsets of predictors. For instance, we considered variable importance using the BART method (using the option sparse=TRUE for variable selection) with D ∈ {10, 50} trees and M ∈ {5 000, 50 000} MCMC iterations are plotted them in Figure 1. While increasing the number of iterations certainly helps in separating signal from noise, it is not necessarily obvious where to set the cutoff for selection. One natural rule would be selecting those variables which have been used at least once on average over the M iterations. With D = 50 and M = 50 000, this rule would identify 4 true signals, leaving out the quadratic signal variable x 3 . The computation took around 8.5 minutes.
The premise of TVS is that one can deploy a weaker learner (such as a forest with fewer trees) which generates a random reward that roughly captures signal and is allowed to make mistakes. With reinforcement learning, one hopes that each round will be wrong in a different way so that mistakes will not be propagated over time. The expectation is that (a) feeding only a small subset S t in a black box and (b) reinforcing positive outcomes, one obtains a more principled way of selecting variables and speeds up the computation.
We illustrate the effectiveness of this mechanism below.
We use the offline local binary reward defined in (9). We start with a non-informative prior a i (0) = b i (0) = 1 for 1 ≤ i ≤ p and choose T = 10 trees in BART so that variables are discouraged from entering the model too wildly. This is a weak learner which does not seem to do perfectly well for variable selection even after very many MCMC iterations (see Figure 1(d)). We use the TVS implementation in Table 1 with a dramatically smaller number M ∈ {100, 500, 1 000} of MCMC burn-in iterations for BART inside TVS. We will see below that large M is not needed for TVS to unravel signal even with as few as 10 trees.

Online TVS
As the second TVS example, we focus on the case with many more observations than covariates, i.e. n >> p. As we already pointed out in Section 3.2, we assume that the dataset D n = {(Y i , x i ) : 1 ≤ i ≤ n} has been partitioned into mini-batches D t of size s = n/T . We deploy our online TVS method (Table 1 with C2 ) to sequentially screen each batch and transmit the posterior information onto the next mini-batch through a prior. This should be distinguished from streaming variable selection, where new features arrive at different time points (Foster and Stine (2008)). Using the notation one can treat the beta posterior as a new prior for the incoming data points, where Parsing the observations in batches will be particularly beneficial when processing the entire dataset (with overwhelmingly many rows) is not feasible for the learning algorithm. TVS leverages the fact that applying a machine learning method T times using only a subset of s observations and a subset S t of variables is a lot faster than processing the entire data.
While the posterior distribution 5 of θ i 's after one pass through the entire dataset will have seen all the data D n , θ i 's can be interpreted as the frequentist probability that the screening rule picks a variable x i having access to only s measurements.
We illustrate this sequential learning method on a challenging simulated example from (Liang et al., 2018, Section 5.1) . We assume that the explanatory variables x i ∈ [0, 1] p have been obtained from x ij = (e i + z ij )/2 for 1 ≤ i ≤ n and 1 ≤ j ≤ p, where e, z ij iid ∼ N (0, 1). This creates a sample correlation of about 0.5 between all variables. The responses with σ 2 = 0.5. This is a challenging scenario due to (a) the non-negligible correlation between signal and noise, and (b) the non-linear contributions of et al. (2008) who set n = p = 500, we make the problem considerably more difficult by choosing n = 20 000 and p = 1 000. We would expect a linear model selection method to miss these two nonlinear signals. Indeed, the Spike-and-Slab LASSO method (using λ 1 = 0.1 and λ 0 ∈ {0.1 + k × 10 : k = 1, . . . , 10} only identifies variables x 1 , x 2 and x 5 . Next, we deploy the BART method with variable selection (Linero 2016) by setting 5 Treating the rewards as data.  (2018)) and 50 trees 6 in the BART software (Chipman et al. (2010)). The choice of 50 trees for variable selection was recommended in Bleich et al. (2014) and was seen to work very well. Due to the large size of the dataset, it might be reasonable to first inquire about variable selection from smaller fractions of data.
We consider random subsets of different sizes s ∈ {100, 500, 1 000} as well as the entire dataset and we run BART for M = 20 000 iterations. Figure 3 depicts BART importance measures (average number of times each variable was used in the forest). We have seen BART separating the signal from noise rather well on batches of size s ≥ 1 000 and with MCMC iterations M ≥ 10 000. The scale of the importance measure depends on s and it is not necessarily obvious where to make the cut for selection. A natural (but perhaps ad hoc) criterion would be to pick variables which were on average used at least once. This would produce false negatives for smaller s and many false positives (29 in this example) for s = 20 000. The Hamming distance between the true and estimated model as well as the computing times are reported in Table 1. This illustrates how selection based on the importance measure is difficult to automate. While visually inspecting the importance measure for M = 20 000 and s = 20 000 (the entire dataset) in Figure 3d is very instructive for selection, it took more than 70 minutes on this dataset. To enhance the scalability, we deploy our reinforcement learning TVS method for streaming batches of data.
Using the online local reward (10) with BART (with 10 trees and sparse=TRUE and M iterations) on batches of data D n of size s. This is a weaker learning rule than the one considered in Figure 3 (50 trees). Choosing s = 100 and M = 10 000, BART may 6 Results with 10 were not nearly as satisfactory.

20
(a) s = 100 not be able to obtain perfect signal isolation on a single data batch (see Figure 3(a) which identifies only one signal variable). However, by propagating information from one batch onto the next, TVS is able to tease out more signal (Figure 4(a)). Comparing Figure 3 Several observations can be made from the timing and performance comparisons presented in Table 2. When the batch size is not large enough, repeated runs will not help. The Hamming distance in all cases only consists of false negatives and can be decreased by increasing the batch size or increasing the number of iterations and rounds. Computationally, it seems beneficial to increase the batch size s and supply enough MCMC iterations.
Variable selection accuracy can also be increased with multiple rounds.

Simulation Study
We further evaluate the performance of TVS in a more comprehensive simulation study.
We compare TVS with several related non-parametric variable selection methods and with classical parametric ones. We assess these methods based on the following performance criteria: False Discovery Proportion (FDP) (i.e. the proportion of discoveries that are false), Power (i.e. the proportion of true signals discovered as such), Hamming Distance (between the true and estimated model) and Time.

Offline Cases
For a more comprehensive performance evaluation, we consider the following 4 mean functions f 0 (·) to generate outcomes using (1). For each setup, we summarize results over 50 datasets of a dimensionality p ∈ {1 000, 10 000} and a sample size n = 300.
• Linear Setup: The regressors x i are drawn independently from N (0, Σ), where Σ = (σ jk ) p,p j,k=1,1 with σ jj = 1 and σ jk = 0.9 |j−k| for j = k. Only the first 5 variables are related to the outcome (which is generated from (1) with σ 2 = 5) via the mean • Friedman Setup: The Friedman setup was described earlier in Section 4. In addition, we now introduce correlations of roughly 0.3 between the explanatory variables.
We run TVS with M = 500 and M = 1 000 internal BART MCMC iterations and with T = 500 TVS iterations. As two benchmarks for comparison, we consider the original BART method (in the R-package BART) and a newer variant called DART (Linero and Yang (2018)) which is tailored to high-dimensional data and which can be obtained in  (2018)) implemented in the R-package SSLASSO with lambda1=0.1 and the spike penalty ranging from λ 1 to the number of variables p (i.e. lambda0 = seq(1, p, length=p)). We choose the same set of variable chosen by SSLASSO function after the regularization path has stabilized using the model output.
We report the average performance (over 50 datasets) for p = 10 000 in Figure 6 and the rest (for p = 1 000) in the Appendix. Recall that the model estimated by TVS is obtained by truncating π i (500)'s at 0.5. In terms of the Hamming distance, we notice that TVS performs best across-the-board. DART (with D = 50) performs consistently well in  We also implement a stopping criterion for TVS based on the stabilization of the in- One possibility is to stop TVS when the estimated model S t obtained by truncating π i (t)'s at 0.5 has not changed for over, say, 100 consecutive TVS iterations. With this convergence criterion, the convergence times differs across the different data set-ups. Generally, TVS is able to converge in ∼ 200 iterations for p = 1 000 25 and ∼ 300 iterations for p = 10 000. While the computing times are faster, TVS may be more conservative (lower FDP but also lower Power). The Hamming distance is hence a bit larger, but comparable to TVS with 500 iterations (Appendix B.1)

Online Cases
We now consider a simulation scenario where n >> p, i.e. p = 1 000 and n = 10 000. As described earlier in Section 5.2, we partition the data into minibatches ( for b = 1, . . . , n/s with s ∈ {500, 1000} and M ∈ {500, 1000} using D = 10 trees. In this study, we consider the same four set-ups as in Section 6.1. We implemented TVS with a fixed number of rounds r ∈ {1, 5, 10} and a version with a stopping criterion based on the stabilization of the inclusion probabilities π . This means that TVS will terminate when the estimated model S t obtained by truncating π(t)'s at 0.5 has not changed for 100 consecutive iterations. The results using the stopping criterion are reported in Figure 7 and the rest is in the Appendix (Section B.2 ). As before, we report the best configuration for BART and DART, namely D = 20 for BART and D = 50 for DART (both with 50 000 MCMC iterations). For both methods, there are non-negligible false discoveries and the timing comparisons are not encouraging. In addition, we could not apply both BART and DART with n ≥ 50 000 observations due to insufficient memory. For TVS, we found the batch size s = 1 000 to work better, as well as running the algorithm for enough rounds until the inclusion probabilities have stabilized ( Figure 7 reports the results with a stopping criterion). The results are very encouraging. While SSLASSO's performance is very strong, we notice that in the non-linear setup of Liang et al. (2018) there are false non-discoveries.

HIV Data
We will apply (offline) TVS on a benchmark Human Immunodeficiency Virus Type I (HIV-I) data described and analyzed in Rhee et al. (2006) and Barber et al. (2015). This publicly  x ij ∈ {0, 1} for whether or not the j th mutation has occurred at the i th sample. 8 In an independent experimental study, (Rhee et al., 2005) identified mutations that are present at a significantly higher frequency in virus samples from patients who have been treated with each class of drugs as compared to patients who never received such treatments. While, as with any other real data experiment, the ground truth is unknown, we treat this independent study are a good approximation to the ground truth. Similarly as Barber et al. (2015), we only compare mutation positions since multiple mutations in the same position are often associated with the same drug resistance outcomes.
For illustration, we now focus on one particular drug called Lopinavir (LPV). There are p = 206 mutations and n = 824 independent samples available for this drug. TVS was applied to this data for T = 500 iterations with M = 1 000 inner BART iterations.
In Figure 8, we differentiate those mutations whose position were identified by Rhee et al. (2005) and mutations which were not identified with blue and red colors, respectively.
From the plot of inclusion probabilities in Figure 8a, it is comforting to see that only one unidentified mutation has a posterior probability π j (t) stabilized above the 0.5 threshold.
Generally, we observe the experimentally identified mutations (blue curves) to have higher inclusion probabilities.

Durable Goods Marketing Data Set
Our second application examines a cross-sectional dataset described in Ni et al. (2012) consisting of durable goods sales data from a major anonymous U.S. consumer electronics retailer. The dataset features the results of a direct-mail promotion campaign in November 2003 where roughly half of the n = 176 961 households received a promotional mailer with 10$ off their purchase during the promotion time period (December 4-15). If they did purchase, they would get 10% off on a subsequent purchase through December. The treatment assignment was random. The data contains p = 146 descriptors of all customers including prior purchase history, purchase of warranties etc. We will investigate the effect of the promotional campaign (as well as other covariates) on December sales. In addition, we will interact the promotion mail indicator with customer characteristics to identify the "mail-deal-prone" customers. with 114 variables whose names and descriptive statistics are reported in Section D in the Appendix. We interact the promotion mail indicator with these variables to obtain p = 227 predictors. Due to the large volume of data (n ≈ 180 000), we were unable to run DART and BART (BART package implementation) due to memory problems. This highlights the need for TVS as a variable selector which can handle such voluminous data.
Unlike the HIV-I data in Section 7.1, there is no proxy for the ground truth. To understand the performance quality of TVS, we added 227 normally distributed knockoffs. The knockoffs are generated using create.second order function in the knockoff R package (Patterson and Sesia (2018)) using a Gaussian Distributions with the same mean and covariance structure (Candes et al. (2018)). We run TVS with a batch size s ∈ {1 000, 2 000, 5 000} and M = 1 000 inner iterations until the posterior probabilities have stabilized. The inclusion probabilities are plotted in Figure 9 for two cases (a) without knockoffs (the first row) and (b) with knockoffs (the second row). It is interesting to note that, apart from one setting with s = 1 000, the knockoff trajectories are safely suppressed below 0.5 (dashed line

Supplementary Material
A Proof of Theorem 1 We will denote with S = arg max S r C θ (S) the optimal model where First, we define the reward gap of a set of arms S t as and write the expected regret in (12) as Reg(T ) = E T t=1 ∆ St . Before proceeding, we need to introduce some notation. We denote with N i (t) = k<t I(i ∈ S k ) the number of times an arm i has been pulled up to time t. Next, denotes the empirical mean of an arm i, i.e. the proportion of times an arm i has yielded a reward when pulled, i.e. γ t i = 1 when i ∈ S t . We will be using the following usual Chernoff-Hoeffding bounds which we state without a proof.
Lemma 4 (Chernoff-Hoeffding Bound) Let X 1 , ..., X n be independent Bernoulli random variables with E[X i ] = p i and denote with X = 1 n n i=1 = X i and µ = 1 n n i=1 p i . Then, for any 0 < λ < 1 − µ, we have P(X ≥ µ + λ) ≤ exp{−nd(µ + λ, µ)}, and, for any 0 < λ < µ, Similarly as in other regret bound proofs (Komiyama et al., 2015; Wang and Chen, 2018) we will bound the expected regret separately on the intersection of combinations of 39 the following events: where α occurred in Assumption 1. First, we focus on the following term The following Lemma finds an upper bound on Reg 1 (T ): Lemma 5 Under the Assumption 1, the TVS sampling policy in Table 1 Proof: We will first prove that The second inequality for the term in (38) can be obtained analogously. With θ i (S t ) as in (14) we denote withθ Similarly as in the proof of Lemma 3 in Wang and Chen (2018), we denote with τ 1 , τ 2 , ... the times such that i ∈ S t , define τ 0 = 0 and write This concludes the proof of (45). The second term can be bounded analogously, which concludes the proof of the Lemma.
Next, we focus on the following term To bound this term, we will be using the following Lemma (Lemma 4 from Wang and Chen (2018)) which we, again, state without a proof.
Lemma 6 Denote with θ i (t) the mean reward for an arm i sampled from B(a i (t), b i (t)) during the step C1 in Table 1. Using the TVS algorithm from Table 1, we have the following two inequalities for any base arm i: Proof: The proof relies on the observation that θ i (t)'s only depend on values a i (t) and b i (t). The proof is then the same as in Lemma 4 in Wang and Chen (2018).
The following lemma bounds the regret term (23).
Lemma 7 Using the TVS algorithm from Table 1, we have the following bound: We can then directly apply Lemma 6 to write Finally, we focus on the following term Lemma 8 Let i ∈ S * and let f * i (j, s) be the probability that after j pulls of an arm i, s of those pulls result in a reward. If s ≤ 0.5j , then Proof: We denote with τ i j the j th time such that the arm i has been pulled (i.e. θ i (t) > 0.5). We denote the probability of yielding a reward at time τ j as p j = P γ τ j i = 1|S τ j and, for a given j and s write f * j,s (p 1 , · · · , p j ) := f * i (j, s). Since we are studying one particular arm, we have dropped the subscript i without any loss of generality. Consider now a vector of binary indicators B = (B 1 , B 2 , B 3 , · · · , B j ) ∈ {0, 1} j where B k = γ τ k i ∈ {0, 1} for whether or not the k th pull of an arm i has yielded a reward. Denoting |B| = p j=1 B k , we can write f * j,s (p 1 , · · · , p j ) = B:|B|=s We want to show that p = (p 1 , . . . , p j ) = arg max f * j,s (p 1 , · · · , p j ) when p i = 0.5 + α for all 1 ≤ i ≤ j. First, we notice is that this is a multi-linear polynomial in the sense that ∂f * j,s (p 1 ,··· ,p j ) ∂p k is independent of p k . Keeping every other coordinate constant, the value p k that maximizes f * j,s (·) in the k th direction has to be either 0.5 + α or 1. The vector (p 1 , · · · , p j ) maximizing f * j,s (p 1 , · · · , p j ) will thus have each coordinate p * k either equal to 1 or 0.5+α. Let r ∈ N∪{0} be the number of coordinates k for which p k = 1 and j −r be the number of coordinates k for which p k = 0.5 + α (notice that r ≤ s). Since f * j,s (p 1 , · · · , p j ) is a symmetric polynomial (i.e. the value of the function is not affected by a permutation of its argument) we assume, without loss of generality, that p 1 = p 2 = · · · = p r = 1 and p r+1 = p r+2 = · · · = p j = 0.5 + α. In this case, we have the constraint on the binary indicators B where the first r indices have to be 1 and the remaining s − r 1 s can be anywhere between the index r + 1 and j (j − r indices). Therefore, we have It is sufficient to prove that this function is maximized at r = 0. We have (using the assumption s ≤ j/2 ) = 1 − (1/2 − α)r + αj (j − r)(1/2 + α) < 1 since 1/2 − α ≥ 0 and α > 0. Since this is true for all r, the function f * j,s,r+1 (p 1 ,··· ,p j ) f * j,s,r (p 1 ,··· ,p j ) is maximized at r = 0. This concludes the proof.
Lemma 9 Let i ∈ S * and let τ i j be the j th time such that θ i (t) > 0.5. Suppose that Assumption 1 is true, then the TVS Algorithm 1 with C = ( where constants C 1 , C 2 > 0 are not related to Algorithm 1. Proof: We denote with τ i j the j th time such that the arm i has been pulled (i.e. θ i (t) > 0.5). First, we consider the time interval [τ i j , τ i j+1 ). For any t ∈ [τ i j , τ i j+1 ) we know that the arm i has been played j times and, thereby, θ i (t) comes from a beta distribution The parameters of the beta distribution are only updated if the arm i is pulled and the distribution thus does not change until we reach the iteration τ j+1 . Therefore, givenμ i (τ i j ) the expected difference between τ i j+1 − τ i j has a geometric distribution with an expectation . We let F n,p (·) and f n,p (·) denote the cumulative distribution function (CDF) and the probability density function of a Binomial distribution with parameters (n, p). We now recall the following fact (see e.g. Fact 3 in Agrawal and Goyal (2012)) about the CDF F beta α,β (x) of a beta distribution with parameters (α, β). We have the following identity which links the CDF of a beta distribution and a CDF of a binomial distribution: for all positive integers α, β. Let f * i (j, s) be the probability that after j pulls of an arm i, s out of those j pulls result in a reward. Here, we have the following relationship a i (τ j ) = s+1 and b i (τ j ) = j + 1 − s. Using the identity (27) and given s successes among j pulls, we can write p i,j (0.5) = F beta s+1,j−s+1 (0.5) = 1 − F j+1,0.5 (s) and thereby First, we consider the case when j ≤ 8 α . In the following calculations, we will use the result from Lemma 8. Let p max = α + 0.5 and R = pmax 1−pmax . Using the fact F j+1,0.5 (s) ≥ 0.5F j,0.5 (s) and F j,0.5 (s) ≥ 1/2 when s ≥ j/2 (since the median of Binomial distribution 44 with parameters (j, 1/2) is either j/2 or j/2 ) we have where from (34) to (35) we have used the following two facts. First, using the definition of d(·, ·) in Lemma 4 and the fact that d(p 1 , p 2 ) > (p 1 − p 2 ) 2 we obtain (1 − p max ) j 1/2 j R j/2 ≤ (1 − p max ) j 1/2 j R j/2 = e j log(1−pmax)−j log(1/2)+j/2 log(pmax)−j/2 log(1−pmax) = e −j{ 1 Second, since p max = 0.5 + α, we have R R−1 = pmax 2α ≤ 1 2α . When j > 8 α , we will divide the sum Σ(0, j) ≡ j s=0 f * i (j,s) F j+1,0.5 (s) into 4 pieces and bound each one of them Σ ( (1/2 + α/2)j , j) ≤ 1 + 1 e α 2 j/4 − 1 , where c 2 > 0 and c 3 > 0 are constants unrelated to Algorithm 1. This will complete the proof. We now prove the bounds in the last display. We start with the first inequality in 45 (37). When s ≤ (j + 1)/2 − (j + 1)/4, we use the following bound for the Binomial CDF (Jeřábek (2004)) for some c 2 > 0. When s ≥ (j + 1)/2 − (j + 1)/4 we use that fact that, for some c 3 > 0, Altogether, we arrive at the following bound (using again Lemma 8 and denoting p max = 1/2 + α and R = pmax 1−pmax ) Now we bound the first term in (41) to obtain where we have used the following facts. First, for any x > 1 we have Second, j/2 + 1/2 − j/2 + 1 < 3 and (similarly as before) Finally, since R/(R − 1) ≤ 1 2α and 1/(R − 1) = 1−2α 4α , we have .
For the second term in (41), we notice that is equal to the probability that the total number of successes is less than j/2 − 1. Here, we invoke Lemma 4 and note that the success probability of each pull is always greater than 1/2 + α and the difference between the average success probability over the j pulls and 1/2 is thereby greater than α. Hence, We put the two terms together to finally obtain ≤ c 2 e −α 2 j 1 − 2α 4α 2 (j + 1) + c 3 e −2α 2 j .
Next, to bound the term Σ( j/2 , j/2 ) in (38), we use Lemma 8 and the fact that p max > 1/2 to find that for s = j/2 where we used the assumption j ≥ 1/α > 2.
Now, denoting C 1 = 2 + 2 c 3 and C 2 = c 2 we get the statement in the Lemma. This concludes our proof.
Using Lemma 9, we can achieve a similar bound in Lemma 6 of Wang and Chen (2018), Lemma 10 Under Assumption 1, the TVS Algorithm 1 with C = ( √ 5 − 1)/2 satisfies the following property. For any signal arm i ∈ S * , the expected number of total pulls before the given arm i is pulled 8 log(T ) α 2 times is bounded by where C = C 1 + C 2 1−2α 32 α for some C 1 > 0 and C 2 > 0 not related to the Algorithm 1.
Proof: We use the notation from Lemma 9, where τ i j is the time when the arm i has been pulled for the j th time. Denoting with T = 8 log(T ) α 2 , we want to find an upper bound for and using Lemma 9 we obtain 1 + 1 e α 2 j/4 − 1 + e −α 2 j/2 C 1 + C 2 1 − 2α 4α 2 (j + 1) . (44) First, we note that Using these lemmas we can prove the following lemma about Reg 3 (T ).

Proof: We start with the following facts Reg 3 (T ) ≤ T t=1 ∆ St E [I (A(t) ∩ D c (t))] and
where we used the fact that on the event D c (t), there exists at least one arm i ∈ S t such that N i (t) ≤ 8 log(T ) α 2 . We now decompose the sum above into signal arms and noise arms If i ∈ S t \S , then S t contributes to the regret but this can only happen 8 log(T ) α 2 times so the total regret contribution of pulling a subset S t including an arm i before N i (t) > 8 log(T ) α 2 is bounded by max S:i∈S . There are p − q * noise arms i / ∈ S * and the term (46) can be thus bounded by (p − q * )∆ max 8 log(T ) α 2 . If i ∈ S ∩ S t , then the arm i contributes to the regret when S t = S . However, by Lemma 9 and Lemma 10 we can bound the expected number of pulls of an arm i before N i (t) reaches 8 log(T ) α 2 . This means that the contribution to the regret when i ∈ S ∩ S t is bounded by ∆ max 8 log(T ) α 2 1−e −α 2 /2 + 8 α 2 1 e 2α −1 + e −1 1−e −α/8 . Because there are q signal arms, we can combine (45) and (46) to arrive at the bound in the statement of this lemma.
We now put the various pieces together to finally prove Theorem 1.

Proof: We start by noticing that
Thereby we can write 50 which leads to the following decomposition = Reg 1 (T ) + Reg 2 (T ) + Reg 3 (T ). Now, we bound Reg 1 (T ), Reg 2 (T ) and Reg 3 (T ) with Lemma 5,Lemma 7, and Lemma 11. This gives us Theorem 1.

B.1 Offline Cases
The Figure 10 below shows simulation results for p = 1 000 under the same settings as Figure 6. In addition, we report convergence diagnostics (number of iterations until convergence) for the simulation study from Section 6.1, using the convergence criterion " S t stays the same for 100 consecutive TVS iterations", are included in Table 4  In addition to the convergence criterion, we tried different number of trees (D) for DART and BART. We also considered a different variable selection rule, i.e. the Median   (2018)). Due to space constraints, we showed only the best settings for BART and DART in Section 6.1. The following tables present the entire simulation study and show that TVS yields better Hamming distance and computational speed gains. Tables 5-8 present the offline simulation study, one table for each data setup.

B.2 Online Cases
In Table 9, we report convergence diagnostics of the simulation from Section 6.2, where the convergence criterion is chosen as " S t stays the same for 100 consecutive TVS iterations".  In addition to the results shown in Section 6.2, we tried a different number of trees D for DART and BART. We also considered a different variable selection rule, i.e. the Median Probability Model using the inclusion probability of BART and DART (as mentioned in Linero (2018)). Due to space constraints, we showed only the best settings for BART and DART in Section 6.2. Now we present additional simulation results for n = 50 000 and n = 100 000. Since BART and DART cannot be run with such large n, we only show the results for TVS.