Mini-batch Metropolis-Hastings MCMC with Reversible SGLD Proposal

Traditional MCMC algorithms are computationally intensive and do not scale well to large data. In particular, the Metropolis-Hastings (MH) algorithm requires passing over the entire dataset to evaluate the likelihood ratio in each iteration. We propose a general framework for performing MH-MCMC using mini-batches of the whole dataset and show that this gives rise to approximately a tempered stationary distribution. We prove that the algorithm preserves the modes of the original target distribution and derive an error bound on the approximation with mild assumptions on the likelihood. To further extend the utility of the algorithm to high dimensional settings, we construct a proposal with forward and reverse moves using stochastic gradient and show that the construction leads to reasonable acceptance probabilities. We demonstrate the performance of our algorithm in both low dimensional models and high dimensional neural network applications. Particularly in the latter case, compared to popular optimization methods, our method is more robust to the choice of learning rate and improves testing accuracy.


Introduction
Since its inception, Markov chain Monte Carlo (MCMC) sampling has been an indispensable tool in Bayesian modeling for obtaining parameter estimates and their uncertainty.However, traditional MCMC algorithms do not scale well to large data as they typically involve expensive computation using the full dataset.Additionally, scaling classical MCMCs toward modern high-dimensional applications can be problematic.The computational bottleneck led researchers to pursue lower accuracy, higher efficiency trade-offs such as variational inference.Despite its computational efficiency, theoretical guarantees for asymptotic convergence of variational approximations are given typically for specific models, and the objective function can contain multiple local optima trapping commonly used optimization algorithms [8,12,20].In comparison, MCMC techniques have the potential to navigate non-convex surfaces and find better local optima in the process.As the amount of data continues to grow rapidly, the need for scalable MCMC methods for large-scale learning tasks remains critical.In this paper, we propose an MCMC algorithm that is scalable in both the size of the dataset and the dimension of the parameter space.Our algorithm leverages the traveling property of an MCMC sampler to find better solutions to optimization problems in machine learning.
The search for scalable MCMC methods has largely proceeded in two directions.The first approach divides the data into manageable batches and performs MCMC on each batch in parallel.To collectively process the results, most methods either require different machines to communicate with each other in different rounds of MCMC iteration [1], or combine the posterior distribution from each batch to approximate the target posterior [22,30,27].Our work follows the second line of approach, which uses subsamples, or mini-batches, of the full data in each iteration of the MCMC algorithm.The key in analyzing such an algorithm is to understand the noise and bias introduced by the mini-batches.
The broad class of pseudo-marginal algorithms [3] use mini-batches of data to accelerate computation in the Metropolis-Hastings (MH) algorithm [13,5,19,23].The exact posterior (or some close approximation) is maintained by constructing an unbiased and nonnegative estimator, which can have a nontrivial form or require carefully chosen lower bound on the likelihood.Another class of methods performs approximate tests in the MH acceptance step using mini-batches.To control the approximation error, an adaptive approach is usually adopted to sequentially increase the size of a batch until an error bound is met [4,15,9].Approaches based on non-reversible MCMC have also been proposed [7].In practice, some of these methods were tested on large datasets with hundreds of parameters, but further scaling up in parameter dimension toward deep machine learning models would be challenging.
In another direction, past few years have witnessed the rise of stochastic gradient based MCMC algorithms which have shown strong potential in large-scale machine learning applications.These algorithms are developed from diffusion-based MCMC and approximate the gradient with noisy estimates based on mini-batches of data ( [25]), a notable example being the Stochastic Gradient Langevin Dynamics (SGLD) and other variants [31,2,10,17].Many studies have since analyzed the convergence of SGLD by viewing the algorithm as a discrete-time simulation of a continuous stochastic differential equation (SDE) [28,24].Unlike algorithms such as MALA which uses the MH acceptance test to correct the errors in discretizing a continuous system (e.g.[26]), SGLD completely avoids the costly computation of the MH ratio by using a shrinking step size.In practice, this implies the algorithm eventually converges to a local optimum.
We propose a general mini-batch MH algorithm whose invariant distribution approximates a tempered version of the target posterior.By augmenting the system with a variable related to the subsampling procedure, we show our algorithm is a reversible Markov chain thus has an invariant distribution.The idea of augmenting the system to sample a tempered posterior was also explored by [18] to heuristically design a mini-batch Metropolis sampler, but their algorithm differs in the use of mini-batches and they did not offer theoretical support for the method.[11] introduced a mini-batch Gibbs sampler capable of exact sampling from certain graphical models.Finally, a connection between tempering and subsample variance was also mentioned in [5].Here, we provide a rigorous theoretical foundation for mini-batching in MH.
We emphasize that our aim here is not Bayesian inference from the exact posterior.Rather, we exploit the tempered posterior with an efficient MCMC sampler to obtain better solutions from a global optimization.
With mild assumptions on the likelihood and allowing the parameter dimension to grow at a suitable rate, we provide full theoretical analysis to i) show the invariant distribution of our algorithm approximately preserves the modes of the true posterior, which is an important property for optimization tasks, and ii) bound the distance between the invariant distribution and the tempered posterior.To further enhance the utility of our algorithm in high dimensional applications, we design a proposal function based on Reversible Stochastic Gradient Langevin Dynamic (RSGLD) to make the calculation of MH ratio computationally efficient while ensuring reasonable acceptance probability.We show that the proposal significantly enhances acceptance probability in regions with strong gradient information and explores flat regions in a way similar to random walk.Empirically, we demonstrate the tempering effect inherent to our algorithm helps the Markov chain jump out of local optima and travel between differently modes more easily.Most importantly, we show our mini-batch MH algorithm combined with the RSGLD proposal can be applied to efficiently train neural networks.
The rest of the paper is organized as follows.In Section 2, we introduce our algorithm and provide theoretical analysis of its stationary distribution.In the high dimensional setting, we also design a proposal function called RSGLD and show that adding the reverse move significantly increases the acceptance probability when the gradient is strong.In Section 3, we demonstrate with an array of examples from simple Gaussian models to neural networks with > 10 5 parameters that our algorithm combines the traveling property of an MCMC sampler and the computational efficiency of stochastic optimization methods, thus showing good promise for optimization tasks in deep machine learning applications.In the neural network examples, our algorithm shows higher accuracy overall and better stability for larger learning rates compared to other popular optimization methods.

Methods
We first introduce our algorithm and outline its connection to tempering using an augmented variable.We then show under appropriate assumptions, the stationary distribution of the mini-batch MH approximately preserves the modes of the target posterior and is close to a tempered posterior in distribution.In the high dimensional setting, we design a proposal function that can navigate a complex surface guided by gradient information and ensure the acceptance probability does not diminish too quickly as the dimension grows.

MH MCMC with batch tempering (MHBT)
Under the usual Bayesian setting, let x = (x 1 , . . ., x n ) ∈ X n , X ⊂ R p , be iid samples drawn from distribution p(•|θ * ), where θ * ∈ Θ ⊂ R d denotes the parameters.Let π 0 (θ) be the prior on θ.We are interested in sampling from the target posterior π(θ) ∝ π 0 (θ) n i=1 p(x i |θ) using the MH algorithm.In each iteration of classical MH, given some proposal function q(•), a move from θ to θ is accepted with probability given by the MH ratio, For large n, the evaluation of π(•) is costly.
We next derive our algorithm, MH MCMC with batch tempering (MHBT), using an augmented system1 .Consider an auxiliary variable τ ∈ {0, 1} n with I(τ ) = {i : where ν m,n is the uniform distribution over I m and c n is a scaling constant that will be explained soon.Performing the classical MH algorithm on the augmented pair (θ, τ ) with the above proposal and π, simple algebra shows the acceptance probability is given by which can be calculated efficiently using a new mini-batch I(τ ) of the data.Since the stationary distribution of this Markov chain is π, marginalizing (1) over τ (with τ in the batches suppressed for clarity), where T = n/c n is the temperature.In this sense, the mini-batch stationary distribution is approximately a tempered version of the posterior, up to a bias term.Unlike pseudo-marginal MCMCs, we do not require constructing an unbiased estimate of the likelihood, which leads to improved computational efficiency.The bias becomes small (i.e. the bias term becomes close to 1) as n increases for appropriate m and c n since μI (θ) − µ(θ) becomes small.c n controls the trade-off between approximation error and the tempering amount -a smaller c n leads to a smaller error but a higher temperature.The choice of c n and the exact error rate will also be discussed in Section 2.2.We summarize the mini-batch MH algorithm in Algorithm 1 (with τ suppressed for simplicity).

Preservation of local optima and convergence to tempered posterior
In this section, we analyze the properties of the stationary distribution π(θ).In particular, we show the convergence rate of the bias term in (3) in terms of the two tuning parameters m and c n .Throughout the rest of the paper, for two positive sequences a n and b n , we use the notation • 2 denote the 1 , 2 norm for vectors, and • op denotes the operator norm of a matrix.a is the greatest integer smaller than or equal to a. a ∨ b = max{a, b}.E θ * is the expectation taken over the data which is generated by the true parameter θ * .Consider the regime where both n and m are large with m ≤ n.We will also allow the dimension d to grow at some suitable rate with respect to n.We assume the likelihood function p(x|θ) := p θ (x), x ∈ X , belongs to a parametric family satisfying the following conditions.
Assumption 1.There exist a function L and a vector of measurable function T such that Assumption 2. There exists a measurable function M such that | log p θ (x) − log p θ (x)| ≤ M (x) θ − θ 1 for all θ, θ ∈ Θ and x ∈ X .In addition, there exists The above assumptions are mild and require the log likelihood log p θ (x) to be suitably smooth in both θ and x.Unlike some pseudo-marginal MCMC algorithms [3,19], we do not require the likelihood to be bounded.We show in Appendix B that these assumptions can be carried over to a number of commonly used models in statistics and machine learning, such as mixtures of exponential family distributions, linear regression with random feature vectors, and classification tasks with fully connected neural networks (which include logistic regression as a special case).In the exponential family example, T (•) and M (•) are in fact functions of the sufficient statistic.In the neural network example, the constant L(θ) is related to the network complexity measure.
For large n, it suffices to consider the population log likelihood µ θ := E θ * log p θ (X).Let θ 0 be a stationary point of µ θ such that it represents a well-separated local optimum in the following sense.Assumption 3. µ θ is twice continuously differentiable in θ. θ 0 ∈ Int(Θ) and the Hessian of µ θ at θ 0 has eigenvalues λ i (H θ 0 ) < 0 for all i = 1, . . ., d.
Theorem 1. Suppose θ 0 is a stationary point of µ θ satisfying Assumption 3.For some α > 0, let c n → ∞ be a sequence such that dc 2+α n log cn m → 0, t be a fixed constant with t ∈ (0, 1/2), and . Then under Assumptions 1, 2, for large n, with probability at least 1 − η n .Here R n = {θ ∈ Θ : The theorem states that with high probability, the supremum of log π(θ) in the shrinking ball B(θ 0 ; δ n ) is larger than any point in the surrounding region R n by a constant margin.This guarantees with high probability π(θ) has a local optimum lying in a shrinking neighborhood centered at θ 0 .The preservation of local optima is important for optimization tasks.
We can further bound the distance between π(θ) and the tempered posterior π 1/T (θ) with one more assumption.Assumption 4. Θ is compact.

Both Theorems 1 and 2 require dc 2+α
n log cn m → 0, meaning c n and d need to go to infinity at a controlled rate.The convergence regime in both theorems covers a wide spectrum of batch size m, from ω(1) to O(n).
2. For a given m, if d is fixed, we can choose c n to be a value close to but smaller than √ m to make sure the temperature is not too high while the convergence holds.In Section 3.1, we show using numerical experiments that the choice of c n is very robust in low dimensional models.
3. The convergence requirement has a linear dependence on d.If m = n γ for some fraction γ, d can also go to infinity at the rate of n raised to some fractional power.

MHBT with stochastic gradient based proposal for neural networks
In large-scale machine learning tasks such as training deep neural networks (DNN), the high dimensionality and complex nature of the loss function surface have posed significant challenges for designing an MCMC sampler that can i) efficiently navigate the high dimensional surface, ii) result in a reasonable acceptance probability in the MH test, and iii) be computationally feasible.Recent studies on stochastic gradient MCMC have demonstrated their potential in training DNNs [10,17,33].However, these methods are derived from continuous-time SDEs, and each discretization step introduces some error which ideally could be corrected with an MH acceptance test.Many of these methods require a shrinking learning rate in order to circumvent the MH test.In this section, we propose and analyze a stochastic gradient-based proposal with appropriate MH correction, which is computationally efficient for DNN applications.

Proposal with Reversible Stochastic Gradient Langevin Dynamics (RSGLD)
Our goal is to design a proposal function that can explore a complex high dimensional surface efficiently guided by gradient information.We will start by considering the proposal used in SGLD, which has been widely adopted in the literature for large-scale training tasks.Let ĝI (θ) = 1 |I| i∈I ∇ θ i (θ) be the average gradient of mini-batch I, the proposal move for SGLD is given by where is the learning rate, N (0, I d ) is the iid Gaussian noise.Note that we have written the learning rate in a form that is consistent with the convention for SGD, so differs from the learning rate in the convention for SGLD by a factor of n.The original SGLD avoids the MH correction step since it is costly to compute using the full data.
In practice, in addition to the computational efficiency issue, another difficulty arises from the acceptance probability as d increases.Using (5) as the proposal in Algorithm 1, it can be treated as a mini-batch version of the MALA algorithm [26] (and the more general Hamiltonian MCMC).It is known that in these full-batch algorithms, needs to scale like d − 1 4 n −1 to maintain a reasonable acceptance probability [21].As an illustration, we consider using (5) as the proposal in Algorithm 1 to sample from the d-dimensional Gaussian N (0, I d ), where d = 1, 10, 10 2 , 10 3 , and n = 10 4 , m = 1000, c n = 20.In Figure 1(a), we computed the average acceptance probability for the first 2000 iterations initializing at the origin and then selected the largest learning rate with average acceptance probability at least 0.5 and 0.1.was chosen from a grid that scales like d − 1 4 n −1 .As can be seen, quickly diminishes to below 10 −7 when the dimension reaches 10 3 , if we still want to maintain a reasonable acceptance probability.Such a small learning rate results in very slow convergence and is therefore usually infeasible for practical use.
Our proposal, Reversible Stochastic Gradient Langevin Dynamics (RSGLD), is based on SGLD but enhances the acceptance probability by allowing the sampler to move in the direction of either ascending or descending gradient with an adjusted Gaussian noise.Using RSGLD as the proposal in Algorithm 1 gives us a mini-batch MH algorithm that both utilizes gradient information and is computationally efficient.Our proposal modifies (5) in two ways: i) a coin flip decides whether the move will be in the positive or negative direction of the gradient.For convenience, we will henceforth refer to a move in the positive (or negative) gradient direction as a forward (or backward ) step; ii) the backward step is coupled with a larger Gaussian noise.The new state θ is sampled by for some constant β ≥ 1. Denote this proposal q I (θ → θ ), then where φ(•; Σ) is the density of a multivariate Gaussian with zero mean and covariance matrix Σ.
In Algorithm 1, the acceptance probability for moving from (θ t , I t ) → (θ , I ) becomes Similar to the argument in Section 2.1, we can show using an auxiliary variable the above mini-batch MH algorithm is closely related to a tempered MCMC.We refer to Appendix C for details.
As an illustration to show both the backward step and its associated, enlarged Gaussian noise increase the acceptance probability, we used the same Gaussian setting as before (sampling from N (0, I d ), where d = 10, 10 2 , 10 3 , and n = 10 4 , m = 1000, c n = 20) and tested β = 1, which corresponds to only adding the backward move; and β = 2, which increases the size of the Gaussian noise in the backward move.In Figure 1(b)-(d), we can see both adding the backward move and increasing the Gaussian noise significantly improve the acceptance probability, and the trend is consistent for different dimensions.

Analysis of acceptance probability
In this section, we show that the RSGLD proposal leads to larger proposal ratio, thus increasing the MH ratio and acceptance probability overall.To focus on the behavior of the algorithm, we take the data x as given and fixed, and the only randomness lies in the selection of data batch and the Gaussian perturbation.Let Z ∼ N (0, I d ) and H I (θ) be the Hessian matrix of ĝI (θ) on mini-batch I.We assume the following conditions hold.Assumption 5. sup I∈Im,θ H I (θ) op ≤ λ , where • op is the operator norm.Assumption 6.For every θ, all batches give similar gradients.More specifically, for any two batches I and J, Proposition 1.For large n, suppose Assumptions 5 and 6 hold.Then depending on where the sampler is in the landscape of the target likelihood, we have the following approximations for the proposal ratio q J (θ →θ) q I (θ→θ ) , where θ is the current parameter value to be updated and I is the current batch.
Case 2).Assume ĝI (θ) 2 = 0, and the learning rate is small enough such that = o(d −1 ) for large d.Then we have q J (θ →θ) q I (θ→θ ) = 1 + o P (1) for both directions in (6).We defer the proof to Appendix D. 1.In this proposition, we consider the behavior of the proposal ratio in different regions of the landscape.The condition in Case 1) means the sampler is at a location where gradient information is strong.Simple rearranging in (6) shows in this case, the gradient part dominates the Gaussian noise.In Case 2), the sampler has reached a flat region of the landscape.
√ η 0 , the rate of which no longer depends on d and scales better than before (d − 1 4 n −1 ).Sparse ĝI (θ) (such as in typical neural networks) and large β can allow for even larger learning rates.
3. The result in Case 1) implies it is more likely for the MH step to accept a forward move than a backward move when the gradient is strong.This is a desirable property in optimization tasks for maintaining efficiency.In particular, the proposal ratio is lower bounded by 1 in the forward direction and hence will no longer shrink the overall MH ratio to zero.In Case 2), the proposal in the sampler behaves like a random walk if the learning rate is sufficiently small.

Convergence to known posterior
We first examined the convergence behavior of MHBT compared to the conventional MCMC sampler using the full dataset (termed full batch MCMC).As the analysis in Section 2.2 suggests, MHBT converges to a tempered version of the original posterior distribution.In order to explicitly measure the distance from this posterior, we considered d-dimensional (d = 2 and 5) Gaussian distributions with unknown mean θ, known covariance I d , where the prior of θ was set to be N (0, I d ).We generated n = 10 5 samples from this distribution with each true θ * i = 2.It follows then the posterior of θ given the data x is N n n+1 x, (n + 1) −1 I d , where x is the sample average.Raising the posterior to temperature T changes the variance to T n+1 I d .Mini-batch sampling was performed with Algorithm 1, setting the proposal q(•) as a Gaussian random walk with step size δ and mini-batch size m = 1000.Full batch MCMC was performed on the tempered posterior also with the same type of random walk proposal.The same step size δ was chosen for both algorithms and the average acceptance probability was around 0.3.
Figure 2(a) shows the total variation distance between the sampled distributions and true tempered posterior for the two MCMC algorithms on d-dimensional Gaussian, as the number of iterations increases.The distance was calculated by running 10 5 independent MCMC chains and taking the same number of independent samples from the tempered distribution, followed by discretization to group the values into d-dimensional histograms.The results shown correspond to c n = 20, which is smaller than √ m as discussed in Remark 1, although we note that a range of c n values (5-30) led to very similar results.For both d = 2 and 5, MHBT converges at a rate almost identical to full batch MCMC to the tempered posterior.

Gaussian mixture
To illustrate the tempering effect of MHBT and examine the accuracy of the approximation in Section 2.2, we consider an example in [31].We generated n = 10 5 samples from a 2-component mixture Gaussian model with parameters θ = (θ 1 , θ 2 ) following: We sampled θ using Algorithm 1, where the proposal q(•) is the Gaussian random walk with step size δ.We set the mini-batch size m to 1000.There remain two tuning parameters in the algorithm: c n and δ.We chose c n to be 20 and δ such that the average acceptance probability was around 0.3.Very similar results can be obtained by a range of c n values (e.g.5-30).Figure 2(b)-(c) show the sampled θ from 10 5 iterations and the contour plot of the tempered log posterior, log π T (θ) ∝ 1/T log π(θ).We can see that the two modes in these plots coincide well.
Figure 2(d)-(f) compare the trajectory of MHBT with that of the full batch MCMC in one of the two dimensions.The latter sampling was performed on the original posterior distribution, and the step size of the random walk was chosen so that the average probability was around 0.3.We fixed θ 1 = 0 and increased θ 2 from 0.5 to 4 so that the two modes in the posterior distribution became increasingly separated.In each case, MHBT is capable of visiting the two modes of the distribution whereas the full batch MCMC is trapped in one of the modes.This highlights the effect of tempering brought about by the mini-batch algorithm, which makes the landscape smoother and easier for the sampler to travel.

Neural networks Fully connected neural networks
We first tested MHBT with RSGLD on the standard MNIST handwritten digit classification task.The dataset was loaded directly from TensorFlow tutorial and consists of 55,000 instances for training and 10,000 instances for testing.We considered a neural network containing one hidden layer with 600 nodes and ReLU activation function (∼ 4 × 10 5 parameters).The outputs from the layer are connected to a 10-class softmax layer for classification.In this case, the log likelihood function is the negative of the cross entropy loss.The batch size was set to 100.We compared the performance of our method with a number of popular optimization methods in the neural network literature for a range of learning rates.In each training, we started RSGLD with a large β to initiate the moves and gradually decreased it as the training progressed.
Choosing β.Throughout training, we monitored the overall acceptance probability for each epoch, where by convention one epoch equals the total number of iterations it takes to step through the whole training dataset (in this case 55000/100 = 550 iterations).We decreased β according to the following adjustment phase once the acceptance probability became larger than 0.4 at the end of each epoch.During the adjustment phase, we ran 100 forward steps using the current parameter values and computed the MH acceptance probability.If the average probability of these forward steps exceeded 0.7, we decreased β by 5%.The maximum reduction allowed in each adjustment phase was 50%.The next epoch of training was then run with the new β value.On the other hand, when the average probability for one epoch dropped below 0.2, we increased β by 5%.We observed that in all experiments, β eventually stabilized to some constant slightly larger than 1.
Comparison with other methods.We performed extensive comparison with SGD and SGLD using various learning rates and multiple rounds of training to assess the stability of each method.Each round of training lasted 2.75 × 10 5 iterations (500 epochs), and all the parameters were initialized with independent N (0, 0.03) distribution.The same batch size (100) was used for all the methods.In this high dimensional setting, we explored a range of c n values around the batch size and show results using c n = 100.We additionally tested c n = 50, 200 under the same settings; the results are very similar thus omitted.
Table 1 shows the prediction errors of the three methods on the testing set, using the top class from the softmax layer as the predicted label.Each number is the median error obtained from 30 training rounds with the corresponding standard deviation shown in parentheses.Overall, the performance of RSGLD improves with large learning rate and eventually achieves better accuracy (smaller error) than that attainable by SGD or SGLD at any learning rate.RSGLD shows substantially better stability for large learning rate than the other two methods.In particular, when the learning rate is 0.2 or larger, SGD and SGLD can fail to converge completely for a significant fraction of the training rounds, which explains the large standard deviations.In general, the standard deviation of errors increases with the learning rate for all the methods, showing stability is hard to achieve with a large learning rate although it can lead to faster convergence and potentially better prediction.As explained in [32], using a large learning rate can help algorithms maintain a trajectory high from the valley floor and more easily overcome energy barriers as they explore the loss surface with stochastic gradients.In this sense, the stability of RSGLD under large learning rates is beneficial for training DNNs.We also observe that in all the experiments, the backward step in RSGLD was much less likely to be accepted compared to the forward step, which is discussed in Case 1) of Proposition 1 and is desirable for optimization efficiency.Since the forward step is identical to SGLD, this suggests a main reason for improvement offered by RSGLD lies in the algorithm being able to select a more efficient trajectory through the parameter space via the MH correction step.1: MNIST top class prediction error (%) on the testing set using 30 training rounds for each learning rate.Each number is the median error with standard deviation in parentheses.
In addition to checking the average performance of the methods from multiple training rounds, we also examine the lowest prediction error achieved under each learning rate from 30 rounds of training.Since SGD and SGLD did not converge most of the time under large learning rates, showing the average or median error would make the plot scale badly.Fig 3(a) shows a trend similar to Table 1 with RSGLD outperforming the other two methods for large learning rates.Overall the lowest error is achieved by RSGLD with learning rate around 0.4-0.5.Examples of detailed testing error trajectories for various methods are shown in Fig 3(b), where for each method we selected the learning rate with the best performance.We have further included RMSprop [29] with learning rate 0.005 and Adam [14] with learning rate 0.001 for comparison.The learning rate was chosen by optimization via grid search for these two methods.

Convolutional neural networks (CNN)
We next tested a standard three-layer CNN on the CIFAR-10 RGB image dataset [16], the detailed architecture of which is listed in Appendix Table 3.The network has around 4 × 10 6 parameters.The dataset consists of 60000 32 × 32 RGB images in 10 classes, with 50000 for training and 10000 for testing.All parameters were initialized independently with N (0, 0.02) distribution.The same batch size and c n were used, and the same schedule was used for decreasing β as in the last example.Similar to the comparison performed on MNIST, we used 20 rounds of independent training for each learning rate to check the accuracy and stability of RSGLD, SGD and SGLD, with each round lasting for 10 5 iterations.As shown in Table 2, RSGLD consistently outperformed the other two methods and the margin of difference becomes larger as the learning rate increases.

Conclusion
In this paper, we study an efficient MH-MCMC algorithm which uses mini-batches of data.We draw connections between the stationary distribution of this Markov chain and the tempered posterior, and provide the approximation errors for a general class of likelihood functions.We also propose RSGLD, a stochastic gradient based proposal to help the sampler navigate complex high dimensional surface with reasonable acceptance probability in the MH acceptance test.Empirically, we demonstrate the algorithm has good convergence behavior and the tempering effect helps move between well separated modes in classical low dimensional models.We demonstrate the efficacy of RSGLD in training neural networks with the MNIST and CIFAR-10 datasets and show that compared to popular optimization methods, we achieve improved accuracy and stability when the learning rate is large.

Appendix A Proofs of the main theorems
In this section, we first prove Theorems 1 and 2 in the main paper.We start the analysis by first showing two concentration lemmas.For brevity, we will write E θ * (•) as E(•).C, C 1 , . . .are general constants and might be different in every appearance.Lemma 3. Let c n be a sequence going to infinity such that c 2 n /m → 0 and Ω ⊂ Θ be a bounded subset.Under Assumptions 1-2, where C 1 is a constant depending only on θ * and δ 1 , Proof.We first consider fixed θ.Let ˜ θ (X i ) = log p θ (X i ) − µ θ and (Y 1 , . . ., Y n ) be an independent copy of (X 1 , . . ., X n ), then for s > 0, Letting x 2 , we have where for 0 < s ≤ δ 1 /L 0 .Putting all the parts together, To show uniform concentration, consider a n -covering of the set Ω with centers {θ 1 , . . ., θ N }, where N = K( n ) −d diam(Ω) d for some constant K since Ω is bounded.For any θ ∈ B 1 (θ j , n ), where B 1 denotes the 1 ball, sup by Assumption 2, and sup It follows then for 4s n < δ 2 , again by Assumption 3. Next note using the same calculation as in (13), for 2s , and Now with the n -covering, where We can now provide a uniform bound for the term I∈Im e cn(μ I (θ)−µ θ ) .Lemma 4. For some α > 0, let c n be a sequence going to infinity at a rate such that dc 2+α n log c n /m → 0. For θ ∈ Ω, Ω being a compact subset of Θ, for any fixed t > 0, and with g θ (X I(1) , . . ., X I(m) ) = e cn(μ I (θ)−µ θ ) .Thus Ū is a U-statistic.We first provide a bound on its expectation, Since log(1 taking The same rate can be obtained for the second term in (20).Overall we have Next we derive the concentration of the U-statistic Ū around its expectation.Let gθ (X 1 , . . ., X m ) = sup Noting the symmetry of g θ , we can first rewrite Ū − E( Ū ) as where {i 1 , . . ., i n } is a permutation of {1, . . ., n} and for n m = n m .Then for any fixed t > 0, It remains to calculate the second moment of g2 θ .Using (22), By Lemma 3, the first integral is bounded by using a similar calculation as (21), taking The second integral can be calculated in the same way to obtain the same order.Thus (24) and (25) imply Together with (22), we obtain the required bound in one direction.
The proof for the other direction is similar noting inf θ∈Ω U (θ) ≥ U , where for some normalizing constant C(x).Observe that maximizing log π(θ) is equivalent to maximizing S n (θ) = µ θ + c −1 n log U (θ).
Assumption 3 implies there exist 0 , δ 0 > 0 such that for all θ ∈ B(θ 0 ; δ 0 ), so the local optimum is well separated.Lemma 4 shows log U (θ) is uniformly small in θ for θ ∈ B(θ 0 ; δ 0 ).Taking fixed t and t < 1/2, for large enough n, with probability at least 1 − η n .Now we have sup with probability at least 1 − η n .Similarly, with probability at least 1 − η n .Putting these parts together, with probability at least 1 − η n , taking . The required result follows.
Further suppose Assumption 5 holds, we can prove Theorem 2.
Proof of Theorem 2: Let C(x) and C(x) be the normalizing constants for π 1/T (θ) and π(θ) respectively.Then It follows that The required bound follows by Lemmas 3 (applied to m = n) and 4.

Appendix B Applications
In this section, we illustrate our assumptions and results in Section 2.2 can be applied to a number of widely used models in statistics.

Mixture of exponential family distributions
We consider the problem of clustering with a K-component mixture model of exponential family distributions having a density function, each with parameter φ k = (φ k,1 , . . ., φ k,p ) ∈ Φ ⊂ R p .Let α = (α 1 , . . ., α K ) ∈ Λ be the unknown mixture proportions.Then collectively the set of parameters is given by θ = (α 1 , . . ., α K , φ 1 , . . ., φ K ) ∈ Θ = Λ × Φ K .We observe data points x 1 , . . ., x n , each drawn independently from the mixture distribution according to some true parameters θ * .The goal is to estimate the parameters without observing the class labels of the data points.The likelihood function is given by In this case, we can replace Assumptions 1 and 2 with the following conditions.
Assumption 7. Mixture of exponential family distributions.
Condition 1 ensures the data is from a real K-component mixture and there is no model selection issue; conditions 2 and 4 are commonly used regularity conditions; condition 3 is satisfied by many commonly occurring exponential family distributions including multivariate Gaussian, chi-squared distribution, and gamma distribution.First note that h(•) introduces an extra log h(x) in log p θ (x), which can be handled in Lemma 3 using standard concentration inequalities such as Bernstein's inequality using condition 3. Since the term is data dependent only, the convergence rate is dominated by the rest of log p θ (x) that depends on θ.For convenience, we will omit h(x) from now on.
To check Assumption 1, let Taking the derivative with respect to t, it is easy to check that Furthermore, there exists δ 1 > 0 such that by condition 4 in Assumption 7.
To see that Assumption 2 holds, similarly taking the derivative of f (t, θ) with respect to θ, it is easy to check By condition 2 and 3, there exists δ 2 > 0 such that E θ * e δ 2 M (X) < ∞.

Linear regression
In linear regression, we observe n data points with z 1 = (y 1 , x 1 ), . . ., z n = (y n , x n ), y i ∈ R, x i ∈ R d .We have y i = θ, x i + i , where i are iid Gaussian noise with unknown variance σ 2 .Here both θ and σ are the parameters.We consider x i as feature vectors generated iid from some likelihood p 0 (•), which does not depend on the parameters θ or σ.The likelihood function for a data point (x, y) is given by We assume the following conditions hold.
3. The likelihood p 0 (x) satisfies Assumption 1, but with the Lipschitz constant independent of θ and σ.The feature vector is bounded in the sense that x 2 < ∞.
Under Assumption 8, it is easy to verify Assumptions 1 and 2.

Classification with fully connected neural networks
We are given n data points z 1 = (y 1 , x 1 ), . . ., z n = (y n , x n ), where y i ∈ {1, . . ., K} are labels and x i ∈ R q are features (e.g.pixels in images) generated iid from some likelihood p 0 (•), which does not depend on θ.We consider the popular deep learning classification task with N fully connected layers.In the -th layer, the input x ( −1) undergoes an affine transformation followed by a nonlinear transformation by an activation function σ(•).The output of the -th layer is then given by where W ( ) is the weight matrix, b ( ) is the bias vector in the -th layer.Here x (0) corresponds to the input feature vector; the last layer is the softmax function being the j-th row of W (N ) .x (N ) can be interpreted as prediction probabilities.Training a neural network involves minimizing some loss function between the labels y = (y 1 , . . ., y n ) and the predictions.We consider the commonly used cross entropy loss, where θ is the collection of W ( ) , b ( ) , = 1, . . ., N .In this way we can interpret −H(y, θ) as the sum of log likelihood p θ (y|x), with y coming from a multinomial distribution with parameters specified by θ and the features x.The logistic regression is a special case of this.
Next we show that Assumptions 1 and 2 are satisfied if the following hold. .
Using the fact that log( exp(t k ) K j=1 exp(t j ) ) is Lipschitz in 2 norm, the above is bounded by Continuing the same way with x (N −1) − x(N−1) 2 , we can show using condition 2 in Assumption 9.

Appendix D Improved acceptance probability with RS-GLD
We now give the proof of Proposition 1, which calculates the proposal ratio of RSGLD.
Proof of Proposition 1. Case 1) and a forward move in (6).
In this case we have where ĝJ (θ )

Figure 3 :
Figure 3: (a) Lowest error rate in % achieved by the three methods out of 30 training rounds using various learning rates.(b)Examples of testing error trajectories using different training methods.

Now we are ready to prove Theorem 1 .
Proof of Theorem 1: It follows from Equation (4) in the paper that log π(θ) = c n µ θ + log U (θ) + C(x)

Assumption 9 .
Activation function and operator norms of weight matrices.1.The activation function σ(•) is bounded and Lipschitz continuous in 2 norm.

Table 2 :
CIFAR-10 top class prediction error (%) on the testing set using 20 training rounds for each learning rate.The numbers shown are the median/lowest errors out of 20 rounds.