Asynchronous and Distributed Data Augmentation for Massive Data Settings

Abstract Data augmentation (DA) algorithms are slow in massive data settings due to multiple passes through the entire data. We address this problem by developing a DA extension that exploits asynchronous and distributed computing. The extended DA algorithm is called Asynchronous and Distributed (AD) DA with the original DA as its parent. Any ADDA is indexed by a parameter r∈(0,1) and starts by dividing the entire data into k disjoint subsets and storing them on k processes. Every iteration of ADDA augments only an r-fraction of the k data subsets with some positive probability and leaves the remaining (1−r)-fraction of the augmented data unchanged. The parameter draws are obtained using the r-fraction of new and (1−r)-fraction of old augmented data. We show that the ADDA Markov chain is Harris ergodic with the desired stationary distribution under mild conditions on the parent DA algorithm. We demonstrate that ADDA is significantly faster than its parent for many (k, r) choices in three representative models. We also establish the geometric ergodicity of the ADDA Markov chain for all the three models, which yields asymptotically valid standard errors for estimates of desired posterior quantities. Supplementary materials for this article are available online.


Introduction
DA algorithms are a popular choice for Bayesian inference using Markov chain Monte Carlo.A variety of DA algorithms have been developed over the years for sampling-based posterior inference in a large class of hierarchical Bayesian models.The development of DA algorithms still remains popular because they are simple, easily implemented, and numerically stable; however, DA algorithms are very slow in massive data applications because they pass through the full data in every iteration.Taking advantage of asynchronous and distributed computations, we propose the ADDA as a scalable generalization of any DA-type algorithm.ADDA is useful for fitting flexible Bayesian models to arbitrarily large data sets, retains the convergence properties of its parent DA, and reduces to the parent DA under certain assumptions.
Consider the setup of DA algorithms.In a typical DA application, we augment "missing data" to the observed data and obtain the "augmented data" model, where the term missing data is interpreted broadly to include any additional parameters or latent variables.Under the augmented data model, the imputation (I) step draws the missing data from their conditional distribution given the observed data and the current parameter value.The I step is followed by the prediction (P) step that draws the "original" parameter from its conditional distribution given the augmented data.This completes one cycle of a DA algorithm.Starting from some initial value of the parameter, the I and P steps are repeated sequentially to obtain a Markov chain for the missing data and parameter.The marginal parameter samples from the DA algorithm form a Markov chain with the posterior distribution of parameters given the observed data as its stationary distribution [16,42].
The convergence of the DA Markov chain to its stationary distribution can be extremely slow.
This problem has been addressed by developing generalizations of many DA algorithms based on missing data models indexed by a "working parameter" that is identified under the augmented data model but not under the observed data model.Parameter expanded DA chooses any working parameter, conditional DA chooses the working parameter that leads to maximum acceleration in the convergence to the stationary distribution, and marginal DA averages over values of the working parameter drawn from its prior distribution [51,17].All these generalizations accelerate convergence to the stationary distribution by relaxing many restrictions in the observed data model.Modern massive data settings, however, pose severe computational challenges for DA and its parameter expanded versions in that the size of the augmented missing data can be really large, resulting in a prohibitively slow I step in every iteration of the DA or its extension.
The inefficiency of DA-type algorithms in massive data settings has received significant attention.Expectation-Maximization (EM) algorithm is the deterministic version of DA and its inefficiency is due to the slow E step [14].Cappé and Moulines [10] have addressed this issue by developing the family of online EMs that modify the E step using stochastic approximation.They compute the conditional expectation of complete data log likelihood obtained using a small fraction of the full data and the online E step increases in accuracy as greater fraction of the full data are processed.The stochastic counterparts of online EMs are the suite of Monte Carlo algorithms based on subsampling [23,6,12,13,62,39] and stochastic gradient descent [57,3,8,15,7,37]; see Nemeth and Fearnhead [30] for a recent overview.All these algorithms are extremely scalable, have convergence guarantees, and are broadly applicable; however, the performance of these algorithms is sensitive to the specification of tuning parameters in hierarchical Bayesian models, including subsample size, gradient step-size, and the learning rate.Their optimal performance requires proposal tuning, which is a major limitation in their use as off-the-shelf algorithms for Bayesian inference.
Distributed Bayesian inference approaches based on the divide-and-conquer technique also rely on DA and exploit its ease of implementation.A typical application involves dividing the full data into smaller subsets, obtaining parameter draws in parallel on all the subsets using a DA-type algorithm, and combining parameter draws from all the subsets.The combined parameter draws are used as a computationally efficient alternative to the draws from the true posterior distribution.
A variety of algorithms have been developed over the years at an increasing level of generality and applicability [46,27,29,48,61,60,21,59,54,33].There are two major limitations in all these methods.First, the combined parameter draws do not provide a Markov chain with the target posterior based on the entire data as the stationary distribution, which implies that quantification of the Monte Carlo error is challenging.Second, the theoretical guarantees of these methods are based on a normal approximation of the posterior distribution as the subset sample size and the number of subsets tend to infinity.No guarantees are available about the distance between the distribution of the combined parameter draws and the target posterior distribution.
Addressing both these problems, the proposed ADDA offers an simple approach for bypassing problems due to the slow I step using asynchronous and distributed computations, but at the same time producing a single Markov chain with the target posterior as a marginal of its stationary distribution.The key observation which enables such a construction is that for many DA settings, the missing data can be partitioned into several sub-blocks such that two conditions are satisfied.
First, all these sub-blocks are mutually independent given the original parameter and the observed data.Second, the conditional posterior distribution (given the original parameter) of each subblock depends only on an exclusive subset of the observed data; see the representative examples in Sections 3.1-3.3for greater details.
An ADDA algorithm starts by reserving (k + l) computing units called processes, where l k, k and l are the number of workers and managers, respectively.The processes could be cores in a CPU, threads in a GPU, or machines in a cluster.Every implementation of ADDA divides the full data into smaller k disjoint subsets, stores them on the k workers, and performs the I step in parallel on the k workers.The workers communicate the results of their I steps to the manager processes that are responsible for performing the P step.The managers also track progress of the algorithm by maintaining the latest copies of I step results for every worker.Let an r ∈ (0, 1) and an ∈ (0, 1) be given.Then, the managers wait for all workers to return their results with probability .With the probability 1 − , the managers wait to receive the I step results from an r-fraction of the workers and perform the P step using an augmented data model that depends on the r-fraction of new I step results and the (1 − r)-fraction of old I step results.This sequence of I and P steps is repeated until convergence, which generalizes the classical DA-type algorithms by using r as an additional working parameter.If r = 1, then we recover a distributed generalization of the parent DA algorithm.
The parameter is introduced as a theoretical device for ensuring that any ADDA algorithm produces a Markov chain.Any positive ensures that the missing data sub-block corresponding to each worker has a positive probability to be updated at every iteration.This is crucial for establishing theoretical results about the ADDA chain, including Harris ergodicity and geometric ergodicity.In practice, can be set to be a small number for faster computations.
It is important to compare and contrast the proposed ADDA algorithm with Asynchronous Gibbs sampling algorithms that have been developed in machine learning motivated by topic models.Newman et al. [31] developed the first such algorithm, which has been extended by exploiting various synchronization schemes and computer architecture [2,25].The only similarity between this class of algorithms and ADDA is that the data are partitioned and stored on the workers, which run the sampling algorithm locally.The manager process is absent in asynchronous Gibbs sampling because every worker draws its local set of parameters and latent variables from the local full conditional without waiting for the full Gibbs cycle to finish.This also implies that the parameter and latent variable updates do not form a Markov chain.In simple Gaussian models, this strategy produces draws from a distribution that fails to converge to the target [18].
Terenin et al. [50] fix this issue by introducing a correction step that allows a worker to accept or reject a draw with a probability based on the corresponding acceptance ratio.This additional Metropolis step increases the computational burden and the amount of information that needs to be communicated among the workers, but the authors are able to establish convergence to the desired stationary distribution under additional regularity conditions.The sequence of iterates generated by this exact algorithm still do not form a Markov chain in general, and hence one does not have access to the standard Markov chain central limit theorem (CLT) based approaches to quantify the MCMC standard error (see the discussion below).More recently, these ideas have been extended to Bayesian variable selection using Gaussian spike-and-slab priors [4].Developing a general theoretical understanding of these asynchronous Gibbs algorithms is still an active area of research.
On the other hand, any ADDA algorithm inherits several attractive properties of its parent DA.
First, being a generalization of the parent DA for a valid choice of (k, r), any ADDA is easy to implement and numerically stable.Second, ADDA is the stochastic counterpart of the distributed EM, is independent of the computer architecture, and has broad applicability [47].Finally, while the marginal sequence of original parameter values generated by the ADDA algorithm is not a Markov chain (unlike the DA algorithm), the joint sequence of original and augmented parameter values is a Markov chain whose stationary distribution has the targeted posterior as a marginal.
This facilitates a simpler theoretical analysis of the Monte Carlo error based on the familiar tools for analyzing Markov chains in sampling-based posterior inference; see Sections 3.1-3.3below.
We now describe the structure and key contributions of this paper.The general framework for the ADDA algorithm is developed in Section 2.1, and Harris ergodicity of the resulting Markov chain is established in Theorem 2.1.We then develop ADDA extensions of three representative DA algorithms in Sections 3.1-3.3.There are two kinds of massive data settings that are encountered in modern applications.The first is the "massive n setting", where the number of observations or samples n is extremely large, typically in the order of millions or larger.For this setting, we consider two illustrative examples: the Polya-Gamma DA algorithm [38] for Bayesian logistic regression and the marginal DA algorithm [51] for linear mixed-effects modeling.For both algorithms, the number of augmented parameters is equal to the number of observations, which makes the I step computationally very expensive, and provides motivation for applying the ADDA methodology.
The corresponding ADDA extensions are specified in Algorithms 1 and 3.The second big data setting is the "massive p setting", where the number of variables p in the dataset is extremely large.
For this setting, we consider the Bayesian lasso DA algorithm [40] for high-dimensional regression, where the number of augmented parameters equals the (large) number of regression parameters p.This provides another motivation for applying the ADDA methodology and the corresponding ADDA for the Bayesian lasso is presented in Algorithm 2.
We then undertake extensive simulation studies for each of the three representative examples to understand and evaluate the performance of the respective ADDA extensions depending on the number of workers k and the update fraction r.The two hallmarks of ADDA are distributed and asynchronous processing.Parallel processing by distributing the independent draws of the appropriate sub-blocks of the missing data to the k workers obviously leads to faster computations with no adverse impact on mixing of the Markov chain [58].The asynchronous updates, however, come with a tradeoff.Smaller values of the fraction r lead to faster computations per iteration, but also slow down the mixing of the Markov chain; therefore, compared to the DA algorithm, the same number of iterations of the ADDA algorithm take much less time but provide less accurate estimates of desired posterior quantities.The idea, as demonstrated by our extensive simulations in Section 4, is that typically the computational gain is quite significant with a comparatively small loss of accuracy.For example, MovieLens ratings data analysis in Section 4.3 using logistic regression and linear mixed-effects models shows that the ADDA algorithm is anywhere between three to five times faster and only around 2% less accurate than its parent after 10, 000 iterations; see Figures 9-12.
As mentioned above, the iterates generated by the ADDA algorithm form a Markov chain whose stationary distribution has the target posterior as a marginal.This potentially allows us to directly leverage standard MCMC approaches to quantify the Monte Carlo standard error of the resulting estimators of posterior quantities.All asymptotically consistent estimators of the MCMC standard error, however, rely on the existence of a Markov chain CLT.A standard method for establishing the existence of such a CLT is to show geometric ergodicity; that is, the distribution of the Markov chain converges geometrically fast to the stationary distribution as the number of iterations increase.It is well-known that establishing geometric ergodicity of statistical continuous state space Markov chains is in general very challenging [20].The drift and minorization analysis typically needed for this purpose can be quite involved and "a matter of art", requiring the analysis to be heavily adapted to the structure of individual Markov chains.
While geometric ergodicity has been established for the Markov chains corresponding to all the three DA algorithms considered in the paper, adapting these analyses to establish a similar result for the ADDA extensions is not feasible given the complications introduced by the asynchronous updates.The main reason is that the subset of missing data that is updated in any given iteration depends on the current parameter value, which implies that the marginal parameter process of the ADDA chain is not Markov in general.Also, establishing minorization conditions, which are integral parts of the original DA proofs, becomes very challenging.Through a fresh analysis which only uses appropriate drift conditions that are unbounded off compact sets, we establish geometric ergodicity for all the three ADDA extensions in Theorems 3. 2 Asynchronous and Distributed Data Augmentation (ADDA)

The General Framework
Consider the setup of a general ADDA algorithm.Assume that we have chosen an r ∈ (0, 1) and reserved k worker and 1 manager processes.Let θ be the model parameter, D obs and D be the observed and missing data, D aug = {D obs , D} be the augmented data.Let M and Θ denote the missing data and the parameter spaces, respectively.We assume that both sets are equipped with countably generated σ-algebras, and respective σ-finite measures µ and ν. and the parameter given the augmented data, respectively.Let D i denote the missing data subset distributed to worker i so that D = (D 1 , . . ., D k ).We assume that If the parameter θ can also be divided into conditionally independent blocks, one could introduce more than one manager process, but for ease of exposition we restrict ourselves to a setting with one manager process.
For t = 0, 1, . . ., ∞, the manager (M-a) waits to receive only an r-fraction of updated D (M-d) sends θ (t+1) to all the worker processes and resets t = t + 1.
(2) For t = 0, . . ., ∞, the worker i (i = 1, . . ., k) (W-a) waits to receive θ (t) from the manager process; to the manager process, resets t = t + 1, and goes to (W-a) if θ (t+1) is not received from the manager before the draw is complete; otherwise, it truncates the sampling process, resets t = t + 1, and goes to (W-b).
The general steps for an ADDA algorithm are easily summarized into Asynchronous and Distributed (AD) I and P steps.The ADDA algorithm initializes (D, θ) at (D (0) , θ (0) ) and the AD-I and AD-P steps in the (t + 1)th iteration for t = 0, 1, . . ., ∞ are as follows: to the manager (if the draw is finished before receiving θ (t+1) ).

AD-P step:
As soon as the manager receives the required fraction (r with probability 1 − and 1 with probability ) of updated D Note that if the r-fraction update regime is chosen, then the manager sets D t+1 i = D t i for the remaining (1 − r)-fraction.As soon as the workers receive θ (t+1) , they stop any ongoing activity and proceed with the AD-I step for the (t + 2)th iteration.
One cycle of ADDA consists of AD-I step followed by AD-P step, and they are repeated sequentially to obtain the {(D (t) , θ (t) ) : t = 0, 1, . . ., ∞} chain for the missing data and parameter.
We emphasize again that at the end of iteration t + 1, with probability 1 − , only an r-fraction of k are replaced by the manager with new draws received from the workers, whereas the remaining (1 − r)-fraction of them haven't changed.The AD-I and AD-P in ADDA are distributed generalizations of the I and P steps in its parent DA.If r = 1, then AD-I and AD-P steps reduce to the I and P steps of their parent DA.
ADDA inherits the simplicity and stability of its parent DA.Its implementation requires communication between the manager and worker processes that is easily implemented using software packages such as the parallel package in R. The most important advantage of ADDA is that its cycle is faster than the parent DA cycle by a factor of O(k) for every r.The main reason is that the k workers perform the AD-I steps in parallel on an (1/k)-fraction of the full data, resulting in a O(k)-fold speed-up over the I step of parent DA.The P steps of ADDA and its parent DA are identical given the missing data, so their timings are also similar.If r is sufficiently small, then ADDA offers a simple and general strategy for scaling any existing DA-type algorithm to massive data settings by choosing a sufficiently large k.The next section carefully investigates crucial theoretical properties of the ADDA algorithm which guarantee that draws from the ADDA algorithm can be used to effectively approximate relevant posterior quantities.
The ADDA can be interpreted as a hybrid Gibbs sampler.The AD-P step is a systematic scan step for sampling the parameter θ from its conditional posterior distribution given D, but the AD-I step is a "random subset scan" step which updates a random subset of the missing Other hybrid versions of systematic and random scan Gibbs steps have been considered in the literature.Backlund et al. [5] consider a DA setup where the parameter θ is partitioned into two blocks.They construct and study a hybrid sampler that performs a systematic scan step for the entire missing data D at each iteration and updates exactly one of the two parameter blocks with fixed probabilities s and 1 − s, respectively.A general (non-DA) setting with multiple blocks is considered in [26], where a sampler updates a single coordinate in each iteration.The choice of the block to update is made randomly based on a vector of fixed probabilities which depend on the block index that was updated in the previous iteration; therefore, this strategy subsumes the traditional systematic and random scan samplers as special cases.The ADDA is significantly different from the above strategies and updates a randomly chosen subset of the conditionally independent blocks of the missing data D, while performing a systematic scan update of the entire parameter θ.Note also that the probabilities of choosing various relevant subsets at a given iteration can in general depend on the current value of the parameter θ.
A simple observation shows that D (t) , θ (t) ∞ t=0 process in ADDA is Markov.For an ADDA algorithm with r ∈ (0, 1), the knowledge of (D (t) , θ (t) ) makes the entire past irrelevant for generating (D (t+1) , θ (t+1) ).Specifically, (D (t+1) , θ (t+1) ) is conditionally independent of (D (j) , θ (j) ) t−1 j=0 given (D (t) , θ (t) ); therefore, the process D (t) , θ (t) ∞ t=0 is still Markov.It is also easily seen that both the AD-P step and the AD-I step leave f (D, θ | D obs ) invariant.This is promising but to use the ADDA iterates to approximate relevant posterior expectations and quantiles, we need to establish Harris ergodicity.The following theorem shows that the ADDA Markov chain is Harris ergodic as long as is positive, the original DA chain is Harris ergodic, and its transition kernel satisfies a mild absolute continuity condition.The proof of the theorem is given in Supplementary Section A.1.The assumption > 0 is needed because it ensures that there is a positive probability that each worker returns an updated value at each iteration.This allows us to leverage the Harris ergodicity of the parent DA chain to establish relevant properties of the ADDA chain.If = 0, then assumptions regarding the computer architecture and the processing power of the workers will need to be made to guarantee Harris ergodicity; for example, see Terenin et al. [50].This can get quite messy and tricky to ensure in real-life computing.Using > 0 is an elegant and clean way of side-stepping these issues.Of course, smaller values of epsilon are preferred for faster computations.Theorem 2.1 is a fundamental step towards understanding theoretical properties of the {(D (t) , θ (t) )} process in ADDA but is still far from the end.An important next step is to establish a CLT for the {(D (t) , θ (t) )} process in order to provide asymptotically valid standard error bounds for MCMC based approximations.As discussed before, the standard method of establishing a CLT is to establish geometric ergodicity; however, proving geometric ergodicity can be quite challenging in general.The proof of geometric ergodicity for a large majority of statistical Markov chains is not available.For Markov chains where a proof is available, it is typically based on drift and minorization conditions, which are specifically tailored for the particular chain at hand.
In the next section, we illustrate the application of ADDA on logistic regression, highdimensional variable selection, and linear mixed-effects modeling.For the three examples, the {(D (t) , θ (t) )} process in the parent DA is known to be geometrically ergodic.Extension of each of these proofs to ADDA is difficult for at least one of two main reasons.First, each cycle of ADDA updates (with 1 − probability) a random r-fraction of the D 1 , . . ., D k .This random subset update significantly complicates the transition density in terms of establishing a minorization condition involved in the proof of geometric ergodicity.Second, some of the parent DA geometric ergodicity proofs focus on the marginal process {θ (t) }, which is a Markov chain in the parent DA setting, and extend the result to the full process {(D (t) , θ (t) )}.This strategy fails in the ADDA setting because the {θ (t) } process is not in general Markov.We provide results establishing geometric ergodicity for the ADDA chains in the three above-mentioned models in the next section.A key idea in the three proofs is to identify and establish geometric drift conditions, where the drift functions are unbounded off compact sets.Some of these proofs are established under weaker conditions than those required for the parent DA chains.

ADDA for Bayesian Logistic Regression
Consider the implementation of Pólya-Gamma DA for Bayesian logistic regression [38].Let n be the sample size, β = (β 1 , . . ., β p ) T ∈ R p be the regression coefficients, and y i , s i , x i = (x i1 , . . ., x ip ) T be the response, number of trials, covariates for sample i, respectively, where y i ∈ {0, 1, . . ., s i }, The hierarchical model for logistic regression is The DA algorithm augments the model in (1) with n Pólya-Gamma random variables ω 1 , . . ., ω n specific to the n samples, and cycles between the I and P steps for a large number of iterations starting from a given value of β: where PG is the Pólya-Gamma distribution.
The ADDA algorithm with Pólya-Gamma DA algorithm as its parent DA modifies step PG.1.
Let s = (s 1 , . . ., s n ) and ω = {ω 1 , . . ., ω n }.Then, following the notation of Section 2.1, the Pólya- For many datasets, the sample size n can be really large.For example, the MovieLens data analyzed in Section 4.3 has 10 million samples.Instead of updating n ω i s in every iteration, ADDA updates only an r fraction of them (with probability 1 − ) and reduces to its parent DA when r = 1.The ADDA algorithm runs after we randomly split the n samples into k disjoint subsets and store them on k worker processes.Let n i be the number of samples assigned to worker i, (y j(i) , s j(i) , x j(i) , ω j(i) ) be the jth augmented sample on worker i (j = 1, . . ., n i ) and D i = {ω 1(i) , . . ., ω n i (i) }.Note that ω j(i) s in D i are rearranged to match the ordering of samples in D obs so that D = ω = {ω 1 , . . ., ω n }.
The convergence analysis of the parent PG DA chain has been performed in the literature.In particular, Choi and Hobert [11] show that the parent PG DA chain is uniformly ergodic (using the marginal β (t) chain) if proper priors are used.Following this work, Wang and Roy [56] show the geometric ergodicity of the PG DA chain with flat priors.The uniform ergodicity proof for the parent DA chain in Choi and Hobert [11] for the binary and proper prior setting is based on a minorization argument on the marginal {β (t) } chain; however, this proof strategy fails for ADDA because the {β (t) } process for Φ ADP G is not Markov.As a result, we take a different approach where we establish a geometric drift condition using a function of (β, ω), which is unbounded off compact sets.Wang and Roy [56] follow a similar strategy for the proving geometric ergodicity of the parent DA chain with improper priors, but they focus on the marginal ω chain instead of the (β, ω) chain.Furthermore, the previous two works consider the parent DA chain when the response is binary, whereas Theorem 3.1 deals with the ADDA chain in a more general binomial setting.

ADDA for Bayesian High Dimensional Variable Selection
Consider Bayesian high dimensional variable selection using a Bayesian lasso shrinkage prior [40].
This DA is different from Pólya-Gamma DA in that the I step in BL.1 draws p latent variables instead of n.In high-dimensional problems, n p and updating τ 1 , . . ., τ p in every iteration is time consuming even when n is small.Following Section 2.1, the Bayesian lasso DA has Unlike Pólya-Gamma DA, the computational bottleneck in the I step happens when p is large.The ADDA algorithm starts by partitioning {1, . . ., p} into k disjoint subsets denoted as P 1 , . . ., P k .
(AD-P step) With probability 1 − , wait to receive updated D and D obs as , where With probability , wait to receive updated values from all the workers, and then proceed to sampling β (t+1) , σ 2(t+1) given and D obs .
Similar to the Pólya-Gamma DA extension, this ADDA cycles generate the Markov chain Φ ADBL = The following theorem establishes the geometric ergodicity of Φ ADBL when > 0, and the proof is given in Supplementary Materials Section A.5. Theorem 3.2.If > 0 and n ≥ 3, the Markov chain Φ ADBL described in Algorithm 2 is geometrically ergodic.
Geometric ergodicity of a three block version of the parent DA chain was established in [22], and geometric ergodicity for the parent DA chain described in BL.1 and BL.2 above was established in [41].Both these proofs make use of drift and minorization.In particular, [41] prove results on the marginal (β, σ 2 ) chain, and leverage them to establish geometric ergodicity of the parent DA chain.
As mentioned previously, this route is not available to us, as the marginal (β, σ 2 ) is not Markov, and establishing minorization with the asynchronous updates is nontrivial.As such, similar to the Polya-Gamma ADDA setting, our proof of Theorem 3.2 is based on establishing a geometric drift condition using a drift function that is unbounded off compact sets.

ADDA for Linear Mixed-Effects Modeling
Consider the setup for linear mixed-effects models [51].Let n be the number of observations, m be the number of subjects, (y i , X i , Z i ) be the response, fixed effects covariates, and random effects covariates, respectively, for subject i, p and q be the number of fixed and random effects, β ∈ R p be the fixed effect, and Σ be a q × q positive definite covariance matrix.Then, the linear mixed-effects model assumes that where b i ∈ R q and e i ∈ R n i are the random effect and idiosyncratic error for subject i, b i and e i are mutually independent, Σ is the covariance matrix of random effects, σ 2 is the error variance, I is an identity matrix, y i ∈ R n i , n = m i=1 n i , and the dimensions of X i , Z i , e i , I are chosen appropriately.The b i s are unobserved and the interest lies in inference on θ = {β, Σ, σ 2 }.
The DA algorithm for posterior inference on θ is based on modified random effects.Let Γ be a non-singular matrix, γ = vec(Γ) stacks the columns of Γ into a q 2 -dimensional vector, p = p + q 2 , y T = (y T 1 , . . ., y T m ) ∈ R n , Σ = Γ −1 ΣΓ −1 , α = (β, γ), and X ∈ R n×p be the matrix with [X i (d T i ⊗Z i )] as its ith row block, where d i = Γ −1 b i for i = 1, . . ., m.The prior distributions on (σ 2 , α) and Σ are assigned as where V α 0, W 0, 0 denotes a symmetric positive definite matrix, and IW stands for inverse Wishart distribution.Then, the DA algorithm for posterior inference in linear mixedeffects modeling starts with some given value of θ and repeats the following I and P steps for a large number of iterations: LM.1 (I step) Draw d i given θ and y i from Normal(m d i , V d i ) for i = 1, . . ., m, where LM.2 (P step) Draw Σ, σ 2 and α = (β, γ) given d 1 , . . ., d m , y 1 , . . ., y m in a sequence of three steps: where α = ( XT X+V α ) −1 XT y, ŷ = X α, Γ = unvec(γ), and This DA is different from the previous two DA algorithms in Sections 3.1 and 3.2 due to the presence of random effects.Following Section 2.1, we have The main computational bottleneck here is the repeated sampling of d 1 , . . ., d m in the I step LM.1 when m is large.For example, the MovieLens data analyzed in Section 4.3 has m ≈ 72, 000.The ADDA-based extension of this algorithm starts by partitioning {1, . . ., m} into k disjoint subsets denoted as P 1 , . . ., P k .Let m i = | P i |, {(y j(i) , X j(i) , Z j(i) , d j(i) ) : j ∈ P i } be the augmented sample on worker i, and to the the disjoint partitioning.Define the statistics S dd(i) , S xx(i) , S xy(i) for worker i as Then, the augmented data sufficient statistics { X1(i) , . . ., Xm i (i) , S dd(i) , S xx(i) , S xy(i) } for worker i play an important role in both the DA and ADDA algorithms.In particular, α and ŷ in P step of the parent DA in (7) become where Here, is the scale parameter of the Wishart density in (7).Note that X is obtained by arranging X1(i) , . . ., Xm i (i) (i = 1, . . ., k) in the order of samples in X and binding them along the rows.
The ADDA extension of the linear mixed effects DA, at any iteration, updates (with proba- m , α (0) , Σ (0) , σ 2(0) ) and the AD-I and AD-P steps in the (t+1)th iteration are as follows: given θ (t) = (α (t) , Σ (t) , σ 2(t) ) and y j(i) from N (m d j(i) , V d j(i) ) on worker i for j = 1, . . ., m i , where m d j(i) , V d j(i) are obtained by replacing (y i , X i , Z i ) in (6) with (y j(i) , X j(i) , Z j(i) ).Compute D (t+1) i from the sampled d (t+1) j(i) values and send D (t+1) i to the manager process if these draws are finished before receiving θ (t+1) = (α (t+1) , Σ (t+1) , σ 2(t+1) ) (i = 1, . . ., k). and D obs using a sequence of three steps: where α(t+1) , ŷ(t+1) , S DD , and S XX are computed based on ( 9) and ( 10) using the latest d m , α (t) , Σ (t) , σ 2(t) } ∞ t=0 .The convergence analysis of the linear mixed-effect model parent DA chain is extremely challenging.In fact, the literature on geometric ergodicity of the parent DA chain described in LM.1 and LM.2, to the best of our knowledge, focuses only on the special case when the Σ (t) is diagonal and Γ is fixed to be the identity matrix; for example, see [45] and [1].Following [45,1], we assume that Γ = I q for the convergence analysis; however, unlike these analyses of the parent DA chain, we do not restrict Σ to be a diagonal matrix.Algorithm 3, without the Γ updates (i.e., Γ is fixed at I p ), provides a Markov chain {d In this setting, from the priors used in (5), only the prior for α given σ 2 is modified to the prior for β given σ 2 as N (0, σ 2 V −1 β ) for some V β 0. We need the following assumptions for our geometric ergodicity result.
(c) The matrix V β from the prior distribution on β, and the degrees of freedom a from the prior distribution on σ 2 satisfy (p+mq) Assumption (a) is quite reasonable, especially since the ADDA algorithm is geared towards applications with large m, n and typically small values of p and q (such as the MovieLens data analyzed in Section 4.3.3).Assumptions (b) and (c) on the prior hyper-parameters s, a, V β might seem restrictive, but it is important to keep in mind that we are in a challenging setting with a general non-diagonal Σ for which convergence results are unavailable even in the parent DA setting.As discussed previously, the asynchronous updates in the ADDA setting present additional challenges in the analysis; however, we are still able to establish geometric ergodicity in this more general ADDA setting, as described in the result below.Theorem 3.3.If > 0 and Assumption 3.1 holds, the Markov chain Φ ADLME described in Algorithm 3 (adjusted for Γ = I q ) is geometrically ergodic.
The proof of Theorem 3.3 is provided in Supplementary Materials Section A.6.

Setup
We evaluate the numerical accuracy and computational efficiency of ADDA algorithms using their parent DA algorithms as the benchmarks.In all our simulated and real data analyses, k ∈ {10, 25, 50} and r ∈ {0.20, 0.40, 0.60, 0.80}, where the first two and last two values of r represent small and large choices of r, respectively.Every experiment with a given choice of (n, k, r) is replicated ten times.Before running an ADDA algorithm, the full data are split (based on the samples or parameter dimensions) into k disjoint subsets that are stored on k worker machines.The parent DA algorithm uses the full data.We use the choice = 0 in our simulations and = 0, 0.01, 0.10 in the real data analyses.The choice = 0 is best for fast computations that are required for multiple simulation replications; however, it is not clear if the ADDA chain is Harris ergodic with this choice because Theorem 2.1 requires that > 0. In our simulations with 20,000 iterations for ADDA algorithms with = 0, we have rarely encountered a situation where a worker never sends its I step results to the manager.Our observations from the real data analyses suggests that one could set to be 10 −4 or 10 −5 to maintain theoretical convergence guarantees while compromising very little on the computational gains.
The (empirical) accuracy of an ADDA algorithm relative to its parent DA algorithm is defined using the total variation distance.Let η = h(θ) = (η 1 , . . ., η c ) be a function of the underlying parameter θ with c components that is of interest to the practitioner.Denote the posterior density estimates of η j using t iterations of the ADDA algorithm and its parent DA as q t rk (η j ) and p t (η j ), respectively.Then, the accuracy metric based on η j for the ADDA algorithm at the end of t iterations is defined as where r and k are given, TV(p t , q t rk ) is the total variation distance between p t and q t rk , and Acc rk,j (t) ∈ [0, 1] because TV(p t , q t rk ) ∈ [0, 1] for every r, k, t.We estimate Acc rk,j (t) using kernel density estimation in two steps.First, we select the first t draws of η j from the ADDA algorithm R package.Second, these estimates are used to numerically approximate the integral in (12) and obtain the Acc rk,j (t) (j = 1, . . ., c) estimates.The overall accuracy metric estimate based on η for the ADDA algorithm is computed by averaging the estimates of Acc rk,1 (t), . . ., Acc rk,c (t) as The larger the Acc rk (t) estimate, the better an ADDA algorithm is at approximating its parent DA algorithm.
The Monte Carlo standard errors in an application of ADDA algorithm and its parent DA algorithm are computed using the overlapping batch means method.This method is implemented in the mcse function of mcmcse R package [19,53].Following the notation from the previous paragraph, let SE A rk,j (t) and SE P j (t) respectively be outputs of the mcse function using the first t draws of η j from the ADDA algorithm for a given (r, k) and its parent DA algorithm (j = 1, . . ., c).
Then, a metric for comparing their overall Monte Carlo standard errors is The smaller the value of SE rk (t), better is the agreement between Monte Carlo standard errors of the ADDA algorithm and its parent DA.
We now undertake an extensive experimental evaluation of the proposed ADDA framework using both simulated datasets (Section 4.2) and the MovieLens data (Section 4.3).The simulation study focuses on using the metrics in ( 13) and ( 14) to evaluate the accuracy of the ADDA algorithm in approximating its parent DA algorithm.Due to the relatively small sample/missing data size to facilitate multiple replications for various (k, r) choices, a runtime comparison is not undertaken in Section 4.2.For the MovieLens data in Section 4.3 however (with up to 10 7 samples), we present both an accuracy and runtime comparison for both the logistic regression and the linear mixed model settings.All ADDA algorithms are implemented in R using the parallel package.The ease of implementation and reproducibility drive this choice.For this data, our ADDA implementations using the parallel package show three to five times gains in run-times for all the choices of (k, r) (see Figure 12).We expect the run-time gains to be even better if the asynchronous updates for the ADDA algorithms are implemented carefully using MPI [32].However, this requires significant additional programming efforts and will be addressed in future research.

Simulated Data Analyses
This section focuses on two broad classes of applications.The first class consists of logistic regression and linear mixed-effects models, where the computational bottlenecks arise due to massive sample size; see Sections 4.2.1 and 4.2.3.In these models, the samples are partitioned into k subsets and stored on the worker machines.Section 4.2.2 focuses on high-dimensional variable selection using the Bayesian lasso, where the parent DA is inefficient due to the augmentation of p local shrinkage parameters specific to the regression coefficients.Unlike the former two models, here we partition the augmented parameters into k subsets and store them on worker machines.

Logistic Regression
We evaluate the empirical performance of the ADDA Algorithm 1 for logistic regression in (1) with p = 10, and s i = 10 for every i.The entries of β are set to −2 and 2 alternatively starting from −2 and the entries of X are independent standard normal random variables.We use them to simulate y 1 , . . ., y n using the hierarchical model in (1).The rows of X and y are randomly split into k subsets that are stored on the worker machines.The ADDA and its parent run for 20000 iterations and β draws are collected at the end of every iteration.
The Acc rk (t) metric is evaluated for η = β and η = P(Y = 1 | x 1 = 1, . . ., x p = 1), respectively (Figure 1).The Acc rk (t) values for β depend on n, r values but are fairly insensitive to the choice of k.For every n, k, t, the Acc rk (t) values for β are (initially) the largest when r = 0.80 and decrease with r.However, The differences between Acc rk (t) values for different choices of r, k, n are negligible after t = 5000 for β.On the other hand, Acc rk (t) values for P(Y = 1 | x 1 = 1, . . ., x p = 1) are comparatively insensitive to the choice of n, r, k, and after the differences for different choices of r, k, n are negligible after t = 2000.
The SE rk (t) metric for β and P(Y = 1 | x 1 = 1, . . ., x p = 1), respectively, decays to 0 as t increases (Figure 2).For every n, k, and t, the SE rk (t) values for β are the largest when r = 0.2 but quickly decay to 0 as t increases.The SE rk (t) values for P(Y = 1 | x 1 = 1, . . ., x p = 1) quickly decay to 0 as t increases and are comparatively insensitive of the choices of (n, r, k).This shows that SE rk (t) values for both parameters are close to 0 for every n, k, and r if t is sufficiently large and that the Monte Carlo standard errors of parameter estimates obtained using ADDA Algorithm 1 and its parent are comparable in massive data settings.
Note that Theorem 3.1 says that the ADDA Markov chain converges to the same stationary distribution as the parent DA chain for any (r, k).Since the Acc rk (t) values for P(Y = 1 | x 1 = 1, . . ., x p = 1) and β converge to 1 for a sufficiently large t irrespective of the choices of r and k, our empirical results support the assertion in Theorem 3.1.

High Dimensional Variable Selection
We evaluate the empirical performance of the ADDA Algorithm 2 for Bayesian variable selection.
We vary n and p such that n = p and n ∈ {50, 500, 5000}.The first 0.90p entries of β in (2) are set to 0 and the next 0.10p entries are set to −2 and 2 alternatively starting from −2.The entries of X are independent standard normal random variables.We use them to simulate y 1 , . . ., y n using the model in ( 2) with σ 2 = 0.01.The columns of X are randomly split into k subsets that are stored on the worker machines.The ADDA algorithm and its parent DA are run for 20000 iterations and β draws are collected at the end of every iteration.

MovieLens Data
We use the MovieLens ratings data (http://grouplens.org)that contain more than 10 million movie ratings from about 72 thousand users, where each rating ranges from 0.5 to 5 in increments of 0.5.Starting with Perry [35], a modified form of this data has been used for illustrating scalable inference in linear mixed-effects models [48,47,49].Specifically, a movie's category is defined using its genre: action category includes action, adventure, fantasy, horror, sci-fi, or thriller genres; children category includes animation or children; drama category includes crime, documentary, drama, film-noir, musical, mystery, romance, war, or western; and comedy category includes comedy genre only.
Three predictors are defined using this data.The first predictor encodes a movie's category using three dummy variables with action category as the baseline.If a movie belongs to C categories, then its category predictor is 1/C for each of the C categories.The second and third predictors capture a movie's popularity and a user's mood.The popularity predictor of a movie equals logit{(l + 0.5)/(r + 1)}, where r users rated the movie and l of them gave a rating of 4 or higher.
The user's mood predictor equals 1 if the user gave the previously 30 rated movies a rating of 4 or higher and 0 otherwise.The second and third predictors are treated as numeric variables.
We model this data using logistic regression and linear mixed-effects model in Sections 4.

Logistic Regression
We randomly select two subsets of users in the MovieLens data such that the total number of ratings in them are about 10 6 and 10 7 , respectively.The MovieLens data with modified 0/1valued responses are analyzed using logistic regression model in (1) with six predictors, including an intercept.There are three dummy variables for the movie categories and one each for a movie's popularity and a user's mood.The β vector is 6-dimensional.The responses and predictors in the modified data are collected into the response vector y and design matrix X.We apply the ADDA Algorithm 1 used in Section 4.2.1 with two modifications: the number of iterations is 10,000 instead of 20,000 and = 0, 0.01, 0.10.This setup is replicated ten times by randomly choosing the users in each replication.
Accuracy comparisons.The Acc rk (t) and SE rk (t) metrics are evaluated for β and P(Y = 1 | x), where x = (0, 0, 0, 0, 1, 0) is the predictor for a popular movie.When = 0, the Acc rk (t) values for Accuracy comparisons.We evaluate Acc rk (t) and SE rk (t) metrics for β, Σ, and σ 2 , respectively.When = 0 and r = 0.2, the Acc rk (t) values for σ 2 , β, and Σ converge to 1 quickly for every n, k, t (Figures 9 and 11).For smaller t values, Acc rk (t) is relatively low for n = 10 6 and r = 0.2; however, the differences between Acc rk (t) values disappear for a sufficiently large t given = 0.
On the other hand, when = 0 the convergence of Acc rk (t) to 1 is very slow compared to the previous cases.For example, Acc rk (t) values for σ 2 are noticeably small when r ∈ {0.20, 0.40} and k = 50.This is a consequence of violating the assumptions of Theorem 3.3; therefore, we recommend choosing a small but positive in practice.
Monte Carlo standard error comparisons.Theorem 3.3 also impacts the decay of SE rk (t) values to 0 when = 0. Specifically, if = 0, then SE rk (t) values for σ 2 , β, and Σ converge to 0 quickly for every n, k, t (Figures 10 and 11).This pattern is maintained for σ 2 even when = 0.The SE rk (t) values for β and Σ decay to 0 relatively slowly for every k when n is large, r = 0.20, and = 0; however, if t is sufficiently large, then the SE rk (t) value is small for every choice of n, r, k, and .
Run-time comparisons.The ADDA Algorithm 3 is three to five times faster than its parent DA for all the choices of n, r, k, and (Figure 12b).For a given k, the gain in run-time increases with n for every and r.On the other hand, for a given n, the gain in run-times decreases with increasing k for every and r due to the increased communication among manager and worker processes.
Similar to our observations in logistic regression, a distributed (or parallel) implementation of the parent DA is two to three times slower than the asynchronous implementations of ADDA with r < 1 and < 1 for every choice of n, k, and .Unlike the case for logistic regression, varying loads on the cluster have a minimal impact on the run-times.
Conclusions.If is positive and t is sufficiently large, then the Acc rk (t) and SE rk (t) values of Algorithm 3 are close to 1 and 0, respectively, for every n, r, k.The ADDA chain corresponding to Algorithm 3 is also three to five times faster than its parent DA depending on (k, r).The asynchronous computations are more efficient than parallel computations in that ADDA with r < 1 is faster than the distributed implementation of parent DA.We conclude that ADDA Algorithm 3 is a promising alternative to the marginal DA for accurate and efficient Bayesian inference in linear mixed-effects models for massive data.

Discussion
It is interesting to explore extensions of our ADDA schemes and theoretical results.First, linear mixed-effects have been an important class of models for evaluating the performance of a new class of DA algorithms [28,51,52].The excellent empirical performance of ADDA in Sections 4.2.3 and 4.3.3suggests that it is possible to extend the ADDA scheme to other classes of hierarchical Bayesian models.For example, if the random effects in a linear mixed-effects model are assigned a multivariate t distribution, then the I and P steps of the new model are similar to those in Section 3.3 [36].This implies that the Algorithm 3 is easily extended to robust extensions of linear mixed-effects models.
Second, our real data analysis illustrates the dependence of accuracy and Monte Carlo standard errors on the choice of (k, r, ).Due to varying loads on worker processes in a cluster, some workers return their I step results to the manager more often than others.Our empirical results suggest this variability decreases with n.For example, if we define the empirical estimate of r for a worker as the fraction of the number of times the manager accepts its I step results over the total number of DA iterations, then empirical estimates of r are closer to the true r as n increases for every k and worker process; see Figure 13.Furthermore, the Bernstein-von Mises theorem implies that the posterior distribution becomes increasingly Gaussian as n increases.We are exploring techniques for balancing the Monte Carlo error (depending on t, k, r, ) and the statistical error (depending on n) for providing some guidance on the choice of k and total number of MCMC iterations so that a desired accuracy is achieved by the ADDA algorithm.
Our DA extensions based on ADDA exploit two features of the augmented data model.First, the k augmented data subsets are mutually independent given the original parameter and the observed data.Second, the conditional posterior distribution of the missing data on any subset depends only on the local copy of the observed data.Both assumptions are violated in models for time series data, including hidden Markov models (HMMs) and Gaussian state-space models.
DA has been widely used for posterior inference and predictions in these models, but it suffers from similar computational bottlenecks in massive data settings that have been highlighted in this paper.Recently, this problem has been addressed for HMMs using online EM [9,24] and the divide-and-conquer technique [54].As part of future research, we will focus on extending the original DA algorithms using ADDA for scalable posterior inference and predictions in HMMs and related state-space models.
Here I A (•) is the indicator function.By repeating the above argument, we get that P r((D (n) With > 0, by letting N go to infinity, it follows that P r((D (n) , θ (n) ) ∈ A for all n|(D (0) , θ (0) )) = 0.
Then by Theorem 6 of [44], the ADDA chain is Harris recurrent.

A.2 The switched ADDA chain and geometric ergodicity
Let K ADDA denote the transition kernel of the ADDA chain described in Section 2.1, and KADDA denote the transition kernel of the switched ADDA chain, where the AD-P step is performed before the AD-I step.Hence, the one-step dynamics of the switched ADDA chain from (θ, D) to (θ , D ) is described as follows.The manager draws θ from f (• | D, D obs ) and distributes it to the K workers.The i th worker generates a draw D i from f (• | θ , D obs ) and sends to manager (the draw is truncated if the next θ value is received from the manager before the draw is finished).As soon as the manager receives the new D i s from the required fraction of workers (r with probability 1 − , and 1 with probability ), it assigns D i = D i for the remaining fraction, and proceeds with the next θ draw.For both the ADDA and the switched ADDA chain, the marginal θ-process is not in general Markov.However, it can be easily seen that the marginal D process for both chains is Markov.This follows from the fact that for both chains, the distribution of the latest θ value For c ≥ a, where where the first and last inequality follows from e −x ∈ (0, 1] for any x ≥ 0, and the second inequality follows from results in Case 1.
A.5 Proof of Theorem 3.2 We will prove geometric ergodicity of the switched ADDA chain, which will be enough to establish geometric ergodicity of the ADDA chain based on Lemma A.1.
Define the following drift function: where w > 0 and 0 < s < 1 2 are positive numbers which are to be defined later.It can be easily seend that the above drift function is unbounded off compact sets for the parameters (β, σ 2 , τ ).Thus, we can proceed to show that by choosing proper w and s, there exist some constants 0 ≤ ρ < 1 and Here this expectation is with respect to the transition kernel of Φ ADBL , and can be written as Under the assumption > 0, for a fixed acceptance rate r, there is a probability of at least p m = that τ j is updated at a given iteration.Hence, where A is an arbitrary positive number, and the last inequality holds because √ xy ≤ x+y 2 for any positive number x and y.Similarly we can establish where B is an arbitrary positive number.
Then we proceed to calculate expectation of (36) conditional on τ * , σ 2 .Notice Further more By Proposition A1 in [34] E[( where [•] ii is the (i, i)'th entry of a matrix and λ X is the maximum eigenvalue of X T X.Then it follows that where goes to zero when s → 1 2 , by choosing ≤ 2 , which leads to Then for the outer expectation of (33), we first establish the following inequalities: where Thus the proof of ( 32) is completed by letting ρ = 1 − 1 2 p m , and ψ as defined above.
A.6 Proof of Theorem 3.3 We will prove geometric ergodicity of the switched ADDA chain, which will be enough to establish geometric ergodicity of the ADDA chain based on Lemma A.1.
By Assumption 3.1 holding, there exists a symmetric positive definite matrix Q such that Z T i Z i Q.Also since we are fixing Γ as identity matrix, b is the same as d, Σ is the same as Σ, and α will simply be β (since the γ entries are fixed).So in the following we will only use b, Σ and β.Hereby we can define some matrices that will be used later in the proof.Let When Γ is fixed as identity, the conjugate priors are in the forms as σ 2 ∼ M χ 2 a , β|σ 2 ∼ N (0, σ 2 V −1 β ) and Σ ∼ IW (W, s), and the corresponding posterior conditional distributions will be the following: b i |(σ 2 , Σ, β, worker i successfully sends update to the manager) We proceed by defining the following drift function: V(b, β, σ 2 , Σ) = y − Zb where this expectation is with respect to the transition kernel of Φ ADLM E (adjusted for fixed Γ = I q ).By the definition of Φ ADLM E , the expectation can be written as Using linearity, we proceed to simplify each term in (48).
Let f (D, θ | D obs ) denote the joint posterior density (with respect to µ × ν) of D and θ, and f (D | θ, D obs ) and f (θ | D, D obs ) denote the conditional posterior densities of the missing data given the parameter, (t+1) i s (see below) from the workers with probability 1 − , and with probability , waits to receive all the updated D b) creates D (t+1) by replacing the relevant D (t) i s with the newly received D (t+1) i ; (M-c) draws θ (t+1) from p(θ | D (t+1) , D obs ); and

Theorem 2 . 1 .
Let K DA denote the transition kernel of the parent DA chain which is equivalent to the ADDA chain with r = 1, and Π (• | D obs ) denote the joint posterior distribution of D and θ.Assume that K DA ((D, θ), •) is absolutely continuous with respect to Π for every (D, θ).Then, if > 0 and the parent DA chain is Harris ergodic, the ADDA chain is Harris ergodic.

bility 1 −
) only an r fraction of D 1 , . . ., D k , and with probability updates all of D 1 , . . ., D k .Clearly, the ADDA extension reduces to the parent DA in steps LM.1-LM.2 when r = 1.For a given k and r, the ADDA extension of the marginal DA initializes (d 1 , d 2 , • • • , d m , α, Σ, σ 2

3 . 2 and 4 . 3 .
3 below.Before fitting the logistic regression model, the response in MovieLens data is modified for defining the 0/1-valued response.If the user's rating in a given sample is greater than 3, then the response is 1; otherwise, the response is 0. No modification is required for the application of linear mixed-effects model.The original PG DA and marginal DA algorithms for logistic regression and linear mixed-effects modeling, respectively, are slow if we use with the full MovieLens data; therefore, we analyze randomly selected subsets of the MovieLens data to facilitate posterior computations and estimation of Acc rk (t) and SE rk (t) metrics.

Figure 12 :
Figure12: Gain in run times for the logistic regression and linear mixed-effects models used in MovieLens data analysis for r = 0.20, 0.40, 0.60, 0.80, 1.00 and k = 10, 25, 50.The distributed (or parallel) implementation of the parent DA corresponds to r = 1.00 (or = 1.00).The gain is defined as the ratio of run times of the parent DA and its ADDA-based extension.The dashed black line indicates equal run times for an ADDA algorithm and its parent.

Figure 13 :
Figure13: Estimated r for every worker process vs the r specified in the logistic regression model for MovieLens data.The estimated r for a worker is defined as the fraction of the number of times the manager accepts its I step results over the total number of DA iterations.