Accelerating Large-Scale Statistical Computation With the GOEM Algorithm

ABSTRACT Large-scale data analysis problems have become increasingly common across many disciplines. While large volume of data offers more statistical power, it also brings computational challenges. The orthogonalizing expectation–maximization (EM) algorithm by Xiong et al. is an efficient method to deal with large-scale least-square problems from a design point of view. In this article, we propose a reformulation and generalization of the orthogonalizing EM algorithm. Computational complexity and convergence guarantees are established. The reformulation of the orthogonalizing EM algorithm leads to a reduction in computational complexity for least-square problems and penalized least-square problems. The reformulation, named the GOEM (generalized orthogonalizing EM) algorithm, can incorporate a wide variety of convex and nonconvex penalties, including the lasso, group lasso, and minimax concave penalty penalties. The GOEM algorithm is further extended to a wider class of models including generalized linear models and Cox's proportional hazards model. Synthetic and real data examples are included to illustrate its use and efficiency compared with standard techniques. Supplementary materials for this article are available online.


Introduction
In the machine learning and statistics communities, linearly structured estimators are among the most thoroughly developed and widely adopted techniques in statistical model building. From classical methods such as linear regression and generalized linear models, to more modern, flexible methods such as local polynomial regression, Gaussian process models, and artificial neural networks, the estimation problem is eventually reduced to a linear form, where a solution can be obtained by solving a system of equations. Despite its simple form, leastsquare estimation can be computationally challenging on the massive scales currently faced by practitioners. In such applications of massive-scale, traditional estimation procedures tend to exhibit slow or intractable performance. Various methods have been proposed to ease the computational demand for "big data" problems, such as leverage score subsampling (Drineas, Mahoney, and Muthukrishnan 2006), divide-and-recombine (Guha et al. 2012), preconditioning via the subsampled randomized Hadamard transform (Boutsidis and Gittens 2013;Lu et al. 2013), and simulation-based regression (Wang, Polydorides, and Bertsekas 2009;Wang and Bertsekas 2013), among many others. While these methods have many benefits, they rely on randomized or approximated algorithms, which often only guarantee an approximate solution with high probability, do not fully use all of the available data for a given model insofar as their computed solutions are less efficient than the exact solution, and can be difficult to extend to more general models.
On a different note, variable selection is one of the most studied subjects in statistics. Modern statistical methods for variable selection often involve penalized likelihood, as selection properties for such methods can be established. Penalties such as the lasso and nonnegative garrote have been widely used. Nonconvex penalties such as smoothly clipped absolute deviation (SCAD) and the minimax concave penalty (MCP) offer alternatives with attractive properties such as diminished bias and the oracle property. Computation for these penalties can be challenging, especially for nonconvex penalties, and this is amplified when the number of observations becomes very large. Thus, there is a need for a unified framework with convergence guarantees for penalized estimation for a wide variety of models and penalties for large-scale data.
This article presents a generic algorithm, called the generalized orthogonalizing expectation-maximization (GOEM) algorithm, for efficient penalized estimation of a wide array of statistical models and penalties. This algorithm is a generalization of the orthogonalizing expectation-maximization (OEM) algorithm proposed by Xiong et al. (2016), which borrows the computational benefit of orthogonal design and combines it with EM algorithm for least squares solving. The OEM algorithm is computationally efficient, uses all available data, and does not produce a randomized solution. However, it is not directly applicable to nonlinear cases. The GOEM algorithm improves upon and extends the OEM algorithm, including the incorporation of nonlinear models.
We discuss more perspectives of the OEM algorithm in Section 2, and how it can be generalized in Section 3. A convergence analysis is also provided in Section 3. In Section 4, we demonstrate the efficiency of our algorithm with various examples. Section 5 gives some concluding remarks.

Review of OEM Algorithm
Consider the regression model is an n × p regression matrix, y ∈ R n is the response vector, β = (β 1 , . . . , β p ) is the vector of regression coefficients, and ε is the vector of random errors with zero mean. The ordinary least-square (OLS) estimator of β iŝ where · denotes the l 2 norm. Solving this can be computationally intensive if the regression matrix X is not column orthogonal and the dimensions n and p are big. The OEM algorithm makes use of the facts that inverting a diagonal matrix is trivial, and adding noninformative pseudo observations does not alter the solution when observations are assumed independent. Let be a matrix of pseudo observations whose response values z are missing. The maximum likelihood estimate of the augmented problem maximizes the marginal likelihood If is designed such that the augmented regression matrix X c = ( X ) is column orthogonal, an EM algorithm can be used to solve the augmented problem efficiently similar to Healy and Westmacott (1956). The OEM algorithm achieves this with two steps: Step 1. Find an appropriate augmentation matrix .
Step 2. Iteratively solve the orthogonal design with missing data by EM algorithm.
Step 2.1. E-step: impute the missing responses z by z = β (t ) , where β (t ) is the current estimate.
Step 2.2. M-step: solve β (t+1) = argmin β y − Xβ 2 + z − β 2 . An augmentation matrix can be found via the active orthogonalization procedure, which starts with a positive definite diagonal matrix S and constructs by ensuring = dS 2 − X X is positive semidefinite for some constant d ≥ λ 1 (S −1 X XS −1 ), where λ 1 (·) denotes the largest eigenvalue. In fact, can be used conceptually and never needs to be explicitly computed. This is because the EM iterations in the second step have closed-form solutions, which only depends on . In specific, let A = , u = X y + Aβ (t ) , and let (d 1 , . . . , d p ) be the diagonal elements of X c X c , then the update for the regression coefficients has the simple form β (t+1) The OEM algorithm can be extended to solving the penalized estimatorβ = argmin β y − Xβ 2 + P λ (β), when the penalty P λ is decomposable. The only change is in step 2.2, where β is updated as β (t+1) = argmin β y − Xβ 2 + z − β 2 + P λ (β) instead. Since X c is column orthogonal and P λ is decomposable, simple closed-form solution can be obtained. For more details, see Xiong et al. (2016). The overall computational complexity of OEM algorithm is about np 2 + 2kp 2 flops, where k is the number of iterations until convergence. As a reference, Cholesky decomposition is one of the fastest decomposition methods, and its complexity is about np 2 + p 3 /3 flops, which is on par with the OEM algorithm in the leading term. However, when the regression matrix X is well-conditioned, OEM algorithm often converges with only a few iterations, and slightly outperforms Cholesky decomposition by reducing the second leading term in complexity. This is particularly useful when p is moderately large. Xiong et al. (2016) showed convergence of the algorithm when X c X c = dI p where d ≥ λ 1 (X X ). This is in accordance with the requirement that is positive semidefinite. In fact, we only need d > λ 1 (X X )/2 if we associate it with the idea of boosting.

Alternative Interpretation of the OEM Algorithm
Step 2.2 in the OEM algorithm attains an update pattern where e (t ) denotes the vector of residuals after t iterations. This suggests that the algorithm produces a strong learner (the orthogonal projection) with a sequence of weak learners obtained by projecting residual e to the column vectors of X and scaling them. This is equivalent to decomposing the inverse matrix (X X ) −1 and approximating it by a finite sum: In order for the decomposition to be valid, it is clear that d > λ 1 (X X )/2 would suffice. The alternative interpretation can also reduce the computational complexity for large-scale problems when p is large. Since it only requires matrix-vector products, the new complexity is reduced to approximately 2knp flops. Figure 1 gives an illustration of the first two iterations. In the figure, dashed lines represent projections onto basis vectors after appropriate scaling, and star indicates the position of exact solution obtained by a direct orthogonal projection. This alternative approach to OEM offers some insight about its properties. First, the algorithm is guaranteed to converge as long as d > λ 1 (X X )/2. Proposition 1. If d > λ 1 (X X )/2, the sequence of β (t ) produced by OEM algorithm always converges to X + y, where X + is the Moore-Penrose pseudoinverse of X. In particular, if X has full column rank, the sequence converges to (X X ) −1 X y.
The convergence speed can also be characterized. In the following proposition, we provide an upper bound to the convergence rate of this sequence, and provide the optimal value for d in terms of this upper bound.
x 1 x 2 y e (1) x 1 x 2 e (1) e (2) ) b ( ) a ( Proposition 2. Letβ be the exact solution. If d > λ 1 (X X )/2, we have where · is the l 2 -norm and r is the column rank of X. The optimal d with respect to this upper bound is (λ 1 + λ r )/2.
Since choosing d for OEM algorithm only involves computing the largest eigenvalue (and also smallest if optimal rate is desired), it can be more efficient than solving OLS with decomposition methods, which computes the entire spectrum. An efficient way of computing the largest and smallest eigenvalues is the Golub-Kahan-Lanczos bidiagonalization (GKLB) procedure (Golub and Kahan 1965). When using GKLB, only a few iterations of matrix-vector multiplication are typically needed, since d is not required to be the exact optimal value of (λ 1 + λ r )/2 for the algorithm to converge. Preconditioning can also be used to accelerate the convergence.

The GOEM Algorithm
In general, the GOEM algorithm stacks two levels of operations: boosting and searching. Starting from an estimate, it boosts a subproblem for a search direction. Then it searches along that direction for the next estimate. We will start off by introducing the GOEM algorithm for the unpenalized case.

GOEM Algorithm for General Loss Without Penalty
Consider a general model with parameters θ estimated aŝ where V is a loss function and we assume θ is linearly related to the transformed predictors. For simplicity of presentation, we assume f is the identity function. Extension to more generic transformations is straightforward. Many popular methods can fit into this framework. Possible examples include: 1. Least squares, such as linear regression, spline regression, and kernel regression: 2. Maximum likelihood with likelihood function P, such as generalized linear model and Cox proportional hazards model: Assuming the loss V is second-order differentiable in x i θ, a candidate estimator can be found by solving the estimating equation Solving this with standard Newton's method involves two steps: 1. start with an initialθ Because of the linear structure, ∇ψ (θ (t−1) ) has the form of X W (θ (t−1) )X and ψ (θ (t−1) ) has the form of X z(θ (t−1) ), and z(θ (t−1) ) = (z i (θ (t−1) )). Solving each iteration using the standard Newton's method is computationally intensive, as it requires reweighting the Hessian matrix and computing the inverse (∇ψ (θ (t−1) )) −1 . The GOEM algorithm decomposes the matrix inverse and approximates it by a finite sum. An attractive property is that the constant d need only be specified once in the beginning. For subsequent reweighted Hessians, the initial d can be simply rescaled so that d > λ 1 (X W X )/2 is satisfied. The steps are outlined as follows: Step 1. Compute a scaling factor d > λ 1 (X X )/2 using GKLB procedure.
Step 2. Iterate between a boosting step and a searching step.
Step 2.1. Boosting step: approximately solve the subproblem p = (X W X ) −1 X z.
Step 2.1.1. Adjust scaling factor by d (t ) = max(|W |)d, where max(|W |) is the largest absolute value in the diagonal matrix W .
Step 2.1.2. Set p (0) = 0, update k ≥ 1 times: Step 2.2. Searching step: This algorithm can be thought of as a quasi-Newton procedure. Essentially, the above algorithm uses as an approximation to (X W X ) −1 . In practice, the choice of k can have meaningful impact on the speed of convergence. We have noticed empirically that in the initial iterations, a small k suffices to provide a sufficient proportion of the full Newton step, yet in later iterations, a larger k is required. As such, our strategy is to choose the initial k to be small and increase it by 2 at each iteration until we reach k max , where k max is two times the maximum number of inner iterations. Any value of k max greater or equal to 2 will suffice. In practice, we have found that choosing the initial k to be 2 and k max to be a moderate number such as 50 works well. Note that the search direction determined by Hessian matrix X W X is only meaningful when the objective function is locally convex in the search region. In the case of the general regression setting, this requirement is equivalent to V being locally convex in x θ, so that all diagonal entries of W are nonnegative. When Hessian matrix is meaningful, this finite sum approximation preserves some of the curvature information. In particular, let U U be the spectral decomposition of the matrix (X W X ) −1 , it is easy to see that the finite sum can be decomposed as U k U and k → . When the Hessian matrix is not meaningful in that V is not locally convex in x θ, the above algorithm still yields an improvement. This claim is substantiated by the following theorem: Theorem 1. For any positive integer k, the search direction −p (k) is a descent direction.
Since the search direction is a descent direction, convergence of the algorithm can be achieved in conjunction with standard line search or trust-region techniques. In fact, there is no requirement on the diagonal matrix W . Different choices of W can be used while maintaining the descent property of p. The matrix W can be chosen flexibly according to the structure of the problem. If , this corresponds to the original GOEM algorithm with the hope of preserving Hessian local curvature information whenever possible. If W (θ) = I, this corresponds to searching guided by a sense of global curvature and completely ignoring local curvature.

Penalized GOEM Algorithm
The GOEM algorithm can be further extended to including a penalty function P λ (θ). Suppose we want to estimate the regression parameters θ by minimizing a penalized objective function Due to the orthogonality of the augmented regression matrix, the M-step updates are simple for a wide range of penalties including the lasso, elastic net, SCAD, MCP, and group lasso, among many others.
We observe that in the unpenalized GOEM algorithm we are solving a quadratic subproblem in each iteration. The quadratic subproblem is inexactly solved, while ensuring the inexact solution to give a descent direction. The penalized GOEM algorithm is along the same line. For simplicity of presentation, let us denote g(θ) = n i=1 V (y i , x i θ) and f (θ) = g(θ) + P λ (θ). Given an initial estimateθ (t−1) , the first step is to try to solve the quadratic subproblem With the special structure of regression problems, H has form X W X where W is a diagonal matrix with positive diagonal entries. The matrix H can be the Hessian of g(θ) or an approximation of the Hessian. Since the minimization is taken with respect to θ, it is easy to see that (1) is equivalent to the following penalized weighted least squares Denote the weighted least squares part byĝ(θ) = (Xθ −ỹ) (Xθ −ỹ)/2 and the new objective function byf (θ) =ĝ(θ) + P λ (θ). The second step is to approximately solve min θf (θ). To do this, consider a further quadratic approximation toĝ(θ) Optimizing with respect to this upper bound, the solution is given as where prox P is the proximal operator of P Apply the quadratic upper bound and proximal operator iteratively k times, and we get an inexact solutionθ (k) . Note that unlike in the unpenalized version where we only specify d > λ 1 /2, here we require d > λ 1 . This is to ensure sufficient descent when the weighted least squares are contaminated by penalty P λ , so that we can quantitatively measure the distance from the inexact solution to the exactf (θ * ) where θ * = argmin θf (θ).
Finally, we get a search direction from the inexact solution by and perform a line search to get the next updateθ (t ) =θ (t−1) + α p (t,k) .
The procedure can be summarized in the following steps: Step 1. Compute a scaling factor d > λ 1 (X X ) using GKLB procedure.
Step 2. Iterate between a boosting step and a searching step.
Step 2.2. Searching step: propose search direction , and When P λ is convex in θ, Theorem 2 shows that the search direction θ (t,k) is a descent direction. The theorem is derived from standard analysis of proximal algorithms (e.g., Bertsekas 2010) and the work of Lee, Sun, and Saunders (2014).
Theorem 2 gives a global bound on the number of iterations performed to guarantee the inexact solution to be a good direction. Intuitively, it says although we do not require argmin θf (θ) to be solved exactly, in general the approximation needs to be reasonably good. Since this is a global bound, usually it takes less iterations for an actual descent. Also, sharper bounds may be obtained following different analytical tools, for example, by checking the duality gap (Schmidt, Roux, and Bach 2011).

Numerical Examples
This section applies the GOEM algorithm to different problems and compares its efficiency with various state-of-the-art methodologies. We will demonstrate how our algorithm can be applied to linear regression, logistic regression, and survival analysis. For ordinary least squares, we compare with a Cholesky decomposition least squares solver. For penalized least squares, we compare with glmnet (Friedman, Hastie, and Tibshirani 2010). For logistic regression, we compare with glmnet and Liblinear (Fan et al. 2008), as they are among the fastest solvers for penalized and unpenalized logistic regression. For survival analysis, we compare with fastcox, which has stateof-the-art computational speed for lasso-penalized Cox proportional hazards models. Our experimental setup involves real and synthetic data. Evaluation on the synthetic data can be used to determine which situations are most suitable for GOEM. We also investigate the performance of GOEM on modern largescale datasets. All simulations and numerical experiments in this article are conducted on a 64-bit machine with an Intel Xeon E5-2470 2.30 GHz CPU and 128GB of main memory under a Linux operating system.

Linear Regression
The synthetic datasets were generated from a standard normal distribution. The response variable was generated according to y = Xθ + ε where ε is standard normal independent of X and half of the elements of θ are from a standard normal independent of X and ε and the remaining half are exactly zero. The signalto-noise ratio in all simulations was set to be 5. The convergence precision for all methods was set to a tolerance of 10 −7 . The run times for OEM and Cholesky decomposition-based least squares are in Table 1. The Cholesky routines used are from the Eigen C++ library (Guennebaud et al. 2010) with an R wrapper. OEM and GOEM were compared with the algorithm of Meier, van de Geer, and Bühlmann (2008) as implemented in the R package grplasso and with the algorithm of Yang and Zou (2015) as implemented in the R package gglasso. Experiments with run times for single values of the tuning parameter for the lasso and group lasso are in Table 1. As the magnitude of the tuning parameter can affect the run time of algorithms, two different choices of the tuning parameter were investigated. The first, λ 25 , is the tuning parameter such that approximately 25% of the variables are nonzero and the second, λ 75 , is the tuning parameter such that approximately 75% of the variables are nonzero. As can be seen in Table 1, the GOEM algorithm tends to be relatively more effective when more variables are nonzero. This could be due in part to the fact that glmnet uses an active set approach to screen out variables and reduce computation time when few variables are nonzero. GOEM could benefit from such an approach as well, however it is competitive even without it. The second experiment investigates the performance in the computation of pathwise solutions for regularized least squares. The run time results for the computation of paths of 25 and 50 tuning parameters are in Table 2. The OEM algorithm tends to perform best when many tuning parameters are required, however, when the dimension and number of observations scale, the GOEM algorithm eventually becomes more efficient. An investigation into the computational complexity reveals this trend as well. As such, GOEM is more suitable when the problem size is quite large.

Binary Response
For data with a binary response, we consider the generalized linear model. Denote the response variable as y, the data x, and the coefficients θ. The conditional probability of class membership is modeled as where g = logit −1 is the logit link function. Therefore, The synthetic datasets were generated from a standard normal distribution. The response variable was generated according to the logistic model P(y = 1) = (1 + exp(Xθ)) −1 where half the elements of θ were from a normal distribution independent of X with standard deviation 0.05, so as to prevent an overly strong signal. The other half of θ was set to be exactly zero. The convergence precision was set to a tolerance of 10 −5 for OEM, glmnet, and Liblinear. The run times for OEM, glmnet, and Liblinear for regularized logistic regression are in Table 3. p = 500 p = 2500 p = 5000 n = 1 × 10 4 5 × 10 4 1 × 10 5 n = 1 × 10 4 5 × 10 4 1 × 10 5 n = 1 × 10 4 5 × 10 4 1 × 10 5  Two real datasets were used to compare efficiency and performance of the approximate Hessian OEM algorithm with other state-of-the-art methods. The first dataset has 11 million instances of 28 features that describe the kinematic properties of particle collisions (Baldi, Sadowski, and Whiteson 2014). The data are available as a 7.5 Gigabyte file from https://archive.ics.uci.edu/ml/datasets/HIGGS. The response is an indicator of whether a Higgs boson has been produced by the collision. The second dataset used for evaluation is the Reuters Corpus Volume I (RCV1) dataset (Lewis et al. 2004). RCV1 contains 20,242 newswire stories in various categories. The features are the unique words present in each of the stories, making 47,236 words after stop words were removed. The data were preprocessed using principal components analysis and a varying number of components were used to evaluate run times versus number of features.
To evaluate the performance of the methods over training time, the stopping criterion or maximum number of iterations was varied for each method and predictive performance was evaluated on a testing dataset independent of the training dataset. For the Higgs dataset, one million instances were used for training and one million instances were used for testing. Four million instances were used for training in the supersymmetric particle dataset and one million were used for testing. For l1-regularized logistic regression, because the tuning parameter is specified differently for Liblinear, the tuning parameters were chosen for each method so that all converge to the same estimated coefficients. As can be seen in Figure 2, GOEM converges more rapidly than both glmnet and Liblinear in the Higgs dataset. The run times for various methods versus the number of principal components are compared for both logistic regression and 1-regularized logistic regression are compared in Figure 3. It can be seen that GOEM is competitive for both a small and large number of components.

Survival Analysis
For survival analysis, we consider the Cox proportional hazards model. Denote the observed data as (y i , x i , δ i ), i = 1, . . . , n, where y i is the sorted observation time and δ i ∈ {0, 1} is a failure indicator. Under the proportional hazards assumption, the conditional failure probability can be modeled as where R i is the risk set for failure time i. In this case, W (θ) is not a diagonal matrix, but it can be well-approximated by its diagonal elements, since its off-diagonal elements are of the order 1/n (Hastie and Tibshirani 1990, chap. 8). A direct calculation shows that In our experimentation, we also investigated the GOEM algorithm letting W (θ) = I. The design matrix was simulated from a standard normal distribution, as in the experiments for linear regression and logistic regression. The true survival time was generated from an exponential distribution with mean exp {x i θ}. The coefficients θ were generated in the same manner as for the logistic regression simulations. The censoring time was generated independently of the survival time such that there is a 50% censoring rate. This approximation enjoys computational advantages when the data are very tall and not too wide. From Table 4, we see the latter version tends to be slow when the ratio of p to n is relatively large and becomes more efficient when this ratio decreases. All run times reported are for the total computation time for a sequence of 50 values of λ, averaged over 10 independent runs.
squared error of the estimated coefficientsβ was computed as 1 p ||β OLS −β|| 2 , which serves to evaluate the precision of an estimate by comparing the estimated solution with the ordinary least-square solution. The results are presented in Table 5 and conclude with some interesting findings. While it seems contradictory that the leverage score method should be slower than the Cholesky decomposition in some settings, this is in fact because the fast leverage score approximation requires a QR decomposition of a matrix that is of dimension r 1 × n. Furthermore, in higher dimensional settings, a very large subsample size is required for fast leverage score-based methods to achieve a reasonable precision. It is up to the user to decide if the improvement in runtime is worth the loss in precision when choosing between subsampling methods and the GOEM algorithm.

Conclusion
We have introduced the GOEM algorithm that extends the OEM algorithm to a wider class of models and results in improved computational speed. It works by iteratively searching and boosting for a better estimate, leveraging the computational benefits of the OEM algorithm. We have demonstrated its performance on a variety of problems with both synthetic and real datasets. Experimental results show that our method is highly efficient and is competitive with state-of-art methods for both a wide class of models and penalties. In addition to speed, the GOEM algorithm is also intuitive, with a natural geometric interpretation. Furthermore, it is simple to implement, making it more appealing for practitioners who may wish to extend it further or wish to implement it on a much larger scale than is possible in the R computing framework. In addition to our preliminary convergence analysis, further fine tuning of the method can be performed to even improve the overall computational efficiency. Ill-conditioning of matrices is a potential drawback of the GOEM algorithm, but it can be mitigated by using a preconditioner.