Iterative proportional scaling revisited: a modern optimization perspective

This paper revisits the classic iterative proportional scaling (IPS) from a modern optimization perspective. In contrast to the criticisms made in the literature, we show that based on a coordinate descent characterization, IPS can be slightly modified to deliver coefficient estimates, and from a majorization-minimization standpoint, IPS can be extended to handle log-affine models with features not necessarily binary-valued or nonnegative. Furthermore, some state-of-the-art optimization techniques such as block-wise computation, randomization and momentum-based acceleration can be employed to provide more scalable IPS algorithms, as well as some regularized variants of IPS for concurrent feature selection.


Background
Count data are ubiquitous in modern statistical applications. Such data are often cross-classified into contingency tables, where iterative proportional scaling (IPS) can be applied as a standard tool (Fienberg and Meyer, 2006). IPS was firstly introduced by Deming and Stephan (1940) to adjust a contingency table to obey prescribed column and row marginals, the problem of which is referred to as matrix raking nowadays. In general, IPS can be applied to Kullback-Leibler (KL) divergence minimization with linear constraints, and Poisson log-linear model fitting on multi-way tables (Ireland and Kullback, 1968;Bishop et al., 1975), and there is an interesting duality between these two types of problems (Good, 1963;Csiszár, 1975). Theoretical studies regarding the convergence properties of IPS undergo a long history and we refer to Pukelsheim (2014) for a comprehensive survey. To name a few, Fienberg (1970) took a geometric approach, assuming that all entries in the input table are positive, while Haberman (1974) and Bishop et al. (1975) were among the first to use the ascent property of the associated log-likelihood. It is worth mentioning that the theoretical studies become more challenging if IPS operates on a table with zero entries (Sinkhorn and Knopp, 1967;Csiszár, 1975).
Although the standard version of IPS is derived for contingency tables, it has some quite useful and popular extensions. For example, Darroch and Ratcliff (1972) proposed the generalized iterative scaling (GIS) which can fit Poisson log-affine models with non-negative designs. Later, Pietra et al. (1997) developed a more relaxed improved iterative scaling (IIS), without the requirement that all rows of the design matrix must sum to one. The class of IPS algorithms are widely used in the areas of Markov random fields and Gibbs distributions, natural language processing, matrix factorization, econometrics, boosting, and others (Sinkhorn, 1964;McCallum et al., 2000;Lahr and De Mesnard, 2004;Phillips et al., 2004;Kurras, 2015).
Today, IPS is less frequently used by statisticians, largely due to the outstanding performance and generality of Newton-type algorithms. In fact, Newton-Raphson is the default routine in most software for fitting generalized linear models (GLMs), and can have quadratic convergence in contrast to the linear convergence rate of IPS (ignoring the cost difference per iteration). Consider a five-way table of size 10 × 10 × 10 × 10 × 10. The associated Poisson model including all up to three-way interactions has 8,146 independent parameters. The corresponding Hessian matrix has more than 10 7 entries, leading to prohibitively high computational cost when using Newton-Raphson. Quasi-Newton methods are more computationally economical, but still fail easily when the model is of high dimensionality. First-order methods (such as gradient descent) are typically more scalable; in our problem, however, there exists no universal stepsize due to the unbounded Hessian. This means that some line search method must be adopted, but frequent function and gradient evaluations can be expensive on big data.
The traditional IPS algorithm is potentially helpful in this regard. In comparison with Newton-type and gradient descent methods, IPS has certain benefits in computation. We illustrate the procedure via a toy example. Consider a three-way table m ijk of size m 1 × m 2 × m 3 with the categorical variables denoted by X, Y, Z and assume a Poisson log-linear model (XY, Y Z, XZ) with margins m ij+ , m i+k and m +jk as sufficient statistics (Agresti, 2012). Starting with initial counts [µ (0) ijk ] 1≤i≤m 1 ,1≤j≤m 2 ,1≤k≤m 3 , IPS adjusts the estimated counts in a multiplicative manner iteratively: µ for all (j, k) in the t-th epoch. Clearly, all calculations are "in-place" (with no need of auxiliary variables) and no line search is required. The procedure has memory efficiency, scalability and implementation ease. In addition, it converges within one step once the solution has a closed form expression.
IPS is subject to some serious criticisms in the literature. Although IPS produces expected cell counts, it does not deliver any coefficient estimate nor asymptotic covariance estimate. The elegant scaling procedure has however a somewhat narrow scope and may encounter difficulties say in the scenarios with features not necessarily bi-leveled or nonnegative, or in shrinkage estimation. Finally, though cost-effective per iteration, IPS often requires a large number of iterations to converge.
This paper attempts to investigate IPS from a modern optimization perspective to improve and generalize the classic method and overcome the aforementioned obstacles. Our main contributions are threefold. First, we are able to show that IPS implicitly involves a coefficient-update step and adding it back leads to a novel coordinate descent characterization of the procedure. To the best of our knowledge, this is the first fix of IPS to produce coefficient estimates. Second, we reveal an interesting connection of IPS to majorization-minimization (MM) algorithms (Hunter and Lange, 2004). The MM principle successfully generalizes IPS to handle arbitrary features, with the celebrated GIS and IIS taken as two particular instances. Third, we employ some state-of-the-art optimization techniques, such as block-wise computation, randomization and momentum-based acceleration, to develop highly scalable IPS algorithms (without using parallel computation), as well as some sparse variants of IPS for concurrent feature selection.
The rest of the paper is organized as follows. Notations and model assumptions are introduced in Section 1.2. Section 2 describes a coefficientdriven IPS based on coordinate descent, and discusses its convergence properties, efficient implementation and acceleration. Section 3 shows several effective ways of constructing MM surrogate functions to generalize IPS. Section 4 develops sparse IPS for high dimensional estimation. Section 5 shows the accuracy and efficiency brought by blockwise descent, randomization, reparametrization, and momentum-based acceleration in simulations and real data experiments.
After the acceptance of our paper, we noticed the work by Anna and Tamás (2015), which showed how to update the coefficients in IPS under the assumption of strictly positive counts. In Section 2 we give a more general result and more efficient algorithms. They also proposed two generalized procedures for multinomial likelihood optimization based on bisection or coarse-to-fine search. See Section 3 for some fast optimization algorithms that can operate on general designs.

Notation and model setting
Notation. Given N ∈ N, we define [N] = {1, 2, . . . , N}. We use bold lower-case and upper-case symbols to denote column vectors and matrices, and j ∈ [p]. The inner product between two vectors x and y is x, y = x T y, their Hadamard product is denoted by x • y, and their component-wise division is denoted by x ⊘ y. Given any matrix X, we denote by x i+ the ith row sum j x ij . For notational ease, we extend all scalar functions and operations in a component-wise manner. For example, given x = [x i ] ∈ R N and y ∈ R N , exp(x) stands for [exp(x i )], log(x) = [log x i ], and x y indicates that x i ≥ y i for all i ∈ [N]. The design matrix X ∈ R N ×p is frequently partitioned into columns (features) and rows: Given two square matrices X and Y , X Y means X − Y is positive semi-definite. We denote by x 1 and x 2 the ℓ 1 -norm and the ℓ 2 -norm of x, respectively. The spectral norm and the infinity norm of a matrix X are defined as X 2 = σ max (X) (the largest singular value of X) and X ∞ = max i j |x ij |, respectively.
Given a contingency table model, one can vectorize all cell counts and introduce a design matrix in the framework of Poisson log-affine models. Concretely, let X k (1 ≤ k ≤ r) be the kth categorical variable taking values in [m k ] and r be the total number of categorical variables. Introduce dummy variables X (ℓ k ) k I(X k = ℓ k ) for ℓ k = 2, 3, . . . , m k with X k = ℓ 1 as the baseline, and let X k = [X ]. Then the model matrix containing all main effects has form [1, X 1 , X 2 , . . . , X r ]. Similarly, the model matrix including all two-way interactions has the following form [X 1 * X 2 , . . . , X 1 * X r , X 2 * X 3 , . . . , X 2 * X r , . . . , ]. Higher-order interactions can be included as well. Given an arbitrary X ∈ R N ×p , a log-linear model with mean µ ∈ R N satisfies µ = exp(Xβ), where β denotes the coefficient vector. In some applications, e.g., rate data analysis and matrix raking, an extra offset q 0 is required to specify a log-affine model (Lauritzen, 1996): We allow β j to take ±∞, i.e., β ∈R p withR = [−∞, ∞] (the extended real line). Throughout the paper, we will not consider overdispersion or inflated zeros.
Let n ∈ R N with n i ≥ 0 denote the observed entries. The maximum likelihood (ML) estimation problem of µ according to the log-affine model can be formulated by whereR X = { p j=1 β j x j : β j ∈R} denotes the closure of the range of X. Equivalently, the loss can be D KL (n µ) or the deviance function. For convenience, the constraint region is denoted byM {µ | µ = q • exp(η), η ∈ R X }. As suggested by Lauritzen (1996), theoretically it is more convenient to consider (1) in a slightly more restricted manner in order to guarantee the convergence of IPS: where M * {µ | l(µ) < +∞}, meaning that n i > 0 implies For simplicity, the following assumptions are made throughout the paper: Assumption (i) is trivial. (ii) and (iii) are without loss of generality, since one can drop the corresponding trivial predictors and/or observations. (iv) ensures finite likelihood for at least one point inM, and typically holds in real-life applications (otherwise one could add a mild ℓ 2 -type penalty to make the criterion strongly convex, cf. Section 2.2). It is worth mentioning that, under (iv), one can remove certain observations to ensure a positive q without affecting the coefficient estimation. In fact, for any µ = [µ i ] ∈M ∩ M * , q i = 0 implies µ i = 0 and n i = 0; so the ith observation does not contribute to the objective function in (1) and can be excluded in optimization. This can greatly simplify computation and analysis.
To extend IPS, we specify three types of design matrices: (3) Clearly, design matrices derived from contingency table models are special cases of (a). But real applications can go much beyond this binary setting.

A Coordinate Descent Characterization
As mentioned in Section 1, IPS is commonly used for matrix raking to find a table µ that not only matches the marginals of a reference table n ∈ R N but is closest to an initial (prior) q ∈ R N in the sense of relative entropy or KL divergence: The features x j are binary-valued in the table setup (but not so in general) and x j , n give prescribed marginals. In certain applications there is no need to provide n, since only the constraint values are needed. When 1 ∈ R(X), 1, µ is a constant and so the objective can be reduced to µ i log(µ i /q i ). Without the binary restriction on the design, (4) defines the general problem of maximum entropy with linear constraints, and has widespread applications in statistical mechanics, information theory, natural language processing, and ecology (Berger et al., 1996;Dudík et al., 2004;Elith et al., 2011). The duality between the maximum entropy problem (4) and the maximum likelihood problem (1) is well known by statisticians, and indeed many take the ML route to study the properties of IPS.

Coefficient-driven IPS
IPS is often criticized for not being able to deliver a coefficient estimate. We will show, however, that this is not true and IPS includes an implicit coefficient-update step.
First, although the objective function of IPS is conventionally formulated with respect to the unknown mean vector µ (or log µ) in the literature, it is arguably more insightful to rewrite (1) in terms of β: where we used µ i (β) = q i exp(x T i β). The convenience can be partially observed from the initialization condition of IPS (Fienberg and Meyer, 2006), which requires µ (0) to take the form of q • exp(η (0) ) for some η (0) in R X . Here, we still use l(·) to denote the loss by abuse of notation, and when 1 ∈ R(X), it is easy to see that (5) amounts to minimizing the G 2 -statistic 2 n i log(n i /µ i (β)). For this convex optimization problem, we can design a simple cyclic coordinate descent (CD) algorithm, which updates β j (1 ≤ j ≤ p) according to the following formula with the other coordinates fixed Define Algorithm 1 shows the details of the cyclic coordinate update. The solution to (7) has a closed form in some cases. For example, if we assume the design is derived from a contingency table model, or more generally, X is binary (cf. (3)), then the following crucial fact Step 5 in Algorithm 1 then becomes or equivalently, µ , which is exactly the IPS algorithm used

Algorithm 1 IPS-CD
Input n, q and X Initialize β (0) ∈ R p , t ← 0 1: while not converged do 2: 3:   (t) in matrix raking. In the literature, Haberman (1974) shows that IPS is a cyclic ascent method in updating log µ, but to the best of our knowledge, formulating IPS as cyclic CD on β is new.
Algorithm 1 provides more flexibility in initialization. For instance, µ (0) = q is unnecessary; rather, starting with an arbitrary β (0) ∈ R p suffices. In our experience, a properly chosen initial point can reduce the computational time substantially in large-scale data problems. The algorithm not only offers an easy fix of IPS to yieldβ, but also suggests efficient ways to update multiple components of µ at one time. Concretely, when β j (1 ≤ j ≤ p) is changed, all the µ i with x ij = 0 can be updated. It is thus extremely helpful in implementation to consolidate the features and use a design matrix with full column rank (which can be easily obtained by QR or LU decomposition). For example, on the aforementioned homogeneous association model (XY, XZ, Y Z) with X, Y, Z taking two levels, each epoch of the ordinary IPS updates all cell values according to the minimal sufficient statistics in 4 × 3 = 12 steps (cf. Section 1.1), while IPS-CD updates β and the associated cells in 7 steps; see Figure 1 for an illustration.
Some people conjectured that IPS attempted to minimize Pearson's X 2statistic, but this is not true-see Theorem 1. It is well known that X 2 converges faster than G 2 to the asymptotic χ 2 -distribution. Nicely, our coordinate descent standpoint can modify IPS in a simple way to solve the problem In fact, similar to the derivation of (7), given the other coordinates β by solving the equation and with a binary design, the equation has a closed-form solution and the resultant iterative scaling of µ is The procedure can be conveniently used for testing associations and interactions in contingency tables. Its differences from the ordinary version (10) are seen in the term n • n ⊘ µ (t,j−1) in place of n, and the additional factor of 1/2.

Convergence properties
The CD characterization facilitates theoretical studies of the convergence properties of IPS. First, we have a natural outcome for Algorithm 1 for a general design X.
Theorem 1. For the sequence of iterates {β (t) } ∞ t=0 generated by Algorithm 1, the associated function values l(β (t) ) are monotonically non-increasing for t ≥ 0. In particular, if the first column of X corresponds to the intercept, Similar results have been obtained on contingency tables (e.g., Bishop et al. (1975)). The theorem directly follows from the coordinate descent nature of the algorithm design, and is not restricted to tables. Moreover, because of the convexity of the problem, under some regularity conditions, the convergence of the sequence of iterates is readily at hand.
Theorem 2. Suppose that there exists a unique solutionβ to problem (5) of finite norm. Then the sequence of iterates {β (t) } ∞ t=0 generated by Algorithm 1 has a unique limit pointβ, and the rate of convergence is at least linear.
The linear convergence rate was shown by Fienberg (1970) for matrix raking. Our theorem builds upon the theory of descent methods (Luo and Tseng, 1992). In consideration of recent advances (Tseng, 2001;Razaviyayn et al., 2013), one can possibly relax the assumption of Theorem 2 in certain ways, but we will not pursue further in this work. We refer to Fienberg and Rinaldo (2012) and Wang et al. (2016) for detailed studies of the existence and uniqueness of MLE. In practice, we could add a mild ridge-type penalty (cf. Section 4), which ensures the condition and enhances numerical stability.

Randomization and block-wise computation
Recently, CD algorithms have received a lot of attention in statistics (high dimensional statistics, in particular) due to their simplicity and low-complexity operations at each iteration. But the characteristic of not updating all variables together may also make them take more iterations and require more stringent conditions to converge. Instead of the cyclic update in Algorithm 1, one can choose the coordinate with the largest derivative in magnitude, j = arg max j |∇ j l(β (t) )|, the so-called Gauss-Southwell (G-S) rule. G-S can successfully reduce the number of iterations, and has been analyzed in depth in the literature (see, e.g., Bertsekas (2015) and Nutini et al. (2015)). However, it is inefficient in large problems since all partial derivatives have to be computed-indeed, with the full gradient vector available, one could update the whole vector β at each step.
An effective way to speed the convergence of the algorithm is to randomize it. The recent analysis of Nesterov (2012) shows that random coordinate selection can achieve the same convergence rate as G-S. It is worth mentioning

Algorithm 2 A-IPS.
Input n, q and X Initialize β (0) ∈ R p , t ← 0 1: while not converged do 2: that random strategies make complexity bounds much easier to obtain, and are often suitable for modern computational architectures. Experience shows that random sampling without replacement (Wright, 2015) works well in IPS. Concretely, at the start of each cycle, we randomly shuffle the elements in [p] to obtain a new index set Perm[p], and then update the permuted coordinates sequentially. The randomized algorithm, denoted by Accelerated-IPS (A-IPS), is presented in Algorithm 2. In theory, randomized coordinate selection is able to avoid worst-case order of coordinates, and seems to be better than the cyclic rule in an average sense (Richtárik and Takáč, 2014). (Sampling the coordinates in a data-dependent manner may be better, but it involves more computation and is not considered here.) Extending the coordinate-by-coordinate update to block-by-block update may save some computational time, too. Block coordinate descent (BCD) decomposes all unknowns into m blocks, and updates only one block at a time. In this way, a large difficult problem can be reduced to a series of smaller and easier sub-problems so that Newton or quasi-Newton methods such as L-BFGS (Liu and Nocedal, 1989) can be applied with ease. In the block setting, again, G-S takes much fewer iteration steps than the plain cyclic rule, but in terms of total computational time, it incurs significantly more overhead in big data applications due to the calculation of the full gradient. Perhaps surprisingly, we found that some well-known random rules, such as (block) resampling without replacement, may not have satisfactory performance, either. We recommend random blocking followed by cyclic update, and this random block-wise IPS is referred to as B-IPS. See Figure 2 for an illustration. At the beginning of each iteration, all coordinates are randomly shuffled and sequentially assigned into m blocks with sizes g 1 , . . . , g m , and then we update the blocks of coefficients in a cyclic manner. This strategy substantially outperforms the other rules according to our experiments.

A Majorization-Minimization Viewpoint
This section uses the majorization-minimization (MM) principle to study and generalize IPS. We refer to Lange et al. (2000) and Hunter and Lange (2004) for more details on MM algorithms. Rather than directly minimizing l(β) in (5), our goal here is to construct a surrogate function g(β | β − ), with β − denoting the value of β from the last iteration, such that the following properties are satisfied for all β ∈R p : Then, if we set β (t+1) ∈ arg min β g(β | β (t) ), the sequence of iterates satisfies Hence repeatedly minimizing the surrogate function g(· | β − ) guarantees the original objective function l(β) to be monotonically non-increasing. Because the problem under consideration is convex, there are other nice properties of the MM iterates (e.g., Chapter 12 of Lange (2013)). We shall focus on the derivation of surrogate functions in this section. For notational simplicity, define µ − exp(Xβ − ) and µ (t) exp(Xβ (t) ). After getting β (t+1) , one can update µ by which still results in an iterative scaling procedure and could be viewed as a generalized IPS.

GIS and extensions
We derive three surrogate functions all of which are applicable under the non-negative design setting (3). Recall the objective function: Noticing the convexity of l i in β, we can apply Jensen's inequality where α ij satisfy α ij ≥ 0, p j=1 α ij = 1 for all i. We will not allow α ij to be zero when x ij = 0. If x ij = 0, the associated term x ij β j is β j -independent and so we set α ij = 0 for such x ij formally. This gives Let's first study binary designs. With the somewhat naive choice α ij = 1/p, we immediately obtain a surrogate satisfying (11): Given any binary design with x ij = 0 or 1, which covers the contingency table setting, the optimal solution of arg min β g 1 (β | β − ) is given by: The updates of parameter and mean are obtained as follows The MM algorithm shares similarities with the IPS defined in Algorithm 1, but differs in two ways. IPS updates the components of β sequentially (asynchronously), while (17) updates the entire vector synchronously. Correspondingly, the stepsize in MM algorithm is smaller than that used in standard IPS. Now we consider non-negative features. Recall (3), where we have x i+ = 0 for all i from assumption (iii) in Section 1.2. Setting which leads to another surrogate function The optimal β satisfies i Applying the inequality with a = x i+ and b = max i x i,+ = X ∞ R (R > 0 from assumption (iii) in Section 1.2) to (19) gives Now, the optimal solution can be explicitly evaluated: β j = β − j +(1/R) log(x T j n /x T j µ − ), and the iterates are then given by (20) is computationally more efficient than (17) in binary scenarios. (For example, for a three-way table of size 2 × 2 × 100, the main-effects model has p = 102 and R = 4 and so the stepsize in (20) is much larger.) When R = 1, the update of µ corresponds to the GIS by Darroch and Ratcliff (1972). (Darroch & Ratcliff pre-transformed the original design matrix such that all elements are non-negative and all row sums are equal to one.) The MM viewpoint is able to extend GIS, further, to deal with an arbitrary design matrix. Define the row support by S r j {i ∈ [N]|x ij = 0}, and is a valid surrogate function, and the corresponding update of β j has an explicit expression, which degenerates to (20) in the special case of a non-negative design (details omitted).
It is also worth mentioning that MM is capable of deriving block-wise algorithms that update all blocks in parallel (in contrast to the sequential update of block coordinate descent). For example, consider (13) with nonnegative designs. Let {G 1 , . . . , G k , . . . , G m } form a partition of the whole set [p],x T i,k be the ith row vector of the sub-matrix X k , A nice feature is that this surrogate is separable in β k and so all blocks of coefficients can be simultaneously updated. The convergence of the MM algorithm is typically slower than that of BCD, but parallel computing resources should make it much more efficient, which will be investigated in future work.
The only choice to guarantee that g 4 is a surrogate is ζ = 1/ q, exp(Xβ − ) .

Now, minimizing (24) givesβ
and the corresponding proportional scaling onμ is given bẙ Interestingly, this algorithm can be converted to the celebrated IIS (Pietra et al., 1997) that runs on the normalized observed and estimated counts, i.e.,n n/ 1, n andμ µ/ 1, µ .
In (25) and (26), neither the normalization operation nor the intercept update is needed during the iteration. One can extract the intercept estimate at the end.
The reparametrized form (22) offers more options in surrogate construction. A pleasing fact is that the term 1, n log q, exp(Xβ) has uniformly bounded curvature. Let's consider a quadratic function Q, By Taylor expansion, Q is a valid surrogate function provided that A straightforward choice is W = ( 1, n X 2 2 /2) I. It can be refined to W = 1, n X T (I − 11 T /N)X/2 (Bohning and Lindsay, 1988) noticing the matrix in the brackets of (30) has an eigenvector 1 associated with eigenvalue 0. Based on our experience, the latter is better in most large problems and will be adopted unless otherwise specified. In either case, the optimal solution of minβ (29) can be identified as a Bregman divergence D(β |β − ), (29) falls into the computational framework where Nesterov's second acceleration scheme applies (Nesterov, 1988). At each step, it uses two auxiliary sequences and adds momentum terms in updating the iterates. It can be shown rigorously that this ingenious approach leads to improved rate of convergence; see, e.g., Tseng (2010). The resulting algorithm (Algorithm 3) is referred to as the Quadratic-surrogate IPS (Q-IPS). It is worth mentioning that this momentum-based acceleration does not add much additional cost in each step but offers significant improvement over IIS and GIS (cf. Section 5).

Regularized Estimation
Modern statistical applications often involve a large number of variables, where regularization is necessary to achieve estimation accuracy or model parsimony. For example, one can append an ℓ 2 -type penalty to the negative log-likelihood to handle collinearity:

Algorithm 3 Q-IPS.
Input n, q,X where β 1 corresponds to the intercept column and λ ≥ 0 is a regularization parameter. λ can be tuned by AIC (Akaike, 1974) but even fixing it at a small value (say 1e-5) often shows improved accuracy. We recommend including such a mild ℓ 2 -penalty in practical applications, especially those with zero counts. Many of the previously developed algorithms can be easily modified to adapt to (31) and the details are not discussed.
Another popular way of regularization is to enforce sparsity, which can help practitioners select a small set of relevant features. The coordinate descent characterization of IPS enables us to develop its sparse variants on contingency tables. Assume a binary design X (cf. (3)) and consider the following problem subject to an ℓ 1 -or ℓ 0 -penalty where typically λ 1 = 0 and λ j = λ for 2 ≤ j ≤ p. The following theorem can be used to derive the coordinate-wise update for (32).
Theorem 4. Let n, x, q ∈ R N , β ∈ R, λ ≥ 0. Then the solution to the optimization problem min β∈R − n, x β + q, exp(βx) + λ|β| is given bŷ We use the ℓ 1 -penalized problem as an example to show how to modify Algorithm 1 to get a sparsity-pursuing IPS (a similar algorithm can be developed in the ℓ 0 case). Let µ (t,j−1) . When the intercept is not subject to any penalty, it can be updated by β (t,0) ). So the iterative scaling on the mean vector is µ (t,j) . This modified Algorithm 1 is termed ℓ 1 -IPS. In contrast to the greedy procedures for maximum entropy with concurrent feature selection in the literature, our optimization-based algorithm is stable in the sense that there is an associated objective function (32) that is guaranteed to be non-increasing, and no quadratic approximation or line search is required.

Experiments
This section tests various algorithms on both contingency tables and loglinear models with non-binary features. The observed vector n = [n i ] has independent entries satisfying where the mean vector µ * = exp(Xβ * ) and the first column of the design matrix X is assumed to be 1. The details of how to generate the true coefficient vector β * will vary in different setttins. Given each setting, we simulate 20 data sets and report averaged results. The measures to characterize the computational optimization error and statistical estimation error of the t-th iterate are the relative gradient g t ∞ / g 0 ∞ , and the relative estimation error β (t) − β * 2 2 / β * 2 2 , where g t (t ≥ 0) is the gradient of the original objective function (5) evalu-ated at β (t) . The termination criterion is met if g t ∞ / g 0 ∞ ≤ ǫ tol or the running time is greater than t max . By default, ǫ tol = 1e-4 and t max = 600. Typically, we will plot computational and statistical errors (on the log scale for better visibility) against computational time, rather than the number of iterations, since different algorithms may have very different per-iteration complexity.
The algorithms for comparison include IPS (cf. Alg. 1), A-IPS (cf. Alg.  (2012). The simulations are conducted on a PC with 2.9GHz CPU, 16GB memory, and 64-bit Windows 10. In the following, we first run IPS algorithms on contingency tables to examine the power of acceleration brought by different randomization strategies, then compare IIS, GIS, Newton-type algorithms and our proposed generalized IPS on problems with non-binary features, as well as studying their scalability on large-scale data. At the end, a real marketing campaign dataset is analyzed with the sparsity-pursuing IPS.

Power of randomization
The first experiment compares different random schemes on a 10×10×10×10 table with all two-way interactions. This homogeneous association model (Agresti, 2012) has 523 predictors, including 36 main effects and 486 two-way associations in addition to the intercept. The coefficient vector is generated in two ways. In the first setting, the intercept is 2, the last 10 components are sampled independently from 0.5N (1, 1) + 0.5N (3, 1), and the remaining are 0; in the second setting, we set 30 randomly chosen components of β * j to nonzero, sampled from the previous mixture distribution, and set the rest zero. The results averaged over 20 runs are shown in Figure 3.
According to the figure, A-IPS is much faster than IPS in the first setting, and is comparable to it in the second setting, but in both scenarios, A-IPS delivers far more statistically accurate estimates. (The optimization and estimation errors are shown on the log scale.) Perhaps surprisingly, fixed blocking with random block selection does not show the full power of randomization. Our random-blocking-based B-IPS is clearly the winner both in computational efficiency and in statistical estimation. Next, we compare A-IPS and B-IPS with Newton's method on a table of size 10 × 10 × 10 × 10 × 10, where a three-way homogeneous association model including all interactions up to third order is assumed (and thus p = 8,146). The coefficients are generated such that β * 0 = 5, the last 2,000 are independently sampled from N (1, 1) and the rest zero. Seen from Figure 4, Newton's algorithm could not deliver a useful estimate within the time limit. It was quite memory-demanding in the experiment. In comparison, A-IPS and B-IPS offer good scalability and statistical accuracy.

Generalized IPS algorithms
This subsection goes beyond the binary feature setting and we use generalized IPS to handle nonnegative features as well as those having both positive and negative values. By default, the prototype designX (the submatrix of X excluding the intercept) is generated with each row sampled independently from N (0, [ρ |j−k| ]), where ρ = 0.8 and 1 ≤ j, k ≤ p − 1. Some shifting and scaling operations will be performed to avoid overflow issues or to make the design non-negative. The first experiment is to find the winning algorithm among GIS, IIS, and Q-IPS. The celebrated GIS and IIS only run on non-negative features, which can be produced byX ←X − (min i,jxij )11 T . We scale it down further, X ←X/(50 X max ), for better numerical stability. An artifact is that the obtained matrix has all row sums approximately equal, which appears too ideal in the real-world. So we multiply each row by a random factor 1 + |z i | with z i i.i.d ∼ N (0, 1). The true coefficient vector is generated by β * 0 = 1 and β * j i.i.d ∼ 0.5N (10, 1) + 0.5N (−10, 1) for 1 ≤ j ≤ p − 1. The results shown in Figure 5 are for N = 1,000 and p = 100, where we plot the algorithm progress (in terms of optimization error and estimation error) as the computational time or iteration number increases. (To better differentiate GIS and Q-IPS, we only show part of the plots.) Although IIS needs fewer iterations, GIS is found to be more efficient than IIS. This is due to the cost of solving p − 1 nonlinear equations at each iteration of IIS. We also see that Q-IPS outperforms GIS and IIS. Next, we conduct a larger experiment with p = 2,000, 4,000 and N = 20,000 , to compare Q-IPS and B-IPS with Newton. In generating the simulation data, we scale down the prototype design byX ←X/(20 X max ), and set β * j i.i.d ∼ 0.5N (10, 1) + 0.5N (−10, 1) for 1 ≤ j ≤ p − 1 and β * 0 = 10. Figure  6a) demonstrates two typical stages of Newton's method: the damped Newton phase and the quadratically convergent phase (see Boyd and Vandenberghe (2004)). Though converging super fast in the second stage, in 6b), this algorithm took too long to complete the first stage, thereby perhaps less useful in big data applications. In contrast, B-IPS and Q-IPS are able to deliver accurate estimates within the time limit, and B-IPS seems to have better scalability than Q-IPS. Finally, we turn to problems with features not restricted to be nonnegative and compare the performance of Newton, B-IPS and L-BFGS. We set β * 0 = 10 and β * j i.i.d ∼ 0.5N (10, 1) + 0.5N (−10, 1) for 1 ≤ j ≤ p − 1 and scale the prototype design matrix byX ←X/(100 X max ). Fixing N = 50,000, we vary p from 1,000 to 12,000. When p = 10,000 or 12,000, not all methods converge fast, and we set a time limit t max = 600. The relative gradient (relGrad) and the estimation error (estErr) are scaled by 1e+7 and 1e+4, respectively, when reported in Table 1. From the table, we notice that B-IPS is less precise than Newton in general, but having low computational complexity makes it suitable for largescale data applications where moderate accuracy usually suffices. Indeed, Newton's method, when feasible, gives the smallest estimation error, but it easily fails when p is large (say p ≥ 6000). B-IPS 1 (with Newton as the sub-problem solver and g k = 200) has better efficiency and scalability in computation-see the case when p = 10,000, in particular. The same conclusion can be drawn from the comparison between quasi-Newton and B-IPS.
In the experiments, L-BFGS ran out of memory when p > 10,000. B-IPS 2 , which takes L-BFGS as the sub-problem solver and g k = 2000, showed the best scalability. In summary, the benefits brought by BCD, reparametrization, and randomization are impressive in large-scale problems. We conducted even larger experiments (with p ≥ 20,000) to study the scalability of B-IPS; the reader may refer to the Appendix for more details.

ℓ 1 -IPS on real data
The dataset is collected from a Portuguese marketing campaign related to bank deposit subscription (Moro et al., 2014). We use 41,188 instances and 10 categorical variables to study whether a client subscribes a term deposit or not. The information of these variables is shown in Table 2. We group the data at each observed combination level of all categorical variables, and use the total number of successful subscriptions as the response variable.

Variables
Description job type of job (12 levels) marital marital status (4 levels) education education level (8 levels) default whether the client has credit in default (3 levels) housing whether the client has housing loan (3 levels) loan whether the client has personal loan (3 levels) contact contact communication type, cellular or telephone (2 levels) month the month in which the last contact was made (10 levels) day the day of week when the last contact was made (5 levels) poutcome outcome of the previous marketing campaign (3 levels) We consider a three-way association model, including all main effects, second-order interactions, and third-order interactions. This results in a large model with 5,874 predictors. We ran ℓ 1 -IPS to compute the solution path as shown in Figure 7 and used EBIC (Chen and Chen, 2008) for parameter tuning, where a sparse model with only 16 predictors was obtained. See Table 3 for the selected terms; we also labeled all main effects in Figure 7. (We ran the experiment on a two-way association model, too, and found that all the main terms and two-way interaction terms listed in Table 3   Some interesting and useful conclusions can be drawn from the variable selection result. First, month plays an important role in the marketing campaign, as supported by some previous studies (e.g., Moro et al. (2014)). In particular, clients are unlikely to subscribe a term deposit if the last contact occurs in November. Moreover, contact communication type being cellular and the outcome of previous marketing campaign being successful are favorable factors for successful subscription.
According to the table, all of the selected two-way interactions involve the variable poutcome = non-exist (no record in the previous campaign), which separates out this special group of clients. Our model also contains two three-way terms, and notably they show no collinearity with the other selected variables, thereby worthy of consideration. The positive coefficients of the two terms, (job = technician) * (education = professional) * (marital = married) and (poutcome = non-exist) * (education = univ) * (marital = single), indicate that the group of married technicians who received professional training and the group of singles with university-level degrees but no record in the previous marketing campaign tend to subscribe a term deposit.
It is worth mentioning thatβ is not unique if D KL ( n, x q, x ) = λ, in which caseβ can take either 0 or log( n, x / q, x ).

A.4 Large experiments
This part tests the performance of B-IPS on large tables and large designs with dimensionality p ≥ 20000. The experiments were performed on a machine with 2.9GHz CPU and 64GB RAM installed. We consider 6 examples. The first three come from contingency tables, consisting of 200, 225 and 250 binary categorical variables, respectively. All interactions (up to second order) are included, resulting in p = 20,101, p = 25,426, and p = 31,376, respectively. The observed entries are sampled from the table and the number N is fixed at 100,000. In generating the coefficients, we let the intercept take 5, and set all β * j to zero except the last 100 which are sampled from N (−1, 1). The error tolerance ǫ tol in the stopping criterion is 1e-5. The remaining three examples have large design matrices involving non-binary features, where N = 100,000 and p varies from 20,000 to 30,000. In these examples, β * 0 = 10 and β * j are i.i.d. following 0.5N (10, 1) + 0.5N (−10, 1), 1 ≤ j ≤ p − 1. The prototype design matrix is scaled,X ←X/(200 X max ), and ǫ tol = 1e-7. The block sizes in calling B-IPS are fixed at 5000 (or approximately so). The computational time and estimation error for are reported in the following table.