FlexiFaCT : Scalable Flexible Factorization of Coupled Tensors on Hadoop

Given multiple data sets of relational data that share a number of dimensions, how can we eﬃciently decompose our data into the latent factors? Factorization of a single matrix or tensor has attracted much attention, as, e.g., in the Netﬂix challenge, with users rating movies. However, we often have additional, side, information, like, e.g., demographic data about the users, in the Netﬂix example above. Incorporating the additional information leads to the coupled factorization problem. So far, it has been solved for relatively small datasets. We provide a distributed, scalable method for de-composing matrices, tensors, and coupled data sets through stochastic gradient descent on a variety of objective functions. We oﬀer the following contributions: (1) Versatility: Our algorithm can perform matrix, tensor, and coupled factorization, with ﬂexible objective functions including the Frobenius norm, Frobenius norm with an (cid:96) 1 induced sparsity, and non-negative factorization. (2) Scalability: FlexiFaCT scales to unprecedented sizes in both the data and model, with up to billions of parameters. FlexiFaCT runs on standard Hadoop. (3) Convergence proofs showing that FlexiFaCT converges on the variety of objective functions, even with projections.


Introduction
How can we efficiently mine data that capture relations between different entities? Suppose, for instance, that we are given a time-evolving social network, such as Facebook, and we have information about who messages whom, or who becomes friends with whom, and when. This data may be formulated as a three mode tensor. Suppose now that we also have some side information pertaining to the users, e.g. demographic information. This problem can be formulated as an instance of a so-called coupled factorization, where the two pieces of * Carnegie Mellon University, School of Computer Science data, a three-mode (user, user, time) tensor and a (user, demographic) matrix share a common dimension. Even without the presence of the (user, demographic) matrix, efficient tensor decomposition of truly large datasets can be challenging, attracting increasing interest.
Most prior work has either focused on a specific type of factorization or a specific loss function (e.g. Frobenius norm), thus having a limited range of potential applications. Here we propose FlexiFaCT, a flexible and highly scalable distributed factorization algorithm which attacks a very broad spectrum of problems: FlexiFaCT can handle matrices, tensors, coupled tensor-matrix settings, cross product a variety of loss functions, including Frobenius norm, KL divergence, 1 regularization, and non-negativity constraints.
Moreover, FlexiFaCT is very fast and scalable; we show how to implement it on Hadoop, and we show how to achieve high speeds, by distributing both the data as well as the parameters. In Table 1 , we provide a comprehensive overview of the state of the art. In short, FlexiFaCT reigns, combining both scalability, as well as versatility.
In summary, our main contributions are: 1. Versatility: FlexiFaCT can operate under a wide spectrum of settings, including plain matrix factorization, tensor factorization, as well as coupled decompositions. Thus, FlexiFaCT includes several recent methods [9], [11], as special cases. 2. Scalability: FlexiFaCT scales very well both with the input size, as well as with the number of model parameters. 3. Proof of convergence: We prove that Flexi-FaCT converges, even with constraints like nonnegativity. Moreover, we demonstrate this empirically. 4. Usability and Reproducibility: Our implementation runs on stock Hadoop, as opposed to other recent methods [9]. We also open-source our code.
FlexiFaCT DSGD [9] PSGD [21] Matlab GigaTensor [ [14]. Solutions range from alternating least squares (ALS) [20], to stochastic gradient descent (SGD) [7]. Gemulla et al. [9] made a breakthrough in scaling up MF: they propose a distributed version of Stochastic Gradient Descent (SGD) for MF which elegantly exploits the factorization structure of the problem, break the matrix into blocks and process them in parallel. Follow-up works [16,19] have successfully extended this idea to the matrix completion problem using parallel stochastic updates. We build upon this insightful work, expanding on both the theory and implementation. Tensor Factorization. Tensors are multidimensional generalizations of matrices, widely applied in a variety of fields. See Kolda and Bader [12] for a comprehensive overview. The most popular Tensor Factorization is the so called Canonical Polyadic (CP) or PARAFAC decomposition [10]. PARAFAC with 1 constraints, which FlexiFaCT is able to handle, is introduced in [15]. For in memory/single machine computations, the state of the art is the Tensor Toolbox [6], which provides an implementation of the ALS algorithm. Besides ALS there exist first order optimization techniques, such as [4]. In [18], the authors propose algorithms for incremental computation of the tensor decomposition, where the tensor is a stream. For large datasets residing on HDFS (Hadoop Distributed File System), the state of the art is GigaTensor [11] solves the PARAFAC decomposition using Alternating Least Squares (ALS). Coupled Factorizations. In this case, we have a tensor and a matrix, or two tensors, or two matrices, that share one dimension, as in the Facebook example of the introduction. Singh et al. [17] jointly factorize two related matrices for better decomposition and provide stochastic updates for this coupled MF optimization. For coupled matrix-tensor factorization, recent work [3] used first order optimization techniques.

FlexiFaCT Approach
Any parameter in the objective σ The parameter being updated ∇ Symbol for derivative ηt Step size at iteration t R Rank of decomposition • Khatri Rao product As mentioned previously, we take on the problem of matrix, tensor and coupled factorization. In this section we will explain the variety of loss functions used in these tasks, the Stochastic Gradient Descent (SGD) update rules, and our partitioning scheme allowing for distribution of the SGD work. Although much of our description of the matrix factorization work is similar to [9], we will explain it here for completeness and clarity.
Before we begin, it is important to clarify our notiation. We will use capital boldface script characters like X to denote a tensor, capital boldface non-script characters e.g. Y to denote a matrix, and lowercase boldface character, e.g. y, to denote a vector. X i,j,k denotes the scalar in the (i, j, k) position of the tensor X, Y i,j denotes the scalar in the (i, j) position of matrix Y, and y i denotes the scalar in the ith position of vector y. We use Y i, * to denote the vector of scalars Y i,j for all j. Additionally, with a slight overloading of notation for simplicity and because our matrices and tensors may only have a small percentage of observed values, we say that (i, j, k) ∈ X if X i,j,k is observed. A list of our commonly used symbols can be seen in Table 2.

Optimization Objectives
We begin by explaining how stochastic gradient descent works for our variety of objective functions. We will briefly go over the objective functions for simpler cases like the Frobenius norm of matrices before expanding to more complex objectives.
Matrix Factorization For matrix factorization we would like to approximate our I × J data matrix X by UV T , where U is of size I × R and V is of size J × R. Therefore, we can have a loss function using the Frobenius norm as follows: where L Xi,j (U, V) = (X i,j − R r=1 U i,r V j,r ) 2 . As seen above, we divide our loss function into its component pieces L Xi,j based on each observed point X i,j . This is necessary to use stochastic gradient descent.
Tensor Factorization For tensor factorization we would like to approximate our I × J × K tensor X by a Khatri-Rao product R r=1 U * ,r • V * ,r • W * ,r where U is of size I × R, V is of size J × R and W is of size K × R and we are performing an Khatri Rao product between these three matrices. We can analyze the loss in a few different ways. Following the standard Frobenius norm, as is common in PARAFAC, the loss is: Similarly we can induce sparsity in our parameter space with an 1 penalty: or add a constraint that U, V, W ≥ 0 as is common in non-negative matrix factorization (NNMF). These terms are not as clearly separable in the loss function, but as we will see the update rules are still separable as is necessary for SGD. We make a distinction here between L and L: the objective L is obtained by adding 1 or non-negativity constraints to the loss L.
Coupled Matrix-Tensor Factorization In this case our data tensor X is approximated by R r=1 U * ,r • V * ,r • W * ,r and our data matrix Y is simultaneously approximated by UA T . Note here we use the same component U in both approximations. As such, our objective function is merely a sum of the losses on each data set: Table 3 denotes the loss objectives for different coupled cases. For each of these we use SGD to minimize our loss and thus approximate our data.

SGD Updates
For SGD we perform updates to our parameters U, V, W, A, which we will collectively refer to as Θ matrix whereas θ are the individual components of the matrix. This definition of Θ and θ will come in handy for parameter updates based on the gradient at individual data points. E.g. the update for tensor X are: For these update rules, we list below the differentials similarly for σ = V j,l or σ = W k,l . From this we observe that SGD update for U i,l at a particular entry X i,j,k (for a tensor X) depends only on previous  Figure 1: Dividing the paired matrix and tensor into blocks such that no two of them share any row or a column or a third dimension in case of tensor.
U i,r , V j,r , W k,r where r ∈ 1, . . . , R and R is the rank we chose. The updates for each component are similar for the paired cases.
In the case of additional components such as an 1 penalty or a non-negativity constraint on our parameters, we add a projection to our update rule. For example, for an 1 penalty, the update rule is Here we see that S λ is the soft thresholding operator. We can similarly use the following projection for the non-negativity constraint:

Blocking for Parallelization
Given this understanding of our optimization objective and SGD update rules, we would like to segment our data in such a way that certain blocks Z b can be run in parallel, where we define Z b ⊆ X. Figure 1 is a pictorial representation of the way we segment our simple matrix or a coupled tensor/matrix to enable parallelization. In order to run SGD on our blocks in parallel, we divide them such that no two blocks share common rows or columns. To be more precise, we say that a point x ∈ Z b is the coordinates in the data, such as x = (x i , x j , x k ) ∈ X. Two blocks Z b and Z b are non-overlapping if for all x ∈ Z b and x ∈ Z b , x i = x i and x j = x j and x k = x k . (We will prove later that this allows us to run the blocks in parallel.) We see that in the division shown in Figure 1 no two blocks share common rows or columns. More interestingly, we note that in Figure 1(c) blocks in the tensor X and the matrix Y share coordinates in the i dimension, and as a result, data points in the same i range must be in the same block across both data sets.
Given this intuition, we provide a detailed description of our partition function. We call one set of independent blocks a stratum, and we denote the number of blocks in each stratum by d. In order to cover all regions of X, we need multiple strata. For a matrix we require d strata, and for tensors we require d 2 strata. For a stratum s we have blocks Z With this we define the blocks for stratum s as In our algorithm, we run the strata sequentially, but for each stratum we run SGD on the blocks in parallel. We consider running SGD on one stratum to be a subepoch in our algorithm, and running it on all strata an epoch. (Note, the order in which you run the strata does not matter, as long as they are each run once per epoch.) We can do this repeatedly, iteratively updating our parameters θ, until the algorithm converges. A more formal write up of the distributed stochastic gradient algorithm for a tensor (which can easily be generalized to matrices and coupled factorizations) is shown in Algorithm 1. We next offer a proof that this converges appropriately.

Proof of convergence with projections
The FlexiFaCT approach is described in Algorithm 1. We first prove that two blocks in a stratum are interchangeable. We use this to prove that sequence of strata are a regenerative process, defined later in this section. We use this to prove that our FlexiFaCT approach for Tensor and coupled case converges.

Algorithm 1: FlexiFaCT for tensor factorization
Our generic constrained loss function for a tensor case is In the above projected loss equation 4.10 the parameter is always in a set, P, constrained by the 1 and non-negativity constraints. The set P is a hyperrectangle defined as ∃ a i < b i , i = 1 . . . r, such that Here θ is a parameter to be updated as defined in previous section (equation 3.3). The gradient based on equation 4.10 is: where θ is defined in Table 2 and L and L are defined in equation 3.2. Function p() is the projection or constraint term of the gradient. The set C(θ) is the union of the subgradients at θ. When θ ∈ interior of P, C(θ) contains only the zero elements and contains the convex cone generated by the subgradients at θ when θ ∈ ∂P, boundary of P.
Definition 4.1. Two blocks Z i and Z i in a given stratum are independent if for each x ∈ Z i and x ∈ Z i we have where ∇L x (θ) is the partial differential of L x w.r.t. θ and θ is the parameter we are updating. Proof. For any two points x ∈ Z i and x ∈ Z i their rows or columns or any other coordinate do not overlap. From equation (3.4) we see that x does not modify θ in positions for which i = x i , j = x j and k = x k . Therefore, because x and x are not equal in any dimension, an update from ∇L x will update different values than ∇L x , where ∇L x is the gradeint at point x Additionally from equation (3.4) we see that any updates on ∇L x only use values from U xi, * , A xj , * and B x k , * , and thus do not use any values that would be updated by ∇L x . Because updates from x only effect parameters in x's coordinates, updates from x are only based on parameters in x's coordinates, and x and x have no overlapping coordinates, we know that and ∇L x (θ) = ∇L x (θ − η∇L x (θ)) Therefore, Z i and Z i are independent. By a similar argument and ∇L x (θ) = ∇L x (θ − η∇L x (θ)) Definition 4.2. a process P (t), t ≥ 0 is regenerative if there exist time points 0 ≤ T 0 < T 1 < T 2 < . . . such that the the remainder of the process after T k P (T k + t) : t ≥ 0, for k ≥ 1 : (a) has the same distribution as the remainder of the process after T 0 , and (b) process P (T 0 + t) : t ≥ 0 is independent of the process prior to T k P (t) : 0 ≤ t < T k .
In other words a stochastic process with certain time points such that from a probabilistic view the process restarts itself at these time points is called a regenerative process. Intuitively this means a regenerative process can be split in to i.i.d. cycles [5].
Based on equation 4.10 the projected sgd updates can be written as: where Π P () is the projection of the updated gradient with respect to the original loss L(U, V, W ). The projection step can be further broken up into θ t+1 = θ (t) + η t ∇L si x (θ (t) ) + η t p(θ (t) ) (using (4.11)) = θ (t) + η t ∇L 0 (θ (t) ) + η t δM t + η t β t + η t p(θ (t) ) (4.14) where L si x (θ (t) ) is the loss function at stratum s i at a point x in iteration t given parameter value in previous iteration θ (t) . ∇L 0 (θ (t) ) is the exact gradient in iteration t given previous parameter value θ (t) . And where β t is the "error" before projection i.e. the error by which the update is outside P.
To prove the convergence of the method we define the following conditions, similar to the ones defined in [13,9]: Condition 3. The squared sum of the step sizes η t is bounded i.e. t η 2 t < ∞. Condition 4. The noise is martinagale difference: Condition 5. E[η t β t ] < ∞ with probability 1. Note that this is a condition on step-size. It implicitly says that the projection must not wander off infinitely outside the set P over the iterations.
Theorem 4.2. The distributed SGD algorithm for tensor decomposition with projections, as presented in algorithm 1, converges.
Proof. The primary equations being updated each time in our iterations is equation 4.14. Rewriting it here we have: From theorem 4.1 we can see that the individual blocks in a given stratum are independent of each other's updates and are interchangeable. We can also observe from Algorithm 1 that every stratum out of d strata is picked exactly once in one cycle i.e. one epoch (outer while loop). Moreover two different cycles of strata i.e. iterations of the while loop are identical and independent. In other words the while loop forms an i.i.d cycle, and thus a regenerative process. The timeperiod of cycles is finite and bounded consequently that of the regenerative process too. Besides given all the conditions 1 to 5 as defined above, we have for any arbitrary κ. The proof is similar to [13] and is valid due to the fact that noise is a martingale difference sequence and η i δM i and η i β i are an equicontinuous sequence ( [13] Theorem 2.1, part 1, chapter 5; [9] follows a similar proof up to this point). We can now use this to analyze the updated with a projected loss. We find that equation 4.14 has the same set of stable points as Now we show that equation 4.18 converges. Through few algebraic manipulations it can be verified that the projection functions p(·) we have, 1 soft threshold and non-negativity constraint project, are lipschitz continuous. Following the arguments of [13] (theorem 2.1, part 2) and with the assumption that updates θ (t) are bounded (follow from the conditions 1 to 5 assumed earlier), equation 4.18 converges to a set of stationary points.

MapReduce Implementation of FlexiFaCT
Along with our theoretical analysis, we implemented our algorithm within the MapReduce framework [8]. To do this we used the open source Hadoop [1] version of MapReduce. The challenge is to turn the factorization problem into map() and reduce() functions, that Hadoop is designed to handle.
In our implementation we pass the data matrix or tensor as input to the mappers in the form (i, j, k, X i,j,k ). We also store our current parameters θ (s) , which could include U, V, W, and A on the Hadoop File System (HDFS).
FlexiFaCT Mapper: Our Mapper function, is shown in Algorithm 2. It splits the data into the appropriate blocks and determines the order they should be processed in within each reducer. We also overload the default Hadoop partitioner, which typically just partitions on unique Key values, and now partition only on b i so that each reducer represents a unique set of i in I or a unique set of rows in U. We additionally override the default Comparator, allowing us to sort our (Key,Value) pairs within each reducer by the subepoch term calculated in the Mapper. We see here while the Mapper is quite simple, the calculation of the blocks and the order within each reducer captures our partition function that allows us to perform SGD in this distributed fashion.

to HDFS
FlexiFaCT Reducer: Our Reducer function is shown in Algorithm 3. As we explained before, each reducer gets all points for a given range of values i ordered by the subepochs. (As before we use s to denote the subepochs.) The reducer iterates over the points in V in order, each time updating θ (s) . Each reducer only stores the components of θ that correspond to its current block in the current stratum. As such, when a new subepoch is reached, it must write its updated θ values to disk (for another reducer to retrieve) and read the most current θ values for its next block in the subsequent subepoch.
We run the MapReduce jobs iteratively. Each MapReduce job is one epoch using all points in X to update the full parameter space θ and ultimately to save it to HDFS. We then use the updated parameters θ in the subsequent epoch (another run of the MapReduce algorithm). We do this for a constant number of steps or until the algorithm converges.
Reproducibility and Usability: This is a high level overview of our implementation, but captures the general method we use to both distribute our work and optimize our speed within the MapReduce framework. While this is not the typical way Hadoop is programmed, it requires no modification of the Hadoop framework and can be run on any standard Hadoop cluster. Our code is open-sourced, and available at http://alexbeutel.com/l/flexifact. It can run for all of the data types and loss functions described in this paper.
6 Experiments 6.1 Performance Evaluation In order to assess how scalable and fast FlexiFaCT is, we conducted a series of experiments in order to measure the running time of FlexiFaCT with respect to 1) increasing number of data points, 2) increasing dimensions of the data and thus model, and 3) increasing rank of the factorization. The first aspect has to do with scalability in terms of data size, whereas the two latter aspects refer to scalability with respect to parameter space size; FlexiFaCT is able, as we demonstrate in the following experiments, to scale easily in all three aspects. As a baseline for tensor decomposition, we use GigaTensor [11]. We also compared against PSGD [21], however, the solutions obtained achieved much worse RMSE, and the algorithm was not able to scale for very large number of parameters (either rank or dimensions).
FlexiFaCT was implemented in Java, with Hadoop 0.20.1 [8,1]. We ran the experiments on the OCC-Y cluster 1 . For the sake of experimentation, we created a series of synthetic datasets wherein we were able to control the three aspects we were testing: data size, data dimensions, and rank. Additionally, we validate our method's correctness by presenting experimental evidence (on top of our theoretical guarantees), for monotone convergence. In all cases, number of reducers was constant and equal to 24. Synthetic Data Generation To generate data we first generate randomly matrix factors U, V, W of the specified dimension D (where I = J = K = D) and rank R = 30. We then randomly select data points (i, j, k) and add their value U i, * • V j, * • W k, * to the dataset. We do this until we have the desired number of data points for each dataset. Unless otherwise stated, we set D = 1 million and the number of data points to 10 million.

Scalability
We now test FlexiFaCT on all three types of scalability to demonstrate that it scales in all dimensions and to unprecedented sizes. Rank Scalability In testing the scalability with respect to rank, we ran FlexiFaCT, GigaTensor, and PSGD on a tensor and varied the rank from 25 up to 1000. Figure 2(a) shows the running time for Flexi-FaCT, both for tensor and coupled factorizations, as the rank (i.e. one of the parameter dimensions) increases. We can see that FlexiFaCT scales linearly as the rank of the factorization increases, having similar timing behaviour both for plain tensor and coupled  We observe that FlexiFaCT scales very well with respect to all aspects. PSGD can be seen in sub-figure (a) before it runs out of memory. FlexiFaCT was applied to both a tensor, and a matrix-tensor couple, whereas GigaTensor was only applied to a tensor.
factorizations. GigaTensor, on the other hand, due to the fact that it is inverting an R × R matrix, locally, at every iteration slows down significantly for large ranks. We were unable to test GigaTensor on ranks larger than 100 due to the slow down. Although barely visible in the plot, we also ran PSGD on the data. However, it too couldn't scale in rank due to the size of the parameter space. For rank above 50 the factor matrices could no longer fit in memory on each machine and thus PSGD could not run. This demonstrates FlexiFaCT's unique scalability in the parameter size. Additionally, we note that with R = 1000 and D = 1 million, the coupled FlexiFaCT factorization scales to a total parameter space of 4 billion parameters. Data Dimensions Scalability To test more directly the scalability as the dimension D of the data tensor grows, we created a variety of tensors with varying dimension from 10,000 to 10 million. We decompose each tensor with R = 50. When testing the coupled Flex-iFaCT decomposition, we add an additional coupled matrix with 100,000 data points and the same dimensionality as the main tensor.
In Figure 2(b) we show how coupled factorization using FlexiFaCT scales, as the dimensions of the data increase. We observe that FlexiFaCT runs much faster than the baseline, GigaTensor. A likely explanation for the degree to which FlexiFaCT is faster than GigaTensor is that FlexiFaCT only focuses on the observed data points, where GigaTensor has to convert unobserved data points to zeros, thus slowing down the computation. We can see that FlexiFaCT has a very smooth behaviour, scaling linearly with the dimension size. Again, when we are performing the coupled decomposition, we see that as the dimension scales our total parameter space reaches 2 billion parameters.
Data Size Scalability Last, for data scalability, we vary the number of observed data points from 1 million to 10 million. Figure 2(c) shows FlexiFaCT's running time as a function of the data size, i.e. the number of observations. We can see that FlexiFaCT has, again, very smooth behaviour, and scales linearly with the number of observed elements. Again, we are significantly faster than GigaTensor, though the degree of difference is likely because it is must make unobserved points zeros for it to run.  [21]. Note, Zinkevich et al. [21] do not claim to work on this problem because it is not convex.

Correctness & Monotone Convergence
Besides speedup, we experimentally validate that Flex-iFaCT indeed decreases monotonically the objective function that it is minimizing. To test this we run on a small synthetic data set with D = 10, 000 and 10 million data points, making the dataset very dense. We run a factorization with R = 50 using our implementation of PSGD and FlexiFaCT with both 1 sparsity and non-negativity constraints. We then monitor the root mean squared error (RMSE). Figure 3 shows that FlexiFaCT decreases the RMSE as expected, and at a much quicker pace than PSGD. It is important to note that the slow convergence of PSGD is because the problem (tensor factorization) is not convex, and thus Zinkevich et al. do not claim that their method works on such problems. However, we use PSGD as a comparison because it is not possible to track the RMSE with GigaTensor and thus PSGD is the closest competitor.

Conclusion
In this work we have introduced FlexiFaCT, a highly flexible, efficient, and scalable factorization tool. Our main contributions are 1. Versatility: FlexiFaCT, can operate under numerous factorization scenarios, including matrix, tensor, and coupled factorization as well as with non-negativity and sparsity constraints. 2. Scalability: FlexiFaCT scales very well with the input size, as well as with the number of parameters. 3. Proof of convergence: We provide theoretical guarantees for the convergence of FlexiFaCT, even in the presence of constraints. 4. Reproducibility and Usability: Our implementation works on stock Hadoop; furthermore, we ensure the reproducibility and outreach of Flex-iFaCT by making our implementation publicly available.