Large-scale collaborative prediction using a nonparametric random effects model

A nonparametric model is introduced that allows multiple related regression tasks to take inputs from a common data space. Traditional transfer learning models can be inappropriate if the dependence among the outputs cannot be fully resolved by known input-specific and task-specific predictors. The proposed model treats such output responses as conditionally independent, given known predictors and appropriate unobserved random effects. The model is nonparametric in the sense that the dimensionality of random effects is not specified a priori but is instead determined from data. An approach to estimating the model is presented uses an EM algorithm that is efficient on a very large scale collaborative prediction problem. The obtained prediction accuracy is competitive with state-of-the-art results.


Introduction
In transfer learning and multi-task learning, the observed responses of interest can be seen as relational measurements under a pair of heterogeneous conditions.If there are M tasks and N shared observations, the standard regression model forms M separate estimates where µ is a bias parameter and z j ∈ Z is the vector of input predictors for observation j.A more flexible Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner (s).
approach is to model all of the data jointly, as where now m ij = m(x i , z j ) is a regression function based on task-specific predictors x i ∈ X .In order to directly model the dependency between tasks, a multi-task Gaussian process approach (Yu et al., 2007;Bonilla et al., 2008) may assume m ∼ GP(0, Ω ⊗ Σ) where Σ(z j , z j ; θ Σ ) 0 is a covariance function among inputs and Ω(x i , x i ; θ Ω ) 0 a covariance function among tasks, which means Cov(m ij , m i j ) = Ω ii Σ jj .The parameters θ Σ and θ Ω can be estimated, for example, by maximizing the marginalized likelihood.
The model is nonparametric in the sense that m lives in an infinite-dimensional function space; it can be more appropriate for capturing complex dependencies than a fixed-dimension model.But this flexibility comes at a computational price.If Y ∈ R M ×N is the random response matrix, including those elements Y ij that are observed and those that need to be predicted, the computational complexity of inference is O(M 3 N 3 ).It is thus infeasible to do inference if M and (or) N are large.For example, in our experiments on the Netflix dataset, M = 480, 189 and N = 17, 770.
Just as significantly, from a modeling perspective this model may not be suitable in the situation where the responses Y ij cannot be fully explained by the predictors x i and z j alone.In such a case the conditional independence assumption implied by Eq. ( 1) is invalid.Such is often the case for relational or networked observations.For instance, in the application of predicting users' unobserved ratings on movies, the observed ratings themselves are often more informative than the users' demographic attributes and the movies' genre attributes.
Motivated by these considerations, in this paper we propose a transfer learning approach that satisfies three desiderata.The model ( 1) is nonparametric with sufficient flexibility, (2) is appropriate for dependent responses that are not solely conditioned on the given predictors, and (3) supports efficient learning on very large scale data.

A Random Effects Model
In statistics, a random effects model is a kind of hierarchical linear model that that allows dependent observations occurring in repeated or multilevel structures.The scenario of transfer learning matches such a form; in particular, the population is grouped together according to tasks, and each group repeatedly generates random observations.A natural extension of the previous model is to assume the Y ij are conditionally independent, given m ij ≡ m(x i , z j ) and additional additive random effects f ij .The model becomes where the distribution of m and f determines the dependence among the variables Y ij .One may wish to choose a form of f that is appropriate to explain the data dependence, for example, introducing latent variables u i and v j such that f ij has a parametric form f (u i , v j ; θ f ), as suggested in (Hoff, 2005).
A nonparametric random effects model may assume f ∼ GP(0, ∆ ⊗ Υ), where the covariances ∆ and Υ are parameters to be determined from data.While this model is natural and flexible, it is subject to the large computational limitations discussed above.We address these limitations by imposing additional structure and developing an efficient estimation procedure.

Our Model
In the following we introduce a refinement of the above model that has specific structure, but exhibits some attractive conceptual and computational properties.We again take a hierarchical linear model: where τ > 0, Ω 0 (x i , x i ) is a given covariance function based on task-specific predictors, and f i denotes the function values of the random effects on task i.Following a hierarchical Bayesian framework, we let the covariance function Σ be a random variable sampled from a hyper-prior distribution Σ ∼ IWP(κ, Σ 0 + λδ), which defines an inverse-Wishart process, where λ > 0, κ is a degree-of-freedom parameter, δ is a Dirac delta kernel function, and Σ 0 (z j , z j ) is a given covariance function based on input data.The inverse-Wishart process (IWP) is a nonparametric prior for random covariance functions, based on a nonstandard notation of the inverse-Wishart distribution (Dawid, 1981); a brief introduction is given in the Appendix.
The deviation of Y from the population mean µ is decomposed into two parts, random effects m and f .Let Ω 0 (x i , x i ) = φ(x i ), φ(x i ) , where φ is an implicit feature mapping; then m can be parameterized as where {β j } are latent Gaussian random variables whose covariance follows E( β j , β j ) = Σ j,j .Because of the coupling between m i and x i , m i is nonexchangeable given Σ and the predictors x i , while f i is exchangeable under the same condition.The separation of these two kinds of random effects can enable computational advantages.For instance, one can let the nonexchangeable part be parametric while the exchangeable part is nonparametric; we will develop this strategy below in Sec. 3.
We note that the independent noise ij is no longer separately required in this model, but can be absorbed into the random effects f .Since the covariance Σ is a random variable, it is fully capable of modeling the total variance of the random effects and independent noise.Indeed, this treatment can considerably improve the computational efficiency, as we shall see in Sec.3.4.
Since the population mean µ can be reliably estimated by pooling all the observations together and computing their average, hereafter, for simplicity of presentation, we assume Y ij to be centered and thus µ will not occur in the model.

An Equivalent Model
Note that the above row-wise generative model seems to be arbitrary, because there is no particular reason to prefer it over the alternative column-wise sampling process, where similarly f j denotes the function values of the random effects on case j, across tasks.It turns out that the two models are equivalent -in both cases, if we compute the marginal distribution of Y , namely we obtain exactly the same closed-form distribution where MTP defines a matrix-variate Student-t process.That is, any subset of matrix values Y ∈ R M ×N follows a matrix-valued Student-t distribution (Gupta & Naga, 1999).The proof of the equivalence is sketched in the Appendix.

Discussion
The model is nonparametric at two levels.First, the covariance functions Σ and Ω are nonparametric, because they both live in infinite dimensional function spaces.Moreover, they do not have explicit parametric forms.Second, the dimensionality of any finite Y is not specified a priori, but instead increases with the observation size.
The equivalence between the two generative processes reveals an appealing symmetry in the model.One can either view the model as adapting the free-form covariance function Σ to the data, or equivalently, adapting the covariance Ω to the data.Therefore the model is somewhat similar to those that deal with both Σ and Ω simultaneously, such as the multi-task Gaussian process discussed in Sec. 1.However, as we shall see in Sec. 3, the model potentially allows us to avoid the prohibitive computational cost O(M 3 N 3 ), without sacrificing significant flexibility.
The model can easily incorporate group-specific effects, as do most classical random-effects models.For example, one can expand the feature representation by adding a constant term √ c, such that Ω 0 (x i , x i ) = φ(x i ), φ(x i ) + c.Then m can be parameterized as where γ j is the j-th element of γ ∼ GP(0, cΣ), i.e., the column-specific random effects.Similarly, with Σ 0 (z j , z j ) = ϕ(z j ), ϕ(z j ) + b, one can deal with the row-specific random effects as well.

Large-scale Inference and Learning
Given the model and some observed elements, denoted by Y O , we can make predictions by computing the conditional expectation on unseen elements.To this end, we need to compute the posterior distribution p(Y|Y O , θ), where θ includes Σ 0 , Ω 0 , λ, τ and κ.
It is highly challenging to apply nonparametric models to large-scale problems.We will present several algorithms, each with an analysis of its computational complexity in the context of the Netflix collaborative prediction problem, which contains M = 480, 189 users and more than 100 millions ratings on N = 17, 770 movies.Our estimates of computing time are based on a 2.66GHz Intel PC with 3.0GB memory.

Modeling Large-scale Data
Assuming M N , e.g., the Netflix case, it is computationally more convenient to work with the row-wise model.Since the computation is always on a finite matrix Y ∈ R M ×N , including those elements Y ij that are observed and those that need to be predicted, and since the model is consistent under marginalization, in the following we state the model on finite matrices where IW defines a finite inverse-Wishart distribution, Σ and Σ 0 are both N × N covariance matrices, Ω 0 is M × M , m i is the i-th row of m, and Y i is the i-th row of Y.
To further facilitate a large-scale implementation, we assume Ω 0 has finite dimensions.i.e., Ω 0 (x i , x i ) = φ(x i ) φ(x i ), with φ : X → R p .For large-scale data, it is reasonable to assume M N p.The finitedimension assumption is mainly due to computational concerns for large-scale problems, though the model itself does not require this.However, the sampled latent covariance functions Σ and Ω are always infinitedimensional, and so is the function Y .Therefore the overall model is still nonparametric.
Without loss of generality, we let Ω 0 (x i , x i ) = p 1 2 x i , p 1 2 x i .In this case, the random effects m follow where β is a p × N random matrix, and β k is its k-th row.Then the model becomes

Approximate Inference -Gibbs Sampling
Based on the joint distribution p(Y, β, Σ|θ) described by the model, the prediction can be approximated by a Markov chain Monte Carlo method where the samples Y (t) , together with β (t) , Σ (t) are i.i.d.drawn from p(Y, β, Σ|Y O , θ).For Gibbs sampling, whose details are omitted here due to space limitation, the computational cost of one Gibbs iteration is roughly where we count mainly multiplications, and ignore the impact of factor p, since M N p.Compared with O(M 3 N 3 ) for the direct Gaussian process, this is a dramatically reduced complexity.However, since M is large, the cost O(2M N 2 ) is still prohibitiveeach Gibbs iteration will take hundreds of hours on the Netflix data.

Approximate Inference -EM
We first introduce some necessary notation.Let J i ⊂ {1, . . ., N } be the index set of the N i observed elements in the row Y i , and Σ [:,Ji] ∈ R N ×Ni be the matrix obtained by keeping the columns of Σ indexed by J i .Then let Σ [Ji,Ji] ∈ R Ni×Ni be obtained from Σ [:,Ji] by further keeping only the rows indexed by J i .We can similarly define Alteratively, the conditional expectation can be approximated by where β and Σ are the maximum a posteriori estimates When the number of observations is sufficiently large that the posterior distribution is concentrated around its mode, the above approximation can be tight.The expectation-maximization (EM) algorithm is a popular approach to finding the mode, alternating the following two steps: • E-step: compute the posterior distribution of Y given the current β, Σ where the sufficient statistics are • M-step: optimize β and Σ and then let β ← β, Σ ← Σ.
It turns out that β can be nicely decoupled from Σ, and has a closed form: Plugging β into the optimization cost, we have Further detail on the M-step can be found in the Appendix.
The overall cost of a single EM iteration is The computational cost seems to be even higher than that of Gibbs sampling, since N 2 is multiplied by the large factor (M + M i=1 N i ), and N is a large multiplicative factor of M i=1 N 2 i .We estimate that, in a single EM step, excluding the term O( M i=1 N 3 i ), the computation would take several thousands of hours for the Netflix problem.

Efficient EM Implementation
In order to introduce a faster algorithm, we now fix some notation.Let U i ∈ R N ×Ni be a column selection operator, whose (s, t) element is one, if the index s equals to the t-th element of J i , otherwise zero, so that Σ [:,Ji] = ΣU i and Σ [Ji,Ji] = U i ΣU i .Furthermore, U i Σ [Ji,Ji] U i restores a N × N sparse matrix where the elements of Σ [Ji,Ji] is placed back to their original positions in Σ.
The major cost is from the E-step that computes the sufficient statistics υ i and C i for every i.Our key observation is that, the M-step only needs C = M i=1 C i + υ υ and υ x from the previous E-step.To obtain them, in fact it is unnecessary to compute υ i and C i at all.For example, by using Σ [:,Ji] = ΣU i , we have Since multiplication with U i is done by memory access only, above reduces O( In a similar way we have where . Before running the EM iterations, we pre-compute x x and (x x + τ I p ) −1 , since they are both repeatedly used in the algorithm.Then we implement the E-step as the following The statistics required by the M-step are computed as With this new implementation, the major cost of a single EM step is

Related Work
There is a large body of research on multi-task learning using Gaussian processes, including those that learn the covariance Σ shared across tasks (Lawrence & Platt, 2004;Schwaighofer et al., 2005;Yu et al., 2005), and those that additionally consider the covariance Ω between tasks (Yu et al., 2007;Bonilla et al., 2008).The methods that only use Σ have been applied to collaborative prediction (Schwaighofer et al., 2005;Yu et al., 2006).However, due to the computational cost, they were both evaluated on very small data sets, where M and N are several hundreds only.
Our work differs from the above models in two aspects: First, a new nonparametric model is proposed to explicitly combine known predictors and random effects for multi-task predictions; Second, considerations in both model designing and algorithm engineering are put together to scale the nonparametric model to very large data sets.
Another class of related work is the so-called semiparametric models, where the observations Y ij are linearly generated from a finite number of basis functions randomly sampled from a Gaussian process prior (Teh et al., 2005).Our approach is more related to a recent work (Zhu et al., 2009), where Y ij is modeled by two sets of multiplicative factors, one sampled from GP(0, Σ) and the other from GP(0, Ω), namely, Y becomes a low-rank random function.
Low-rank matrix factorization is perhaps the most popular and the state-of-the-art method for collaborative filtering, e.g., (Salakhutdinov & Mnih, 2008a).Our model generalizes them in the sense that it models an infinite-dimensional relational function.A simplification of our work leads to a nonparametric prin-  (Yu et al., 2009).A non-probabilistic nonparametric method, i.e., maximum-margin matrix factorization, was introduced in (Srebro et al., 2005).Very few matrix factorization methods use known predictors.One such a work is by (Hoff, 2005) that introduces low-rank multiplicative random effects, in addition to known predictors, to model networked observations.

EachMovie Data
We did experiments on the entire EachMovie data, which include 74, 424 users' 2, 811, 718 numeric ratings Y ij ∈ {1, 2, 3, 4, 5, 6} on 1, 648 movies.The data are very sparse because 97.17% of the elements are missing.We randomly selected 80% of the observed ratings of each user for training and used the user's rest 20% ratings for testing.This random partition was repeated 10 times independently.We calculated RMSE (root mean square error) for each partition and averaged the 10 results to compute the mean and the standard error.
In our experiments, for each evaluated method, we used one training/testing partition to do model selection, including determining hyper parameters and dimensionality.The selected model was then used for other partitions.For low-rank matrix factorization methods, we chose the dimensionality from D = 50, 100, 200.We found that larger dimensionality generally gave rise to better performances, but the accuracy was tending to saturate after the dimensionality got more than 100.
We used 1% of the training ratings for setting the stopping criterion: for each method, we terminated the iterations if RMSE on the holdout set is observed to increase, and then included the holdout set to run one more iteration of training.We recorded the total run time of each method.In the experiments, we implemented the following methods: • User Mean and Movie Mean: baseline methods that predict ratings by computing the empirical mean of the same users' observed ratings, or, other users' ratings on the same movie.
• NREM: Nonparametric random effects model, the work of this paper, which models Y as an infinitedimensional random function.NREM-1 does not use additional attributes, i.e., we let Σ 0 and Ω 0 be a constant c, as suggested in Sec.2.3.
For both BSRM-2 and NREM-2, we obtain the user and movie attributes from the top 20 eigenvectors of the binary matrix that indicates whether or not ratings are observed in the training set.Note that there can be different ways to obtain useful attributes, which is not the focus of this paper.
All the results are summarized in Tab. 1 and Tab. 2. Our models achieved lower prediction errors than other methods.This result is perhaps not entirely surprising, because nonparametric models are generally more flexible to explain complex data than parametric models.On the other hand, the training of our models is even faster, which is a significant result.

Netflix Problem
The data contain 100, 480, 507 ratings from 480, 189 users on 17, 770 movies.In this case Y ij ∈ {1, 2, 3, 4, 5}.In addition, Netflix.comprovides a set of validation data with 1, 408, 395 ratings.Therefore there are 98.81% of elements missing in the rating matrix.For evaluation, a set of 2, 817, 131 ratings are withheld.People need to submit their results in order to obtain the RMSE result from Netflix.com.We decided to direct compare our results with those reported in the literature, since they were all evaluated on the same test data.
The results are shown in Tab. 3, where Cinematch is the baseline provided by Netflix.Besides those introduced in Sec.5.1, there are several other methods: • SVD: a method almost the same as FMMMF, using a gradient-based method for optimization (Kurucz et al., 2007).
Note that sometimes the running time was not found in the papers.For BPMF, we gave a rough estimation by assuming it went through 300 iterations (each took 220 minutes as reported in the paper).Similar to the EachMovie experiments, BSRM-2 and NREM-2 used the top 40 eigenvectors of the binary indicator matrix as additional attributes.As shown in Tab. 3, if compared with those matrix factorization methods, NREM-1 used no additional attributes either, but generated more accurate predictions.Furthermore, NREMs are highly competitive if compared with the semi-parametric method BSRM.In terms of efficiency for training, our nonparametric model turns out to be even faster than those compared parametric and semi-parametric models.
We note that the top performers in the Netflix competition reported better results by combining heterogenous models.For example, the progress award winner in 2007 combined predictions from about one hundred of different models (Bell et al., 2007).However, our focus here is not on developing ensemble methods.

Conclusion
In this paper a nonparametric model for multi-task learning is introduced.The contributions are twofold: First, the model provides a novel way to use This is a dramatic efficiency improvement; for each EM iteration on Netflix, the original thousands of hours are now reduced to several hours.The remaining major cost O( Ji] .Occasionally a row may have an unusually large number of observations, so we truncate N i by randomly selecting 2000 observations if N i > 2000.In the end, a single EM step takes about 5 hours, and the whole optimization typically converges in about 30 steps.

Table 1 .
Prediction Error on EachMovie Data

Table 2 .
Computation Time on EachMovie Data

Table 3 .
Performance on Netflix Data known attributes to explain the complex dependence of data; Second, an algorithm is proposed to speed up the model on very large-scale problems.Our experiments demonstrate that the algorithm works very well on two challenging collaborative prediction problems.In the near future, it will be promising to perform a full Bayesian inference by a parallel Gibbs sampling method.