A Group-Specific Recommender System

ABSTRACT In recent years, there has been a growing demand to develop efficient recommender systems which track users’ preferences and recommend potential items of interest to users. In this article, we propose a group-specific method to use dependency information from users and items which share similar characteristics under the singular value decomposition framework. The new approach is effective for the “cold-start” problem, where, in the testing set, majority responses are obtained from new users or for new items, and their preference information is not available from the training set. One advantage of the proposed model is that we are able to incorporate information from the missing mechanism and group-specific features through clustering based on the numbers of ratings from each user and other variables associated with missing patterns. In addition, since this type of data involves large-scale customer records, traditional algorithms are not computationally scalable. To implement the proposed method, we propose a new algorithm that embeds a back-fitting algorithm into alternating least squares, which avoids large matrices operation and big memory storage, and therefore makes it feasible to achieve scalable computing. Our simulation studies and MovieLens data analysis both indicate that the proposed group-specific method improves prediction accuracy significantly compared to existing competitive recommender system approaches. Supplementary materials for this article are available online.


Introduction
Recommender systems have drawn great attention since they can be applied to many areas, such as movies reviews, restaurant and hotel selection, financial services, and even identifying gene therapies. Therefore, there is a great demand to develop efficient recommender systems which track users' preferences and recommend potential items of interest to users.
However, developing competitive recommender systems brings new challenges, as information from both users and items could grow exponentially, and the corresponding utility matrix representing users' preferences over items are sparse and highdimensional. The standard methods and algorithms which are not scalable in practice may suffer from rapid deterioration on recommendation accuracy as the volume of data increases.
In addition, it is important to incorporate dynamic features of data instead of one-time usage only, as data could stream in over time and grow exponentially. For example, in the MovieLens 10M data, 96% of the most recent ratings are either from new users or on new items which did not exist before. This implies that the information collected at an early time may not be representative for future users and items. This phenomenon is also called the "cold-start" problem, where, in the testing set, majority responses are obtained from new users or for new items, and their preference information is not available from the training set. Another important feature of this type of data is that the missing mechanism is likely nonignorable missing, where the missing mechanism is associated with unobserved responses. For instance, items with fewer and lower rating scores are less likely to attract other users. Existing recommender systems typically assume missing completely at random, which may lead to estimation bias.
Content-based filtering and collaborative filtering are two of the most prevalent approaches for recommender systems. Content-based filtering methods (e.g., Lang 1995; Mooney and Roy 2000;Blanco-Fernandez et al. 2008) recommend items by comparing the content of the items with a user's profile, which has the advantage that new items can be recommended upon release. However, domain knowledge is often required to establish a transparent profile for each user (Lops et al. 2011), which entails pre-processing tasks to formulate information vectors for items (Pazzani and Billsus 2007). In addition, content-based filtering suffers from the "cold-start" problem as well when a new user is recruited (Adomavicius and Tuzhilin 2005).
For collaborative filtering, the key idea is to borrow information from similar users to predict their future actions. One significant advantage is that the domain knowledge for items is not required. Popular collaborative filtering approaches include, but are not limited to, singular value decomposition (SVD ;Funk 2006;Mazumder et al. 2010), restricted Boltzman machines (RBM; Salakhutdinov et al. 2007), and the nearest neighbor methods (kNN; Bell and Koren 2007). It is well-known that an ensemble of these methods could further enhance prediction accuracy. (See Cacheda et al. 2011 andFeuerverger et al. 2012 for extensive reviews.) However, most existing collaborative filtering approaches do not effectively solve the "cold-start" problem, although various attempts have been made. For example, Park et al. (2006) suggested adding artificial users or items with pre-defined characteristics, while Goldberg et al. (2001), Melville et al. (2002), and Nguyen et al. (2007) considered imputing "pseudo" ratings. Most recently, a hybrid system incorporating contentbased auxiliary information has been proposed (e.g., Agarwal and Chen 2009;Nguyen and Zhu 2013;Zhu et al. 2016). Nevertheless, the "cold-start" problem imposes great challenges and has not been effectively solved.
In this article, we propose a group-specific singular value decomposition method that generalizes the SVD model by incorporating between-subject dependency and uses information of missingness. Specifically, we cluster users or items based on their missingness-related characteristics. We assume that individuals within the same cluster are correlated, while individuals from different clusters are independent. The cluster correlation is incorporated through mixed-effects modeling assuming that users or items from the same cluster share the same group effects, along with latent factors modeling using singular value decomposition.
The proposed method has two significant contributions. First, it solves the "cold-start" problem effectively through incorporating group effects. Most collaborative filtering methods rely on subject-specific parameters to predict users' and items' future ratings. However, for a new user or item, the training samples provide no information to estimate such parameters. In contrast, we are able to incorporate additional group information for new users and items to achieve higher prediction accuracy. Second, our clustering strategy takes nonignorable missingness into consideration. In the MovieLens data, we notice that individuals' rating behaviors are highly associated with their missing patterns: movies with higher average rating scores attract more viewers, while frequent viewers tend to be more critical and give low ratings. We cluster individuals into groups based on their nonrandom missingness, and this allows us to capture individuals' latent characteristics which are not used in other approaches.
To implement the proposed method, we propose a new algorithm that embeds a back-fitting algorithm into alternating least squares, which avoids large matrices operation and big memory storage, and makes it feasible to achieve scalable computing in practice. Notice that the proposed algorithm guarantees convergence to a stationary point which is a local minimum along each block direction, while our theoretical results provide general properties of the global minimum. In general, this distinction occurs in nonconvex optimization problems (e.g., Zhou et al. 2013;Zhu et al. 2016), since nonconvex optimization is NP-hard (Ge et al. 2015). Our numerical studies indicate that the proposed method is effective in terms of prediction accuracy. For example, for the MovieLens 1M and 10M data, the proposed method improves prediction accuracy significantly compared to existing competitive recommender system approaches (e.g., Agarwal and Chen 2009;Koren et al. 2009;Mazumder et al. 2010;Zhu et al. 2016).
This article is organized as follows. Section 2 provides the background of the singular value decomposition model and introduces the proposed method. Section 3 presents the proposed method, a new algorithm, and its implementation.
Section 4 establishes the theoretical foundation of the proposed method. In Section 5, we illustrate the performance and robustness of the proposed method through simulation studies. MovieLens 1M and 10M data are analyzed in Section 6. Section 7 provides concluding remarks and discussion.

Background
We provide the background of the singular value decomposition method (Funk 2006) as follows. Let R = (r ui ) n×m be the utility matrix, where n is the number of users, m is the number of items, and each r ui is an explicit rating from user u for item i (u = 1, . . . , n, i = 1, . . . , m). The SVD method decomposes the utility matrix R as where R is assumed to be low-rank, P = (p 1 , . . . , p n ) is an n × K user preference matrix, Q = (q 1 , . . . , q m ) is an m × K item preference matrix, and K is the pre-specified upper bound of the number of latent factors, which corresponds to the rank of R.
Here, q i and p u are K-dimensional latent factors associated with item i and user u, respectively, which explain variability in R.
The predicted value of r ui given by the SVD method is:r ui = p uq i , whereq i andp u are estimated iteratively bŷ Here, U i denotes the set of all users who rate item i, and I u is the set of all items rated by user u. Different penalty functions can be applied. For example, Zhu et al. (2016) suggested L 0 and L 1 penalties to achieve sparsity of P and Q. In addition, some SVD methods (e.g., Koren 2010;Mazumder et al. 2010;Nguyen and Zhu 2013) are implemented on residuals after a baseline fit, such as linear regression or ANOVA, rather than the raw ratings r ui directly.
The SVD method can be carried out through several algorithms, for example, the alternating least square (ALS; Carroll and Chang 1970;Harshman 1970;Koren et al. 2009), gradient descent approaches (Wu 2007), and one-feature-at-a-time ALS (Funk 2006).

Model Framework
The general framework of the proposed method is constructed as follows. Suppose x ui is a covariate vector corresponding to the user u and item i. In the rest of this article, we consider r ui − x uiβ as the new response, whereβ is the linear regression coefficient of x ui to fit r ui . To simplify our notation, we still use r ui to denote the residual here. In case covariate information is not available, we apply the ANOVA-type model where the grand mean, the user main effects, and the item main effects are subtracted and replace r ui by its residual.
Let θ ui = E(r ui ). We generalize the SVD model and formulate each θ ui as where s v u and t j i are K-dimensional group effects that are identical across members from the same cluster. We denote users from the v-th cluster as . . . , N), and items from the jth cluster as J j = {i : |V v | = n and M j=1 |J j | = m, | · | is the cardinality of a set, and N and M are the total number of clusters for users and items, respectively. Details about selecting N and M are provided in Section 3.3.
In matrix form, we use S = (s 1 , . . . , s N ) and T = (t 1 , . . . , t M ) to denote the user and item group-effect matrices, respectively. However, the dimensions of matrix S and T are N × K and M × K, which are not compatible with the dimensions of P and Q, respectively. Therefore, alternatively we define S c = (s 1 1 |V 1 | , . . . , s N 1 |V N | ) and T c = (t 1 1 |J 1 | , . . . , t M 1 |J M | ) , corresponding to group-effects from users and items, where 1 k is a k-dimensional vector of 1's, and the subscript "c" in S c and T c denotes the "complete" forms of matrices. Let = (θ ui ) n×m , then we have and if there are no group effects, degenerates to = PQ , which is the same as the SVD model.
Here, the users or items can be formed as clusters based on their similar characteristics. For example, we can use missingness-related information such as the number of ratings from each user and each item. Users or items within the same cluster are correlated with each other through the group effects s v u or t j i , while observations from different clusters are assumed to be independent. In Sections 3, 4, and 5.1, we assume N and M are known, and that members in each cluster are correctly labeled.
Remark 1. For easy operation, one could use users' and items' covariate information for clustering. In fact, (1) is still a generalization of the SVD method even if N = M = 1, because s v u t j i , p u t j i , s v u q i correspond to the grand mean, the user main effects and the item main effects, analogous to the ANOVA-type of SVD model. Note that covariate information might not be collected from new users and new items. However, missingness-related information is typically available for clustering, and therefore s v u and t j i can be used for new users and new items. This is crucial to solve the "cold-start" problem.

Parameter Estimation
In this subsection, we illustrate how to obtain estimations of model parameters through training data. In addition, we develop a new algorithm that embeds back-fitting (Breiman and Friedman 1985) into alternating least squares. This enables us to circumvent large-scale matrix operations through a two-step iteration, and hence significantly improve computational speed and scalability.
Let γ be a vectorization of (P, Q, S, T), be a set of useritem pairs associated with observed ratings, and R o = {r ui : (u, i) ∈ } be a set of observed ratings. We define the loss function as where θ ui is given by (1) and λ is a tuning parameter. We can estimate γ viaγ Then, the predicted value of θ ui can be obtained byθ The estimation procedure consists of updating (p u +ŝ v u ) and (q i +t j i ) iteratively. Following the strategy of the alternating least squares, the latent factors and the group effects associated with item cluster j are estimated by Similarly, we estimate latent factors and group effects associated with user cluster v: However, directly solving (3) and (4) by the alternating least square encounters large matrices. In the MovieLens 10M data, it could involve matrices with more than 100,000 rows. We develop a new algorithm which embeds back-fitting into alternating least squares, and minimize each of (3) and (4) iteratively. Specifically, for each item cluster J j ( j = 1, . . . , M), we fix P and S, and minimize (3) through estimatingq i andt j iteratively: For each user cluster V v (v = 1, . . . , N), we fix Q and T and minimize (4) through estimatingp u andŝ v iteratively: The above backfitting is an iterative algorithm for additive models. In contrast, the alternating least squares is an iterative algorithm for multiplicative models. Although both backfitting and alternating least squares are blockwise coordinate descent methods in general, their convergence properties are different under our framework. This is because the objective function for the backfitting algorithm is convex, while the objective function for the alternating least squares is nonconvex. Ansley and Kohn (1994) showed that for penalized least-square problems, the backfitting algorithm converges to the unique optimum solution from any initial values, while the alternating least-squares algorithm for two blocks only converges to a stationary point (Chen et al. 2012).
In addition, the proposed algorithm is also different from the blockwise coordinate descent algorithm which estimates each of (P, Q, S, T) sequentially and iteratively while keeping the other terms as constants. The convergence property of the proposed algorithm is illustrated in Section 3.2. Note that the blockwise coordinate descent algorithm does not have such a property.

Algorithm
In this section, we provide the detailed algorithm as follows. 0) ) and the tuning parameter λ.
Note that the alternating least square is performed by conducting Steps 2 and 3 iteratively, while the back-fitting algorithm is carried out within Steps 2 and 3. The parallel computing can be implemented in Steps 2(ii), (iii) and 3(ii), (iii).
Algorithm 1 does not require large computational and storage cost. We denote I B1 , I B2 and I ALS as the numbers of iterations for back-fitting in Steps 2 and 3, and the ALS, respectively, and C Ridge as the computational complexity of solving the ridge regression with K variables and max{|V 1 |, . . . , |V N |, |J 1 |, . . . , |J M |} observations. Then, the computational complexity of Algorithm 1 is no greater than {(m + M)I B1 + (n + N)I B2 }C Ridge I ALS . Since both ridge regression and Lasso have the same computational complexity as ordinary least squares (Efron et al. 2004), the computational cost of the proposed method is indeed no greater than that of Zhu et al. (2016). For the storage cost, Algorithm 1 requires storages of only item-specific or user-specific information to solve (5) or (7), and the sizes of items and users information not exceeding max{|J 1 |, . . . , |J M |} and max{|V 1 |, . . . , |V N |} to solve (6) or (8), respectively.
We also establish the convergence property of Algorithm 1 as follows. Let γ * = vec(P * , Q * , S * , T * ) be a stationary point of L(γ|R o ) corresponding to two blocks. That is, The following lemma shows the convergence of Algorithm 1 to a stationary point, which is a local minimum along each block direction. One way to approximate the global minimum is to adopt the branch-and-bound technique, and search all possible local minima (Liu et al. 2005). However, this technique could be computationally intensive.

Implementation
In this subsection, we address some implementation issues for the proposed method. To select tuning parameter λ, we search from grid points which minimizes the root mean square error (RMSE) on the validation set. The RMSE on a given set 0 is defined as { 1 In selection of the number of latent factors K, we choose K such that it is sufficiently large and leads to stable estimations. In general, K needs to be larger than the rank of the utility matrix R, but not so large as to intensify the computation. Regarding the selection of the number of clusters N and M, Corollary 2 of Section 4 provides the lower bound in the order of O(N) and O(M). Note that too small N and M may not have the power to distinguish between the proposed method and the SVD method. In practice, if clustering is based on categorical variables, then we can apply the existing categories, and N and M are known. However, if clustering is based on a continuous variable, we can apply the quantiles of the continuous variable to determine N and M and categorize users and items evenly. We then select the number of clusters through a grid search, similar to the selection of λ and K. See Wang (2010) for a consistent selection of the number of clusters in more general settings.
In particular, for our numerical studies, we split our dataset into 60% training, 15% validation, and 25% testing sets based on the time of ratings (timestamps; Zhu et al. 2016). That is, we use historical data to predict future data. If time information is not available, we use a random split to determine training, validation, and testing sets instead.

Theory
In this section, we provide the theoretical foundation of the proposed method in a general setting. That is, we allow r ui to follow a general class of distributions. In particular, we derive an upper bound for the prediction error in probability, and show that existing approaches without using group effects lead to a larger value of the loss function, and therefore are less efficient compared to the proposed method. Furthermore, we establish a lower bound of the number of clusters which guarantees that the group effects can be detected effectively.
Suppose the expected value of each rating is formulated via a known mean function μ. That is, and θ ui is defined as in (1). For example, if r ui is a continuous variable, then μ(θ ui ) = θ ui ; and if r ui 's are binary, then μ(θ ui ) = exp(θ ui ) 1+exp(θ ui ) .
We let f ui = f (r ui |θ ui ) be the probability density function of r ui . Since each r ui is associated with γ only through θ ui , we denote f ui (r, γ ) = f (r ui |θ ui ). We define the likelihood-based loss function as: where λ | | is the penalization coefficient, | | is the total number of observed ratings, and D(·) is a nonnegative penalty function of γ. For example, we have D(γ ) = γ 2 2 for the L 2 -penalty. Since, in practice, the ratings are typically nonnegative finite values, it is sensible to assume γ ∞ ≤ L, where L is a positive constant. We define the parameter vector space as Notice that the dimension of γ is dim(γ ) = (n + m + N + M)K which goes to infinity as either n or m increases. Therefore, we assume k ∼ O( Similarly, we define the parameter space for each θ ui : Assumption 1. For some constantḠ ≥ 0, and θ ui ,θ ui ∈ S (k), where EG 2 (r ui ) ≤Ḡ 2 for u = 1, . . . , n, i = 1, . . . , m.
In the following, we discuss the convergence properties based on the probability density function f ui rather than the parameter itself. This is because with the knowledge of f ui (r, γ ), we can estimate the predicted rating E(r ui ), the corresponding variance var(r ui ), and other distributional features of r ui . Specifically, we employ the Hellinger metric to measure the distance between two density functions, since it is the most convenient metric associated with density functions and is always welldefined (Van de Geer 1993). The Hellinger metric h (·, ·) on S (k) is defined as where ν(·) is a probability measure. Based on Assumption 1, h (θ ui ,θ ui ) is bounded by θ ui −θ ui 2 . We now define the Hellinger metric h S (·, ·) on S (k). For γ,γ ∈ S (k), let It is straightforward to show that h S is still a metric. In the rest of this article, we suppress the subscript and use h(·, ·) to denote the Hellinger metric on S (k). In the following, we show that h(γ,γ ) can be bounded by γ −γ 2 .
Lemma 2. Under Assumption 1, there exists a constant d 0 ≥ 0, such that for γ,γ ∈ S (k), is a penalized maximum likelihood estimator of γ. Theorem 1 indicates thatγ converges to γ exponentially in probability, with a convergence rate of | | . Theorem 1. Under Assumption 1 and suppose λ | | < 1 2k 2 | | , the best possible convergence rate ofγ is , and there exists a constant c > 0, such that Remark 2. Theorem 1 can be generalized to achieve the convergence property measured by the L 2 distance as a special case of Corollary 2 in Shen (1998). However, the convergence under the L 2 distance is more restrictive than the convergence under the Hellinger distance. In addition, the Hellinger distance has the following advantages. First, the convergence rate ofγ depends only on the size of the parameter space S (k) and the penalization coefficient λ | | (Shen 1998). In contrast, the convergence rate based on the L 2 distance depends on additional local and global behavior of var{L(γ|R o ) − L(γ|R o )}. Second, the exponential bound under the Hellinger distance does not rely on the existence of the moment generating function of G(·), which is needed for the exponential bound under the L 2 distance.
n+m)k )} 1/2 {log( m k )} 1/2 is the convergence rate provided by the collaborative prediction method with binary ratings (Srebro et al. 2005). The exact rate comparison is not available here.
Remark 4. The definition of S (k) is for the purpose of achieving the best possible convergence rate. Specifically, let S ∈ R (n+m+N+M)K be the true underlying parameter space. Since S is in an infinite-dimensional space when n or m goes to infinity, γ obtained by optimizing over S may not achieve the best possible convergence rate (Shen and Wong 1994). Instead, we adopt the idea of sieve MLE (Grenander 1981), and approximate S by S (k) which grows as the sample size increases. This ensures that the penalized MLEγ on S (k) is capable of achieving the best possible convergence rate (Shen 1998).
Remark 5. If we impose γ − γ 2 ≤ d n,m with radius d n,m = 2nm d 2 0 (n+m) | | , then the entropy of S (k) under Assumption 1 also satisfies the condition of local entropy (Wong and Shen 1995). That is, Consequently, the convergence rate of | | is log(| |) times faster than the convergence rate calculated by using global entropy.
We now assume that the density function f ui is a member of the exponential family in its canonical form. That is, In fact, the following results still hold if f is in the overdispersed exponential family.
Suppose γ ∈ S (k) and θ ui ∈ S (k) are the true parameters. Then, Theorem 2 indicates that if misspecifiedθ ui 's are not close to θ ui 's, then the loss function of the correspondingγ cannot be closer to the loss function of γ than a given threshold in probability.
Suppose γ 0 ∈ S (k) is a vectorization of (P, Q, 0, 0), which corresponds to models with no group effects. The following Corollary 1 shows that if the true group effects are not close to 0, then existing methods ignoring group effects such as the SVD model (θ 0 ui = p u q i ) lead to a larger loss in probability than the proposed method. Corollary 1. Under Assumption 1 and λ | | < 1 2k 2 | | , there exists c i > 0, i = 1, 2, and a constant φ ∈ (0, 1], such that for 1 √ φ | | > 0, there exists δ | | > 0. Assume that at least (φnm) pairs of (u, i) satisfy |θ 0 ui − θ ui | > δ | | . Then The following corollary provides the minimal rate of N and M, in terms of n, m, K, and | |. This implies that the number of clusters should be sufficiently large so that the group effects can be detected.
Corollary 2. Under assumptions in Theorem 1, the rate of N and M satisfies If we further assume that the number of ratings is proportional to the size of the utility matrix, that is, The lower bound of O(N + M) is useful in determining the minimal number of clusters. For example, for the MovieLens 10M data where | | = 10,000,000, we have the lower bound log( | | 1/2 K 1/2 ) ≈ 7 if K ≤ 10.

Simulation Studies
In this section, we provide simulation studies to investigate the numerical performance of the proposed method in finite samples. Specifically, we compare the proposed method with four matrix factorization methods in Section 5.1 under a dynamic setting where new users and new items appear at later times. In Section 5.2, we test the robustness of the proposed model under various degrees of cluster misspecification.

Comparison Under the "Cold-Start" Problem
In this simulation studies, we simulate the "cold-start" problem, where new users' and new items' information is not available in the training set. In addition, we simulate that users' behavior is affected by other users' behavior, and therefore the missingness is not missing completely at random. Here, users and items from the same group are generated to be dependent from each other.
In contrast to other simulation studies, we do not generate the entire utility matrix R. Instead, we mimic the real data case, where only a small percentage of ratings is collected. We choose the total number of ratings to be | | = (1 −π )nm, whereπ = 0.7, 0.8, 0.9, or 0.95 is the missing rate. The following procedure is used to generate these ratings.
We first select the lth user-item pair (u l , i l ), where l = 1, . . . , | | indicates the sequence of ratings from the earliest to the latest. If item i l 's current average rating is greater than 0.5, then for user u l , we assign a rating r u l i l with probability 0.85; otherwise we assign r u l i l with probability 0.2. The rating r u l i l is generated by (p u l + s v u l ) (q i l + t j i l )/3 + ε, where ε iid ∼ N(0, 1). That is, we simulate a setting where users tend to rate highly-rated items. Here, u l and i l are sampled from 1, . . . , n and 1, . . . , m independently, but with weights proportional to the density of normal distributions N(nl/| |, (0.2n) 2 ) and N(ml/| |, (0.2m) 2 ), respectively. That is, ratings appearing at a later time are more likely corresponding to newer users or to newer items. If we fail to assign r u l i l a value, we re-draw (u l , i l ) and restart this procedure. The selection of r u l i l is based on observed information, so the missing mechanism is missing at random (Rubin 1976).
We compare the performance of the proposed method with four competitive matrix factorization models, namely, the regularized singular value decomposition method solved by the alternating least-square algorithm (RSVD; Funk 2006;Koren et al. 2009), a regression-based latent factor model (Agarwal and Chen 2009), a nuclear-norm matrix completion method (Soft-Impute; Mazumder et al. 2010), and a latent factor model with sparsity pursuit (Zhu et al. 2016). The proposed algorithm is implemented with Matlab, and the RSVD is solved in R. For the last three methods, we apply the codes in https://github.com/beechung/latent-factor-models, the R package "softImpute, " and that of Zhu et al. (2016), respectively.
For the proposed method, we apply the loss function (2). The tuning parameter λ for the proposed method and the RSVD is selected from grid points ranging from 1 to 29 to minimize the RMSEs on the validation set. For Agarwal and Chen (2009), we use the default of 10 iterations, while for Mazumder et al. (2010), the default λ = 0 is chosen to achieve convergence for the local minimum; and for Zhu et al. (2016), the tuning parameter selection is integrated in their programming coding. We generate simulation settings when the number of latent factors K = 3 and 6, and the missing rateπ = 0.7, 0.8, 0.9, 0.95. The means and standard errors of RMSEs on the testing set are reported in Table 1. The simulation results are based on 500 replications. Table 1 indicates that the proposed method performs the best across all settings. Overall, the proposed method is relatively robust against different missing rates or different numbers of latent factors, and has the smallest standard error in most settings. In the most extreme case with K = 6 andπ = 0.95, the proposed method is still more than 100% better than the best of the four existing methods in terms of the RMSEs. The RSVD method performs well when bothπ and K are small, but performs poorly when eitherπ or K increases. By contrast, Agarwal andChen (2009), Mazumder et al. (2010), and Zhu et al. (2016) are able to provide small standard errors when K = 6 andπ = 0.95, but have large RMSEs across all settings. Mazumder et al. (2010) occasionally provided outlying results due to a convergence problem whenπ is 0.9 or 0.95. We remove these extreme results in our simulations.

Robustness Against Cluster Misspecification
In this simulation study, we test the robustness of the proposed method when the clusters are misspecified.
We follow the same data-generating process as in the previous study, but allow the cluster assignment to be misspecified. Specifically, we misassign users and items to adjacent clusters with 10%, 30%, and 50% chance. Here, adjacent clusters are defined as the clusters with the closest group effects. This definition of adjacent clusters reflects the real-data situation. For example, a horror movie might be misclassified as a thriller movie, but less likely a romantic movie.
The simulation results based on 500 replications are summarized in Table 2. In general, the proposed method is robust against the misspecification of clusters. In comparison with the previous results from Table 1, the proposed method performs better than the other four methods in all settings even when 50% of the cluster members are misclassified. On the other hand, the misspecification rate affects the performance of the proposed method to different degrees for various settings ofπ and K. For example, the proposed method below the 50% misspecification rate is 2.7% worse than the proposed method when there is no misspecification, in terms of the RMSE under K = 3 and π = 0.7; and becomes 18.8% worse than the one with no misspecification under K = 6 andπ = 0.95.

MovieLens Data
We apply the proposed method to MovieLens 1M and 10M data. The two datasets are collected by GroupLens Research and are available at http://grouplens.org/datasets/movielens. The Movie-Lens 1M data contain 1,000,209 ratings of 3883 movies by 6040 users, and rating scores range from 1 to 5. In addition, the 1M dataset provides demographic information for the users (age, gender, occupation, zipcode), and genres and release dates of the movies. In the MovieLens 10M data, we have 10,000,054 ratings collected from 71,567 users over 10,681 items, and 99% of Table . RMSE (standard error) of the proposed method compared with four existing methods, with the missing rateπ = 70%, 80%, 90%, and 95%, and the number of latent factors K = 3 or , where RSVD, AC, MHT, and ZSY stand for regularized singular value decomposition, the regression-based latent factor model (Agarwal and Chen ), Soft-Impute (Mazumder et al. ), and the latent factor model with sparsity pursuit (Zhu et   the movie ratings are actually missing. Rating scores range from 0.5, 1, . . . , 5, but no user information is available. Figure 1 illustrates the missing pattern of MovieLens 1M data. Both graphs indicate that the missing mechanism is possibly missing not at random. In the left figure, the right-skewed distribution from users indicates that only a few users rated a large number of movies. While the median number of ratings is 96, the maximum can reach up to 2314. The right figure shows that popular movies attract more viewers. That is, the number of ratings for each movie is positively associated with its average rating score, indicating nonignorable missingness.
For the proposed method, we take advantage of missingness information from each user and item for clustering. We observe that users who give a large number of ratings tend to assign low rating scores; therefore, we classify users based on the quantiles of the number of their ratings. For items, we notice that old movies being rated are usually classical and have high average rating scores. Therefore, the items are clustered based on their release dates. We use N = 12 and M = 10 as the number of clusters for users and items in both datasets. The means of ratings from different clusters are significantly different based on their pairwise two-sample t-tests. In addition, we also try a large range of N's and M's, but they do not affect the results very much.
The proposed method is compared with the four matrix factorization methods described in Section 5.1. Tuning parameters for each method are selected from grid points to minimize the RMSEs on the validation set. For the proposed method, we apply the loss function (2) and select K = 2 and λ = 12 for the 1M data, and K = 6 and λ = 16 for the 10M data. For Agarwal and Chen (2009), we select K = 1 for both the 1M and 10M data, which requires 25 and 10 iterations of the EM algorithm to guarantee convergence, respectively. For Mazumder et al. (2010), K = 4 and K = 9 are selected for the 1M and 10M data, and while using different λ's does not influence the RMSE very much, we apply λ = 0 to estimate the theoretical local minimum. For Zhu et al. (2016), the tuning and the selection of K are provided in their coding automatically, and the L 0 -penalty function is applied. For the RSVD, K = 4 and λ = 7.5 are selected for the 1M data, and K = 4 and λ = 6 are selected for the 10M data. In addition, we also compare the proposed method with the "grand mean imputation" approach, which predicts each rating by the mean of the training set and the validation set, and the "linear regression" approach using ratings from the training and the validation sets against all available covariates from users and items. Table 3 provides the prediction results on the testing set, which indicates that the proposed method outperforms the other methods quite significantly. For example, for the 1M data, the RMSE of the proposed method is 8.7% less than the RSVD, 19.5% less than Agarwal and Chen (2009), 10.3% less than Mazumder et al. (2010), 9.4% less than Zhu et al. (2016, and 13.2% and 11.6% less than grand mean imputation and  The numerical studies are run on Dell C8220 computing sleds each with two 10-core Intel Xeon E5-2670V2 processors and 64GB RAM. The proposed method uses 0.8 minutes for 1M data (K = 2 and λ = 12), and 8.3 minutes for 10M data (K = 6 and λ = 16). The RSVD uses 6.4 minutes for 1M data (K = 4 and λ = 7.5), and 7.1 hours for 10M data (K = 4 and λ = 6). The Agarwal and Chen (2009) method requires 18.1 minutes for 1M data (K = 1 with 25 iterations), and 1.1 hours for 10M data (K = 1 with 10 iterations), while Mazumder et al. (2010) method uses 0.3 minutes for 1M data (K = 4), and 11.6 minutes for 10M data (K = 9), and Zhu et al. (2016) uses 1.1 minutes for 1M data, and 18.5 minutes for 10M data.
We also investigate the "cold-start" problem in the MovieLens 10M data, where 96% of the ratings in the testing set are either from new users or on new items which are not available in the training set. We name these ratings "new ratings, " in contrast to the "old ratings" given by existing users to existing items. In Table 4, we compare the proposed method with the four competitive methods on the "old ratings, " the "new ratings, " and the entire testing set. On the one hand, the RSVD, Mazumder et al. (2010), Zhu et al. (2016) and the proposed method have similar RMSE for the "old ratings" set, indicating similar performances on prediction accuracy for existing users and items. On the other hand, the proposed method has the smallest RMSE compared Table . RMSE of the proposed method compared with four existing methods on the MovieLens M data to study the "cold-start" problem: "old ratings" and "new ratings"stand for ratings in the testing sets given by existing users to existing items, and by new users or to new items. Here, RSVD, AC, MHT, and ZSY stand for regularized singular value decomposition, the regression-based latent factor model (Agarwal and Chen ), Soft-Impute (Mazumder et al. ), and the latent factor model with sparsity pursuit (Zhu et  to the other methods for the "new ratings" and the entire testing sets, indicating the superior performance of the proposed method for the "cold-start" problem.

Discussion
We propose a new recommender system which improves prediction accuracy through incorporating dependency among users and items, in addition to using information from the nonrandom missingness.
In most collaborative filtering methods, training data may not have sufficient information to estimate subject-specific parameters for new users and items. Therefore, only baseline models such as ANOVA or linear regression are applied. For example, for a new user u,p u = 0, and a method without specifying the group effects hasθ ui = x uiβ . In contrast, the proposed method provides a prediction throughθ ui = x uiβ +ŝ v u (q i + t j i ). The interaction termŝ v uq i provides the average rating of the v u th cluster on the ith item, which guarantees thatθ ui is itemspecific. The same property also holds for new items. The group effects s v u and t j i allow us to borrow information from existing users and items, and provide more accurate recommendations to new subjects.
The proposed model also takes the advantage of missingness information as users or items may have missing patterns associated with their rating behaviors. Therefore, we propose clustering users and items based on the numbers of their ratings or other variables associated with the missingness. Thus, the group effects (s v u , t j i ) could provide unique latent information which are not available in x ui , p u or q i . Note that if the group effects (s v u , t j i ) are the only factors that are associated with the missing process, then the proposed method captures the entire missingnot-at-random mechanism. In other words, correctly estimating (s v u , t j i ) enables us to achieve consistent and efficient estimation of θ ui , regardless of the missing mechanism.
One possible future research direction is to bridge the gap between numerical implementation and theoretical justification, where the current numerical algorithm converges to a stationary point, while the theoretical properties are established for the global minimum. In general, this gap occurs in nonconvex optimization problems (e.g., Zhou et al. 2013;Zhu et al. 2016), since nonconvex optimization is NP-hard (Ge et al. 2015). In the global optimization literature, methods ensuring global solutions exist but may require extremely high computational cost; for instance, branch-and-bound methods (Land and Doig 1960;Liu et al. 2005). The most relevant approach is the outer approximation method proposed by Breiman and Cutler (1993), which guarantees a global minimizer but may suffer a slow convergence. This could make it infeasible to solve largescale problems. In addition, further investigation is needed regarding the convergence behavior of parameters under the L 2 distance.

Supplementary Materials
The online supplement contains proofs for Lemmas 1 and 2, Theorems 1 and 2, and Corollarys 1 and 2.