A Bayesian Singular Value Decomposition Procedure for Missing Data Imputation

Abstract Missing data are common in empirical studies. Multiple imputation is a method to handle missing values by replacing them with plausible values. A common imputation method is multiple imputation with chain equations (MICE). MICE defines a series of conditional distributions to impute missing values. Although MICE is relatively easy to implement, it may not converge to a proper joint distribution. An alternative strategy is to model the variables jointly using the general location model, but this model can become complex when the number of variables increases. Both approaches require integration of prior information when there are more variables than cases. We propose a Bayesian model that is based on the singular value decomposition components of a continuous data matrix to impute missing values. The model assumes that the matrix is of low rank by applying double exponential prior distributions on the singular values. We describe an efficient sampling algorithm to estimate the model’s parameters and impute the missing data. The performance of the model is compared to current imputation methods in simulated and real datasets. Of all the methods considered and in most of the simulated and real datasets, the proposed procedure appears to be the most accurate and precise. Supplementary materials for this article are available online.


Introduction
In many biological and social sciences studies, values for some units are missing. A dataset may have missing values for a variety of reasons. For example, missing data arise when survey respondents refuse to answer questions (Brick and Kalton 1996;Jannach et al. 2016), or when participants are lost to follow up in clinical trials (Hogan, Roy, and Korkontzelou 2004;Morita, Thall, and Müller 2008). Statistical analyses that are applied only to units with complete data may result in biased estimates and misleading conclusions (Little and Rubin 2019). Multiple imputation methods are a set of statistical procedures to handle missing data (Rubin 1996). These methods explicitly fill-in missing data entries with plausible values. The main attraction of these methods is that once a dataset with imputed values has been constructed, standard statistical methods for complete data can be used for analysis. Methods that rely on a single imputation of the missing values fail to adjust for the uncertainty that is inherent in the imputation of unobserved values (Rubin 1996). Multiple imputation methods generate H > 1 complete datasets to account for the uncertainty in the imputation process (Rubin 2004).
Multiple imputation procedures can be broadly classified into two general strategies: Fully Conditional Specification (FCS) and Joint Modeling (JM). FCS methods, also known as multiple imputations by chained equation (MICE) (van Buuren et al. 2006), posit a series of univariate conditional models for variables with missing values. These procedures begin with a set of initial values. Using these values, for each variable with missing data, FCS methods estimate a corresponding conditional model. Estimates from this conditional model are used to predict and impute the variable's missing values. Repeating the estimation and prediction steps for each variable is called a cycle (Carpenter and Kenward 2012). These cycles are repeated until "convergence" is achieved and the set of imputed values is used to complete the first dataset. FCS algorithms continue for additional cycles to obtain the required number of imputed datasets. Continuous outcomes are typically imputed with either a set of univariate regression models or the predictive mean matching method (Rubin and Schenker 1986;Little 1988). The FCS approach is straightforward and widely adopted in many applications (White, Royston, and Wood 2011), but its theoretical properties are not well understood. Specifically, the stationary distributions of FCS methods may not correspond to inference on multivariate joint distributions of all the variables (Liu et al. 2013). Another limitation arises when the number of cases is smaller than the number of variables. Selecting the explanatory variables for each of the conditional models or defining prior distributions to identify all of the parameters in these conditional models can be complex (Carpenter and Kenward 2012). Recently, recursive partitioning algorithms were proposed as possible solutions to this problem (Burgette and Reiter 2010;Doove, Van Buuren, and Dusseldorp 2014;Shah et al. 2014). However, these methods still suffer from the same theoretical limitations (Xu, Daniels, and Winterstein 2016).
JM procedures propose joint models for the entire dataset. Commonly, these procedures assume that the rows in the dataset are independent and follow a joint multivariate distribution.
When all of the variables with missing values are quantitative, a commonly used model assumes that each row in the dataset independently follows a multivariate Normal distribution (Honaker, King, and Blackwell 2011b;Goldstein, Carpenter, and Browne 2014). The multinomial and log-linear models were proposed when all of the variables are categorical (Schafer 1997). The general-location and the restricted general location models were proposed to simultaneously describe quantitative and categorical variables (Little and Schluchter 1985;Schafer 1997). These models are statistically valid and are computationally efficient, but they lack the flexibility to model various data features such as skewness or complex dependencies that may arise when the rows of the dataset are dependent ( Van Buuren 2018). Moreover, when the number of rows in a dataset is smaller than the number of columns, the multivariate model is not identifiable without prespecified constraints or prior distributions for model parameters (Carpenter and Kenward 2012).
A different JM approach that relaxes the assumption that rows of the dataset are independent is based on the Bayesian Principal Component Analysis (PCA) model (Audigier, Husson, and Josse 2016). This procedure assumes that each entry in the data matrix has independent Normal distribution with an expectation calculated from the singular value decomposition (SVD) components with a known rank k of the entire matrix, and an unknown variance σ 2 . Missing values are imputed using a data augmentation algorithm that iterates between sampling σ 2 and imputing the missing values. This procedure ignores the sampling variability of the rank of the matrix and the estimated SVD components.
Low-rank matrix recovery algorithms have been proposed as single imputation procedures for missing values (Laurent 2009). Theses algorithms view missing values estimation process as an optimization problem under certain sparsity constraints on the singular values (Hastie, Tibshirani, and Wainwright 2019). These algorithms do not assume that the rank of the data matrix is known and estimates its rank within the optimization process. However, the variability of model parameters and the variability of imputed values are commonly ignored when reporting estimates that are based on these procedures.
Bayesian matrix completion algorithms can provide the variabilities associated with model parameters and imputed values by generating samples from their joint posterior distribution. Yang et al. (2018) devised a Bayesian matrix completion algorithm which assumes that the columns of the matrix follow a multivariate Normal distribution with mean zero and a common precision matrix. Low-rank matrix is induced by assuming a Normal-Inverse Wishart prior distribution. To improve the sampling speed, variational Bayesian algorithm is implemented (Tzikas, Likas, and Galatsanos 2008). A different Bayesian approach to sample from a low-rank matrix is the Bayesian SVD model (Hoff 2007). In the absence of missing values, the Bayesian SVD model was proposed as a possible procedure to estimate the variabilities of the rank of the matrix and its SVD components by considering them as random variables. The Bayesian SVD algorithm requires sampling on a Stiefel manifold (Chikuse 2003), which can be complex. To overcome this issue, a relaxed SVD factorization that relies on orthogonal low-rank matrix factorization was proposed (Song and Zhu 2016). This algorithm assumes that the eigen vectors are orthogonal, but not orthonormal. Sampling from this relaxed model does not require sampling on a Stiefel manifold, but this factorization may result in multiple equal modes.
We propose a Bayesian low-rank matrix completion algorithm by modifying the Bayesian SVD model in three major ways. First, we introduce a set of double exponential prior distributions on the singular values to adaptively shrink the singular values toward zero. These prior distributions mimic the Lasso penalty used in low-rank matrix recovery algorithms (Laurent 2009). Second, we propose an efficient MCMC procedure that reconstructs the double exponential prior distributions as a mixture of Normal distributions (Park and Casella 2008;Griffin and Brown 2010). Third, we modify the Bayesian SVD model to handle missing values using the Data Augmentation algorithm (Tanner and Wong 1987). The proposed Bayesian SVD multiple imputation model does not assume that the rows of the dataset are independent and it can handle datasets with more variables than cases. Using simulated data we examine the operating characteristics of the proposed model and demonstrate that it is superior to existing JM and FCS methods. Lastly, we demonstrate the accuracy and robustness of the new procedure by applying it to data on fine needle aspiration biopsy procedure for breast cancer patients.
The article proceeds as follows: Section 2 describes the Bayesian SVD model and the proposed model with double exponential prior distributions. Section 3 presents simulations to examine the propriety of the proposed method and Section 4 compares the proposed method to existing imputation methods. Section 5 examines the performance of existing imputation methods and the proposed methods on a real dataset, and Section 6 provides discussion and conclusions.

Bayesian SVD Multiple Imputation Model
Let Y = {Y ij } be a m × n data matrix of continuous entries such that Y ij ∈ R. Matrix decomposition procedures involve partitioning matrix Y into constituent elements (Hoff 2007;Tuncer, Tanik, and Allison 2008;Audigier, Husson, and Josse 2016). Formally, let Y = M + E, where M = {M ij } is a m × n mean matrix and E is a matrix of independent and identically distributed Normal random variables with mean zero and variance 1/φ. The mean matrix M can be decomposed as UDV T , where U is a m × k matrix with orthonormal columns, V is a n × k matrix with orthonormal columns, and D is a k × k diagonal matrix with nonnegative diagonal elements d 1 ≥ · · · ≥ d k , where k ≤ min{n, m} is the rank of M. The sets of orthonormal matrices U and V are Steifel manifolds, denoted as V k,m and V k,n , respectively. For a matrix M of rank k, Hoff (2007) proposed a Bayesian matrix decomposition model which assumes uniform prior distributions for U and V over the spaces V k,m and V k,n , respectively. This model is formally defined as (1) For computational simplicity, the singular values are not restricted in the model to be nonnegative. Hoff (2007) assumed a set of Normal prior distributions for d 1 , . . . , d k iid ∼ N(μ, 1/ψ), and derived the conditional posterior distributions for U, D, V when the data matrix is fully observed. We introduce a set of double exponential prior distributions for the singular values, and extend the sampling scheme to incorporate a step for missing value imputation. The double exponential prior distributions produce adaptive shrinkage to singular values' estimations, which is similar to the Lasso penalty in the lowrank matrix recovery algorithm (Park and Casella 2008).

Double Exponential Prior Distributions for Singular Values
Consider a set of double exponential prior distributions for d 1 , d 2 , . . . , d k of the form In addition, let the marginal prior distribution of φ be a Gamma distribution with shape parameter ν 0 /2 and rate parameter ν 0 σ 2 0 /2, φ ∼ Gamma(ν 0 /2, ν 0 σ 2 0 /2). Park and Casella (2008) noted that it is important for the double exponential prior distribution to condition on φ because it guarantees that the posterior distribution p(φ, λ|Y) is unimodal. Sampling the singular values with this prior distribution can be computationally complex. A possible solution to improve computational efficiency is to represent the double exponential distribution as a scaled mixture of Normal distributions (Andrews and Mallows 1974;Park and Casella 2008), Combining (1) and (3) the posterior distribution is where π(λ) and π(φ) are the prior distributions for λ and φ.

Sampling Algorithm in the Absence of Missing Values
To sample from the posterior distribution p U, D, V, φ, λ, τ 1 , . . . , τ k |Y, k we developed an efficient Markov chain Monte Carlo (MCMC) procedure that is based on the original Bayesian SVD sampling procedure (Hoff 2007). Formally, let U [,j] and V [,j] be the jth columns of U and V, respectively. In addition, let U [,−j] and V [,−j] be the matrices of all the remaining columns of U and V, respectively. Lastly, Let −j] . The likelihood function can be expressed as Conditional on U [,−j] , U [,j] is in the column null spaces of U [,−j] . Specifically, U [,j] is a basis for the null space of columns of U [,−j] and u j is uniform on the m − (k − 1) dimensional sphere. Equation (5) shows that the conditional posterior distributions of u j |Y, D, U [−j] [,j] , which is a von Mises-Fisher distribution. Thus, given U [,−j] , U [,j] can be sampled by generating u j from the von Mises-Fisher (vMF) distribution on the m − (k − 1) sphere with parameter φd j E −j V T [,j] and setting U [,j] [,j] from its corresponding conditional distribution can be generated similarly.
Based on Equation (4), the full conditional distributions for 1/τ j and d j are the inverse-Normal distribution and the Normal distribution, respectively. To complete the Bayesian model a prior distribution for the hyper-parameter λ is required. One possible prior distribution for λ is the conjugate Gamma distribution with shape parameter α and rate parameter β. To generate S samples from the joint posterior distribution p(U, D, V, λ, φ|Y), the steps in Algorithm 1 are repeated S times.
Because the true rank of the mean matrix M is unknown, we can treat it as a fixed constant or assume that k follows a prior distribution and estimate its posterior distribution in the sampling process (see Hoff (2007) for further details). In some applications of matrix decomposition, the goal is to represent the n variables in the data using a matrix M of rank k << min{m, n} that displays the main patterns in the data. In such applications, estimating the posterior distribution of the rank can be important when describing the k factors that represent the main data patterns. In other applications of matrix decomposition, the goal is to impute missing values in order to implement downstream analysis with statistical tools for complete data. When imputing the matrix, the posterior distribution of the rank is less critical, because researchers are mainly concerned with improving the prediction accuracy of the missing values. Moreover, when the imputed matrices are used in downstream analyses, it has been suggested to rely on saturated models rather than restricted models to avoid uncongeniality (Meng 1994;Murray 2018).
When setting the rank to a fixed constant in the Bayesian SVD procedure, k can not be set at its maximum value min{m, n}, because when k = min{m, n} the conditional distributions of U [j] or V [j] are Bernoulli distributions over the support of two unit vectors both orthogonal to either U [,−j] Algorithm 1 Bayesian SVD Sampling or V [,−j] . In the MCMC sampling procedure, U [j] or V [j] can only rotate by π degrees from their initial values. This limits the MCMC sampling procedure from generating meaningful posterior samples. Setting k to min{m, n}−2 avoids the extreme case of k = min{m, n}, and it provides maximal flexibility for estimating the singular values, because the prior distribution would shrink the nonsignificant singular values. This shrinkage effect on the singular values will be demonstrated in Section 3. However, when k = min{m, n} − 2 the Bayesian SVD algorithm requires sampling min{m, n} − 2 eigenvectors, which increases the computation time significantly. In large matrices with limited computational power, a smaller k may be selected. When k is set to be significantly smaller than min{m, n} − 2, the estimates of M may be biased, because significant positive singular values are set to zero.
In the Gamma prior distribution of φ, the shape parameter is set at ν 0 = 2. This prior distribution was derived based on guidelines for effective sample size computations, which ensures that inference is not dominated by the prior distribution (Morita, Thall, and Müller 2008). For the rate parameter, we use the "Empirical Bayes" method to calculate σ 2 0 by averaging the residual variances over different ranks of M (Hoff 2007;Casella 2001). Algorithm 2 summarizes the steps for estimating σ 2 0 using the "Empiricial Bayes" method. In this algorithm, we ignore the correlations between φ and the prior distribution of d j to gain computational efficiency. This leads to underestimation of σ 2 0 ; however, we find that this underestimation of the hyperparameter does not have significant impact on the convergence rate of the MCMC algorithm.
Algorithm 2 Empirical Bayes Algorithm to Estimate σ 2 0 for l ∈ {1, . . . , min{m, n}} do • LetÛDV T be the least-square projection of Y onto the set of rank-l matrices;

Sampling of the Bayesian Lasso Parameter
The model parameter λ governs the magnitude of the Lasso prior distributions on the singular values because integrating over In contrast to the low-rank matrix recovery algorithms (Laurent 2009), in the Bayesian SVD model, λ can be estimated by imposing a prior distribution and sampling from the corresponding posterior distribution as proposed in Section 2.2, or by using an Empirical Bayes procedure that relies on marginal maximum likelihood (Park and Casella 2008;Casella 2001).
In the Empirical Bayes procedure, λ is estimated at each iteration of the MCMC sampling procedure. Formally, λ (r) is approx- are the elements along the diagonal ofD (0) andÛ (0)D(0)V (0)T is the least-square projection of Y onto the set of matrices with rank K 0 = min{m, n}. This λ (0) is the empirical Bayes estimates of λ conditional on σ 2 0 (see Appendix for further derivation). Because σ 2 0 is underestimated, λ is also underestimated in this procedure. We find in the simulation that this choice of λ (0) does not impact the convergence rate significantly.

Sampling Procedure with Missing Values
where Y mis denotes entries of Y with W ij = 1 and Y obs denotes entries of Y with W ij = 0. Let θ be the population parameters governing Y, and η be the parameters governing the missing data mechanism. We assume that the missing data mechanism is ignorable if it fulfills the two conditions (Rubin 1976): • Missing at random (MAR), p(W = w|Y obs = y obs , Y mis , η) = p(W = w|Y obs = y obs , η), ∀η, where w, y obs are realized values of W and Y obs , respectively. • θ and η are distinct (e.g., their joint parameter space is the product of the parameter spaces of θ and η). In Bayesian inference, distinctness implies that θ and η are a priori independent.
The data are said to be missing not at random (MNAR) when the first condition is violated. Under MAR, the MCMC scheme in Section 2 can be extended to include a step for missing data imputation using the Data Augmentation algorithm (Tanner and Wong 1987). The Bayesian SVD procedure with missing entries is summarized in Algorithm 3.

Algorithm 3 Bayesian SVD Matrix Completion Algorithm
Initialization: • Impute Y miss with the columns' means of the observed entries • Initialize ν 0 , σ 2 0 as proposed in Section 2.2 and λ (0) as proposed in Section 2.3 • Initialize {τ 1 , . . . , τ k } by sampling from their prior distributions, is the Hadamard product.

Evaluating Convergence and Model Fit
To examine the validity and the performance of the proposed Bayesian SVD model, we perform two sets of simulation analyses. The first examines the convergence of the proposed sampling procedure and the fit of the proposed model. This evaluation was performed on complete matrices. The second set of simulations examines the performance of the proposed procedure in comparison to other imputation methods on matrices with missing values. To evaluate Algorithm 1 without missing data, we drew a total S = R + L posterior samples, where R is the number of burn-in samples and L is the number of samples that were used to make inferences on the singular values. To ensure that the MCMC algorithm is sampling from the posterior distribution, we set R = L = 10,000 for a total of 2×10 4 MCMC iterations.

Simulation Configurations
The data matrices in these simulations have m = 20 rows and n = 40 columns. The true rank of these matrices is set as k * = 5 and the singular values are set to D * = diag(16, 14, 10, 2, 1).
In addition, we assume that all elements of E = {e ij } are independently generated from a standard Normal distribution,

Competing Methods and Evaluation Metrics for Complete Matrices
We compare the proposed Bayesian SVD model with k = min{m, n} − 2 to least-square estimation, which relies on the true rank, k * , of M. The least-square estimate is based on principal component analysis, and is derived by calculatinĝ  Figure 1(1) displays the shrinkage effects on the singular values when the data matrix is fully observed. As λ increases, more singular values are estimated closer to 0. The three largest singular values, {16, 14, 10}, are shrunk to 0 for values of λ ≥ 5. The two smallest singular values, {2, 1}, are shrunk to 0 for values of λ ≥ 1. This shrinking effect is similar to the one observed for the low rank matrix recovery algorithm with the Lasso penalty except that here the singular values are not shrunk to 0 exactly. Figure 1(2) displays the MSE of the mean matrix for different values of λ. The vertical lines represent the posterior median and 95% credible intervals for λ when it is updated in each MCMC step by the proposed MCEM method. The MSE for the median λ is close to the smallest MSE when λ is set to a constant. Even when the true rank is correctly specified, the least-square estimation results in larger MSE than the proposed Bayesian SVD model with λ updated using MCEM where true rank is unknown and k = min{m, n} − 2. Figure 2 shows a trace plot of λ for each iteration of the MCMC algorithm described in Section 2.3. As expected, the initial λ (0) underestimates λ. However, the convergence is not significantly influenced by the choice of λ (0) as the sampling algorithm converges to the approximate posterior distribution of λ after 2000 iterations.

Simulation Configurations
Similar to Section 3, the mean data matrix M is simulated by sampling U ∼ uniform(V m,k * ) and V ∼ uniform(V n,k * ). The singular values are simulated by sampling where k * is the rank of the mean matrix and μ mn = n + m + 2 √ mn. The error matrix E = {e ij } is generated such that e ij are independently sampled from standard Normal distribution , N (0, 1), or a t-distribution, t(df = 5). The simulations examined several sizes of matrices. One matrix size included more columns than rows, Y N 20×40 , one matrix included more rows than columns, Y N 40×20 , and one set of matrices had equal number of rows and columns, Y N 100×100 and Y t 100×100 . The true rank of Y N 20×40 and Y N 40×20 was set at k * = 5, the true rank for Y t 100×100 was set at k * = 20, and the true rank for Y N 100×100 was set at either k * = 20 or k * = 30. Table 1 depicts the parameters and configurations of the simulations.

Missing Data Mechanisms
We examine three types of missing data mechanisms. The first missing data mechanism is the missing completely at random (MCAR) mechanism (Rubin 1976). Under this mechanism, the missing indicators for each cell of the data matrix are simulated independently. Formally, W ij ∼ Bernoulli(0.15) such that m i=1 W ij < m ∀j and n j=1 W ij < n ∀i to avoid completely missing rows and completely missing columns in the data matrix.
The next two mechanisms are nonignorable mechanisms. These mechanisms are challenging because they can be hard to specify correctly, and they may be based on non or weakly identified parameters. Both mechanisms are MAR, but they assume that the parameter spaces of θ and η are not distinct. The first nonignorable mechanism relies on column-wise correlation of the mean matrix, M. Let be the column-wise correlation of M and let i (j) be the column that has the largest correlation with column j. Missing data indicators are simulated independently conditional on the column with the largest correlation. Formally, W ij ∼ Bernoulli(p ij ), where logit p ij = α + βM i (j) j . The slope coefficient, β, is set at 10, and α is chosen such that the average missingness rates are either 15%, 30%, 50%, or 70%. This mechanism generates missingness based on parameters that may be of interest when analyzing the complete dataset, and we denote this mechanism as NIG_CORR. The largest column-wise correlation ranges between 0.67 and 0.87 for Y N 20×40 , between 0.47 and 0.71 for Y N 40×20 , and between 0.43 and 0.56 for Y N 100×100 when the true rank is k * = 20. The corresponding columnwise correlations of the M matrix have range of 0.94 to 0.99 for Y N 20×40 , 0.68 to 0.86 for Y N 40×20 , and 0.87 to 0.99 for Y N 100×100 . The second nonignorable mechanism is based on the largest singular value of M, denoted as NIG_SV. This mechanism assumes that missing values are generated based on parameters that are estimated during the imputation process, which can lead to biased estimates of the parameters and therefore imprecise imputed values. Formally, let M * = d s U [s] V T [s] be the matrix that is generated by setting all of the singular values to 0 except the largest one, where d s is the largest singular value, U [s] is the corresponding left-singular vector, and V [s] the corresponding right-singular vector. We generate the missing values indicators W ij from Bernoulli(p ij ), where logit p ij = α + βM * ij , such that β = 10 and α is chosen such that the average missingness rates are either 15%, 30%, 50%, or 70%.

Imputation Methods
We compare the performance of multiple imputation Bayesian SVD to currently available methods. When inferences on the imputed matrices are based on the empirical distribution of the MCMC samples from Algorithm 3, then a large number of independent samples for Y miss is required (Little and Rubin 2019). We set R = L = 5000. As seen in Section 3.3, this number of iterations ensures that Algorithm 3 is generating samples from the posterior distribution, and it enables us to obtain large number of independent samples of Y miss . For matrices Y N 20×40 and Y N 40×20 , we assumed that k = min{m, n} − 2 = 18. For Y N 100×100 and Y t 100×100 , we assumed that k = 50, in order to reduce the computation time over large number of simulations. These assumed ranks are greater than at least 1.5 the true ranks.
We compare all of the methods using H = 100 imputations. This number of imputations was selected because it is likely to be an upper bound on the number of imputations that would lead to reproducible results (White, Royston, and Wood 2011;Carlin 2014). Moreover, similar results were observed when we increased the number of imputations for the proposed Bayesian SVD method and when we decrease the number of imputations of existing methods (data not shown). The performance of the proposed Bayesian SVD algorithm was compared to the following four existing imputation methods: Marginal Mean Imputation imputes missing values with the average of the observed values in each column (Carpenter and Kenward 2014). This is a single imputation method that may result in biased estimates of the correlations between variables (Carpenter and Kenward 2014).
MICE is an FCS procedure that assumes a univariate regression model for each variable with missing values. For continuous outcomes the most commonly used models are linear regression models or predictive mean matching (Van Buuren 2018). Recently, classification and regression trees (CART) and random forests were proposed as possible imputation methods (Doove, Van Buuren, and Dusseldorp 2014). All of these method can result in more reliable inferences by modeling nonlinear relationships between variables (Doove, Van Buuren, and Dusseldorp 2014). These methods are implemented in the R mice package ( van Buuren and Groothuis-Oudshoorn 2011). Here, we implemented the random forest model as the univariate model because it can be implemented without manual adjustments across all simulation configurations. Because there are large number of columns compared to the number of rows, CART, linear regression and predictive mean matching tend to fail in some simulation replications and result in improper outputs. When using the random forest algorithm, the set of initial predictor variables comprised of variables that have observed marginal correlations with the response variable which are greater than or equal to 0.3. In our numerical studies we observed that the imputed values were relatively stable after five iterations. Following the recommendations in the literature (White, Royston, and Wood 2011;van Buuren and Groothuis-Oudshoorn 2011), we implemented the algorithm for 5 burnin cycles followed by 100 cycles to generate the imputed datasets.
Amelia is a JM procedure that assumes a joint multivariate Normal distribution for all of the continuous variables. The mean vector and the covariance matrix are estimated using the Expectation Maximization with Bootstrap (EMB) algorithm (Honaker and King 2010). Using these estimates and their corresponding sampling variances, the missing data are imputed by sampling from the corresponding multivariate Normal distributions. The method is implemented in the R Amelia package (Honaker, King, and Blackwell 2011a). When some variables are highly collinear or when the number of variables is larger than the number of observations, the model is unidentifiable. To overcome this issue, the package employs the Ridge prior distribution to estimate model parameters (Honaker, King, and Blackwell 2011a). Similar to MICE we have implemented this procedure using 100 imputed datasets.
missMDA is another JM procedure that assumes Y = UDV T + E, where UDV T are the SVD components of the mean matrix and E is an error matrix composed of independent Normally distributed entries from N (0, σ 2 ). The algorithm assumes that the true rank of the matrix is known and use it as an input. In practice, the rank k is commonly determined using cross-validation (Audigier, Husson, and Josse 2016). missMDA initializes missing values using column means of the observed data matrix, then it iteratively estimates σ 2 and the missing values using principal components methods that are based on the first k principal components (Audigier, Husson, and Josse 2016). The method is implemented in the R missMDA package . The multiple imputation procedure was implemented with the Bayesian option and 100 imputed datasets.

Evaluation Metrics
Point estimates for the mean matrix from the proposed Bayesian SVD model and the methods described in  and the four other imputation methods were recorded. To estimate M with existing imputation method, we useM LS assuming that k = k * in each complete dataset h. We have also examined the performance of the existing methods assuming that M is full rank. The results with both of these ranks were similar. Thus, we only report the results with k = k * in the manuscript, and the results assuming full rank are provided as supplementary material. In the Bayesian SVD imputation method, the estimates of M are obtained from the sampling algorithm. Figure 3 shows the paths of the singular values for different λ values when 15% of the entries are missing in the data matrix Y N 20×40 . For all three types of the missing data mechanisms, the proposed Bayesian SVD procedure results in λ values with MSE W of the mean matrix that are close to the corresponding minimal MSE W of the mean matrix when λ is set to a constant. The trends in shrinkage of the singular values are similar to those observed in Section 3.3, where none of the singular values are significantly bigger than 0 when λ ≥ 5. The MSE of the mean matrix is bigger under NIG_CORR and NIG_SV than under MCAR. This is because recovering the missing entries is more complex in nonignorable configurations. Table 2 compares the proposed Bayesian SVD multiple imputation model with the four multiple imputation methods described in Section 4.3 for different matrix dimensions and error distributions when the rate of missing values is 15%. Under all simulation configurations, the largest average MSE W is observed for marginal mean imputation, because this method ignores the correlations between variables in the matrix. The average MSE of the mean matrix from the proposed Bayesian SVD model is the smallest compared to all other methods. This suggests that the proposed model provides more accurate estimation of the mean matrix. In addition, the standard deviation of MSE W across replications from the proposed model is the smallest for Y N 20×40 and Y N 40×20 , and it is similar to Amelia for Y N 100×100 and Y t 100×100 . The JM procedure in Amelia generally has the lowest MSE W among the existing procedures described in Section 4.3.

Estimation of the Mean Matrix
The averages MSE W of Marginal Mean imputation, MICE, Amelia and missMDA are smaller for Y N 40×20 compared to Y N 20×40 . This probably stems from the ability to estimate model parameter more precisely when there are more cases than variables. The proposed Bayesian SVD imputation has similar MSE W for both matrices, because the eigenvalues are similar for the original and the transposed matrix. Figure 4 depicts the distribution of MSE W across 100 replications when Y N 100×100 is simulated with different missingness rates for each of the missing data mechanism described in Section 4.2. The results are similar to the ones observed when 15% of the values are missing. The proposed Bayesian SVD method has the smallest MSE W under all of the examined simulation settings. Across all methods, the variability of MSE W generally increases when the overall missingness rate increases. For low missingness rates, Amelia performs better than Marginal Mean imputation, MICE, and missMDA. As the missingness rates increase, Amelia performs similarly to these methods. With Y N 100×100 , when the number of singular values is k * = 30, similar trends to k * = 20 are observed (data not shown). Specifically, the Bayesian SVD method has the smallest MSE W in all configurations and bigger differences are observed for nonignorable missing data configurations. Table 2. Estimation of the mean matrix: average, standard deviation, and 2.5% and 97.5% quantiles of the mean squared error of the mean matrix (MSE W ) for 15% missingness rate over 100 replications. Y N 20×40 are simulated data matrices with m = 20, n = 40, k * = 5, and e ij ∼ N (0, 1); and Y N 40×20 are simulated similarly with m = 40, n = 20. Y N 100×100 are simulated data matrices with m = 100, n = 100, k * = 20, and e ij ∼ N (0, 1). Y t 100×100 are simulated data matrices with m = 100, n = 100, k * = 20, and e ij ∼ t(df = 5). Bayesian SVD model with double exponential priors for the singular values.  Table 3 compares the MSE of the proposed multiple imputation model and the four multiple imputation methods. For Y N 20×40 and Y N 40×20 , the proposed Bayesian SVD multiple imputation model has the smallest MSE for the column-wise correlation compared to the four methods and for all three types of missing data mechanisms. For Y N 100×100 and Y t 100×100 , the MSE for all four methods are relatively small. This is because the column-wise correlations are calculated using both the imputed values and the observed values. The effects of poor Table 3. Estimation of the column-wise correlation of the mean matrix: average, standard deviation, and 2.5% and 97.5% quantiles of the mean squared error (MSE ) of the column-wise correlation of the data matrix over 100 replications.  imputation on the correlation is attenuated when the dimension of the data matrix is large and the proportion of missing entries is small.

Effect of t-distributed Error
The average MSE W and MSE are larger for Y t 100×100 compared to Y N 100×100 for MICE, Amelia, missMDA and the Bayesian SVD method (Tables 2 and 3). This is because the t df =5 -distributed error matrices have more extreme values than standard Normal distributed error matrices. The Bayesian SVD multiple imputation provides the smallest average MSE W for the mean matrix even when the distribution of E is misspecified. This suggests that the proposed Bayesian SVD model for multiple imputation is robust in terms of component-wise symmetric error matrix. For the column-wise correlation, all four methods have similar MSE because the column-wise correlations are based on the imputed values and the observed values, which attenuates the effects of inaccurate imputation.

Computation Time
We examine the computation time over 100 simulated datasets using one core of an Intel(R) Xeon(R) CPU E5-2650. The average time to complete 10 4 imputation using the Bayesian SVD with double exponential prior is 560 sec for Y N 20×40 and 2807 seconds for Y N 100×100 (Table 4). Using the Bayesian SVD algorithm with Normal prior distributions increases the average computation time by 14 seconds for Y N 20×40 , and it is approximately similar for Y N 100×100 (Table 4). The imputation methods Amelia and missMDA require significantly less time to complete because they require smaller number of iterations to converge. Compared to the Bayesian SVD methods MICE requires significant more time to impute small matrices, but its running time is shorter for larger matrices. Comparing the number of seconds that are required to generate one complete dataset for Y N 20×40 , the Bayesian SVD methods is 12.8 and 3.7 times faster than MICE and missMDA, respectively. This shows that when the auto-correlations between MCMC iterations is low, the Bayesian SVD methods can be computationally efficient with small and medium sized matrices.

Simulations Using Real Data Matrix
In Section 3 and 4 the mean matrix was simulated based on specific rank of the matrix and the singular values followed a predefined Uniform distribution on the Steifel manifold. Real datasets may not follow this matrix construction. To examine the performance of the proposed Bayesian SVD model and compare it to the four imputation methods described in Section 4.3, we used the UCI breast cancer Wisconsin (Diagnostic) dataset (Dua and Graff 2017). This dataset includes 30 features computed from a digitized image of a fine needle aspirate of a breast mass. These features describe the characteristics of the cell nuclei present in the image. A total of 569 images are included in the dataset, and it has been widely used in the machine learning literature as an example data-set for model testing (Mangasarian and Wolberg 1990). To generate a matrix with relatively small number of cases, we randomly sample 50 out of the 212 malignant breast cancer cell nuclei. We define these samples as the mean matrix, M. To generate the matrix Y we perturbed the sampled M matrix of 50 rows and 30 columns by adding a random error distributed as N(0,1) to each cell. The missing data indicator matrix W are generated based on the three missing data mechanisms described in Section 4.2 such that on average 15% of the cells are missing. Similar to Section 4, multiple imputation Bayesian SVD with double exponential prior distribution was implemented with k = min{50, 30} − 2 = 28 and with L = 10 4 MCMC iterations. For MICE, 100 complete datasets were generated, and variables in the conditional models were included if they have marginal correlations with the outcome variable that are greater than or equal to 0.3. For missMDA, 100 complete datasets were generated, and the rank of the matrix was estimated using crossvalidation. For Amelia, 100 complete datasets were generated. For Marginal Mean, MICE, missMDA, and Amelia, we estimated the M matrix assuming that the rank is 28. Table 5 shows MSE W and MSE over 100 replications. Generally, the Bayesian SVD multiple imputation method has the smallest MSE W and MSE . For NIG_CORR, missMDA has smaller MSE W , but Bayesian SVD has smaller MSE This suggests that even when the mean matrix is not generated under the Bayesian SVD model assumptions, imputations for missing values using the Bayesian SVD model is generally more precise than marginal mean imputation, MICE, Amelia, and missMDA.

Discussion
We propose a new joint modeling approach for multiple imputation that is based on the SVD components of a data matrix. The method does not rely on the assumption that the rows of the datasets are independent, and it can be applied to datasets where the number of variables is larger than the number of cases. Simulations suggest that the proposed method results in more precise estimations for the missing values as well as the column-wise correlation of the data matrix compared to exisiting imputation procedures under various missing data mechanisms. In simulation analyses, we examined datasets with more rows than columns, datasets with more columns than rows, and datasets with equal number of rows and columns. Existing methods to impute continuous data such as the multivariate normal model and fully conditional specification based on random forests require strong prior distributions to perform model estimation when the number of cases is smaller than or equal to the number of variables. In Amelia, which is an R software that uses the multivariate normal distribution, the Ridge prior distribution is assumed when many columns are collinear. MICE, which is a FCS approach, requires that the predictors of the univariate models be prespecified for every univariate model. The proposed Bayesian SVD method does not require such specifications, because it relies on a shrinkage prior distribution for the singular values.
The proposed method relies on a full Bayesian representation of the SVD components of the data matrix. The MCMC scheme uses a rejection-sampling method to sample from the von-Mises Fisher distribution, and it can become computationally intensive when the dimension of the data matrix increases (Hoff 2009). One possible way to reduce computational complexity is to assume that the mean matrix has lower rank. However, this may result in biased imputed values. Because the imputation process is only implemented once, this computational complexity should be less of a concern. When researchers encounter computational constraints, we suggest to select k < min{m, n}− 2 when both dimensions of the data matrix are larger than a 100. This reduction is reasonable in many cases because lowdimensional signal matrix occurs frequently in practice (Hoff 2007;Josse and Husson 2016;Hastie, Tibshirani, and Wainwright 2019).
In the low-rank matrix recovery algorithm, the Ridge penalty on the singular values have been proposed. In our simulations, we did not identify significant differences between the Ridge prior distribution and the Double Exponential prior distribution. However, in any specific application, these could have significant effect on the imputation process. Selection of a prior distribution should be based on prior knowledge. When such prior knowledge is unavailable, we propose to examine the performance of the imputation using posterior predictive checks  for both prior distributions and select the model that has best performance. Future extension may use Bayesian Model Averaging to obtain better average predictive performance (Hoeting et al. 1999).
In many real applications, missing data occurs in both quantitative and categorical variables. This may limit the usefulness of the proposed method. However, there are multiple cases where the proposed method can be applied to impute missing data. For example, one limitation of the generalized location model is that as the number of discrete categories increases, there are less observations to estimate the covariance matrix of the multivariate Normal component. To address this issue and restrict the number of estimated parameters, multivariate linear regression constraints are commonly applied (Schafer 1997). These constraints impose significant restriction on the distribution of the continuous variables given the categorical variables. Instead, the proposed Bayesian SVD method can be used to impute multiple continuous variables with small number of observations while relying on weaker constraints. Another example arises in longitudinal or clustered studies. These studies result in possible correlation across individuals and variables. The proposed Bayesian SVD method can be applied to correlated data without the need to specify an exact correlation structure. When there is a natural multilevel structure, a possible extension of the Bayesian SVD model is to partition the variability into the between and within groups and preforming SVD within each component (Husson et al. 2019).
In conclusion, we proposed a novel method to impute missing continuous data based on the Bayesian SVD model. We described a new prior distribution for the model, and show that this Bayesian SVD model is more accurate and precise in imputing missing values compared to existing methods. However, these benefits may require increased computational complexity.