MCMC Strategies for Computing Bayesian Predictive Densities for Censored Multivariate Data

Traditional criteria for comparing alternative Bayesian hierarchical models, such as cross-validation sums of squares, are inappropriate for nonstandard data structures. More flexible cross-validation criteria such as predictive densities facilitate effective evaluations across a broader range of data structures, but do so at the expense of introducing computational challenges. This article considers Markov chain Monte Carlo strategies for calculating Bayesian predictive densities for vector measurements subject to differential component-wise censoring. It discusses computational obstacles in Bayesian computations resulting from both the multivariate and incomplete nature of the data, and suggests two Monte Carlo approaches for implementing predictive density calculations. It illustrates the value of the proposed methods in the context of comparing alternative models for joint distributions of contaminant concentration measurements.


INTRODUCTION
Model comparison and selection are among the most common problems of statistical practice, with numerous procedures for choosing among a set of models proposed in the literature (see, e.g., Kadane andLazar 2004 andRao andWu 2001 for recent reviews). Advances in computational technologies provide practitioners the opportunity to use increasingly rich models, such as Bayesian hierarchical models, to explore structures in complicated data. However, comparing and selecting among complex models challenges standard procedures. Formal Bayesian methods for model selection or model averaging via posterior model probabilities (Draper 1995;Kass and Raftery 1995;Hoeting, Madigan, Raftery, and Volinsky 1999;and Han and Carlin 2001) present notorious computational difficulties that are exacerbated by richly parameterized models. As noted by Han and Carlin (2001), both direct approximations to marginal densities and model space searches require significant effort to implement, and in many cases may remain indeterminate.
Moreover, richly parameterized hierarchical models may also confound less computationally intensive model selection criteria such as AIC (Akaike 1973), BIC (Schwarz 1978;Kass and Raftery 1995), and CIC (Tibshirani and Knight 1999). These criteria penalize model fit by model complexity to favor parsimonious models in an effort to maintain predictive power. Such criteria may be misleading when applied to richly parameterized Bayesian models, because the terms that quantify model complexity are based on asymptotic approximations that replace the realized model dimension with the nominal dimension. When prior information restricts the ability of parameters to vary independently, the effective number of parameters fit by a model may be considerably less than the nominal number, leaving the appropriate measure of complexity ambiguous. Recent efforts have made progress in developing useful definitions of model complexity in these cases (Poskitt 1987;Ye 1998;Pauler, Wakefield, and Kass 1999;Hansen and Yu 2001;Hodges and Sargent 2001;Spiegelhalter, Best, Carlin, and van der Linde 2002), but no consensus yet exists. These factors make cross-validation (Mosteller and Tukey 1977) and other informal predictive evaluations (Laud and Ibrahim 1995;Han and Carlin 2001) attractive alternatives. Unfortunately, traditional cross-validation criteria-most notably sums of squared deviations from observed to predicted values-are not appropriate for complicated data structures. An example of such data is multivariate contaminant concentration data subject to censoring, for which the incompleteness of the data make sums of squares impossible to evaluate. An alternative, more flexible cross-validation criterion is the predictive density (Gelfand and Dey 1994;Alqallaf and Gustafson 2001). Cross-validation predictive densities compromise between formal Bayes factor calculations and less formal criteria not applicable to complicated data structures.
This article discusses Markov chain Monte Carlo (MCMC) methods for efficiently calculating predictive densities for multivariate data when data are subject to differential component-wise censoring. It reviews the role of censored data in sampling posterior distributions, and shows how extensions of standard methods for achieving such samples do not provide effective means of calculating predictive densities. It then presents two Monte Carlo methods for implementing the calculations. It focuses on multivariate Gaussian data in an arbitrary parametric model structure because of the importance of this widely applicable case, but the principles of the methods are extensible to more general structures. Moreover, because the predictive densities we consider have the marginal densities required for Bayes factors as a special case, the methods may have value for structuring Bayes factor calculations. The article does not discuss the many important practical issues surrounding the use of cross-validation in model comparison, such as the appropriate numbers and sizes of data splits, but rather focuses on computational techniques that benefit all cross-validation schemes with censored multivariate data.
The remainder of the article is organized as follows. Section 2 presents the Bayesian parametric structure for censored data in which the methods are derived. Section 3 reviews the use of predictive density as a cross-validation criterion, and considers the role of cen-sored data in the calculation of predictive densities. Section 4 presents two Monte Carlo approaches to calculating the predictive densities that address the computational obstacles presented by simpler approaches. Section 5 presents applications of the methods to a model comparison problem involving joint distributions of contaminant concentrations, and Section 6 summarizes and discusses possible extensions.

PARAMETRIC STRUCTURE FOR CENSORED DATA
Suppose that conditional on a value θ ∈ Ω of a vector parameter Θ Θ Θ, The vectors x i represent known covariate information and for the remainder of the article all probability statements are implicitly conditional on the these covariates. The parameter Θ Θ Θ is arbitrary, and we assume only that we are working in the Bayesian framework where Θ Θ Θ has some (possibly hierarchical) probability model. This general structure encompasses a broad class of models, including Bayesian multivariate regression, Bayesian MANOVA, and more complex hierarchical, spatial, or longitudinal models with multivariate responses.
The focus of the current study is where iju ) (mnemonic devices for "lower" and "upper" endpoints). That is, for each component of Z i , we learn either the actual value Z ij , or only the partial information that Z ij ∈ (c ij , c iju ). We assume that these censoring points are either a known function of the measurement process, or at least can be conditioned on subsequent to observing the data, so that all probability statements involving censored observations are conditional on fixed values of the censoring points. The motivating example for this structure is multivariate concentration data for constituents in water, which are subject to censoring of sufficiently low concentrations because of limitations of measurement techniques. For generality we allow c ij = −∞ and/or c iju = +∞ to cover cases where components are left censored (c ij = −∞), right censored, (c iju = +∞), or completely unobserved (c ij = −∞ and c iju = +∞). The double subscripting by i and j makes explicit the fact that the censoring patterns may vary across observation vectors i and components within observation vectors j. For notational convenience we partition the observation vectors Y i as (Y io , Y ic ), where Y io are the observed coordinates and Y ic are the censored coordinates. We let d i = dim(Y ic ), 0 ≤ d i ≤ k, denote the number of censored coordinates. We analogously partition the "complete data" Z i as (Y io , Z ic ). We denote the region in which the unobserved Z ic can take values by C i , which by previous assumptions is a Cartesian product of the intervals (c ij , c iju ) over the censored coordinates. Finally, we follow common notational convention by denoting realizations of random vectors by bold lowercase, so that the observed data are denoted by y = (y 1 , . . . , y n ).
Because of the censoring, the distribution of Y i is not absolutely continuous. Rather, the density is a mixture of continuous components on R k−di and discrete components on the elements of the Cartesian product defining C i . This complicates the likelihood function and poses the challenges in Bayesian computations that are the focus of this article. Using p generically to denote a density function (precisely which density function is indicated by its arguments) and using φ to denote marginal, conditional or joint Gaussian densities conditional on Θ Θ Θ = θ, the density of observation i is of the form: This is a k-dimensional multivariate normal rectangle probability if all coordinates are censored, the usual k-dimensional normal density if all coordinates are observed, and the product of a k − d i dimensional multivariate normal density function (the marginal density of the observed coordinates) and a d i -dimensional multivariate normal rectangle probability (the conditional probability of the censored coordinates being censored given the observed coordinates) if d i coordinates are censored. By the assumed conditional independence of the observations, the joint conditional density of the observations is The fundamental assumption of the current study is that it is either computationally infeasible or otherwise undesirable to perform exact calculations of the form given in (2.1) within the MCMC setting. This is a reasonable assumption because the evaluation involves numerical integrations that are likely to be prohibitively expensive to carry out in realistic problems. Such "nested integration" also arises in constrained parameter applications, as discussed by Chen and Shao (1998). In the current application, if one complete cycle through the parameters (i.e., one MCMC step) requires q evaluations of the joint likelihood function, then obtaining M posterior samples requires O(Mqn) numerical evaluations of multivariate normal rectangle probabilities. These probability calculations are notoriously challenging, particularly in high dimensions, and have received intense study. A battery of both analytical and stochastic methods has been developed, including functional expansions of the integral (Dutt 1973), re-expression following by numerical integration (Schervish 1984), and numerous Monte Carlo integration procedures (Geweke 1989;Genz 1992;Keane 1994;Breslaw 1994;Hajivassiliou, McFadden, and Ruud 1996;Vijverberg 1997). The articles by Hajivassiliou et al. (1996) and Gassmann, Deák, and Szántai (2002) provide excellent overviews of the available methods.

THE ROLE OF CENSORING IN BAYESIAN CROSS-VALIDATION CALCULATIONS
This section first reviews cross-validation and predictive density, and then discusses its implementation in the context of censored data. Cross-validation begins by partitioning y into n 0 vectors of training data y (0) and n 1 vectors of testing data y (1) , with n 0 + n 1 = n. In some cases, commonly called "leave one out" cross-validation (Stone 1974(Stone , 1977Mosteller and Tukey 1977;Geisser and Eddy 1979;Gelfand, Dey, and Chang 1992), n 1 = 1. In other cases, the training data consist of approximately half of the data (Mosteller and Tukey 1977) or other unbalanced allocations (Alqallaf and Gustafson 2001). Then, each model of interest is fit to y (0) and compared in some manner to y (1) to assess the degree of agreement between the testing data and the inferences concerning those data that are made by the model. Most often the quantification of agreement is a scalar function of h(y) = E(Y (1) |y (0) ) − y (1) , such as the sum of squared deviations of predicted values from the observed testing values. Within a set of models being compared, models with smaller values of h(y) are preferred, and h(y) for the best model might be compared to an external standard depending on the substantive problem and the intended uses of the model inferences.
Although such summaries can be a natural quantification of agreement with auspicious properties in some settings, they are not directly applicable to the censored data structures presented in Section 2 because the incomplete nature of the data makes h(y) uninterpretable. Although one could modify the criterion to treat the observed and censored observations separately, this is ad hoc and cumbersome when the patterns of censoring and the censoring points vary across vectors. The posterior predictive density of the testing data given the training data (see, e.g., Schervish 1995, p. 18) is a viable alternative with a number of advantages over functions of h(y). Under the assumption that Y 1 , . . . , Y n are conditionally independent given their associated covariates and Θ Θ Θ, the density is given by Like h(y), the predictive density can be used to quantify agreement between the predictions made by the model and the observed testing data, and allows comparisons of models with different parametric structures because the criterion integrates over the model parameters. More importantly, the predictive density is well-defined for even complex data structures, and is closely related to the Bayes factor. The ratio of predictive densities of the testing data for two different models is known as a "partial" Bayes factor (O'Hagan 1995) because it is equivalent to the Bayes factor assuming that the training data were used to form the prior. Although partial Bayes factors are primarily motivated by applications with improper prior distributions, their generally more stable estimability makes them a pragmatic choice even with proper priors.
The most straightforward method of evaluating the predictive density in (3.1) (outside of simple cases where the integral can be evaluated analytically) is to obtain a sample θ 1 , . . . , θ M from p(θ|y (0) ) by MCMC methods and to approximate the predictive density by Additional computational stability is achieved by estimating log p(y (1) |y (0) and using as the final estimate This form of the calculation is more numerically stable because it applies the exponential function to numbers that are location shifted relative to the maximum and thus less likely to result in numerical values of zero after exponentiation. However, censoring presents computational challenges to this straightforward algorithm. Direct implementation of the estimate in (3.2) requires evaluating conditional densities of the form given in (2.1) both for obtaining the sample θ 1 , . . . , θ M from p(θ|y (0) ) and for evaluating the conditional predictive density p(y (1) As noted in Section 2, we are assuming that it is not feasible to evaluate such densities. It is reasonably straightforward to sidestep the evaluation of (2.1) while obtaining a sample from p(θ|y (0) ) using data augmentation (Tanner and Wong 1987;van Dyk and Meng 2001). Letting z (0) denote the unobserved portions of the training data (z n0c ), the logic of data augmentation is to obtain a sample from p(θ|y (0) ) by sampling p(θ, z (0) |y (0) ) and discarding the z (0) . The sample is obtained by iteratively simulating from p(θ|z (0) , y (0) ) and p(z (0) |θ, y (0) ), which by standard MCMC results converge in distribution to p(θ, z (0) |y (0) ).
Sampling p(θ|z (0) , y (0) ) proceeds exactly as MCMC would normally proceed had the data been fully observed, with the usual multivariate normal likelihood function rather than the censored likelihood in (2.1). Sampling p(z (0) |θ, y (0) ) is slightly more complicated. Because the data vectors are assumed to be conditionally independent given Θ Θ Θ, we focus on a single observation vector without loss of generality. The conditional distribution of z (0) ic given y (0) io and θ is a multivariate normal distribution restricted to lie in the set C i ; that is, a truncated multivariate normal distribution. Obtaining exact samples from this distribution in an efficient manner is not easy. A naive rejection sampler will be exceedingly inefficient when Pr(C i |y (0) io , θ) is small, while more sophisticated methods such as importance sampling will require evaluating the multivariate normal probabilities that the data augmentation algorithm is trying to avoid in the first place. Fortunately, a straightforward Gibbs sampling algorithm using the full conditional distribution of each censored coordinate can be used. The conditional distribution of a single-censored coordinate j of z (0) ic given θ, y (0) io and imputed values of all of the other censored coordinates is a truncated univariate normal with range of the form (c ij , c iju ). This distribution is easily sampled with the inverse CDF method using a restricted uniform random variable. Thus, replacing the step of sampling from p(z (0) |θ, y (0) ) with Gibbs sampling steps for each censored coordinate of each observation vector, when combined with iteratively sampling from p(θ|z (0) , y (0) ), results in samples that converge in distribution to p(θ, z (0) |y (0) ).
The other required part of the predictive density calculation is the evaluation of Although the imputation of censored co-ordinates via data augmentation avoids calculations of the form in (2.1) and thus is valuable in obtaining a sample from p(θ|y (0) ), such methods provide little help in calculating predictive densities for censored data.
To demonstrate why imputation of censored coordinates in the testing data is not useful, first consider sampling the censored observations from their conditional distributions given the observed coordinates, θ, and the fact that the censored observations are known to lie in the censoring regions C i . We call these the constrained conditional distributions of the censored coordinates; these are the distributions that are used in the data augmentation algorithm for sampling p(θ|y (0) ). As noted previously, sampling directly from these multivariate truncated normal distributions is difficult, and this difficulty is further exacerbated by the fact that we cannot use the Gibbs sampling method discussed previously unless we nest that algorithm within the existing MCMC framework. That is, we would need to carry out an auxiliary Gibbs sampler for each sampled value of θ m . More problematically, plugging the imputed values of the censored coordinates into the joint conditional density of the complete testing data given θ (as if the data had been fully observed) implies that marginally we will be estimating Because the term in square brackets does not equal Pr(Z io , θ) in general, this strategy does not perform the correct calculation.
A related strategy samples the censored coordinates from their unconstrained conditional distributions; that is, their conditional distributions given the observed coordinates and θ but not the fact that the censored observations are known to lie in the censoring regions C i . These can be used to calculate a simulation-consistent estimate of the predictive density using a "brute force" Monte Carlo algorithm, but this algorithm suffers from unacceptably large Monte Carlo variance. The approach is, for each observation vector, to sample from the unconstrained conditional distributions of the censored coordinates. If all coordinates happen to fall in C i (i.e., if they happen to be valid plausible censored values), set p(y Under this algorithm, the only way a nonzero value of p(y (1) |θ) is achieved for a particular iteration is when all simulated missing data from every observation vector, as sampled from their unconstrained distributions, happen to fall in their respective censoring regions. Such a strategy, while having the correct expected value (see Section 4.1), will suffer from extreme Monte Carlo variance. A similar strategy, in which we write and sample from the unconstrained marginal distribution of the censored coordinates, has the same shortcoming.

TWO MONTE CARLO APPROACHES
We provide two strategies to reduce the Monte Carlo variance in the calculation of predictive densities for multivariate censored data. The first is a refined version of the brute force method described in the last section, using a more efficient Monte Carlo estimator for the conditional density in (2.1). It is most useful when the effective parameter space is small enough to permit stable Monte Carlo integration over the entire space. When this fails, the second method that conditions sequentially on components of the data (thus capitalizing on the strengths of data augmentation) can be used for more stable estimation.

(METHOD 1) DIRECT PREDICTIVE DENSITY CALCULATION
The brute force method of Section 3 is the most naive implementation of a more general class of methods that, rather than evaluating p(y i |θ), and because the random variables h i (W i ; y (1) i , θ), i = 1, . . . , n 1 , are conditionally independent given y (1) and θ, (4.1) The law of iterated expectations implies that the marginal mean of these random variables is the desired predictive density in (3.1). The brute force method takes p(W i |y (1) i , θ) to be the unconstrained conditional distribution of the censored coordinates given the observed coordinates and θ, and sets An obvious improvement is to use more efficient, unbiased estimates of the multivariate rectangle probabilities that also can be obtained without extensive computing effort. Although many Monte Carlo methods are available (Gassmann et al. 2002;Hajivassiliou et al. 1996), the method that we propose is the "Geweke-Hajivassiliou-Keane (GHK)" Monte Carlo simulator for multivariate normal rectangle probabilities (Hajivassiliou, McFadden, and Ruud 1996;Geweke 1989;Keane 1994). Both the study by Hajivassiliou et al. (1996) and another study by Geweke, Keane, and Runkle (1997) indicate that this simulator offers an effective balance of accuracy and computational speed. Details on the method are provided in the Appendix.
The primary benefit of the proposed method is that it maintains the relative simplicity and computational efficiency of the brute force method, while providing a more statistically efficient estimator because 0 < h i (W i ; y (1) i , θ) < 1. Additional statistical efficiency can be achieved by setting h i (W i ; y (1) i , θ) to the average of the function over K realizations from the GHK simulator. Because the GHK simulator is consistent, K → ∞ provides an exact evaluation of the multivariate normal probabilities and thus of the integrand in (3.1). As noted previously, such exact evaluations will be infeasible in many realistic problems. Moreover, we performed some empirical investigations with various values of K > 1 in the context of the example presented in Section 5 and found the reduction in the Monte Carlo standard error of the predictive density estimate was modest relative to the additional computational burden. This is because many, and in some cases most, sampled values θ m contribute little to the overall estimate, and thus obtaining improved precision of integrals conditional on these parameters does not provide much additional stability to the overall calculation.

(METHOD 2) SEQUENTIAL PREDICTIVE DENSITY CALCULATION
The direct predictive density calculation is carried out via a single Monte Carlo integral over the entire parameter space. However, in some cases the predictive density estimates are subject to considerable Monte Carlo variability, even in long chains. The regions of the parameter space that result in high conditional densities for the testing data can have low probabilities under the posterior distribution based on the training data. Thus, these regions may be visited only rarely in a typical MCMC analysis, similar to the situation faced when trying to calculate the marginal density of the data (e.g., for Bayes factors) using a Monte Carlo sample from the prior distribution. (This is why expending computing resources to obtain precise estimates of the conditional predictive density p(y (1) |θ m ) for every θ m in the direct calculation is not very effective.) Part of the reason for using cross-validation rather than Bayes factors is that this should be less likely to happen; nevertheless, it may still manifest, especially in richly parameterized models.
Moreover, the problem is exacerbated as the dimension k of the data vectors increases. Heuristically, in order to assign a high predictive density to a given data vector, a sampled parameter θ must be consistent with each of the k coordinates. If M iterations are required to calculate the predictive density within a given level of Monte Carlo variability when k = 1, then O(M k ) iterations may be necessary to achieve the same order of control of Monte Carlo error when the dimension is k > 1. This is especially true for cases in which the dimension of the parameter space increases with k, which is extremely common in parametric models that include additional parameters for means, variances, and correlations with each new coordinate. The additional size of the parameter space may make the direct predictive density calculations subject to unacceptably large Monte Carlo variability; examples of this behavior are presented in Section 5. This section presents an alternative method for estimating the predictive density that can help substantially to reduce the Monte Carlo variability. Although motivated by predictive evaluations for censored observations, the concepts and methods are applicable to even fully observed multidimensional data structures.
Let y (1) 1j , . . . , y n1j ) denote the vector of n 1 observations of coordinate j from the testing data. Let y (1) i<j denote coordinates 1, . . . , j − 1 of observation vector i in the testing data, and let y (1) ·<j denote the collection of coordinates 1, . . . , j − 1 from all testing observations, in both cases defined to be empty for j = 1. Then, the predictive density for the testing data given the training data can be expressed as a product of conditional predictive densities: where, through conditional independence, each term on the right hand side can be expressed as the following integral over the parameter space: That is, the desired predictive density can be calculated as a product of k separate integrals rather than one omnibus integral (or in practice, as a sum of the logarithms of k separate integrals as discussed in Section 3). Expressing it in this manner provides several advantages that should help to reduce Monte Carlo variability. Because each successive integral conditions on more of the testing data, the posterior distribution concentrates sequentially on regions of the parameter space which are likely to result in high conditional density for the next coordinates. Equally important is that the sequential calculation obviates the need for calculating multivariate normal rectangle probabilities, because each integral involves only products of univariate conditional densities. The sequential calculation is valid for any ordering of the coordinates, and more generally, is not restricted to the prediction of a single coordinate given all previous coordinates. Instead, the vectors could be partitioned into blocks of arbitrary sizes.
The primary disadvantage of the sequential calculation is that it requires the model to be fit k times. The first fits the model to only the training data and calculates the predictive density of the collection of first coordinates of the testing data. The remaining k − 1 fits use the training data and j coordinates of the testing data, treating the remaining k − j coordinates of the testing data as missing, for j = 1, . . . , k −1. However, the inconvenience of the multiple model fits is minor in cases where direct calculations result in estimates too variable to be informative.
In addition, note that in general it will be necessary to condition on censored or missing coordinates in the sequential calculation. To account for the censoring, we again use data augmentation. Let z ij is censored), both of which can be calculated efficiently with standard routines. We use data augmentation to sample from p(θ, z (1) ·<j |y (1) ·<j , y (0) ) in much the same way as we did to sample p(θ|y (0) ). We use Gibbs sampling steps to impute each censored coordinate from its conditional distribution given the observed coordinates, the imputed values of all other censored coordinates, and θ. Then, conditional on the complete data, we sample θ from p(θ|z ·<j , y (0) , z (0) ) exactly as if the data had been fully observed. The integral in (4.2) can be approximated by ·<j , y (0) ) (as before, practical applications should work on the logarithm scale as in (3.3)).

EXAMPLE
This section presents an application of the methods to the comparison of models for joint distributions of contaminants in community water systems. Such models are used to predict raw (untreated) water contaminant profiles in all community water systems in the country, helping to quantify uncertainty during the regulatory decision process that sets maximum contaminant levels for drinking water supplies. The cross-validation approach to comparing models is particularly appropriate because predictive validity is central to the application. The details of the substantive problem, the data sources, and the class of models under consideration can be found in Lockwood, Schervish, Gurian, and Small (2004) and Gurian, Small, Lockwood, and Schervish (2001), and are only briefly summarized here.
The data, from the National Arsenic Occurrence Survey (Frey and Edwards 1997), consist of raw water arsenic (As), manganese (Mn), and uranium (U) concentrations from 482 of the approximately 55,000 U.S. community water systems. For each system we know the U.S. state and the source water type (ground water versus surface water), and the goal is to use these limited data to estimate joint raw water distributions of the three substances as a function of location and source water type. That is, we would like to estimate a distribution in each of the 100 cells defined by the 50 states and two source water types. This challenge is more formidable given that the observation vectors are subject to considerable censoring. The observations in the dataset were measured with a standardized protocol that resulted in left censoring points of .5 µg/L for each of the three substances. Less than 30% of the observation vectors have fully observed coordinates; Figure 1 presents scatterplots of the log contaminant concentrations.
Estimating 100 multivariate distributions from 482 data vectors while maintaining predictive validity requires choosing model complexity that respects the inferential limitations of the available data. The two primary aspects of model complexity that we explored are the spatial resolution of the model and whether the models do or do not include parameters that model residual covariance among contaminants. In particular we present comparisons of ten different models organized in a two by five design. The first factor is whether the contaminants are modeled with independent lognormal distributions within each cell, or with general multivariate lognormal distributions within each cell. The other factor compares a sequence of five increasingly rich spatial resolutions for the model as follows: • 1 region (no spatial differentiation) • 2 regions (Eastern and Western regions of the U.S) • 7 regions (defined by the NAOS surveyors (Frey and Edwards 1997)) • 50 states with no spatial correlation • 50 states with a distance-based spatial correlation structure for the contaminant means and variances.
In all cases we allow different distributions in ground water and surface water. For all of the multivariate lognormal models that allow nondiagonal covariance matrices within cells, separate correlation matrices for ground water and surface water were estimated, but these correlation matrices were assumed to be common across locations. In all of the models, all of the three-dimensional contaminant observation vectors are assumed to be conditionally independent given the parameters of the model, with correlation between locations and/or source types modeled through the distribution of the parameters. Full details of the model structure, prior distributions, estimation algorithms, and diagnostics can be found in Lockwood et al. (2004). We use cross-validation predictive densities to help guide our choice about the most effective model structure given the resolution of the data. We randomly split the 482 observations in half into training and testing datasets, and used MCMC to fit each of the ten models under consideration to the training data. We then calculated the predictive density for the testing data using the methods discussed previously. For the five models that treated the contaminants independently within cells, we fit separate models of the appropriate spatial structure to each contaminant, with the predictive density for each contaminant estimated using the straightforward methods of Section 2. The overall predictive density for the testing data was then estimated as the product of the individual predictive densities for each contaminant. For the five models that explicitly respected the multivariate structure of the data, the predictive densities were estimated using the two different methods in Section 3.
We also compare our cross-validation predictive density results to those obtained by the deviance information criterion (DIC) (Spiegelhalter et al. 2002), which generalizes AIC to richly parameterized Bayesian hierarchical models. DIC requires the posterior mean of the log-likelihood functionL as well as the log-likelihood function evaluated at the posterior mean of θ, denoted L(θ). Apart from a normalizing constant that depends on only the observed data and thus does not affect model comparison, DIC is given by −4L + 2L(θ) and models with smaller values are favored. One of the strengths of DIC is that it is relatively easy to calculate. However, censored observations provide additional complication because the logs of the Monte Carlo estimates of the multivariate normal rectangle probabilities that were used to sidestep the exact calculation of the conditional densities are not unbiased estimates of the logs of the probabilities. By Jensen's inequality, the log of an estimator such as that on the left side of (4.1) will be negatively biased for log p(y i |θ). Thus, it is necessary to perform exact (or nearly exact) calculations of the multivariate normal probabilities for each sampled parameter. As discussed previously, this is assumed to be computationally infeasible. Fortunately, because the integrand in (5.1) is generally largest near the mode of the posterior distribution, the number of sampled parameters required to get stable estimates of DIC is lower than that for predictive densities. In the current application we estimatedL using every 1,000th parameter vector from our MCMC samples. Repeated applications of this procedure indicated that this was sufficient to get a reasonably stable estimate of DIC for each model. The results are summarized in Figure 2, which provides the estimated log predictive density (LPD) of the testing data under the ten models. The six different sequential estimates for each spatial complexity derive from the six possible orderings of the three contaminants. The LPD estimates are scaled relative to the lowest LPD among all models (the independence model with no spatial differentiation). The DIC values for each of the ten models are provided in the right frame of Figure 2, again scaled relative to the least effective model (with the highest DIC). For consistency, in all cases the estimates are based on three million parameter vectors obtained via MCMC from the appropriate posterior distribution given the training data. This number of iterations was chosen to accommodate the most difficult calculations (direct LPD calculation for the spatially rich models); estimates for the simpler models, and for the sequential calculation regardless of model, in practice required far fewer MCMC iterations for reasonable stability. In addition to the LPD point estimates marked by "x", the plots indicate the variability of the estimates in consecutive blocks of 100,000 parameter vectors (Han and Carlin 2001). All models were run on a 1.5 GHz PC running Linux; computing times for each model ranged from approximately 600 minutes to 1,500 minutes depending on the complexity of the model.
The two primary questions that underlie the model comparison are: (1) Are the contaminants sufficiently highly correlated that explicitly modeling the multivariate structure is advantageous? and (2) What is a pragmatic degree of spatial differentiation for the contaminant distributions given the coarse resolution of the data? The general trends evident in Figure 2 are that for the models with a simple spatial structure (one or two regions), there is a clear predictive advantage to modeling additional correlation among the contaminants via the multivariate lognormal, and that the models with the richer spatial structure offer a clear predictive advantage relative to the simpler spatial models.
However, there is a substantively critical interplay between these inferences and the method used to estimate the LPD for the joint models. The direct calculation method seems to suggest that conditional on a richer spatial resolution of the model, there is no additional benefit to modeling the residual correlations among the contaminants. In fact, it would appear that the fully multivariate models predict the testing data more poorly than the independence models when the state-based spatial structure is used. However, notice that there is a much higher degree of Monte Carlo variability in the LPD calculations for the spatially rich models. Extensive empirical investigation of the root of this variability revealed the parameter spaces for the richer spatial models are so large that Monte Carlo integrals that try to average over the whole space simultaneously are subject to an overwhelming degree of Monte Carlo variability. The regions of the parameter space which concurrently make effective predictions for all contaminants are so small that even three million iterations are not sufficient to visit them frequently (or at all), and thus the integral is estimated inaccurately. This is why the direct calculation would imply that it is more effective to model the contaminants independently when there is richer spatial structure.
This motivates the sequential calculation in which each successive model fit spends more time in the parts of the parameter space most consistent with the testing data because it learns portions of the testing data in turn. This can dramatically improve estimation precision, as shown in Figure 2. When the effective parameter space is reasonably small, the direct and sequential methods give the same results, as expected. However, when the model complexity grows, the estimates from the sequential method are shifted toward higher values than those from the direct methods because they concentrate more heavily on the relevant portions of the parameter space. These regions are never even visited by the direct calculation, explaining the almost complete lack of overlap of the distributions for the two methods. The improved estimates imply different substantive conclusions as well, as the LPD for the joint model is higher than that for the independence model for all spatial resolutions. Thus, even with the richer spatial structure, there is additional predictive power in modeling residual correlation between the contaminants. Most interestingly, the calculations provide insight into the appropriate degree of spatial complexity for the model. Very little spatial differentiation is suboptimal, but so is ignoring large-scale spatial trends in the contaminant distributions. Exploiting these spatial trends by using data from surrounding states to inform the distributions for a given state provides more efficient estimates, and ultimately, more effective predictions of external data.
The sequential estimates illustrate an interesting aspect of the role of missing data in the sequential calculation. The primary advantage that the direct calculation has over the sequential calculation is that all composite terms (i.e., all summands in (3.3)) involve a conditional density of censored coordinates given observed coordinates. As discussed in Section 3, the sequential calculation must calculate the conditional density of an observed coordinate given imputed values of missing coordinates obtained via data augmentation. Although the expected value of the sequential estimate is invariant to this as well as to the order of conditioning, certain orders of conditioning may be more advantageous than others depending on the particular data and models under consideration. As is evident in Figure 2, three orders of conditioning result in higher LPD estimates than the other three orders for all but the model with no spatial differentiation. In-depth examination of this case revealed that there was an influential observation with a relatively large arsenic concentration, but for which the uranium coordinate was censored. The estimated contribution of that observation to the overall LPD was sensitive to whether the calculation used the conditional density of the censored coordinate given the observed one, or used the conditional density of the observed one given imputed values of the censored one. The sensitivity was exacerbated by the relatively strong within-cell correlation of these two contaminants. In this case it seems likely that the three orders of conditioning that result in higher LPD values are providing estimates that are closer to the truth because the distributions for the lower three orders have higher variances and generally have upper tails that support values that are consistent with the estimates provided by the higher orders. As a practical matter, it is thus advisable to try different orderings of conditioning for the coordinates, as these may reveal subtly influential observation vectors.
Finally, it is interesting to note that while the DIC and sequential predictive density criteria agree that both the independence models and the models that do not provide the full 50 state spatial differentiation are inferior, they disagree somewhat on which of the two 50 state models is preferred. The sequential predictive density calculation favors the 50 state model with spatial correlation, while DIC slightly favors the 50 state model without spatial correlation. Of course, this is based on only a single split of the data, but does raise a general question about the circumstances under which the two criteria might provide different results.

SUMMARY AND DISCUSSION
Cross-validation is an effective means to compare complex Bayesian models in which formal evaluations may be more difficult. Multivariate censored data present additional challenges by confounding traditional cross-validation criteria. This study focused on the posterior predictive density of the testing data given the training data as an alternative criterion, presenting two approaches for calculating predictive densities for censored data in MCMC applications. Which method is most appropriate may depend on the complexity of the model; the sequential approach helps to reduce Monte Carlo variability and can be used to make decisions where the direct methods are unclear or potentially misleading.
Both computational approaches apply to all predictive density calculations of the form p(y r |y −r ) [e.g., the conditional predictive ordinates of Geisser and Eddy (1979) and others reviewed by Gelfand and Dey (1994)], not only those involving half splits of the data, and thus have direct applicability to formal Bayes factor calculations. In models with sufficiently small parameter spaces and/or sufficiently informative prior distributions, the methods may be applied exactly as described here with a null training dataset (e.g., by direct integration with respect to the prior distribution). The methods also can be coupled straightforwardly with more advanced importance sampling or derivative methods for calculating marginal densities (Han and Carlin 2001) in more complex settings.
As noted, the sequential approach is useful even with fully observed data; it is similar in spirit to calculating the marginal densities one observation at a time through sequentially conditioning on observation vectors. Partitioning the data by components rather than by observation vectors [akin to Chib's (1995) method for calculating marginal densities from the Gibbs sampler] limits the number of integrals to k rather than n, and is particularly wellsuited to models where the parameter space is approximately partitioned by coordinate.
A promising avenue for future research is the exploration of the trade-off between Monte Carlo errors for the different levels of nesting required for the density calculation. Our experience in this problem indicated that it is likely more valuable to spend time sampling additional parameter vectors rather than obtaining more precise Monte Carlo integral estimates for a given parameter vector. However, the value (in terms of reducing overall Monte Carlo variance of the final estimate) of precise estimates of the "inner" or conditional integral depends on the particular value of the parameter vector. It would be useful to this problem and the numerous problems like it to develop adaptive algorithms that would spend time obtaining precise estimates for only those values of the parameter vector that have high leverage in the overall estimate.
Finally, although the results are presented for censored multivariate Gaussian data, the principles carry over to non-Gaussian data, but may present difficulties if conditional densities cannot be evaluated in closed form. Further exploration of this issue, and empirical investigations of performance in higher dimensional Gaussian problems, are ongoing.

APPENDIX: GHK SIMULATOR
Let Z be a d-dimensional random vector with a nonsingular N d (µ, Σ Σ Σ) distribution. (Note that in the context of the current study, this should be considered the unconstrained conditional distribution of the d censored coordinates given the k − d observed coordinates and the current value of θ in the MCMC). The GHK simulator is a Monte Carlo method for estimating Pr(Z ∈ C), where C = {(z 1 , . . . , z d ) : c j ≤ z j ≤ c ju , j = 1, . . . , d}. We summarize the method as defined by Hajivassiliou et al. (1996). Allow the endpoints of the marginal intervals c j and c ju to take the values ±∞, with the convention that adding finite numbers to or multiplying finite nonzero numbers by ±∞ still results in ±∞. For a vector (x 1 , . . . , x d ), let x <j denote the row vector (x 1 , . . . , x j−1 ), and for a matrix X = ((x ij )), let X j,<j denote the row vector consisting of the first j − 1 elements of row j. In both cases, adopt the convention that the vector is null for j = 1. Let L = (( ij )) be the lower triangular Cholesky factor of the covariance matrix Σ Σ Σ; that is, LL = Σ Σ Σ. Also, let U be a d-dimensional random vector with a N (0, I d ) distribution, and let u = (u 1 , . . . , u d ) denote a general realization of U. Finally, let φ and Φ denote the standard univariate normal density and CDF, respectively. Given the set C defined previously and a vector u, consider a collection of sets C j (u <j ), j = 1, . . . , d defined by C j (u <j ) = {u j : (c j − µ j − L j,<j u <j )/ jj ≤ u j ≤ (c ju − µ j − L j,<j u <j )/ jj }.

Then
Pr(Z ∈ C) = Pr(LU + µ ∈ C) The purpose of this sequence of equations is to transform the probability calculation under the distribution of Z into an expected value under the distribution of a different collection of random variables W = (W 1 , . . . , W d ). These random variables are defined recursively: W 1 has a static truncated univariate normal distribution, while the distribution of W j is a truncated univariate normal distribution that is a function of W <j . The joint density of the random variables defined in this manner is equal to the second bracketed term in the integrand of (A.1) (which is the p i function noted in Section 4.1). A realization w of the random vector W is obtained by recursively sampling from the proper univariate truncated normal distributions specified in (A.1), and this realization is used to calculate the quantity d j=1 Φ(C j (w <j )) (which is the h i function noted in Section 4.1). The GHK method is to estimate the desired probability as the average of this quantity over K >= 1 realizations of w.