Multiple Imputation of Missing or Faulty Values Under Linear Constraints

Many statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online.


INTRODUCTION
Most economic datasets suffer from missing data. For example, among the plants surveyed in the 2007 U.S. Census of Manufactures and across all six-digit NAICS industries, 27% of plants are missing total value of shipments, 32% of plants are missing book values of assets, and 42% of plants are missing cost of materials. As is well known (Little and Rubin 2002), using only the complete cases (all variables are observed) or available cases (all variables for the particular analysis are observed) can cause problems for statistical inferences, even when variables are missing at random (Rubin 1976). By discarding cases with partially observed data, both approaches sacrifice information that could be used to increase precision. Further, using only available cases complicates model comparisons, since different models could be estimated on different sets of cases; standard model comparison strategies do not account for such disparities. For data collected with complex survey designs, using only available cases complicates survey-weighted inference, since the original weights generally are no longer meaningful for the available sample.
An alternative to complete/available cases is to fill in the missing items with multiple imputations (Rubin 1987). The basic idea is to simulate values for the missing items by sampling repeatedly from predictive distributions. This creates m > 1 completed datasets that can be analyzed or, as relevant for many data producers, disseminated to the public. When the impu-tation models meet certain conditions (Rubin 1987, chap. 4), analysts of the m completed datasets can make valid inferences using complete-data statistical methods and software. Specifically, the analyst computes point and variance estimates of interest with each dataset and combines these estimates using simple formulas developed by Rubin (1987). These formulas serve to propagate the uncertainty introduced by missing data and imputation through the analyst's inferences. See Reiter and Raghunathan (2007) for a review of multiple imputation.
In many settings, imputations of numerical variables must respect linear constraints. These constraints arise particularly in the context of editing faulty data, that is, values that are not plausible (e.g., a plant with one million employees, or a large plant with an average salary per employee of $1 million) or that are inconsistent with other values on the data file (e.g., an establishment reporting an entry year of 2012 that also reports nonzero employment in earlier years). For example, the U.S. Census Bureau requires numerical variables to satisfy an extensive set of ratio constraints intended to catch errors in survey reporting or data capture in the Census of Manufactures, the Annual Survey of Manufactures, and the Services Sectors Censuses (SSC) in the Economic Census (Winkler and Draper 1996;Draper and Winkler 1997; Thompson et al. 2001). Examples of non-U.S. economic data products, subject to editing, include the Survey of Average Weekly Earnings of the Australian Bureau of Statistics (Lawrence and McDavitt 1994), the Structural Business Survey of Statistics Netherlands (Scholtus and Goksen 2012), and the Monthly Inquiry Into the Distribution and Services Sector of the U. K. Office for National Statistics (Hedlin 2003).
Despite the prevalence and prominence of such contexts, there has been little work on imputation of numerical variables under constraints (De Waal, Pannekoek, and Scholtus 2011). Most statistical agencies use hot deck imputation: for each record with missing values, find a record with complete data that is similar on all observed values, and use that record's values as imputations. However, many hot deck approaches do not guarantee that all constraints are satisfied and, as with hot deck imputation in general, can fail to describe multivariate relationships and typically result in underestimation of uncertainty (Little and Rubin 2002). Tempelman (2007) proposed to impute via truncated multivariate normal distributions that assign zero support to multivariate regions not satisfying the constraints. This approach presumes relationships in the data are well-described by a single multivariate normal distribution, which for skewed economic variables is not likely in practice even after transformations (which themselves can be tricky to implement because the constraints may be difficult to express on the transformed scale). Raghunathan et al. (2001) and Tempelman (2007) proposed to use sequential regression imputation, also called multiple imputation by chained equations (Van Buuren and Oudshoorn 1999), with truncated conditional models. The basic idea is to impute each variable y j with missing data from its regression on some function of all other variables {y s : s = j }, enforcing zero support for the interval values of y j not satisfying the constraints. While more flexible than multivariate normal imputation, this technique faces several challenges in practice. Relations among the variables may be interactive and nonlinear, and identifying these complexities in each conditional model can be a laborious task with no guarantee of success. Furthermore, the set of specified conditional distributions may not correspond to a coherent joint distribution and thus is subject to odd theoretical behaviors. For example, the order in which variables are placed in the sequence could impact the imputations (Baccini et al. 2010).
In this article, we propose a fully Bayesian, flexible joint modeling approach for multiple imputation of missing or faulty data subject to linear constraints. To do so, we use a Dirichlet process mixture of multivariate normal distributions as the base imputation engine, allowing the data to inform the appropriate number of mixture components. The mixture model allows for flexible joint modeling, as it can reflect complex distributional and dependence structures automatically (MacEachern and Müller 1998), and it is readily fit with MCMC techniques (Ishwaran and James 2001). We restrict the support of the mixture model only to regions that satisfy the constraints. To sample from this restricted mixture model, we use the Hit-and-Run sampler (Boneh and Golan 1979;Smith 1980), thereby guaranteeing that the imputations satisfy all constraints. We illustrate the constrained imputation procedure with data from the Colombian Annual Manufacturing Survey, estimating descriptive statistics and coefficients from a regression used in the literature on plant productivity.
The remainder of the article is organized as follows. In Section 2, we review automatic editing and imputation processes, which serves as a motivating setting for our work. In Section 3, we present the constrained Dirichlet process mixture of multivariate normals (CDPMMN) multiple imputation engine, including the MCMC algorithm for generating imputations. In Section 4, we apply the CDPMMN to the Colombian manufacturing data. In Section 5, we conclude with a brief discussion of future research directions.

BACKGROUND ON DATA EDITING
Although the CDPMMN model is of interest as a general estimation and imputation model, it is especially relevant for national statistical agencies and survey organizations-henceforth all called agencies-seeking to disseminate high-quality data to the public. These agencies typically dedicate significant resources to imputing missing data and correcting faulty data before dissemination. For example, Granquist and Kovar (1997) estimated that national statistical agencies spend 40% of the total budget for business surveys (20% for household surveys) on edit and imputation processes, and in an internal study of 62 products at Statistics Sweden, Norberg (2009) reported that the agency allocated 32.6% of total costs in business surveys to editing processes. Improving these processes serves as a primary motivation for the CDPMMN imputation engine, as we now describe.
Since the 1960s, national statistical agencies have leveraged the expertise of subject-matter analysts with automated methods for editing and imputing numerical data. The subject-matter experts create certain consistent rules, called edit rules or edits, that describe feasible regions of data values. The automated methods identify and replace values that fail the edit rules.
For numerical data, typical edit rules include range restrictions, for example, marginal lower and upper bounds on a single variable, ratio edits, for example, lower and upper bounds on a ratio of two variables, and balance edits, for example, two or more items adding to a total item. In this article, we consider range restrictions and ratio edits, but not balance edits. Formally, let x ij be the value of the jth variable for the ith subject, where i = 1, . . . , n, and j = 1, . . . , p. For variables with range restrictions, we have L j ≤ x ij ≤ U j , where U j and L j are agencyfixed upper and lower limits, respectively. For any two variables (x ij , x ik ) subject to ratio edits, we have L jk ≤ x ij /x ik ≤ U jk , where again U jk and L jk are agency-fixed upper and lower limits, respectively. An exemplary system of ratio edits is displayed in Table 1. These rules, defined by the U.S. Census Bureau for data collected in the SSC, were used to test editing routines in 1997 Economic Census (Thompson et al. 2001).
The set of edit rules defines a feasible region (convex polytope) comprising all potential records passing the edits. A potential data record x i = (x i1 , . . . , x ip ) satisfies the edits if and only if the constraints Ax i ≤ b for some matrix A and vector b are satisfied. Any record that fails the edits is replaced (imputed) by a record that passes the edits. Finding records that fail rules is straightforward, at least conceptually, with automated routines and modern computing. However, identifying the faulty values within any record (the error localization procedure) and correcting those erroneous values (the imputation procedure) are far more complicated tasks. Among national statistical agencies, the most commonly used error localization procedure is that of Fellegi and Holt (1976), who developed a set covering algorithm that identifies the minimum number of values to change in the data to satisfy all edit rules. Fellegi and Holt (1976) error localization algorithms are the basis for many automated edit systems, including the U.S. Census Bureau's SPEER (Draper and Winkler 1997), Statistics Canada's GEIS (Whitridge and Kovar 1990), Spanish National Statistical Institute's DIA (Garcia-Rubio and Villan 1990), Statistics Netherlands' CherryPi (De Waal 2000), and Istituto Nazionaledi Statistica's SCIA (Manzari 2004). We do not discuss error localization steps further, as our concern is with the imputation procedure; that is, we assume that the erroneous fields of survey data that violate edit rules have been detected and blanked by the agency, and we focus on imputing the blanked values.

MODEL AND ESTIMATION
We now present the CDPMMN imputation engine. We begin in Section 3.1 by reviewing Dirichlet mixtures of multivariate normal distributions without any truncation or missing data. We adopt the model in Section 3.2 to handle truncation and present an MCMC algorithm to estimate the model, again assuming no missing data. We modify this algorithm in Section 3.3. to account for missing data, making use of a Hit-and-Run algorithm in the MCMC steps. We discuss several implementation issues in Section 3.4.

Dirichlet Process Mixtures of Normals Without Truncation
Let Y n = { y 1 , . . . , y n } comprise n complete observations not subject to edit constraints, where each y i is a p-dimensional vector. To facilitate modeling, we assume that the analyst standardizes each variable before modeling to have observed mean equal to 0 and observed standard deviation equal to 1; see Appendix B in the supplementary material for details of the standardization. We suppose that each individual i belongs to exactly one of K < ∞ latent mixture components. For i = 1, . . . , n, let z i ∈ {1, . . . , K} indicate the component of individual i, and let π k = Pr(z i = k). We assume that π = (π 1 , . . . , π K ) is the same for all individuals. Within any component k, we suppose that the p variables follow a component-specific multivariate normal distribution with mean μ k and variance k . Let = (μ, , π ), where μ = (μ 1 , . . . , μ K ) and = ( 1 , . . . , K ). Mathematically, the finite mixture model can be expressed as Marginalizing over z i , this model is equivalent to As a mixture of multivariate normal distributions, the model is flexible enough to capture distributional features like skewness and nonlinear relationships that a single multivariate normal distribution would fail to encode. To complete the Bayesian specification, we require prior distributions for . As suggested by Lavine and West (1992), for (μ k , k ) we use where f is the degrees of freedom, = diag(φ 1 , . . . , φ p ), and with mean a φ /b φ for j = 1, . . . , p. These conditionally conjugate prior distributions simplify the MCMC computations. We defer discussion of these hyperparameter values until Section 3.2, when we present the constrained model. For the component weights, π, we use the stick-breaking representation of a truncated Dirichlet process (Sethuraman 1994;Ishwaran and James 2001), shown in (7)-(8).
Here, (a α , b α ) are analyst-supplied constants. Following Dunson and Xing (2009), we recommend setting (a α = 0.25, b α = 0.25), which represents a small prior sample size and, hence, vague specification for Gamma distributions. This ensures that the information from the data dominates the posterior distribution (Escobar and West 1995). The specification of prior distributions in (7)-(9) encourages π k to decrease stochastically with k. When α is very small, most of the probability in π is allocated to the first few components, thus reducing the risks of over-fitting the data as well as increasing computational efficiency. We note that Dirichlet process distributions are widely used for mixture models in economic analyses (e.g., Hirano 2002; Gilbride and Lenk 2010). As a prior distribution for π = (π 1 , . . . , π K ), one also could use a symmetric Dirichlet distribution, π ∼ Dir(a/K, . . . , a/K), for example, with a = 1. We expect that imputations based on the truncated Dirichlet process prior distribution and this symmetric Dirichlet distribution generally will not differ noticeably. We use the truncated Dirichlet process prior distribution primarily because (i) it has worked well in other applications of mixture models for multiple imputation of missing data (Si and Reiter 2013; Manrique-Vallier and Reiter forthcoming) and (ii) in our experience, it facilitates reasonably fast MCMC convergence.
With specified hyperparameters, the model can be estimated using a Gibbs sampler, as each full conditional distribution is in closed form; see Ishwaran and James (2001).

Dirichlet Process Mixtures of Normals With Truncation
The model in Section 3.1 has unrestricted support and, hence, is inappropriate for imputation under linear constraints. Instead, when y i is restricted to lie in some feasible region A, we need to replace (3) with where I (·) = 1 when the condition inside the parentheses is true, and I (·) = 0 otherwise. Here, h(A, ) is the normalizing constant such that Equivalently, using the representation conditional on z i , we need to replace (1) with We leave all other features of the model as described in (2) and (4) (2008) in which we (i) conceive of the observed values as a sample from a larger, hypothetical sample not subject to the constraints, (ii) construct the hypothetical sample by augmenting the observed data with values from outside of A, and (iii) use the augmented data to estimate parameters using the usual, unconstrained Gibbs sampler. With appropriate prior distributions, the resulting parameter estimates correspond to those from the CDPMMN model.
Specifically, we assume that there exists a hypothetical sample of unknown size N > n records, We consider the observed data Y n = { y 1 , . . . , y n } to be a sample from Y N with probability h(A, ). Thus, for known N, the number of cases in A has distribution We use the prior distribution suggested by Meng and Zaslavsky (2002) and O'Malley and Zaslavsky (2008), With this construction, we can estimate without computing h(A, ) using an MCMC algorithm. Let Z N = {Z n , Z N−n } be the set of membership indicators corresponding to Y N , where Z n = {z 1 , . . . , z n } and Z N−n = {z n+1 , . . . , z N }. For each group k = 1, . . . , K, let N k = N i=1 I (z i = k) be the number of cases in the augmented sample in group k; letȳ k = {i:z i =k} y i /N k ; and, let S k = {i:z i =k} ( y i −ȳ k )( y i −ȳ k ) . After initialization, the MCMC algorithm proceeds with the following Gibbs steps. S1. For k = 1, . . . , K, sample values of (μ k , k ) from the full conditionals, as in (7). S3. For j = 1, . . . , p, sample each φ j from the full conditional, S4. Given π, sample α from the full conditional, S5. For i = 1, . . . , n, sample each z i from the full conditional, where π * ik = π k N( y i | μ k , k ) / K g = 1 π g N( y i | μ g , g ) . S6. Sample (N, Z N−n , Y N−n ) jointly from their full conditional distribution following the approach suggested by O'Malley and Zaslavsky (2008). Specifically, based on the result in (16), we draw each (z i , y i ) ∈ (Z N−n , Y N−n ) from a negative binomial data-generating process. To begin, set c in = c out = 0. Then, perform the following steps.
For the prior distributions in (4)-(6), we recommend using a prior mean of μ 0 = 0 since each variable is mean-centered, using f = p + 1 degrees of freedom to ensure a proper distribution without overly constraining , and setting h = 1 mostly for convenience. We recommend setting (a φ , b φ ) to be modest but not too small, so as to allow substantial prior mass at modest-sized variances. Following Dellaportas and Papageorgiou (2006), we use a φ = b φ = 0.25. In the Colombia data illustration, results were insensitive to other sensible choices of hyperparameter values; details of the sensitivity analysis are in Appendix A, available as online supplementary material.

Accounting for Missing Data: Hit-and-Run Algorithm
Having developed an MCMC algorithm for fitting the CDP-MMN model without any missing values, we now extend to include imputation of missing data or, equivalently, imputation of blanked values due to edit rules. Here, we assume the missing data mechanism is ignorable (Rubin 1976). Without loss of generality, suppose that the first s ≤ n records in Y n have some missing values. Let Y s = { y 1 , . . . , y s } be these first s records, and Y n−s = { y s+1 , . . . , y n } be the set of fully observed records. For each y i ∈ Y s , let y i = ( y i,0 , y i,1 ), where y i,0 comprises the missing values and y i,1 comprises the observed values. Finally, let Y s,0 = { y 1,0 , . . . , y s,0 }, and let Y s,1 = { y 1,1 , . . . , y s,1 }.
Formally, we seek to estimate the posterior distribution, f (Y s,0 , Y N−n , Z N , , , α, N | Y s,1 , Y n−s , A). We do so in two steps, namely (i) given a draw of Y s,0 satisfying A, draw values of (Y N−n , Z N , , , α, N) using S1 -S6 from Section 3.2, and (ii) given a draw of (Y N−n , Z N , , , α, N), draw imputations for Y s,0 satisfying A using a Metropolis version of the Hit-and-Run (H&R) sampler (Chen and Schmeiser 1993). Thus, we only need to add an imputation step to the MCMC algorithm from Section 3.2, as we now describe.
We begin by finding the region of feasible imputations for each y i,0 , which we write as For systems of linear constraints like those in Table 1 and typically employed in edit-imputation contexts, A i can be defined using matrix algebra operations; see Appendix C, supplementary material, for an illustrative example. For more complex linear constraints, feasible regions can be found by linear programing and related optimization techniques.
The H&R sampler proceeds as follows. At any iteration t of the MCMC sampler, we presume the current values of each y i,0 , say y (t) i,0 , where i = 1, . . . , s, satisfy the constraints; see Section 3.4 for a suggestion to initialize the chain at feasible values. The basic idea of the H&R sampler is to pick a random direction in R p i , where p i is the number of variables in y i,0 ; follow that direction starting from y (t) i,0 until hitting the boundary of A i , say at some point b (t) i,0 ; and, sample a new point along the line segment with end points ( y (t) i,0 , b (t) i,0 ). For convex A i , the selected point is guaranteed to be inside the feasible region. Formally, we implement this process via the Metropolis step in S7 below.
S7. For i = 1, . . . , s, update each current value y (t) i,0 with a Metropolis accept/reject step as follows.

S7.1. Propose a direction d * ∼ r(·) where r(·) is a uniform
distribution on the surface of the unit sphere in R p . S7.2. Find the set of candidate proposal distances, which we write as (d * , where M is a Lebesgue measure. S7.4. Accept or reject the proposal y q i,0 = y (t) i,0 + λ * d * with the acceptance probability ρ i , where Note that the acceptance probability can be calculated by using the fact that the conditional distribution of y i,0 is proportional to that of y i , that is, (12). After canceling common terms, the acceptance probability simplifies to ρ i = min{1, N( y Because the H&R sampler randomly moves in any direction, it can cover multivariate spaces more efficiently than typical Gibbs samplers that move one direction in each conditional step. The efficiency gain can be substantial when the sample space is a convex polytope and when variables are highly correlated or sharply constrained (Chen and Schmeiser 1993;Berger 1993). Lovász and Vempala (2006) proved that the H&R sampler mixes fast, and that mixing time does not depend on the choice of starting points. Hence, for imputing multivariate data subject to the constraints, the H&R sampler offers a potentially significant computational advantage over typical Gibbs steps.
To obtain m completed datasets for use in multiple imputation, one selects m of the sampled Y s,0 after MCMC convergence. These datasets should be spaced sufficiently to be approximately independent (given the observed data). This involves thinning the MCMC samples so that the autocorrelations among parameters are close to zero.

Implementation Issues
In this section, we discuss several features of implementing the CDPMMN imputation engine, beginning with the choice of K. Setting K too small can result in insufficient flexibility to capture complex features of the distributions, but setting K too large is computationally inefficient. We recommend initially setting K to a reasonably large value, say K = 50. Analysts can examine the posterior distribution of the number of unique values of z i across MCMC iterates to diagnose if K is large enough. Significant posterior mass at a number of classes equal to K suggests that the truncation limit should be increased. We note that one has to assume a finite K because of the truncation; otherwise, the model can generate arbitrary components with nearly all mass outside A.
One also has to select hyperparameter values, as we discussed in Section 3.2. Of note here is the value of h in (4), which determines the spread of mean vectors. For example, small values of h generally imply that, a priori, μ k can be far from μ 0 . Empirically, our experience is that small values of h can generate large values of N, the number of augmented values, which can slow the MCMC algorithm. We recommend avoiding very small h, say by setting h > 0.1, to reduce this computational burden. As noted in Section 3.2 and the sensitivity analysis in Appendix A, our application results are robust to changes of h and other hyperparameters.
To initialize the MCMC chains when using the H&R sampler, we need to choose a set of points all within the polytope defined by A. To do so, we find the extreme points of the polytope along each dimension by solving linear programs (Tervonen et al. 2013), and set the starting point of the H&R chain as the arithmetic mean of the extreme points. We run this procedure for finding starting values using the R package "hitandrun" (available at http://cran.r-project.org/web/packages/hitandrun/index.html). In our empirical application, other options of finding starting points within the feasible region, including randomly weighted extreme points or vertex representation, do not change the final results.
With MCMC algorithms it is essential to examine convergence diagnostics. Due to the complexity of the models and many parameters, as well as the truncation and missing values, monitoring convergence of chains is not straightforward. We suggest that MCMC diagnostics focus on the draws of N, α, the ordered samples of π , and, when feasible, values of the π-weighted averages of μ k and k , for example, k π k μ k , rather than specific component parameters, {μ k , k }. Specific component parameters are subject to label switching among the mixture components, which complicates interpretation of the components and MCMC diagnostics; we note that label switching does not affect the multiple imputations.

EMPIRICAL EXAMPLE: COLOMBIAN MANUFACTURING DATA
We illustrate the CDPMMN with multiple imputation and analysis of plant-level data from the Colombian Annual Manu-  (2011), we use the seven variables in Table 2. We remove a small number of plants with missing item data or nonpositive values, so as to make a clean file on which to evaluate the imputations. The resulting annual sample sizes range from 5770 to 6873.
To illustrate the performance of the CDPMMN imputation engine, we introduce range restrictions on each variable (see Table 2) and ratio edits on each pair of variables (see Table 3) like those used by government agencies in economic data. Limits are wide enough that all existing records satisfy the constraints, thus offering a set of true values to use as a gold standard. However, the limits are close enough to several observed values that the constraints matter, in the sense that estimation and imputation under models that ignore the truncation would have nonnegligible support in the infeasible region. Figure 1 displays a typical example of the linear constraints for two variables, log(USW) and log(CAP).
We then randomly introduce missing values in the observed data, and use the CDPMMN imputation engine to implement multiple imputation of missing values subject to linear constraints. We do so in two empirical studies: a set of analyses designed to compare one-off inferences from the imputed and original data (before introduction of missing values), and a repeated sampling experiment designed to illustrate efficiency gains in using the CDPMMN over complete-case methods.
In both studies, we make inferences for unknown parameters according to the multiple imputation inferences of Rubin (1987), which we review briefly here. Suppose that the analyst seeks to estimate some quantity Q, such as a regression coefficient or population mean. Let {D (1) , . . . , D (m) } be m completed datasets drawn from the CDPMMN imputation engine. Given completed data, the analyst estimates Q with some point estimator q and estimates its variance with u. For l = 1, . . . , m, let q (l) and u (l) be, respectively, the value of q and u computed with D (l) . The analyst usesq m = 1   Li, Raghunathan, and Rubin (1991), Meng and Rubin (1992), and Reiter (2007).

Comparisons of CDPMMN-Imputed and Original Data
For each year t = 1977, . . . , 1991 other than 1981, let D orig,t be the original data before introduction of missing values. For each year, we generate a set of missing values Y s,t by randomly blanking data from D orig,t as follows: 1500 records have one randomly selected missing item, 1000 records have two randomly selected missing items, and 500 records have three randomly selected missing items. One can interpret these as missing data or as blanked fields after an error localization procedure. Missingness rates per variable range from 10.2% to 11.5%. This represents a missing completely at random mechanism (Rubin 1976), so that complete-case analyses based only on records with no missing data, that is, Y n−s,t , would be unbiased but in- Figure 1. The linear constraints for two variables, log(USW) and log(CAP), for 1982 data. efficient, sacrificing roughly half of the observations in any one year. We note that these rates of missing data are larger than typical fractions of records that fail edit rules, so as to offer a strong but still realistic challenge to the CDPMMN imputation engine.
We implement the CDPMMN imputation engine using the model and prior distributions described in Section 3 to create m = 10 completed datasets, each satisfying all constraints. To facilitate model estimation, we work with the standardized natural logarithms of all variables (take logs first, then standardize using observed data).
After imputing on the transformed scale, we transform back to the original scale before making inferences. Of course, we also transform the limits of the linear inequalities to be consistent with the use of standardized logarithms; see Appendix B, available as online supplementary material. We run the MCMC algorithm of Section 3 for 10,000 iterations after a burn-in period, and store every 1000th iterate to obtain the m = 10 completed datasets. We use K = 40 components, which we judged to be sufficient based on the posterior distributions of the numbers of unique z i among the n cases.
We begin by focusing on results from the 1991 survey; results from other survey years are similar. Figure 2 summarizes the marginal distributions of log(CAP) in the original and completed datasets. For the full distribution based on all 6607 cases, the distributions are nearly indistinguishable, implying that the CDPMMN completed data approximate the marginal distributions of the original data well. The distribution of the 714 values of log(CAP) in D orig before blanking is also similar to those in the three completed datasets. Figure 3 displays a scatterplot of log(USW) versus log (CAP)-at least one of these variables is missing for 20.3% of cases-for D orig and one of the completed datasets; results for other D (l) are similar. The overall bivariate patterns are very similar, suggesting once again the high quality of the imputations. We note that the shape of the joint density implies that a model based on a single multivariate normal distribution, even absent truncation, would not be appropriate for these data.  We next examine inferences for coefficients in the regression used by Petrin and White (2011) to analyze plant productivity, namely where LAB i = SL i + USL i . We estimate regressions independently in each year. Figure 5 displays OLS 95% confidence intervals from D orig,t and from the CDPMMN multiply imputed datasets in each year. For comparison, it also displays intervals based on only the complete cases. The intervals for β based on the CDPMMN completed datasets are similar to those based on D orig , with somewhat wider lengths due to the missing values. For comparison, the complete-case results typically have even wider standard errors.

Repeated Simulation
The results in Figure 5 suggest that the CDPMMN multiple imputation offers more efficient inferences than the complete cases analysis. To verify this further, we perform a repeated sampling study. We assume that the 6607 plants in the 1991 data comprise a population. We then randomly sample 500 indepen-dent realizations of D orig from this population, each comprising 1000 records. For each sampled D orig , we introduce missing values by blanking one value for 200 randomly selected records, two values for 200 randomly sampled records, and three values for 100 randomly sampled records. We create m = 10 completed datasets using the CDPMMN imputation engine using the same approach as in Section 4.1, and estimate the regression coefficients in (17) using the multiple imputation point estimator. We also compute the point estimates based on D orig and only the complete cases.
To evaluate the point estimators, for each method we compute three quantities for each coefficient in β = (β 0 , β C , β L ) = (β 0 , β 1 , β 2 ). The first quantity is the simulated bias, Bias j = 500 r=1β r,j /500 − β pop,j , j = 0, 1, 2, whereβ r,j is a methodspecific point estimate of β j in replication r and β pop,j is the value of β j based on all 6607 cases. The second quantity is the total squared error, TSE j = 500 r=1 (β r,j − β pop,j ) 2 . The third quantity is the total squared distance between the point estimates based on the imputed (or only complete) data and the original data, TSD j = 500 r=1 (β r,j −β r,j (orig) ) 2 . Table 4 displays the results of the simulation study. All methods are approximately unbiased, with the complete-cases analysis being far less efficient than the CDPMMN multiple imputation analysis. The CDPMMN multiple imputation closely follows the analysis based on the original data.   . Ninety-five percent confidence intervals for β C and β L from original data (first displayed interval), CDPMMN multiply imputed data (second displayed interval), and the complete cases (third displayed interval). Using CDPMMN results in similar intervals as the original data, and shorter intervals than the complete cases.
Since the repeated simulation results are based on a missing completely at random (MCAR) mechanism, we also perform a repeated simulation study with a missing at random (MAR) mechanism; see Appendix D in the supplementary materials for details. The performance of the CDPMMN imputations are akin to those in Table 4: point estimates are approximately unbiased with roughly a 30% increase in variance due to missing data. The complete-cases analysis results in biased estimates.
For both the MCAR and MAR scenarios, we also examine results for multiple imputation by chained equations using the MICE software (Van Buuren and Groothuis-Oudshoorn 2011) in R. We use the default implementation, which is based on predictive mean matching using linear models. In any completed dataset, there are usually a handful of imputed values that result in violations of the inequality constraints. Agencies seeking to release data clean of violations would have to change these values manually, whereas the CDPMMN imputation automatically respects such constraints. Nonetheless, we found that point estimators from the predictive mean matching MICE and the CDPMMN imputation performed quite similarly on our

CONCLUDING REMARKS
The empirical analyses of the Colombia manufacturing data suggest that the CDPMMN offers a flexible engine for generating coherent imputations guaranteed to respect linear inequalities. We expect the CDPMMN to be computationally feasible with efficient parallel computing when the number of variables is modest, say on the order of 40 to 50. For larger numbers of variables, analysts may need to use techniques other than the CDPMMN, for example, models based on conditional independence assumptions such as Bayesian factor models (Aguilar and West 2000). We note that the general strategy of data augmentation combined with a Hit-and-Run sampler can be applied to any Bayesian multivariate model. In contrast, we do not view the number of records as a practically limiting factor, because the computations can be easily parallelized or implemented with GPU computing (Suchard et al. 2010). If computationally necessary, and potentially for improved accuracy, analysts can split the data into subsets of rows, for example, by industry classifications, and estimate the model independently across subsets.
As with any imputation-model specification, it is prudent to examine the fit of the CDPMMN model for the particular context. As described in Gelman et al. (2005), one can compare the marginal distributions of the observed and imputed values as a "sanity check." Highly dissimilar distributions can suggest the model does not describe the data well. One also can compute posterior predictive checks (e.g., He et al. 2010;Burgette and Reiter 2010;Si and Reiter 2013). As described by Si and Reiter (2013), the basic idea is to use the imputation model to generate not only Y s but an entirely new full dataset, that is, create a completed dataset D (l) = (Y (l) s , Y n−s ) and a replicated dataset R (l) in which both Y s and Y n−s are simulated from the imputation model. After generating many pairs (D (l) , R (l) ), one compares each R (l) with its corresponding D (l) on statistics of interest, such as regression coefficients and tail area probabilities. When the statistics are dissimilar, the diagnostic indicates that the imputation model does not generate replicated data that look like the completed data, so that it may not be generating plausible imputations for the missing data. When the statistics are not dissimilar, the diagnostic does not indicate evidence of imputation model inadequacy (with respect to that statistic).
Although the CDPMMN is intended primarily for continuous data, similar data augmentation strategies can be applied in other contexts of imputation under constraints. Manrique-Vallier and Reiter (forthcoming) use a truncated Dirichlet process mixture of multinomial distributions for imputation of missing unordered categorical data when the data include structural zeros, that is, certain combinations are constrained to have probability zero (e.g., pregnant men). For mixed categorical and continuous data with no missing categorical variables, one can apply the CDPMMN imputation model separately within cells defined by the implied contingency table, provided sample sizes within each cell are adequate. We are unaware of methods for fitting joint mixture models for mixed data when both the categorical and continuous variables are subject to constraints.
In addition to extending these ideas to mixed data settings, there are several key areas in imputation under constraints that need future research. Some variables may need to satisfy linear equalities, for example, logical sums. We did not account for these types of constraints here, although we anticipate that the Hit-and-Run sampler can be modified to do so. In editimputation settings, it is not clear that the Fellegi and Holt (1976) paradigm for error localization is optimal or advantageous in all settings. Error localization methods based on measurement error models derived from empirical evidence, combined with a fully coherent joint model for the imputation step, could result in higher quality edited/imputed data and subsequent analyses.

SUPPLEMENTARY MATERIALS
Further explanations of results: File consisting of (i) results of simulations that illustrate the insensitivity of multiple imputation inferences to specifications of the prior distributions for the CDPMMN imputation engine, (ii) the expression for the limits of range restrictions and ratio edits when using standardized logarithms in the imputation models, (iii) an illustration of how to find boundaries of feasible region when constraints are represented by range restrictions and ratio edits, and (iv) the simulation study under the MAR assumption. (PDF file)