Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data

ABSTRACT In categorical data, it is typically the case that some combinations of variables are theoretically impossible, such as a 3-year-old child who is married or a man who is pregnant. In practice, however, reported values often include such structural zeros due to, for example, respondent mistakes or data processing errors. To purge data of such errors, many statistical organizations use a process known as edit-imputation. The basic idea is first to select reported values to change according to some heuristic or loss function, and second to replace those values with plausible imputations. This two-stage process typically does not fully use information in the data when determining locations of errors, nor does it appropriately reflect uncertainty resulting from the edits and imputations. We present an alternative approach to editing and imputation for categorical microdata with structural zeros that addresses these shortcomings. Specifically, we use a Bayesian hierarchical model that couples a stochastic model for the measurement error process with a Dirichlet process mixture of multinomial distributions for the underlying, error-free values. The latter model is restricted to have support only on the set of theoretically possible combinations. We illustrate this integrated approach to editing and imputation using simulation studies with data from the 2000 U. S. census, and compare it to a two-stage edit-imputation routine. Supplementary material is available online.


Introduction
In surveys with multiple categorical data items, the reported data frequently include erroneous values, that is, combinations of answers that are theoretically impossible or inconsistent across items. These could result from respondent error, for example, a parent accidentally checks boxes indicating that his child is 5 years old and has attained a university degree, or a person selects "male" for sex and "yes" to answer whether or not the person ever had a hysterectomy. They also could result from data processing errors (Groves and Lyberg 2010;Biemer 2010). Regardless of the source, left uncorrected, erroneous values can diminish the quality of analysis and interpretation of the data (Fuller 1987;Groves 2004;Durrant and Skinner 2006).
Recognizing this, many data stewards such as national statistics institutes (NSIs) purge data of overtly erroneous values in a process known as edit-imputation. Edit-imputation is particularly salient for organizations sharing data with the public: releasing files with erroneous values could cause the public to lose confidence in the validity of the data and in the organization more broadly. In fact, many data stewards invest quite substantial resources in edit-imputation processes, exceeding 30% of survey budgets in some NSIs (Granquist and Kovar 1997;Norberg 2009).
It is generally impractical and cost-prohibitive to recontact all respondents with potentially erroneous values. Thus, many data stewards rely in part on automated methods of editimputation, in which algorithms select and "correct" erroneous values with minimal human intervention. These automatic editing systems typically run in two steps, an error localization step in which some set of each record's values is determined to be in error, and an imputation step in which those values are replaced with plausibly correct values (De Waal, Pannekoek, and Scholtus 2011). Most editing systems use variants of the error localization suggested by Fellegi and Holt (1976): for any record with impossible reported values, change the minimum number of fields (variables) needed to make that record a theoretically possible observation (e.g., see Winkler 1995;Winkler and Petkunas 1997). The subsequent imputation step usually is a variant of single imputation generated from a hot deck or parametric model (e.g., Winkler 2003Winkler , 2008. Kim et al. (2015) pointed out that Fellegi and Holt approaches to edit-imputation-henceforth abbreviated as F-H approaches-have two key drawbacks. First, they do not fully use the information in the data in selecting the fields to impute. To illustrate, consider an example where the variables include sex, hysterectomy status, and age in years. When a record is reported as a male with a hysterectomy who is 20 years old, it seems more plausible to change status to no hysterectomy than to change sex to female, because hysterectomies are relatively uncommon among 20 year old women (Merrill 2008). The minimum number of fields' criterion results in changing one of sex or hysterectomy status. The data steward might select among these two solutions based on some heuristic, for example, change the variable that is more likely to have errors according to experience in other contexts. Second, the organization generally cannot be certain that an F-H (or any) error localization has identified the exact locations of errors. This uncertainty is ignored by specifying a single error localization; hence, analyses of data corrected by F-H approaches underestimate uncertainty (Kim et al. 2015). For example, there are 20 year old women with hysterectomies, and inferences should reflect that possibility (as well as other feasible variations, including changing multiple variables) as increased uncertainty.
In this article, we propose an alternative to F-H methods for categorical data that addresses these shortcomings. Specifically, we use a Bayesian hierarchical model that couples a stochastic model for the measurement error process with a Dirichlet process mixture of multinomial distributions for the underlying, error-free values. The support of the latter model excludes the set of structural zero cells. A similar strategy was used by Kim et al. (2015) for data with only continuous variables, in which true values are required to satisfy prespecified linear inequalities and equalities on the relationships among the variables. By fully integrating the editing and imputation steps, we encode a probability distribution on the error locations that is informed by the observed relationships among the variables in the data, and we fully incorporate uncertainty in the process of selecting and replacing erroneous values. The Markov chain Monte Carlo (MCMC) algorithm for estimating the model generates datasets without structural zeros as by-products. These can be disseminated as public use files and analyzed using multiple imputation techniques (Rubin 1987). We note that multiple imputation for measurement error correction also has been suggested by, among others, Little and Smith (1987), Heitjan and Rubin (1990), Ghosh-Dastidar and Schafer (2003), Cole, Chu, and Greenland (2006), Parker and Schenker (2007), and Blackwell, Honaker, and King (2015).
The remainder of the article is organized as follows. In Section 2, we present the Bayesian hierarchical model for editimputation, which we call the EI-DPM model. We first describe the model for measurement error, which comprises sub-models for the error locations and the reported values. We then describe the Dirichlet process mixture model for the underlying true data. In Section 3, we present an MCMC algorithm for sampling from the EI-DPM model that guarantees all edit constraints are satisfied. In Section 4, we report results of simulation studies based on a subset of data from the 2000 U.S. census. Here, we compare the accuracy of inferences from the EI-DPM to those from an F-H edit-imputation approach. In Section 5, we conclude with a discussion of future research directions.

The EI-DPM Model
Suppose we have a sample of n subjects measured on J categorical variables. We associate each subject i ∈ {1, . . . , n} with a true response vector, True responses, x i , correspond to the data we would observe if they were perfectly recorded. Let X = (x i , . . . , x n ) be the vector of all true responses. We do not observe X ; rather, we observe the reported data Y = (y 1 , . . . , y n ). Here, each y i ∈ C = C 1 × · · · × C J is a potentially contaminated version of its corresponding x i . We assume that observable data are stochastically generated conditional on the true responses through a measurement process, with density M(y|x i , θ y ) for y ∈ C. If this process is such that M(y|x i , θ y ) = δ x i (y), we call it a perfect measurement process. Whenever it is the case that y i = x i we say that the ith individual's observed data contain errors.
True responses are subject to a set of edit rules that enumerate impossible responses. Formally, edit rules are a subset S C for which we know a priori that Pr(x i ∈ S) = 0, that is, they are a set of structural zeros as defined by Bishop, Fienberg, and Holland (1975). Often S is defined by the organization that collected the data. We assume that the organization has specified a logically consistent S; this can be checked using the methods by Fellegi and Holt (1976) and subsequent work.
Unlike true responses, reported data potentially can take any possible value in C, that is, Pr(y i ∈ S) > 0. Therefore, whenever y i ∈ S we know for sure that y i contains errors. We call errors that result in a direct violation of edit rules detectable. Conversely, whenever x i = y i but y i / ∈ S, we call the errors undetectable. We note that the existence of errors within a record is detectable, not necessarily the location of the errors. For example, when a reported record is a male with a hysterectomy we can be sure that there is an error, but we cannot know-at least not from the record only-whether the error is the reported sex, the reported hysterectomy, or both.
We assume that the true response vectors are iid samples from a common data-generating process with density f x (x|θ x , S) for x ∈ C. Here the dependence on S stresses the fact that the support of the distribution must be truncated to the set C \ S to avoid inconsistent values. Under this setup, the objective is to use the contaminated data, Y, to estimate the joint distribution of the true responses.

Measurement Model
When we consider X as given, the measurement model M(y|x i , θ y ) encodes assumptions about how the observable responses are generated from the true responses. It is useful to specify the measurement model in two parts. The first part, the error location model, specifies which among the n × J entries in X are reported incorrectly. For i = 1, . . . , n, let E i = (E i1 , . . . , E iJ ), where E i j = 1 when there is an error in the (i, j)th location, and E i j = 0 otherwise. By definition, E i j = 1 if and only if x i j = y i j . Generically, we write the distribution of the error location model for all n records as p(E | X , θ y ). The second part, the reporting model, specifies which values are observed for fields in error. Generically, we write the distribution of this model as p(Y | X , E, θ y ). Thus, we can write The error localization and reporting models can be adapted to reflect a priori assumptions about the measurement error process. For example, one can allow the likelihood of error for variable j to depend on the level of the true response by using a regression of E i j on x i j . More simply, one could favor certain combinations of E i over others. For example, one could assume Pr(E i j = 1 | x i , θ y ) = j , allowing differential probability for changing some variables over others. This is sensible when some variables are more reliably reported than others. Alternatively, one might assume that errors are generated completely at random, so that all E i j have independent Bernoulli distributions with a common error rate . This location model defines a perfect measurement model when = 0. For the reporting model, it is computationally convenient to assume that, conditional on the existence of errors, substitutions are independent. Similar independence assumptions are made, for example, in record linkage contexts (e.g., Fellegi and Sunter 1969;Steorts, Hall, and Fienberg 2014). Generally, we write the model as (1) Here, q jl (l * ) is the probability of reporting level l * ∈ {1, . . . , L j } for variable j when the actual value is l ∈ {1, . . . , L j }. We consider L j l * =1 q jl (l * ) = 1 and q jl (l) = 0. Interpreted as a generative process, this model simply states that whenever the jth variable is erroneous, the reported data will be sampled from all possible values except for the correct one. Absent any information about the propensities of reporting errors, one can assume a uniform substitution process, making q jl (l * ) ∝ 1 for all l * = x and q jl (l) = 0. This leads to the probabilities where q j = 1/(L j − 1). We note that the uniform substitution model in (2) is only one example and may not be appropriate for some data. When analysts have information about the substitution process, for example, erroneous values are likely to be categories adjacent to the correct category on a survey form, analysts can and should incorporate that information in the model instead of assuming uniform substitution. We complete the specification of the measurement error model with prior distributions for θ y . For example, for a Bernoulli error location model, one can use ∼ Beta(a , b ). ( This prior specification has the advantages of being conjugate to the error location model and of having a flexible and natural interpretation. Interpreting the prior expected value a /(a + b ) as the "prior error rate, " and a + b as the "prior sample number of responses" (see Gelman et al. 2013, p. 35), we can encode beliefs about the quality of the observed data. With j = for all j, one instead could use (a j , b j ) in (3), selecting values that reflect a priori beliefs about the error rate of each variable.
A key feature of this formulation is that we do not assume that the error generation depends in any way on the edit rules S. In particular, we do not assume y i / ∈ S implies that y i = x i . From a generative perspective, this feature seems a sensible modeling choice-the generation of errors need not be contingent on our ability to detect them. However, this is a departure from the F-H approaches applied in most NSIs. F-H approaches are based on the principle of minimizing the number of changes required for the observed data to satisfy all the edits. This implies that whenever y i = x i but y i / ∈ S, that is, the record has errors but satisfies all edits, an F-H approach prescribes no changes to that record. Thus, F-H essentially treats undetectable errors as nonexistent; in other words, it assumes a measurement model that only can generate detectable errors.
Although perhaps unrealistic, measurement models that generate only detectable errors are implicitly used by many NSIs.
Typically these organizations are reluctant to change reported data values as a matter of principle. To adhere to this logic, that is, fixing only records with detectable errors, we can adjust the measurement model in the EI-DPM by setting E i = (0, . . . , 0) for all records with y i / ∈ S. This effectively forces the model to assume x i = y i for all records without detectable errors.
Alternatively, we can enforce changing a small number of fields through the prior distribution on the error rates. Specifying small probabilities of errors with high certainty-for example, using small a /(a + b ) with large a + b in (3)-implies that errors are rare by default. This results in posterior distributions where individual records are considered erroneous only when strongly suggested by the data, which is most likely because of a direct violation of the edit rules. This specification can be considered as a model-based analog of the F-H principle.

True Response Model
The true responses generation model, f x (x|θ x , S), in principle can be any multivariate discrete distribution, as long as it is supported only in the set C \ S. In practice, we desire f x (x|θ x , S) to be rich enough to capture the relevant multivariate structure in X . One such model is the truncated Bayesian nonparametric latent class model (TNPLCM), introduced by Manrique-Vallier and Reiter (2014a), which we now review.
Latent class models (LCM; Lazarsfeld and Henry 1968;Goodman 1974) are widely used for representing discrete multivariate distributions. Putting aside for the moment the complications posed by the structural zeros, the LCM is the finite mixture of K products of multinomial distributions, where π = (π 1 , . . . , π K ) are mixture component probabilities such that K k=1 π k = 1, and λ = (λ jk [l]) are J × K sets of multinomial probabilities with L j l=1 λ jk [l] = 1 for j = 1, . . . , J and k = 1, . . . , K. For large enough K, this model can represent any probability distribution over C arbitrarily well (Dunson and Xing 2009).
The mixture in (4) can be represented as a two-step generative process for x i . Let each individual belong to one latent class, z i ∈ (1, . . . , K ). For any i = 1, . . . , n, we can write (4) as This representation facilitates estimation via Gibbs sampling (Ishwaran and James 2001; Si and Reiter 2013; White and Murphy 2014). Dunson and Xing (2009) suggested letting K = ∞ and using an infinite stick-breaking process (Sethuraman 1994) for the prior distribution on π. This prior distribution has full support on the distribution of probabilities of x, and it does not restrict dependence structures a priori (Dunson and Xing 2009). For an almost-sure approximation (Ishwaran and Zarepour 2002) that is computationally more convenient, one can set K to a large but finite integer (Ishwaran and James 2001;Si and Reiter 2013). In this representation, we let π k = V k The prior specification can be completed with diffuse priors α ∼ Gamma(0.25, 0.25) and λ jk [·] iid ∼Dirichlet L j (1, . . . , 1) for j = 1, . . . , J and k = 1, . . . , K. See Dunson and Xing (2009) and Si and Reiter (2013) for additional discussion of the prior distributions.
The LCM does not automatically assign zero probability to combinations x ∈ S, that is, potential outcomes known to have probability zero. To enforce Pr(x ∈ S ) = 0 when estimating cell probabilities, Manrique-Vallier and Reiter (2014a) introduced the truncated LCM, (7) Manrique-Vallier and Reiter (2014a) used the K-dimensional stick-breaking process as the prior distribution of π. To estimate model parameters, Manrique-Vallier and Reiter (2014a) relied on a sample augmentation strategy. They considered X = {x i / ∈ S} to be a subset of a hypothetical sample X * of N ≥ n records, directly generated from (5)-(6) without caring about the truncation of the support. Let X 0 and Z 0 be the responses and latent class labels corresponding to those samples in X * that did fall into the set S. Manrique-Vallier and Reiter (2014a) showed that, by using the improper prior distribution p(N) ∝ 1/N (Meng and Zaslavsky 2002), the marginal posterior distribution of parameters (π, λ, α) after integrating out (N, X 0 , Z, Z 0 ) matches that based on the truncated representation in (7).

Edit-Imputation Model Used in Illustrations
In the empirical illustration, we use a measurement model with common and a uniform substitution process. Putting this together with a TNPLCM as the true response model, we have Here i ranges from 1 to n, and j from 1 to J.

MCMC Estimation
To estimate the model in (8)-(15), we use the Gibbs sampler outlined in Section 3.2. This sampler uses the fact that editing rules, that is, representations of structural zeros, often can be expressed as the union of disjoint table slices. A table slice is a subset of C defined by fixing a subset of the coordinates of the potential responses-for example, the set {x ∈ C : x 1 = 1, x 3 = 2}. Edit rules frequently are formulated as a collection of combinations of levels of a few variables (often just two at a time) that are deemed impossible. For example, a rule forbidding records where "sex = male" and "hysterectomy = yes" defines a table slice. More complex rules can be decomposed into simple slice definitions. For example, a rule specifying that people younger than 14 years old cannot be married can be translated as the union of all the table slices formed by keeping the variable for marital status fixed at "married, " and making the age variable take all the discrete levels corresponding to ages less than 14. Expressing S as table slices can facilitate significant computational gains in the Gibbs sampler for the truncated LCM (Manrique-Vallier and Reiter 2014a, 2014b). Hence, we begin by defining a notation for table slices corresponding to edit rules.
For any slice definition μ = (μ 1 , . . . , μ j ) ∈ C * , let We call μ the table slice defined by μ. In the example above, the table slice includes all possible x that have (x 1 = 1, x 3 = 2). Note that μ is a subset of C, whereas μ is an element of the set C * . The slice definition notation is useful because it allows us to define formal operations in the space C * that map directly into set operations in C. In particular, we define the intersection of slice definitions μ A and μ B as the slice definition vector int(μ A , μ B ) = (γ 1 , . . . , γ J ) ∈ C * such that, for any j = 1, . . . , J, It is easy to verify that int(μ A , μ B ) = μ A ∩ μ B in general, provided that μ A ∩ μ B = ∅. For example, the intersection of slice definitions μ A = (1, * , 2, * , * ) and μ B = (1, 1, * , * , * ) is the slice definition int(μ A , μ B ) = (1, 1, 2, * , * ). This slice definition represents the set {x ∈ C : x 1 = 1, x 2 = 1, x 3 = 2}, which in turn corresponds to μ A ∩ μ B .
In what follows we assume that the collection of edit rules is characterized by a collection of C disjoint slice definitions, so that S = C c=1 μ c . This typically requires a preprocessing step of the original collection of edits. Manrique-Vallier and Reiter (2014a) proposed a simple orthogonalization algorithm based on a repeated application of the int(·, ·) operator that could be used for this purpose. We describe the algorithm in the online supplement.

Gibbs Sampler
We use a Gibbs sampler to estimate the posterior distribution of (X , E, Z, Z 0 , X 0 , π, λ, α, , N) for the model in Section 2.3. Given (Z, Z 0 , X 0 , π, λ, α, N), we update (x, E, ) using the steps in Section 3.2.1. Given a set of true responses X , that is, after imputing "corrected" values for fields deemed to be erroneous, we update (Z, Z 0 , X 0 , π, λ, α, N) using the sampling strategy by Manrique-Vallier and Reiter (2014a). These steps are shown in Section 3.2.2.
After running the Gibbs sampler to convergence, analysts can obtain posterior inferences from relevant parameters. Alternatively, analysts can treat the samples of X generated by this algorithm as multiple instances of "corrected" datasets for use in multiple imputation inference (Rubin 1987). To do so, analysts select a modest number, say M, of datasets sufficiently spaced so that they are approximately independent.

... Sampling (X , E, )
For the model with different error rates for each variable, the full conditional distribution of each j is where s j = N i=1 I(E i j = 1) and (a j , b j ) are specific hyperparameters for variable j. For the model with j = and (a j , b j ) = (a , b ) for all j, this expression simplifies to where s = j s j . Sampling from (E i , x i ) is more involved. Since the vector E i is completely determined by x i and y i , we cannot form Gibbs steps for sampling E i independently of x i using the full conditionals p(x i | . . .) and p(E i | . . .). Instead, we sample directly from p(x i , E i | . . .) using the conditional factorization where p(x i | . . . , −{E i }) denotes the pmf of x i , conditional on all the parameters and data except for E i . Allowing different j to present these pmfs in most general form, we have Sampling from (20) is difficult because of the factor I(x i / ∈ S), which induces dependency among the coordinates of x i . A simple solution is to use a rejection sampling scheme, sampling repeatedly from (20) without considering the truncation until getting a value x i / ∈ S. This method works well when the rejection probability is small. When this is not the case (e.g., under severe prior misspecification for ; see discussion at the end), we can use a conditional sampling strategy that exploits the special structure of S, which is a disjoint union of table slices. Noting that we sample the coordinates of x i one by one using their partial conditional distributions. For this we rely on the following result, proved in the Appendix.
Theorem 1. Let the region of structural zeros, S, be defined by a disjoint collection of table slices, that is, with μ c ∩ μ c = ∅ for c = c . Then, the partial conditional distribution, Applying this theorem, we sample from (x i , E i ) as follows. For m = 1, . . . , J. sample We compute p 1 , . . . , p L j using result (22) from Theorem 1. We make e i j = I(x i j = y i j ) for j = 1 . . . J.
This conditional sampling strategy is guaranteed to work; however, computing probabilities using the formula in (22) can be computationally expensive. As a compromise, we suggest and use a hybrid strategy. We start with the rejection sampling scheme, trying to get a proposal accepted until a maximum number of trials (we used a cutoff of 500 attempts in our calculations in the next section). After that threshold is reached, we default to the conditional sampling method. In our experience, this hybrid method is reasonably fast and robust, taking advantage of the low computational cost of sampling directly from the LCM when possible and turning to a more sophisticated sampler when not. (Z, Z 0 , X 0 , π, λ, α, N) We sample (Z, Z 0 , X 0 , π, λ, α, N) using an adapted version of the seven steps outlined in Manrique-Vallier and Reiter (2014a).

Empirical Study
We now empirically illustrate the performance of the EI-DPM for edit-imputation and compare it to an F-H approach to edit-imputation. To do so, we use a subset of the 5% public use microdata file for the 2000 U.S. decennial census for the state of New York (PUMS; Ruggles et al. 2010). These data also were used by Manrique-Vallier and Reiter (2014a, 2014b). The data comprise H = 953, 076 subjects and J = 10 categorical variables: ownership of dwelling (3 levels), mortgage status (4 levels), age (9 levels), sex (2 levels), marital status (6 levels), single race identification (5 levels), educational attainment (11 levels), employment status (4 levels), work disability status (3 levels), and veteran status (3 levels). This results in a contingency table with 2,566,080 cells. The data documentation indicates 60 pairwise combinations of variable levels that are deemed impossible; we take these to form a set of edit rules. For example, for any individual i, the true response to the variable OWNERSHIP (ownership of dwelling) cannot take the value "Rented" at the same time as the variable MORTGAGE (mortgage status) takes the value "No, owned free, and clear. " After translating these rules into table slice definitions and applying the algorithm by Manrique-Vallier and Reiter (2014a), we end up with 567 disjoint slice definitions. This represents a substantially simplified characterization of the edit rules, as 2,317,030 of the cells correspond to impossible responses. See the online supplementary material for a list of the 60 pairwise edit constraints.
We examine the performances of EI-DPM and the F-H approach under two types of editing preferences. In Section 4.1, we encourage the EI-DPM to correct undetectable (and detectable) errors by assuming ∼ Beta(1, 1), which expresses complete ignorance about the nature of the common error rate. In Section 4.2, we discourage EI-DPM from changing anything but detectable errors, as often adopted by NSIs.

Encouraging Undetectable Error Correction
We consider the H records as a population, from which we take 500 independent, random samples of n = 1000 individuals. Since public use files released by the U.S. Bureau of the Census do not contain errors, we contaminate each of the 500 samples using the independent errors and uniform substitution model, with error rate = 0.4. Thus, in each of the 10 × 1000 entries in each test dataset, approximately 4000 have been replaced by a random value different from the actual one. With this error rate, we expect 994.0 records per sample-essentially all of them-to have at least one error. However, not all these errors are detectable. In fact, for a given contaminated sub-sample only approximately 78% of the records with errors actually contain one or more violations of the edit rules, meaning that about 22% of the records contain undetectable errors. We note that 40% is a very large fraction of errors; our objective is to put the EI-DPM method through a challenging stress test.
For each sample, we use the EI-DPM model to generate 50 multiply imputed datasets. We use multiple imputation combining rules (Rubin 1987) to estimate all 1824 three-way marginal proportions that are estimable from the 500 samples. We also create 50 multiply imputed datasets using the F-H paradigm. Specifically, in each of the 500 samples, we independently employ the R package "EditRules" (De Jonge and Van der Loo 2012) to select a single set of error-localizations that minimizes the number of changes needed to force each record to satisfy all edits. Since the error-localization solution need not be unique, in case of ties we use the default behavior of the package, which is choosing one solution randomly. Once we select the location of the errors, we blank and multiply impute the selected values. To make comparisons between F-H and EI-DPM as fair as possible, we generate imputations using the model by Manrique-Vallier and Reiter (2014b), which is similar to the sampler in Section 3.2. We note that for all the editing methods considered here, some of analyses may not be congenial (Meng 1994) to the editimputation method, which can result in inferences with undesirable theoretical properties, for example, conservative variance estimates. Figure 1 illustrates the effects of the contamination procedure on the quality of inferences, before any edit-imputation. Here, we contrast the population values and empirical frequencies of each of the 1824 three-way marginal probabilities over the 500 replications, using the samples before and after contamination. Unsurprisingly, the uncontaminated frequencies lie almost perfectly on the main diagonal, and the frequencies from the contaminated data are extremely biased. Figure 2 displays the 1824 three-way margins estimated from the 50 multiply edited-imputed datasets generated by EI-DPM and by the application of the F-H method. Comparing to the estimates obtained from the raw contaminated data in Figure 1, the EI-DPM edit-imputation procedure (Figure 2, left panel) produces remarkably accurate estimates of the target quantities. The F-H edit-imputation approach (Figure 2, right panel) is not as accurate. Even after the error-localization and multiple imputation steps, estimates obtained through the F-H approach exhibit notable bias--in fact, they are similar to those obtained directly from the contaminated data ( Figure 1). Figure 3 presents the mean squared error (MSE) for these two groups of estimates. The MSE of F-H estimates tend to be around 20 times larger than those from EI-DPM. Figure 4 displays the empirical coverage rates of 95% confidence intervals for the 1824 three-way margin test   population parameters, using EI-DPM as well as the F-H approach. These intervals are based on the methods by Rubin (1987). We also display the empirical coverage of 95% confidence intervals obtained from multiply imputing the faulty values using the true error locations as generated during the data contamination procedure. We use these values-in principle unattainable from the edited-imputed datasets-as a goldstandard reference to calibrate the coverage rates. Most of the intervals from EI-DPM have at least 80% coverage rates. Typically, the undercoverage results from small absolute biases with small standard errors. In contrast, only 21% of the intervals from the F-H procedure have at least 80% coverage rates, and 30% of them are exactly zero.
To examine whether or not similar patterns hold at lower rates of error, we repeat the simulation using = 10% in place of = 40%. As seen in Figure 5, edit-imputation by the EI-DPM offers more accurate estimates than edit-imputation by the F-H method, although as expected the differences are less pronounced due to fewer imputations. See the online supplement for additional results and details.
In these simulations, the measurement error model used in this version of EI-DPM correctly describes the actual measurement error process. Because of the use of a correct measurement error model, and the existence of a sufficient number of true data values, the model appropriately gives low probability to reported values being true when they are unusual compared to the rest of the data. In other words, the model favors changing these unusual values to ones that are more like the rest of the data. However, the ability of the model to do so depends on the quality of the measurement error model (as well as the underlying true data model).
To examine the robustness of the EI-DPM to measurement error model misspecification, we repeat the simulation by contaminating data so that the common error and substitution models are misspecified. Specifically, we contaminate the data using distinct column-wise rates of error, j , randomly sampled from a Beta(5, 50) distribution. The resulting j have a median of 8.5%, with 95% of values between 3.1% and 17.9%. We also use a nonuniform substitution model, with probabilities proportional to samples from a Gamma(40, 1) distribution. As an example of the probabilities that can result, for a variable with five levels (and hence four possible erroneous values per possible true response) the typical ratio of maximum and minimum probabilities is around 1.4 with 95% of ratios between 1.1 and 1.9. For estimation we use the same EI-DPM model, assuming a common = j and uniform substitution. The EI-DPM continues to provide reasonable estimates under the misspecified measurement error model, and performs slightly better than the F-H approach both in terms of biases (Figure 6), and coverage ( Figure 7). In fact, about 85% of the EI-DPM intervals have coverage greater than 85%, while this proportion falls to 57% for the F-H method. Of course, this is only one example of model misspecification; other types may affect EI-DPM and the F-H approach in different ways.

Encouraging or Enforcing Only Detectable Error Corrections
As discussed in Section 2.1, there are cases in which an NSI might want to take a more conservative approach to editimputation, in a way that the imputation process tends not to alter entries that are not in direct violation of edit rules. Here we explore the two approaches proposed in Section 2.1: using a highly informative prior distribution centered on very low error rates, and setting E i j = 0 for all j for records i without detectable errors. This latter approach prevents the edit-imputation engine from editing records without detectable errors, akin to the F-H principle.
To illustrate the first approach, we return to the simulation with = 0.4 and uniform substitution model. We use the EI-DPM with the prior distribution, ∼ Beta(1, 10 5 ). This expresses a strong (unwarranted in this illustration) belief in the quality of the data, giving essentially a weight equivalent to 10 times the data to the prior specification in the posterior inference on . As a result, we expect to reduce the probability of detecting errors that are not evident, like those that do not result in violation of edit rules. Using one run with = 0.4 as an example, we find that the mean posterior probability of detecting at least one error in records with inconsistencies is exactly 100%, whereas in faulty records without edit violations it drops  down to 5.6%. As expected, the failure to catch and correct undetectable errors induces large biases in the estimates (left panel Figure 8). We note, however, that these biases are noticeably lower than those obtained with the F-H approach (right panel Figure 8). Stronger priors (e.g., a = 1 and b = 10 6 ) result in behaviors even closer to, but still more accurate than, the F-H results; see the online supplement for details.
To illustrate the second approach, that is, setting E i j = 0 for all j for records i without detectable errors, we consider a new simulation design where we randomly contaminate 500 samples of n = 1000 records with rate = 0.4, but where we leave records that would result in undetectable errors untouched, that is, we reset y i = x i for these records. This still represents a difficult stress test for the methods, in that typically only 22% have no errors. As evident in Figure 9, the EI-DPM method produces accurate estimates of the target quantities, whereas the F-H method produces highly biased results. We note that for very small error rates, say in the 1% to 2% range, we expect that the differences between the conservative EI-DPM and the F-H method would be practically irrelevant, as the imputation procedures affect few values.
Finally, we test the robustness of the EI-DPM to misspecification in the measurement error model when setting E i j = 0 for all j for records with no detectable errors. We contaminate the data with different per-variable rates j iid ∼Beta(5, 50) and a nonuniform substitution model with probabilities proportional to samples from a Gamma(40, 1) distribution. We eliminate undetectable errors by resetting y i = x i for any record with detectable errors. This results in a median (across the 500 replications) of 278 rows with errors. We fit the misspecified EI-DPM with j = for all j and uniform substitution, setting E i j = 0 for records that do not violate the edit rules. As evident in Figure 10, the EI-DPM and the F-H approach perform similarly, with the former producing slightly less biased results. We find that 61.2% of EI-DPM point estimates have smaller absolute biases than their F-H counterparts. The EI-DPM also performed slightly better than F-H in terms on confidence interval coverage. We find that 95.8% of the EI-DPM intervals versus 90% of the F-H intervals have coverage rates greater than 85%. Additionally, 87.5% of the EI-DPM intervals have larger coverage rates than their F-H counterparts. The online supplementary material contains additional results for both simulations that force E i = 0 for records with no detectable errors.

Concluding Remarks
In these scenarios, the EI-DPM outperforms the Fellegi-Holt methods. The gains are substantial when the measurement error model in the EI-DPM faithfully describes the measurement error in the data, and smaller (but still evident) when the measurement error model does not. These benefits stem not as much from the underlying imputation model-which is the same in both editing strategies-but in the stochastic nature of the error selection. The minimum number of fields to impute (MFI) criterion that we use in the simulations does not consider multivariate relationships, whereas the EI-DPM does. Further, the EI-DPM allows for changing multiple fields, even if a single change would satisfy the MFI criterion. There can be situations where changing only the minimum number of fields generates a corrected record with feasible but unlikely values, whereas changing more than the minimum number of fields allows for a more plausible imputation. This clearly can be the case with undetectable errors. Nonetheless, NSIs still may have practical reasons to favor methods that enforce, directly or approximately, the MFI criterion. In such cases, NSIs still can benefit from applying EI-DPM in ways that allow changing undetectable errors, as this can serve as a way of evaluating the quality of the data.
As discussed by Kim et al. (2015), agencies using F-H approaches could force higher probability of editing certain fields based on combinations of variables from other fields. For example, for a reported male with a hysterectomy, the agency could decide to change hysterectomy status from yes to no when the person reports female and 20 years old, whereas change sex from male to female when the person reports having a hysterectomy and 60 years old. Such heuristics could get cumbersome when based on multivariate relationships involving many variables. The EI-DPM automatically lets the data identify unusual combinations based on relationships among all variables, thereby potentially leveraging important associations that were not anticipated by the agency (Kim et al. 2015). Unlike F-H approaches with such heuristics, the EI-DPM recognizes the uncertainty in the fields to be edited. The 20 year old person still could be a woman who has undergone a hysterectomy, or a 20 year old man who has not.
The EI-DPM approach can handle missing values simultaneously with erroneous values. One sets E i j = 1 for all fields with missing values, forcing x i j to be imputed. We note that this presumes values are missing at random, as is typical in applications of edit-imputation. It also presumes that the same error location and true response models describe the records with errors and records with missing data.
The EI-DPM method can be computationally efficient, particularly for modest sample sizes. In most of our examples with sample sizes n = 1000, producing the 50 multiply editedimputed datasets from a single faulty dataset took only approximately 58 sec, using personal desktop computers. Computation times are strongly dependent on the particular application. Naturally, longer computing times are required for larger sample sizes (e.g., it took about 177 sec to create 50 completed datasets with a sample of n = 5000). Longer computation times also are required when the model uses strong prior specifications for the error rates that do not accord with the true values, for example, using ∼ Beta(1, 10 5 ) when in fact = 0.4. This results from the difficulty of imputing values within the feasible region C \ S when the error rate is estimated with a severe bias. The proposed sampling algorithm offers ample opportunities for optimization through parallelization. In particular, the many imputation steps (Section 3.2.1, and steps 1 and 6 in Section 3.2.2) can be easily split among different processors or computers. Our current implementation, available upon request, is single threaded C++ with an interface in R. We are currently testing an R package implementing the model in Section 2.3 that will be made available on CRAN. We note that, as suggested by a reviewer, it also may be possible to speed up the algorithm, perhaps substantially, by using integer programming techniques to enumerate and propose feasible solutions. This may be especially useful for large-scale edit-imputation contexts, for example, involving millions of records.
Independent Bernoulli error location models, potentially with variable-specific error rates, are simple specifications useful for many applied settings. However, sometimes more complex models are appropriate. For example, errors may be correlated within records in situations where individuals tend to have different propensity to commit errors. We conjecture that edit-imputation can be improved in such cases by using mixture models for the vector of error locations. We believe that developing and evaluating versions of the EI-DPM is a topic worthy of future research.

Appendix: Proof of Theorem 1
Proof. To compute the required pmf we need to marginalize

Supplementary Materials
The supplementary materials include the description of the algorithm for transforming a collection of table slice definitions into a collection of disjoint definitions. It also include results from additional simulations.

Funding
This research was supported by a grant from the National Science Foundation (SES-11-31897).