Application of iterative hybrid MI approach to household survey data with complex dependence structures

Abstract The multiple indicator cluster survey (MICS) is a household survey tool designed to obtain internationally comparable, statistically rigorous data of standardized indicators related to the health situation of children and women. Missing data in a large number of categorical variables are a serious concern for MICS, following complex dependency structures and inconsistency problems that impose severe challenges to the investigators. Despite the popularity of multiple imputation of missing data, its acceptance and application still lag in large-scale studies with complicated data sets such as MICS. We propose interdependent hybrid multiple imputation (HMI) techniques which combines features of existing MI approaches to handle complex missing data in large scale household surveys. The iterative HMI approach is observed to be a good competitor to the existing approaches, with often smaller root mean square errors, empirical standard errors and standard errors. Regardless of any combination, the iterative HMI method is markedly superior to the existing MI methods in terms of computational efficiency. Results from household data example support the capacity of proposed method to handle complex missing data.


Introduction
Multiple Indicator Cluster Survey (MICS) is an international household survey tool. MICS is developed by the United Nations Children's Fund (UNICEF) to obtain internationally comparable, statistically rigorous data of standardized indicators related to the health situation of children and women. MICS household questionnaire contains information of following dimensions of household head life: education, household characteristics, water and sanitation, salt iodization, hand washing facilities, water quality testing, and results, etc. Such background variables are important for data analysis, modeling, and policy research. MICS datasets have mixed type variables that are both, multilevel categorical and continuous variables.
However, missing data in a large number of variables are a serious concern for household surveys. For example the MICS 2014 house hold data file that we analyze, 26,819 only out of 41,413 observations have complete data on a set of more than 200 background variables. Respondent's may refuse to provide a requested piece of information based on various reasons, such as unwillingness, lack of capability to answer, reservation on sensitivity of question, confidentiality and privacy, etc. This results in the failure to collect complete information. Generally, this nonresponse behavior is referred to as item non-response (INR). Most typically, high rate of INR occurs for simple demographic variables such as age, sex or marital status however, questions related to income or wealth are often related to high rate of INR (e.g., Riphahn and Serfling 2005;Hawkes and Plewis 2006). Beside INR general reasons for the missing datasets include data entry errors, system failures, etc. Besides missing data problem, household surveys also have complex dependency structures and inconsistency problems that impose severe challenges to the investigators. Sensitivity or stigma related with the activity being reported is also a main cause of inconsistent responses whereas, complex dependency structures among units occur due to the clustered nature of the data. However, we mainly focus on the missing data problem and inconsistency and complex dependency problems are not addressed in this article.
Analysis of data for scientific investigations becomes complicated, biased and less efficient in presence of missing information. In recent decades, lots of effort has been made in development of statistical methods to carter missing data. Missing data can be handled by "Multiple Imputation" (MI). MI, first introduced by Rubin (1987), is widely regarded as the "gold standard" approach to handle missing data problem, with many documented advantages over complete case analyses. Multiple random values for the missing data under a statistical model can be generated to estimate the values multiple times using MI. This results in M > 1 multiple complete datasets. MI combines the results which account extra variability caused by the missing data. The complete datasets can be analyzed by using standard statistical procedures or so called "Rubin's inference." Multivariate normal model, the log linear model, or the general location model (Schafer 1997) are examples of MI. Despite the popularity of MI, its acceptance and application still lag in largescale studies with complicated data sets such as MICS data. Hence, MI is restricted in one or the other way and not dedicated to the complex household survey data. For instance complex relationships linked to large households, multiple generation households or two or more households increase the number of potential predictors for each imputation model. More information on complex household survey designs can be found in Binder (1983).
The article is organized as follows: First, we provide a description of notations and assumptions of missing mechanisms then briefly describing some fundamentals of missing data and MI. In Sec. 3 we describe hybrid architectures in details. In Sec. 4 we present the simulations studies, the methods used in the analyses and relevant results to evaluate our proposed approach. Section 5 presents the imputation of the household data. We conclude with a discussion in Sec. 6.

Notations and assumptions of missing mechanisms
In general, there are three types of missingness generating mechanisms. Missing categories can be classified into: (i) missing completely at random (MCAR), (ii) missing at random (MAR), (iii) missing not at random (MNAR) (Little and Rubin 2002). Let Y be the data with n Â p dimensions, where n and p denote the sample size and number of variables, respectively. Assume, y ij refers to the i th value of variable j from Y where i ¼ 1, … , n and j ¼ 1, … , p. Suppose, there are two components of the data set Y ¼ fY miss , Y obs } where, the first component denotes the observed part of the data and the second component is the missing data. Let H be a response indictor matrix with same dimensions as Y indicating, if an element of Y is missing.
are the averages of the within-imputation variances d W m and the between-imputation variance, respectively.

Literature review of existing studies in large-scale complex surveys
There are two general approaches for MI. Fully conditional specification (FCS; also known as sequential regression and MI using chained equations (MICE)) and MI based on the joint posterior distribution of incomplete variables, often referred to as joint modeling (JM) (Raghunathan et al. 2001;van Buuren 2007;Schafer 1997;van Buuren et al. 2006). FCS is an iterative process which cycles through incomplete variables one at a time and imputes data on a variable-by-variable basis. A conditionally specified imputation model known as MICE, visits sequentially each incomplete variable and draws alternately the imputation parameters and the imputed values. FCS MI approach imputes variables one at a time from a series of univariate conditional distributions (van Buuren et al. 2006). FCS approach requires existence of joint distribution for convergence, which is a major downside of this approach. It is possible to get the joint distribution under rather general conditions (Liu et al. 2014;Zhu and Raghunathan 2015). However, correct specification of conditional distributions can guarantee consistency of inferences based on the imputed data even in the absence of joint distribution. In MICE missing values can be present in many variables and user can specifies regression methods according to the types of variables. For example classification and regression tree (CART) (Burgette and Reiter 2010) for categorical variables and predictive mean matching (PMM) (Rubin and Schenker 1986) which is the default imputation technique for continuous data. CART is a nonparametric method. CART uses splitting algorithms to divide the values of a variable into homogeneous subgroups. On the other hand, PMM approach uses predicted value obtained by a linear regression model to impute an observed value. The predicted value is among the values of donor pool which are closest to the value predicted for the missing one. Software packages implementing MICE includes "mice" (van Buuren and Groothuis-Oudshoorn 2011; van Buuren 2012), "mi" in R (Su et al. 2011) and "IVEware" in SAS Institute (2014), (Raghunathan, Solenberger, and Van Hoewyk 2002). Despite of many advantages, MICE has few downsides for example, MICE mostly use parametric models. Those models are hard to implement due to lack of compatibility and complex dependencies among variables. Moreover, implementation is difficult due to higher order interactions effects or many nonlinear relations in regression model (see Burgette and Reiter 2010). Implementation of MICE becomes very time consuming in presence of large number of categorical variables. PMM can be problematic, when sample size is large (van Buuren and Groothuis-Oudshoorn 2011) and CART can subject to odd behaviors in high dimensions. Another limitation of CART is that the corresponding joint distribution based on conditional models might not exist (Si and Reiter 2013). Moreover, variables with many levels are preferred to variables with few levels in CART, e.g., Breiman et al. (1984) and Kim and Loh (2001).
Joint modeling (JM) draws missing values simultaneously for all incomplete variables using a multivariate distribution (Schafer 1997). Draws from fitted distribution are used to create imputations. Dirichlet Process Mixture of Products of Multinomial Distributions Model (DPMPM) provides a fully Bayesian, non-parametric JM approach to MI for high dimensional categorical data (Manrique-Vallier and Reiter 2015; Si and Reiter 2013). Dunson and Xing (2009) proposed DPMPM for the first time. This approach uses nonparametric Bayesian versions of latent class models to multiply impute high-dimensional categorical data (Vermunt et al. 2008). The DPMPM imputation routines are implemented in the R software package, "NPBayesImputeCat" (Quanli et al. 2019). Softwares "Realcom-impute" (Carpenter, Goldstein, and Kenward 2011), R package "pan" (Schafer and Zhao 2014), R package "jomo" (Quartagno and Carpenter 2016) implement JM approach.
Like many complex models, the effectiveness of DPMPM still lags in capturing the many features of empirical data. It is not possible to implement JM approach in the multilevel context if missingness also occurs in the random slope variable(s) (Carpenter, Goldstein, and Kenward 2011). Modeling mixed type variables can make the specification of a joint distribution very difficult.
MI approaches described above are available in standard computer packages (SAS, Stata, and R). See Horton and Kleinman (2007) for an overview of available MI procedures and packages. FCS and JM MI approaches were originally proposed for dealing with item nonresponse in crosssectional data sets. Despite of being commonly available in existing softwares, these methods are hard to implement in large scale data sets with many categorical variables and many levels.
In large-scale complex surveys many types of variables with special data situations have to be handled. To do so, several methods have been proposed in the literature over recent years. For example Audigier et al. (2018) deal with quantitative variables. Reiter (2014, 2015), Audigier, Husson, and Josse (2017) among many deal for qualitative and Audigier, Husson, and Josse (2016) and Murray and Reiter (2016) deal for mixed data. Methods for qualitative and mixed data tend to perform well particularly for small number of observations and dataset having multilevel categorical variables. Moreover, these methods often require less execution time. However, some of these approaches require knowledge of complicated models and other need transformations (or other tricks) for continuous variables or assume missing values in few variables. Categorizing of continuous variables can subject to considerable loss of information (van Buuren and Groothuis-Oudshoorn 2011). Husson et al. (2019) have proposed a MI method based on multilevel singular value decomposition (SVD) for quantitative, categorical, or mixed data. This method performs SVD on between and within groups variability of the data. Downside of this method is that it does not take into account the uncertainty associated with predicting missing values from observed values. Goßmann (2016) proposed the application of CART in combination with multiple imputation and data augmentation for large-scale survey. Mislevy (1991) presented the idea to combine multiple imputations with latent variables that were used to estimate population characteristics when individual values were missing in complex surveys. A Bayesian approach for flexible handling of missing values is proposed by Aßmann et al. (2016) which handles continuous and categorically scaled background variables in large-scale surveys. Stekhoven and B€ uhlmann (2012) have presented a machine learning technique based on nonparametric models called random forest models to impute ordinal missing data. It has many desirable properties such that can be applied to a variety of categorical data, a mix of categorical and continuous data. It does not require any specific distributional assumption. It can handle nonlinear relationships among variables (Doove, van Buuren, and Dusseldorp 2014;Shah et al. 2014). Random forest approach to MI is implemented in R packages "mice" (van Buuren and Groothuis-Oudshoorn 2011; van Buuren 2012) and "missForest" (Doove, van Buuren, and Dusseldorp 2014;Stekhoven and B€ uhlmann 2012). Shah et al. (2014) found that random forestbased MICE tends to perform better than parametric MICE on survival data. Hybrid MI based on dependence models (Razzak and Heumann 2019a) is another approach to impute complex household survey data. The dependence models impute continuous covariates using FCS MI given the categorical covariates already imputed using JM MI. The Hybrid MI based on dependence models not only yields better predictive performance of generalized linear models (GLMs) (Nelder and Wedderburn 1972) for binary response (Razzak and Heumann 2019b) forthcoming but are also observed to be a good competitor to the existing approaches, with often smaller root mean square errors and less computational cost. However, hybrid dependence models do not use the information of continuous covariates for imputing categorical covariates. In this article, we extend the hybrid imputation approach based on dependence models by categorizing continuous variables. We propose two iterative hybrid imputation approaches for mixed data in complex household surveys where missing values in continuous covariates are imputed by using the information of already imputed categorical variables and continuous variables are categorized to impute categorical variables. We review inference in GLMs with binary response and mixed type missing covariates in large scale survey for a proposed and existing methods.

Proposed hybrid architectures
Consider the motivational question in section one. Performance of JM and FCS approaches to obtain complete information on mixed type covariates in large scale surveys are limited and subject to specific tasks. Moreover, these approaches are generally not equipped to handle a wide range of complexities in large scale data, categorical variables, and different heretical relations. We propose that various features of JM and FCS methods can be combined to obtain complete data with the limitations discussed above. To do so, we propose two easy and simple to implement variants of hybrid architecture that use the idea of categorizing continuous data. In first variant of hybrid architecture, we use the concept of categorizing continuous variables before the imputation of categorical data. Second variant uses initial imputed values. These values are obtained by categorization of continuous data before the imputation of categorical data. Unlike existing approaches, where categorization results in loss of power, proposed approaches restore the continuous variables in their original form. These variants are computational fast and can be applied to both categorical and continuous data in high dimensions. The incomplete data MICS household data, which is used in the real data example latter, contains few continuous variables (e.g., age, time to get water, etc.) and many categorical variables (e.g., sex, type of toilet facility, etc.). The first variant of proposed hybrid architecture generates a complete data MICS set in three steps. It divides the MICS incomplete data in to two sub groups (i.e., one containing incomplete continuous data (Miss num ) and other having incomplete categorical data (Miss cat )).
Step 1: variables in Miss num are categorized Miss num.cat .
Step 2: JM technique is applied on Miss cat given additional covariates Miss num.cat to generate complete categorical data. Complete categorical data generated in this step contains complete categorical variables Imp cat and complete categorized variables Imp num.cat. In first step, categorization allows the information on continuations variables to impute categorical variables.
Step 3: FCS technique is applied to impute missing values in original continuous variables Miss num given additional categorical variables Imp cat .
Step 3, allows the information on categorical variables to impute continuous variables.
Steps 1 to 3 are repeated M times to generate multiple copies of complete data sets. Inference (e.g., mean, regression) can be run on each of the newly created, imputed datasets. Finally, estimates can be combined by using "Rubins rules." Algorithm 1 explains the proposed method in detail. Schematic diagram illustrating the proposed hybrid architecture 1can be seen in supplementary file (see Supplementary material Figure S1). The second variant of proposed hybrid architecture is a two steps approach.

Proposed
Step 1: (a) Initialize values for categorical variables (Imp cat i ) by applying JM approach to Miss cat . (b) Given the initial values for categorical variables, single iteration of the FCS algorithm is run to Miss num for initialization of values for continuous variables Imp num i : Information on categorical variables is used for the generation of Imp num i whereas, no information available on continuous variables is used in generation of Imp cat i : (c) Initial values for continuous variables Imp num i are categorized Imp num:cat i to allow usage of information available on continuous variables for imputing categorical variables.
Step 2: (a) Given the initial categorized variables (Imp num:cat i ) as additional covariates, complete categorical variables with updated values ðImp cat ) are generated by applying JM approach to Miss cat . (b) Given updated values of additional covariates Imp cat , complete continuous variables ðImp num ) with updated values are generated by applying single iteration of FCS approach to Miss num . (c) Updated values of complete continuous variables are categorized ðImp num:cat ). Steps 2(a-c) are repeated M times with new updated values of Imp cat , Imp num and Imp num:cat to obtain M complete data sets. Algorithm 2 explains the proposed method in detail. Schematic diagram illustrating the proposed hybrid architecture 2 is provided in supplementary file (see Supplementary material Figure S2).

A simulation study
To investigate the performance of hybrid architectures via simulation, somewhat large numbers (X ¼ 39) of mixed type variables are generated. To generate first thirty one binary (X b ) variables a multivariate normal (MVN) distribution is used and correlated random covariates C i compromising 1000 observations are generated. The marginal distributions are: The correlation structure is given as: where q ¼ 0.5. Random covariates (C i ) are transformed into binary values (X b ) using the following threshold: In order to generate outcomes for the two multilevel categorical covariates i.e., ðX m 1 and X m 2 Þ, we first generate two random covariates from normal distributions (ND) given as: C 32 $ N l 1 ; ffiffi ffi 2 p À Á , C 33 $ N l 2 ; ffiffi ffi 2 p À Á , where l 1 and l 2 are described as: Further, all observations in C 31 and C 32 are randomly split into various homogeneous groups and two multilevel categorical variables X m 1 and X m 2 are formed with four and six categories, respectively. To encode complex dependence relationships with higher order interactions, we generate another binary covariate X b 32 from Bernoulli distribution with probabilities governed by the logistic regression with logit PrðX b 32 Þ ¼ 0:001 À 0:01X b 1 À 0:09X b 2 À 0:09X b 3 À 0:09X b 4 þ 0:05X b 5 þ 0:08X b 6 À 0:02X b 7 þ 0:08X b 8 þ 0:01X b 9 þ 0:01X b 10 À 0: We then generate outcomes for the two continuous covariates i.e., X n 1 and X n 2 from normal distributions (ND). Description is as follows X n 1 $ Nðl 3 ; ffiffiffiffiffiffi 0:5 p Þ: where, where, Both continuous covariates are highly positively correlated i.e., r ¼ 0.9. Covariate dependent binary response y is generated from Bernoulli distributions with probabilities governed by the logistic regression with Equations 5-10 include high-order interactions to represent the type of complex dependence structures. Imputation approaches based on log-linear models or chained equations may fail to capture these structures. There is no particular importance of the specific values of the coefficients. Nonzero coefficients are specified for higher order interactions for generating complex dependencies. The analysis model of interest is the GLMs with link "logit". The observations in all covariates are missing (at random) with the probabilities based on a logistic probability distribution model. Probabilities for a random covariate X are given as: where i¼{1, … ,39} and j 6 ¼ i. Missingness in X i is attributed solely to other observed variable X j : This yields 10% of the observations to be MAR. Source code for the simulated data is available at GitHub repository (https://github.com/humera123845/code-for-simulated-sample-data).
We use a JM technique called DPMPM MI for categorical variables. DPMPM MI technique is selected due its ability to identify complex dependencies structure among categorical variables and computational efficient qualities in high dimensions. We use a FCS technique called MICE for continuous variables. MICE is selected due to its popularity and applications in wide range of fields. For comparison, two MICE based MI methods namely "Mice CART " (classification and regression trees (CART)) and "Mice DEF " (which uses logistic regression models for categorical and "PMM" for continuous variables as default) are used. Proposed hybrid architectures are implemented as "H.CART" and "H.DEF". The mixtures of multinomial distributions approach is combined with the MICE algorithms "CART" and "Default" in H.CART" and "H.DEF," respectively. Further, we express "H.CART" as "H.CART 1 " and "H.CART 2 " indicating first and second hybrid architectures based on CART. Similarly first and second hybrid architectures based on "default" are expressed as "H.DEF 1 " and "H.DEF 2 ," respectively. JM technique in hybrid architectures is implemented with prior specifications a a ¼ 0.25, b a ¼ 0.25, and somewhat large number of mixture components i.e., k ¼ 80. We used R (R Core Team 2018) version 3.0.1 to perform all calculations. The packages "mice" (van Buuren and Groothuis-Oudshoorn 2011), version 2.17 and "NPBayesImputeCat" (Quanli et al. 2019), version 0.6 were used to perform MICE for continuous data and Non-Parametric Bayesian MI for categorical variables, respectively. These blended versions of joint and sequential modeling MI techniques make it possible to obtain complete datasets with information available on both types of variables. The imputation model contains all of the variables from the generated data in order to preserve the relationships between the variables of interest (Schafer 1997;Moons et al. 2006;White, Royston, and Wood 2011;van Buuren 2012). The parameters of interest are estimated using Rubin's aforementioned method on Z ¼ 1000 simulation runs. Ten imputed data sets for each of the proposed and the MICE MI methods are generated for realistic applications (Fichman and Cummings 2003). Table 1 displays the performance of MI methods for simulated data. Graphical comparisons of the imputation methods based on boxplots (White, Royston, and Wood 2011;van Buuren 2012) of standard errors and point estimates across 1000 simulations for regression coefficients are presented in Figures 1 and 2, respectively.

Evaluation criteria
The quality of MI methods is evaluated based on two error-based measurements i.e., root mean square error (RMSE) and empirical standard errors (ESE) (Akande, Li, and Reiter 2017;Armina et al. 2017). RMSE is computed as a combination of the bias and variance of the estimate (Burton et al. 2006). ESEs can be considered to access the between imputation variations. The smaller values for RMSEs and ESEs indicate better performance (Oba et al. 2003). RMSE and ESE are calculated using the following formulas: where q z M denote the estimated parameter pooled over M imputed data sets and Z simulation runs and b denote original parameters.

Results
There seem to be similarities in structure among all MI methods i.e., all methods are upward biased for binary covariates e.g., X b 1 , whereas, the average point estimates based on default and H.DEF methods are closer to the corresponding true values as compared to other methods. CART and hybrid methods are slightly downward biased for multilevel covariate with six levels e.g., X m 1 5 X b 1 : The average point estimates for multilevel covariate with six levels based on Figure 1. Simulated data: Boxplots for the point estimates across 1000 simulations by imputation methods under Missing at Random (MAR) and ten imputations with 10% of missing data. Point estimates are shown for only three regression coefficients, i.e., for variables X b1 , X m1 5 , X b13 X b30 : The horizontal red lines indicate the respective "true" values.
CART and H.CART methods are closer to the corresponding true values as compared to H.DEF methods. All methods are downward biased for the interaction terms e.g., X b 13 X b 30 , whereas, the average point estimates based on default, CART, H.DEF methods and H.CART 2 method are closer to the corresponding true values as compared to H.CART 1 (Figure 1). Hybrid and CART methods tend to have smaller standard errors as compared to default method for all covariates, whereas the hybrid methods tend to have similar standard errors as compared to CART for most of the cases (Figure 2). The estimated ESEs for the all hybrid methods are smaller for all types of covariates except the binary covariate. H.DEF methods and H.CART 2 show similar or slightly higher ESEs as compared to default and CART methods for the binary covariate. The estimated ESEs for the H.CART 1 are smallest for the multilevel covariate with six levels and H.DEF 2 has smallest ESEs for the interaction terms. All hybrid methods tend to have smaller estimated RMSEs for binary covariate where H.DEF 2 has smallest RMSEs as compared to all methods. The estimated RMSEs for all hybrid methods are slimier to default and CART methods for the multilevel covariate with six levels whereas the H.CART 1 has the smallest RMSEs among others. Similarly for interaction term, all hybrid methods tend to have smaller RMSEs for most of the cases where H.DEF 2 shows smallest RMSE among the remaining methods (Table 1). The estimated ESEs (RMSEs) and averages of point estimates (standard errors) for all coefficients under hybrid architecture 1 and 2 are provided in supplementary file (Supplementary material Tables S1-S4). Boxplots for point Figure 2. Simulated data: Boxplots for the standard errors across 1000 simulations by imputation methods under Missing at Random (MAR) and ten imputations with 10% of missing data. Standard errors are shown for only three regression coefficients, i.e., for variables X b1 , X m1 5 , X b13 X b30 : estimates (standard errors) for all coefficients under hybrid architecture 1 and 2 are given in supplementary file (Supplementary material Figures S3-S18).

Motivation
National studies like Government of Pakistan Economic survey (2008) highlighted that nearly 50 million individuals are deprived from safe drinking water in Pakistan. Our motivation stems from data obtained from MICS Punjab, 2014. MICS in Punjab was conducted in the Punjab province of Pakistan with joint collaboration of the Bureau of Statistics (BOS) Punjab and the United Nations Children's Fund (UNICEF). Final and key findings report, survey plan, list of indicators, questionnaires and training agenda of MICS Punjab 2014 is available for download via a dedicated BOS Punjab website (www.bos.gop.pk). MICS Punjab questionnaire for household contains more than two hundred indicators on variety of household's conditions. For example indicators on house conditions (e.g., number of rooms used for sleeping, main material of floor and roof, etc.), access to general facilities (e.g., electricity, radio, television, non-mobile phone, refrigerator, etc.), source of drinking water (e.g., main source of drinking water and other purposes, location of the water source, duration to get water and come back, person collecting water, treatment for water to make safer for drinking, etc.), sanitation facilities (e.g., type of toilet facility, water available at the place for hand washing, soap or detergent present at place of hand washing, etc.). Binary logistic regressions models can be fitted to describe household trends in access to improved water sources and sanitation facilities. Associated factors like location, demographic and socio-economic, etc. can be further use for prediction. Information based indicators described above can prove to be very useful in policy making in order to improve quality of drinking water and sanitation in Punjab.

Imputation of MICS household data
We use a secondary household data from the Punjab Multiple Indicator Cluster Survey in 2014. A GLM with a logit link is used to describe associations between access to water and sanitation, and geographic, demographic, and socio-economic factors. Most of the background variables related to geographic, demographic, and socio-economic characteristics in MICS data for household are categorical with many categories having complex data structures and large amount of missingness. For example geographical region of Punjab is divided into 36 districts. Living area has two levels i.e., urban or rural. Statistical models based on survey data sets contain both, Estimated bias is simply a difference between root mean square error and empirical standard error. All results are based on 10 imputations and 1000 simulations. Estimates are shown for only three regression coefficients (Coef.) i.e., for variables X b1 , X m1 5 , X b13 X b30 : Bold figures indicate the smallest mean root mean square errors, mean empirical standard errors and amount of bias among various imputation variants.
continuous and categorical variables and it can be tedious for MICE to specify imputation models and interaction terms in presence of such complications (van Buuren and Oudshoorn 1999). Therefore for the proper comparisons, multiple categories for categorical variables were reduced by merging them and a sub-sample of 57 variables is selected which contains information on water and sanitization, hand washing, and household characteristics. For the sake of keeping the analysis comparable and challenging at the same time, variable "Main material of exterior walls" is included in the sub-sample which has 15 levels. Among all these variables, forty nine variables are categorical with multiple categories and remaining are continuous, only two variables are fully observed. The missing data rates in most items were moderate. Items carrying great substantive importance, such as "Person collecting water," 83% values were missing; "Energy use for cooking" indicator was missing at approximately 68%; the indicator on whether the child needed to be physically punished to be brought up properly was missing at approximately 37% (see supplementary file Tables S5 and S6). We assume items are MAR in data under consideration. The R package "VIM" (Templ et al. 2012) is utilized for exploring data and the pattern of missing values. Graphics for the all variables in sub sample are provided in a supplementary file (see Supplementary material Figures S19-S25).

Logistic regression model
To identify key determinants of water quality, we use a dichotomous variable indicating whether the household do anything to the water to make it safer to drink. That is, WT ¼ 0 if household do not do anything to the water to make it safer to drink, 1 if household do anything to the water to make it safer to drink: & where WT denotes water treatment status. We determine two explanatory variables associated with the binary response "WT:" We then used a Logistic regression model, given by where X 1 , X 2 are the predictor variables, "type of area (rural or urban)" and "soap/other material available for washing hands (yes or no)," respectively and p denoted the probability that the household do not do anything to the water to make it safer to drink. The binary predictor "soap_avilb_wash_hand" has the highest amount of missing values (i.e., about 9%) while the amount is rather small in the other two variables (i.e., less than 8% for response "treat_water_make_safe" and less than 6% for predictor "area"). See supplementary file for summary of all variables. Since there are no true values to compare for real data example, we calculated complete case (CC) estimates for comparison purpose. The CC analysis uses only the complete cases (i.e., n ¼ 26,819). The point estimates of GLM for "type of area" and "soap/other material available for washing hands" are 1.361 and 1.111, respectively. Whereas, standard errors for "type of area" and "soap/other material available for washing hands" are 0.106 and 0.052, respectively. Similar to simulation study, point estimates and standards for M ¼ 10 completed data sets across 50 simulations are calculated for real data (see Figures 3 and 4). ESEs and means of point estimates (standard errors) and computational time for various MI methods are shown in Tables 2 and 3, respectively.

Results
We note that the standard errors for all of the coefficients are smaller compared to their point estimates under all MI methods (see Figures 3 and 4). The empirical example with real data indicated that the MICE methods and HMI variants yielded differing point estimates. We noticed that point estimates in both default and CART methods are nearer to the estimates in CC analysis for all cases with larger standard errors as compared to hybrid methods (see Table 2). Figure 4 displays smaller standard errors for hybrid variants (i.e., H.DEF 1 , H.CART 1 , H.DEF 2 , H.CART 2 ) as compared to default and CART methods. ESEs and means of standard errors for hybrid variants are also smaller as compared to other methods (see Table 2) whereas these estimates are smaller for H.DEF 2 and H.CART 2 as compared to H.DEF 1 and H.CART 1 , suggesting better performance over default and CART. Given the results produced by the MI methods, a look at the computation times in Table 3 may be useful for a further comparison. We found that hybrid variants ran quite fast followed by default method whereas, it took almost 5 days by CART to run on standard computers for a small subset of incomplete household data. Surprisingly, this time was reduced to almost half a day when hybrid methods were applied. We also found that hybrid variants also resulted in satisfactory performance when applied the full MICS household data set with hundreds of variables and categories with multiple levels whereas, methods based on MICE were not even able to run this large dataset due to complex structures. Thus, there exist significant differences in terms of the computational efficiency among the MI methods.

Conclusion and future research
This article describes the mechanisms of two hybrid strategies to handle missing data in large scale survey data with complex dependence structures among categorical variables and high percentage of missing information. After comparing the performance of various multiple imputation algorithms, we showed that both proposed hybrid variants of the multiple imputation algorithms were clearly superior to MICE MI methods not only in terms of the accuracy of imputation, but were also markedly superior to the others in terms of the computational efficiency. Practitioners can easily use our proposed methods to handle complex survey data because our techniques rely mostly on previously implemented algorithms. Our current work is limited to MAR mechanism, however, we believe that the biases due to wrongly assumed missingness mechanism are minimal when the imputation models are kept as rich as possible to the extent where they are estimable. We also believe that a data generating processes considered in simulation study can be generalized to a large number of situations. However, we have no sound grounding to prove that the comparisons we make here will always apply for any data. In particular, we have not yet considered alternative categorizations for continuous variables such as ordinal, unordered or multiple categories. Issues like convergence and appropriate selection of predictors are beyond the scope of the present article. This study has for the first time provided an overview and a systematic comparison of previous approaches to MI for large scale complex data implemented in conditional models. We propose that the performance of proposed algorithms can be improved by extending the categorization process of continuous variables to ordinal or multiple categories. Since proposed approach requires the covariates to be strongly correlated in order to work properly, further evaluations with diversity of experimental settings will undoubtedly be needed to account for this.