Systematic handling of missing data in complex study designs – experiences from the Health 2000 and 2011 Surveys

ABSTRACT We present a systematic approach to the practical and comprehensive handling of missing data motivated by our experiences of analyzing longitudinal survey data. We consider the Health 2000 and 2011 Surveys (BRIF8901) where increased non-response and non-participation from 2000 to 2011 was a major issue. The model assumptions involved in the complex sampling design, repeated measurements design, non-participation mechanisms and associations are presented graphically using methodology previously defined as a causal model with design, i.e. a functional causal model extended with the study design. This tool forces the statistician to make the study design and the missing-data mechanism explicit. Using the systematic approach, the sampling probabilities and the participation probabilities can be considered separately. This is beneficial when the performance of missing-data methods are to be compared. Using data from Health 2000 and 2011 Surveys and from national registries, it was found that multiple imputation removed almost all differences between full sample and estimated prevalences. The inverse probability weighting removed more than half and the doubly robust method 60% of the differences. These findings are encouraging since decreasing participation rates are a major problem in population surveys worldwide.


Introduction
Unequal sampling probabilities and selective missing data mechanisms markedly complicate the analysis of survey data [14,35]. Due to these challenges, standard tools and analysis methods are not always directly applicable and modifications are required. Making modifications of this kind can easily require ten-fold the time and effort, compared to a standard analysis flow in studies with simple random sampling (SRS) and complete data.
In surveys, non-response and non-participation are synonyms, which are used to describe missing data [34]. The term 'non-response' is commonly used in questionnaire surveys. The term 'non-participation' on the other hand is commonly used in epidemiological studies such as health examination surveys (HES), in which study subjects need to arrive to the examination clinic in order to participate in the survey. In this work we use the term non-participation for missing data.
From the population perspective, both sampling and non-participation induce missing data: individuals not selected for the sample are missing by design. A decision on sample size is taken during the design phase, on the basis of accuracy targets and budget constraints, and the probabilities associated with the selection process are usually known. By contrast, unit non-response and item non-response lead to unintentional missingness with unknown selection probabilities. These selection probabilities can be estimated only if assumptions such as missing at random (MAR) [37,41,46,51] are feasible or if there is prior knowledge of the selection mechanism [20].
In this paper we aim to present a systematic approach towards the practical handling of missing data. Our motivation is based on our experiences of analysis of the Health 2000 Survey [3] and Health 2011 Survey [31]. The Health 2000 Survey was a national HES performed in Finland in [2000][2001]. The sample of 9922 adults was selected by a stratified two-stage cluster sampling design in 2000. In the following we refer to this survey as the baseline. In the Health 2011 Survey, the study subjects covered by the Health 2000 Survey were invited to a health examination (HE). A new sample of young adults was also selected to compensate for the aging of the original cohort. In 2000, the unweighted participation rate was as high as 93% while in 2011 it was only 73%. If the differences in the participation rates are ignored, the studies may not reliably reflect changes in the population's health between 2000 and 2011. The related data collection is described in detail in Section 2.
An analysis of the Health 2000/2011 Survey requires knowledge of complex survey designs [35], longitudinal analysis [40], handling of missing data [41] and model selection methods. To ensure that all aspects of statistical analysis are considered, we divided the analysis flow into the following four steps: (1) Description of data collection using a graphical model. (2) Derivation of the sampling probabilities for the two-stage design.
(3) Modeling of non-participation. (4) Systematic comparison of alternative methods of handling missing data.
These steps are explained in Section 3.
On the basis of earlier research on non-participation in health surveys, we know that survey non-participants are more often single men and from younger age groups, with a lower socio-economic status and living in urban areas [8,9,13,22,30,39,54]. Nonparticipants have higher mortality than survey participants [17,23,33], more frequent hospitalizations [2,13,28], and worse self-reported health [22]. They are also known to be daily smokers more frequently [13,22], to have more mental disorders [9,29,39] and to be in receipt of disability benefits more often [29,30,39].
Previous studies have reported differences in estimates of population health indicators due to non-participation in the survey results, resulting from the above differences between survey participants and non-participants [11,15,16,38,55,61]. The non-participation profile in the Health 2000/2011 Survey is presented in Section 4 and discussed in Section 5 of this paper.
By linking the complete survey sample to administrative registers, which have been shown to have a good coverage, we can obtain data on various socio-demographic characteristics as well as on health. We apply these data in two ways. First, the sociodemographic variables are used as auxiliary population information to remove the effects of non-participation.
Second, in order to compare the performance of different methods of handling missing data, we need variables which were observed to the full sample, not only to the participants. By linking the complete survey sample to administrative registers, we can obtain health data on disability pension, hospitalization and medicine reimbursements. The prevalence of these variables is estimated using only participants and the various methods. The results are then compared with the prevalences estimated on the basis of the complete survey sample. The results of the comparison are presented in Section 4, and the benefits of the systematic approach are discussed in Section 5.

Data and sampling designs
The details of the sample design are described in the subsections below. To summarize the sample designs, the Health 2000 Survey was based on a stratified proportional to size (PPS) two-stage design with health center districts (HCDs) as primary sampling units (PSUs) and a systematic sample of persons selected at the second stage by the Social Insurance Institution of Finland.
One part of the Health 2011 Survey was simply the full Health 2000 Survey sample, with their given selection probabilities. The 18-28 years olds required an additional sample in order to cover the adult population in 2011.
The inferential populations are (1) the cross-sectional population of those aged 18+ in Finland in 2000, (2) the cross-sectional population of those aged 18+ in 2011, and (3) the cohort of the 2000 population followed to 2011. The following subsections give an overview of the sampling designs, and further details can be found in Supplement [18].

The health 2000 survey
The Health 2000 Survey was a national HES carried out in 2000-2001. For this survey, the target population encompassed persons aged 18 years or older living in mainland Finland on 1 July 2000.
The design was a stratified two-stage cluster sampling design. Twenty geographical strata were based on the 15 largest towns, while the rest of continental Finland was divided into 5 strata based on the university hospital regions.
A total of 80 HCDs were selected for a sample, including the 15 largest towns with probability 1, and a systematic PPS sampling of smaller HCDs (clusters) as the PSUs, in such a way that the sample contained 16 HCDs in each university hospital region. Systematic sampling of persons was applied so that the sample size in each stratum was proportional to the corresponding population base. The total sample size was 9922. For further details, see Laiho et al. [32].

The health 2011 survey
The sample used for the Health 2011 Survey was designed to provide a representative longitudinal data on the Finnish population. First, the 8135 eligible sample members from the baseline Health 2000 Survey were invited to participate in the Health 2011 Survey. Of these 8135 sample members, 1573 had died, 96 had emigrated and 109 had forbidden any contacts in the future [31] during the 11-year follow-up. In addition, the sample was amended with a new sample of 1994 young adults aged 18-28 years, since the baseline sample had aged by 11 years during the follow-up period. Details of this new sample are presented in the supplement.
The baseline sample represents the target population on 1 July 2000. It was further realized that during the period up to 2011, the composition of the original sample had changed due to mortality and emigration. Overcoverage detected in the original sample with respect to the 2011 target population was caused by the same mechanism as with respect to the entire population. We therefore have good grounds for considering the composition of the original sample in 2011 to constitute a proper sample of the 2011 target population. However, undercoverage due to immigration to Finland after the year 2000 was not represented by the baseline sample. The target population therefore became those persons who belonged to the baseline target population and who were alive and resident in Finland in the year 2011. The migration of study subjects between strata was also considered representative with respect to migration within the target population. However, because the allocation of the sample between strata in 2011 could not be controlled by means of the sampling design (except in the case of young adults), the stratum sample sizes were considered random.
In addition to the old sample, a new sample of young adults aged 18-28 years was sampled. The new sample was selected from the same areas as in the Health 2000 Survey.

Register data
We had a possibility of linking the full survey sample, to several administrative registers, which have been shown to have a good coverage (see, e.g. [62]). This was done using the personal ID numbers provided for everyone who resides for at least one year in Finland. The samples were collected by the Population Register Centre using information on the day of birth and municipality of residence as the list variables.
Administrative registers to which the survey data were linked were as follows: (1) Care Register for Health Care [59] (former Hospital Discharge Register). (2) Reimbursement of medical expenses [26] from the Social Insurance Institution of Finland, and (3) Disability benefits and services [25] provided by the Social Insurance Institution of Finland. All these registers are national registers which are regularly updated.

Description of data collection using a graphical model
The statistical analysis undertaken requires that the study design be expressed in a precise form (Step (1) in the procedure presented in the introduction). For this task, we apply a graphical model extended to present also the study design and the missing-data mechanism in a formal way. The concept has been introduced with the name 'causal models with design' [24] and it relays on functional causal models defined by Pearl [42,43]. We acknowledge that some readers may have a different view on causality and may not want to talk about causal effects in the absence of factual interventions. These readers may interpret the graphical model as an associational model that gives a useful factorization of the joint distribution. We believe, however, that although our aim is to estimate population statistics without hypotheses of causal relationships, causal considerations are beneficial to understand the study design and the missing-data mechanism.
Following Karvanen [24], the nodes of the graphical model are divided into three classes: causal nodes, selection nodes, and data nodes. Here the variables of interest in the population are called causal nodes. These variables are not directly observed but measurements of them are performed in relation to the individuals selected for the sample. The selection nodes function as indicators of sampling and participation, and have the possible values 1 (selected) and 0 (not selected). The selection nodes form a chain where each selection node must have at least one parental selection node. The unique population node r is an ancestor of all selection nodes and has a value of r = 1 for all individuals in the population. The data nodes represent actual measurements. For individuals not selected for the sample the measurement is missing. If X is a causal node and r is a selection node, the value of the data node X * is defined deterministically.
where NA stands for missing data. Figure 1 shows the graphical model for the Health 2000 and 2011 Surveys. The graph seeks to describe the study design and the overall causal/association structure. The causal nodes V 0i , X 0i , V 1i and X 1i represent population level variables and the data nodes V * 0i , X * 0i , V * 1i and X * 1i represent the measurements on them. Table 1 presents the notation. The first subscript, 0 or 1, refers to the year, 2000 or 2011, respectively. The second subscript i refers to the individual. Lowercase r denotes a sampling indicator and uppercase R denotes a participation indicator. The selection nodes r i , r 0i , r Ai , r Bi and r 1i correspond to the sampling design, and R 0i and R 1i to the missing-data mechanism. Index A corresponds to the baseline sample and B to the new sample of young adults in 2011.
The variables (data nodes) of interest can be divided into two groups: strata and registry data available for all individuals in the population, and the covariates measured at the baseline for the participants. Variable V * 0i represents any strata and registry variable in 2000, and X * 0i represents any baseline covariate measured in 2000 for the subject i. Similarly, variable V * 1i represents any strata and registry variable in 2011, and X * 1i represents any baseline covariate measured in 2011 for the subject i. There can be both unit and item nonresponse in the covariates. The causal/association structures between the registry variables or between the covariates are not specified in the graph because they are not needed in the current analysis.
The population consists of all individuals for whom there was a positive probability of being selected for the study in 2000 or in 2011. This included persons aged 18 years or older and living in mainland Finland on 1 July 2000 and persons aged 18-28 years and living in mainland Finland in 2011. A hybrid population of this kind is of technical importance, because all selection probabilities and population distributions are now defined with respect to this population. From the epidemiological perspective we are naturally more Table 1. Notation corresponding to the population and sample sizes in stratum s and HCD k, as well as to the observed information and weights for individual i.

Year 2000
Year 2011 All ages All ages Note It is assumed that the registry variables V 0i precede the baseline covariates. In reality, factors such as health, education, and socio-demographic status have a complicated causal structure which can be modelled only if the life courses of the individuals in question are understood in detail. Such modeling is not required for the present analysis.
The probability of being included in the sample {i : r 0i = 1} depended on the strata variables age and geographical area, which were available from the administrative registries. It is assumed that in 2000 the strata and registry variables V 0i and the baseline covariates X 0i were associated with the strata and registry variables and the baseline covariates given in the Health 2011 Survey. The register variables included information describing whether the subject was alive and lived in Finland in 2011, and was eligible for invitation to participate in the survey. Each subject selected for the sample decided individually whether or not to participate in the survey. We made the MAR assumption that this decision R 0i depended on both the observed registry variables and the baseline covariates. The item non-response can be similarly modeled using a separate response indicator for each covariate measurement. In the Health 2011 Survey, each subject selected for the sample also made a decision R 1i on whether or not to participate in the survey.

Sampling probabilities for the two-stage design
In Step (2), the sampling probabilities for 2000 and 2011 are derived from the design. In the terms set in Figure 1, we define the probabilities P{r 0i = 1|V * 0i } and P{r 1i = 1|V * 1i , r 0i }.
The notation required to calculate the sampling probabilities is presented in Table 1. Let us assume that V 1i contains V 0i . Based on the boundaries in 2000, the true population sizes in both 2011 and 2000 were obtained from Statistics Finland.
In the Health 2000 Survey, the inclusion probability for subject i, who was Age 0i years old, belonged to stratum s := Stratum i and HCD k := HCD i , was defined as It should be noted that if individual i was selected for the sample (r 0i = 1), then the corresponding HCD was also selected, which explains the first equality in Equation (2). The sampling intervals (N sk 0, [30,80) [30,∞) in the systematic sampling of individuals were equal for the age groups '18-29 ' and '30-79 ' years, and half of those in the age group '80 years and older', making the sampling probabilities twice as high for the oldest age group [12]. The sample sizes n sk 0,[30,∞) = n sk 0,[30,80) + n sk 0,[80,∞) were equal in each PSU k within stratum s. Within the strata in which element-level sampling was employed the sample sizes were proportional to the corresponding population sizes.
The 'size' of cluster k was the corresponding population size N sk 0 aged 30 or older. The same sampling intervals were used for the '18-29 years' age group. The sampling weight was defined as v 0i := 1/P{r 0i = 1|Age 0i }.
All study subjects from the Health 2000 Survey, who were still alive and living in Finland, were invited to participate in the Health 2011 Survey. The sampling probabilities for this part of the sample are therefore as follows: where Z i is a register variable indicating that the study subject is alive and lives in Finland in 2011 (Z i = 1) or has died or left Finland (Z i = 0). Because the new sample for the '18-28 years' age group was created using
Where the participation rates are as high as in the Health 2000 Survey, it would be common practice to use the calibration weights of the Health 2000 Survey [12]. We refer to these as the baseline weights below. For the year 2000, the baseline weights are derived from Equation (2), with no further adjustments for non-participation. It is assumed that weighted statistics such as means and prevalences provide representative results on the target population. In addition to the cross-sectional statistics, follow-up data can also be analyzed using baseline weights and the methods applied for cohort studies.
Formally, the participation probability can be stated as The proportionality in Equation (4) holds in the case that the MAR assumption for the non-participation is valid. Non-response was accounted for using the calibration weights based on language, gender, age, and area [12], thus In 2011, participation rates were significantly lower and the means and prevalences of the population in 2011 cannot be reliably estimated using the baseline weights. From Figure 1 we can see that it is also assumed that the decision on participating R 1i in the Health 2011 Survey depends on the health status and other covariates (V 1i , X 1i ) which the survey is intended to measure. This is a missing-not-at-random (MNAR) situation; in general, the participation probabilities cannot be estimated on the basis of the data only. However, the baseline covariates measured in 2000 include information on various risk and lifestyle factors predicting future diseases or functional disabilities, which are expected to be common causes of non-participation in 2011.
We divide the sample in 2011 into two parts. The first part consists of the participants of the Health 2000 Survey (R 0i = 1). We assume that the participation probabilities in 2011 can be modeled reasonably well using the registry variables in 2011 and the covariates measured in 2000.
We model the latter probability using the logistic regression model where α 1 and β are regression parameters. The second part consists of the small group of subjects who did not participate in the Health 2000 Survey (R 0i = 0) and the new sample of young adults. They are handled similarly, but using only the register-based information V 1i .
where α R describes the effect of non-participation in 2000 on participation in 2011. The regression models corresponding to the observed register data V * 1i (and α 0 and α 1 ) depicted the interactions between age group, gender, and education.
Combining the above results, the probability of participation for each individual i can now be expressed as where the sampling probability P{r Bi = 1|V * 1i } is given in Equation (3). For this work we selected numerous variables covering various aspects of the information collected in the Health 2000 Survey. Together with age and gender, and with or without the interactions between the variable and age and/or gender, these variables were then entered one at a time into logistic regression models. The Bayesian information criterion (BIC, [50]) was then applied to assess which variables had better predictive power with respect to non-participation than the model accounting only for age group and gender. The best predictors were then entered as main effects into the same model. Using the Wald test, the variables, whose p-value was below .20 were selected for the final model. This approach was compared with the BIC applied in the multivariate models, which contained the predictors selected by the univariate BIC described above. Due to the item non-response, these steps were conducted using the complete-case data ignoring the sampling design. See Section 4 for a description of the selected model.
Diagnostics of the weighting model using the le Cessie -van Houwelingen -Copas -Hosmer unweighted sum of squares test for global goodness of fit [21] gave p-value .85. A large p-value indicates that the model cannot be disqualified with the data.

Comparison of missing-data methods
The models derived above are used for Step (4). Commonly applied methods in handling missing data are the inverse probability weighted (IPW) analysis [7,45], doubly robust (DR) methods [5] and multiple imputation (MI) [48,49]. The MI methods have previously been applied, for example, in longitudinal data analyses [56] and survey analyses [19]. More specifically, we use the MI method called the multivariate imputation using chained equations [63,64]. All these methods apply the sampling weights described in Section 3.2 to provide results which represent the population. The IPW and DR methods also require weights which handle the non-participation and which were defined in Section 3.3. As these methods are widely known and they are applied here using generally available software, we describe the details in the Supplement [18]. Three imputation models using the predictive mean matching (PMM, [36,47]) method are applied, and the imputed data sets are analyzed using the baseline weights [53]. MI1 contains only the register-based socio-demographic variables, MI2 the same variables as IPW and DR methods above, and MI3 in addition to those variables contained in MI2, also the biological risk factors measured at the baseline survey. In addition to the PMM method, also the recursive partitioning and regression trees (RPART, [58], option cart in package mice) were applied to two other MI models. MI4 contained the same variables as MI3. MI5 contained all variables selected by the Akaike information criterion (AIC, [1]) applied to age and gender adjusted univariate logistic regression models using complete-case data without sampling design or weights. As a result, MI5 contained a much larger group of variables than the other imputation models. Supplement Table I presents the selected additional variables. Neither the sampling design nor the weights were applied in the MI procedures. 50 imputed data sets were generated as it has been suggested that the number of imputations should be greater than the percentage of missing data [66].
Although alternative missing-data methods were available, in this work we concentrate on these methods. Bayesian inference has benefits in terms of handling missing data, since the uncertainties involved in both the parameters and data can be properly accounted for in the predictive distributions of the missing-data values. Tanner and Wong [57] introduced the data augmentation method, in which the missing data are integrated out from the joint distribution of the data and the model parameters. The posterior distribution of the parameters can then be obtained from the resulting distribution. In practice, this integration can be handled numerically using the Markov chain Monte Carlo (MCMC) methods [44] and, for example, the OpenBugs software [60].
Various methods of handling missing data were applied to calculating prevalences adjusted for non-participation. The true prevalences of the disability benefits in 2009, hospitalizations in 2010, and reimbursement of medications in 2011, which were available from the administrative registries for all sample members, were calculated as the sampling weighted prevalences using the full sample. When assessing the goodness of the missing-data methods, the values of these outcome variables were set missing to the non-participants before imputation or calibration of weights.
Younger individuals under the age of 30 years were excluded since these health-related events are rare, entailing that the prevalences are likely to be close to zero and making differences between various methods difficult to detect.

Results
Participation rates were much lower in the youngest age group than in the '50-74 years' age group. They were also lower in the oldest age group ( Table 2). The participation rate for the HE was lower than the participation rate in any part of the survey including the questionnaires, interviews and HE (total). A total of 8135 sample members of the In 2000, participants appeared to have higher hospitalization prevalences than nonparticipants under 65 years, but prevalences were lower among older participants (Supplement Table II, [18]). In 2011 these differences appeared only in those older than 65. Non-participants had higher disability pension prevalences in 2011.
Univariate model selection demonstrated that the self-reported work ability score (together with the interaction between age group, gender, and education) was the best predictor for the response in 2011: BIC = 5331. The null model containing only the interaction between age group, gender and education gave the following result: BIC = 5337. A better BIC value than the null model was also recorded for the following predictors: language (Finnish vs. Swedish or other); participation frequency in clubs or associations; self-reported health status; and the work ability. When one or more of these predictors were entered into the regression model simultaneously, both the model selection and the Wald tests showed that the best results were achieved for the work ability, language and participation frequency in clubs or associations in addition to the interaction of age group, gender and education (BIC = 5325).
The participation rate was very low among men with low educational attainment and amongst the '29-34 years' age group as shown by the OR estimates of the main effects (Supplement Table III [18]). Poor self-reported work ability, language other than Finnish, and either no or high activity in clubs or associations also decreased participation in 2011.
Accompanied by the baseline weights, multiple imputation based on the chained equations appeared to remove almost all differences between full sample estimates and estimates based on the participants prevalences ( Table 3). The soundness of the imputation methods and models varied depending on the outcome. The RPART method was the best in two out of three outcomes and PMM in one. The large imputation model (MI5) was best only for one outcome. In two outcomes out of three, the estimate based on the large imputation model (MI3) was the closest to the true prevalence. The estimated standard errors were the smaller the larger the imputation model was. These results are in accordance with the general fact that the more information is available for predicting the missing values, the more  accurate the estimates and the smaller the standard errors are. However, there seemed to be almost no benefits in adding a large number of additional variables based on the AIC in MI5. Baseline weights had almost no influence at the baseline, nor did they remove the differences for the 2011 results. Adjusted for the non-participation in 2011, the weights improved the estimates considerably by removing more than half of the differences. The doubly robust method appeared to provide even better results by removing at least 60% of the differences for all three outcome variables. In the Health 2000 Survey, the participation rate was very high, thus the differences were small.

Discussion
The systematic approach we have presented, for the analysis of complex study design with missing data, has several benefits over less systematic, ad hoc approaches. Causal models with design force statisticians to make the study design and the missing-data mechanism explicit, although in many population surveys hypotheses concerning causal relationships are not relevant. In such association studies these models can be considered as graphical models describing both associations between variables and the temporal ordering of variables. This helps statisticians to think logically and makes it easier for the reader to follow the presentation. The need to communicate the study design accurately is also stressed in the reporting guidelines such as the STROBE Statement [65]. Using the systematic approach, the complex likelihood can be split into parts of manageable size. Most importantly, sampling probabilities and participation probabilities are considered separately, since the former are known by their design and the latter are estimated from the data. This helps us to see the common factors between the alternative missing-data methods and to observe the differences between these methods. The systematic approach was beneficial to integrating the complex sampling design and non-participation within a single probabilistic framework during the analysis of the Health 2000/2011 data. This was particularly useful in estimating the weights for the IPW method. NMAR based on possibly unobservable causal nodes could be a hypothetical model for non-participation, whereas the actual estimable model for non-participation assuming MAR is based on observations (data nodes). The systematic approach reveals where assumptions have been made in order to render the model identifiable. These assumptions can also be communicated to other researchers using graphs as in Figure 1.
Gelman [14] compared weighting and modeling as possible solutions to handling survey design and non-participation in statistical analyses. While both approaches had advantages and disadvantages, in our case the potential complexity involved in constructing the weights was at least partially realized. In a multi-wave survey such as the Health 2011 Survey, in which each wave involved several parts such as the HE, interviews and questionnaires, the definition of participation alone becomes complicated. A different set of weights would be needed for each definition of this kind. Multiple imputation can handle not only different participation profiles, but also item non-participation. The imputed data can then be analyzed using survey analysis methods to handle different sampling probabilities, unit non-response and clustering as we have done. This approach can be a partial solution to achieving the aim expressed by Gelman [14]: 'Our ideal procedure should be as easy to use as hierarchical modeling, with population information included using poststratification.' Related to hierarchical modeling, we did not account for intra-cluster correlations, because we anticipated that migration during the 11-year follow-up period would have eroded such associations, but this can create bias in variance estimates [27].
In the original sample for 2000, 109 persons (around 1% of the sample) specifically forbade any further contacts after the Health 2000 Survey. However, due to the written consent in 2000, register-based follow-up and utilization of the baseline data are possible thus allowing the missing-data analyses.
As we have demonstrated in the case of health-related outcomes, which are often associated with non-participation, various statistical methods to handle missing data appear to be effective in reducing the differences between full sample estimates and estimates based on the participants. In particular, multiple imputation with chained equations appeared to be the best method in our comparison. In addition, the doubly robust method provided good results. These findings are encouraging, since decreasing participation rates have been a major problem in population surveys worldwide. For practical work, tools for statistical analyses based on multiple imputation are available in many statistical software packages, for example in SAS, Stata, SPSS and R, whereas tools intended for the application of doubly robust methods are not commonly available.
Our results demonstrate that the more baseline information can be utilized in the MI, the more accurate results can be obtained. In our case the participation rates were exceptionally high at the baseline, but it is likely that similar benefits can be obtained also in other multi-wave surveys with decreasing participation rates. Furthermore, in multi-wave surveys information on individuals, who did not participate in the early waves but participated in the later ones, can also be utilized in the MI. Multi-wave surveys can provide more information, which improves the accuracy of the results compared to those obtained by only cross-sectional surveys.
It has been shown that a weighting model should contain good predictors of the outcome rather than the missingness [4,6,52]. Our aim in the Health 2011 Survey has been to provide researchers general-purpose tools (based on the IPW as weights are easy to use also for non-statisticians) to handle missing data. As there are a couple of thousand variables aimed to cover a large variety of lifestyle and risk factors as well as health and other outcomes, we cannot provide optimized weights for all research questions, thus we concentrated on variables with predictive power on the missingness. We compared five imputation models based on different amount of register or baseline survey information, which improved the results. The imputation model MI2 utilized basically the same information as the resurvey weights, but the MI2 results were slightly closer to the true prevalences. The MI3 contained also the BMI, systolic blood pressure and smoking (measured in 2000), which are important risk factors to various health outcomes such as the ones we have applied in our work. It was also notable that the MI3 model, which contained the important predictors of health outcomes, has slightly smaller standard errors, which in accordance with earlier results [10,52] that inclusion of auxiliary variables associated with the outcome in the imputation model can improve accuracy of the estimates. The differences between MI3, which appeared to perform best, and other methods were not very large, however. Results based on the RPART method (MI4 and MI5) were practically as good as MI2 and MI3. MI5 contained a considerably larger number of variables selected using the AIC than the other imputation or weighting models, but it did not perform better than the other imputation models, thus a relatively small number of variables in the imputation model seemed to be sufficient in our case.
Many similarities with the results reported in the literature can be observed in the patterns of non-participation in the Health 2011 Survey. In the Health 2011 Survey, the participation rates were lower among men, younger age groups and those with lower educational attainments. Similar results have been reported for other studies [8,9,13,22,30,39,54]. In addition, our observation that non-participants are more often in receipt of disability pensions than participants echoes previous reports from Finland [30], Norway [29] and Sweden [39].

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work was supported by the Academy of Finland [grant number 266251]. The authors have no relevant financial or nonfinancial relationships to disclose.