Generalized Linear Mixed Models With Crossed Effects and Unit-specific Survey Weights

Abstract Mixed models are frequently used in social and economic analysis. Typically, the analyzed data are gathered through a survey sample from the population of interest. The underlying survey design may be non-ignorable, leading to possible bias in the regression parameters if not accounted for. To circumvent this problem, survey weights should be included in the estimation process. We propose a survey weighted generalized linear mixed model allowing for unit-specific survey weights and a flexible random effects structure. An estimation procedure is proposed and evaluated in two simulation studies under different scenarios. The first simulation study strongly encourages the use of survey weighted generalized linear mixed models if the survey design is non-ignorable. The second one replicates a previous simulation study in order to study the competitivity of the proposed method with established approaches and software. Further, the proposed method is used to re-estimate a regression on reading proficiency using US PISA data from 2000. Supplementary files for this article are available online.


Introduction
In the recent article, Lumley and Scott (2017) resumed the work of Pfeffermann (1993) and outline the importance of surveyweighting in regression analysis when the empirical research is done on complex survey data. If not accounted for, a nonignorable sampling design could lead to erroneous inference (Pfeffermann 1993;Lumley and Scott 2017). Consequently, the authors advise incorporating survey weights in the estimation procedure. An overview of the use of survey weights in regression analysis, in general, is given in Pfeffermann (1993). The use of survey weights in the classical generalized linear regression context is widely understood (Kott 2018). However, the inclusion of survey weights in a (generalized) linear mixed model (G)LMM framework is a non-trivial task as Lumley and Scott (2017) noted: All the design-based inference we discuss is for marginal models. Design-based inference even for linear mixed models is substantially more complicated. In the special case where the clusters in the model are the same as the clusters in the sampling design, there is literature on consistent estimation [...] and some implementations [...].
It is the more general case thereof that we analyze in this work.
The GLMM framework extends the classical generalized linear model by additional unobserved random effects structures. An exemplary application for GLMM is the sampling of pupils in classes in schools, where there are assumed unobserved random effects both on classes and teachers. The teachers give lessons in various classes leading to a crossed random effects structure CONTACT Jan Pablo Burgard burgardj@uni-trier.de University of Trier, FB IV -Economics, Trier, Germany. Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JCGS.
(e.g., Leckie 2009). To the best of our knowledge, there is up to now no proposal on how to handle such a crossed effects structure considering survey weights. A study on reading proficiency of teenage U.S. students presented by Rabe-Hesketh and Skrondal (2006) will serve as an exemplary application. Besides illustrating the theoretical developments, it allows us to distinguish our approach from the established ones, too.
For the two-level LMM case, Pfeffermann et al. (1998) proposed to use inverse selection probabilities at each stage to reduce the bias due to non-ignorability of the survey design. For the case of nested GLMM, where the unobserved random effects structure is hierarchically ordered and corresponds to the sampling stages, Rabe-Hesketh and Skrondal (2006) propose a pseudo-likelihood estimation method accounting for weights in each sampling stage. However in practice, the data producers only provide a single survey weight for the dataset, and to reduce disclosure risk, often the sampling stages are not communicated. Alternative approaches are found in Rao, Verret, and Hidiroglou (2013) who proposed an estimating equation approach for the LMM scenario and the two-stage cluster sampling scheme only. In the same set-up, Oguz-Alper and Berger (2016) derived survey-weighted estimators through empirical maximum likelihood (EML). Yi, Rao, and Li (2016) elaborated a composite likelihood approach that is applicable to mixed models with the endogenous family belonging to the exponential family. Though Yi, Rao, and Li's (2016) approach is already a valuable extension from linear mixed models to GLMMs, it is nonetheless necessary that the design mirrors the random effects structure of the model.
Our contribution is to allow for survey weighted GLMM estimation with a flexible random effects structure including both nested and crossed random effects. As already mentioned, the surveys only provide a single survey weight per unit in the sample. Therefore, in contrast to Rabe-Hesketh and Skrondal (2006), only one survey weight per unit must be sufficient in the presented method. We provide the underlying theory and propose an algorithm to fit such GLMMs. Further, we present simulation results supporting our findings.
Section 2 gives a brief description of the working example that will be used alongside the theoretical development. Section 3 introduces the estimation algorithm for GLMMs under complex survey designs. In this context, we describe the underlying theoretical requirements for the estimator's consistency in Section 3.2. Section 4 describes the path from the original statistical model to the presented algorithm. In Section 4.2, we derive the standard error estimators of the estimated regression parameters. Another special focus lies on the computational implementation, as the algorithm is computationally intensive, which is discussed in Section 5. Afterwards, a simulation study in Section 6 demonstrates the possible gains of the surveyweighted GLMM. In addition, we compare our estimation method to the simulation results in Rabe-Hesketh and Skrondal (2006). The final section discusses possible further research and concludes.

Working Example: Estimating Reading Proficiency
As a real-world application, we consider a model on reading proficiency of 15-year-old U.S. students using data from the PISA study in 2000. These data were also analyzed in Rabe-Hesketh and Skrondal (2006) in order to demonstrate their pseudo-likelihood approach to survey-weighting in GLMMs. It is freely available on https://www.stata-press.com/data/r15/me. html. Thus, a direct comparison between the algorithm in Rabe-Hesketh and Skrondal (2006) and ours is possible. However, we would like to underpin that the presented algorithm allows survey-weighting using marginal survey weights and models with a crossed random effects structure. That is, the presented approach can also be applied when survey weights on each sampling stage are not available or when the model includes additional effects, say on teachers in schools.
The sampling design of the PISA study in the United States in 2000 is a three-stage cluster design, where schools (Stage 2) within sampled geographic areas (Stage 1) are selected. Then, students within schools (Stage 3) are the final sampling units. Public schools with a high share of students belonging to minorities are oversampled. Within the stratum of those highshare-of-minorities schools, the sampling probability is proportional to (estimated) school size. Students in these schools that belong to a minority, are oversampled, too. More detailed numbers on the design can be found in Rabe-Hesketh and Skrondal (2006).
It is possible that the sampling design of the example is nonignorable for a statistical model on the reading proficiency: Assume, for example, reading proficiency in large schools is generally lower, especially when the share of minorities is high. Presume a statistical model that only accounts for personal and family characteristics of the student. Then this tendency can be reflected in overproportional negative idiosyncratic errors for students of large schools. In that case, the design is not independent of the endogenous variable given the explanatories and thus the design is nonignorable. Thus, weighting becomes necessary for a correct inference on the reading proficiency of the complete 15-year-old population in the United States. Obviously, the ignorability depends on the chosen model.

Estimation Algorithm
In the following, we assume as a data-generating process (DGP) a generalized linear mixed model (GLMM) where x j ∈ R p and z j ∈ R q are nonrandom explanatory variables. The likelihood of the model depends on these explanatories, too, but for brevity, we omit them in the subsequent formulae. For a binary endogenous variable, this model was first introduced by Stiratelli, Laird, and Ware (1984). In the systematic component (3.1), the linear prediction η j is a function of x j with fixed effect vector β and of z j interacting with the multivariate Gaussian random effects G whose realization is denoted by γ . The symmetric positive definite covariance matrix is assumed to originate from at most q(q+1) 2 disjoint parameters σ , i.e. = (σ ).
F is a distribution from the exponential family with corresponding probability density f , scale parameter ϕ and inverse link function g. As the scaling parameter ϕ does not affect the estimation of ψ T := (β T , σ T ) T through maximum likelihood (ML) estimation (Wedderburn 1976), we omit it in the following formulae for brevity, too.
In our working example, Y j would be a binary variable indicating whether student j achieved the top two reading proficiency levels defined in the PISA study. x j contains explanatories such as gender, indicators for the education level of the parents, indicators for foreign born parents and the international socioeconomic index of student j. z j is an indicator vector for the school that student j attends.
Before going into deeper detail of the estimation process, we first introduce survey sampling theory and our estimator of the population log-likelihood as it will be used in Algorithm 3.1. The development of the algorithm is postponed to the next section and can be skipped by practitioners.
Assume that there are N realizations of the DGP described in Equations (3.1) to (3.4). We denote the index set {1, . . . , N} =: U as the finite population, for j ∈ U, we observe realization y j of Y j and, for the complete finite population, γ of G. The DGP is also named superpopulation model. In our working example, U is the set of all 15-year-old students in the United States. From U, a survey sample s ⊂ U is drawn as a realization of the random Variable S that follows the random process P S , the survey design. Accordingly, we can define the inclusion probability of a unit j, P S (j ∈ S) =: π j . Let δ S denote the indicator function whether a unit is included in an arbitrary realization of the random variable S. The inverse inclusion probability π −1 j is the basic survey weight of unit j and we set w j := π −1 j . The Horvitz-Thompson estimator (Horvitz and Thompson 1952) of the population log-likelihood is then (3.5) where φ denotes the density of a multivariate Gaussian variable G S centered at zero and with a covariance matrix of the same structure as (σ ), S . S has possibly a lower dimension than if there is a population cluster from which no unit enters the sample. The differentiation between G S and G is necessary, as it can happen that not all clusters that constitute a random effect in the population U are drawn into the sample S. For example in the PISA study, schools all over the United States represent a cluster in which students are gathered. Students of these schools share a single random effect. However, not every school is sampled and for those that are not sampled, a prediction of the random effect is impossible. Hence, G S is a subvector of G.
When G S = G, the proportion between the terms A and B in (3.5) does not reflect the relationship in the finite population likelihood (that is introduced in Equation (4.1)). We can restablish the correct proportion by rescaling the weights w i , i ∈ U, such that i∈S w i = |S| =: n rather than i∈S w i = N. In the following, we assume that survey-weights are appropriately rescaled and for ease of notation, we write sometimes G and γ instead of their survey sample analogues G S and γ S . As is pointed out in Section 3.2, we must assume G ⊥ S in order to assure consistency because the term B is not survey-weighted.
A common estimation method in the framework of mixed models is the EM algorithm (Laird and Ware 1982). Though, the analytical effort in the integration step is hardly manageable as the conditional distribution of G given the sample observations y in the sample s is q-dimensional and possibly nonnormal. We suggest to replace the analytical integration by a Monte-Carlo approximation-a method that does not only account for random effect crossings (like in the formulation in McCulloch 1997; Booth and Hobert 1999) but also allows to respect the survey design P S . As the conditional distribution of the random effects given the observed data-necessary for the conditional expectation-depends on P S and the exponential family F, it is not pre-implemented in statistical software and therefore, again an MC-sampler is necessary in the integration step. One such MC-sampler suggested in Booth and Hobert (1999) is importance sampling where we choose to center the proposal around G's current mode with variance equal to the inverse negative Hessian of the log-likelihood.
We denote byĤ s the Hessian of (3.5) with respect to γ S given a parameter vector ψ and a sample realization S = s. Then, the final estimation algorithm for GLMMs under a variable random effects structure and complex survey design is given in Algorithm 3.1. It is a Monte-Carlo EM (MCEM) algorithm as the E-step of the classical EM algorithm is, like already mentioned, replaced by Monte-Carlo integration. Furthermore, sampling from the required distribution is not generally implementable which is the reason why we employ additionally importance sampling in the MC-integration.

Theoretical Properties
The EM-algorithm is known to converge to a stationary point of the likelihood distribution (Dempster, Laird, and Rubin 1977) and a closer inspection of (3.5) shows: The log-likelihood is separable between the fixed effects parameters β (and φ) in term A and the random effects parameters σ in term B. Furthermore, if no special structure is assumed for the covariance matrix besides positiveness and symmetry, the log-density of the zerocentered normal distribution is concave in σ . Thus the maximizerσ is unique. Concerning β, which only appends in term A, we have a unique stationary point for common link functions g −1 (Wedderburn 1976), too. In that case, the EM-algorithm converges consequently to the unique stationary point of the distribution which is thus the maximum.
However, this needs not be necessarily the case when using MC-integration in the E-step. A general overview about convergence properties of the MCEM-algorithm is given in Neath et al. (2013), including the case of GLMMs. There are several notions of convergence in the MCEM-setting and they depend on the number of Monte-Carlo draws in each E-step, on whether they are increasing and on the employed sampler. Generally, the number B of realized random draws must increase along the Esteps (as mentioned in the Algorithm 3.1), in order to ensure that the MC-error decreases when the estimates get closer to the stationary point. By the law of large numbers (Owen 2013, chap 9) we find that is the model expectation of the HT-estimator of the log-likelihood: Neath et al. (2013) summarize conditions under which the solutions of the MCEM converge almost surely to a connected set of stationary points of the likelihood function, and as the likelihood is strictly concave in our set-up, we state that our estimatorψ converges almost surely to the maximizer of the survey-weighted joint likelihood.
We write E m (LL) instead of E m ( LL) in order to underline that the conditional density h(·|y, s) = h(·|y, U): As the conditional density given a sample s differs from that given the population U, the estimator is not the expected value of (3.8) When the design meets certain requirements that are discussed below, especially when S ⊥ G, we claim that . Therefore, also the exponential thereof converges (with weight rescaling, multiplied by a constant) in probability. We combine both findings to analyze the overall convergence.
Assume we have that the following asymptotic assumptions hold for the superpopulation and the survey design: The number of domains for the random effects grows as k → ∞, that is, taking our setting, The expected dimension of G S goes to infinity, that is, the number of domains that are expected to enter with at least one unit the sample grows unboundedly as k → ∞ These assumptions exclude a theoretical setting where for example the random effects represent clusters in the population and the design and only the number of secondary sampling units per cluster increases. Conditions 3, 4, and 5 are the Conditions H1, H2, and H4 from Chauvet (2014). Together with the fact that the summands in the log-likelihood (3.5) take the role of the 'variable of interest' in Chauvet (2014), and the existence of the summands second moments (H3 in Chauvet (2014)), we can assure that the HT estimator for term A in (3.5) is consistent. Condition 2 and 6 assure the consistency of the ML variance estimator for a Gaussian covariance matrix. Condition 6 and 7 assure that the design is asymptotically independent of G which is important to assure that term B is a good-rescaled-estimator for log φ(γ |σ ) which in turn is only consistent when q → ∞.
Under these assumptions, we find that implying that we have only convergence in probability when the number of MC-samples for the integration increases within the estimation process as stated in McCulloch (1997); Booth and Hobert (1999) and in Algorithm 3.1. The population log-likelihood (4.1), which we discuss in the next section, where we derive the algorithm, is continuously differentiable with respect to ψ. The same applies to the expectation given the observed data. So the implicit function is a model-design consistent estimator, too, that according to Dempster, Laird, and Rubin (1977) will converge to a stationary point of the log-likelihood. As the log-likelihood, however, has a unique maximum for common link functions (Wedderburn 1976), this means that the estimatorψ is a consistent estimator, meaning that In a certain sense, our result is a development of what Rao, Verret, and Hidiroglou (2013) call a "marginal model'-as we only require one survey weight per observation-and the classical Pseudo-ML approach of Asparouhov and Muthen (2006) and Rabe-Hesketh and Skrondal (2006). In these works, the classical Pseudo-ML approach gets tractable because the loglikelihood exp(LL) dλ(γ ) is assumed to be nested due to the random effects structure, where the random effects of different stages are independent of each other. However, the easier analytical tractability is at the expense of a strict random effects structure. As we circumvent the need for analytical tractability by the application of MC-integration, we have less restrictions on the random effect structure: The survey weight that we employ is the inverse of the product of conditional probabilities on different stages, hence it is the marginal survey weight as the inverse of the unit's marginal inclusion probability. Though, the less restrictive random effects structure comes with a price: We must assume S ⊥ G, which is not necessary in Rabe-Hesketh and Skrondal (2006). Furthermore, as one reviewer pointed out, there is some information loss if the random effects structure is indeed nested: The marginal survey weight does not differentiate at which level of a multistage cluster design which inclusion probability is involved. This may lead to a loss in efficiency in the estimation. On the other hand, the integral approximation methods in Rabe-Hesketh and Skrondal (2006) become more inaccurate the higher the integration dimension becomes. In this term, MCintegration helps to maintain efficiency. The impact of survey weights per sampling stage on the efficiency may depend on the survey design, though. We refer to the appendix where we study the impact of sampling stage-based survey weights in a simulation study.

Point Estimation
As motivated in the previous section, Algorithm 3.1 relies on an approximation and estimation of the finite population loglikelihood where U denotes the index set {1, . . . , N} and Y j = y j is the jth realization of the DGP (3.1) to (3.3) under G = γ . In our working example, y j would be the outcome "student j achieved one of the two highest reading proficiency levels" (y j = 1) or "did not achieve" (y j = 0) for all 15-year-old students in the United States (the ensemble of these students thus is U). G then denotes a school-specific effect on individual reading skills. The students actually drawn into the PISA study are then elements of the random variable realization S = s. An adequate estimator if students were sampled from all schools in the US would be the Horvitz-Thompson estimator (Horvitz and Thompson 1952) like in (3.5) with B = log φ(γ |σ ). However, as some schools do not enter the PISA study, only a subvector γ S of γ could enter the sample log-likelihood. This yields a random part B S in (3.5) instead of B of the shape that has the same structure as B but is of lower dimension. Hence, in order to give the random effects G adequate 'importance' (comparing to term A) in the subsequent optimization process, we require a rescaling of the survey weights w j , j ∈ U.
As E j∈U δ S (j)w j = N, we suggest to multiply the survey weights by the factor n/N where n is the expected sample size: The likelihood part B S has the correct size dimension for a simple random sample of size n and as the term cannot be marginally weighted, the term A must be adjusted.
So, if the random effects realization were observed, maximization of the HT-estimate with S = s would represent a sort of pseudo joint likelihood estimate for the superpopulation parameter vector ψ. This idea is summarized in Figure 1 and reflects the considerations given in Isaki and Fuller (1982) about (simple) regression modeling in complex samples. In fact, the sampling approach in superpopulation models was studied in several influential works such as Chambless and Boyle (1985), Yi, Rao, and Li (2016), Lumley and Scott (2017) and Kott (2018).
However, random effects are unobserved heterogeneity. That is, estimates cannot be gathered from (3.5). A common method for ML-estimation under partially unobserved data is the EMalgorithm (Dempster, Laird, and Rubin 1977). This algorithm consists of two iterating steps. In the E-step, k the expected likelihood of the observed data is recalculated given the current maximizer ψ k . And in the M-step, the expected likelihood given the observed data is maximized. The EM-algorithm was first proposed for GLMMs in Laird and Ware (1982). As the integration becomes difficult to manage analytically in a nonnormal setting that we have described here, the E-step can be replaced by Monte-Carlo integration. In the case of GLMMs, this was suggested in McCulloch (1997) and Booth and Hobert (1999). Chen, Zhang, and Davidian (2002) use a MCEM-approach for GLMM estimation, too, but they relax the model on the normality assumption of G whereas we relax the model with respect to the survey design. The novelty of our approach is that the EM-algorithm is not applied to the population loglikelihood (4.1) but on its HT-estimator. Zipunnikov and Booth (2006) introduced weights in their MCEM-algorithm, too, but their focus lies rather on a correction for overdispersion and heteroscedasticity.
The different motivation for weighting in our settingnamely the design P S -has also implications on the conditional density against which we integrate for the E-step: As pointed out in the previous section, we integrate with respect to a realization of h(·|y, S) instead of h(·|y, U) (see Equation (3.7)), which would return an estimator E m ( LL) that would be preferable due to removing noise from the MC-integration steps.
On the other hand, conditioning on S allows to respect the (potentially complex) sampling design in both steps of the algorithm-the maximization and the expectation-in contrast to the pairwise likelihood in Yi, Rao, and Li (2016) and the pseudo-likelihood approach of Rabe-Hesketh and Skrondal (2006). There, the likelihood LL is survey-weighted but for the conditional density it is assumed that h(·|y, S) = h(·|y, U). In consequence, the expectation of A becomes scale-variant but is congruent between sampling stages and random effects. Furthermore, it can be argued that the additional noise through the MC-integration is counter-weighted by the fact that Rabe-Hesketh, Skrondal, and Pickles (2002) integrate a linear approximation to the GLMM and face in addition numerical integration errors through adaptive quadrature, which is avoided here. The linear approximation in Rabe-Hesketh, Skrondal, and Pickles (2002) however, yields an additional error that does not depend on the numerical integration method and can yield an additional bias (Yi, Rao, and Li 2016).

Standard Errors
For the MCEM-algorithm on GLMMs, the literature (e.g., Tan, Tian, and Fang 2007) suggests to use as variance estimators of the fixed effects estimatorβ the inverse of the observed Fisher information introduced in Louis (1982). In the case of MCintegration, the Fisher information is replaced by a MC-estimate thereof (Booth and Hobert 1999). The observed information matrix corresponds to the negative expected Hessian of the joint likelihood minus the expected variance of the gradient of (4.1) in a classical model-based estimation. In our case, the derivation is rather with respect to the HT-estimator of (4.1), as we take into account the survey design P S . That is, the expectation (with respect to G) of the first and second derivatives of (3.5) is required. Originally, the observed Fisher information given in Louis (1982) was intended for the traditional EM-algorithm. Oakes (1999) derives the observed information matrix of the EM-algorithm differently but arrives at the same result at the stationary point.
Replacing the integral by an MC-integral is-following the MCEM training procedure-easy: After the convergence, the simulated random variables γ 1 , . . . , γ B and the corresponding importance weights are available and allow once again a MCintegration to approximate the integral. That is, we suggest to use the estimated information matrix where ∂ 2 LL ∂β 2 denotes the Hessian matrix and ∇ β LL the gradient of the joint likelihood with respect to the parameter vector β at positionβ. From the final E-step, one can keep the simulated random effects and importance weights in order to define Monte-Carlo-based estimators These estimators have been suggested in Oakes (1999) and Tan, Tian, and Fang (2007) for the case without survey weights. Booth and Hobert (1999) used the inverse of the information matrix, too, but not as a final variance estimator. Instead, Booth and Hobert (1999) used the inverse observed Fisher matrix to determine stopping rules for the number of MC random generations in the Monte-Carlo E-step. Chen, Zhang, and Davidian (2002), on the other hand, have found numerical problems with the MC-version of the information matrix (the MC-estimated Fisher information (4.3) is not necessarily positive definite). In consequence, we suggest to use the closest positive semidefinite symmetric matrix when the estimate ofÎ β is not positive semidefinite (Higham 1988).
Having an estimator of the information matrix, its inversê I −1 β is a lower bound for the variance-covariance matrix and the square roots of the diagonal elements are thus lower bounds of the standard errors of the fixed effects estimatorβ (Rao 1992).

Numerical Aspects of the Implementation
In a given survey realization S = s, sampling from the distribution h(·|y, s) is necessary for the MC-integration in the Estep. However, as we have already mentioned, h(·|y, s) is highdimensional and not pre-implemented in statistical software packages because it depends on S = s under P S and Y = y. It is thus very specific for a dataset. Furthermore, if Y is not Gaussian, the distribution is not normal, neither. Consequently, Monte-Carlo sampling from a good proposal distribution is required.
McCulloch (1997) suggested a Metropolis algorithm, which can take much time to scan the sample space of G and yields correlated random generations of G. Booth and Hobert (1999) suggested rejection sampling or importance sampling. Rejection sampling, however, may become inefficient in the case of low acceptance rates (i.e., when the proposal is not chosen wisely). The normal distribution centered at 0 with covariance matrix would be an option for a proposal, for both rejection and importance sampling. It has the advantage that only the conditional density of Y j |G S must be evaluated. However, if the variance components of are moderate-to-large, and the modes of G|Y have therefor a moderate distance to zero, this proposal is inefficient. It leads to high rejection rates and often almost zero-valued importance weights. Therefore, for the importance sampling, Booth and Hobert (1999) suggest a t-distribution centered at the current mode of G S and with the inverse negative Hessian as covariance matrix. Zipunnikov and Booth (2006) propose a quasi MC-method named randomized spherical radial integration, arguing that a transformation of the random effects in combination with random orthogonal matrices allows to reduce the MC-draws and consequently the computational burden. However, if there is no predefined structure on the covariance matrix (except for symmetry), and there are crossed random effects, the dimension of the required random orthogonal matrix is q × q, which is cumbersome to generate. In our working example, there are 148 high schools sampled into the survey and thus, the dimension of the required random orthogonal matrix would be 148 × 148.
For the named inconveniences of the other algorithms, we suggest to use importance sampling in each E-step with proposal distribution in Algorithm 3.1. In order to improve sampling from the tails, the covariance matrix can be scaled by a factor slightly greater than 1. This proposal distribution (5.1) was applied in Pinheiro and Bates (1995) and is relatively easy to compute. If Z T Z is not block diagonal with Z only containing zeros and ones, and/or the number of random effects is large (q 0) calculus of H and its inversion is cumbersome, too. Furthermore, realizations of multivariate normals are usually gathered by multiplying an iid standard normal vector by the square root of −Ĥ −1 s . This is a computational burden that must be repeated in each Monte-Carlo E-step when the proposal (5.1) is used, because the Hessian is updated in each step. A solution is the modified BFGS-algorithm described in Powell (1987) that returns the modeγ s in each MCEM-step together with a q × q matrix C such that CC T = −Ĥ −1 s . We can then use in each training step the matrix C resulting from the optimization to generate random vectors with the respective mode and covariance.
Note that the modeγ s givenψ is a unique maximizer of the log-likelihood (under the link functions given in Wedderburn (1976)): The GLM-part A of the likelihood (3.5) is strictly concave in the linear predictor η and the multivariate normal part (here denoted by B) is strictly concave in γ , too. Thus, A+B is strictly concave and thus there is a unique maximum.
Another point of the numerical implementation is that the MC-integration in the E-step turns Algorithm 3.1 into a stochastic optimization procedure. Though we have already discussed the convergence of the MCEM algorithm (Neath et al. 2013), the application might yield some instabilities. Especially in the binary data case, we found like McCulloch (1997) that one gets quickly into a neighborhood of the ML-estimators but then values might keep oscillating and even can leave the neighborhood. We therefore keep track of the expected log-likelihood and store the values that have returned the highest estimated expected loglikelihood.

Application to Real World Data
We estimate a logit mixed model on our working example, the U.S. PISA study data from 2000. Indeed, this working example allows us a direct comparison to the results presented in Rabe-Hesketh and Skrondal (2006). In order to soften the impact of MC-error in our algorithm, we estimate the model 500 times and present here the boxplots of these estimates. The exact model is as follows: The endogenous variable is an indicator on whether a student achieved one of the top two reading proficiency levels. The explanatories are female: Dummy on the student's gender isei: International socio-economic index high school: Dummy on whether the highest education level of either parent is high school college: Dummy on whether the highest education level of either parent is high school one for: Dummy on whether one parent is foreign born both for: Dummy on whether both parents are foreign born And for the schools, a random intercept is assumed whose variance is denoted "revar. " The results of the point estimates are summarized in Figure 2. The results of Rabe-Hesketh and Skrondal (2006) are added in the plot. Most point estimates agree with those of Rabe-Hesketh and Skrondal (2006). Nonetheless, our estimation procedure returns a higher impact of the parents' educational achievement and a lower effect of one parent being foreign born. The estimate of the intercept is lower with our proposed algorithm, however, as the intercept has rarely a causal interpretation, we consider our deviations from Rabe-Hesketh and Skrondal's results to be minor. In addition, consider that in average, the proposed estimation Algorithm 3.1 returns a less biased intercept estimator than the pseudo-loglikelihood estimation in Rabe-Hesketh and Skrondal (2006) under weight scaling method 1 (cf. replication study 6.2.1), which is used there for the real world data.

Replication of Rabe-Hesketh and Skrondal (2006)
Data-Generating Process. The DGP in Rabe-Hesketh and Skrondal (2006) can be summarized as follows: k = 1, . . . , 1000 Monte-Carlo finite populations were generated according to , 10, 20, 50, 100} (6.1) We have thus a random intercept logistic regression model. Consequently, the number of sampled domains is random, too. A similar design of the inclusion probability is implemented on the second stage given the realization Y k i = y k i : (6.7) The design is thus nonignorable at both stages and if it was not accounted for, the random effects' variance would be systematically underestimated.
Results. In the following, we present boxplots in order to give an idea about the distribution of our proposed model parameter estimators. In Figure 3, we added the average result of the point estimator that Rabe-Hesketh and Skrondal (2006) obtained in their simulation study. We find that our method offers an advantage over the pseudo-loglikelihood estimation under weight scaling especially for the fixed effects. The bias reduces with increasing cluster sizes. However, we underestimate the standard deviation of the random effects. While the MCEM outperforms the pseudo-loglikelihood approach under weight scaling for N (1) j = 10 and has approximately the same bias for N (1) j = 20, it is worse for very small and very large clusters. However, note that these results agree with the application to the real world data: The average (and median) number of students within high schools in the dataset is about 14, which is close to the expected sample size within clusters for N (1) j = 20. There, we found that the estimated variance of the school random effect using MCEM is slightly larger than that found by Rabe-Hesketh and Skrondal (2006). The same result that occurs here for N (1) j = 10, 20.

Simulations under Crossed Random Effects Simulation Set-up.
Data-Generating Process. We conduct a model-based simulation study under finite populations: In each of the k = 1, . . . , 2000 simulation runs, a finite population of size N = 4000, is generated according to the following superpopulation model . . , 20 (6.9) G k d 2 ∼ N 0, 0.7 0.5 0.5 1.3 , d 2 = 1, . . . , 30 (6.10) We arrange the parameters as ψ T = (β 0 , β 1 , β 2 , σ 2 1 , σ 2 2 , ρ, σ 2 3 ) = (4, −2, −1, 4, 0.7, 0.5, 1.3). ψ is chosen such that the linear predictors return sensible conditional expectations μ k i = g(η k i ) under both, the identity link g = id and the logit link g = logit −1 . The distribution F is once Gaussian and once Bernoulli. That is, we run simulations for a linear mixed model and a logit mixed model with crossed random effects structures: Domains (clusters) for the first random effect are indexed by d 1 and domains for the second random effect are indexed by d 2 and are not nested within domains d 1 = 1, . . . , 20. Crossing of both domain types d 1 and d 2 does not follow a pattern, the crossclassifications U d 1 ∩ U d 2 was once assigned randomly and kept constant over all simulations (that is, z i does not change over the simulations). All domains of type 1 have the same number of units N d 1 = 200. The affiliation to domains of type 2 is also uniform, that means that N d 2 is either 133 or 134. The explanatories are also constant along the simulation runs and were generated once according to X 1 ∼ N(2, 1) and X 2 ∼ Exp(1). Ignorable Design. The simulation is designed to show the strengths and weaknesses of the proposed wGLMM approach in the case of non-ignorable survey designs. Furthermore, we show that the loss of efficiency under ignorable survey designs due to weighting is acceptable. For this reason, we first implement an ignorable sampling design under the DGP: Given the explanatories, the random sample S is independent of the endogenous variable, (Y|X, Z) ⊥ S. From the finite population realization k, k = 1, . . . , 2000, a stratified random sample is drawn. The strata are the domains d 1 = 1, . . . , 20 and the sample size in each domain is equal to n d 1 = 10. The domains d 2 , in contrast, are unplanned as they randomly cross domains d 1 . This leads to random sample sizes for domains d 2 and possibly to a sample size n d 2 = 0. In this set-up, alternative algorithms that were reviewed in the introduction would have troubles to take into account either random effects G d 2 (e.g. the algorithm introduced in Rabe-Hesketh and Skrondal (2006)) or surveyweights (e.g., the R-package lme4, Bates (2011)). Because of the unplanned sampling in domain type 2, it might happen that the observed realizations γ S = γ and thus the outlined rescaling of the survey weights is necessary.
The units within the strata are sampled with unequal probabilities. Hereby, the first design (π ps) is ignorable. The inclusion probability for unit i in stratum d 1 is proportional to x 2,i : x 2,j .
As the elements x 2 are observed in the sample, survey-weighting should not have an effect under this sampling design. Probability proportional to size sampling designs are described by Lohr (2010, chap 6.1-6.4). In this set-up, the weights are ignorable in expectation (Kott 2018). Therefore, this design is some sort of control for the estimation quality when weighting is not necessary. Using the weights nonetheless should lead to some loss in efficiency but should not impact the estimator's expected outcome.
Simulation Results under Ignorable Design. The benchmark for our results are the results that the R-package lme4 (Bates 2011) returns -without the inclusion of survey weights. The results are presented in Tables 1 and 2. We find that for the fixed effects and a Gaussian-dependent variable, the MCEM algorithm yields unbiased estimators and there is only a slight loss in efficiency in comparison to ignoring the survey weights in lme4. In addition, the results for the random effects variance components are reasonable, too. There is a slight bias concerning the variance components of the effects crossing the survey design. This bias, however, is about the same range as the one that is returned by lme4.
For the binary dependent variable, on the other hand, the integral approximation in lme4 leads partially to tremendous errors such that any efficiency loss that might result from the (unnecessary) inclusion of survey weights in our MCEM algorithm is clearly overweighted by that error. Furthermore, we remark that some simulation runs of the lme4 functions stopped and would have required further fine-tuning of the estimation hyperparameters whereas our algorithm managed to finish these particular simulation runs when lme4 failed.
Nonetheless, especially for the crossed random effects' parameters a bias for the MCEM algorithm is observable.
In the second step, we report results for the discussed standard error estimator. As discussed in previous sections, the Fisher information of the EM algorithm (Louis 1982) was suggested by several authors. However, to the best of our knowledge, we are the first ones to present its performance in a MC simulation study. We benchmark the standard error estimators with the MC-estimated standard error: The results are summarized in Tables 3 and 4; negative variance estimates (negative diagonal elements ofÎ −1 β ) were disregarded. In cases whereÎ −1 β was not positive definite, the closest positive definite matrix was taken ('nearPD' in the table).
It becomes clear that the inverse of the Fisher information is only a lower bound of the variance and hence, the square root thereof is only a lower bound of the standard errors. In contrast to lme4, the standard error estimator is biased downwards. This is even the case under the Gaussian DGP with ignorable design, i.e. in the most favorable setting.  For the logit regression, the standard error estimator of the inverse Fisher information behaves similar like in the linear regression-it is a lower bound. However, in comparison with lme4, the standard errors of the MCEM algorithm are nonetheless more trustworthy due to the large integration errors that occur in lme4 and translate to the MC-based standard error.
Nonignorable Design. The second design (ups), in contrast, is non-ignorable. This design is selected to underpin the importance of survey weights when model assumptions such as the ignorability of the design do not hold. For this scenario, we define the inclusion probability of unit i in stratum d 1 in three steps (and due to the model dependency, for each finite population realization k): where τ d 1 ,p is the p-quantile of the residuals e k i in domain d 1 in the kth finite population. Consequently, the inclusion probability is a function of an unobserved model error and thus non-ignorable, Y|X, Z ⊥ S (Pfeffermann 1993). Under this setup, the expectation of the error in the sample does not equal zero, not even so when conditioned on the explanatories (Kott 2018). This scenario corresponds to the assumption violation of Kott (2018) that makes survey weights nonignorable. As the regression concept seeks to minimize the random error and observations with a highly positive error are oversampled, we expect the regression intercept to be biased upwards. However, the information on the oversampling of units with positive error realizations is contained in the survey weights and can thus be accounted for when the weights are included. A similar nonignorable design was studied in the simulation of Pfeffermann et al. (1998): In this article, the authors constructed the inclusion probability as a function of the random effects.
Simulation Results under Nonignorable Design. As discussed above, survey-weighting becomes essential under a nonignorable survey design. This becomes especially clear when taking a look on the intercept in Tables 5 and 6. In the linear case, except for the estimates on the fixed effect intercept, however, the results are similar to those under the ignorable design due to the construction of the sample informativity. The relatively large bias of the intercept (about 22%) is especially critical when the regression model is used for prediction, for example in small area estimation.
The biases of the variance components under the MCEM algorithm are only marginally greater than those under ignorable survey design. Furthermore, they are (overall) relatively smaller than those that result under ignorance of the survey design.
For nonlinear models such as the logit regression, on the other hand, things change dramatically as can be seen in Table 6. The combination of integration error and ignored sample informativity leads to severely biased estimators for both fixed effects and variance components using lme4.
Note that under this case, any inference stemming from standard error estimators must be erroneous. Standard error estimators are reported in Tables 7 and 8.
It is observable that the lower bound standard error estimators for the MCEM algorithm behave similarly like under an ignorable survey design, which makes us confident that the bias only stems from the lower bound property. Then, an improved estimator similar to the square roots of the diagonals of the inverse Fisher information could yield approximately correct standard error estimators. On the other hand, for the Gaussian-dependent variable, the standard error estimator of the intercept of lme4 would lead to wrong inference, too, due to the bias. For the binary case, the large biases in all three fixed effects overweight any variance component that is caught by the standard error estimator of lme4. Hence, there is a severe underestimation.

Discussion
In this article, we proposed to combine the Monte-Carlo EMalgorithm, that allows for flexible mixed models (McCulloch 1997;Booth and Hobert 1999;Chen, Zhang, and Davidian 2002;Zipunnikov and Booth 2006) with the inclusion of survey weights (e.g. Chambless and Boyle 1985). To the best of our knowledge, survey weighting in GLMMs has until now come from another direction (e.g., Pfeffermann et al. 1998;Rabe-Hesketh and Skrondal 2006;Rao, Verret, and Hidiroglou 2013;Yi, Rao, and Li 2016). The previous approaches for survey weighting in mixed models make it difficult to include a single, unit-specific survey weight and crossed random effect patterns.
We have circumvented this problem by the use of the MCEM. In addition, we find that our proposed estimator is design consistent for the finite population parameters. These are in turn model-consistent for the parameters in the data-generating process (DGP). Overall, the suggested MCEM yields thus a modeldesign consistent estimator at the sample level. Therefore, our approach is a powerful tool in both econometric research, where the focus lies on the DGP, and survey statistics, where a model under the finite population can assist the estimation process.
For example, small area estimation often uses a multi-level framework in order to increase the efficiency of mean estimators in small domains Battese, Harter, and Fuller (1988). However, those estimators are only model-unbiased under the correct model and ignorable sampling. The survey weights thus could protect the point estimators when the design is non-ignorable. In combination with Box-Cox transformations (Box and Cox 1964) to improve error distribution assumptions as suggested in Li and Lahiri (2007), Gurka et al. (2006) and Rojas-Perilla et al. (2017), a powerful and flexible tool might result. Another, yet to be studied application, would be to replace the survey weights in the algorithm with nonresponse propensities in order to correct for a missing-at-random nonresponse pattern. Furthermore, our approach could be of use in the context of imputation: Instead response propensity scores, a (mixed) survey weighted regression imputation becomes possible.
Besides the theoretical background, we have discussed the computational problems that occur for flexible covariance matrices and the resulting high dimensionality of the random vector G. With the application to U.S. PISA 2000 data, we demonstrate the applicability in real data situations.
A final simulation study revealed that the fixed effects estimators are not harmed by survey weighting when the design is ignorable and profit when this is not the case. When the domains, for which a random effect pattern is assumed, reflect sampling stages, we get the same evidence for the variance component estimation. However, further investigation is necessary for how to stabilize the variance component estimation for unplanned domains, especially when the endogenous variable is binary. We have furthermore contrasted our estimation algorithm with an established estimation procedure and find it competitive.
Finally, we have evaluated the inverse observed Fisher information for standard error estimation. To the best of our knowledge, this method was proposed in the literature but not evaluated in a simulation study before. Though the tendency of the standard error estimator is stable across scenarios, we find that further research must be done-the Fisher information can lead at most to lower bounds for the standard errors.
To conclude, we introduced unit-specific weights into the generalized linear mixed model framework. This is important as most surveys only provide a single survey weight that must thus account for the complete sampling design. The proposed methodology is, therefore, an important tool to reconcile classical statistical modeling with the reality of empirical research where only samples generated by complex surveys are available.