Comparing estimation approaches for generalized additive mixed models with binary outcomes

Generalized additive mixed models (GAMMs) extend generalized linear mixed models (GLMMs) to allow the covariates to be nonparametrically associated with the response. Estimation of such models for correlated binary data is challenging and estimation techniques often yield contrasting results. Via simulations, we compared the performance of the Bayesian and likelihood-based methods for estimating the components of GAMMs under a wide range of conditions. For the Bayesian method, we also assessed the sensitivity of the results to the choice of prior distributions of the variance components. In addition, we investigated the effect of multicollinearity among covariates on the estimation of the model components. We then applied the methods to the Bangladesh Demographic Health Survey data to identify the factors associated with the malnutrition of children in Bangladesh. While no method uniformly performed best in estimating all components of the model, the Bayesian method using half-Cauchy priors for variance components generally performed better, especially for small cluster size. The overall curve fitting performance was sensitive to the prior selection for the Bayesian methods and also to the extent of multicollinearity.


Introduction
Generalized additive mixed models (GAMMs) [1] are a flexible version of generalized linear mixed models (GLMMs) [2] that use additive nonparametric functions to model the effect of continuous covariates and random effects to model correlation between responses.Statistical inference in GAMMs involves inference concerning the parametric and nonparametric mean functions as well as inference on the correlation parameters.
The smooth mean functions in a GAMM can be estimated in many different ways such as smoothing splines [1,3,4], regression splines [5] and penalized splines [6][7][8][9].However, we focus on penalized splines for estimating the nonparametric functions in a GAMM.Penalized smoothers can be represented as mixed model components which results in GAMMs to be considered as particular cases of GLMMs (see, e.g.[10][11][12]).Eventually, the mixed model methodology and software can be used to make systematic inference on all model components of the GAMMs.One of the main advantages of this approach is that the smoothing parameter which controls the trade off between under-and over-smoothing can be directly estimated from the model as a function of variance components.
Under the mixed model representation of GAMMs for binary data, the likelihood function needs to be approximated as it does not have a closed form.Lin and Zhang [1] extended the penalized quasi-likelihood (PQL) [2] and proposed a double PQL (DPQL) to make approximate inference using smoothing splines.The DPQL method, however, yields biased estimates for binary clustered data with small cluster size [1,13].Wood and Scheipl [14] provided an R function gamma4 that implemented the Laplace (full) approximation approach to the likelihood function.More refined approximation methods using adaptive Gaussian quadrature are not feasible as the mixed model representation of smooth functions in GAMMs involve a large number of random effects [9].A Bayesian approach is an attractive alternative that enjoys exact inference under Bayesian machinery.However, it is computationally expensive and requires specification of the prior distributions which is not always trivial, especially for variance components (see, e.g.[15]).
Although several methods are available for estimating GAMMs, in practice the applied statistician may have to choose among widely contrasting results depending on the method used (see, e.g.[16]).We, therefore, investigate the performance of the different likelihood based and Bayesian estimation methods in ideal and non-ideal data situations.We also investigate the sensitivity of the Bayesian fit to the prior specifications of the variance of random effects.Our aim is to develop guidance for analysing correlated binary data with possibly non-linear functional forms between the independent variables and the response.
Further, in the recent GAMMs literature, the emphasis has been given only on the formulation of smooth functions and construction of various approaches for estimating the models.To our knowledge, very few studies are available on the validity of the model fit in adverse circumstances, for example, in the presence of multicollinearity among the regressors.In this paper, we also we investigate the impact of multicollinearity on the estimation of the GAMM parameters by allowing the covariates to be moderate to highly correlated.We then apply these methods to the Bangladesh Demographic Health Survey (BDHS-2014) data to study the determinants of malnutrition among children in Bangladesh.

Generalized additive mixed model (GAMM)
Let Y i = (Y i1 , . . ., Y in i ) T be the response vector in which Y ij is the observation for the member j of cluster i (i = 1, 2, . . ., m; j = 1, 2, . . ., n i ).Conditional on a q × 1 vector b of random effects, Y i are assumed to be independent observations from an exponential family distribution [17].In a generalized additive mixed model, Y i are assumed to satisfy where is a monotonic link function, β 0 is the intercept, x i = (x i1 . . .x ip ) T are observed realizations from p covariates associated with fixed effects, z i are q × 1 vector of covariates associated with random effects, f r (•) is a centred and twicedifferentiable smooth function, and b ∼ N(0, D γ ) in which γ is a vector of variance components.
In a GAMM, it is ideally assumed that the fixed effect covariates are statistically independent.However, this might not always be true.Explanatory variables can be highly correlated in situations like exploratory data analysis or in data mining, where many and redundant variables are often included in the modelling phase.In a linear regression setting, near-linear dependencies among the covariates give rise to the problem of multicollinearity, which leads to inflated variances of the coefficient of interest.The exactnonlinear dependencies among covariates are termed as concurvity, which leads to infinite solutions of generalized additive models (GAMs) [18].

Penalized spline estimation
For estimating smooth functions f r (•) (r = 1, . . ., p), we adopt thin plate regression splines [19,20] that are both computationally efficient and stable.Considering a large number of knots (K r ) for flexibility, f r (•) is represented by where B r is a basis matrix consists of thin plate regression spline basis functions B rk (x ri ) and β r is a vector of regression coefficients.To avoid overfitting, we impose the constraint: for some nonnegative constant c, where S r is a penalty matrix.To ensure identifiability, we construct f r (•) by absorbing the centreing constraint into the basis so that the sum of the elements of f r (•) is zero.

Penalized spline as mixed model
Following [11], we re-parametrize f r (•) based on the eigendecomposition as where β rF is the vector of fixed effects with its associated design matrix B rF that consists of the columns of B r for which the penalty matrix S r has eigenvalues as zero and β rR is the vector of random effects with the associated design matrix B rR = B r U r W −1 +r in which U r is the matrix containing eigenvectors of S r corresponding to the strictly positive eigenvalues (E r ) arranged in descending order of magnitude, and W +r is the diagonal matrix containing E r on the leading diagonal.The random effects β rR ∼ N(0, τ r I) with smoothing parameter λ r = 1/τ r .

Estimation
The maximum likelihood estimation of the GAMM (6) involves a high dimensional integral over the random effects.In general, the likelihood does not have a closed-form as the integral is intractable unless data are normally distributed.As such, numerical approximation or Monte Carlo approaches are required.For the GAMM (6) estimation, two popular approximation approaches are: (i) double penalized quasi-likelihood (DPQL) [1] and (ii) full Laplace approximation [14].However, both methods tend to yield biased estimates in a GAMM, especially for binary data with small cluster size [1,13].As mentioned earlier, better approximation method with higher degree of accuracy such as adaptive Gaussian quadrature is not feasible due to the large number of random effects in a GAMM [9].A Bayesian approach is an appealing alternative to likelihood-based approximations, in which Markov chain Monte Carlo (MCMC) methods are usually used to make inferences by drawing samples from the posterior distribution of the parameters.Using noninformative priors for the parameters, the results may be comparable to the likelihoodbased approaches.However, the Bayesian method involves higher computational effort and suitable choice of the prior distributions.The estimates of the variance components in Bayesian mixed models are known to be sensitive to the prior specification (see, e.g.[15]).For the fixed effect parameters β ∈ β F , a common choice of non-informative prior is β ∼ N(0, 10 6 ).For the variance components σ 2 ∈ {τ , γ }, different popular choices of priors include (i) inverse-gamma (IG) prior: σ 2 ∼ IG( , ) with set to a very small value such as 0.1 or 0.01 or 0.001; (ii) uniform prior on a wide range, e.g.σ ∼ U(0, 100); and (iii) half-Cauchy prior: P(σ ) ∝ (σ 2 + s 2 ) −1 with a high value for the scale parameter s, for example, s = 25.A detailed account of the prior selection is provided by Gelman [15].
There are currently a number of software platforms for fitting GAMMs via MCMC including WinBUGS [21], OpenBUGS [22], JAGS [23], BayesX [24], R-INLA [25], and STAN [26].However, for simulation and data analysis, we use JAGS (Just Another Gibbs Sampler) which is a mature and declarative language for Bayesian model fitting with reasonable computation time.

Data generation
Following Lin and Zhang [1], we generated 1000 replicate datasets in which each dataset comprised m = 100 clusters of homogeneous size n i = 3 or 5.We considered a random intercepts (Z ij = 1) model that includes a binary outcome (Y ), a binary exposure (E), and two continuous covariates (X 1 , X 2 ).The binary exposure, was generated from Bin(1, 0.5).Two linearly correlated standard uniform U(0, 1) covariates (X 1 and X 2 ) were generated from the Gaussian Copula such that the point-biserial correlations between the binary exposure and continuous covariates were ρ e.x 1 = ρ e.x 2 ≈ 0.30.Four levels of empirical correlation between the continuous covariates were investigated: ρ x 1 .x 2 = 0, 0.5, 0.7, 0.9.Conditional on the cluster-specific random intercepts b i ∼ N(0, σ 2 int ) with σ 2 int = 0.25, 0.5, 0.75, observations Y ij were simulated from Bin(1, p ij ) according to a logistic additive mixed model, where β 1 = 0.7 (giving OR ≈ 2), f 1 (x 1 ) and f 2 (x 2 ) are bimodal and unimodal functions, respectively defined as in which (.) is a gamma function.The two functions were centred so that the means of f 1 and f 2 were 0 over the distinct values of X 1 and X 2 .The value of β 0 was chosen so that the overall prevalence of a favourable outcome (Y = 1) was either 0.5 or 0.25.Note that in the adopted simulation setting, we may consider E as a binary treatment indicator (exposure) and (X 1 , X 2 ) as confounders; or, either of (X 1 , X 2 ) as the exposure of interest and the others as confounders; or, all (E, X 1 , X 2 ) as predictors of Y in a predictive model setting.

Analysis of simulated data
Each simulated dataset was analysed by fitting a logistic additive mixed effects model of the form (7) in which each smooth term was represented by a penalized thin plate regression spline.Following Ruppert [7], we considered a large number of knots (K = 20) and the knot positions were specified as where 1 ≤ k ≤ K. Representing the penalized regression smoothers as mixed model component, and imposing the centreing constraint on each smoother, we estimated the model parameters via the following methods: (1) DPQL under maximum likelihood (ML) or restricted ML (REML) estimation.Standard errors of the estimated fixed effect and smooth terms were obtained following Lin and Zhang [1].(2) Laplace approximation, where standard errors of the estimated fixed and smooth effects were obtained similarly to that for DPQL.(3) A Bayesian approach using noninformative priors for all parameters.Specifically, using N(0, 10 6 ) priors for all fixed effects, and three alternative independent priors for each variance component: (i) Uniform (0, 100); (ii) Half-Cauchy by setting s = 25; and (iii) Inverse Gamma (0.001, 0.001).Using priors (i)-(iii), Bayesian methods are referred to later as, respectively: Bayesian-UNIF, Bayesian-HC, and Bayesian-IG.The Bayesian estimates were medians from 55,000 iterations of the MCMC algorithm after discarding the first 5, 000 burn-in iterations.We ran a single chain and thinned it by keeping every 50th iteration.
All simulations and analyses were carried out in R software employing glmmPQL and gamm4 functions for DPQL and Laplace approximate ML methods, respectively.As mentioned earlier, the Bayesian analysis via MCMC was performed using JAGS.The R package R2jags [27] was used to call JAGS and export results to R.

Performance indicators
For each method, we compared the percentage relative bias (PRB) and empirical mean squared error (MSE) of the fixed and random effects estimators.The PRB was computed as PRB = Bias TrueValue × 100.
To assess the curve fitting performance, we used the following criteria: (i) mean average squared error (MASE); (ii) mean average coverage probability (MACP); and (iii) mean average confidence length (MACL).The MASE measures the closeness of the estimated curve to the true function whereas the MACP and MACL take into account the uncertainty of the of the estimated curves.
The MASE was defined as the mean over the 1000 replicated datasets of the average squared error, The 95% pointwise MACP and MACL were obtained as the means of the 1000 average coverage probabilities (ACP) and average credible intervals lengths (ACL), respectively defined as where 1(•) denotes an indicator function; fL and fU are the lower and upper limits of the point-wise CI, respectively.
To compare the fits graphically, we presented smoothers of the pointwise fitted values of the nonparametric functions; and smoothed pointwise 95% coverage probabilities of the true functions from 1000 simulated datasets.The smoothers were obtained using penalized thin plate regression splines (with K = 20) while considering fj (x j )(j = 1, 2) and logit-transformed coverage probabilities from 1000 replications as continuous outcomes.

Comparison of Methods
Figure 1 and supplementary Table S1 present simulation results when different methods of estimating GAMMs were used.All methods exhibited bias in estimating the variance of the random intercept σ 2 int .The Bayesian-HC method, however, yielded least biased estimates in most scenarios considered.The Bayesian-UNIF produced overestimated σ 2 int whereas it was always underestimated by the Bayesian-IG method.Like Bayesian-IG, the Laplace method also yielded largely underestimated σ 2 int , but nonetheless provided lowest MSE.Similar to the Bayesian-UNIF, σ 2 int was always overestimated by the DPQL method; and REML estimates were more biased than ML estimates.
The fixed effect β 1 (i.e. the coefficient of the binary exposure E) was well estimated by all methods with negligible bias in all cases.It was however better estimated (less biased) by the Bayesian-IG method (when n i = 3) and the Laplace approximation method (when n i = 5).The Bayesian-UNIF slightly overestimated β 1 whereas Bayesian-HC performed in between the Bayesian-UNIF and Bayesian-IG methods.Interestingly, β 1 was better estimated by those methods that underestimated σ 2 int .The curve fitting performance by different methods were comparable (Figure 1 : c − h).The Laplace method performed relatively poorly in reproducing f 1 (x 1 ) but did reasonably well in estimating f 2 (x 2 ) in all cases.The DPQL (REML) method, on the other hand, recaptured the true functions best in most cases based on the minimal MASE.The Bayesian methods also performed well, especially for small cluster size (n i = 3) and Bayesian-UNIF performed better in estimating f 1 (x 1 ), whereas f 2 (x 2 ) was better estimated by the Bayesian-IG.Note that the methods that overestimated the variance component σ 2 int recaptured the peaks and troughs better than those underestimated σ 2 int .The Bayesian methods using three alternative priors had similar average coverage probabilities (MACP), close to the nominal level of 95% for both curves, although a slightly higher coverage was apparent for the Bayesian-UNIF in all cases.The DPQL (ML and REML) and Laplace methods had similar but lower (around 90%) coverage probabilities than the nominal level for both curves when n i = 5.For n i = 3, coverage probabilities further decreased for both methods but DPQL had higher coverage than the Laplace method.Overall, the DPQL (REML) yielded higher coverages than DPQL (ML) in all cases, although the difference was only marked when cluster size was small (n i = 3).
In terms of MACL, a higher coverage probability was associated with a wider credible interval.The Bayesian-UNIF yielded widest credible intervals and resulted in highest coverage probabilities in all cases.As expected, the fully Bayesian methods had relatively wider CIs as they take into account the uncertainty around the unknown parameters.Of the frequentist methods, DPQL yielded narrower confidence intervals than the Laplace method for both curves in most cases.
Figure 2 illustrates the ability of the GAMMs estimated by different methods to recapture the true functions for one particular scenario (when m = 100, n i = 5, σ 2 int = 0.5 and ρ x 1 .x 2 = 0).The upper panel of Figure 2 shows the true curves f j (x j )(j = 1, 2) and the smoothed estimated curves fj (x j ).The reconstructed nonparametric functions had noticeable negative biases when curvature was high.The Bayesian methods (Bayesian-UNIF and Bayesian-HC), in general, worked well in estimating the high curvature areas but did inadequately at the flat regions.The reverse was true for the frequentist methods.Around the peaks, the Bayesian-IG and Laplace methods performed relatively poorly.
The lower panel of Figure 2 compares the empirical pointwise coverage probabilities of the 95% credible intervals (CIs) of f 1 (x 1 ) and f 2 (x 2 ) obtained using different estimation methods.The coverage probabilities of the CIs from all frequentist and Bayesian methods were very close to the nominal value (95%), except when biases in the estimated nonparametric functions were noticeable.At x-values where the bias was visible, CIs from DPQL and Laplace methods yielded very low coverage probabilities.In such cases, coverage probabilities from the fully Bayesian methods were also low, but nonetheless better than the frequentist methods.
For all methods, estimation bias and variance of the model components decreased when we increased the cluster size (n i ), and increased when we increased the value of σ 2 int .No convergence problem was noted for any method when cluster size n i = 5.For a smaller cluster size (n i = 3), only the DPQL failed to converge or reported error messages relating to singularity.Depending on the scenario, this failure rate ranged from 14% to 25% of the total generated datasets and results from these simulated datasets were omitted for all methods.
Computational time for the DPQL, Laplacian approximation, and Bayesian methods were also recorded.For m = 100 clusters of size n i = 5, on average, the per replicate timings were 30 seconds for the DPQL method, 40 seconds for the Laplace method, and 7.5 minutes for the Bayesian-HC method.Computational times for the Bayesian methods using other priors for the variance components were very similar.
Overall, similar relative performances were observed for different methods when the prevalence of a positive outcome (Y = 1) was 25%, although bias, MSE, MASE and MACL were relatively higher (results not shown).

Effect of Multicollinearity
Figure 3 shows the effect of multicollinearity on the estimation of the GAMM components.We considered a particular simulation scenario (m = 100, n i = 5, event probability = 0.5, σ 2 int = 0.5, ρ e.x 1 = ρ e.x 2 ≈ 0.30) and present results from the three estimation techniques: DPQL(ML), Laplace, and Bayesian-HC.Results from all other scenarios considered are reported in Tables S2 and S3 in the supplementary materials.
The estimation bias of the variance component σ 2 int did not change by the extent of multicollinearity for any method except for the DPQL.However, for all methods, the MSE of σ 2 int increased as ρ x 1 .x 2 was increased (Figure 3(a)).The estimate of β 1 was robust to the multicollinearity in that neither bias nor MSE of β1 had changed markedly for any of the methods as ρ x 1 .x 2 was increased.
For the two nonparametric functions, both MASE (measuring the distance between fitted and true curves) and MACL (measuring uncertainty) of the estimated curves increased gradually up until ρ x 1 .x 2 = 0.7 and then sharply at ρ x 1 .x 2 = 0.9 for all methods (Figure 3(c)-(f)).The MASEs increased by more than twice at ρ x 1 .x 2 = 0.9 compared to ρ x 1 .x 2 = 0. Nevertheless, the true functions were recaptured reasonably well by all methods even when the correlation was high (see Figure 4).The MACP, remained almost unchanged across all levels of multicollinearity for both functions (Figure 3

(g)-(h)).
A similar effect of collinearity between X 1 and X 2 was observed in other simulation scenarios (see supplementary Tables S2 and S3), although bias, MSE, MASE and MACL were much higher for n i = 3 compared to n i = 5.As ρ x 1 .x 2 was increased, the convergence problem for DPQL (ML and REML) method also increased in small cluster size (n i = 3).For example, with m = 100, n i = 3, σ 2 int = 0.5, models failing to converge by DPQL (ML and REML) method increased from 16.7% to 23.6% of the total datasets when ρ x 1 .x 2 was increased from 0 to 0.9.

Application to BDHS-2014 data
In this section we applied the different approaches outlined in Section 2 for estimating GAMM to identify the factors associated with malnutrition status of children in Bangladesh using data from the BDHS-2014 survey.

Malnutrition, stunting and potential risk factors
Malnutrition affects people in every country and has been a major challenge for developing countries [28].In Bangladesh, poor nutritional status in early childhood is an important health problem.Around 36% of children under 5 years of age are stunted (low weight for age) in Bangladesh [29].The country aims to bring down the stunting rate to zero percent by 2030 among children under age five.As such, it is important to identify the potential risk factors that affect the early life malnutrition rate in Bangladesh.
Several indices may represent nutritional status such as stunting, wasting and underweight.Here we use stunting because it reflects impaired growth and development, and is considered as a primary indication of chronic malnutrition.A child is defined as stunted if his/her height-for-age is more than two standard deviations below the world health organization (WHO) child growth standard medians.Stunting in early life has many adverse effects including recurrent disease in early childhood, poor cognition and educational performance, and increased risk of nutrition-related chronic diseases in adult life [30].
Stunting can be caused by a variety of factors.Several studies have identified several socio-economic, demographic and health related factors that influence the stunting or malnutrition in Bangladesh (see, e.g.[31][32][33][34]).However, most studies did not consider the possible non-linear effects of the continuous covariates and none of them considered the possible correlation among children raised/living in the same household.Ignoring such correlations among outcome data may invalidate statistical inference.
In this analysis, we evaluated the association of the socio-economic, demographic and health characteristics with stunting status after accounting for the within household correlation and allowing for the possible nonlinear effects of the continuous covariates.

Data and models
Malnutrition data were extracted from the Bangladesh Demographic and Health Survey (BDHS) 2014.The BDHS-2014 is a nationwide large survey that used two-stage stratified random sampling for data collection.A sampling frame was constructed from the list of census enumeration areas (EAs) of the 2011 population and housing census of the people's republic of Bangladesh.The primary sampling unit (PSU) was an EA formed to have around 120 households on average.In the first stage, a total of 600 EAs was selected with probability proportional to their size: 393 from rural area and 207 from urban area.Then a complete household list was created for each selected EA to prove a sampling frame for the second-stage.In the second stage, a systematic sample of 30 households (on average) was selected per EA.The survey interviewed a total of 17,863 ever-married women in the selected households and collected various information on socio-economic, demographic and health related characteristics.
In order to identify the potential risk factors of malnutrition in Bangladeshi children, we consider a binary indicator 'stunting' (yes vs. no) as dependent variable.Based on the literature, we considered the following continuous and categorical factors as independent variables: mother's current age (in years), mother's body mass index (BMI), age of child (in years), sex of child (boy vs. girl), birth order (first-born vs. second-or higher-born), place of residence (rural vs. urban), geographical region of residence (Barisal vs. Chittagong vs. Dhaka vs. Khulna vs. Rajshahi vs. Rangpur vs. Sylhet), mother's education (no education vs. primary vs. secondary vs. higher), mother's working status (yes vs. no), wealth index (poor vs. middle class vs. rich), mass media exposure (not exposed to newspaper/radio/TV vs. exposed to at least one of these media for at least once a week), mother's NGO involvement (yes vs. no), family violence (yes vs. no), and age difference between spouses ( < 4 years vs. ≥ 4 years).
We included all children born 5 years preceding the BDHS-2014 survey and a total of 7886 children were identified.There were 894 missing data for some of the characteristics included in the analysis.We omitted all these missing data and our final analysis was based on the data from 6992 children who were born to 6829 distinct mothers.Most of the mothers had only one child under 5 years of age and the number of such children per mother varied between 1 and 5.Note that a subset of this dataset was previously described and analysed by Tasmiah and Bari [31].
In order to account for the potential correlation among outcomes and also to model the possible non-linear effects of the continuous covariates, we adopted a logistic additive mixed models in which all continuous covariates were modelled non-parametrically by using penalized thin plate regression splines of rank 20 (K = 20).More specifically, conditional on cluster-specific (or household-specific) random intercepts b i ∼ N(0, σ 2 int ), the binary outcomes Y ij (stunting status: 0 = no, 1 = yes) were assumed to be independent and follow a semiparametric logistic model where i = 1, . . ., 6829 indexes the cluster (or household), j = 1, . . ., m i denotes the number of children in the ith household.The fixed effects covariates x ij included an intercept, geographical region, residence, sex of child, birth order, mother's education, media exposure, mother's working status, wealth index, mother's NGO involvement, family violence, and age difference between spouses.The f 1 (mother s age ij ), f 2 (mother s BMI ij ) and f 3 (child s age ij ) are centred twice-differentiable smooth functions of mother's age, mother's BMI and child's age, respectively.We fitted model (11) using DPQL (REML), Laplace approximation, and Bayesian-HC methods.For DPQL and Laplace methods, standard errors of the estimated fixed effects and smooth functions were obtained following the procedure mentioned in Section 3.2.For Bayesian-HC, we ran 2 chains with 55000 iterations after discarding the initial 5000 burn-in iterations.The chains were thinned by keeping every 50th iteration and estimates were the sample medians.Convergence of the chains were assessed following Gelman and Rubin [35] and also visually examining the trace plot, density plot, and sample autocorrelation function for each parameter.All analyses were performed in R.

Results
The first three columns of Table 1 present the summaries of selected characteristics of the children and their mothers.Overall, 36% of the children were stunted.The highest prevalence of childhood stunting was observed in Sylhet division (48.3%) whereas it was lowest in Khulna division (27.2%).Stunting was more frequent in children living in rural areas (38.0%compared to 30.8% in urban areas) and for those born to poor mothers (45.5% vs. 25.3% for rich mothers), uneducated mothers (49.2% vs. 19.1% for highly educated mothers), working mothers (39.4%5 vs. 34.5% for non-working mothers), mothers not exposed to media (44.2% vs. 30.6%who were exposed to media), mothers affected by the family violence (38.7% vs. 34.5% with no family violence), and born as second or later child (34.3% compared to 30.2% for firstborn child).The ORs of stunting from the model fits via DPQL, Laplace and Bayesian-HC are reported in the last three columns of Table 1.We observed good mixing of the chains for the Bayesian-HC method and it yielded relatively stronger ORs (away from 1) with wider CIs than the DPQL and Laplace methods for most of the characteristics.Below we interpret results from the Bayesian-HC method.The risk of childhood stunting was statistically significantly associated with the division of residence, child's birth order, mother's education, mother's working status, and mother's wealth index.The odds of stunting was lower in all other divisions compared to Sylhet.Second-or later-born children were more likely to be stunted compared to the first-born children (OR = 1.34,CI: 1.14, 1.58).The higher education of mothers was associated with the lower odds of child malnutrition (OR = 0.45 for higher vs. no education; OR = 0.58 for secondary vs. no education).Children born to rich mothers were less likely to suffer from malnutrition (OR = 0.49 for rich vs. poor; OR = 0.82 for middle class vs. poor).Further, children born to working mothers were more likely to be stunted compared to those born to non-working mothers (OR = 1.28,CI: 1.10, 1.48).
Figure 5 shows the estimated nonparametric functions of mother's age, mother's BMI and child's age for the DPQL, Laplace and Bayesian-HC methods.The 95% pointwise credible intervals of the curves (shaded grey region) are displayed only for the Bayesian-HC method.Different estimation techniques suggested different trends for the effect of mother's BMI; however, similar trends were observed for the effects of mother's age and child's age.The curve estimated via Bayesian-HC indicated an increased risk of stunting for children born to mothers younger than 25 years (Figure 5(a)).The risk declined sharply up until their mothers' age of 25, declined gradually thereafter up to age 32 and then stabilized, although a slight increase in the risk was apparent after 40 years.The DPQL and Laplacian fits closely reproduced the behaviour of the Bayesian-HC fit.
The Bayesian-HC estimated smooth curve showed an increased risk of stunting for children born to mothers with extreme BMI ( < 20 and > 45) (Figure 5(b)).The convex shaped risk function decreased gradually to reach the minimum at BMI of 35 and then increased gradually.The DPQL fit closely followed the Bayesian-HC fit up until the BMI of 30 and then showed a slow increase relative to the Bayesian fit.The Laplacian fit, on the other hand, completely missed the non-linear association and indicated a downward sloping linear trend.With regard to the age of children, the risk of malnutrition was higher for children aged more than 18 months.The risk functions estimated by different methods were very similar, and the risk increased sharply up until the age of 18 months and then stabilized with some fluctuations.

Discussion
For analysing correlated binary data, how best to estimate flexible curves in a GAMM is unclear, especially when cluster sizes are small and there exist multicollinearity among the covariates.Many of the proposed methods yield biased estimates for model components.We compared the performance of the Bayesian and frequentist approaches for estimating GAMMs in which the flexible nonparametric functions were estimated by the mixed model representation of penalized thin plate regression splines.
Results from the Bayesian and frequentist methods were comparable.No single method appeared to perform best in estimating all components of the model.However, the Bayesian method with half-Cauchy prior for variance components was found perform well in estimating all model components, especially for small cluster size.Different methods behaved differently in estimating nonparametric functions depending on the shape of the association.Methods, that relatively overestimated the variance components, re-captured the true curves better.
Results from the Bayesian methods were sensitive to the prior specification of variance components.The inverse-Gamma (IG) prior specification underestimated variance parameters (and hence overestimated smoothing parameters) resulted in oversmoothed nonparametric curves.The reverse was true for the uniform prior specification.Using half-Cauchy priors appeared to be a compromise between the uniform and IG priors based on the all performance indicators.
The DPQL method exhibited bias (in estimating model components) and had frequent convergence problems for small cluster size (n i = 3).However, it generally showed a good performance for a reasonably large cluster size (n i = 5) and the findings were in line with Lin and Zhang [1] except for the variance estimate of the random intercept; the variance was underestimated in [1] whereas we found it to be overestimated.This may be due to the difference in spline functions used for curve fitting, or using different statistical software.We used penalized thin plate regression splines and the glmmPQL function in R to estimate model parameters, whereas Lin and Zhang [1] adopted natural cubic smoothing splines and estimated it via SAS macro glimmix.
Similar to the Bayesian-IG, the Laplace method was found to underestimate the variance components, leading to oversmoothed nonparametric curves.Nevertheless, the fixed effect parameter associated with the binary exposure was very well estimated for large cluster sizes.In general, the computation was relatively fast and no convergence problem was reported, confirming the results in Wood and Scheipl [14]: the Laplace method is numerically more robust than the DPQL.
Of note, it has previously been shown under different settings that the Bayesian approach is sensitive to the prior specifications and that the PQL method is biased for binary outcome, especially for small cluster sizes in a GLMM.Therefore, it is not surprising that the similar findings were observed for estimating GAMMs under the mixed model representation of smooth terms.However, we compared the performance of all methods together under the same conditions in a GAMM and explored the magnitude and direction of the bias.We demonstrated the strength of the Bayesian methods when the system is stressed, e.g. when the cluster size was small.Although the DPQL is not typically recommended for binary data, we found that it estimated the nonparametric functions well as long as the method converged.The Laplace method was found to oversmooth the nonparametric functions, indicating that it might end up with fitting a straight line when the nonlinear association is not very strong; the same observation holds for the Bayesian-IG method.
We also assessed the effect of multicollinearity on the estimates of the GAMM components.We found that the curve fitting performance was affected by the higher degree of multicollinearity: increased bias and decreased precision were observed for estimated curves by all methods.The fixed effect estimate of the binary exposure ( β1 ) was insensitive (in terms of both bias and precision) to the multicollinearity between two continuous covariates in all cases.This may be due to the fact that even in the presence of high multicollinearity, the estimated curves still recaptured the true functions reasonably well (Figure 4), and as such the estimation of β 1 was not affected that much owing to the adequate adjustment of confounding factors without leaving notable residual confounding.
We analysed BDHS-2014 data using a GAMM and identified a number of factors that were associated with the child malnutrition in Bangladesh.We found that age and BMI of mothers as well as age of children had nonlinear impacts on stunting; the fitted curves varied with respect to estimation methods.For example, the Laplace method completely failed to capture the nonlinear effect of mother's BMI while other methods were able to detect it.This is in line with the simulation results which indicated that the Laplace method could fail to detect a nonlinear effect when the effect is not so strong.Unlike simulation results, the estimated curves for mother's age and child's age by different methods were very similar.This may be due to the simple shapes of nonlinear association for mother's age and child's age, and our simulation results indicated that for simple shapes of association different methods would perform similarly.
In this study, we assessed the frequentist properties of the Bayesian methods which are rarely reported in the literature due to the computational intensity.While a wide range of data generation scenarios were considered to compare the performance of the Bayesian and frequentist methods, the overall investigation was nevertheless limited to the simple scenarios.For example, we considered estimating GAMM that involved only random intercept, two specific shapes of nonlinear associations, and the number of clusters set to 100.However, given the number of scenarios necessary to systematically compare the methods and investigate the effect of multicollinearity even in such a simplified case, these restrictions at this stage seem reasonable.Although a range of scenarios was investigated in the simulations, the range was not exhaustive and the results from this study may not be applicable to situations not considered.Moreover, while we investigated the effect of multicollinearity, the effect of concurvity (that is when the continuous covariates are nonlinearly associated) has not been explored in a GAMM setting.This may be considered a promising aspect of future investigation.
Overall, our simulation results suggested using Bayesian estimation with half-Cauchy prior for variance components, especially when the cluster size is small.The Bayesian methods, however, are computationally intensive and require specialized code, limiting their use in practice.If the main interest lies in accurate estimation of the fixed effects only, the Laplacian approximation would be a worthy alternative which is fast and easy to implement with widely used standard procedures.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
Dr. Benedetti is supported by the Fonds de recherche Santé Québec (FRQS).

Figure 2 .
Figure 2. True and estimated curves of the estimated nonparametric functions based on 1000 replications (upper panel) and smoothed pointwise coverage probabilities of the 95 % confidence intervals (lower panel) for 1000 replicated datasets.These results are for the data generation scenario with m = 100, n i = 5, σ 2 int = 0.5, ρ x 1 .x 2 = 0.The curves estimated by Bayesian-HC were between those obtained by Bayesian-IG and Bayesian-UNIF and have not been displayed here to make other fits more visible.

Figure 4 .
Figure 4. True and estimated curves from the Bayesian (half-Cauchy priors) estimated nonparametric functions (left: f1 (x 1 ) and right: f2 (x 2 )) at different levels of correlation between x 1 and x 2 based on 1000 replications.These results are for the data generation scenario with m = 100, n i = 5, σ 2 int = 0.5.

Figure 5 .
Figure5.Bayesian and frequentist estimates of f 1 (mother's age), f 2 (mother's BMI) and f 3 (age of children) for the logit of the prevalence of stunting.The shaded regions are the 95% credible intervals from the fully Bayesian (MCMC-HC) fit.

Table 1 .
Results from BDHS data analysis.