Dealing with separation or near-to-separation in the model for multinomial response with application to childhood health seeking behavior data from a complex survey

Separation or monotone-likelihood can be observed in fitting process of a multinomial logistic model using maximum likelihood estimation (MLE) when sample size is small and/or one of the outcome categories is rare and/or there is one or more influential covariates, resulting in infinite or biased estimate of at least one regression coefficient of the model. This study investigated empirically to identify the optimal data condition to define both ‘separation’ and ‘near-to-separation’ (partial separation) and explored their consequences in MLE and provided a solution by applying a penalized likelihood approach, which has been proposed in the literature, by adding a Jeffreys prior-based penalty term to the original likelihood function to remove the first-order bias in the MLEs of the multinomial logit model via equivalent Poisson regression. Furthermore, the penalized estimating equation (PMLE) is extended to a weighted estimating equation allowing for survey-weight for analyzing data from a complex survey. The simulation study suggests that the PMLE outperforms the MLE by providing smaller amount of bias and mean squared of error and better coverage. The methods are applied to analyze data on choice of health facility for treatment of childhood diseases.


Introduction
Separation or monotone-likelihood may occur in the fitting process of the multinomial logistic regression model using maximum likelihood estimation (MLE) procedure, resulting in an infinite estimate of at least one regression coefficients of the model. As described in many studies [5,15,16], separation occurs when the responses and non-responses are separated by a predictor, particularly binary predictor, or a non-trivial linear combination of several predictors. The phenomenon of separation is more likely to occur when the sample size is small, or number of response category is high or one of the categories is rare, or there is one or more strong predictors or combination of them. Let us consider an example (Table 1) describing different forms of separation caused by a dichotomous covariate (X) in Table 1. Example of separation (left) or near-to-separation (right) due to a dichotomous predictor X (covariate) against outcome Y. Number in each cell indicates number of observations. Separation Near-to-separation Y Y multinomial response Y. In the first case, the response is completely separated (more than one empty cell if we compare response categories 2 and 3) or quasi-completely separated (one empty cell if we consider response categories 1 and 3) by the predictor. In the second case, all are non-zero cells but there is at least one cell with a few observations, which we termed as 'near-to-separation' (partial separation). For multinomial response, we empirically determined a cutoff (threshold) of such observations, which is maximum 15% of the total sample in the data, to define the case as 'near-to-separation'. Details of the empirical evaluation are discussed later in Section 3. Although complete or quasi-complete separation is rare in practice, the situation of 'near-to-separation' is more common. As reported by Mondol and Rahman [20] in correlated binary data, this form of separation often makes trouble to achieve convergence and yields bias in the estimate to some extent, and hence it cannot be ignored. Such forms of separation can also be found in the multinomial data from a complex survey design, for example, in 'childhood health seeking behavior' data extracted from the database of the Bangladesh Demographic and Health Survey (BDHS) 2014. The BDHS employed two-stage stratified cluster sampling design to make a stratumspecific estimate of the outcome for some stratum such as urban and rural in addition to seven administrative divisions of the country, which may help the policy-makers for designing evidence-based policies. Therefore, when looking for a stratum-specific estimate of health seeking behavior (choice of health facilities: private/public/local-pharmacy, for treatment of acute respiratory infection (ARI)), one can observe the existence of both complete or quasi-complete separation (for both division and urban stratum) and nearto-separation (for rural stratum) created by the covariate 'household economic condition' (poorest/poorer/middle/richer/richest) calculated from household assets owned by the parents ( Table 2). The MLE of the regression coefficient associated with covariate 'economic condition' in the multinomial logit model of the 'choice of facilities' fitted for the urban stratum (where separation exists) is reported with an extreme high value of the odds ratio (OR) and very wider Wald-based confidence interval, which lead to invalid inference [24]. For the rural stratum (where near-to-separation exists), the corresponding odds ratio value is also reported to be large in some cases. Similar results are observed for a divisional stratum (results not shown). These results suggest that maximum likelihood-based multinomial logit models for health seeking behavior data may be failed to provide consistent estimates and valid inference. The results are similar to those found in many other studies [4,12,15,16], which described consequences of separation in the MLE including frequent convergence failure, and even if it converges, it provides biased or often infinite estimates of at least one regression coefficient and standard error (SE) leading to misleading inference.
A number of studies discussed small sample bias adjustment in the ML estimates of the regression coefficient of the generalized linear model [2,3,[6][7][8]10,21,23] while other Table 2. Observed data and odds ratio of the three different population: urban, rural and both for the covariate economic status obtained from the fitted MLE (adjusted for other covariates) to the data on childhood health seeking behavior for treatment of ARI.  Poorest  6  3  0  57  13  10  63  16  10  P oorer  3  5  4  42  10  18  45  15  22  Middle  6  4  5  41  5  16  47  9  21  Richer  14  7  9  18  4  11  32  11  20  Richest  5  5  18  6  1  9  11  6  27  Total  34  24  36  164  33  64  198  57  studies discussed solution to the problem of separation [15,16]. Of them, Firth's [10] proposal was based on reducing the first-order (O(n −1 )) bias in the MLEs by adding a penalty term in the score function, which was equivalent to Jeffrey's invariant prior. Heinze and Schemper [16] extended Firth's penalized approach for solving the problem of separation in binary logistic regression. Firth's approach was also extended to conditional logit [22], proportional odds [19], multinomial logistic regression [4] and non-linear GLM [17] for a small sample bias reduction. Furthermore, Kosmidis and Firth [18] explored the connection between multinomial logistic regression models and Poisson log-linear models for cell counts and proposed a penalized maximum likelihood estimators procedure to remove the first-order bias in the MLEs of the multinomial logit model using Poisson trick. Although small sample bias reduction and solution to separation has been proposed for multinomial outcome by Bull et al. [4] and later by Kosmidis and Firth [18] via an equivalent Poisson regression, none of them elaborately investigate the performance methods in the presence of separation, particularly 'near-to-separation', which is more common in practice. The present study investigates the consequences of all forms of separation in the maximum likelihood and penalized maximum likelihood approaches using an extensive simulation study and compared the results to provide some practical recommendations. In addition, the paper extended the penalized methods incorporating sampling weight for analyzing the above-mentioned health seeking behavior data extracted from a survey with complex design. The resulting weighted penalized method achieves convergence and obtains finite estimate of the regression in the presence of both separation and near-to-separation. The paper is organized as follows. Section 2 reviews the existing maximum likelihoodbased multinomial model and describes the penalized methods incorporating the surveyweight. The simulation study is described in Section 3 and an application of the methods for analyzing health seeking behavior data in Section 4. Finally , in Section 5, the main findings of the paper are discussed.

Multinomial logit model
Let us consider a multinomial response Y with k categories and their corresponding probabilities π 1 , π 2 , . . . , π k . Let q = k − 1 and β T =(β T 1 , β T 2 , . . . , β T q ) be the vector of the pq model parameters. Assume that we have observed n pairs (y i , x i ) with y i = (y i1 , . . . , y iq ) T the vector of observed frequencies which are realizations of a multinomially distributed random vector Y i with index m i , and x i , a p × 1 vector of known covariate values. The observed frequency from the kth category is y ik = m i − q s=1 y is and π i = (π i1 , . . . , π iq ) T is the q × 1 vector of the corresponding category probabilities. By definition, the probability of the kth category is π ik = 1 − q s=1 π is . The matrix X with rows x T i is assumed to be of full rank and for the presence of an intercept parameter in the model, the first element of x i is set to one for every i = 1, 2, . . . , n. Let Z i = 1 q ⊗ x T i be the q × pq model matrix. The log-odds can be expressed as where z ist is the (s, t)th element of Z i and 1 q is the q × q identity matrix.

MLE procedure
According to the notations in Ref. [9], the multinomial log-likelihood can be written as The corresponding score function for the coefficient vector β has the following form: The maximum likelihood estimator (MLE) for the regression coefficient is the solution to the equation U(β) = 0, for which Newton-Rapshon approximation is commonly applied [9].

Review of Firth's penalized method for GLM
To remove the first-order (O(n −1 )) bias in the MLE of the tth regression parameter of GLM, Firth [10] proposed the following score function: with the penalty term where U t (β) ≡ ∂l(β)/∂β t = 0 is the standard score function based on the log-likelihood function l(β) = log L(β), and I(β) −1 is the inverse of the corresponding information matrix I(β) = −∂ 2 l(β)/∂β 2 evaluated at β. The corresponding penalized log-likelihood and likelihood functions against the above-modified score equation are l * (β) = l(β) + 1/2 log |I(β)| and L * (β) = L(β)|I(β)| 1/2 , respectively. The penalty function is known as Jeffreys invariant prior for this problem and its influence is asymptotically negligible. Bull et al. [4] applied Firth's general idea to the multinomial logit models for finite sample bias reduction in the regression estimate. For multinomial logit model, the observed information matrix F(β) = Z T WZ, with W being a nq × nq block-diagonal matrix with q × q blocks W i = {w isk }, w iss = π is (1 − π is ) for s = k and w isk = −π is π ik otherwise. Then, the penalized score function given in Equation (3) can be replaced by the following form for the multinomial logit model : where K is is a q × q symmetric matrix with (u, v)th element, the third-order cumulants of Y i and H i denotes the ith diagonal block of the nq × nq 'hat' matrix H = Z(Z T WZ) −1 Z T W consisting of n 2 blocks, each of dimension q × q. After some algebra, the above penalized score function takes the following form: where h ius is the (s, u)th element of H i . The solution of the equation U * t (β) = 0 (from Equation (4)) is computationally tedious due to the complex structure of the hat matrix. Kosmidis and Firth [18] proposed an alternative penalized score function for the multinomial logit model using the equivalent Poisson trick, which is more flexible (simple and computationally efficient) for illustrative purpose. The details of this approach are discussed below.

Solving the problem of separation in multinomial response using equivalent Poisson regression
The multinomial model in Equation (1) can be embedded conveniently into a Poisson loglinear model. Using the indicator function δ sk (1, if s = k and 0 otherwise), the log-linear model for multinomial response can be written as (using the similar notations used in Ref. [18]): where μ is are the expectations of the independent Poisson random variables Y is , τ i = k s=1 μ is and φ i nuisance parameters. As described by Kosmidis and Firth [18], both the multinomial logit model given in Equation (1) and the Poisson log-linear model in Equation (5) are full exponential families and hence Firth's [10] bias-reduced penalty term can be directly applied to the likelihood of each model. However, they argued that under the given parameterization θ † = (β T , τ T ) T , the Firth-type penalized Poisson likelihood cannot be factorized as the product of the required multinomial likelihood (for β) and a factor free of β. Therefore, standard computation of bias-reduced estimates for the full parameter vector θ † in the Poisson log-linear model does not deliver bias-reduced estimates of β of the multinomial model. Kosmidis and Firth [18] provided a solution to work with the restricted version of the Poisson model derived by imposing the constraint m i = τ i . For more details about this constraint, see Ref. [18]. Under this constraint, the Poisson model in Equation (5) becomes into a canonically linked generalized non-linear model of the form: To simplify the computational algorithm of the likelihood score equation of the above non-linear model, Kosmidis and Firth [18] applied the general results of the penalized likelihood of the generalized non-linear model (Equation (13)) derived in Ref. [17] and proposed the following bias-reducing adjusted score functions for the above model: where F † is the expected information on θ † , D 2 (ζ is ; θ † ) denotes the (n + pq) × (n + pq) Hessian matrix of ζ is with respect to θ † , and z † ist is the (s, t)th component of the k × (n + pq) matrix does not depend on s and substituting for z † ist , Kosmidis and Firth [18] proposed the following simple form of the adjusted score functions for β: where g ist is the (s, t)th component of G i . Only the quantity h iss in the above bias-reducing score equation will be affected due to the constraint τ i = m i . Kosmidis and Firth [18] also proved using Theorem 1 that these adjusted score functions are identical to those (Equation (4)) obtained directly from the penalization of the likelihood for the model in Equation (1). The penalized estimating equation (PMLE) estimates of β can be estimated using the following standard iterative process proposed by Kosmidis and Firth [18]. New values β (j+1) obtained from the candidate β (j) by solving the following iterative equation: iss calculated for the restricted parameterization. As suggested by Kosmidis and Firth [18], Equation (7) can be implemented by the following steps: i . (iii) fit model in Equation (5) by maximum likelihood but using the adjusted responses y is +h The β-block of the inverse of the expected information matrix evaluated at the reducedbias estimates (termed as PMLE) can be used to produce valid SEs for the estimators. The penalized method using the Poisson trick presented above was originally proposed for finite sample ( first-order) bias reduction in the MLE of the multinomial model. However, these bias-reducing score functions in Equation (6) can be used for solving the problems due to separation and near-to-separation discussed earlier in the multinomial response. Because the functions given in Equation (6) guarantee finite estimates of β even while there is empty cell in one or more response categories, which was not possible by the standard score functions (Equation (2)) derived without penalization. This idea is similar to those proposed by Heinze and Schemper [16] for solving the problem of separation in the logistic regression. The corresponding estimate of SEs of the PMLE can be obtained from the roots of the diagonal elements of the hat matrix H i .

Weighted PMLE for complex survey data
The main motivation of the paper is to apply the above method to analyze data (mentioned earlier) extracted from a cross-sectional survey with stratified cluster design (a multi-stage design), which is a complex design. Therefore, incorporating individual sampling weight calculated from the inverse probability of selection in estimation of the model is extremely required to have accurate estimate of the regression parameters and associated SE. Let Y hjik be a multinomial response with k categories for the ith subject (i = 1, 2, . . . , n hj ) from the jth cluster (j = 1, 2, . . . , m h ) and from the hth stratum (h = 1, 2, . . . , H). Again, let δ hji be an indicator variable with value 1 if the subject hji is selected in the sample and 0 otherwise and P(δ hji = 1) be the probability of being selected into the survey, which is fixed by study design and may depend on covariates or additional variables such as screening variables not in the Poisson model for multinomial response of interest [11]. This probability can be calculated by multiplying the probability of selection in each stage in the multi-stage sampling scheme. Therefore, each subject in the sample has known weight w hji = δ hji /p hji . Incorporating the sampling weight, the penalized estimating function given in Equation (6) becomes to the following weighted estimating function for survey data: Solving U †,w t = 0 provides weighted PMLE for the survey data. The equation can be solved using the same procedure as described earlier and the SE of the weighted PMLE can be obtained from the squared root of the robust variance estimator.

Simulation study
An extensive simulation study was performed to investigate the consequence of separation and near-to-separation in MLE for multinomial model and evaluate the performance of PMLE over MLE. We performed simulation considering data from a study with simple random sampling.

Data generation
Let us consider multinomial data y is for the ith observation and sth response category (i = 1, 2, . . . , n; s = 1, 2, . . . , k) with p = 2 covariates, one of which was considered as binary and other one as continuous. The realizations x isb of the binary covariate X b were generated from Bernoulli distribution with probability of event γ and x isc of the continuous covariate X c from standard normal distribution. We considered a multinomial response with three response categories and the first response category as the reference category, then the multinomial responses y i = (y i1 , y i2 , y i3 ) were generated using the multinomial distribution with the probabilities derived from the following model: The probabilities corresponding to the multinomial response categories satisfied the condition π i1 + π i2 + π i3 = 1. The values of β 02 and β 03 determine the overall prevalence of the response category in the data, and the values of β 12b , β 22c , β 13b and β 23c represent the effect size indicating the strength of association between the covariate and the response.

Simulation scenarios
Using the above settings, data were generated for several simulation scenarios with the evidence of separation or near-to-separation. To create separation, we treated binary covariate X b as an influential covariate by considering a relatively large value of β 12b and β 13b (compared to β 22c and β 23c associated with X c ) so that it creates any form of separation defined earlier (Table 1). This setting confirmed complete or quasi-complete separation (at least one zero cell in a contingency table between the binary covariate X b and response Y) for some simulated datasets and created near-to-separation (there is at least one cell with a few observations in the contingency table) for most of the other simulation sets. In order to determine the optimal cutoff point of the cell with few observations, we initially performed an extra simulation study before conducting the final simulation by considering different cutoff points as the percentage of observations of the total sample, which are 5%, 8%, 10%, 12%, 15%, 18% and 20%. The idea is to determine the optimal cutoff point (as percentage) for which there is negligible bias in the estimate of the regression coefficient. We then consider this cutoff point as the rule of thumb below which one can consider the dataset with evidence of near-to-separation. This simulation was performed for scenarios with the true values of the coefficients associated with the binary covariate β 12b = 1.3, β 13b = 1.2 and sample size n = 50 and 100. For each scenarios, we plot the bias of the estimated regression coefficient over 1000 simulations. The empirical result showed that the bias in both the MLE and PMLE is near to zero for cell frequency greater than 15% of the total sample ( Figure 1). We then defined near-to-separation in the final simulation if there is at least one cell with observations less than 15% of the total sample. For the final simulation, we considered several scenarios created by varying the sample size n and degree of imbalance in the prevalence of the response category, proportion of event in binary covariate (γ ), magnitude of the log-odds ratio relating the binary covariate and the response (β 12b , β 13b ). We considered the sample size n as 15, The above simulation scenarios were considered for two cases based on the distribution of the response category. The case 1 is termed as 'homogeneous distribution of the response category', where the percentage distribution of the response categories are approximately equal (approx. 33% in each category), and case 2 is termed as 'heterogeneous distribution of the response category', where such distribution of the response category is unequal. These two cases were considered by fixing such a value of β 02 and β 03 which control the overall prevalence of the response category in the data.

Fitting the model and evaluating the performance
For each simulation scenario, we created 1000 replications of dataset and fitted the model in Equation (1) using MLE and the model in Equation (5) using PMLE to each of the simulated datasets. For both the MLE and PMLE estimates, we calculated the bias as bias(β r ) =β r − β r whereβ r = 1000 s=1β rs /1000, the mean squared of error (MSE) as MSE(β r ) = 1000 r=1 (β rs − β r ) 2 /1000. We also reported analytical SE as the average of the analytical estimates over the number of simulations, and the simulated standard error (SimSE) as the standard deviation of the estimates over the number of simulations. The coverage of the confidence interval for both MLE and PMLE estimators was calculated as the percentage of confidence intervals which include the true value over 1000 simulations. In all aspects of performance, we reported the number of convergence failure and summarized the results over the simulation for which convergence achieved.
All computations were performed using R version 3.5.2. For the standard MLE, the function 'multinom' belonging to R library 'nnet' was used, and for PMLE the function 'brmultinom' belonging to R library 'brglm2' was used.

Results
We first summarized likelihood of separation (complete or quasi-complete) or nearto-separation (in terms of percentage) over 1000 simulations under different scenarios considered (supplementary Table A.1). The results revealed that the likelihood of separation was positively correlated with degree of uneven proportions of event and non-event in the binary covariate (γ ) and the magnitude of log-odds ratio (β 12b , β 13b ). However, the chances of separation (complete or quasi-complete) decreased with the increasing sample size. The opposite scenarios can be observed for the near-to-separation. The results suggest that the separation is likely to occur for small sample or even for large sample with rare outcome along with extremely uneven distribution of event and non-event in binary covariate.

Case 1: homogeneous distribution of the response categories
For the parameter set β = (β 02 , β 12b , β 22c , β 03 , β 13b , β 23c ) = (−0.6, 1.3, .65, −0.5, 1.2, 0.5) and for γ = 0.5, the percentage distribution of the response category is approximately equal even if the sample size vary. Under this homogeneous condition, the results revealed that in the presence of separation (complete or quasi-complete), the MLE either failed to achieve convergence or, if achieved, provided large estimates of the regression coefficients (β 12b , β 22c , β 13b , β 23c ) ( Table 3). The number of convergence failure decreased with the increasing sample size. The MLE also reported large SE, particularly for the regression coefficients (β 12b , β 13b ) associated with the binary covariate (X b ), which created separation. In contrast, the PMLE achieved convergence for all these cases and provided finite estimate of the regression coefficients. In particular, the PMLE slightly overestimated the true value of the regression coefficients (β 12b , β 13b ) associated with X b and underestimated the true value of those associated with X c . Similar results can be observed for all other simulation scenarios with evidence of separation (results not shown). As the MLE showed convergence failure, large amount of bias in the estimate and high SE, we therefore restricted to evaluate the other properties such as MSE and coverage of the confidence interval for these scenarios.
For the simulations with evidence of near-to-separation, the amount of bias and MSE for the coefficients associated with the binary covariate decreased with increasing sample size, which is true for both the MLE and PMLE (Figure 2). However, the amount of both bias and MSE were observed to be low for the PMLE compared to those associated with the MLE. The amount of bias and MSE for both the MLE and PMLE slightly increased with increasing value of the true regression coefficients (β 12b and β 13b ) associated with the binary covariate ( Figure 3). The PMLE showed substantially low amount of bias and MSE compared to the MLE. The bias and MSE for the both estimates were observed to be slightly lower when the proportion of event was around 0.5 ( Figure 4). As usual, the PMLE showed smaller amount of bias and MSE than the MLE. Furthermore, we evaluated the performance of MLE and PMLE by increasing the number of binary covariate (K). The results suggested that the amount of bias and MSE in both the MLE and PMLE increased with increasing number of binary covariate but were greater for MLE than PMLE ( Figure 5).
For the regression coefficients (β 22c , β 23c ) of the continuous covariate which is not responsible for creating near-to-separation, both the MLE and PMLE, in general, provided smaller amount of bias and MSE compared to those associated with the binary covariate that created near-to-separation ( Figure 6). Here, the MLE reported bias and MSE to some extent when sample size is small and the amount of bias and MSE decreased with increasing sample size. In contrast, the PMLE showed negligible amount of bias and MSE, even for small sample.
Some additional results (analytical SE, SimSE, width of confidence interval and coverage) of the MLE and PMLE from the simulation sets with evidence of near-to-separation were summarized in Table 4. The results revealed that in the presence of near-to-separation, both MLE and PMLE achieve convergence and provide finite estimates for the regression coefficients. However, the MLE provided large estimate of the SE (but smaller than the simulation SE) and wider confidence interval compared to those associated with PMLE.

Case 2: heterogeneous distribution of the response categories
For the parameter set β = (β 02 , β 12b , β 22c , β 03 , β 13b , β 23c ) = (0.3, 1.3, .65, 0.1, 1.2, 0.5) and for γ = 0.5, the percentage distribution of the response category was heterogeneous (unequal) even if the sample size vary. Under this heterogeneous condition, similar pattern of the results can be observed here in the presence of complete or quasi-complete separation with convergence failure and infinitely large estimates for the MLE of the regression coefficients particularly that associated with the binary covariate (supplementary Table  A.2). In contrast, the PMLE showed greater improvement by achieving convergence and providing finite estimate. The number of simulations with complete or quasi-complete separation was greater under heterogeneous case than the homogeneous case. The number of simulations with such conditions decreased with the increasing sample size. Similar results were also observed for all other simulation scenarios with evidence of complete or quasi-complete separation (results not shown).
In the presence of near-to-separation, the results suggested that the amount of MSE in both the MLE and PMLE decreased with increasing sample size and MSE were observed to be low for the PMLE estimate compared to those associated with MLE estimate (supplementary Figure A.1). Again, the amount of MSE is observed to be substantially high for the MLE compared to the PMLE for any value of the true effect size and the proportion of the binary covariate. For the regression coefficient of the continuous covariate (β 22c , β 23c ) that is not responsible for creating near-to-separation, both the MLE and PMLE, in general, provided smaller amount of MSE compared to those associated with the coefficient of the binary covariate (β 12b , β 13b ) that created near-to-separation. Here, the PMLE also provided comparatively smaller amount of MSE than the MLE (supplementary Figure A.2). The MSE also decreased with increasing sample size and for large sample, the MLE and PMLE provide comparable results. Like the results for homogeneous distribution of the response category, similar pattern of the results can be observed here with the large value of SE, wider confidence interval for the MLE in comparison with the PMLE (supplementary Table A.3).

Data and variables
The methods are applied to analyze the 'childhood health seeking behavior' data extracted from the database of the BDHS 2014 conducted between June 2014 and November 2014 through the collaborative efforts of the National Institute of Population Research and Training (NIPORT), ICF International (USA), and Mitra and Associates as part of the world-wide demographic and health survey (DHS) program. The BDHS is a nationally representative cross-sectional survey that employed two-stage stratified cluster sampling scheme, where in the first stage, enumeration areas (EA) were randomly selected (75% from rural area and 25% from urban area) from each of the seven administrative divisions, and in the second stage, households were selected following systematic sampling method from the each selected EA [25]. The main objective of such national level survey is to provide updated facts and figures on health and demographic indicators for the policy-makers.
Childhood health seeking behavior information (choice of service providers/facilities) has been collected for all children of age under 5 years for their illness (due to diarrhea, ARI, fever ) in the two weeks preceding the survey. Of the total children (n = 7886), 406 were reported to suffer from ARI, 2768 from fever and 371 from diarrhea. The information about the choice of service providers for the sick children is also collected from the respondent (their parents and caregiver). Of the children with any illness, 355 with ARI, 1999 with fever and 280 with diarrhea received treatment from medical facilities. The health seeking  behavior is defined as their choice of service providers or place from where they first sought treatment. The outcome variable 'choice of health facilities' is categorized as (i) local facilities -drug dealer of pharmacy or unqualified doctor such as so-called untrained village doctor, (ii) public facilities -public hospital, district hospital, maternal and child welfare center, upazila health complex or other public sector, (iii) private facilities -private hospital, NGO static clinic, private chamber or other private sector.
Of these facilities, the most expensive one is the private facility, which is followed by the public and the local facilities. The local facilities are equipped with all unqualified medical providers who basically sell the medicine and hence most of the time they charge for medicine but not for consultation. Generally, people from poor household seek for treatment from such local facilities [1]. In addition, due to lack of knowledge, some people often use these facilities, even in most cases government facility provides health service with very low cost. However, government facilities often suffer from shortage of health-care providers, and hence it takes time to get service. Several covariates are considered here including wealth index (poorest, poorer, middle, richer, richest), place of residence (urban, rural), current working status of the parent's (yes, no), mother's education (none, primary, secondary, higher) and respondent's age. Of them wealth index (the socio-economic status of the parents) is considered to be an influential covariate that may greatly influence the choice of health facilities, because medical expenses significantly differ among the above three facilities.

Analysis and results
As this survey used a two-stage stratified sampling scheme where urban and rural are two important strata, obtaining separate estimates for urban and rural strata, in addition to national estimate, is a major concern for the decision makers to make appropriate policies for urban and rural part of the country. Therefore, we performed separate analysis for urban and rural strata in addition to combined analysis with full dataset. Summary statistics from the contingency table of the outcome variable (choice of facilities) for each type of disease (ARI/fever/diarrhea) suggest that the outcome variables related to ARI and diarrhea for urban sample is separated by the wealth index (Table 2) and those for fever is nearly separated (near-to-separation) by the same covariate (results not shown). For rural sample, the outcome variables for both ARI and diarrhea are also nearly separated by the wealth index. The education level also created a sort of near-to-separation for ARI and diarrhea for the urban sample (results not shown).
For the choice of health facilities for each childhood illness (ARI, fever, diarrhea), multinomial models with both MLE and PMLE version were fitted separately for the urban and rural strata and combined data. Again, as the sampling design of this survey is based on a multi-stage design, we therefore fitted the model incorporating households sampling weight and reported the weighted estimate of the regression coefficients and its SE. In each model, we explored the relationship between health seeking behavior of parents for illness (ARI, fever, diarrhea) and the covariates wealth index, place of residence, current Table 5. Goodness-of-fit statistics of the models fitted using MLE and PMLE for the data from different population (urban, rural and both) on childhood health seeking behavior for treatment of ARI, fever and diarrhea. Note: AIC, Akaike information criterion; LR, likelihood ratio; *, significant test statistics.
working status, mother's education and respondent's age. Furthermore, in each model, we compared the public and private facilities with the local facilities as reference category. The goodness-of-fit statistics (AIC, deviance residuals, likelihood ratio test, and Hosmer-Lameshow test) of each model fitted using both MLE and PMLE showed comparable results ( Table 5). The goodness-of-fit for all models (models for different types of illness and sub-groups in Tables 6-8) are found to be satisfactory for at least one of the goodness-of-fit criterion applied here.
The results for urban stratum revealed that the MLE provided infinitely large estimates of the odds ratio associated with the wealth index that created quasi-complete separation in the outcome (choice of health facilities) for the illness due to ARI, which is not interpretable ( Table 6). In contrast, the PMLE provided comparatively very smaller odds ratio associated with the respective covariate. In addition, the PMLE provided comparatively smaller value for each of the other estimates of the OR associated with education than the MLE. When the model fitted for rural stratum (for which sample size is larger than the urban stratum), a remarkable difference in the estimates between the MLE and PMLE is also observed, particularly for the OR associated with the wealth index that created near-to-separation, with relatively smaller value for the PMLE. For the combined data, a negligible difference in the estimates between two approaches is observed.
The fever data are comparatively large and there was no evidence of complete separation except for near-to-separation in the urban stratum. However, there was evidence of separation due to wealth index in diarrhea data for the urban stratum and a sort of near-to separation in the rural stratum. Similar results were observed in the analysis of fever (Table 7) and diarrhea (Table 8) data with noticeable difference in the estimates of regression coefficients while separation exists and small difference when there is evidence of near-to-separation and negligible difference when there is no evidence of any kind of separation.

Discussion
Complete separation or near-to-separation (partial separation) is a serious problem in the fitting process of a model. Although the data with complete or quasi-complete separation  are rare but nearly separated data (near-to-separation) are very common in practice. However, it is essential to explore the data condition (rule of thumb) to define near-toseparation. This study determined a threshold value (< 15%) of the observed count in at least one cell of a contingency table of a binary covariate and the outcome. Furthermore, this study investigated the consequences of both separation and near-to-separation in the maximum likelihood-based standard multinomial model and addressed the problems by applying the penalized likelihood-based Poisson regression for multinomial response. The simulation results showed that in the presence of complete or quasi-complete separation, the MLE often failed to achieve convergence and/or provided large or infinite estimates of the regression coefficient and Wald-confidence interval, which is not interpretable. In contrast, the PMLE showed great improvement by achieving convergence and providing finite estimate with very small amount of bias and MSE for the corresponding regression coefficient. In the presence of near-to-separation, the standard MLE provided large amount of bias and MSE in the estimate of the regression coefficient associated with the covariate that creates near-to-separation. The amount of bias and MSE were reported to be comparatively high for small sample, large true regression coefficient, the uneven distribution of the event and non-event of the binary covariate. Because for each of these cases, the chance of such near-to-separation was high. For all these scenarios, the PMLE showed improvement over MLE by reducing bias and MSE to some extent and providing narrower confidence interval and accurate coverage. Even for large sample with rare outcome category, the MLE still provided large amount of bias and MSE while the PMLE offered improvement to some extent.
In contrast, the amount of bias and MSE for both the MLE and PMLE for the regression coefficient associated with the continuous covariate that is not responsible for creating any kind of separation were comparatively lower than those associated with the binary covariate. Likewise, the PMLE reported smaller amount of bias and MSE than those in the MLE when sample size was small.
This study also extended the penalized estimation method to analyze data from a complex survey by incorporating individual survey-weight (inverse probability weighting) and provided an application of the methods for analyzing health seeking behavior (choice of health-care facilities for childhood illness) data with evidence of separation or near-toseparation. The results supported the simulation findings. All the findings of this paper are similar to those reported in Ref. [4], which investigated finite sample bias in both Firth's type penalized estimate and MLE for the multinomial logistic regression. However, they ignored the issues related to near-to-separation in the multinomial outcome, which may occur frequently even for moderate-to large sample with rare outcome category.
Finally, the PMLE for the multinomial logit model via Poisson regression framework showed significant improvement over the standard MLE-based multinomial logit model in the presence of separation or near-to-separation created by a covariate. The chance of separation or near-to-separation is reported to be high when sample size is small or even large but the prevalence of one or more outcome category is low or there is a number of influential covariates or combination of them, and the problems due to such separation is non-negligible for multinomial response. Based on the findings of the study, we recommend to use PMLE if sample size is small (n ≤ 50). Again, if the sample size is large, we suggest to explore the data to identify any evidence of separation or near-to-separation, and if exists, recommend to use PMLE. This study explored the use of penalized methods to analyze the sparse multinomial data from a complex survey, however, further study is required to develop appropriate goodness-of-fit methods aligned with penalized maximum likelihood approach. Residualbased goodness-of-fit assessments discussed in some recent studies [13,14] may be an alternative choice in addition to traditional approaches such as Akaike information criteria, Hosmer-Lameshow test and likelihood ratio test.