Robust Inference in Generalized Linear Models

Robust inference on the parameters in generalized linear models is performed using the weighted likelihood method. Two cases are considered: a case with replicated observations and a case with a single observation of the dependent variable for each combination of the explanatory variables. The first case is common in the design of experiments, while the second case arises in observational studies. Theoretical and computational results on real datasets are presented and compared with other existing techniques.


Introduction
Generalized linear models (GLMs) are commonly used in many different fields and inferences are made by using the likelihood function and related quantities. The likelihood method has good asymptotic properties when the true model generating the data is the same as the hypothesized model. However, this approach leads to unreliable results when the hypothesized model does not coincide with the true model. Robust procedures are proposed by several authors in order to obtain more stable inferences. The M-estimator procedure for binary response variable was first introduced by Pregibon (1982), Stefanski et al. (1986), and Copas (1988) (see also Carroll and Pederson, 1993;Künsch et al., 1989). A similar inference approach was used by Bianco and Yohai (1996) in order to have a consistent and more robust estimator. Croux and Haesbroeck (2003) further improved and presented a fast implementation scheme. M-estimators based on quasi likelihood were introduced by Cantoni and Ronchetti (2001). Croux et al. (2002) studied the behavior of maximum likelihood estimators in presence of outliers, while Rousseeuw and Christmann (2003) provided an approach for the logistic regression model. Markatou et al. (1997) introduced the weighted likelihood approach for discrete models with particular attention to the logistic regression case, while Markatou et al. (1998) extended the method for continuous models. Hosseinian (2009) studied a different form of weighted likelihood and applied it to binary and Poisson regression. Ruckstuhl and Welsh (2001) compared the minimum distance methods with other robust methods. Muller and Neykov (2003) used the trimmed likelihood for generalized linear models. Most of these methods are designed for logistic regression and Poisson regression and cannot be easily extended to other generalized linear models.
In this article, we investigate the use of weighted likelihood method to obtain robust estimation, inference, and model selection for generalized linear models with distributions in the regular exponential family. As in Markatou et al. (1998) outliers are "observations that are highly unlikely to occur under the hypothesized model." This concept is different from the usual "geometric" distance-based outlier because it depends on a "probabilistic" measure.
In the context of regression, the above type of outliers is often called vertical outliers since they affect the response variable, while horizontal outliers affect the explanatory variables and produce leverage points. We show through Monte Carlo experiments that the proposed procedure is robust against vertical outliers. The effect of horizontal outliers is not investigated here.
Even though there exist some studies on the weighted likelihood method for replicated observations, to the best of our knowledge, there is no prior literature on weighted likelihood methods in GLM for few or no replicated observations. The new method utilizes Anscombe residuals and is implemented in the R package wle available on CRAN.
The article is organized as follows. Section 2 discusses the estimation procedure, with particular emphasis on the use of Anscombe residuals. Section 3 is devoted to inference, analysis of variance decomposition, and model selection. Section 4 provides an example based on a real dataset. Some computational results obtained from Monte Carlo simulations for Poisson, gamma, and inverse Gaussian models are reported in Section 5. Finally, Section 6 contains conclusions and remarks. Supplemental material is available at www.dst.unive.it/∼claudio/uploads/LSSP 2013 0307.

Robust Estimation in Generalized Linear Models
The weighted likelihood estimates (WLE; Markatou et al., 1998) are obtained as solutions of weighted likelihood estimating equations (WLEE) that are a modified type of the maximum likelihood estimating equations in which each score is associated with a weight that could depend on the unknown parameters. Lindsay (1994) shows that, for discrete models and certain types of distance/divergence measures between distributions, the corresponding estimating equations could be written as WLEE with a particular form of the weight function. Markatou et al. (1998) extended this approach to continuous models. Hereafter, we review the construction of the weights for a random sample y 1 , . . . , y n with a common unknown parameter θ . Later, we consider the extension to the generalized linear model, which has a sample of independent vectors (y 1 , x 1 ), . . . , (y n , x n ), where each response y i has an expectation that depends on explanatory variable x i .
The weights are defined as follows: where δ(·) is the Pearson residual function and A(δ) is the Residual Adjustment Function (RAF; Lindsay, 1994). Hence, the WLE is the solution of where u(y; θ ) is the classical score function from the hypothesized model. Whenever possible we use the simpler notation w i (θ ) for w(y i ; θ, f * ). The Pearson residual function δ(y, θ, f * ) = f * (y) m * (y;θ) − 1 shows the agreement between the hypothesized model and the empirical distribution of the observations, where f * (y) = k(y; t, h) dF n (y) is a nonparametric density estimate based on the data, k(y; t, h) is a kernel with bandwidth h, andF n (y) is the empirical distribution function given from the sample. m * (y; θ ) = k(y; t, h) dM(y; θ ) is a smoothed version of the density model for a continuous random variable, whereas, for a discrete model the smoothing is not needed. A discussion on the bandwidth selection is presented in the Appendix.
The residual adjustment function operates on Pearson residuals as the Huber ψ-function operates on the structural residuals. When A(δ) = δ then the weight w i (θ ) ≡ 1, and the weighted likelihood estimating equations correspond to the maximum likelihood estimating equations. The literature contains several proposals for the RAF, here we recall two different families. The RAF based on the Power Divergence Measure (PDM; Read, 1984, 1988) is and was introduced by Lindsay (1994). Special cases are maximum likelihood (τ = 1), Hellinger distance (τ = 2), Kullback-Leibler divergence (τ → ∞), and Neyman's Chi-Square (τ = −1). The RAF based on the Generalized Kullback-Leibler divergence (GKL) was introduced in Park and Basu (2003) Special cases are maximum likelihood (τ → 0) and Kullback-Leibler divergence (τ = 1). This RAF can be interpreted as a linear combination between the Likelihood divergence and the Kullback-Leibler divergence. We will use this last family in all our subsequent examples.
As an illustration, Fig. 1 displays the Pearson residual function evaluated at θ = (24, 8) for the gamma family of shape s and rate r with density m(y; s, r) (see central panel), the contaminated distribution f (y) = 0.95 m(y; 24, 8) + 0.05 m(y; 48, 32) (see left panel), and the weight function using the generalized Kullback-Leibler RAF with parameter τ = 0.25 (see right panel). The figure shows that the interval with high probability for the contamination distribution (dotted line) is well identified by the Pearson residual function and the observations in that interval are heavily down-weighted.
Example 2.1 (Movimprese dataset). The weighted likelihood estimate (λ w ) for the parameter λ of a Poisson distribution is the solution of the following fixed point equation  Fig. 2 shows the conditional distribution of SPISC for the Udine province, while the right panel displays the final weights. It is evident from the figure that there are unusual observations in the right tail of the distribution, which is incompatible with the Poisson model for the bulk of the data, and these observations are heavily down-weighted by the WLE procedure. Notice that removing the observations with weight smaller than 0.1, which could be considered as outliers, reduces the MLE to 5.347.
In a generalized linear model (McCullagh and Nelder, 1989), a response y has density m(y; θ ) ∝ exp((yθ − b(θ ))/φ + c(y, φ)), where θ is the natural parameter and φ is the dispersion parameter. Let μ = E(y) = b (θ ) be the mean function and Var(y) = V(μ)φ the variance of y, where V(μ) = b (θ ) is the variance function. The mean function μ depends on the linear predictor η = x t β through the monotone link function g with η = g(μ). Here x is the p-vector of the explanatory variables and β is a p-vector of the parameters.
Given n independent responses y = (y 1 , y 2 , . . . , y n ) and the corresponding design matrix X with dimension n × p, the log likelihood is By differentiation, we obtain the score function ∂/∂β (β) = ∂η t /∂β ∂θ/∂η t ∂/∂θ t (θ ) = X t u(β), where the design matrix X = ∂η/∂β t does not depend on β. Since the score equation X t u(β) = 0 is nonlinear, an iterative reweighted least-square algorithm (IRLS) is often used with the adjusted dependent variable z = Xβ + g (μ)(y − μ) regressed on the columns of X using the weight diagonal matrix The maximum likelihood estimates are obtained by repeating the IRLS step until convergence is achieved.
The weighted likelihood estimating equations are obtained by attaching a weight diagonal matrix W (β), with elements w i (β) based on the Pearson residuals, to each score contribution, that is, the weighted likelihood estimates are the solutions of the equation X t W (β)u(β) = 0. The solutions can be obtained by a two-step algorithm. The first step evaluates the matrix W (β) and the second step is the IRLS algorithm with W = W (β) kept fixed, as follows: β = (X t W 1/2 W 1/2 X) −1 X t W 1/2 (W 1/2 Xβ + −1 W 1/2 u(β)) and, since both W and are diagonal matrix, a simple expression for β is Thus, the weight matrix (β) used in the IRLS algorithm for the MLE is replaced by a new weight diagonal matrix with elements ω i (β)w i (β). The IRLS algorithm is iterated until convergence is attained, then the matrix W (β) is updated and is inserted in the IRLS. The two-step process terminates when the values of W (β) elements remain unchanged.
To highlight the definition of the weights hereafter, we use a different notation. Let Y j ∼ m(·, θ j (β), φ j ), j = 1, . . . , J be J independent random variables and y j = (y j,1 , y j,2 , . . . , y j,n j ) be n j replicates from the random variable Y j . Hence y equals to (y 1 , y 2 , . . . , y J ). The weight w jk (β) = w(y jk , θ j (β), φ j , f * j ) is based on the Pearson residual function evaluated at the conditional distribution of Y j , that is, is a nonparametric estimate of the conditional density (probability distribution in the case of discrete models) based on the n j replicates in the j level. If X is a fixed (balanced) design matrix with a small number of distinct values J of the parameters (θ, φ) compared to the sample size n, then we will have an adequate number of observations n j to estimate all the distinct f * j (·) conditional distributions. In practice, problems arise when the sample size n is small with respect to J and this implies small values of n j for all j in the balanced design and for the unbalanced design some n j might be very small. In either cases "good" nonparametric estimates f * j (·) for some or all conditional distributions may not be available. A naive solution is to use observations that correspond to (θ, φ) values in the neighborhood of (θ j , φ j ) in order to estimate f * j (·) at y jk . We adopt a more general solution using the Anscombe residuals r jk (j = 1, . . . , J , k = 1, . . . , n j ) because their asymptotic distribution, under the hypothesized model, is as normal as possible. The Anscombe residuals were introduced in Cox and Snell (1968) and Loynes (1969) (see also Brant, 1987;Pierce and Schafer, 1986) and obtained by the transformation R(y) = y −∞ V(μ) −1/3 dμ in both y and μ, adjusted for the scale R (μ)V(μ) 1/2 , that is, Anscombe residuals were also considered in McCullagh and Nelder (1989, sec. 2.4.2). Using the Anscombe residuals allows us to estimate only one nonparametric distribution, which will be common to all residuals regardless of the j values. Let r jk = A(y jk ; μ j (β), φ j ) be the Anscombe residuals andF n (r) be the corresponding empirical distribution. We letf (r) = k(r; t, h) dF n (t) be the nonparametric density estimate. An estimate of the smoothed modelm(·) of the Anscombe residuals distribution is obtained from simulation and is described as follows. Letp j = n j /n be the proportion of observations in each j level.
In each iteration of the two-step algorithm, we sample v observations (z 1 , . . . , z v ) from the mixture model J j =1p j m(z,θ j (β),φ j ) and we calculate their Anscombe residuals (r 1 , . . . , r v ) with the empirical distribution functionF v (·). Then we have a consistent estimate ofm given by m(r) = k(r; t, h) dF v (t). Comparingf (r) with m(r) provides the Pearson residuals However, the calculations can be simplified often since in most situations, the distributioñ m(·) of the Anscombe residuals is well approximated by the standard normal distribution. See the supplementary material and Section 4 for an empirical evidence about this fact. In most generalized linear models, the φ i are unknown, for some cases φ i = a i φ with known constants a i . For unknown φ, a robust estimate can be obtained using a weighted likelihood estimating equation, with weight matrix W (β) as defined above and score function determined by the model. Alternatively, a robust estimate of the variance Var(y) can be defined leading to the following robust method of moments estimateŝ whereβ andμ i are the WLE estimates of the corresponding quantities. In the gamma family, practical experience suggests that the estimate based on weighted likelihood is more stable than the weighted moment estimate (1), while it is the opposite for the inverse Gaussian family. In most cases, the score function for β does not depend on φ, however, the weight matrix W (β) does depend on φ, hence an iterated algorithm must be used. Starting values for the algorithm can be obtained by subsampling techniques. A subsample of small size is drawn from the original dataset and maximum likelihood estimates are obtained and used as the initial values in the weights for the original dataset. The subsampling is repeated several times searching for all possible roots of the weighted likelihood estimating equations. To find subsamples without outliers, the size of subsamples should be as small as possible, considering the number of the parameters that need to be estimated and algorithm stability. In practice, 20 subsamples with size equal the number of parameters plus 2 provide to be effective.
Anscombe residuals can be avoided in discrete models when all n j are sufficiently large. In this case, the roots of the WLEE are equal to the stationary points of the related minimum distance/disparity approach as shown in Markatou et al. (1997). Hence, the appropriate solution can be identified as the root that minimizes the distance/disparity. For all other situations, the relation with the minimum distance/disparity approach is lost. Thus, the method in Agostinelli (2006) can be applied to choose the appropriate root.
Asymptotics and robust properties of the WLEE are studied in Markatou et al. (1997Markatou et al. ( , 1998. Under the assumptions that the sample size for each distinct value of (θ, φ) tends to infinity (n j → ∞ for all j), and hence we do not use the Anscombe residuals, the asymptotic and robust properties for GLM in the exponential family can be obtained directly from Markatou et al. (1997Markatou et al. ( , 1998. In particular, in the absence of contamination the estimatorsβ are consistent and asymptotically normally distributed with variance matrix (X t (β)X) −1 , which leads to a first-order efficient estimator. Under contamination the distribution is not theoretically derived, but Monte Carlo simulations show that the distribution ofβ can be well approximated by a normal distribution with variance matrix estimated as (X t (β)W (β)X) −1 . We refer to Markatou et al. (1997Markatou et al. ( , 1998 for the theoretical details. If the use of the Anscombe residuals is unavoidable, that is, at least some of the n j does not go to infinity, or the number of distinct value of (θ, φ) tends to infinity with the same speed as the sample size, then asymptotic and robust properties have still to be established. In the Appendix, Theorem A.1, we prove that the Pearson Residuals based on Anscombe residuals converges to zero uniformly almost surely under the assumed model. This is sufficient to establish consistency and asymptotic normality for the estimators following the results in Markatou et al. (1997Markatou et al. ( , 1998. Robust properties are studied through Monte Carlo simulation, see Section 5 and the supplemental material for complete results. Agostinelli and Markatou (2001) introduced robust alternatives of the classical tests: Wald test, Wilks likelihood ratio test, and Rao score test. Under no contamination, the robust tests are asymptotically equivalent to the corresponding classical tests. In the presence of contamination, level and power breakdown points (He et al., 1990) of the robust tests are equal to the breakdown point of the corresponding weighted likelihood point estimators. These tests can be extended to a generalized linear model framework. Given the following system of hypotheses

Robust Inference in Generalized Linear Models
inside the parametric space restricted to the null hypothesis (β 0 ∈ B 0 ). Notice that the weights w i (β) are evaluated at the estimates of β in the full parametric space B. It is easy to show (Agostinelli and Markatou, 2001) that the weights based on the restricted space B 0 can lead to inconsistent tests even though there is no contamination. Under the null hypothesis, the T w test is approximately distributed as a χ 2 p−q , where q is the dimension of the space B 0 . The weighted log-likelihood test for the above system of hypotheses is given by , which has an approximated χ 2 p−q distribution under the null hypothesis, and is based on a weighted form of deviance. Following Agostinelli (2002), we define a weighted Akaike Information Criterion to perform robust model selection between nested models WAIC α = −2 n i=1 w i (β) (y i ,β 0 ) + αq, whereβ 0 is defined as above, and p − q coefficients are treated as zeros. The constant α is equal to 2 while if we put α = ln(n), we get a weighted Bayesian Information Criterion. It is necessary to use weights evaluated at the full model (β ∈ B) to ensure asymptotic consistency of the method and to preserve the relation between the classical log-likelihood test and the Akaike Information Criterion in their weighted forms. Let A and B denote two nested models included in the full model. A weighted deviance decomposition is defined as follows. Letη be the estimate of η, under the saturated model, which is a model without constrains on the η's. When β has dimension n and if the map between β and η is one-to-one, the maximization of the weighted likelihood (using weights fixed at the full model) does not depend on the weights. Then the scaled deviance for model A is defined as whereβ A and the correspondingη A ,φ Ai are obtained by minimizing the weighted likelihood, with fixed weights at the full model n i=1 w i (β) (y i ; θ i (β), φ i ). A deviance table can be constructed from the deviance differences which has an approximated χ 2 distribution with degrees of freedom equal to the difference of degrees of freedom between the two models B and A. When φ is known, a goodness-of-fit measure is the weighted Pearson's statistic whereμ i are the mean fitted values.

Example: Number of New Enterprises in Friuli Venezia Giulia Region
We illustrate our robust procedures using the dataset in Example 2.1. Since 1982, the Infocamere (an IT Consortium of the Italian Chambers of Commerce, http://www.infocamere. itwww.infocamere.it) has been providing on a regular bases information (Movimprese database) about the demography of Italian companies. For every quarter of the year, they report the birth/death count of the registered companies in the Chambers of Commerce stratified by economic sector, legal form, and region. We consider the data of the agricultural sector and the Friuli Venezia Region during the period of 1995-2009. In 1995, the database was transferred to a new platform and this led to several problems. In particular, many old companies were recorded as new ones and this introduced anomalous counts as displayed in Fig. 2 Table 2 reports the estimates (Est.), the estimated standard errors (S.E.), and the p-values (P-val.) for three different methods: maximum likelihood (MLE) using function glm in the R package stats; the Ronchetti and Cantoni's robust procedure (ROB) using function glmrob in the R package robustbase; and the weighted likelihood (WLE) using function wle.glm available in the Rpackage wle. The SPREG is statistically insignificant for both MLE and ROB, while it is statistically significant for WLE. The second trimester (TRIM2) is not significant for both ROB and WLE, while the Gorizia province (PROVGO) is not significant for WLE, while it is highly significant for both MLE and ROB. We also notice that the sign of the fourth trimester coefficient (TRIM4) is the opposite in the robust procedures with respect to the MLE.   S.E.
S.E. P-val.   Table 3 displays the results of three different model selection procedures: Akaike Information Criterion (AIC), Weighted Akaike Information Criterion (WAIC), and Weighted Akaike Information Criterion with weights based on Anscombe residuals (WAIC A ) applied to the eight sub-models. The table is ordered in the index of WAIC that provides the same ranks as the WAIC A . The best model according to the WAIC is the full model, while for AIC the SPREG variable is dropped.
For further investigation an analysis of deviance is performed and the results shown in Table 4. Both MLE (based on Pearson's statistic) and ROB (based on quasi-deviance decomposition) give large p-values for the SPREG variable, however weighted Pearson's statistic based on WLE is highly significant. All methods agree on the importance of the TRIM variable.

Monte Carlo Simulation
In this section, we present some partial results of extensive Monte Carlo simulation experiments. The complete results are available in the supplemental material. The experiments involve three models: Poisson, gamma, and inverse Gaussian. The main features are the sample size (70,140,280,560), and the level of contamination (0%, 5%, 10%). Only vertical outliers are investigated, while leverage points are not considered here. In our experiments, we compare the performance of the three methods: maximum likelihood estimate (MLE), the Ronchetti and Cantoni's robust estimate (ROB, when available), and the proposed weighted likelihood estimate (WLE). For all combinations of models and features, we run 5,000 replications using the R software (R Development Core Team, 2011).
In our simulation for the Poisson family Y ∼ Poisson(λ(x t β)) with log link, we define four independent explanatory variables X i , i = 1, . . . , 4 uniformly distributed on the discrete set {−3, −2, −1, 0, 1, 2, 3}. The vector of parameters β is set to {−0.25, −0.25, 0, 0} with the intercept term β 0 = 2 where the dispersion parameter φ is known and set at 1. The Anscombe residual (Cox and Snell, 1968) for the Poisson model is defined as 3  outliers. For MLE and ROB, we, respectively, employ, the function glm (stats package) and glmrob (robustbase package) using default values. WLE is implemented in the function wle.glm (wle package) by using the Generalized Kullback-Leibler (GKL) residual adjustment function with parameter τ . Three different settings are considered: (i) τ = 0.05 (WLE-0.05), (ii) τ = 0.1 (WLE-0.1), and (iii) weights w i (β) are obtained from the Anscombe residuals with smoothing parameter h = 0.031 and τ = 0.05 (WLE-0.05 Asy). We first perform our simulations for discrete explanatory variables. Here follows a description in terms of bias for the investigated methods. In case of 0% contamination, MLE, ROB, and WLE-0.05-Asy are unbiased for all sample sizes. WLE-0.1 and WLE-0.05 are unbiased for β 3 and β 4 , but exhibit slight bias for β 1 and β 2 , and substantial bias for the intercept. In case of contamination, MLE breaks down, however, ROB shows bias for β 1 , . . . , β 4 with an almost unbiased intercept. WLE-0.1 and WLE-0.05 display stability; however, the biases for β 1 and β 2 are slightly increasing while the intercept is very biased. WLE-0.05-Asy performs well for all sample sizes and levels of contamination, with no bias for β 1 , . . . , β 4 and a slight bias on the intercept. Figure 4 shows the boxplots of the bias distributions for 10% contamination level and 140 sample size, which summarizes well the performance of the methods.
In Table 5, we report the actual significance level of a Wald-type test for the null hypothesis β 3 = 0 against the alternative hypothesis β 3 = 0 when the nominal significance level is fixed at 5%. The MLE is useless in the presence of contamination for all sample sizes; the ROB methods behave relatively well for low contamination levels and become quite distorted for moderate contamination levels, in particular that distortion increases with the sample size; the WLE in both versions shows high stability of the level for all sample sizes. A very similar pattern is observed for the other nominal levels, we have investigated: 0.01, 0.025, 0.1. The WLE-0.05 is slightly conservative while the WLE-0.05-Asy is slightly liberal.
In Fig. 5, we report the estimated distribution of the p-values on testing significance between the null hypothesis H 0 : β 3 = 0 and β 4 = 0 against the alternative hypothesis H 1 : β 3 = 0 and/or β 4 = 0 using different deviance tests, for sample size 140. Since we sampled from the null hypothesis, we expect a uniform distribution of the p-values. The density of the p-values are estimated using a Gaussian copula-based density estimator introduced in Jones and Henderson (2007) with bandwidth equal to 0.3 for all the cases. While the performance of the methods is good and equivalent in absence of contamination, the behavior is very different in the presence of contamination: the density of the p-values corresponding to MLE is concentrated in a right neighborhood of zero (not displayed in the plot) while the density of the p-values for the robust quasi-deviance test (ROB method) is highly nonuniform. While the WLE-0.05 and WLE-0.05-Asy have almost uniform p-value distribution with better performance of WLE-0.05-Asy when no contamination present. Similar behavior is observed for all sample sizes.
In our simulation for the gamma family Y ∼ gamma(μ(x t β), φ) with log link, we defined four independent explanatory variables X i , i = 1, . . . , 4, which are uniformly distributed on the discrete set {1, 2, 3}. The vector of parameters β is set to {1, 1, 0, 0} with the intercept term β 0 = 1 and the dispersion parameter φ = 1. For this experiment, in addition, we consider sample sizes 1, 120 and 2, 240, and 15% of contamination. The Anscombe residual for the gamma model (McCullagh and Nelder, 1989) is defined as 3 y 1/3 − μ 1/3 /μ 1/3 . Outliers are generated by the following process, when μ < 200 then the response variable is sampled from |N (3, 500, 25)| while for μ ≥ 200 the response variable is sampled from |N (0.1, 0.0001)|. For MLE and ROB, we, respectively employ, the function glm (stats package) and glmrob (robustbase package) using default values. WLE is implemented in the function wle.glm (wle package) by using the Generalized Kullback-Leibler (GKL) residual adjustment function with parameter τ = 0.1. The following settings are used. The estimation of the dispersion parameter would be either based on weighted method of moments estimates (WLE-0.1-moment, see Eq. 1) or on weighted likelihood estimator (WLE-0.1-likelihood); weights based on the model and weights based on the Anscombe residuals. The folded standard normal distribution is used as kernel in the nonparametric kernel density estimators and in the evaluation of the smoothed model. The smoothing parameter is set iteratively to kμ iφ , where the constant k is set heuristically to: 1. 024, 0.512, 0.256, 0.128, 0.064, 0.032 (smooth) according to the sample size for the usual weights. We further consider constants equal to k/2 (smooth/2) again according to sample size. For the asymptotic weights, a standard normal is used as kernel with bandwidth iteratively fixed to kσ ; k set to 0.031 (asmooth/2) and to 0.062 (asmooth) regardless of the sample size andσ is the actual estimates of the standard deviation; while the smoothed model is a normal model with zero mean and standard deviation equal to (1 + b)σ .
For all sample sizes and no contamination most of the considered methods are equivalent. The WLE based on likelihood and asymptotic Anscombe residuals (WLE 0.1 asymptotics, asmooth/2, likelihood) and the Cantoni and Ronchetti method (ROB) performs equivalently in most cases and they have the best performance. The WLE based on likelihood and asymptotic Anscombe residuals with double bandwidth (WLE 0.1 asymptotics, asmooth, likelihood) show a deterioration on the estimation of φ. The WLE based on method of moments (WLE 0.1 smooth moment and WLE 0.1 smooth/2 moment) for the estimation of φ always shows a (stable over contamination level) bias for this parameter, besides that, the performance is very good. The WLE based on likelihood and bandwidth equal to smooth/2, most of the time has the worst performance, with large bias corresponding to parameters β 1 , β 2 , and φ. In the use of Wald test, all methods have a good performance, however WLE based on the method of moments and Anscombe residuals (WLE 0.1 smooth, moment and WLE 0.1 smooth/2, moment) show a reduced power as the sample size grows, which make them useless for large sample sizes and all levels of contamination. In the ANOVA decomposition, based on Chi-square distribution, the WLE based on the likelihood and Anscombe residuals (WLE 0.1 asmooth/2, likelihood) performs best for all sample sizes and contamination levels. The Cantoni and Ronchetti method (ROB) performs equally well only for small sample sizes since, starting from sample size 280 in the presence of contamination its performance deteriorated significantly. WLE based on likelihood (WLE 0.1 smooth, likelihood and WLE 0.1 smooth/2, likelihood) shows a quite stable performance, however their distribution is not uniform under contamination, a similar behavior is observed for WLE 0.1 asmooth, likelihood. All types based on the method of moments do not perform very well. Very similar results are obtained using an F-type test in the ANOVA. In this case, ROB start to deteriorate slightly later and in relation with the contamination level: 560, 15%; 1, 120, 10%, and 2, 240, 5%.
In the third experiment, we consider the inverse Gaussian family, Y ∼ IG(μ(x t β), φ) with link: 1/μ 2 . The predictors are four variables distributed as a discrete uniform on (1, 2, 3). The vector of parameters β is set equal to β = (0.6, 0.6, 0, 0) with the intercept β 0 = 0.01 and the dispersion parameter φ = 0.01. In this experiment, we consider sample sizes 70, 140, 280, 560 and the outliers are generated from an inverse Gaussian distribution with β = (0, 0, 0, 0) and dispersion parameter equal to 10 with level of contamination set to 0%, 5%, 10%. The Anscombe residual for the gamma model (McCullagh and Nelder, 1989) is defined as (log(y) − log(μ))/ √ (μ). For MLE, the function glm (stats package) is used while WLE is implemented in the function wle.glm (wle package) by using the Generalized Kullback-Leibler (GKL) residual adjustment function. Settings similar to those used for the gamma family are considered here.
The best results for estimation and inference seem to be achieved by weighted likelihood approach using the weighted method of moments for the estimation of dispersion parameter, regardless of the bandwidth (WLE 0.1, smooth/2, moment and WLE 0.1, smooth, moment). The performance of the WLE with dispersion parameter based on weighted likelihood is not very good, especially for small sample sizes. In particular, WLE with bandwidth "smooth" is the best, while the use of a smaller bandwidth (smooth/2) leads to worse results.
In case of no contamination and for all sample sizes, all the studied versions of WLE perform very well apart from WLE 0.1, smooth/2, likelihood where a bias is observed in the estimation of β 1 and β 2 . Under contamination WLE 0.1, smooth/2, moment and WLE 0.1, smooth, moment perform best, while WLE 0.1, asmooth/2, moment and WLE 0.1, asmooth, moment show a loss of efficiency in the estimation of the β's parameters and a moderate bias in the estimation of the dispersion parameter.
To study the behavior of the Weighted Wald test, we evaluate the statistics on evaluating the β 3 parameter. The WLE 0.1, smooth/2, moment and WLE 0.1, smooth, moment have shown a level close to the nominal one. The performance increases with sample size. To investigate the power of the Weighted Wald test, we test H 0 : β 1 = 0 against H 1 : β 1 = 0. Again, WLE 0.1, smooth/2, moment and WLE 0.1, smooth, moment show very high power regardless of the level of contamination and sample sizes.
In the use of likelihood ratio test for the analysis of variance (ANOVA), using both Chisquared-test and F-test, the WLE 0.1, smooth/2, moment and WLE 0.1, smooth, moment have a stable behavior with p-value distribution very close to the uniform distribution. The corresponding methods based on the Anscombe residuals (WLE 0.1, asmooth/2, moment and WLE 0.1, asmooth, moment) are useful only with moderate contamination (around 5%), while for larger contamination these last methods become unstable.

Conclusions
We have provided a general framework of robust estimation, inference, and model selection in a generalized linear model (GLM) for the exponential family. We have used and adapted the weighted likelihood method (WLE) to GLM. This method is highly related to the minimum distance/disparity methods and has the following main features: (i) outliers are defined on a "probabilistic" scale, rather than on a "geometrical" scale avoiding the location-scale formulation; (ii) WLE is a first-order asymptotic efficient estimates for the model of no contamination; and (iii) WLE has demonstrated good robust properties.
We have extended the method of Markatou et al. (1997Markatou et al. ( , 1998 to generalized linear models by providing a complete robust inferential tool. We have introduced a modified version of Pearson residual function based on Anscombe residual, which is applicable to all types of datasets (with or without replicated observations) and the exponential family models. The Monte Carlo simulation results show the superior performance of our method compared to other robust approaches. Our method has been implemented by creating the function wle.glm, which is made available on CRAN in the R package wle. This function estimates and provides inferences for the binomial, Poisson, gamma, and inverse Gaussian models.

A.1. Asymptotic Behavior of the Pearson Residuals Based on Anscombe Residuals
Theorem A.1. Let assume that (i) the kernel k(z, t, h) is a bounded variation density kernel and the smoothing parameter h is a positive constant; (ii) the observations (y i , x t i ) (i = 1, . . . , n) come from a GLM with distribution M(y, x t ; θ 0 , φ 0 ), density m(y, x t ; θ 0 , φ 0 ) and (θ 0 , φ 0 ) belongs to the parameter space ( , ). Let r i (i = 1, . . . , n) be the Anscombe residuals evaluated at the observation (y i , x t i ) and parameter (θ 0 , φ 0 ) and let G(r; θ 0 , φ 0 ) be the distribution of the Anscombe residuals. The Pearson residuals based on the Anscombe residual r are such that, Proof. Letf (r) = k(r, t, h) dF n (t) be the nonparametric kernel density estimate based on the r i Anscombe residuals and m(r) = k(r; t, h) dF v (t), whereF v (t) is an empirical distribution based on a sample of size v from G(r; θ 0 , φ 0 ). We need to show that W n = sup r f (r) − m(z; θ 0 , φ 0 ) a.s. → 0 as n → ∞ and v → ∞. Let W n = sup r k(r, t, h) dF n (t) − k (r, t, h) where ξ is the total variation of the kernel k. The result follows from Glivenko-Cantelli Lemma. Then,

A.2. Bandwidth Selection
When evaluating the weights for continuous models a smoothing parameter must be chosen. In finite sample, this parameter has an impact on the robustness performance of the procedure. Markatou et al. (1998) suggest the following procedure to determine the bandwidth when the underline model is normal and the kernel is normal as well. Consider the asymptotic form of the Pearson residual for a point mass contamination at cσ with contamination level ε δ a (c) = ε k + 1 k 1/2 exp c 2 2(1 + k) − 1 .
Inserting δ a (c) in the weight function and equating it to a prefixed value w a leads to k. Consequently, the bandwidth is set to h = kσ , whereσ is an estimate of the scale parameter, so that the bandwidth is updated in each iteration. Using this approach we have k = 0.0031 for weight w a equal to 0.03, level of contamination ε = 0.1 and position of the outlier at c = 4 for the GKL RAF and τ = 0.1. Notice that the value of k is independent of the parameters μ and σ of the normal distribution. The above approach is used when Anscombe residuals are considered in all our examples and simulations. For other continuous models, such as gamma or inverse Gaussian, similar results could be obtained, however, the results will depend on the unknown parameters. For a given distribution F taking values on the positive line, location μ and scale σ , and kernel k(y, t, kσ ), the asymptotic Pearson residual becomes δ a (c) = ε k(cσ + μ, cσ + μ, kσ ) k(cσ + μ, t, kσ ) dF μ,σ (t) − 1 , which could be evaluated numerically for given values of c, ε, μ, and σ , proceeding as above, we obtain a value for k. For instance, for the gamma distribution with shape α = 3, scale β = 1, μ = αβ, σ = √ αβ, c = 6, ε = 0.1, and w a = 0.1 lead to k = 0.024, while for the inverse Gaussian distribution with location μ = 1, σ = μ 3/2 , c = 5, ε = 0.1, and w a = 0.1 lead to k = 0.039.