Robust estimation for non-homogeneous data and the selection of the optimal tuning parameter: the density power divergence approach

The density power divergence (DPD) measure, defined in terms of a single parameter α, has proved to be a popular tool in the area of robust estimation [1]. Recently, Ghosh and Basu [5] rigorously established the asymptotic properties of the MDPDEs in case of independent non-homogeneous observations. In this paper, we present an extensive numerical study to describe the performance of the method in the case of linear regression, the most common setup under the case of non-homogeneous data. In addition, we extend the existing methods for the selection of the optimal robustness tuning parameter from the case of independent and identically distributed (i.i.d.) data to the case of non-homogeneous observations. Proper selection of the tuning parameter is critical to the appropriateness of the resulting analysis. The selection of the optimal robustness tuning parameter is explored in the context of the linear regression problem with an extensive numerical study involving real and simulated data.


Introduction
Density-based minimum distance inference is an important component of parametric statistical inference. In minimum distance inference, the parameter of interest is estimated by minimizing a suitable measure of discrepancy between the data and an assumed parametric model. The *Corresponding author. Email: ayanbasu@isical.ac.in c 2015 Taylor & Francis discrepancy is quantified by a divergence measure between two density functions or two distribution functions. In this paper, we will concentrate on the density-based approach. Consider a random sample X 1 , . . . , X n of size n from a population whose true density function is g. Suppose, we model g by a parametric family of densities {f θ : θ ∈ ⊆ R p } and let ρ(f 1 , f 2 ) be a density-based divergence measure between two probability densities f 1 and f 2 with respect to a common dominating measure. Then, the minimum distance estimator of the unknown parameter θ based on the divergence ρ(·, ·) is defined by the minimizer of ρ(ĝ n , f θ ) with respect to θ over the parameter space , whereĝ n is some nonparametric estimate of the true density g based on the observed sample. In the case of discrete models the vector of relative frequencies provides a natural choice forĝ n , but in the case of continuous models one generally needs to use a kernel density estimator in place ofĝ n . See Vajda [13], Pardo [10] and Basu et al. [2] for an elaborate description of the theory of minimum distance inference and several illustrative examples.
It is important to emphasise that the measures of divergence that are used in minimum distance estimation are not necessarily mathematical metrics. They need not be symmetric in their arguments, or may not satisfy the triangle inequality. The only properties that are demanded are that the measure should be non-negative and should equal zero if and only if the two arguments are identically equal. We refer to all such divergences as 'statistical distances' or simply 'distances'.
Many of the minimum distance estimators based on density functions have strong robustness properties, which make them particularly useful in situations where model misspecifications are suspected or the data set contains outliers inconsistent with the model assumptions. One of the notable estimators in this connection is the minimum density power divergence estimator (MDPDE), which provides a natural downweighting of the outlying observations with a weight proportional to the density of the observation in relation to the assumed parametric model [1]. A particularly attractive feature of this method of estimation is that the objective function to be minimized can be constructed without taking recourse to any nonparametric density estimation technique like kernel density estimation or methods based on splines. As a result the method becomes substantially simpler, both in theory and practice, than minimum Hellinger distance estimation and related methods. Basu et al. [2] provide an extended description of the methods based on the DPD and other related minimum distance methods.
We now consider the situation where the observations are independent and share a common parameter in their distributions, but are not identically distributed; we refer to such a scenario as one involving non-homogeneous observations. To be applicable to the case of independent but non-homogeneous data, the minimum DPD method requires an extension of the theory existing in the case of independent and identically distributed (i.i.d.) data. Recently, Ghosh and Basu [5] have provided this extension. Part of the present paper deals with an extensive simulation study which illustrates the performance of the method in the case of linear regression. The results clearly bear out the desirable properties of the minimum DPD method, and complements the theoretical findings of Ghosh and Basu [5].
The DPD family is indexed by a single tuning parameter α, which controls the trade-off between robustness and efficiency with larger α indicating enhanced robustness. To be useful in practice, the minimum DPD proposal must be accompanied by a recommendation for the choice of an optimal tuning parameter. In many real situations, the choice of α has a significant effect on the value of the estimate. The present paper also deals with the selection of the optimal tuning parameter α in problems where the data are independent but non-homogeneous. In the process, we appropriately extend the techniques that have been proposed for the selection of the optimal tuning parameter in the case of i.i.d. data. We propose this as a general method, which is not specific only to the regression case; however for the purpose of illustration we explore the optimal selection technique of the tuning parameter through extensive simulation studies and real data examples.
The rest of the paper is organized as follows. In Section 2, we briefly review the density power divergence (DPD) measure and discuss the application of this divergence in robust estimation, both in the case of i.i.d. and non-homogeneous data. In this section, we also develop the selection of the optimal tuning parameter in the case of non-homogeneous data and briefly discuss the mechanics of the method in the case of linear regression. Section 3 contains all the numerical illustrations of the paper which includes simulation studies, real data examples and selection of the optimal tuning parameter in specific examples. Section 4 has some concluding remarks.

The i.i.d. data problem
The DPD measure d α (g, f ) between the densities g and f is defined, in terms of the tuning parameter α (≥ 0), as where ln represents the natural logarithm. Given an i.i.d. random sample Y 1 , . . . , Y n and a parametric family F = {f θ : θ ∈ ∈ R p } of densities which models the true data-generating density g, the minimum DPD estimatorθ α of θ is defined by the relation d α (ĝ, fθ α ) = min θ∈ d α (ĝ, f θ ), whereĝ represents a suitable nonparametric estimate of g. As already mentioned, a positive feature of the DPD measure is that it does not require any nonparametric smoothing technique for any value of α; substituting the empirical distribution function in place of the true unknown distribution function allows us to build suitable approximations to the divergence measures in Equation (1) or Equation (2). Simplifications of this nature are not possible in the case of most of the other density-based minimum distance methods (such as the minimum Hellinger distance estimator).
When estimating θ by minimizing d α (g, f θ ), the last part of the expression in the right-hand side of Equation (1) is independent of the unknown parameter; hence, the essential objective function to be minimized, after approximating the true distribution with the empirical, turns out to be Under differentiability of the model, the MDPDE is obtained by solving the estimating equation where u θ (y) = ∇ ln f θ (y) is the usual likelihood score function and ∇ is the gradient with respect to θ . From the first term of Equation (4), it is clear that each likelihood score value is weighted by a power of the model density, so that observations unlikely under the model are automatically downweighted; thus, the robustness of the procedure is intuitively obvious. The strength of downweighting increases with α; α = 0 indicates no downweighting, and in this case the MDPDE is the maximum likelihood estimator (MLE) of θ. The second term in Equation (4) is simply the adjustment needed to make the estimating equation unbiased under the model.
The popularity of the MDPDE is mainly due to its high robustness with minimal loss in efficiency and no use of nonparametric kernel smoothing even under continuous models. The MDPDE can also be seen to belong to the class of M-estimators [9]. Basu et al. [1] presented the asymptotic properties of the MDPDE along with its influence function (IF), which is a classical measure of the theoretical robustness of any estimator [7]. Whenever the true density g belongs to the model family, i.e. g = f θ , the IF of the MDPDE (T α , say) has an interesting form given by where y is the contamination point and F θ is the distribution function of f θ . Clearly, the IF of the MDPDE is bounded at any α > 0 for most parametric models, implying their robustness. Furthermore, Basu et al. [1] also showed that for the particular case of normal models with both parameters unknown, the breakdown point for the MDPDE of the mean parameter is α(1 + α) −3/2 ; recently Ghosh et al. [6] generalized this breakdown point result by showing the asymptotic breakdown point of 0.5 for the general locationscale model and a general class of minimum distance estimators including the MDPDEs with α > 0.

The case of non-homogeneous observations
Adapting the DPD method to the case of non-homogeneous observations in a fully rigorous manner requires a substantial and nontrivial extension of the approach followed in the case of i.i.d. data. Ghosh and Basu [5] have recently provided this generalization. Assume that our observed data Y 1 , . . . , Y n are independent but for each i, Y i ∼ g i , where g 1 , . . . , g n are possibly different densities with respect to some common dominating measure. Some examples of g i are provided by sequences of normal distributions with different means μ i and a common variance σ 2 , or a sequence of Gamma distributions with shape parameters α i and a common scale β. A sequence of Bernoulli distributions with different success probabilities p i may also be a useful example in the case of logistic regression, where each p i may be linked to a common parameter β through the explanatory variables. We want to model g i by the family F i,θ = {f i (·; θ)|θ ∈ } for all i = 1, 2, . . .; thus while the distributions are possibly different, they all share a common parameter θ. Ghosh and Basu [5] proposed the minimization of the average divergence between the data points and the models, given by whereĝ i is a density estimate of g; ignoring the constant, the above may be approximated by where V i (·; θ) is the indicated term within the square brackets in the above equation. These authors also derived the asymptotic distribution of the MDPDEθ n . Under regularity conditions listed in [5], it was proved that −1/2 n n [n 1/2 (θ n − θ α )] has an asymptotic p-dimensional normal distribution with (vector) mean 0 and covariance matrix I p , the p-dimensional identity matrix, where θ α is the minimizer of (1/n) n i=1 d α (g i , f i (.; θ)), u i (y; θ) = ∇ ln f i (y; θ), and where Since the focal point of this proposal is the robustness of the estimator, one should check its IF to investigate this issue. Ghosh and Basu [5] extended the definition of IF from the i.i.d. case [7] to the present case of non-homogeneous observations and derived the same for the proposed MDPDE. In fact, in the case of the non-homogeneous setup, the contamination can be in any one direction or in all the directions with respect to the index i. When the contamination is only in the i 0 th direction (at the contamination point t i 0 ), the IF of the MDPDE T α turns out to be [5] where G i is the distribution function of g i . On the other hand, when contamination exists in all the data points, the IF becomes [5] where t i is the contamination point in the ith direction. In particular, letting t i = y, G i = F θ and f i = f in above, we get back the IF of the MDPDE for the i.i.d. case given in Equation (5). See Ghosh and Basu [5] for a detailed discussion of the IF and summary measures based on it.
For most parametric models, the IF of the MDPDE is again seen to be bounded for all α > 0, implying robustness. Ghosh and Basu [5] also proved the breakdown result for the MDPDE under a special locationscale type model f i (y; θ) = (1/σ )f ((y − l i (μ))/σ ) with θ = (μ, σ ) under the non-homogeneous setup. Under Assumptions (BP1)-(BP3) of Ghosh and Basu [5], the asymptotic breakdown point of the MDPDE of the location parameter μ with fixed σ is shown at least 1 2 at the model distribution.

Selecting the optimal tuning parameter
We get the most efficient but highly non-robust MLE as the solution for the case α = 0; as α increases so does the robustness of the estimator, albeit at the cost of efficiency. Thus, a natural question that arises in applying the method to practical real data scenarios is how to choose α appropriately. There have been some attempts to select the 'optimal' α for determining the MDPDE of θ in the i.i.d. case (see, e.g. [8,14]). We will modify the approach of Warwick and Jones in determining the optimal value of the tuning parameter for determining the MDPDEs of the estimators in the non-homogeneous data context.
For the sake of completeness, we briefly describe the method in case of i.i.d. data. Let f θ * be the target density, so that θ * is the target parameter. Let be a contaminated version of the target density, where δ y (x) is the Dirac delta function at the value y. In general g need not involve the model density f θ at all, but the above setup is useful in motivating the choice of the selection parameter. For a given α, let θ α denote the value of θ which minimizes d α (g, f θ ). In a given situation involving an i.i.d. random sample from g, letθ α denote the MDPDE of θ. Warwick and Jones [14] proposed to minimize an asymptotic approximation of the summed mean-squared error (MSE) E[(θ α − θ * ) T (θ α − θ * )] to choose the optimal α. Under standard regularity conditions, the asymptotic bias and variance ofθ α are given by (θ α − θ * ) and and K α (θ ) are as given in Equations (3) and (4) of Warwick and Jones [14]. Thus, we have The asymptotic variance component may be estimated by substitutingθ α for θ α and the sample average for the population average under g in the definition of J α (θ ) and K α (θ ). To estimate the bias component, θ α may be substituted by its estimateθ α . But there is no obvious choice of θ * . Warwick and Jones [14] considered several 'pilot' estimators of θ as the input for θ * in the minimization process and compared its effect through simulation studies. Based on their explorations, they recommended the use of the MDPDE corresponding to α = 1 as the pilot estimate. The approach of Hong and Kim [8] for choosing the optimal α may be considered as a special case of the above approach with the pilot estimator being the MDPDEθ α itself. Thus, their minimization depends only on the second term of the right-hand side of Equation (12).
We will now generalize the approach of Warwick and Jones [14] for determining the optimal value of the robustness tuning parameter α in our non-homogeneous setup. Note that there could be some other optimality criteria apart from the minimization of the MSE. However, the MSE is a popular and widely used criterion with easy interpretation and this criterion appears to work well in case of i.i.d. data [14]. Thus, we have chosen to continue with this particular optimality criterion in the present paper. Now, from the asymptotic distribution of our proposed MDPDE as discussed in Section 2.2, it follows that asymptotically Thus, we need to minimize an estimate of the above approximation in order to obtain an optimum value of α. Based on the given non-homogeneous observations, we can estimate the second part of the above as (1/n) Trace[ˆ −1 n,αˆ n,αˆ −1 n,α ], whereˆ n,α andˆ n,α are some consistent estimators of n and n , respectively, corresponding to the tuning parameter α. The estimation of n can be done in a natural manner by replacing θ α byθ α and the population average with respect to g i s by the corresponding sample values so that we havê However, we will normally have only one observation from each g i , and for such situations it is not possible to estimate the variance based on only one observation. Therefore, we must go beyond the Warwick-Jones proposal to estimate n , and for this purpose we replace θ α byθ α and g i s by the model densities f i (·;θ α ). Thus, we use the approximation To estimate the squared bias term (θ α − θ * ) T (θ α − θ * ), we replace θ α by its estimateθ α . However, as in the i.i.d. case, there is no obvious choice of θ * . We will use several robust estimates of θ as our 'pilot' estimator in the place of θ * in the minimization process. We will explore the role of the pilot estimators on the choice of optimum α through several simulation studies and real data examples in the context of linear regression and make a recommendation at the end.
As in the case of independent and identically distributed observations, here also the approach of Hong and Kim [8] turns out to be a special case of the Warwick and Jones approach, where one minimizes only the variance component. This process may be viewed as one where the MDPDE itself is chosen to be the pilot estimate. This particular case of the pilot estimator is referred to as the 'no bias' case in the rest of the manuscript.
Note that several (intuitively) reasonable approximations have been used in this section. The closeness of these approximations may be of some concern to the reader; however, as we will see later in our numerical illustrations, this is indeed not a practical problem for the implementation of the proposed method.

An application: selecting the optimum α in normal linear regression
The linear regression problem with normal errors constitutes the most prominent example of estimation under independent and non-homogeneous data. While the asymptotic properties of the MDPDEs in the regression setup have been established recently in a formal way by Ghosh and Basu [5], Basu et al. [1] had briefly touched upon the use of the DPD measure in fitting robust regression models; a slightly expanded discussion was provided in the University of Oslo Technical Report of the same set of authors (having the same title and same year of citation). Scott [12] gave an example involving the use of the MDPDE corresponding to α = 1 case (the L 2 distance) in the regression scenario. Some simulation results illustrating the performance of the MDPDEs of the regression parameters were presented by Durio and Isaia [4].
Consider the linear regression model where the error i s are i.i.d. normal with mean zero and variance σ 2 , x T i = (1, x i1 , . . . , x ip ) is the vector of predictor variables corresponding to the ith observation and β = (β 0 , β 1 , . . . , β p ) T represents the regression coefficients. We will assume that the x i s are fixed; under this setup the MDPDE of the parameter θ = (β T , σ 2 ) T can be obtained by minimizing the expression in Equation (6) with f i being the N(x T i β, σ 2 ) density. Ghosh and Basu [5] presented the theoretical asymptotic properties and the robustness properties of the proposed MDPDE under the linear regression model. They have also demonstrated that, whenever the error variance σ 2 is fixed, the asymptotic breakdown point of the proposed MDPDE of regression coefficient β is exactly 0.5 at the model. The advantages of the proposed MDPDE of the regression parameter has also been discussed in comparison with the existing robust regression methods. In the present paper, for the sake of completeness, we only present the IF of the proposed MDPDE (T β α , T σ α ) of (β, σ ) under contamination only in the i 0 th direction as given by Clearly, these IFs depend on the residual of regression and the predictor values and they are both bounded in the contamination point t i 0 for all α > 0 and for any direction i 0 . Hence, the proposed MDPDEs with α > 0 are always robust with respect to contamination in any data point. However, the IFs at α = 0 are clearly unbounded implying the non-robust nature of the corresponding MLEs. Similar robustness properties can also be observed for contamination in all the direction and the extent of robustness always increases as α increases; see Ghosh and Basu [5] for details. Here, we consider the problem of finding the optimum α in the context of the above linear regression model. As explained earlier, we can find the expressions for the estimated bias and variance components, as a function of α, based on the observed data. Then, we will minimize it with respect to α in [0, 1] to obtain the optimum value of the robustness tuning parameter. The choice of the optimum α depends on the choice of the 'pilot' estimator used in estimating the bias component. We will explore this dependency and the performance of the proposed methodology through simulation studies in the following sections.
The robustness of the MDPDEs increases with α, but at the cost of their asymptotic efficiency. We consider the estimators corresponding to α > 1 to be of little practical use on account of their low efficiency, and feel that nothing will be lost in restricting the search of the optimal parameter to the interval [0, 1] as we have done.
Durio and Isaia [4] have also proposed a strategy to find the optimum α in the context of the normal linear regression method. Unlike the current proposal which is quick, simple and intuitive, their method involves an elaborate Monte-Carlo testing and has a high computational complexity.

Numerical studies
In this section, we numerically explore the properties of the proposed estimators along the aims of the present paper. We do this with both real and simulated data and in either case touch upon the selection of the optimal tuning parameter together with a description of the robustness properties of the proposed method.

Real data examples
Example 1 (Hertzsprung-Russell data of the star cluster) The Hertzsprung-Russell diagram of the star cluster CYG OB1 contains 47 stars in the direction of Cygnus [11, Table 3, Chapter 2]. For these data, the independent variable x is the logarithm of the effective temperature at the surface of the star (T e ), and the dependent variable y is the logarithm of its light intensity (L/L 0 ). The data were analyzed by Rousseeuw and Leroy [11] who inferred that there are two groups of data points; the four stars in the upper right corner of the scatter plot (Figure 1 Table 1. It is clear that the four leverage points exert a significant pull on the MLEs of the regression parameters to the extent that the corresponding residuals fail to separate the two groups of data. Note that the MLEs of the regression parameters are also the MDPDEs corresponding to α = 0, as well as the ordinary least-squares (OLS) estimators. However, like the highly robust (but inefficient) LMS estimator, the minimum DPD estimators with α ≥ 0.3 can successfully ignore the outliers and produce robust fits which can easily separate the two groups.    In Table 2, the optimal α corresponding to different pilot estimators are presented. The last column represents the Hong and Kim proposal. It is clearly seen that except for pilot estimators which correspond to MDPDEs with very small values of α, the selected optimal value of α is fairly stable. In Figure 1, we plot the estimated OLS line, the estimated LMS line and the MDPDE estimator line with the optimal tuning parameter corresponding to the pilot estimator given by the MDPDE at α = 0.5; from Table 2 it may be seen that this optimal parameter is α = 0.93.
Example 2 (Belgium telephone call data) This example has a heavy contamination in the y-direction. The data set [11, Table 2, Chapter 2] is the result of a Belgian Statistical Survey by the Ministry of Economy and contains the total number (in tens of millions) of international phone calls made in a year from 1950 to 1973. Due to the use of another recording system (giving the total number of minutes of these calls) from 1964 to 1969, the data contain large outliers in the y-direction in that range. The years 1963 and 1970 are also partially affected.
The estimates of the regression coefficients for the regression of the number of calls on the independent variable (year) obtained by the minimum DPD estimation with several α are presented in Table 3. Clearly, the MLE corresponding to α = 0 are highly affected by the outliers but like the LMS method the MDPDEs with α ≥ 0.3 are highly resistant to the outliers giving excellent fit to the rest of the observations. Notice that in this case the estimated optimal tuning parameter is much more variable (see Table 4); this appears to be a consequence of the fact that in this case all the estimators in the range α ∈ [0.3, 1] are practically identical, and any choice of α in this range works equally well. In Figure 2, we plot the estimated OLS line, the estimated LMS line and the MDPDE estimator line with the optimal tuning parameter corresponding to the pilot estimator given by the MDPDE at α = 0.5; from Table 4 it may be seen that this optimal parameter is α = 0.63.
Example 3 (Salinity data) Finally as an example of the multiple regression model with masking effect, we consider the 'Salinity data' [3,11]. The set contains measurements of water's salt concentration and river discharge taken in North Carolina's Pamlico Sound. Rousseeuw and Leroy [11] fitted a multiple linear regression to these data with salinity as the dependent variable;  the independent variables are salinity lagged by two weeks (x 1 ), the number of biweekly periods elapsed since the beginning of the spring season (x 2 ), and the volume of river discharge into the sound (x 3 ). According to the physical description of the data given by Carroll and Ruppert, cases 5 and 16 in the data correspond to periods of very heavy discharge but the cases 3 and 16 conspire to hide the discrepant number 5 producing a masking effect in the data. Table 5 presents the estimates of the regression coefficients obtained by the minimum DPD method for several α.
It was observed by Rousseeuw and Leroy [11] that the OLS/ML estimators (corresponding to α = 0) cannot separate the influential 5th and 16th observations due to the masking effect. But the robust LMS estimator is not affected by this masking and easily distinguishes these two observations by their large residuals. Table 5 shows that the MDPDEs with α ≥ 0.3 remain pretty stable. The OLS fit after discarding the influential 5th and 16th observations generates (23.3862, 0.6998, −0.25, −0.8356) as the vector of regression coefficients, quite close to the robust MDPDEs obtained here.
The tuning parameter selection criterion, when applied to these data, lead to an optimal value of α = 1 except for pilot estimators corresponding to very small (close to zero) values of α (Table 6). Clearly, the heavy contamination present here requires that a very strong downweighting of the bias component is necessary.  Optimum α 0.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00

Simulation study
In this section, we perform a detailed numerical exploration of the performance of the MDPDEs in the linear regression context. This exploration has two components. First, we study the general stability of the estimators under contamination. The second part involves a (primarily graphical) study of the effect of the pilot estimator on the selection of the optimal α.

Simulation scheme
We consider the simple linear regression case under the model where i s are i.i.d. normal with mean 0 and variance σ 2 . Here, we construct the empirical bias and MSE of the MDPDEs of β 0 , β 1 and σ for various values of the tuning parameter α (α = 0 gives the MLE/OLS). We simulate, on purpose, i s and x i s from distributions that create specific scenarios of large residuals and leverage points as follows: (1) We choose x randomly from the N(10, 5) distribution, where the latter argument is the variance component; the error is chosen randomly from a N(0, 3) distribution. This gives the case of a fully pure model which has no artificially introduced large residuals or leverage points. For each simulation experiment, we chose the true values of the parameters to be (β 0 , β 1 , σ ) = (3, 2, √ 3), and the bias and MSEs are calculated around these target values.

Bias and MSE of MDPDE
We have used 1000 replications to estimate the empirical bias and MSE of the regression parameters. The results, for several combinations of α, e err and e x , are given in Tables 7-9. In Table 7, it may be clearly seen that while the estimators corresponding to small values of α can have large biases depending on the combination of e err and e x values, the large α MDPDEs clearly control this bias neatly and give quite satisfactory results. The same is seen in the case of the MSE in Table 8.
In Table 9, the overall MSE of (β 0 ,β 1 ,σ ) T is presented for several values of the tuning parameter. Here, the MSE is the sum of the MSEs of the three individual estimators, which is nothing but corresponding to θ(α = 0) are the average, optimal mean square error obtained through the appropriate approximation of Equation (13) as discussed in Section 2.3, when the MDPDE with α = 0 is the pilot estimator. The entries for the other columns of α are similarly interpreted. It is clear from the table that the optimum choice of α obtained by minimizing only the variance component (no bias) is the best in terms of the MSE of corresponding MDPDE both in the presence and absence of outliers and (or) leverage points. However, in general the bias term must be incorporated, and our suggestion, based on the tables presented here and other simulations, is to use the MDPDE corresponding to α = 0.5 as the pilot estimator while computing the bias component. We acknowledge, however, that several values of α in a two-sided neighborhood of α = 0.5 worked almost equally well in our numerical studies.
Next, we provide an extensive graphical investigation of the role of the pilot estimator in determining the optimal choice of α. For this purpose, we have chosen different contaminations and leverage point scenarios. Based on the finding of Tables 7-9, we have some idea of where the optimal tuning parameter should be for each perturbation scenario. We want to explore to what extent the selection of our optimal tuning parameter is consistent with the findings of the above tables and to what extent the results are influenced by the choice of the pilot estimator.
For each scenario, we have selected a number of possible pilot estimators which include a few MDPDEs and also the iteratively reweighted least-squares (IRLS) estimators of the regression parameters based on Huber's ψ function; for the last mentioned estimator, the input residuals r i at any stage are given as where r 0i are the residuals at the previous iteration,σ = MAD/0.6745, MAD is the median absolute deviation about the median, and h i is the leverage value from the least-squares fit. We will refer to this as the IRLS-Huber estimator. We have not used the LMS estimator as the pilot because of its low efficiency. For each scenario, and each pilot estimator, 100 replications are run generating 100 'optimal' tuning parameters. We have then made a histogram of these 100 values to get an idea of whether the findings are in the acceptable range. Our findings can be briefly summarized as follows: (1) When the data are generated from the pure model, an overwhelming majority of the estimated tuning parameters are equal to or around α = 0, irrespective of the pilot estimator ( Figure 3). This is how it should be, and indicates that in such cases the algorithm is choosing the correct tuning parameter most of the time. Interestingly, for each pilot estimator, there is always a small chunk of values at or around α = 1. The latter phenomenon appears to be always true, except for the pilot estimator corresponding to MDPDE(α = 0) and the IRLS-Huber estimator.
(2) This situation changes progressively as the level of perturbation increases. However, the MDPDE at α = 0 (the MLE) is clearly inadequate, since it appears to choose α = 0 itself as the optimal value of the tuning parameter in a large number of cases even when the error distribution is fairly heavily contaminated, or when there are a large number of leverage points. This is not entirely unexpected, as the elimination of a large bias component allows it to stay at the top in such cases. (3) As the perturbation becomes heavier, the large bar around α = 0 melts away (see  in the supplementary material). The estimated tuning parameter shifts toward larger values along the α axis. For small perturbations, there is a significant concentration of optimal values slightly above, but not equal to, the value α = 0 (see Figure 1 of the supplementary material), as would seem appropriate. (4) The degree of shift of the optimal values to the right along the α axis appears to be consistent with the degree of perturbation. (5) The IRLS (Huber) pilot estimators do not appear to generate useful optimal values all the time. It would appear that the MDPDEs corresponding to α = 0.3, 0.5 and 1 are the most useful ones in the present context. However, apart from these empirical suggestions, stronger theoretical recommendations may be of some practical interest. In the future, we hope to develop estimates of the bias component without the use of a 'pilot' estimator. (6) On the whole, there is very little difference in the overall shape of the histograms between the pilot estimators with α = 0.5 and α = 1.

Concluding remarks
In this paper, we have tried to explore, in detail, the applicability of the minimum DPD estimation method in the case of non-homogeneous data. The robustness of these methods is generated in a natural way as a function of a single tuning parameter, and since this parameter controls the trade-off between robustness and efficiency, a proper selection of the tuning parameter in any given problem becomes important. Together with the study of the robustness of the estimators, we have tried to select the optimal tuning parameter appropriately, following the approach of Warwick and Jones [14], and have carefully studied the role of the pilot estimator. We trust that on the whole this establishes the practical utility of the minimum DPD estimators, and, together with the works of Scott [12] and Durio and Isaia [4], nicely complements the theoretical findings of Ghosh and Basu [5]. We hope that the practitioners will find the method to be of value in solving real-life situations.