Evaluating the relative merits of competing models based on empirical likelihood ratio test

ABSTRACT Competing models arise naturally in many research fields, such as survival analysis and economics, when the same phenomenon of interest is explained by different researcher using different theories or according to different experiences. The model selection problem is therefore remarkably important because of its great importance to the subsequent inference; Inference under a misspecified or inappropriate model will be risky. Existing model selection tests such as Vuong's tests [26] and Shi's non-degenerate tests [21] suffer from the variance estimation and the departure of the normality of the likelihood ratios. To circumvent these dilemmas, we propose in this paper an empirical likelihood ratio (ELR) tests for model selection. Following Shi [21], a bias correction method is proposed for the ELR tests to enhance its performance. A simulation study and a real-data analysis are provided to illustrate the performance of the proposed ELR tests.


Introduction
Competing models arise naturally in many research fields, such as economics and survival analysis, when the same phenomenon of interest is explained by different researcher using different theories or according to different experiences. For example, lognormal modeling and exponential modeling for a lifetime data in survival analysis [6], and Keynesian and new classical explanations of unemployment in economics [18]. The comparison of competing models or model selection is therefore remarkably important because of its great importance to the subsequent inference; Inference under a misspecified or inappropriate model will be risky. Since Cox [6,7] first formulated this problem in terms of hypothesis testing rather than discrimination, it has attracted considerable attention in the literature. See [5,9,16,21,23,26] and references therein.
A natural way to achieve model selection is to first introduce a statistical measure of the closeness between two models, and then recommend the one closer to the underlying true model. The most popular closeness measure in model selection is Kullback-Leibler information criterion (KLIC; [1,2]). Cox 's [6] centered log-likelihood ratio test, proposed under the assumption that one of the competing models is true, is in fact a KLIC of the alternative model from the null model. This test has been applied to the testing of linear and nonlinear regression models [3,17] and more-than-one alternatives [22]. However it will lose power if neither model is true, which is often the case.
Without any model assumption on data, Vuong [26] proposed Studentized tests based on log-likelihood ratios, which in essence compares the KLIC of the two models from the underlying true model. When constructing his tests, Vuong [26] differentiated nonoverlapping and overlapping models for the competing models under test. For nonoverlapping models, Vuong test [26] is a Student's t-test based on log-likelihood ratios, calibrated by the standard normal distribution. In the case of overlapping models, Vuong [26] proposed a two-step test: (1) test whether the log-likelihood ratio has variance zero; (2) if the decision of (1) is rejected, then apply the test proposed for non-overlapping models. The null hypothesis that the competing models are the same close to the truth is rejected only if both (1) and (2) are rejected.
In general, Vuong test has good power if the variance of the log-likelihood ratio is away from zero or the two competing models have clearly different KLICs from the true model. Otherwise, it may have severe size distortion in finite samples in both the overlapping case and the non-overlapping case [21]. By studying the asymptotic performance of Vuong test at a series of local alternatives, Shi [21] found that the size distortion is mainly due to the asymptotic bias in both the denominator and nominator of Vuong test. A modified Vuong test is consequently proposed by correcting both the biases, and is further enhanced with a simulation-intensive calibration method.
Vuong test and Shi's modified Vuong test are both Student-type tests, which necessitates a variance estimation of the likelihood ratio and has good performance if the likelihood ratio follows a normal distribution. When the variance is rather small, it is difficult to estimate it accurately, which will increase the variation of these tests, and therefore may increases both types I and II errors. Also they will lose power if the distribution of the likelihood ratio is far away from a normal distribution.
We propose in this paper a model selection test based on empirical likelihood (EL; [13,14]), which is a popular non-parametric tool for statistical inference [11,12,19]. Comparing the KLIC of the two competing models is equivalent to test whether the mean of the likelihood ratio is zero. This motivates our proposed EL ratio test for model selection. Further, similar to Shi's modification strategy, a bias-correction method is also proposed for the EL ratio test to enhance its finite-sample performances. A significance advantage of the proposed test over Student-type tests is that it neither involves a variance estimation, nor depends on the normality of the likelihood ratio, and is therefore expected to have better performance in more general situations. Our simulation results confirm this point. We found that Vuong test often inflates its type I error substantially, therefore its power is questionable. The proposed bias-corrected EL ratio test not only has the most accurate type I errors, but also is uniformly more powerful than Shi's test; The latter restrictively controls its type I error, but is somewhat conservative, and therefore may lose certain power. Another significant advantage of the proposed test is that since calibrated by its limiting χ 2 1 distribution, its critical values or p values are available anytime and anywhere; while in comparison, Shi's test is calibrated by a computation-intensive searching method, which is rather time-consuming.
We remark that the problem considered in this paper is to test the null hypothesis that two competing models under consideration have the same appropriateness for modeling given data. Even under the null hypothesis, we still have no idea whether one of the two models is true. This is an essential difference from the well-known goodness-of-fit testing problem, which assumes that the true model is contained in either the null or the alternative hypothesis. The goodness-of-fit testing problem is a fundamental research problem in statistics and has been extensively investigated by the means of divergence measure (see, e.g. [15]). If a non-parametric assumption is imposed on the alternative, non-parametric goodness-of-fit methods, such as density-based EL techniques, Kolmogorov-Smirnov type procedures, and kernel-based approaches, have been proposed in the literature [8,24]. However, these approaches generally do not apply to the problem considered in this paper where in generally none of the models under consideration is true.
The rest of the paper is organized as follows. We define notation and review Vuong and Shi's tests in Section 2. The proposed EL ratio test is presented in Section 3, together with its asymptotical properties. The size and power of the proposed test is investigated in Section 4 by comparing with existing tests. In Section 5, we analyze a real data-set to illustrate the usefulness of the EL ratio test. All proofs are postponed to the supplemental data for clarity.

Problem formulation
Suppose we have n independent and identically distributed (iid) copies {(Y i , X i ) : i = 1, 2 . . . , n} of a random vector (Y, X) and two competing parametric probability models F = {f (y|x; α) : α ∈ A} and G = {g(y|x; β) : β ∈ B}. Given the data, we wish to know which model fits the conditional density function of Y given X better.
Following [1,2,26], we take the KLIC as a measure of distance between a candidate model and the true model or a goodness measure of a candidate model. Suppose the true conditional density function of Y given X is q(y|x) with distribution Q(y|x). We define the distance between a given distribution family and the true distribution to be the minimum KLIC, , and E 0 denotes expectation with respect to the true joint distribution of (Y, X). The value α * is called the pseudo-true value of α. Similarly we have In terms of hypothesis testing, the model selection problem is to test whether the two models have the same distance from the underlying true model, that is, , that is, F is closer to the true than G ; Otherwise, we conclude that G provides better fit for the data.

EL ratio test
Under the hypothesis testing formulation in Equation (1), any hypothesis test for mean is applicable to the model selection problem by taking { i (φ)} as observations if φ is known. This strategy still works in the case of unknown φ if an appropriate estimateφ is plugged in.
be the pseudo maximum likelihood estimators of α and β. We propose to test (1) by the empirical likelihood ratio test (ELR; [13,14]) where In determining the accompanying critical values, we find that the ELR test has two totally different limiting behaviors for two exclusive cases of the null hypothesis H 0 : (a) H 0 is true and For ease of presentation, we define where ∇ is the differentiation operator ∂/∂φ with φ τ = (α τ , β τ ). Parallelling to Assumptions 1-5 of [26], we make the following assumptions on the competing models under consideration.
(C1) The parameter spaces A ⊂ R d 1 and B ⊂ R d 2 are both compact. (C2) (Differentiability and integrability) (i) For all (y, x) on their supports, f (y|x, α) and g(y|x, β) are three times differentiable with respect to α and β, respectively. (ii) There exists a non-negative function H(y, . The next theorem presents the limiting distributions of the ELR test.

Theorem 2.1: Assume the conditions in
, where ξ , of the same length as (α τ , β τ ), is a standard multivariate normal random vector.
The relationship of the two competing models generally divides into three cases: (1) non-nested, that is, When the two models are nested or overlapping and q(y|x) ∈ F ∩ G , H 0 is equivalent to case (b). We need a complicated rejection principle to test (1) according to the second part of Theorem 2.1. When the two models are non-nested or overlapping but q(y|x) ∈ F ∩ G , H 0 is equivalent to case (a). The first part of Theorem 2.1 recommends We may adopt this method for all cases although it may increase type I error in case (b). When H 0 is rejected, we recommend model F if Instead of ELR, Vuong [26] proposed to test (1) using Student's t-test statistic The totally different two limiting behaviors of Vuong test leads to two testing strategies: is the α quantile of ξ τ 2 ξ andˆ is a root-n consistent estimator of given later.
The variance estimation may render the Vuong test to vibrate dramatically if the two models are quite competing and the corresponding ω 2 andω 2 n is very close to zero. This will result in size distortion, that is, the resulting type I errors are at a distance from the significance level. Vuong's two-step test also has such a problem. This calls for a bias-correction technique to improve the efficiency of Vuong test.

Bias-corrected ELR
By local asymptotical theory, Shi [21] found that the size distortion of Vuong's tests is mainly caused by the biases in both the numerator and the denominator of Vuong test statistic VT. A bias-corrected numerator is given by where tr(ˆ ) = tr(Â −1B ) andÂ andB are the estimates of A and B respectively.. The bias in the denominator can not be eliminated but diminished or adjusted. Shi [21] proposed to modify the denominator to beω 2 n =ω 2 n + c · tr(ˆ 2 )/n. where c is a tuning parameter. With these preparations, Shi [21] propose to test (1) by her non-degenerate test (NDT) statistic Both the constant c and accompanying critical values are determined by a computationintensive critical value determining procedure. We refer the reader to Shi [21] for details.
In our theoretical analysis, we find that the ELR is equivalent to squared Vuong test up to an ignorable term. This implies that the ELR may also suffer from the size distortion problem, which is mainly due to the biases in the numerator and the denominator. We can tell clearly where the biases come from, because Vuong test statistic VT has a specific fraction form and both the denominator and the nominator have closed forms. However neither the ELR nor its signed root R has a closed form, therefore Shi [21]'s bias correction method does not apply to the ELR or its signed root R, because the source of bias is unclear.
To be simple, we ignore the bias in the numerator and define a bias-corrected ELR test The proposed testing rule is to reject H 0 if |R c | > z 1−α/2 . An immediate advantage of this test over Shi's test is its convenience of practical use because it needs neither a tuning parameter nor a computation-intensive critical-value determining procedure. What is more, our simulations (see section 4) indicate that the bias-corrected ELR test usually have comparable or even better testing performance than Vuong test and Shi's NDT test. As pointed out by an anonymous referee, we may correct the bias of the ELR test in Equation (2) by the strategy of Chen [4] and Vexler et al. [25], who proposed bias corrections for the t-test and the ELR test for mean by carefully studying their Edgeworth expansions; The goal of the corrections is to improve the approximation accuracy of their type I errors from O p (n −1/2 ) to o p (n −1/2 ). However when this strategy is applied to the ELR test in Equation (2), we find that it is formidable to derive an Edgeworth expansion of R and a subsequent bias-corrected ELR test because the 'observations' Z i 's are not iid. Hence to be simple and easy to use, we choose to use R c as the proposed test for model comparison.

Extension to moment-based models
The proposed ELR tests apply also to moment-based models. Suppose the two competing moment-based models are We define the profile empirical log-likelihood (up to a non-random constant) of α and β to be and letλ 1 = λ 1 (α). We defineβ, λ 2 (β),β andλ 2 similarly.
The empirical KL distance between the two moment models iŝ If models F and G are the same appropriate for fitting the data, then d(F , G ) tends to be small; otherwise it should be at a distance from zero. We define the pseudo true value of α to be and the pseudo value of λ 1 to be λ 1 * = λ 1 * (α * ). We define β * , λ 2 * (β) and λ 2 * in the same way. Then the KL distance between the two moment models is and the formal testing problem is Let φ = (α, λ 1 , β, λ 2 ) τ , φ * = (α * , λ 1 * , β * , λ 2 * ) τ andφ = (α,λ 1 ,β,λ 2 ) τ . Define The testing problem is equivalent to The proposed EL and EL c tests apply to this problem directly by setting Z i = i (φ).

Simulation study
In this section, we report Monte Carlo simulation results to evaluate the performance of the proposed two ELR tests, the sign root R of ELR (EL) and the bias-corrected sign root R c of ELR (EL c ). This purpose is achieved by comparing them with three existing tests: One-step Vuong test (1-step VT), Two-step Vuong test (2-step VT) and [21]'s NDT.
Example 4.1: [(Normal regression; [21])] Suppose the true underlying data generating process is where X 1 and X 2 are d 1 and d 2 -variate covariates, and (X τ 1 , X τ 2 , ε) follows the (d 1 + d 2 + 1)-variate standard normal distribution and a 1 , a 2 ∈ [0, 1]. With data generated from Note: In each pair (p 1 , p 2 ), p 1 denotes the probability of rejecting H 0 and supporting F , and p 2 the probability of rejecting H 0 and supporting G .
We generated 5000 data-sets from Example 1 under each of the 15 settings and computed the simulated rejection probabilities (in percentage) of the tests under consideration. In addition, we also record the proportion (denoted Var.T) of rejecting the hypothesis that the variance of likelihood ratio is zero. The simulation results are reported in Table 1.
The first panel of Table 1 includes simulated type I errors of the five tests under comparison. Since the two competing models have equal goodness-of-fit for the data, it is ideally expected that the probability that the tests reject the null hypothesis and recommend either model should be at most 2.5 % at the 0.05 significance level. However only the biascorrected ELR and [21]'s NDT make it, and the former has closer-to-nominal one-sided type I errors than the latter, which is somewhat conservative. The original ELR and the two Vuong tests often have severely-inflated one-sided type I errors particularly when the parameter dimension is large (case d 2 = 19) or the sample size is small (case n = 100). This also implies that the EL does have an unignorable bias and the EL c succeeds in correcting it, leading to rather accurate type I errors.
Simulated power comparisons are presented in the second and third panels of Table 1, corresponding respectively to (a 1 , a 2 ) values given in cases (H2) and (H3). The power comparisons are only meaningful for the bias-corrected ELR and [21]'s NDT control their type I errors, because only they two control their type I errors. Case (H2) is designed such that model F is better than model G . The EL c test has uniformly larger power than NDT in detecting the fact. We have similar observations in case (H3), which is designed such that model G is better than model F .
This example is designed to mimic the model selection setting for number of doctor visits in Section 5.1. We hope that the simulation results in this example can shed light on the performances of the tests under comparison and provide evidence for their relative efficiency.
Intuitively, when π is small, E{ i (φ * )} tends to be negative, and the Geometric model fits better; while when π is large, E{ i (φ * )} tends to be positive, and the Poisson model fits better. Based on extra-large samples (sample size 100,000), we found that E{ i (φ * )} = 0 when π = 0.875. In our simulation, the sample size is n = 100 and the simulation size is 2000. We generate data from the mixture models with π varying from 0.675 to 0.975 with increment 0.02. When π < 0.875, model G fits the data better, while when π > 0.875, model F fits the data better. Our simulation results are tabulated in Table 2. The powers of the two-step VT are not reported because they are almost the same as those of the one-step VT.
When π = 0.875 or the null hypothesis H 0 : E{ i (φ * )} = 0 holds, the powers (bold numbers) given in Table 2 are actually type I errors. We find that only the EL c controls its type I error below the significance level 5%; it type I error is also the closest to 5%. The Vuong tests have the largest and excessive type I error as was pointed out by Shi [21]. With the bias-correction strategy, the EL c and Shi's NDT tests reduce the type I errors of EL and Vuong's VT tests, respectively.
In view of power comparison, among all the tests the EL c test has the largest power when π < 0.875 or the Geometric model fits better, and has the smallest power π > 0.875 or the Poisson model fits better. Because of the comparison result in type I error, only the

Real-data analysis
We illustrate the proposed bias-corrected ELR test by analyzing a data set taken from the first 12 annual waves (1984 through 1995) of the German Socioeconomic Panel data. This data set, studied by Greene [10] and Riphahn et al. [20] , consists of 27,326 observations on 25 variables including number of doctor visits in the last three months (Docvis), number of hospital visits in last calendar year (Hospvis), and numerous other socio-demographic variables such as age (Age), education (Edu), house income (Income) and having kids or not (Kids). We choose y = Docvis or Hospvis, and x = (Age, Edu, Income, Kids). Following Example 14.10 of [10], we consider the model selection problem between the competing models (7) and (8) for the conditional probability of y given x. The histograms of Docvis and Hospvis are too skewed. Among all the 27,326 values, there 10,135 zeros in Docvis and 24,931 zeros in Hospvis. For a better presentation, in Figure 1 we only display the non-zero values in Docvis and Hospvis. The log likelihood ratios i (φ) for y = Docvis or Hospvis, are also calculated and displayed in Figure 1.
We find that the variance estimateω 2 = 1276948.35 (in the case of Docvis) and 86986.38 (in the case of Hospvis) are both larger than the accompanying critical value 583.82 and 1137.52 at the 5% significance level. Thus the one-and two-step Vuong tests will lead to the same decision. Table 3 presents the test statistics of the proposed ELR and bias-corrected ELR tests, one-step Vuong test and Shi's NDT test.
The critical values proposed by Shi [21] for the NDT test are 2.0229 and 2.1949 corresponding to the cases y = Docvis and Hospvis, respectively. Clearly all the four tests conclude that (1) the Poisson model and the Geometric model do not have the same appropriateness for the data at the 5% significance level, and (2) the Geometric model fits better because the mean of i (φ) is negative and a negative E{ i (φ * )} supports G . Meanwhile the absolute value of the EL c is much larger those of the VT and NDT tests. Given that all  the critical values are all around 2, this implies that the proposed EL c test provides much stronger evidence for the superiority of the Geometric model. Two observations make the conclusion not surprising. On one hand, according to our simulation experience in Example 4.2, the EL c test has the most accurate type I error and largest power in supporting the Geometric model among the four tests. On the other hand, the student-type test relies to some extent to the normality of the likelihood ratios, while the EL-type tests do not. However we observe from Figure 1 the likelihood ratios in the cases of y = Docvis and Hospvis are both severely skewed and far away from being normally distributed. Therefore it is natural that the EL and EL c tests have better powers than the VT and NDT in this real data analysis.