Testing the Number of Components in Normal Mixture Regression Models

Testing the number of components in finite normal mixture models is a long-standing challenge because of its nonregularity. This article studies likelihood-based testing of the number of components in normal mixture regression models with heteroscedastic components. We construct a likelihood-based test of the null hypothesis of m0 components against the alternative hypothesis of m0 + 1 components for any m0. The null asymptotic distribution of the proposed modified EM test statistic is the maximum of m0 random variables that can be easily simulated. The simulations show that the proposed test has very good finite sample size and power properties. Supplementary materials for this article are available online.


INTRODUCTION
Finite mixtures of normal distributions and regressions have been used in numerous empirical applications in diverse fields such as biological, physical, and social sciences, including economics and finance (see, e.g., Kon and Jen 1978;Quandt and Ramsey 1978;Kon 1984;Tucker 1992;Venkataraman 1997;Conway and Deb 2005). Mixture-of-expert models with normal component distribution (see, e.g., Jacobs et al. 1991) can also be viewed as finite mixture of normal regression models. Comprehensive theoretical accounts and examples of applications have been provided by several authors, including (Titterington, Smith, and Makov 1985;Lindsay 1995;McLachlan and Peel 2000).
The number of components is an important parameter in applications of finite mixture models. In economics, the number of components often represents the number of unobservable types or abilities. In other applications, the number of components signifies the number of clusters or latent classes in the data. Despite its importance, testing for the number of components in normal mixture regression models has been a long-standing unsolved problem because the standard asymptotic analysis of the likelihood ratio test (LRT) statistic breaks down due to problems such as nonidentifiable parameters and the true parameter being on the boundary of the parameter space. Numerous papers have been written on the subject of the LRT for the number of components (see, e.g., Ghosh and Sen 1985;Chernoff and Lander 1995;Lemdani and Pons 1997;Chen andChen 2001, 2003;Garel 2001Garel , 2005Chen, Chen, and Kalbfleisch 2004), and the asymptotic distribution of the LRT statistic for general finite mixture models has been derived as a functional of the Gaussian process (Dacunha-Castelle and Gassiat 1999;Liu and Shao 2003;Zhu and Zhang 2004;Azaïs, Gassiat, and Mercadier 2009).
Hiroyuki Kasahara, Vancouver School of Economics, University of British Columbia, 997-1873 East Mall, Vancouver, BC V6T 1Z1, Canada (E-mail: hkasahar@mail.ubc.ca). Katsumi Shimotsu, Faculty of Economics, University of Tokyo, Tokyo 113-0033 (E-mail: shimotsu@e.u-tokyo.ac.jp). The authors are grateful to the editor, the associated editor, and two anonymous referees whose comments greatly improved the article. This research is supported by the Natural Science and Engineering Research Council of Canada and JSPS Grant-in-Aid for Scientific Research (C) No. 26380267.
This article develops a likelihood-based testing procedure of the null hypothesis of m 0 components against the alternative hypothesis of m 0 + 1 components for a general m 0 ≥ 1 in heteroscedastic normal mixture regression models. To this end, we introduce a new reparameterization that substantially simplifies the analysis. Under this reparameterization, the log-likelihood function is locally approximated by a quadratic function of polynomials of parameters, and a standard analysis goes through with some adjustment using the results of Andrews (1999) and Zhu and Zhang (2006), who generalize Andrews (1999). We propose a modified EM test by building on this local quadratic representation and extending the EM approach pioneered by Li, Chen, and Marriott (2009) and Li and Chen (2010). The asymptotic null distribution of the proposed modified EM test statistic is the maximum of m 0 random variables, which can be easily simulated. In particular, when no regressor is present, the asymptotic null distribution is the maximum of m 0 chi-squared random variables with two degrees of freedom. Furthermore, the modified EM test does not suffer from the infinite Fisher information problem.
To the best of our knowledge, no likelihood-based test has yet been developed for testing the null hypothesis H 0 : m = m 0 with m 0 ≥ 1 against the alternative hypothesis H A : m = m 0 + 1 in normal regression mixtures. Chen and Li (2009) developed an EM test for m 0 = 1 in heteroscedastic normal mixtures, and Chen, Li, and Fu (2012) developed an EM test for testing H 0 : m = m 0 against H A : m > m 0 by splitting each component into two, thereby in effect testing against H A : m = 2m 0 , but neither Chen and Li (2009) nor Chen, Li, and Fu (2012) accommodated regressors. Shen and He (2015) develop an EM test for testing H 0 : m 0 = 1 in normal regression mixtures with covariate-dependent mixing proportions, in which the Fisher information matrix is regular. The test of Shen and He (2015) has a simpler limiting distribution, but its power is low when the mixing proportion does not depend on covariates. Our test is designed for such models, and as such, our test and theirs are mutually complementary.
Model selection procedures have been proposed for estimating the number of components in mixture models (see, e.g., Henna 1985;Lindsay and Roeder 1992;Windham and Cutler 1992;Roeder 1994;Chen and Kalbfleisch 1996;Keribin 2000;James, Priebe, and Marchette 2001;Miloslavsky and van der Laan 2003;Woo and Sriram 2006;Chen and Khalili 2008). As discussed by Chen, Li, and Fu (2012), while model selection procedures seek to find a parsimonious model that adequately describes the observed data, the number of components is often linked to scientific propositions, and a hypothesis test can be used to check their validity.
The remainder of this article is organized as follows. Section 2 introduces finite normal mixture regression models. Sections 3 and 4 establish the local quadratic approximation in testing the null hypothesis of m 0 components against the alternative of m 0 + 1 components. Section 5 introduces the modified EM test. Section 6 reports the simulation results, and empirical examples are provided in Section 7. The supplementary appendix contains proofs, auxiliary results, and additional empirical examples. All limits below are taken as n → ∞, unless stated otherwise. Let := denote "equals by definition." For a k × 1 vector a and a function f (a), let ∇ a f (a) denote the k × 1 vector of the derivative (∂/∂ a)f (a), and let ∇ aa f (a) denote the k × k vector of the derivative (∂/∂ a∂ a )f (a).
The regularity conditions for a standard asymptotic analysis fails in finite mixture models because (i) under H 01 , α is not identified, and the Fisher information matrix for the other parameters becomes singular; (ii) under H 02 , α is on the boundary of the parameter space, and either θ 1 or θ 2 is not identified. In addition to the failure of regularity conditions that is common to all finite mixture models, the normal mixture model (2) has additional undesirable mathematical properties, as discussed in Chen and Li (2009): (a) The Fisher information for testing H 02 is not finite unless the range of σ 2 1 /σ 2 2 is restricted. (b) The derivatives of f 2 (y|x, z; ϑ 2 ) of different orders are linearly dependent because ∇ μμ f (y|x, z; γ , θ , σ 2 ) = 2∇ σ 2 f (y|x, z; γ , θ , σ 2 ) (loss of strong identifiability). (c) The log-likelihood function is unbounded and the maximum likelihood estimate fails to exist (Kiefer and Wolfowitz 1956;Hartigan 1985).
Proposition 2. Suppose that Assumptions 1 and 2 hold. Then, under the null hypothesis H 0 : m = 1, for α ∈ (0, 1) and σ ∈ (0, 1), we have (a) for any δ > 0, lim sup n→∞ (10). The asymptotic null distribution of LR n ( 1 ) is characterized by the supremum of the quadratic-form representation of the log-likelihood ratio under the constraint implied by the limit of n as n → ∞. As shown in the proof of Proposition 3, the limit of n is given by the union of 1 λ and 2 λ , where q λ := 2 + 2q + q(q + 1)/2 and Partition S and W = I −1 S as S = (S η , S λ ) , W = (W η , W λ ), where S η and W η are (p + q + 2) × 1, and S λ and W λ are q λ × 1. Define (15) The following proposition establishes the asymptotic null distribution of the LRT statistic. When the model does not have a conditioning variable X, the set of admissible values of t λn converges to R 2 , and LR n ( 1 ) converges to χ 2 (2) in distribution.
When X contains dummy variables that take value 0 or 1, Assumption 2(b) fails because a dummy variable and its square in U 2 are perfectly correlated with each other. The following proposition handles such cases. Define whose rows are the basis of the orthogonal complement of the row space of B.

LOCAL QUADRATIC APPROXIMATION FOR TESTING H
In this section, we develop a local quadratic approximation for testing the null hypothesis of m 0 components against the alternative of m 0 + 1 components for general m 0 ≥ 1. We consider a random sample of n independent observa- We assume (θ * 1 , σ 2 * 1 ) < · · · < (θ * m 0 , σ 2 * m 0 ) for identification. Let the density of an (m 0 + 1)-component mixture model be where and H 0,2h : α h = 0. The inequality constraints are imposed on (θ j , σ 2 j ) for identification.
We focus on testing H 01 because (i) the LRT statistic for testing H 02 has infinite Fisher information unless a stringent restriction is imposed on the admissible values of σ 2 j , and (ii) implementing the LRT for H 02 is practically difficult because the asymptotic null distribution depends on the functional of a multidimensional Gaussian process. Define the set of values of ϑ m 0 +1 that yields the true density (16) as (17) generates the true m 0 -component density (16) and define ϒ * 1 := ϒ * 11 ∪ · · · ∪ ϒ * 1m 0 . Similar to Section 3, we consider the MLE in the constrained parameter space . . , m 0 + 1, and define the LRT statistic for testing H 01 as , andθ m 0 := arg max ϑ m 0 ∈ ϑm 0 ( σ ) L 0,n (ϑ m 0 ). Collect the score vector for testing H 0,11 , . . . , H 0,1m 0 into one vector as The following proposition gives the asymptotic null distribution of the LRT statistic for testing H 01 . In the neighborhood of ϒ * 1h , the log-likelihood function permits a similar quadratic approximation to the one we derived in Section 3. Consequently, the LRT statistic is asymptotically distributed as the maximum of m 0 random variables, each of which is the maximum of two random variables.
In the remainder of this section, we derive a necessary and sufficient condition under which the LRT statistic for testing H 02 has finite Fisher information. For brevity, we focus on the case without (X, Z). The score for testing H 0,2h : . Define the subset of ϒ * corresponding to H 0,2h : α h = 0 as ϒ * 2h := {ϑ m 0 +1 ∈ ϑ m 0 +1 : is not identified when α h = 0, the Fisher information of the LRT for testing H 0,2h : α h = 0 depends on the supremum of the variance of ∇ α h ln f m 0 +1 (Y i ; ϑ m 0 +1 ) over ϑ m 0 +1 ∈ ϒ * 2h . As shown in the following proposition, the Fisher information is infinite, unless a stringent restriction is imposed on the admissible values of σ 2 j .

Proposition 8. Suppose that Assumptions 1 and 2 and H n
λ.η ) and lett λn be an estimator of t λn in (11). Definet j λ, in the same manner ast j λ in (15) but using W λ + in place of W λ . Then, (a) LR 1,n ( 1 ),

Choice of Penalty Function
To apply our modified EM test, we need to specify the set T , number of iterations K, and penalty function for p n (σ 2 ). Based on our experience, we recommend T = {0.5} and K = 2 or 3; following the recommendation given by Chen, Li, and Fu (2012), we set p n σ 2 j ;σ 2 j = −a n σ 2 whereσ 2 j is the estimate from the m 0 -component model. p n (σ 2 j ;σ 2 j ) satisfies Assumptions 5 and 6 if a n = o p (n 1/4 ). In models with a regressor, we use an additional restriction σ j ≥ 0.01σ j , which does not change the theoretical results but improves finite sample performance.
For the model with a conditioning variable X, the value of a n that gives accurate Type I errors is sensitive to the dimension of X. For testing H 0 : m = 1, we choose the value of a n depending on the dimension of X so that Type I errors are accurate at n = 200 using 10, 000 replications. For the cases where m 0 = 2 and m 0 = 3, developing a data-dependent empirical formula for a n is difficult, and therefore, we set the value of a n as in Table 4 and use both asymptotic and bootstrap critical values.

Simulation Results
For the model without conditioning variable X, we examine the finite sample performance of our proposed modified EM test by simulations and compare it with that of the EM test of Chen, Li, and Fu (2012) [CLF, hereafter]. Computation was done using R (R Core Team 2014). We use 5, 000 replications, and the sample sizes are set to 200 and 400. The m 0 -component estimateθ m 0 in (21) is computed with the penalty function p n (σ 2 j ) defined in (22), where a n = 1/n andσ 2 j is set to the sample variance of the data for all j.
We simulated Type I error rates for 12 null models with orders 2 and 3 that are the same as in CLF, as specified in Table 1. The simulation results for H 0 : m = 2 are summarized in Figure 1. Overall, the finite sample size properties of the modified EM test are very good and similar to those of the EM test of CLF, even though the size of the modified EM test tends to have more dispersion across models. Figure 2 reports the simulation results for H 0 : m = 3. The results are similar to those for H 0 : m = 2. We examined the power of the modified EM test by considering (−2.5, 0, 2.5) (1, 1, 1 In the α j columns, A refers to (α 1 , α 2 , α 3 ) = (1/3, 1/3, 1/3) and B refers to (α 1 , α 2 , α 3 ) = (0.4, 0.2, 0.4).
10 alternative models with order 3, and 8 alternative models with order 4. The simulation results are summarized in Tables 2  and 3. In both tables, the first four models are the same as those used in CLF. In many cases, our modified EM test has stronger power than the EM test of CLF. The power difference is the most significant when σ j 's are heterogeneous.
We simulated Type I error rates for 16 null models as specified in Table 7 with orders 2 and 3 and with a scalar conditioning variable. Figures 3 and 4 summarize the simulation results, where bootstrap critical values with 199 bootstrap replications and asymptotic critical values are used. When compared with Figures 1 and 2, the Type I error rates vary more across the models. When testing H 0 : m = 2, the bootstrap test performs well, while the asymptotic test is oversized in some models when n = 200. When testing H 0 : m = 3, the modified EM test performs fairly well, even though it tends to overreject and underreject when n = 200 and n = 400, respectively. Tables 8 and 9 report power simulation results when testing H 0 : m = 2 and H 0 : m = 3, respectively. The modified EM test has a good power under both bootstrap and asymptotic critical values.

Analysis of Stock Returns
Using 4639 observations of daily returns for 30 stocks in the Dow Jones Industrial Average, Kon (1984) estimated a finite mixture of normal distributions. Kon (1984) selected the number of components using LRT, but based on the invalid chi-squared asymptotic distribution. We reestimated the number of components by sequentially applying the modified EM test with K = 2 to test m = k against the alternative m = k + 1 for k = 1, 2, 3 at the (1/3) × 5% significance level. Table 10 shows the frequency of the number of components selected by Kon (1984), the modified EM test, Akaike information criteria (AIC), and Bayesian information criteria (BIC). In Kon's dataset, some stocks return data contain a substantial number of observations with y i = 0.0, which leads to degeneracy (i.e, μ j = 0 andσ j → 0). To deal with this problem, we discarded estimates such that max j (σ j )/ min j (σ j ) > 100. Kon (1984) often chooses a two-component model over other procedures. Compared with the modified EM test, AIC tends to select a larger number of components, whereas BIC selects a model with fewer components.
whereas the two-component model estimated that α 2 comprised approximately 35.6% of the genes, and classified more genes as overexpressed compared to the three-component model.

Spread of a Viral Infection in Potato Plants
Turner (2000) used normal regression mixtures to model the spread of a viral infection in potato plants by aphids. The dataset is from an experiment described in Boiteau et al. (1998). In each experiment, a grid of 81 potato plants was placed on the floor of a flight chamber, and varying number of aphids between   1 and 320 were released near the center of the grid, and the plants were taken out to measure the number of infected plants after one day. A total of 51 such experiments were conducted. The response variable is the number of infected plants, and the explanatory variables are the constant and the number of aphids. When testing H 0 : m = 1 against H A : m = 2, the asymptotic p-values of the modified EM test are 0.000 at K = 2 and 3, providing strong evidence against the one-component model. On the other hand, when testing H 0 : m = 2 against H A : m = 3, the bootstrapped p-values of the modified EM test are 0.142 and 0.154 at K = 2 and 3, respectively. Therefore, a two-component model is not rejected at the 10% significance level.