Testing quantitative trait locus effects in genetic backcross studies with double recombination occurring

Testing the existence of quantitative trait locus (QTL) effects is an important task in QTL mapping studies. In this paper, we assume the phenotype distributions from a location-scale distribution family, and consider to test the QTL effects in both location and scale in the backcross studies with double recombination occurring. Without equal scale assumption, the log-likelihood function is unbounded, which leads to the traditional likelihood ratio test being invalid. To deal with this problem, we propose a penalized likelihood ratio test (PLRT) for testing the QTL effects. The null limiting distribution of the PLRT is shown to be a supremum of a chi-square process. As a complement, we also investigate the null limiting distribution of the likelihood ratio test for the case with equal scale assumption. The limiting distributions of the two tests under local alternatives are also studied. Simulation studies are performed to evaluate the asymptotic results and a real-data example is given for illustration.


Introduction
Quantitative traits loci (QTL) mapping is a crucial tool for dissecting the genetic factors affecting the variations of many quantitative traits in human, plants and animals. Testing the existence of a QTL effect is the starting point of QTL mapping studies. If the QTL effect does exist, we proceed to identify the location and estimate the genetic effect of the QTL.
In QTL mapping studies, the interval mapping developed by Lander and Botstein [10] has been a quite popular method for testing the existence of the QTL effect since it was proposed. Let M and N be two markers on a chromosome, and the genotypes of the two markers and the phenotype data of interest can be observed. Testing the existence of the putative QTL, say Q, between M and N is referred to as QTL interval mapping. In this paper, we consider testing problems in a backcross design with double recombination occurring. In the design, the possible genotypes are MM and Mm at M, and NN and Nn at N, and QQ and Qq at Q. For each progeny, the genotypes of two flanking markers are MM/NN, MM/Nn, Mm/NN or Mm/Nn. The phenotype data can be divided into four groups according to the marker-genotypes, and they are independent of each other. Let {y ij , j = 1, . . . , n i } (i = 1, . . . , 4) denote the phenotype data corresponding to the four marker-genotypes. Denote r, r 1 and r 2 as the recombination frequency between M and N, and between M and Q, and between N and Q, respectively. Assuming no interference existing, and double recombination occurring, we have r = r 1 + r 2 − 2r 1 r 2 and r 2 = (r − r 1 )/(1 − 2r 1 ). The two markers M and N are observed, hence r is known, while r 1 and r 2 are unknown. Let f 1 and f 2 be the phenotype probability density functions corresponding to QQ and Qq. According to [6] or [21], we have the following model for the problem of interval mapping in the design we consider: . Testing the existence of the QTL effect is equivalent to testing In recent decades, many literatures investigated the testing problem (2) by using the likelihood ratio test (LRT). Assuming f 1 and f 2 be normal distribution with equal variance, Chen and Chen [1] gave the consistency of the maximum likelihood estimators of the parameters and the asymptotic property of the log-likelihood ratio. Wu et al. [20] explored the LRT for the problem with f 1 and f 2 being one-parameter kernel function. Zhang et al. [22] also investigated the LRT for the problem under a finite non-linear regression mixture model, in which the kernel function is determined by one parameter and some random covariates.
In the QTL studies, many literatures investigated the LRT for testing the QTL effects under the assumption of no double recombination occurring, while their models are different from model (1). Among these literatures, Kim et al. [8] derived the limiting distribution of the LRT statistic for the one-parameter exponential family mixture model. Recently, in consideration of the economical value of the QTL effect in the variance [9,19], Liu et al. [12] used the LRT to test the QTL effect in both location and scale under location-scale distribution family. The null limiting distribution and an explicit representation for this limiting distribution are established in Liu et al. [12]. Here, we emphasize again that the LRTs in Liu et al. [12] are established under no double recombination occurring in the backcross design, while in this paper we allow the double recombination occurring. The assumption of no double recombination occurring implies that the first sample and the fourth sample in model (1) are from f 1 and f 2 , respectively, rather than the mixture of f 1 and f 2 . Consequently, our model in (1) is different from that in Liu et al. [12].
In this paper, we assume the phenotype distributions from a location-scale distribution family, and motivated by Liu et al. [12], we consider to test the QTL effects in both location and scale. Specifically, in model (1), we assume f h (y) = f (y; μ h , σ h ) with f (y; μ, σ ) = σ −1 f ((y − μ)/σ ; 0, 1), where f (y; 0, 1) is a known probability density function, and μ and σ are the location and scale parameters, respectively. Then, the testing problem in (2) becomes However, the testing problem in (3) under model (1) with the location-scale distribution assumption is very challenging due to some undesirable properties of mixture models. For example, the likelihood function is unbounded [3,4,17], which leads to the maximum likelihood estimators (MLEs) are not well defined. Here we emphasize that the likelihood function under the model considered in Liu et al. [12] is bounded. Due to the unbounded problem, the classical testing methods such as LRT [5,13] cannot be directly applied for the problem. To deal with the problem, motivated by the ideas in [2,4], we construct a penalized log-likelihood function, and the well-defined penalized MLEs of unknown parameters can be obtained by maximizing the penalized log-likelihood function. Further we propose a penalized LRT (PLRT) method for the testing problem in (3). Deriving the null limiting distribution of the PLRT is technically challenging. A key step for deriving the limiting distribution is showing the consistency of the penalized MLEs under the null model. The consistency results such as Chen et al. [4] and Tanaka [17] cannot be used directly because our model is different from theirs. Despite the rather complicated derivations, we establish the consistency for the penalized MLEs (Lemma 1 in the supplementary material). Based on the consistency, we show the null limiting distribution of the PLRT to be a supremum of a chi-square process. As a complement, we also study the asymptotic properties of the LRT for the existence of a QTL effect in the location only (i.e. f 1 and f 2 have the same unknown scale parameter). The asymptotic properties of the tests under a local alternative are also investigated. The rest of the paper is organized as follows. The PLRT, the LRT designed under the same unknown scale parameter, and their asymptotic properties are given in Section 2. Section 3 investigates the finite-sample performance of the PLRT and the LRT via simulation studies. A real-data example is given in Section 4 for the illustration of the proposed tests. Section 5 concludes the paper with some discussions. For clarity, the proofs are provided in the Appendix and the supplementary material.
We reject the H 0 when R n exceeds some critical value determined by the null limiting distribution of R n . Before presenting the null limiting distribution, we give some useful notation. Let n = 4 i=1 n i be the total sample size. We assume that n i /n goes to a constant p i with p i > 0, i = 1, . . . , 4. In the genetic backcross studies described in Section 1, the p i 's are related to r, the recombination frequency between two markers M and N, in the following way (see [21]): Let z hk (h, k = 1, 2) be independent and identically distributed standard normal random variables, and define for h = 1, 2, where It is clear that {Z h (r 1 ) : r 1 ∈ [0, r]} (h = 1, 2) are independent and both are Gaussian processes.
The following theorem shows the root-n consistency of the penalized MLE of (μ 1 , μ 2 , σ 1 , σ 2 ) and the limiting distribution of R n . For presentational continuity, we put the long proof in the Appendix and the supplementary material.
The commonly used location-scale distributions, such as normal, logistic, extremevalue, and t distributions all satisfy Conditions A1-A7. For each case, as long as p n (σ ) satisfies Conditions C1-C4, the PLRT and its asymptotic properties are applicable.
As a complement, we further consider the asymptotic properties of the LRT under the assumption that σ 1 = σ 2 = σ . In this case, l n (r 1 , μ 1 , μ 2 , σ , σ ) is bounded, hence penalty functions are not needed. In particular, the LRT statistic is defined as The next theorem studies the asymptotic behaviours of (μ * 1 ,μ * 2 ,σ * ) and R * n . Its proof is similar to that of Theorem 2.1 and is omitted to save space.

Theorem 2.2: Assume the same conditions in Theorem
Theorem 2.2 extends the results of Chen and Chen [1] under location-scale distribution families. Compared with the results in Chen and Chen [1], Theorem 2.2 does not need the assumption of the compactness of parameter space.

Asymptotic properties under local alternatives
In this section, we consider to investigate the asymptotic properties of the PLRT and the LRT under the following local alternative hypothesis: where δ μ and δ σ are all positive constants, and σ 0 > δ σ . Let χ 2 m (c) denote the noncentral chi-squared distribution with non-centrality parameter c and m degrees of freedom.
with respect to f (y; 0, 1), and let A be the covariance-variance matrix of (T, U) with respect to f (y; 0, 1), where (T, U) is defined in Condition A5. The definitions of α 0 and θ 0 are similar to those of α and θ with r 1 replaced by r 10 (6), we have For convenience of presentation, the proof of Theorem 2.3 is deferred in the Appendix. Theorem 2.3 indicates that the two tests R n and R * n are both consistent under the local alternative H n A . Note that δ σ appears in the limiting distribution of R n but not in that of R * n . Hence, we expect that R n is more powerful than R * n when σ 1 and σ 2 are significantly different, i.e. δ σ is significantly different from 0. This is confirmed in the simulation study.
First, we show empirical type I errors of R n and R * n in Table 1. Their critical values are calculated based on the limiting distributions in Theorems 2.1-2.2. For the empirical type I errors, data are generated from normal distribution N(0, 1) and logistic distribution Logistic(0, 1). All the empirical type I errors are calculated based on 10,000 repetitions.  (1) with N(0, 1) and f 1 = f 2 = Logistic(0, 1). From Table 1, we see that when the sample size is small (e.g. n = 50), R n and R * n tend to have inflated empirical type I errors, while when the sample size goes to large (e.g. n ≥ 100), their empirical type I errors are quite close to the nominal levels. This implies that the null limiting distributions of R n and R * n approximate their finite-sample distributions well as long as the sample size is not small.
Second, we study the power performance of the proposed tests. For comparison, the multiple-sample Anderson-Darling test ( [15]; denoted AD) and Liu et al. [12]'s two LRTs designed under no double recombination are chosen as competitors. Denote byR n andR * n the two LRTs in Liu et al. [12] for σ 1 = σ 2 and σ 1 = σ 2 , respectively. Data are generated from model (1). Suppose d 1 be the distance between the putative QTL and one of two flanking markers, and we get r 1 = 0.5(1 − e −2d 1 /100 ). We consider two values for d 1 , 0.5d and 0.25d, and the following combinations of f 1 and f 2 .
The QTL affects only the location in Cases I and IV; both the location and scale in Cases II and V; and only the scale in Cases III and VI. To save space, we present the results only for n = 100 and α = 5%. The simulated powers of R n , R * n ,R n ,R * n and AD are presented in Tables 2 and 3. From Table 1 and the simulation results in Liu et al. [12], we know the null limiting distributions of the five tests approximate their finite-sample distributions well when the sample size n ≥ 100. In view of this, all the critical values are determined based on the null limiting distributions. All the power calculations are based on 1000 repetitions.   Table 3. Power (%) comparison of the R n , R * n ,R n ,R * n and AD. The random samples are generated from model (1), in which f 1 = Logistic(0, 1) and f 2 = Logistic(0.8, 1) in Case IV, and f 1 = Logistic(0, 1) and f 2 = Logistic(0.8, 1.35) in Case V, and f 1 = Logistic(0, 1) and f 2 = Logistic(0, 1.5) in Case VI. The significance level is α = 5%, and the sample size n = 100.  From Tables 2 and 3, we see that in most cases, R n and R * n possess more power thanR n andR * n , respectively. In Cases I and IV, i.e. when the QTL affects only the location, R * n is more powerful thanR n , while in Cases II-III and Cases V-VI, i.e. as long as QTL affects the scale,R n is more powerful than R * n . In Cases III and VI, the QTL affects only the scale and R * n almost has no power. In all the cases, the tests R n , R * n ,R n ,R * n are more powerful than the test AD.

Real data
In this section, we adopt our method to analyse one real data example. This real data can be found in the R package qtl, and the data are from bristle number in chromosome X recombinant isogenic lines of Drosophila melanogaster. For this example, there are 92 chromosome X recombinant isogenic lines, derived from inbred lines that were selected for low (A) and high (B) abdominal bristle numbers. Each line is typed at 17 genetic markers on chromosome X, and hence 16 intervals are on chromosome X. For each line, the average of the abdominal bristle number of females was recorded. Our goal is to identify the existence of genes affecting for abdominal bristle number on chromosome X of females. Here, we choose the normal distribution as the component in model (1). We use the tests R n , R * n , R n andR * n to test the QTL effects on the 16 intervals. The p-values of the four tests are summarized in Table 4. The results in Table 4 show that there is strong evidence of QTL effects on interval NO. 6-7, 9-11 and 14. By the four p-values on intervals NO. 9-11, we conclude that the QTL effect in both mean and variance exists. On interval NO. 14, the p-value of R n (orR n ) is significantly small, while that of R * n (orR * n ) is significantly large, which implies that there exists the QTL effect in only the variance. On intervals NO. 10 and NO. 14, the p-values of R n are slightly smaller than those ofR n , which shows that the double recombination may occur in meiosis.

Conclusions and discussions
In this paper, we develop a PLRT to test the QTL effect in both location and scale when double recombination occurring. The null limiting distribution of the PLRT is shown to be the supremum of a chi-square process. It is very challenging to establish the null limiting distribution, because deriving the consistency of the penalized MLEs is quite complicated. As a complement, we also investigate the null limiting distribution of the likelihood ratio test for testing the QTL effect only in location. The limiting distributions of the two tests under local alternatives are also studied. Simulation studies and real data analysis show that the proposed tests have good performance.
In this paper, we deal with the problem of unboundedness of the likelihood by imposing penalty functions on scale parameters. An alternative method to deal with the problem is confining σ h (h = 1, 2) to [d 0 , ∞) with d 0 being some small constant [7,14]. However, this method may suffer from the following two problems: (1) the true parameter may not, at least theoretically, satisfy the constraint imposed on σ h . If the true parameter is out of the constraint, the consistency of the MLE would be not satisfied (see [4]) and (2) we need to select the tuning parameter d 0 , which complicates the likelihood ratio test.
When f is unknown, some nonparametric or semi-parametric testing methods may be needed. This topic is out of the scope of this article and warrant future research.

Funding
Next, we list the regularity conditions for the penalty function p n (σ ).
We allow p n to be dependent on the data. To ensure that the test R n is location-scale invariant, we recommend choosing a p n that also satisfies

A.2 Two technical lemmas
In the proof of Theorems 2.1 and 2.2, the first step is to establish the consistency of the penalized MLEs under the null model, which can be obtained from the following lemma. It claims that any estimator of (r 1 , μ 1 , μ 2 , σ 1 , σ 2 ) with a large likelihood value is consistent for μ h and σ h , h = 1, 2, under the null model. Since both R n and R * n are invariant to the location and scale transformations, we assume that (μ 0 , σ 0 ) = (0, 1). Lemma A.1: Assume the same conditions in Theorem 2.1. Let (r 1 ,μ 1 ,μ 2 ,σ 1 ,σ 2 ) be any estimator of (r 1 , μ 1 , μ 2 , σ 1 , σ 2 ) such that pl n (r 1 ,μ 1 ,μ 2 ,σ 1 ,σ 2 ) − pl n (0.5r, 0, 0, 1, 1) > c > −∞ for some constant c. Then under the null model f (y; 0, 1), The proof of Lemma A.1 is quite long and technically involved. For the convenience of presentation, we leave it to the supplementary material.
In the next lemma, we strengthen the conclusion of Lemma A.1 by providing an order assessment. For the convenience of presentation, we define some notation. Let Further let From (A1), we know thatm 1 +m 2 (α) = (m 10 (α),m 01 (α)) τ . Defineᾱ andθ as α and θ in model (1) withr 1 in place of r 1 .

A.3.1 Proof of part (i)
Applying the results in Lemma A.2, we obtain the results in Part (i).
After the second Taylor expansion, under the null, we have