Markov chain composite likelihood and its application in genetic recombination model

Phylogenetic Trees are critical in human genome research for investigating human evolution and identifying disease-associated genetic markers. New high-throughput genome sequencing technologies raise an urgent need to develop statistical methods that can construct phylogenetic trees from long genome sequences with quick computation speeds, while considering various biological complexities. Though an ancestral mixture model has been proposed [Chen SC, Lindsay BG. Building mixture trees from binary sequence data. Biometrika. 2006;93(4):843–860. doi: 10.1093/biomet/93.4.843] to this end by allowing genetic mutations over generations, another essential evolution factor, genetic recombination, is missed. Therefore, in this paper, we develop a novel genetic recombination model for tree construction and propose to use Markov chain composite likelihood (MCCL) to make model estimation computationally feasible. To further reduce computation complexity, a hierarchical estimator is constructed to estimate unknown ancestral distributions through MCCL. Simulation studies and real data example show that our proposed methods perform well and fast, so have the potential for implementation in long sequence genome data.


Introduction
Phylogenetic trees, also referred to as evolutionary trees, are diagrammatic representations that reflect the inferred evolutionary relationships among organisms or various biological species based on similarities and differences in their physical or genetic characteristics.A phylogenetic tree contains nodes that are connected by branches.The nodes represent a group of organisms, and the branches depict the evolutionary path and inferred historical relationships among different nodes.The root of the tree represents the most common ancestor, and node shows the divergence of descendants from a common ancestor [1].
Phylogenetic trees have broad biological applications.They help in understanding the origin of human [2] and human evolution history and population expansion [3][4][5].
The first column contains the rsID of the SNPs.The second column contains the physical position of these SNPs on chromosome 1.The last column denotes the major allele (i.e. the most common allele) for each SNPs in this data set.The other columns are the SNPs for each haplotype sequence.
In epidemiology, phylogenetic trees are also used in tracing the origin and spread of infectious diseases [6][7][8].Phylogenetic trees are usually derived from DNA or protein sequences.For example, Figure 1, presented in Chen and Lindsay [9], shows a constructed phylogenetic tree by using mitochondrial DNA data.In this paper, we will specifically focus on phylogenetic trees stemmed from human haplotype SNPs sequences.A single nucleotide polymorphism (SNP) is a position on the DNA for which there is variation in the nucleotide (A for adenine, T for thymine, C for cytosine, or G for guanine) at that single position among humans; and a haplotype is a DNA sequence inherited from a single parent.
In the past decade, quickly developed sequencing technology, such as the next generation sequencing, has enabled researchers to obtain whole genome DNA sequences at relatively low price.Large international genomic projects, such as the International HapMap Project and the 1000 Genomes Project [10,11], have offered access to high-quality whole genome SNPs data collected from thousands of individuals belonging to a vast number of populations.For example, the third phase of the International HapMap Project contains haplotypes from 11 ancestry groups.Among these ancestry groups, the population of Yoruba in Ibadan, Nigeria (YRI) trio samples contains 200 whole genome haplotype sequences, in which there are 116,415 SNPs identified on chromosome 1.Table 1 lists the first 10 SNPs on chromosome 1 as an example of raw haplotype data.The rapid increase of genome sequence data has enhanced the significance of phylogenetic trees in various biology branches but with considerable statistical and computational challenges [12].Hence there is an urgent need to develop statistical methods that can construct phylogenetic trees from long DNA sequences by taking various biological complexities into account.
Statistically, building phylogenetic trees is equivalent to estimating unknown distributions of ancestral SNP sequences from current descendant ones, with consideration of realistic biology complexities such as mutation and recombination.To this end, an ancestral mixture model [9] was proposed by Chen and Lindsay in 2006 to construct evolutionary trees while allowing only mutations to accumulate over the number of generations.Inspired by their work, in this paper, we propose a model by accounting for another evolutionary factor, recombination.To reduce significant computation challenges caused by enormous number of recombination possibilities, especially when the length of haplotype SNP sequence is large, we further propose to use the Markov chain composite likelihood (MCCL) method to make statistical inference on the proposed recombination model.
The rest of the paper is organized as follows.In Section 2, we describe the developed recombination model and the challenges on model estimation when the classic likelihood method is implemented.In Section 3, the method of composite likelihood (CL), especially a Markov chain composite likelihood (MCCL) for the recombination model, is discussed explicitly for the purpose of solving computation challenges.In Section 4, we comprehensively explain a hierarchical estimator which allows making fast estimations by sequentially optimizing the corresponding MCCL for the proposed recombination model.The results from simulation studies are summarized in Section 5, to examine the performance of the proposed MCCL and the hierarchical estimator.Finally, a real data example and some further discussions on the proposed method and potential future directions are discussed in Sections 6 and 7.

Recombination model
In this paper, we mainly consider the SNPs data from haplotype sequences.In the majority of SNP data, only two possible character states exist at each of the positions (i.e.A or G at a purine site, and T or C at a pyrimidine site).Thus, we can use binary data to represent the SNPs on a haplotype sequence, and use 0 and 1 to denote major allele (the most common allele for a given SNP) and minor allele (the less common allele for a given SNP), respectively.For example, in Table 1, the major allele for the sixth SNP with ID rs3131969 is letter A. Thus, for this SNP, all letter As will be coded as 0 and all the other letters (i.e.letter G) will be coded to 1, when we analyse this data set.
In addition, during the process of genetic inheritance, two types of principal mechanisms may occur to introduce genetic variations into the population.One is called mutation, which is an alternation in the SNPs (i.e.changes from 0 to 1 or vice versa).Another one is genetic recombination, which refers to the rearrangement of DNA sequences by the breaking and rejoining of chromosomes or chromosome segments between maternal and paternal DNA sequences.The ancestral mixture model [9] only involves mutation, while missing the other important evolution factor, recombination.Hence, in this section, we propose a model to take potential recombination into consideration.This model estimates the unknown ancestral distribution from observed current binary descendant sequences by considering possible recombination.
For n descendant binary sequences, X 1 , . . ., X n , if there are n 1,2 (i, j) descendants with sequence (i, j), the log-likelihood function for unknown ancestral distribution π is

Recombination model with sequence length L
When the sequence length is L, there are L−1 locations where recombination could occur.Assume a constant recombination rate q, a general recombination model is where X = (x 1 , . . ., x L ) represents the current descendent sequence with value x s ∈ {0, 1} on site s = 1, . . ., L, and the notations of π s are defined in a similar way as π 1 (i) before, by summing up π 1,2,...,L (x 1 , x 2 , . . ., x L ) along the desired sites.That is, the probability of observing current descendent sequence X is the summation of the probabilities from one extreme scenario that no recombination occurred and X came from a single ancestor, to the other extreme situation where recombination occurred at all L−1 locations and X came from L ancestors.Similarly, for n observed descendant binary sequences X 1 , . . ., X n , the log-likelihood function is where π = {π 1,...,L (x 1 , . . ., x L )} represents the set of unknown ancestral proportions, n 1,...,L (x 1 , . . ., x L ) denotes the number of observed descendants, which have values as X = (x 1 , . . ., x L ) for x s ∈ {0, 1} and s = 1, . . ., L.
Though we can write out the above general recombination model and the corresponding log-likelihood function explicitly, there are significant computation challenges when applying the classic maximum likelihood estimator due to the enormous number (2 L−1 ) of recombination possibilities when L is large.Therefore, we propose to use composite likelihood (CL) as a substitute of full likelihood to reduce the computation complexity.

Composite likelihood
High dimensional data can bring enormous challenges to inference and estimation with maximum likelihood method, especially when there are complex correlation structures in the data.Some alternative approaches based on likelihood modifications have been developed.For example, pseudo likelihood [13], partial likelihood [14], and pairwise likelihood all address estimation when the full likelihood is intractable.Composite likelihood is a general class of likelihood-based methods [15], and can also be used in complex situations.Composite likelihoods are constructed by taking a product of likelihood terms, each one of which is a marginal or conditional sublikelihood, i.e.CL(θ ) = K k=1 L k (θ ).Because each component likelihood is a true likelihood, the composite likelihood provides a generally consistent method of estimation, but often with an efficiency loss when compared with the full likelihood.A thorough review of composite likelihood method can be found in Varin et al., 2011 [16].
In particular, we use Markov chain composite likelihood (MCCL) to solve the computation complexity for our proposed recombination model.Markov chains are widely used in genetic evolution problems [17] because they provide rich models to describe the evolution of DNA sequences [18,19].Here, we build the Markov chain idea into our recombination model and hence construct the MCCL as a substitute for the full likelihood of the recombination model.
There is a certain natural logic to use the MCCL in this problem.First, it greatly reduces the computational need for finding estimates.Second, the action of recombination breaks the sequences into independent pieces.Thus, if there is any recombination at all between two sites a and b, the values of x a and x b on the descendant X are independent, and so the pair contain little information about the dependence in ancestral sequences at these two sites.Thus, we expect most of the recombination information to come from shorter sequences, which will be built into our MCCL.

Markov chain composite likelihood
Suppose a sequence of random variables Y 1 , Y 2 , . . ., Y L is a Markov chain of order m, we have In the recombination model (1), if we assume that descendant sequences hold the Markov chain structure, then for n observed descendants, X 1 , X 2 , . . ., X n , the log-likelihood function (2) can be written as where f (x s , x s+1 , . . ., x t ) = P(X s = x s , X s+1 = x s+1 , . . ., X t = x t ) is the probability that descendant sequence has values x s , x s+1 , . . ., x t on sites s, s + 1, . . ., t, and can be calculated by recombination probability in Equation ( 1).Thus, we can use Equation (3) as a surrogate for the full log-likelihood, and call this surrogate log-likelihood as the log Markov chain composite likelihood (log MCCL).It is a good surrogate if the observed sequences have the Markov property.In addition, the MCCL converges to full likelihood when the order of Markov chain, m, increases.
If we use the log MCCL (3) to substitute the full likelihood to estimate the unknown ancestral distributions, we only need to estimate the (L − m) different sequential margins of ancestral distributions, i.e. π s,...,s+m (x s , . . ., x s+m ) for s = 1, . . ., L − m, not the full joint ancestral distribution π 1,...,L (x 1 , . . ., x L ).In order to estimate the full joint ancestral distributions, we could make an additional assumption on the joint density, such that they also have an order-m Markov property.Then, after obtaining the (m + 1)-wise marginal ancestral distribution estimations, the joint density can be estimated by This estimation is an approximation to the joint density, one that improves in quality as m increases.
By using order-m MCCL instead of the full likelihood, we focus on estimating (m + 1)wise marginal ancestral distributions rather than the joint distributions.The number of possible recombination, that need to be considered when estimating the marginal ancestral distributions by applying MCCL, is (L − m) × 2 m .This number is much less than 2 L−1 , the number of potential recombination that need to be accounted for when we use full likelihood to estimate the joint ancestral distributions.As a consequence, compared with the full likelihood, the proposed log MCCL (3) can reduce the computation burden significantly, especially when m L.
However, there are still a number of challenges on optimizing the log MCCL (3) to obtain the estimations, which will be discussed in the following subsection.
In addition, there exist several restrictions on the parameter space for π s,...,s+m (x s , . . ., x s+m ).
The first restriction is the between zero and one constraint.That is, for s = 1, . . ., L − m, The second one is sum to one restriction.That is, for s = 1, . . ., L − m, we have The third restriction is that the unknown parameters should satisfy lower order margin consistency property, which means the estimated (m + 1)-wise margins must satisfy π s+1,...,s+m+1 (x s+1 , . . ., x s+m+1 ), All these complicated constraints make it difficult to build an estimator that will maximize the MCCL globally within the parameter space.For this reason, we have focused on this problem first from the perspective of finding quick (and possibly inefficient) methods to come near to maximize MCCL for long sequences L.

Estimation in MCCL
In this section, we propose a fast estimation method, a hierarchical estimator, to estimate unknown (m + 1)-wise margins, π s,...,s+m (x s , . . ., x s+m ) hierarchically based on the developed order-m MCCL (3).The idea for this hierarchical estimator is, we first estimate onewise margin, then estimate the pairwise margins given all the onewise fixed and so on, until we have the estimates for the desired (m + 1)-wise marginal distributions, given all lower margins fixed.The proposed hierarchical estimator requires one to change the unknown parameters, π s,...,s+m (x s , . . ., x s+m ), into a new parameter system.Hence, we first demonstrate the reparametrization in this section, followed by detailed discussions on the developed hierarchical estimator.

Pairwise margins
To illustrate the idea of a hierarchical estimator in greater detail in the simplest situation, we start from the situation where order-1 log MCCL is applied to the recombination model and all pairwise margins, π s,s+1 (x s , x s+1 ) for s = 1, . . ., L − 1 and x s , x s+1 ∈ {0, 1}, need to be estimated.Under this situation, the target log MCCL (3) is We start from estimating π s (x s ) for all s = 1, . . ., L and x s ∈ {0, 1}.By maximizing the order-0 MCCL (i.e. the product of onewise marginal likelihood), we obtain πs (x s ) = n s (x s )/n.To estimate the next higher margins π s,s+1 (x s , x s+1 ), we can apply conditional maximization by fixing all onewise margins as πs (x s ).Given all the onewise margins fixed, the unknown parameter π s,s+1 (x s , x s+1 ) only appears in the log MCCL (5) in the term f (x s , x s+1 ), and does not depend on other neighbour pairs f (x t , x t+1 ) or f (x s ).Thus, to estimate π s,s+1 (x s , x s+1 ), we can maximize the following reduced log-likelihood function where f (x s , x s+1 ) = P(X s = x s , X s+1 = x s+1 ) = (1 − q)π s,s+1 (x s , x s+1 ) + qπ s (x s )π s+1 (x s+1 ), and n s,s+1 (x s , x s+1 ) is the number of descendant sequences having values x s on site s and x s+1 on site s + 1.
Since each pairwise margin, π s,s+1 (x s , x s+1 ) and π s+1,s+2 (x s+1 , x s+2 ), has three constraints stated above, there are in total six constraints on a threewise margin and only two of the eight π s,s+1,s+2 (x s , x s+1 , x s+2 ) are free.Notice that as long as the pairwise margins satisfy sum to one constraints, the threewise margins automatically hold sum to one constraints given the satisfaction of lower order margin consistency.
Again, the corresponding order-m log MCCL can be separated into terms for each parameter φ s (x), and the log-likelihood part used to estimate φ s (x) is simply a sum of terms: where f (x s , x, x s+m ) = P(X s = x s , X = x, X s+m = x s+m ) can be calculated by Equation (1).
From the above reparametrization, we can see that the φ parameters used here are separable in the log-likelihood function, and consequently, greatly simplifies algorithms that maximize the likelihood.

A hierarchical estimator
In Section 4.1, we have introduced our idea for the new estimator briefly: we first estimate all the onewise marginal ancestral distributions, then estimate the pairwise margins given all the onewise fixed.The threewise margins are estimated by fixing all the pairwise ones.We follow this process hierarchically until we obtain the estimates for the desired (m + 1)wise marginal distributions.We call this new estimator as the hierarchical estimator and discuss it explicitly in this subsection.
We estimate π s,s+1 (x s , x s+1 ) through new parameters φ s = π s,s+1 (0, 0) given πs (x s ) fixed.To estimate φ s , we maximize the log-likelihood function (6) by solving the score equation It can be proved that l(φ s ) is a concave function with respect to φ s , which implies there typically exists a unique local maximum for l(φ s ) (Appendix 1).It is also easy to see that d dφ s l(φ s ) achieves zero when given πs (x s ) = n s (x s )/n, where s = 1, . . ., L and x s ∈ {0, 1}.Notice that the above solution for the score equation is the one when there is no constraint on the parameter space.However, φ s is restricted to be max{0, πs (0) + πs+1 (0) − 1} ≤ φ s ≤ min{ πs (0), πs+1 (0)}.If φs is in this interval, then it is the desired constrained MLE with respect to log-likelihood function (6).If φs isn't in parameter space, then the desired constrained MLE must be located on one of the two boundaries in the above interval because of the concavity of l(φ s ).In particular, the constrained MLE must be the bound that is closest to φs .

Threewise margins
We first explore the possibility of generalizing the above pairwise estimator to threewise estimator.That is, to estimate π s,s+1,s+2 (x s , x s+1 , x s+2 ), for s = 1, . . ., L − 2, we fix onewise margins πs (x s ) and pairwise margins πs,s+1 (x s , x s+1 ).Then, the unknown parameters, π s,s+1,s+2 (x s , x s+1 , x s+2 ), can be reparametrized as φ s (0) = π s,s+1,s+2 (0, 0, 0) and φ s (1) = π s,s+1,s+2 (0, 1, 0) based on the discussions in Section 4.1.These two new parameters can be estimated by optimizing the log-likelihood function (7).Denote x s+1 = i, for i = {0, 1}, to estimate the parameter φ s (i), one can solve the score equation It can be shown (Appendix 2) that score Equation ( 11) has a solution provided πs (x s ) = n s (x s )/n and πs,s+1 (0, 0) = φs obtained according to Equation (10) (not a solution forced by constraint).However, if the constrained MLE for πs,s+1 (0, 0) is not φs in Equation ( 10), but rather one of two bounds in the parameter sapce, Equation ( 11) has no solution.Hence, the pairwise hierarchical estimator cannot be directly generalized to the threewise margin case.However, by further investigation, we find that solving the score Equation ( 11) is equivalent to solve a cubic equation of φ s (i), where explicit forms of A 3 j , j = 1, 2, 3, 4, can be found in the Appendix 2. In addition, for any cubic equations with real coefficients, there exists either one real root and two conjugate complex roots, or three real roots for this equation.We prove (Appendix 2) that if there exists three real roots for Equation (13), only one of the real roots will be an eligible candidate estimate for φ s (i), such that f (x s , x s+1 = i, x s+2 ) ≥ 0 for x s ∈ {0, 1} and x s+2 ∈ {0, 1}, and this root is the middle one among all three real roots.
Therefore, the hierarchical estimators for threewise marginal ancestral distributions are (1) Fix all onewise margins as πs (x s ) = n s (x s )/n and pairwise margins πs,s+1 (x s , x s+1 ), which is estimated by hierarchical estimator for pairwise margins in Section 4.2.1.(2) Solve the cubic function ( 13) and find all its real roots.
(3) If there is only one real root, then use this root as the candidate estimate of φ s (i), otherwise use the middle real root as the candidate estimate, denoted by φs (i).(4) Check if φs (i) is in the interval (12).If yes, φs (i) is the desired estimate.If no, the desired estimate should be the bound that is closest to φs (i) in the interval ( 12). ( 5) Calculate the estimate for other π s,s+1,s+2 (x s , i, x s+2 ) through reparametrization in Section 4.1.
(2) Solve the cubic function ( 14) and find all its real roots.
(3) If there is only one real root, then use this root as the candidate estimate of φ s (x), otherwise use the middle real root as the candidate estimate, denoted by φs (x).(4) Check if φs (x) is in the interval (8).If yes, φs (x) is the desired estimate.If no, the desired estimate should be the bound that is closest to φs (x) in the interval ( 8). ( 5) Calculate the estimates for other π s,...,s+m (x s , x, x s+m ) through reparametrization in Section 4.1.

Simulation studies
To assess the performance of the proposed MCCL and hierarchical estimator, we conduct a series of simulation studies.Both marginal and joint estimators are evaluated under factorially varied values of fixed quantities m and q.In this section, we first introduce our simulation design, followed by a discussion of simulation results and conclusions.

Design
To simulate data, we first generated an ancestral distribution from haplotype sequence data released by the International HapMap Project phase 3. Particularly, we used the 200 haplotypes on Chromosome 1 from the YRI trio samples to generate the ancestral distribution.
The corresponding SNPs data were first re-coded into binary sequences such that the minor allele is 1 and the major allele is 0.Then, haplotype segments of length 20, starting with the first available recorded SNP on chromosome 1, were selected as the pool for ancestors.For length L = 20, in general, there are 2 20 = 1, 048, 576 possible sequences.Among the 200 YRI chromosome 1 haplotypes segments of length 20, however, there were 91 distinct sequences only, which were used as the ancestors in our simulations.We calculated the relative proportions for these 91 ancestral sequences and used them as the true ancestral proportions.The 91 sequences with positive proportions, and the remaining 1,048,485 sequences with zero proportions are referred to as the true non-zero and true zero ancestral sequences, respectively.We set the recombination probabilities to be, q = 0.005, 0.01, 0.05 and 0.1, in the simulations to best mimic the real-world situation and the corresponding rate used in the International HapMap Project [20][21][22].For each q-value, we generated a sample of size n = 100 simulated descendant sequences for N = 100 replications, from the true ancestral distribution (i.e.91 true non-zero ancestral sequences and their relative proportions) by implementing the proposed recombination model in Equation (1).We examined three Markov chain orders, m = 1, 2, 3, under each q-value.
In the simulation studies, the (m + 1)-wise marginal ancestral proportions, π s,...,s+m (x s , . . ., x s+m ), were estimated by applying the proposed hierarchical estimator based on order-m log MCCL.The estimated joint ancestral distributions were then calculated according to Equation ( 4) by assuming the joint distributions also had order-m Markov property.We used bias and standard error to examine the performance of estimator.In addition, 95% bootstrapped confidence intervals were calculated to further assess estimator accuracy.

Results
The results for estimated marginal ancestral distributions are similar under difference combination of q and m values.That is, overall speaking, the estimations have relatively low bias and low standard error compared to the true values.
To save space, here we use Figure 1 to summarize marginal estimate results when m = 2 and q = 0.01, as an example.Clearly, from the figure, we can see that the maximum absolute bias for parameters, φ s (0) and φ s (1), are very small (both are 0.007) for s = 1, . . ., L − 2 = 18.The standard errors are also quite small with the maximum values being 0.049 and 0.046, respectively, for φ s (0) and φ s (1).In addition, for the majority of estimates, the true parameter values fall within the coverage of the 95% bootstrapped confidence intervals.Though some of the true parameter values are not in the confidence intervals (e.g. for φ s (0) and φ s (1) when s = 4, and for φ s (1) when s = 7 and 18), the distance between the true value and the closest interval is negligibly small, which ranges from 0.001 to 0.002.The same conclusions can be drawn for the marginal estimates obtained under other combinations of q and m, and the corresponding figures are listed in Supplementary Material.When come to the results for estimated joint ancestral distributions, we notice a tradeoff between bias and standard error.Particularly, increasing m results decreased estimation bias but increased the standard error.For example, when q = 0.1, for non-zero true ancestral sequences, the largest observed absolute biases are 0.068, 0.056, and 0.036, respectively, for m = 1, 2, 3. On the other hand, the largest observed standard errors increase from 0.006, to 0.010, and to 0.019, when m increases from 1 to 3 (Figure 2).The same trends are also observed for other q = 0.005, 0.01, 0.05, and the corresponding figures are reported in the Supplementary Material.
We also notice that we successfully identified all 91 true non-zero ancestral sequences under all combinations of m and q values, but the vast majority of the true non-zero sequences are underestimated.When m = 1, 2, only about 25% of the total proportions is assigned to the true non-zero sequences and when m = 3 only about 40% of the total proportions is assigned to the true non-zero sequences (Table 2).The reason for this underestimation is most likely because the true ancestral distribution is not a Markov chain.
However, a question would then naturally raised: Where are those misplaced proportions assigned to?To investigate this question, we use the concept of Hamming distance, which is usually implemented in genetic research to measure the similarity of a group of sequences [23,24].Specifically, the Hamming distance between two sequences X 1 and X 2 is defined as the number of indices in the sequences that differ between the two sequences.That is, for sequence length L, the Hamming distance is Sequences X 1 and X 2 are considered similar if and only if d H (X 1 , X 2 ) ≤ t, where the threshold t can be set manually by the researcher based on biological properties or selected through a clustering algorithm.
In this simulation study, we choose the threshold based on the percentage of sequences that differ.Particularly, let X non−zero and X zero denote the sets of the true non-zero ancestral sequences and true zero ancestral sequences, respectively, and calculate the Hamming distances between each sequence in X zero and each sequence in X non−zero .Consider a subset of X zero , such that d H (X zero , X non−zero ) ≤ 1.The true zero ancestral sequences in this subset are the 'neighbours' of X non−zero in terms of differing by at most one site from any true non-zero ancestral sequences.As a consequence, for the zero true ancestral sequences in this subset, 95% of their sites are the same to the sequences in X non−zero .Hence, we call this subset as the set containing '95% neighbours' of X non−zero .Similarly, we can define a set for '90% neighbours' of X non−zero by including sequences in X zero that satisfy d H (X zero , X non−zero ) ≤ 2, and a set for '85% neighbours' of X non−zero by involving sequences in X zero that satisfy d H (X zero , X non−zero ) ≤ 3. It is easy to see that the set of 95% neighbours is nested in the set of 90% neighbours, which is also a subset of the 85% neighbours set.More specifically, this gives us 1564 sequences in the set for 95% neighbours, 12,070 sequences in the set for 90% neighbours, and 55,374 sequences in the set for 85% neighbours, of X non−zero .
Table 2 shows that, though only a small part of total positive estimated proportions is assigned to true non-zero ancestral sequences, when we include the 95% neighbour sequences of X non−zero , this total proportions increases to about 70% to 85% depending on different m values.The corresponding percentage increases more when we consider 90% neighbour sequences of X non−zero , and nearly all of our misplaced estimated joint ancestral proportions are assigned to 85% neighbours of X non−zero .This suggests that even if our proposed method tends to underestimate the true joint ancestral distributions, the missing densities are not assigned to sequences that are very different from the true ancestors.
Instead, almost all of the missing proportions are placed to the sequences that are quite close, with Hamming distances less than three, to the true ancestors.Finally, we use this simulation study to investigate the computation burden for the proposed method.For one randomly selected simulation replication, we summarize the computation time (in seconds) needed for calculating marginal and joint ancestral distributions, respectively, in Table 3. Note, the computation time is based on a calculation by using R on one processor from a personal laptop equipped with 32.0 GB RAM and Intel Core i7-8650 1.90 GHz Processor.It shows that for a fixed sequence length L, the computation time increases when m increases, but no clear trend for the relationship with q.Most importantly, Table 3 clearly indicates that the proposed MCCL plus hierarchical estimator show promising computation speed, especially for marginal ancestral distribution estimation.

Conclusions
Our simulation studies show that the proposed Markov chain composite likelihood plus hierarchical estimator work well in estimating the marginal ancestral distributions with very small bias and standard error.Though the calculated joint ancestral distributions tend to be underestimated, we successfully identify all true ancestral sequences and reveal the misplaced ancestral proportions are assigned to sequences which are close to the true ancestors.
In addition, with closed-form solutions by solving a cubic equation, the proposed hierarchical estimator has the advantage of fast computation without the need for iterations and consequently can be applied to the scenario when sequence length L is large.
observed in descendent samples, but the estimated ancestral proportion for this sequence is 0.0394 (Table 6).

Discussion
In this paper, we first develop a recombination model in order to estimate the unknown distribution for ancestral haplotype sequences from observed current descendant ones.The model accounts for the possibility that recombination takes place in the genetic processing, with maternal and paternal genes being regrouped in the formation of gametes (sex cells).However, classic likelihood methods are computationally infeasible to be implemented here due to the need to summate numerous recombination possibilities, especially when the length of haplotype sequence is long.Thus, we propose to use composite likelihood, particularly, Markov chain composite likelihood, to estimate the marginal ancestral distributions instead, and calculate the joint ancestral distributions from the estimated marginal ones by assuming that the joint distributions similarly hold the Markov property.
Compared with the classic likelihood method, the implementation of Markov chain composite likelihood significantly reduces the computation burden because only a limited number of potential recombinations need to be considered for estimating the marginal ancestral distributions.In addition, due to the complex constraints for the parameter space, we further develop a hierarchical estimator to estimate the marginal ancestral distributions sequentially from the lower to the desired order of margins, in order to further reduce the computation burden.
However, the proposed hierarchical estimator gives conditional maxima, not the maximum likelihood estimator, for the Markov chain composite likelihood.Thus, some other optimization methods, such as the gradient method, have the potential to add improvement steps so that the resulted estimates are closer to the maximum composite likelihood estimator.
Another potential limitation of the proposed method is the possible direction effect.Note that for an order-m MCCL, the hierarchical estimator estimates (m + 1)-wise marginal ancestral distribution π s,...,s+m , for s = 1, . . ., L − m, depending on all related lower order margins.Thus, theoretically speaking, it does not matter we make estimation beginning from the left-side or the right-side of the sequence.In practice, however, we usually start from one end and go sequentially until the other end of the sequence.Therefore, it is possible that the computation errors also pass to the next step sequentially and the cumulative errors at the end could make the end-site estimation have a larger bias than the start-site estimation.We expect more chance of observing such direction effect when L is large, and one simple solution is that we can make estimations multiple times by using different directions, then use the average as the final estimate to eliminate the potential direction effect.
Last but not least, since our motivation is to develop a computation-efficient tool for constructing phylogenetic tree with the considerations of both genetic mutation and recombination, an important future work is to add a mutation factor into the proposed recombination model.This can be started by combining the recombination model with the ancestral mixture model proposed by Chen and Lindsay [9].The related estimation methods and tree construction algorithms will be developed in future investigations.
Thus, l(φ s ) is a concave function with respect to φ s , which means there typically exists a unique local maximum for l(φ s ).
Given πx (x s ) = n s (x s )/n and by plugging in into the score equation d dφ s l(φ s ) = 0, we can easily prove that φs is the closed from solution for this score equation.
In such situation, which real root should be chosen as the candidate estimate is a critical problem.Below we prove that under such situation, we should select the middle real root because it is the only real root that makes f (x s , x s+1 = i, x s+2 ) ≥ 0 for x s , x s+2 ∈ {0, 1}.Proof: It is easy to see that when φ s (i) take any one of the above four values, we have one of the f (x s , i, x s+1 ) = a s x s ,x s+1 φ s (i) + b s x s x s+1 = 0. Hence the denominators of d dφ s (i) l(φ s (i)) and dφ s (i) 2 l(φ s (i)) are zeroes, which implies that they have no definitions at above four points.
Similarly, we can prove that there could not be real root for the cubic equation ( 13) in the interval (max{φ (2)  s (i), φ ( 3)
Theorem A.5: If there exist three real roots for cubic function (13), there is only one real root is the eligible candidate estimate for φ s (i), such that f (x s , x s+1 = i, x s+2 ) ≥ 0 for x s ∈ {0, 1} and x s+2 ∈ {0, 1}.
Specifically, below, we show that the real roots in the other two intervals are not eligible estimate for φ s (i) because they would cause one of the f (x s , i, x s+2 ) < 0, where x s , x s+2 ∈ {0, 1}.

Figure 2 .
Figure 2. In relation to m, the bias and standard error of joint estimates for true non-zero ancestral sequences when q = 0.1.(a) Bias and (b) standard error.

Table 1 .
First 10 SNPs on chromosome 1 of haplotype sequences from YRI trio samples, released by the International HapMap Project (phase 3).

Table 2 .
Sum of estimated joint ancestral proportions that are assigned to various sets of ancestral sequences.

Table 3 .
Computation time (in seconds) for obtaining marginal and joint estimates from one randomly selected simulation replication.

Table 5 .
Estimated pairwise marginal ancestral distributions for gene PSEN2.

Table 6 .
Top 10ancestors that having largest proportions for gene PSEN2.