Information theoretic approaches to biological sequence analyses

2017-01-13T03:57:54Z (GMT) by Cao, Minh Duc
Molecular biology is the first information processing system on the planet. The genome sequence of an organism stores the genetic information that virtually defines the organism. Analysis of genomic sequences can help elucidate many aspects of life. This thesis investigates approaches for sequence analysis that make use of the information content of the sequences. The information content of a sequence can be estimated by lossless compression. The thesis develops the expert model, a fast and effective algorithm for compression of biological sequences. The expert model uses a novel adaptive technique to combine predictions from different sub-models for compression based on the well-founded Bayesian statistical framework. Experiments show that the expert model outperforms existing biological compression algorithms on standard DNA and protein sequence data sets while maintaining a practical running time. Moreover, the expert model is capable of compressing long sequences. It is applied to estimate the information content of the genomes of species at various organism levels, including viruses, bacteria, archaea, single cell eukaryotes, invertebrates, plants and mammals. Most importantly, the expert model can produce an estimate of the information content of every symbol in a sequence using background knowledge in the form of known sequences or contexts. This is useful for performing information extraction from genomic sequences. The thesis suggests that since genomic sequences carry genetic information, sequence analysis can be performed at the information level. A method for pairwise local alignment of genomes, namely XMAligner, is presented. Instead of comparing sequences at the character level, XMAligner considers a pair of sequences to be related if their mutual information is significant. XMAligner is shown to be superior to conventional alignment methods, especially on distantly related sequences or statistically biased data. The method aligns sequences of eukaryote genome size with only modest hardware requirements. Importantly, the method has an objective function which can obviate the need to choose parameter values for high quality alignment. The information content of sequences can also be used for phylogenetic analysis. The thesis formulates XMDistance, a measure of genetic distances between sequences based on their information content estimated by lossless compression. The measure does not rely on an evolutionary model. It is shown to be proportional to elapsed time if the evolutionary rate is constant. The distance measure can be used for phylogenetic analysis of sequences that cannot be reliably aligned, for example, whole genomes. On a set of simulated data, phylogenetic analysis using XMDistance outperforms maximum parsimony method and the standard character-based distance measure. For small sequences, the maximum likelihood method, which requires much longer time to run, performs better. XMDistance successfully infers plausible trees from real data, and most importantly manages problematic sets of whole genome sequences.