The Statistical Significance of Max-Gap Clusters

Hoberman, Rose; Sankoff, David; Durand, Dannie

doi:10.1184/R1/6104519.v1

File(s) stored somewhere else

http://dx.doi.org/10.1007/978-3-540-32290-0_5

Please note: Linked content is NOT stored on Carnegie Mellon University and we can't guarantee its availability, quality, security or accept any liability.

The Statistical Significance of Max-Gap Clusters

journal contribution

posted on 2004-10-16, 00:00 authored by Rose Hoberman, David Sankoff, Dannie Durand

Identifying gene clusters, genomic regions that share local similarities in gene organization, is a prerequisite for many different types of genomic analyses, including operon prediction, reconstruction of chromosomal rearrangements, and detection of whole-genome duplications. A number of formal definitions of gene clusters have been proposed, as well as methods for finding such clusters and/or statistical tests for determining their significance. Unfortunately, there is very little overlap between previously published rigorous analytical statistical tests and the definitions used in practice. In this paper, we consider the max-gap cluster: a contiguous region containing a maximal set of homologs, where the number of non-homologous genes between pairs of adjacent homologs is never greater than a predefined, fixed parameter, g. Although this is one of the models most widely used in practice, currently the statistical significance of max-gap clusters can only be evaluated using Monte Carlo simulations because no analytical statistical tests have been developed for it. We give exact expressions for the probability of observing such a cluster by chance, assuming a simple reference-region scenario and random gene order, as well as more efficient methods for approximating this probability. We use these methods to identify which regions of the parameter space yield clusters that are statistically significant. Finally, we discuss some of the challenges in extending this model to whole-genome comparison.

History

Date

2004-10-16

Usage metrics

Keywords

Biological Sciences

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

File(s) stored somewhere else

The Statistical Significance of Max-Gap Clusters

History

Date

Usage metrics

Categories

Keywords

Licence

Exports