Despite substantial interest in the species diversity of the human microbiome and its role in human disease, we have not quantitated the scale of human microbiome genetic diversity, an instrumental task for understanding human-microbe interactions. Here, to do so, we conducted a cross-study meta-analysis of metagenomes from two niches, the mouth and gut, amassing 3,655 samples from 13 studies. We found staggering genetic heterogeneity in our dataset, identifying, at the 95% identity level, a total of 45,666,334 non-redundant genes (23,961,508 in the oral, 22,254,436 in the gut). We found that 50% of the genes in both datasets were “singletons”, meaning they were unique to a single metagenomic sample. We identified that singletons were enriched for discrete functions (compared to non-singletons) and arose from sub-population specific and extremely rare microbial strains. Overall, these results serve as a potential explanation for the large, unexplained heterogeneity observed in microbiome-derived human phenotypes. We have built a publicly available resource from our work that is available at https://microbial-genes.bio.
Here, you can download the individual data files that comprise our database. You have access to:
1) Gene catalog consensus sequences (fasta format) for the oral and gut microbiomes.
2) Gene catalog cluster files for the oral and gut microbiomes. These are in the default output format of the clustering algorithm we used, CD-HIT. They describe the underlying genes within each consensus gene cluster.
3) Our entire database, csv format.
4) A diamond index of our merged oral and gut gene catalogs.
5) Sample and study metadata.