Abstract:
The human microbiome consists of microbes with pan-ecological evolutionary origins, yet a systematic gene-level analysis of microbial life across ecologies is lacking. We quantified the gene content from 14,183 samples across 17 ecologies -- 6 human-associated, 7 non-human-host associated (e.g. mouse gut), and 4 in other environmental niches (e.g. soil). At 30% amino acid identity, we identified 117,629,181 non-redundant genes across all samples, 66% of which were singletons, only being observed in one sample. We quantified the genetic similarity and “uniqueness” between different ecologies, showing that sites like the human vaginal and skin ecologies had low genetic alpha-diversity yet high beta-diversity, indicating few species but high pangenomic variation. We further identified a set of 1,864 sequences conserved across all ecologies, which indicates an overwhelming gene-level conservation to microbial life despite extreme taxonomic variation. However, using 90% amino acid clustering identity, we did not observe any globally conserved genes, even those known to be present in all bacteria. This indicates that prior studies, which cluster at, for example, 95% nucleotide identity, have not estimated microbial gene content accurately. We additionally found genes that were differentially abundant in particular groups of ecologies (e.g. human gut and non-human gut genes), identifying discrete functions among these groups. We showed that genes associated with pathogenic taxa tend to be the most likely to appear in multiple ecologies. We provide our databases, as well as the sets of genes described above, as a resource at https://microbial-genes.bio/.
The old version of this database (consisting of only human oral and gut microbiome data) can be found at: https://figshare.com/projects/Genetic_Landscape_of_the_Human_Microbiome/62327