5th_ASM_Sal: Genomic population structure of Salmonella enterica by Zhemin
Background
High-throughput sequencing is now being applied for routine typing of many pathogens, including S. enterica. There are over 50,000 sets of Salmonella reads in public sequence repositories. These data could provide a basis for a global perspective of this species. However, such analyses are confounded by a paucity of standardized typing approaches.
In 2012 we applied 7-gene Multilocus Sequence Typing (MLST) to group the majority of the typed S. enterica isolates into 138 independent genetic clusters - eBurstGroups (eBGs) - of closely related sequence types (STs). However, MLST lacks resolution below the ST level, and cannot reliably identify deep evolutionary history. To effectively manage and compare genomic data in a scalable way, advanced frameworks for describing the global population as well as local variation are required.
Methods
To address these issues, we have developed automatic pipelines within EnteroBase to assemble genomes from public sequence repositories or registered users. These pipelines not only derive classical MLST eBGs from all assembled genomes of adequate quality, but also extend to more discriminant schemes such as ribosomal MLST (rMLST; 51 genes), core genome MLST (cgMLST; 3,002) and whole-genome MLST (wgMLST; 21,065). We have also calculated species trees for core genes in S. enterica with two independent strategies, in order to reconstruct its evolutionary history.
Results
The Salmonella database in EnteroBase serves genotyping data for >47,000 genome assemblies along with 7,000 records from the legacy MLST database. The genomic data defines >3,000 rSTs and 381 reBGs, which are consistent with legacy eBGs but more discriminant. >90% of the genomes have been assigned to an reBG. Strains within each reBG are uniform for serovar. A species tree of one genome per rST was consistent with the reBGs, and contains clear signals of deep phylogenetic structure. Furthermore, fine-grained genetic structures within reBGs were largely resolved with a novel cgMLST scheme. This scheme gives a comparable resolution as SNPs, whereas much more standardized and portable, and is being increasingly used for epidemiological analyses. Using online tools in Enterobase, users can easily map any isolate onto global, high-definition perspectives of reBGs that were previously unresolvable, such as reBG1 (Typhimurium) or reBG4.1 (Enteritidis).
Conclusion
EnteroBase provides access to high-resolution genotyping data (MLST, rMLST and cgMLST) and visualization tools, allowing microbiologists to investigate the genomic relationships between all Salmonella serovars of clinical significance through an easy to use web interface. We anticipate that it will result in a transformational change in genotypic designations and global communication. Enterobase is available at http://enterobase.warwick.ac.uk.