ISMB 2021 Talk: A machine learning framework for discovering and enriching metagenomics metadata from open access research articles

Nassar, Maaly; Finn, Robert D.; McEntyre, Johanna; PMC, Europe

doi:10.6084/m9.figshare.15077985.v3

ISMB 2021 Talk: A machine learning framework for discovering and enriching metagenomics metadata from open access research articles

presentation

posted on 2021-08-05, 14:07 authored by Maaly NassarMaaly Nassar, Robert D. Finn, Johanna McEntyreJohanna McEntyre, Europe PMCEurope PMC

Metagenomics is a culture-independent approach for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally and/or taxonomically), either from a longitudinal study or between independent studies can provide clues into how the microbiota have adapted to a particular environment. However, to understand the impact of environmental factors on the microbiome, it is important to also account for experimental confounding factors. Metagenomics databases, such as MGnify , provide analytical services to enable the consistent functional and taxonomic annotations to mitigate bioinformatic confounding factors. However, a recurring challenge is that key metadata about the sample (e.g. location, pH) and molecular methods used to extract and sequence the genetic material are often missing from the sequence records. Nevertheless, this missing metadata may be found in publications describing the research. When identified, the additional metadata can lead to a substantial increase in data reuse and greater confidence in the interpretation of observed biological trends. Here, we describe a machine learning framework that automatically extracts relevant metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework includes 3 processes: (1) literature classification and triage, (2) named entity recognition (NER) and (3) database enrichment.