poster_AdvCompBio2019_BeatrizGarciaJimenez.pdf (1.52 MB)

Condensed Microbiome Representation using Transfer and Deep Learning to Promote Microbial Composition Prediction

Download (1.52 MB)
posted on 2019-11-19, 15:00 authored by Sara Cabello, Beatriz Garcia-JimenezBeatriz Garcia-Jimenez, Mark WilkinsonMark Wilkinson
  • Advances in Computational Biology Conference (
  • November 2019
  • At: Barcelona

Data produced by metagenomic studies has multiple layers of complexity. Even 16S taxonomic analyses result in high-dimensional, extremely complex data that thwarts knowledge discovery. In this study, we describe a strategy to reduce the dimensionality of microbiome datasets, such that they can be interrogated and explored more easily.


This work brings together Deep Learning techniques, and microbiome data. We selected a particular type of artificial neural network - an autoencoder - to condense long vector values into a short vector (i.e. an encoded representation) which is more amenable to various kinds of analyses. In this case, the long vector of values describes a microbiome sample. We further show that we are able to recover the original vector from the encoded representation with high fidelity.


We transfer knowledge from a previously published dataset of around 5000 maize root microbiome samples into our autoencoder model, which returns a code of 6 rational numbers representing the information contained in the long vector of 717 taxa that describes the microbial composition of those samples. We are subsequently able to predict 458 of those taxa, after decoding, with a Pearson correlation greater than 0.5, with 0.77 being the average. This compressed representation opens-up many novel possibilities for microbiome data analysis, particularly with respect to knowledge retrieval and visualization. The autoencoder structure provides the ability to recover the complete abundance vector from the codified samples, making it possible to perform all analyses using the reduced coded data, and to recover the long vector only when required. For example, we apply our encoded microbiome to a novel scenario, showing that we are able to predict the final microbial composition (717 taxa, after recovery of the original vector) of maize root microbiome samples using only a few available environmental variables such as plant age, temperature or precipitation. We achieve an average mean square error of 0.0018; this is a higher accuracy than predictions made without our encoded model.

Conclusions and Further work:

This condensed representation could be applied to any environment (gut, ocean, urban soil, etc.) where there is a representative set of samples available. The contributions of our proposed microbiome autoencoder include: a) a novel dimensionality reduction approach to representing a long taxa vector as fewer than ten values; b) the ability to undertake challenging tasks in microbiome data analysis, such as to predict the microbial composition of hundreds of taxa based on a small number of features, rather than the more common (and simpler) task of predicting a phenotypic feature of the microbiome-associated host (e.g. age of the plant, productivity or disease) from hundreds of taxa; c) the encoded version of a microbiome can be reused, via transfer learning, into novel but related studies, allowing complex analyses to be undertaken using fewer de novo sequencing samples; the knowledge encoded within the microbiome autoencoder model can be applied to samples from a similar environment, enabling inferences or predictions in studies that would otherwise have insufficient power.


Research was supported by the “Severo Ochoa Program for Centres of Excellence in R&D” from the Agencia Estatal de Investigación of Spain (grant SEV-2016-0672 (2017-2021)) to the CBGP. BGJ was supported by a Postdoctoral contract associated to the Severo Ochoa Program.