Bioinformatics pipeline for exome sequencing of complex diseases: variant discovery in 400 exomes

Perez-Gil, Daniel; Sanchis-Juan, Alba; Galán-Chilet, Inmaculada; Martínez-Barquero, Vanesa; DeMarco-Solar, Griselda; Pérez-Soriano, Cristina; Bárbara García-García, Ana; Marin-Garcia, Pablo; Javier Chaves, Felipe

doi:10.6084/m9.figshare.1250001.v2

ConBioPreVal-DPerezGil.pdf (867.42 kB)

Bioinformatics pipeline for exome sequencing of complex diseases: variant discovery in 400 exomes

Version 2 2014-11-24, 12:53

Version 1 2014-11-24, 09:35

poster

posted on 2014-11-24, 09:35 authored by Daniel Perez-GilDaniel Perez-Gil, Alba Sanchis-JuanAlba Sanchis-Juan, Inmaculada Galán-Chilet, Vanesa Martínez-Barquero, Griselda DeMarco-Solar, Cristina Pérez-Soriano, Ana Bárbara García-García, Pablo Marin-GarciaPablo Marin-Garcia, Felipe Javier Chaves

Targeted exome sequencing by massively parallel sequencing is a powerful and affordable way to survey small to large portions of the genome for genetic variation. This strategy is an effective tool for analyze the genetic basis of diseases and traits that cannot be assessed with conventional gene-discovery strategies. In addition, exome sequencing has become an interesting approach to explore the impact of rare alleles in the heritability of complex diseases. All these features set the stage for using exome sequencing to clinical diagnosis and personalized disease-risk profiling.

Identification of genetic variants from the raw genetic sequences involves quality assessment and many processing steps. The main building blocks for a bioinformatics pipeline that calls variants from raw massively parallel sequencing data includes quality control, pre-processing, mapping of reads to the reference genome, post-processing, variant calling and visualization. The final steps of the pipeline include the prediction of the variant effect on the coding protein and a filtering of variant candidates (prioritization).

We have implemented a pipeline for exome sequencing and its performance was evaluated for variant discovery with the analysis of 400 exomes (200 of them from type 2 diabetes individuals). For the pipeline implementation we used well-established software for genome analysis as well as applications developed in our lab, all of them open source.

Here we present a description of a 400 exome case study as a general example of the pipeline execution we have proposed.