figshare
Browse

1000 Genomes gVCFs

Version 3 2019-04-11, 16:26
Version 2 2019-04-09, 15:31
Version 1 2019-03-21, 17:17
Posted on 2019-04-11 - 16:26 authored by Jonathan Pevsner

Resource


Genomic VCF files from 1000 Genomes aligned to hs37d5 to supplement studies with fewer samples. Can be directly used in the GATK GenotypeGVCFs or Sentieon Genotyper. Programmatic access is available at https://github.com/sean-cho/figshare_onekg (clickable link below at references).



Query


Specific gVCFs can be queried through the tags and keywords. For example, to query all CEU individuals, enter "population:CEU" into the search bar.


Accepted query values:

Complete population list: http://www.internationalgenome.org/category/population/


population: CEU, YRI, ...

superpopulation: EUR, EAS, ...

sex: male, female

sample_id: NA12878, ...

filetype: gVCF


Examples:


Search for all CEU:

population:CEU


Search for females in CEU or GBR:

(:keyword: population:CEU OR :keyword: population:GBR) AND :keyword: sex:female



Motivation


Haplotype-caller and joint genotyping is a scalable and accurate variant calling algorithm introduced by the Genome Analysis Toolkit (GATK) and implemented by Sentieon DNA-seq suite. In this model, genotyping accuracy improves with increasing number of samples.


Genomic variant call format (gVCF) files are inputs into joint genotyping by both GATK and Sentieon. However, there is currently no public gVCF resource. We generated gVCFs from 1,000 Genomes data with the aim of facilitating variant discovery in whole genome sequencing (WGS) studies with limited numbers of samples.



Data Set


This resource is derived from the Phase 3 data of the 1,000 Genomes project of phenotypically normal individuals. The data set consists of 2,530 gVCFs aligned to the hs37d5 (HG37) genome build. These gVCFs were generated from Illumina pair-end sequencing FASTQs hosted on the 1000 Genomes AWS Collection (https://aws.amazon.com/1000genomes/). Detailed sample information can be found at the 1000 Genomes FTP site, specifically the Phase 3 analysis sequence index file (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20130502.phase3.analysis.sequence.index).


These gVCFs are compatible with both GATK and Sentieon tools.



Methods


We used Sentieon best practices, which mirrors that of GATK, on the Seven Bridges Genomics (SBG) Cancer Genomics Cloud (CGC) to generate the gVCFs.


Variant calling. FASTQs were mapped to the hs37d5 reference FASTA using bwa-mem. Using Sentieon DNAseq on the Seven Bridges Genomics platform, we performed (1) base quality score recalibration, (2) indel realignment, (3) variant calling using Haplotyper to generate gVCFs, and (4) joint calling using Genotyper.


VCF evaluation. Multisample VCF from Genotyper was assessed using vcfstats from VCFtools, and outlier samples were removed from the final data set.


Benchmarking. We used HG001 and HG002 (trio) from the Genome in a Bottle consortium as gold datasets for benchmarking. Using rtgtools vcfeval and hap.py, we assessed variant calling metrics (F-measure) for 3 different depths of sequencing, 10x, 30x, and 50x, obtained through downsampling.



Results & Utility


Cohort of unrelated individuals. We showed that improved variant calling performance with increasing number of samples at low depth of sequencing for HG001. At 10x depth of sequencing, we observed a modest gain in F-measure for SNVs, and improvements for indels up to 30x depth of sequencing.


Trio setting. Most of the improvement from joint genotyping was observed with the inclusion of the parent. The only improvement observed was for indels at 10x depth of sequencing. This is expected, as parents share ~50% of their variants with the child, including the parental gVCFs increases the prior probability that these variants are called in the child’s sample.


Geographic origin. When using gVCFs of samples from a different geographic origin for joint calling (e.g. when using YRI as opposed to CEU in joint calling for HG001 and HG002 which are CEU), the increase in F-measure is more gradual, and does not achieve the same maximal performance.



Resources

1000 Genomes: http://www.internationalgenome.org/

License

This work is licensed under a CC BY 4.0 license.

CITE THIS COLLECTION

DataCite
3 Biotech
3D Printing in Medicine
3D Research
3D-Printed Materials and Systems
4OR
AAPG Bulletin
AAPS Open
AAPS PharmSciTech
Abhandlungen aus dem Mathematischen Seminar der Universität Hamburg
ABI Technik (German)
Academic Medicine
Academic Pediatrics
Academic Psychiatry
Academic Questions
Academy of Management Discoveries
Academy of Management Journal
Academy of Management Learning and Education
Academy of Management Perspectives
Academy of Management Proceedings
Academy of Management Review
or
Select your citation style and then place your mouse over the citation text to select it.

FUNDING

This research benefited from the use of credits from the National Institutes of Health (NIH) Cloud Credits Model Pilot, a component of the NIH Big Data to Knowledge (BD2K) program.

Role of somatic mosaicism in autism, schizophrenia, and bipolar disorder brain

National Institute of Mental Health

SHARE

email
need help?