figshare
Browse
1/1
9 files

Corpas Family Comparison 23andMe vs Exome

Version 2 2014-11-24, 14:51
Version 1 2014-11-24, 14:36
dataset
posted on 2014-11-24, 14:36 authored by Manuel CorpasManuel Corpas, Bastian Greshake TzovarasBastian Greshake Tzovaras

# For each Family Member the SNPs Found in both 23andMe Data & Exome SNP Calls
* daughter: 36315 SNPs
* father: 39624 SNPs
* mother: 40234 SNPs
* son: 22233 SNPs (so the v2 chip seems to have an impact)

 

# SNP Calling Quality
Most 'non-concordant' SNPs (not filtered for haploid ones that are given as diploid in VCF or positions where 23andMe just used the opposite strand for the prediction) have a low coverage of < 10. See the attached distribution graphs for details on that.

The 'Calling Sequencing SNPs' document of Illumina (http://res.illumina.com/documents/products/technotes/technote_snp_caller_sequencing.pdf) says that at a coverage of 20x genotypes are only 95% certain, this rises to 99% with 30x coverage. Thus I only looked at non-concordant SNPs with a coverage of at least 30x, because for all others it's pretty certain that non-concordance will mainly arise due to lack in coverage.

# Filtering Non-Concordant SNPs
Firstly I removed the SNPs where 23andMe SNPs and the Exome SNPs are identical in principle but 23andMe uses the opposite strand for the SNP prediction (e.g. Exome says genotype is AG and 23andMe says genotype is TC) and haploid SNPs (e.g. X/Y chromosomes for males) where the haploid predictions of 23andMe match the diploid exome predictions (for some reason haploid loci are called as homozygous diploid loci in the Exome VCF).

## Closer Look at the Resulting SNP Subset
I then had a detailed look at the resulting list of SNPs and compared the non-concordant SNPs in the family context. For most errors it's impossible to say whether the 23andMe data or the Exome data is wrong. This is either because the SNPs for the relatives are non-concordant as well or just missing. And in some cases no clear Mendelian Inheritance Error can be found because both inferred genotypes could result in the observed family tree. There are three exceptions to this:

rs1056806: For the mother the exome SNP calling gives the genotype as TT while 23andMe gives it as CT. The father's genotype is concordantly CC in both data sets and the daughter's genotype is concordantly CC as well. Thus the exome calling for the mother seems to be wrong. Puzzlingly the C allele was observed 19 out of 96 times for the mother's exome data, but still the genotype was called as TT.

rs3749488: The genotype for the father is given as CC in the 23andMe data and as AC in the exome SNP calling. The mother's genotype is given as AC in both data sets. Both children have concordant genotypes of AA. Thus the father's 23andMe genotype call of CC seems to be wrong and the genotype of AC given by the exome data seems to be the right one.

rs1926736: For the daughter the exome SNP calling gives the genotype as AA, 23andMe gives it as AG. For the father and the mother the exome and 23andMe data agree: The father's genotype is GG, mother's genotype is AG. Thus the daughter's genotype in the exome SNP calls seems to be wrong. The coverage seems to be okay with 46x but only one allele was observed, thus this is either a mapping artifact or it's an unlikely case where only a single allele was sequenced.

Besides the Coverage Distribution Graphs I've also attached the list of non-concordant SNPs and the genotypes for each family members as inferred from the 23andMe and the Exome data. Of most interest will be the last 8 columns. For each family member the Exome genotype (including coverage etc) from the VCF are given along with the genotype as inferred by 23andMe. Fields with a red background show non-concordant genotype calls.
Fields with a yellow background seem non-concordant, but I guess it's also just a case of where 23andMe used the opposite strand for one allele, e.g. rs11580218: the father's genotype in the exome is given as GA and in 23andMe it's GT. According to dbSNP known alleles are only G & A. So probably for the "T" call they just used the "wrong" strand with respect to dbSNP.
Fields with a green background show where the family data can be used to infer which data set gives the wrong call for a given family member.

 

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC