sorry, we can't preview this file

NNaall_positions.tsv.gz (208.89 MB)

Variants and annotations of Neandertals

dataset

posted on 2019-05-31, 11:50 authored by Martin KuhlwilmMartin Kuhlwilm

Catalog of changes between modern humans and at least one among the Altai and Vindija Neandertal and Denisovan genomes. This file contains all genome-wide differences, information on their abundance in humans, and annotations.

The columns contained in the tab-separated file are:

"POS": chromosome:Position

"human_DAF": Human derived allele frequency #comment1
"Gene_name": Gene name
"REF": Human reference allele
"ALT": Alternative allele #comment2
"CAnc": Human-chimp ancestral allele #comment3
"Altai_GT": Altai Neandertal genotype (0=REF,1=ALT,0.5=het) #comment4
"Altai_filt": Altai filter passing (0=FALSE, 1=TRUE) #comment5
"Altai_allele": Altai allele (REF if GT==0, ALT if GT==0.5 or GT==1)
"Vindija_GT": Vindija genotype
"Vindija_filt": Vindija filter passing
"Vindija_allele": Vindija allele
"Denisova_GT": Denisova genotype
"Denisova_filt": Denisova filter passing
"Denisova_allele": Denisova allele
"Archaic_coverage": Coverage across archaic individuals #comment6
"feature": ENSEMBL features #comment7
"consequence": VEP consequence prediction #comment8
"dbSNP": dbSNP entry
"Thkg_alt_allele": 1000Genomes project non-reference allele (*=no variable allele, only REF found)
"Thkg_freq": 1000Genomes allele frequency
"Chimp_frequency": Frequency in chimpanzee/bonobo #comment9
"impact": VEP impact predicton
"CADD": C-score deleteriousness
"GWAVA": GWAVA deleteriousness score
"HGNC": HGNC identifier
"CCDS": CCDS identifier
"Coding_info": Information for coding sites (cDNA/protein postion,amino acid change, codons) #comment10
"SIFT": SIFT deleteriousness prediction for coding sites
"PolyPhen": PolyPhen deleteriousness prediction for coding sites
"other_feature": TSSDistance=Distance to Transcription Start Site; GENE_PHENO=gene associated with phenotype/disease/trait (from ENSEMBL VEP); PHENO=variant is associated with phenotype/disease/trait; BIOTYPE=biotype of transcript defined by ENSEMBL
"Thkg_subset": More detailed frequency information of 1000Genomes continental populations #comment11
"ExAC_allele": Minor allele observed in ExAC data (exome) #comment12
"ExAC_freq": Minor allele frequency in ExAC data
"other_variation": Detailed ExAC and NHLBI-ESP frequency information #comment13
"Capture_info": Information about individuals from capture experiment #comment14
"Arch_freq": Frequency of derived allele across three archaic individuals
"Arch_filter": Frequency of derived allele across three archaic individuals after filtering

# comment1 when marked with "*" = uncertain ancestral allele
# comment2: In some rare cases, across the three archaics, there are different alleles. Then the site would be considered tri-allelic and the ALT field contains, for example, "A,G".
# comment3: There may be "N" and "." for different states of missing CAnc.
# comment4: At triallelic sites, 1 refers the the alternative allele noted two columns further; 0.5 is mostly REF/ALT, but in very rare cases two different alternative alleles (ALT1/ALT2); in such cases the information would be incomplete in this table, but can be found in the original genotype calls (http://cdna.eva.mpg.de/neandertal/Vindija/VCF/). The information for the other archaic individuals is analogous.
# comment5: Filtering criteria are: GQ>=20 & Coverage>=5 & Coverage<=105(Altai)|75(Vindija/Denisova) & Heterozygous_balance>=0.2 & no InDels & MapabilityUniqueness35mer==1. Genotypes are retrieved from uniquely mapped sequences with mapping quality>=25.
# comment6: The following format is used: Full_coverage_across_individuals:Altai=X,Proportion_non_reference_reads,Vindija=X,Proportion_non_reference_reads,Denisova=X,Proportion_non_reference_reads[,others=X]. Other individuals from Hybridization capture experiments: VE=Vindija_exome,SE=Sidron_exome,V21=Vindija_chr21,S21=Sidron_chr21.
# comment7: ENSR=regulatory feature; ENST=transcript
# comment8: Often there are many consequences (3_prime_UTR;downstream etc.) overlapping at the same SNP, all are aggregated to one here.
# comment9: Not exactly frequency, but allele counts among 138 Pan chromosomes, wherever applicable (data was filtered, so not all sites are included). Should ideally be 138 or close, i.e. Pan species are alternative/ancestral for all individuals.
# comment10: Could be several cDNA/protein positions, because different transcripts have different start points: codons should mostly be the same.
# comment11: AFR=African,AMR=Americans,EAS=EastAsians,EUR=Europeans,SAS=SouthAsians
# comment12: This is often different from 1000G and archaic alternative, since it is 1000s of different exomes with many singleton mutations.
# comment13: ExAC: Adj_MAF=Adjusted global frequency,AFR=African,AMR=American,EAS=EastAsians,FIN=Finnish,NFE=NonFinnishEurooeans,SAS=SouthAsians,OTH=OtherPopulations; NHLBI-ESP: AA=African American,EA=European American
# comment14: Format is Individual:allele,filter_pass,genotype. VE=Vindija_exome,SE=Sidron_exome,V21=Vindija_chr21,S21=Sidron_chr21; Filter means coverage >=5 & Coverage<=75(Vindija)|50(Sidron), this is to include or exclude potential support for other genotypes; about genotypes see comment4.