figshare
Browse
1/1
9 files

Supplementary data to: An open dataset of Plasmodium falciparum genome variation in 7,000 worldwide samples

Download all (4.3 MB)
dataset
posted on 2021-01-14, 16:59 authored by MalariaGENMalariaGEN
Data correct at time of upload (16 December 2020). Data maintained at: https://www.malariagen.net/resource/26

BACKGROUND
This Figshare project provides information about data generated by the MalariaGEN Plasmodium falciparum Community Project (https://www.malariagen.net/projects/p-falciparum-community-project) using the version 6 pipeline for variant discovery and genotype calling.

The Plasmodium falciparum Community Project supports groups around the world to integrate parasite genome sequencing into clinical and epidemiological studies of malaria. It comprises multiple partner studies, each with its own research objectives and led by a local investigator. Genome sequencing is performed centrally, and partner studies are free to analyse and publish the genetic data produced on their own samples, in line with MalariaGEN’s guiding principles on equitable data sharing (https://www.malariagen.net/resource/1).

Aggregated data from the Community Project were initially released through a companion project called Pf3k (https://www.malariagen.net/projects/pf3k) whose goal was to bring together leading analysts from multiple institutions to benchmark and standardise methods of variant discovery and genotyping calling. The Pf3k dataset can be explored using an interactive web application (https://www.malariagen.net/apps/pf3k).

The open dataset was enlarged in 2016 when multiple partner studies contributed to a consortial publication on 3,488 samples from 23 countries (https://www.malariagen.net/resource/16). The variants and genotypes described in this publication used version 3 of the analysis pipeline. Data produced using an earlier version of the data analysis pipeline can be explored using an interactive web application (https://www.malariagen.net/apps/pf).

ABOUT THE V.6 DATA PIPELINE
In 2018 the Plasmodium falciparum Community Project upgraded to version 6 of its variant discovery and genotype calling pipeline. Details of the methods can be found in the accompanying paper and here: ftp://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/Pf_6_extended_methods.pdf

The major change from previous versions is that the version 6 pipeline is based on GATK and utilises findings on genome accessibility generated by P. falciparum Genetic Crosses Project.

CONTENT OF THE DATA RELEASE
This release contains details on contributing partner studies, sample metadata and key sample attributes inferred from genomic data, and genomic data including raw sequence reads. Further details and analytical results can be found in the accompanying data release paper.

These data are available open access. Publications using these data should acknowledge and cite the source of the data using the following format: "This publication uses data from the MalariaGEN Plasmodium falciparum Community Project as described in ‘MalariaGEN, Ahouidi A, Ali M et al. An open dataset of Plasmodium falciparum genome variation in 7,000 worldwide samples [version 1; peer review: awaiting peer review. Wellcome Open Research, 2021, 6:42 DOI: https://doi.org/10.12688/wellcomeopenres.16168.1'".

DATA RESOURCE
1) Study information: Details of the 49 contributing partner studies, including description, contact information and key people. (File1_Pf_6_partner_studies.pdf)

2) Sample provenance and sequencing metadata: sample information including partner study information, location and year of collection, ENA accession numbers, and QC information for 7,113 samples from 28 countries. (File2_Pf_6_samples.txt)

3) Measure of complexity of infections: characterisation of within-host diversity (FWS) for 5,970 QC pass samples. (File3_Pf_6_fws.txt)

4) Drug resistance marker genotypes: genotypes at known markers of drug resistance for 7,113 samples, containing amino acid and copy number genotypes at six loci: crt, dhfr, dhps, mdr1, kelch13, plasmepsin 2-3. (File4_Pf_6_drug_resistance_marker_genotypes.txt)

5) Inferred resistance status classification: classification of 5,970 QC pass samples into different types of resistance to 10 drugs or combinations of drugs and to RDT detection: chloroquine, pyrimethamine, sulfadoxine, mefloquine, artemisinin, piperaquine, sulfadoxine- pyrimethamine for treatment of uncomplicated malaria, sulfadoxine- pyrimethamine for intermittent preventive treatment in pregnancy, artesunate-mefloquine, dihydroartemisinin-piperaquine, hrp2 and hrp3 genes deletions. (File5_Pf_6_inferred_resistance_status_classification.txt)

6) Drug resistance markers to inferred resistance status: details of the heuristics utilised to map genetic markers to resistance status classification. (File6_Pf_6_resistance_classification.pdf)

7) Gene differentiation: estimates of global and local differentiation for 5,561 genes. (File7_Pf_6_genes_data.txt)

Short variants genotypes: Genotype calls on 6,051,696 SNPs and short indels in 7,113 samples from 29 countries.
* VCF files available here: ftp://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/Pf_6_vcf/
* Zarr files available here: ftp://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/Pf_6.zarr.zip

A README file (File8_Pf_6_README_20191010.txt) describes in fine detail all the files included in the release, the format and interpretation of each column, and contains some tips and tricks for accessing genotype data in zarr file.

SUPPLEMENTARY DATA
The following supplementary data is available as a single document download. (File9_Pf_6_supplementary)

* Supplementary Note
- Analysis of local differentiation score
- The classic 76T chloroquine resistance mutation in crt is found on multiple haplotypes
- Suplhadoxine-pyrimethamine resistance is widespread and associated with many haplotypes
- mdr1 duplications have many different breakpoints
- Artemisinin, piperaquine, and mefloquine resistance
- No evidence of resistance to less commonly used antimalarials
* Supplementary Table 1. Breakdown of analysis set samples by geography
* Supplementary Table 2. Studies contributing samples
* Supplementary Table 3. Summary of discovered variant positions
* Supplementary Table 4. Breakpoints of duplications of gch1
* Supplementary Table 5. Breakpoints of duplications of mdr1
* Supplementary Table 6. Breakpoints of duplications of plasmepsin 2-3
* Supplementary Table 7. Genes ranked by global differentiation score
* Supplementary Table 8. Genes ranked by local differentiation score
* Supplementary Table 9. Number of samples used to determine proportions in Table 2
* Supplementary Table 10. Frequencies of mutations associated with mono- and multi-drug resistance pre- and post-2011
* Supplementary Table 11. Frequency of crt amino acid 72-76 haplotypes
* Supplementary Table 12. Frequencies of dhfr (51, 59, 108, 164) and dhps (437, 540, 581, 613) multi-locus haplotypes
* Supplementary Table 13. Frequency of HRP2 and HRP3 deletions by country
* Supplementary Table 14. Alleles at six mitochondrial positions used for the species identification
* Supplementary Figure 1. Histogram of local differentiation score for all genes


History