figshare
Browse

Viridian sequencing run metadata

Download (121.49 MB)
dataset
posted on 2025-10-27, 10:03 authored by Martin HuntMartin Hunt
<p dir="ltr">Supplementary data file of all sequencing runs and their metadata, for the paper "Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny", Hunt et al 2025.</p><p dir="ltr">This file is identical to Supplementary Table S1 (10.6084/m9.figshare.27195261) in the BiorXiv preprint in November 2024 (https://doi.org/10.1101/2024.04.29.591666). For the final paper, it was not called a supplementary table, hence this Figshare entry containing the same table as the preprint.</p><p dir="ltr">The columns are:</p><ul><li>In_may_2024_preprint. Whether or not this run was in the May 2024 preprint. The values are T for true or F for false</li><li>Study, Sample, Experiment, Run, Platform, Country, Region, Collection_date, First_created. These are all directly from the ENA metadata.</li><li>Run_count. The number of runs from the sample accession</li><li>Date_tree. A consensus date, using up 3 sources of data for each sample: COG-UK, GISAID, ENA/SRA. This was used for building the trees. Where dates conflicted for a given sample, the order of preference used was the date with highest resolution, then COG-UK, GISAID, and finally ENA/SRA.</li><li>Date_tree_order. This helped define the order in which the samples were added to the tree. The tree was built in two main batches: the May 2024 preprint, and the October 2024 updated preprint. In all cases, only samples with <=5000 Ns were used. For the May preprint, samples with zero heterozygous calls and a date in the "Date_tree" column were used first, sorted in date order. These samples have “Date” in the “Date_tree_order” column. Next, those with zero heterozygous calls and no date in the "Date_tree" column were added. These samples have “End” in the "Date_tree_order" column. This process was repeated using the samples with 1-3 heterozygous calls, adding samples with a date first, and then adding those with no date. Again, these samples have “Date” and “End” in the “Date_tree_order” column respectively. The same process was used for the October preprint tree, except the starting point was the May preprint tree, with new samples added to that tree using the same rules as that of the May preprint. There were two exceptions to the above. 1) where the GISAID date was given priority, the sample was treated as if it had no date and will have “End” in the "Date_tree_order" column. 2) There was one project (PRJEB46220) where data arrived after the October preprint had initially been built. We did one final update, adding in these samples after all the others. These samples have “Last” in the "Date_tree_order" column.</li><li>Viridian_result. This is “PASS” if Viridian finished and made a consensus sequence, “NO_READS” if the reads failed to download, “FAIL_QC” if one of Viridian’s QC requirements was not met, or “FAIL_OTHER” if something else went wrong during processing.</li><li>Genbank_accession. This is the GenBank accession of the assembly</li><li>Genbank_other_runs. Any other run accessions associated with the GenBank entry, other than that in the Run column</li><li>In_Viridian_tree. “T” or “F” to indicate if the sample is in the viridian tree</li><li>In_intersection. “T” or “F” to indicate if the sample is in the intersection tree</li><li>Artic_primer_version. If known, the ARTIC primer scheme from ENA metadata</li><li>Viridian_amplicon_scheme. Amplicon scheme called by Viridian</li><li>Viridian_N. Number of Ns in the viridian consensus sequence, after aligning to the reference with MAFFT</li><li>Genbank_N. Number of Ns in the GenBank consensus sequence, after aligning to the reference with MAFFT</li><li>Viridian_pangolin/Viridian_scorpio. Pangolin/scorpio call from the Viridian consensus sequence, using pangolin data version 1.21</li><li>Genbank_pangolin/Genbank_scorpio. Pangolin/scorpio call from the GenBank consensus sequence, using pangolin data version 1.21</li><li>Genbank_tree_name. Name of the sample in the GenBank tree</li><li>Viridian_cons_len. Length of the Viridian consensus sequence</li><li>Viridian_cons_het. Number of heterozyous (non-ACGTN) calls in the Viridian consensus sequence</li><li>Viridian_pangolin_1.29/Viridian_scorpio_1.29. Pangolin/scorpio call from the Viridian consensus sequence, using pangolin data version 1.29</li></ul><p></p>

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC