covid19_preprints-2021-11-28.zip (28.83 MB)

covid19_preprints

Version 58 2021-12-16, 19:36

Version 57 2021-12-16, 19:35

Version 56 2021-09-17, 19:34

Version 55 2021-09-07, 12:44

Version 54 2021-08-20, 13:34

Version 53 2021-08-04, 14:12

Version 52 2021-08-04, 14:11

Version 51 2021-08-04, 14:10

Version 50 2021-08-04, 13:03

Version 49 2021-08-04, 13:01

Version 48 2021-08-04, 13:00

Version 47 2021-05-05, 09:00

Version 46 2021-04-02, 20:38

Version 45 2021-04-02, 20:37

Version 44 2021-04-02, 20:36

Version 43 2021-03-09, 14:40

Version 42 2021-02-08, 12:10

Version 41 2021-01-15, 17:33

Version 40 2020-12-16, 09:17

Version 39 2020-11-23, 14:42

software

posted on 2021-12-16, 19:36 authored by Nicholas FraserNicholas Fraser, Bianca KramerBianca Kramer

This repository contains code used to extract details of preprints related to COVID-19 and visualize their distribution over time. Work by Nicholas Fraser and Bianca Kramer.

Preprint data is currently updated on a weekly schedule - details of these releases can be found in data/metadata.json, where release_date refers to the date on which data was collected, and sample_date the cut-off point for preprints to be included based on their posting date.

Note that this dataset is not exhaustive, but aims to collate information from some of the main sources of preprint metadata.

The process for collecting preprint metadata is documented fully here. In general terms, preprint metadata are harvested from four main sources:

Crossref (using the rcrossref package). All records with the type field defined as posted-content are harvested, as well as records from SSRN (where the type field is instead defined as journal-article). Preprint records are then matched to known preprint repositories based on institution, publisher and group-title metadata fields.
DataCite (using the rdatacite package). All records with the resourceType field defined as Preprint are harvested. Preprint records are matched to known preprint repositories based on client fields.
arXiv (using the aRxiv package). Records are harvested by searching directly for COVID-19 related keywords in titles or abstracts using the built-in search functionality of the arXiv API.
RePEc (using the oai package)). All record types are initally harvested, and subsequently filtered for those with the Type field defined as preprint.

For all sources, preprints are classified as being related to COVID-19 on the basis of keyword matches in their titles or abstracts (where available). The search string is defined as: coronavirus OR covid-19 OR sars-cov OR ncov-2019 OR 2019-ncov OR hcov-19 OR sars-2.

In some cases, multiple preprint metadata records are registered for a single preprint (e.g. ChemRxiv registers a new Crossref record for each new version of a preprint). In these cases, only the earliest posted version is included in this dataset. Additionally, some preprints are deposited to multiple preprint repositories - in these cases all preprint records are included.