2020-06-16T21:31:12Z (GMT) by Nicholas Fraser Bianca Kramer

This repository contains code used to extract details of preprints related to COVID-19 and visualize their distribution over time. Work by Nicholas Fraser and Bianca Kramer.

Note that this dataset is not exhaustive, but aims to collate information from some of the main sources of preprint metadata.

The process for collecting preprint metadata is documented fully here. In brief, preprint metadata are harvested from three sources: Crossref (using the rcrossref package), DataCite (using the rdatacite package) and arXiv (using the aRxiv package).

With respect to Crossref, all records with the type field defined as posted-content are included, as well as records from SSRN (where the type field is instead defined as journal-article). Preprint records are then matched to known preprint repositories based on institution, publisher and group-title fields, and filtered for partial matches to keywords relating to COVID-19 ("coronavirus", "covid-19", "sars-cov", "ncov-2019", "2019-ncov", "hcov-19", "sars-2") in either their titles or abstracts. For DataCite, all records with the resourceType field defined as Preprint are included. Preprint records are matched to known preprint repositories based on client fields, and filtered for COVID-19 related terms in the same way as for Crossref. With respect to arXiv, records are harvested by searching directly for COVID-19 related keywords in titles or abstracts using the built-in search functionality of the aRxiv package.

In some cases, multiple preprint metadata records are registered for a single preprint (e.g. ChemRxiv registers a new Crossref record for each new version of a preprint). In these cases, only the earliest posted version is included in this dataset. Additionally, some preprints are deposited to multiple preprint repositories - in these cases all preprint records are included.