Note that this dataset is not exhaustive, but aims to collate information from some of the main sources of preprint metadata.
The process for collecting preprint metadata is documented fully here. In brief, preprint metadata are harvested from three sources: Crossref (using the rcrossref package), DataCite (using the rdatacite package) and arXiv (using the aRxiv package).
With respect to Crossref, all records with the
type field defined as
posted-content are included, as well as records from SSRN (where the
type field is instead defined as
journal-article). Preprint records are then matched to known preprint repositories based on
fields, and filtered for partial matches to keywords relating to
COVID-19 ("coronavirus", "covid-19", "sars-cov", "ncov-2019",
"2019-ncov", "hcov-19", "sars-2") in either their titles or abstracts.
For DataCite, all records with the
resourceType field defined as
Preprint are included. Preprint records are matched to known preprint repositories based on
fields, and filtered for COVID-19 related terms in the same way as for
Crossref. With respect to arXiv, records are harvested by searching
directly for COVID-19 related keywords in titles or abstracts using the
built-in search functionality of the
In some cases, multiple preprint metadata records are registered for a single preprint (e.g. ChemRxiv registers a new Crossref record for each new version of a preprint). In these cases, only the earliest posted version is included in this dataset. Additionally, some preprints are deposited to multiple preprint repositories - in these cases all preprint records are included.