<p>This repository contains code used to extract details of preprints
related to COVID-19 and visualize their distribution over time. Work by <a href="https://orcid.org/0000-0002-7582-6339" rel="nofollow">Nicholas Fraser</a> and <a href="https://orcid.org/0000-0002-5965-6560" rel="nofollow">Bianca Kramer</a>.</p><p><br></p><p>Preprint data is currently updated on a weekly schedule - details of these releases can be found in <code>data/metadata.json</code>, where <code>release_date</code> refers to the date on which data was collected, and <code>sample_date</code> the cut-off point for preprints to be included based on their posting date.</p><p><br></p>
<p>Note that this dataset is not exhaustive, but aims to collate information from some of the main sources of preprint metadata.</p><p><br></p><p>The process for collecting preprint metadata is documented fully <a href="https://github.com/nicholasmfraser/covid19_preprints/blob/master/covid19_preprints.md">here</a>. In general terms, preprint metadata are harvested from four main sources:</p>
<ul><li>
<p>Crossref (using the <a href="https://github.com/ropensci/rcrossref">rcrossref</a> package). All records with the <code>type</code> field defined as <code>posted-content</code> are harvested, as well as records from SSRN (where the <code>type</code> field is instead defined as <code>journal-article</code>). Preprint records are then matched to known preprint repositories based on <code>institution</code>, <code>publisher</code> and <code>group-title</code> metadata fields.</p></li><li>
<p>DataCite (using the <a href="https://github.com/ropensci/rcrossref">rdatacite</a> package). All records with the <code>resourceType</code> field defined as <code>Preprint</code> are harvested. Preprint records are matched to known preprint repositories based on <code>client</code> fields.</p>
</li><li>
<p>arXiv (using the <a href="https://github.com/ropensci/aRxiv">aRxiv</a>
package). Records are harvested by searching directly for COVID-19
related keywords in titles or abstracts using the built-in search
functionality of the arXiv API.</p>
</li><li>
<p>RePEc (using the <a href="https://github.com/ropensci/oai">oai</a> package)). All record types are initally harvested, and subsequently filtered for those with the <code>Type</code> field defined as <code>preprint</code>.</p>
</li></ul>
<p>For all sources, preprints are classified as being related to
COVID-19 on the basis of keyword matches in their titles or abstracts
(where available). The search string is defined as: <code>coronavirus OR covid-19 OR sars-cov OR ncov-2019 OR 2019-ncov OR hcov-19 OR sars-2</code>.</p>
<p>In some cases, multiple preprint metadata records are registered for a
single preprint (e.g. ChemRxiv registers a new Crossref record for each
new version of a preprint). In these cases, only the earliest posted
version is included in this dataset. Additionally, some preprints are
deposited to multiple preprint repositories - in these cases all
preprint records are included.</p>