figshare
Browse
1/15
300 files

Citations with identifiers in Wikipedia

dataset
posted on 2019-12-17, 07:58 authored by Aaron HalfakerAaron Halfaker, Bahodir Mansurov, Miriam RediMiriam Redi, Dario TaraborelliDario Taraborelli

This dataset includes a list of citations with identifiers extracted from the most recent version of Wikipedia across all language editions. The data was parsed from the Wikipedia content dumps published on March 1, 2018.

License

All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/

Projects

Previous versions of this dataset ("Scholarly citations in Wikipedia") were limited to the English language edition. The current version includes one dataset for each of the 298 languages editions that Wikipedia supports as of March 2018. Projects are identified by their ISO 639-1/639-2 language code, per https://meta.wikimedia.org/wiki/List_of_Wikipedias.

Identifiers

• PubMed IDs (pmid) and PubMedCentral IDs (pmcid).
• Digital Object Identifiers (doi)

• International Standard Book Number (isbn)

• ArXiv Ids (arxiv)

Format

Each row in the dataset represents a citation as a (Wikipedia article, cited source) pair. Metadata about when the citation was first added is included.

• page_id -- The identifier of the Wikipedia article (int), e.g. 1325125
page_title -- The title of the Wikipedia article (utf-8), e.g. Club cell
rev_id -- The Wikipedia revision where the citation was first added (int), e.g. 282470030
timestamp -- The timestamp of the revision where the citation was first added. (ISO 8601 datetime), e.g. 2009-04-08T01:52:20Z
type -- The type of identifier, e.g. pmid
id -- The id of the cited source (utf-8), e.g. 18179694

Source code

https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia (MIT Licensed)

A copy of this dataset is also available at https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/

Notes

Citation identifers are extracted as-is from Wikipedia article content. Our spot-checking suggests that 98% of identifiers resolve.

History