Citations with identifiers in Wikipedia
This dataset includes a list of citations with identifiers extracted from the most recent version of Wikipedia across all language editions. The data was parsed from the Wikipedia content dumps published on March 1, 2018.
License
All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/
Projects
Previous versions of this dataset ("Scholarly citations in Wikipedia") were limited to the English language edition. The current version includes one dataset for each of the 298 languages editions that Wikipedia supports as of March 2018. Projects are identified by their ISO 639-1/639-2 language code, per https://meta.wikimedia.org/wiki/List_of_Wikipedias.
Identifiers
• PubMed IDs (pmid) and PubMedCentral IDs (pmcid).
• Digital Object Identifiers (doi)
• International Standard Book Number (isbn)
• ArXiv Ids (arxiv)
Format
Each row in the dataset represents a citation as a (Wikipedia article, cited source) pair. Metadata about when the citation was first added is included.
• page_id -- The identifier of the Wikipedia article (int), e.g. 1325125
• page_title -- The title of the Wikipedia article (utf-8), e.g. Club cell
• rev_id -- The Wikipedia revision where the citation was first added (int), e.g. 282470030
• timestamp -- The timestamp of the revision where the citation was first added. (ISO 8601 datetime), e.g. 2009-04-08T01:52:20Z
• type -- The type of identifier, e.g. pmid
• id -- The id of the cited source (utf-8), e.g. 18179694
Source code
https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia (MIT Licensed)
A copy of this dataset is also available at https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/
Notes
Citation identifers are extracted as-is from Wikipedia article content. Our spot-checking suggests that 98% of identifiers resolve.