Citations with identifiers in Wikipedia
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
This dataset includes a list of citations with identifiers extracted from the most recent version of Wikipedia across all language editions. The data was parsed from the Wikipedia content dumps published on March 1, 2018.
All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/
Previous versions of this dataset ("Scholarly citations in Wikipedia") were limited to the English language edition. The current version includes one dataset for each of the 298 languages editions that Wikipedia supports as of March 2018. Projects are identified by their ISO 639-1/639-2 language code, per https://meta.wikimedia.org/wiki/List_of_Wikipedias.
• PubMed IDs (pmid) and PubMedCentral IDs (pmcid).
• Digital Object Identifiers (doi)
• International Standard Book Number (isbn)
• ArXiv Ids (arxiv)
Each row in the dataset represents a citation as a (Wikipedia article, cited source) pair. Metadata about when the citation was first added is included.
• page_id -- The identifier of the Wikipedia article (int), e.g. 1325125
• page_title -- The title of the Wikipedia article (utf-8), e.g. Club cell
• rev_id -- The Wikipedia revision where the citation was first added (int), e.g. 282470030
• timestamp -- The timestamp of the revision where the citation was first added. (ISO 8601 datetime), e.g. 2009-04-08T01:52:20Z
• type -- The type of identifier, e.g. pmid
• id -- The id of the cited source (utf-8), e.g. 18179694
https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia (MIT Licensed)
A copy of this dataset is also available at https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/
Citation identifers are extracted as-is from Wikipedia article content. Our spot-checking suggests that 98% of identifiers resolve.