Wikipedia Cultural Diversity Dataset

Version 4 2019-01-15, 19:53

Version 3 2019-01-05, 13:19

Version 2 2019-01-05, 12:51

Version 1 2018-09-03, 09:19

dataset

posted on 2019-01-15, 19:53 authored by Marc Miquel-RibéMarc Miquel-Ribé, David Laniado

For each existing Wikipedia language edition, the dataset contains a classification of the articles that represent its associated cultural context, i.e. all concepts and entities related to the language and to the territories where it is spoken (places, traditions, language, politics, agriculture, biographies, events, etcetera.).

For each article, the dataset contains a rich set of context-related features, including geolocation, ISO codes, wikidata properties related to the language or to the corresponding country or territories, as well as related categories, among many other metadata. Other general article features are additional included, such as the number of edits and number of pageviews.

The methodology employed to classify articles through machine learning is described in:

Wikipedia Cultural Diversity Observatory: Cultural Context Content Methodology

Miquel-Ribé, M., & Laniado, D. (2018). Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions. Frontiers in Physics.

The uses of the dataset are several but we want to highlight three: 1) Wikipedia Culture Gap assessment and overall improvement of the cultural diversity, 2) Academic research in the Digital Humanities field, and 3) User-generated Content based technologies.

You can read more at wcdo.wmflabs.org.

Funding

Wikimedia Foundation

History

Usage metrics

Keywords

wikipedia wikimedia cultural diversity gender gap culture gap knowledge equality

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM