figshare
Browse
1/15
302 files

Wikipedia Cultural Diversity Dataset

Version 4 2019-01-15, 19:53
Version 3 2019-01-05, 13:19
Version 2 2019-01-05, 12:51
Version 1 2018-09-03, 09:19
dataset
posted on 2019-01-15, 19:53 authored by Marc Miquel-RibéMarc Miquel-Ribé, David Laniado
For each existing Wikipedia language edition, the dataset contains a classification of the articles that represent its associated cultural context, i.e. all concepts and entities related to the language and to the territories where it is spoken (places, traditions, language, politics, agriculture, biographies, events, etcetera.).

For each article, the dataset contains a rich set of context-related features, including geolocation, ISO codes, wikidata properties related to the language or to the corresponding country or territories, as well as related categories, among many other metadata. Other general article features are additional included, such as the number of edits and number of pageviews.

The methodology employed to classify articles through machine learning is described in:
  • Wikipedia Cultural Diversity Observatory: Cultural Context Content Methodology
  • Miquel-Ribé, M., & Laniado, D. (2018). Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions. Frontiers in Physics.

  • The uses of the dataset are several but we want to highlight three: 1) Wikipedia Culture Gap assessment and overall improvement of the cultural diversity, 2) Academic research in the Digital Humanities field, and 3) User-generated Content based technologies.

    You can read more at wcdo.wmflabs.org.

    Funding

    Wikimedia Foundation

    History

    Usage metrics

      Licence

      Exports

      RefWorks
      BibTeX
      Ref. manager
      Endnote
      DataCite
      NLM
      DC