Wikipedia Article Topics for All Languages (based on article outlinks)

Version 3 2021-07-20, 20:13

Version 2 2021-02-03, 21:29

Version 1 2020-07-07, 13:57

dataset

posted on 2021-07-20, 20:13 authored by Isaac JohnsonIsaac Johnson

This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.

The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy

* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia

* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)

* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)

* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.

For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performance

Additional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.