This dataset contains the predicted topic(s) for (almost) every
Wikipedia article across languages. It is missing articles without any
valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.
The
data is bzip-compressed and each row is tab-delimited and contains the
following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy
* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia
*
qid: if the article has a Wikidata item, what ID is it -- e.g., the
article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)
*
pid: the page ID of the article -- e.g., the article for Douglas Adams
in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)
*
num_outlinks: the number of Wikipedia links in the article that were
used by the model to make its prediction -- this is after removing
links to non-article namespaces (e.g., categories, templates), articles
without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly
provided to give a sense of how much data the prediction is based upon.
For
more information, see this model description page on Meta:
https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performance
Additional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.