sorry, we can't preview this file

...but you can still download topics_all_wikipedia_articles_202012.tsv.bz2
topics_all_wikipedia_articles_202012.tsv.bz2 (1.87 GB)

Wikipedia Article Topics for All Languages (based on article outlinks)

Download (1.87 GB)
dataset
posted on 03.02.2021, 21:29 by Isaac Johnson
This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles.

The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy

* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia
* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)
* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)
* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.

For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performance

History

Licence

Exports

Keywords

Licence

Exports