Wikipedia Articles and Associated WikiProject Templates
datasetposted on 08.06.2020, 16:48 by Isaac Johnson, Aaron Halfaker
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
== wikiproject_to_template.halfak_20191202.yaml ==
The mapping of the canonical names of WikiProjects to all the templates that might be used to tag an article with this WikiProject that was used for generating this dump. For instance, the line 'WikiProject Trade: ["WikiProject Trade", "WikiProject trade", "Wptrade"]' indicates that WikiProject Trade (https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Trade) is associated with the following templates:
== wikiproject_taxonomy.halfak_20191202.yaml ==
A proposed mapping of WikiProjects to higher-level categories. This mapping has not been applied to the JSON dump contained here. It is based on the WikiProjects' canonical names.
== gather_wikiprojects_per_article.py ==
Old Python script that built the JSON dump described below for English Wikipedia based on wikitext/wikidata dumps (slow and more prone to errors).
== gather_wikiprojects_per_article_pageassessments.py ==
New Python script to build the JSON dump described below that uses the PageAssessments Mediawiki table in MariaDB and so is much faster and can handle languages beyond Enlgihs much more easily.
== labeled_wiki_with_topics_metadata.json.bz2 ==
Each line of this bzipped JSON file corresponds with a Wikipedia article in that language (currently Arabic, English, French, Hungarian, Turkish). The intended usage of this JSON file is to build topic classification models for Wikipedia articles.
While the English file has good coverage because a more or less complete mapping exists between WikiProjects and topics, the other languages are much more sparse in their labels because they do not cover any WikiProjects in that language that don't have English equivalents (per Wikidata). The other languages are probably best used for supplementation of the English labels or a separate test set that might have a different topic distribution.
The following properties are recorded:
* title: Wikipedia article title in that language* article_revid: Most recent revision ID associated with the article for which a WikiProject asssessment was made (might not be current revision ID)
* talk_pid: Page ID corresponding with the talk page for the Wikipedia article
* talk_revid: Most recent revision ID associated with the talk page for which a WikiProject asssessment was made (might not be current revision ID)
* wp_templates: List of WikiProject templates from the page_assessments table.
* qid: Wikidata ID corresponding to the Wikipedia article
* sitelinks: Based on Wikidata, the other languages in which this article exists and the corresponding page IDs.
* topics: topic labels associated with the article based on its WikiProject templates and the WikiProject<->Label mapping (wikiproject_taxonomy)
This version is based on the 24 May 2020 page_assessment tables and 4 May 2020 Wikidata item_page_link table. Articles with no associated WikiProject templates are not included. Of note in comparison to previous versions of this file, the revision IDs are now that revision IDs that were most recently assessed by a WikiProject, not the current versions of the page. The sitelinks are now as page IDs, which are more stable and less prone to encoding issues etc. The WikiProject templates are now pulled via the Mediawiki page_assessments table and so are in a different format than the templates that were extracted from the raw talk pages.
For example, here is the line for Agatha Christie from the English JSON file:
"Novels/Crime task force",
"Biography/science and academia work group",
"Biography/arts and entertainment work group",
"Archaeology/Women in archaeology task force",