Politicians by Country from the English-language Wikipedia
datasetposted on 28.10.2017, 17:49 by Os Keyes
This project contains data on most English-language Wikipedia articles within the category "Category:Politicians by nationality" and subcategories, along with the code used to generate that data. Both are released under the CC-BY-SA 4.0 license.
The data was extracted via the Wikimedia API using the associated code. It is formatted as a CSV and saved as page_data.csv in the "data" directory. Columns are:
1. "country", containing the sanitised country name, extracted from the category name;
2. "page", containing the unsanitised page title.
3. "last_edit", containing the edit ID of the last edit to the page.
Country codes are inconsistent. Where possible, they have been modified to match the country names found in http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14 - but the PRB dataset contains nations not found in Wikipedia, and vice versa.
The actual recursion only went 2 levels deep into the category tree: someone listed as an Antiguan politician, say, is included - someone exclusively listed as an Antiguan politician who was assassinated is not.
The code is written in the programming language R, and heavily commented; it can be found in the "code" directory, and is split into 3 files:
1. utils.R, which contains utilities for operating the code in the other files;
2. retrieve.R, which contains functions for retrieving the category and page data from Wikipedia;
3. main.R, which executes the data retrieval code and performs sanitisation before writing it to file.