figshare
Browse

File(s) not publicly available

Reason: Data sets have been removed due to quality issues.

[deprecated] Reference and map usage across Wikimedia wiki pages

Version 2 2023-12-18, 10:34
Version 1 2023-09-28, 07:50
dataset
posted on 2023-12-18, 10:34 authored by Adam WightAdam Wight

Errata

Please note that this data set includes some major inaccuracies and should not be used. The data files will be unpublished from their hosting and this metadata will eventually be unpublished as well.

A short list of issues discovered:

  • Many dumps were truncated (T345176).
  • Pages appeared multiple times, with different revision numbers.
  • Revisions were sometimes mixed, with wikitext and HTML coming from different versions of an article.
  • Reference similarity was overcounted when more than two refs shared content.

In particular, the truncation and duplication means that the aggregate statistics are inaccurate and can't be compared to other data points.

Overview

This data was produced by Wikimedia Germany’s Technical Wishes team, and focuses on real-world usage statistics for reference footnotes (Cite extension) and maps (Kartographer extension) across all main-namespace pages (articles) on about 700 Wikimedia wikis. It was produced by processing the Wikimedia Enterprise HTML dumps which are a fully-parsed rendering of the pages, and by querying the MediaWiki query API to get more detailed information about maps. The data is also accompanied by several more general columns about each page for context.

Our analysis of references was inspired by "Characterizing Wikipedia Citation Usage” and other research, but the goal in our case was to understand the potential impact of improving the ways in which references can be reused within a page. Gathering the map data was to understand the actual impact of improvements made to how external data can be integrated in maps. Both tasks are complicated by the heavy use of wikitext templates, obscuring when and how and tags are being used. For this reason, we decided to parse the rendered HTML pages rather than the original wikitext.

License

All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/

The source code is distributed under BSD-3-Clause.

Source code and executionThe program used to create these files is our HTML dump scraper, version 0.1, written in Elixir. It can be run locally, but we used the Wikimedia Cloud VPS in order to have intra-datacenter access to the HTML dump file inputs. Our production configuration is included in the source code repository, and the commandline used to run was: “MIX_ENV=prod mix run pipeline.exs” .

Execution was interrupted and restarted many times in order to make small fixes to the code. We expect that the only class of inconsistency this could have caused is that a small number of article records may potentially be repeated in the per-page summary files, and these pages’ statistics duplicated in the aggregates. Whatever the cause, we’ve found many of these duplicate errors and counts are given in the “duplicates.txt” file.

The program is pluggable and configurable, it can be extended by writing new analysis modules. Our team plans to continue development and to run it again in the near future to track evolution of the collected metrics over time.

Format

All fields are documented in metrics.md as part of the code repository. Outputs are mostly split into separate ND-JSON (newline-delimited) and JSON files, and grand totals are gathered into a single CSV file.

Per-page summary files

The first phase of scraping produces a fine-grained report summarizing each page into a few statistics. Each file corresponds to a wiki (using its database name, for example "enwiki" for English Wikipedia) and each line of the file is a JSON object corresponding to a page.

Example file name: enwiki-20230601-page-summary.ndjson.gz

Example metrics:

  • How many tags are created from templates vs. directly in the article.
  • How many references contain a template transclusion to produce their content.
  • How many references are unnamed, automatically, or manually named.
  • How often references are reused via their name.
  • Copy-pasted references that share the same or almost the same content, on the same page.
  • Whether an article has more than one list.

Mapdata files

Example file name: enwiki-20230601-mapdata.ndjson.gz

These files give the count of different types of map "external data" on each page. A line will either be empty "{}" or it will include the revid and number of external data references for maps on that page.

External data is tallied in 9 different buckets, starting with "page" meaning that the source is .map data from the Wikimedia Commons server, or geoline / geoshape / geomask / geopoint and the data source, either an "ids" (Wikidata Q-ID) or "query" (SPARQL query) source.

Mapdata summary files

Each wiki has a summary of map external data counts, which contains a sum for each type count.

Example file name: enwiki-20230601-mapdata-summary.json

Wiki summary files

Per-page statistics are rolled up to the wiki level, and results are stored in a separate file for each wiki. Some statistics are summed, some are averaged, check the suffix on the column name for a hint.

Example file name: enwiki-20230601-summary.json

Top-level summary file

There is one file which aggregates the wiki summary statistics, discarding non-numeric fields and formatting as a CSV for ease of use: all-wikis-20230601-summary.csv

Funding

Wikimedia Germany

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC