Reference usage and in-page reuse, all Wikimedia wikis, snapshot 2024-05-01
Overview
This data was produced by Wikimedia Germany’s Technical Wishes team, and focuses on usage statistics for reference footnotes made using the Cite extension, across Main-namespace pages (articles) on nearly all Wikimedia sites. It was produced by processing the Wikimedia Enterprise HTML dumps.
Our analysis of references was inspired by "Characterizing Wikipedia Citation Usage” and other research. Our specific goal was to understand the potential for improving the ways in which references can be reused within a page.
Reference tags are frequently used in conjunction with wikitext templates, which is challenging . For this reason, we decided to parse the rendered HTML pages rather than the original wikitext.
We didn’t look at reuse across pages for this analysis.
License
All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/
The source code is distributed under BSD-3-Clause.
Source code and pluggable framework
The dumps were processed by HTML dump scraper v0.3.1 written in the Elixir language.
The job was run on the Wikimedia Analytics Cluster to take advantage of its high-speed access to HTML dumps. Production configuration is included in the source code repository, and the commandline used to run was: “MIX_ENV=prod mix run pipeline.exs” .
Our team plans to continue development of the scraper to support future projects as well.
Suggestions for new or improved analysis units are welcomed.
Data format
Files are provided at several levels of granularity, from per-page and per-wiki analysis through all-wikis comparisons.
Files are either ND-JSON (newline-delimited JSON), plain JSON or CSV.
Column definitions
Columns are documented in metrics.md .
Page summaries
Fine-grained results in which each line represents the summarization of a single wiki page.
Example file name: enwiki-20240501-page-summary.ndjson.gz
Example metrics found in these files:
- How many tags are created from templates vs. directly in the article.
- How many references contain a template transclusion to produce their content.
- How many references are unnamed, automatically, or manually named.
- How often references are reused via their name.
- Copy-pasted references that share the same or almost the same content, on the same page.
- Whether an article has more than one references list.
Wiki summaries
Page analyses are rolled up to the wiki level, in a separate file for each wiki.
Example file name: enwiki-20240501-summary.json
Top-level comparison
Summarized statistics for each wiki are collected into a single file.
Non-scalar fields are discarded for now and various aggregations are used, as can be seen from aggregated column name suffixes.
File name: all-wikis-20240501-summary.csv
Error count comparison
We’re also collecting a total count of different Cite errors for each wiki.
File name: all-wikis-20240501-cite-error-summary.csv
Environmental costs
There were several rounds of experimentation and mistakes, costs below should be multiplied by 3-4.
The computation took 4.5 days at 24x vCPU sharing 2 GB of memory at a data center in Virginia, US. Estimating the environmental impact through https://www.green-algorithms.org/ we get an upper bound of 12.6 kg CO2e, or 40.8 kWh, or 72 km driven in a passenger car.
Disk usage was significant as well, with 827 GB read and 4 GB written. At the high estimate of 7 kWh/GB, this could have used as much as 5.8 MWh of energy, but likely much less since streaming was contained within one data center.