COVID-19 Pandemic Wikipedia Readership

Version 3 2021-06-07, 14:46

Version 2 2021-05-13, 15:08

Version 1 2021-05-06, 22:02

dataset

posted on 2021-06-07, 14:46 authored by Isaac JohnsonIsaac Johnson, Leila ZiaLeila Zia, Joseph Allemandou, Marcel Ruiz Forns, Nuria Ruiz, Fabian Kaelin

This data release includes two Wikipedia datasets related to the readership of the project as it relates to the early COVID-19 pandemic period. The first dataset is COVID-19 article page views by country, the second dataset is one hop navigation where one of the two pages are COVID-19 related. The data covers roughly the first six months of the pandemic, more specifically from January 1st 2020 to June 30th 2020. For more background on the pandemic in those months, see English Wikipedia's Timeline of the COVID-19 pandemic.

Wikipedia articles are considered COVID-19 related according the methodology described here, the list of COVID-19 articles used for the released datasets is available in covid_articles.tsv. For simplicity and transparency, the same list of articles from 20 April 2020 was used for the entire dataset though in practice new COVID-19-relevant articles were constantly being created as the pandemic evolved.

Privacy considerations

While this data is considered valuable for the insight that it can provide about information-seeking behaviors around the pandemic in its early months across diverse geographies, care must be taken to not inadvertently reveal information about the behavior of individual Wikipedia readers. We put in place a number of filters to release as much data as we can while minimizing the risk to readers.

The Wikimedia foundation started to release most viewed articles by country from Jan 2021. At the beginning of the COVID-19 an exemption was made to store reader data about the pandemic with additional privacy protections:

- exclude the page views from users engaged in an edit session

- exclude reader data from specific countries (with a few exceptions)

- the aggregated statistics are based on 50% of reader sessions that involve a pageview to a COVID-19-related article (see covid_pages.tsv). As a control, a 1% random sample of reader sessions that have no pageviews to COVID-19-related articles was kept. In aggregate, we make sure this 1% non-COVID-19 sample and 50% COVID-19 sample represents less than 10% of pageviews for a country for that day. The randomization and filters occurs on a daily cadence with all timestamps in UTC.

- exclude power users - i.e. userhashes with greater than 500 pageviews in a day. This doubles as another form of likely bot removal, protects very heavy users of the project, and also in theory would help reduce the chance of a single user heavily skewing the data.

- exclude readership from users of the iOS and Android Wikipedia apps.

In effect, the view counts in this dataset represent comparable trends rather than the total amount of traffic from a given country. For more background on readership data per country data, and the COVID-19 privacy protections in particular, see this phabricator.

To further minimize privacy risks, a k-anonymity threshold of 100 was applied to the aggregated counts. For example, a page needs to be viewed at least 100 times in a given country and week in order to be included in the dataset. In addition, the view counts are floored to a multiple of 100.

Datasets

The datasets published in this release are derived from a reader session dataset generated by the code in this notebook with the filtering described above. The raw reader session data itself will not be publicly available due to privacy considerations. The datasets described below are similar to the pageviews and clickstream data that the Wikimedia foundation publishes already, with the addition of the country specific counts.

COVID-19 pageviews

The file covid_pageviews.tsv contains:
- pageview counts for COVID-19 related pages, aggregated by week and country

- k-anonymity threshold of 100

- example: In the 13th week of 2020 (23 March - 29 March 2020), the page 'Pandémie_de_Covid-19_en_Italie' on French Wikipedia was visited 11700 times from readers in Belgium

- as a control bucket, we include pageview counts to all pages aggregated by week and country. Due to privacy considerations during the collection of the data, the control bucket was sampled at ~1% of all view traffic. The view counts for the `control` title are thus proportional to the total number of pageviews to all pages.

The file is ~8 MB and contains ~134000 data points across the 27 weeks, 108 countries, and 168 projects.

Covid reader session bigrams

The file covid_session_bigrams.tsv contains:

- number of occurrences of visits to pages A -> B, where either A or B is a COVID-19 related article. Note that the bigrams are tuples (from, to) of articles viewed in succession, the underlying mechanism can be clicking on a link in an article, but it may also have been a new search or reading both articles based on links from third source articles. In contrast, the clickstream data is based on referral information only

- aggregated by month and country

- k-anonymity threshold of 100

- example: In March of 2020, there were a 1000 occurences of readers accessing the page es.wikipedia/SARS-CoV-2 followed by es.wikipedia/Orthocoronavirinae from Chile

The file is ~10 MB and contains ~90000 bigrams across the 6 months, 96 countries, and 56 projects.

Contact

Please reach out to research-feedback@wikimedia.org for any questions.

History

Usage metrics

Keywords

Wikipedia COVID-19 data Applied Computer Science

Licence

CC0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM