figshare
Browse

Libraries in Diff posts (2009–August 2023)

dataset
posted on 2024-05-18, 03:17 authored by Silvia GutiérrezSilvia Gutiérrez

Diff (https://diff.wikimedia.org/) is the collaborative, multilingual and multimedia platform for news, updates, and discussions related to the Wikimedia movement, and libraries appear many times within their posts.

This dataset is an attempt to gather for the first time, in one place, all those entries that tell the vibrant stories of libraries, librarians and library enthusiasts who contribute to Open Knowledge through Wikimedia projects.

It was derived from another, larger dataset called Diff posts' titles, authors, full-text, dates, and tags (from 2008-04-11 to 2023-08-31). The rationale behind the three curated library sub-datasets that were derived is described below:


1. diff_lib_full-text_lang_data.csv has 134 posts that were identified as related to libraries from the complete dataset, by searching in titles, excerpts and tags
2. diff_lib_full-text_lang_false-positives.csv contains those posts that had terms related to libraries but were actually talking about something different (like "libraries" in the sense it is used in computing, a collection of pre-written code to perform specific tasks or "The Wikipedia Library" which is a digital library for active Wikipedia editors). The criteria with which they were filtered out is in the column notes, but they are shared here as well, as many might be considered by other researchers as relevant.

3. diff_lib_full-text_lang_traslations.csv any time an English version of a post was published in Diff, it was added to the first CSV collection (diff_lib_full-text_lang_data.csv) as it is easier for some analysis. However, when a post had a version in another language, this was added to this third dataset. All posts have a language code column as well as another column with the ID of the post to which they are related.

History