The Guardian Reading Dataset

dataset

posted on 2025-01-30, 16:38 authored by Frans Van der SluisFrans Van der Sluis, Egon L. van den Broek

This repository hosts the The Guardian Reading Dataset, designed to explore readers' experiences with textual complexity, comprehensibility, and interest. The dataset captures detailed subjective and objective measures of readers' interactions with a selection of articles from The Guardian, providing granular insights into how textual features impact reading engagement.

Description

This dataset includes data from 30 readers who participated in 540 reading sessions. Each participant evaluated 18 articles sampled at three levels of textual complexity (low, medium, high), determined by a readability algorithm (Van der Sluis, 2014). The data captures subjective appraisals of complexity, comprehensibility, and interest, alongside eye-tracking metrics to provide an objective view of readers' processing difficulty and engagement with the text.

Data File and Structure

Data is stored in .csv format at the reading session level, with each row corresponding to a unique reading session by a participant. Participant identifiers, demographic scales, and trait measures were dropped as part of anynomisation. Key columns are:

stimulus: Identifier for each article.
appraised_complexity: Participant’s perceived complexity of the article.
appraised_comprehensibility: Participant’s perceived comprehensibility of the article.
processing_fluency: Combined score of appraised complexity and appraised comprehensibility.
interest: Participant’s interest rating for the article.
topical_familiarity: Participant’s familiarity with the article’s topic.
familiarity_group:Grouping of articles into balanced blocks (either low, median, or high familiarity).
pupil_diameter: Average pupil diameter across the reading session, reflecting cognitive load during reading.
pupil_corrected: Baseline-corrected pupil diameter per reading session (details on preprocessing and correction can be found in Van der Sluis et al., 2023).
novelty_comprehensibility_scale_*: Semantic differentials, rating the content as complex – simple (57), not familiar to me – very familiar to me, easy to read – difficult to read(59), easy to understand – hard to understand (60), comprehensible – incomprehensible (61), coherent – incoherent (62), interesting – uninteresting (63), boring – exciting (64). Of these, 59, 60, 61, 62, and 63 were reverse scored in the resulting scales.
fam_answer_*: Participants' topical familiarity rating for five topics per article. Note that each column covers different topics depending on the article.

Data is stored in .db format at the stimulus (text) level, with each row corresponding to a unique text. In addition to averaged measurements aggregated per article, key columns are:

text: Article content excerpt (first 50 and last 50 words of excerpt presented to study participants).
url: Source URL for full text access.

Purpose

The purpose of this dataset is to facilitate the analysis of human responses to differences in textual complexity, with a focus on understanding how readers' interest varies with different complexity levels. The controlled conditions and validated data in this dataset make it ideal for assessing the accuracy and applicability of models of textual complexity, ensuring that the findings are both reliable and relevant to actual readers' perceptions and experiences.

Licensing

The dataset contains textual excerpts and metadata from The Guardian, shared under The Guardian’s open license terms (https:/www.theguardian.com/info/2022/nov/01/open-licence-terms). Full-text sharing is restricted, but excerpts of up till 100 words may be used with proper attribution.

References

Van der Sluis, F., & van den Broek, E. L. (2023). Feedback beyond accuracy: Using eye-tracking to detect comprehensibility and interest during reading. Journal of the Association for Information Science and Technology, 74(1): 3–16. https://doi.org/10.1002/asi.24657

Van der Sluis, F., van den Broek, E. L., Glassey, R. J., van Dijk, E. M. A. G., & de Jong, F. M. G. (2014). When complexity becomes interesting. Journal of the American Society for Information Science and Technology, 65(7): 1478–1500. https://doi.org/10.1002/asi.23095

The Guardian Reading Dataset

Description

Data File and Structure

Purpose

Licensing

References

Funding

NWO IPPSI-KIEM project Adaptive Text-Mining (ATM) (project-number: 628.005.006)

7th Framework ICT Programme of the European Union project PuppyIR (project number: FP7-ICT-2007-3)

History

Usage metrics

Categories

Keywords

Licence

Exports