The Guardian Reading Dataset
This repository hosts the The Guardian Reading Dataset, designed to explore readers' experiences with textual complexity, comprehensibility, and interest. The dataset captures detailed subjective and objective measures of readers' interactions with a selection of articles from The Guardian, providing granular insights into how textual features impact reading engagement.
Description
This dataset includes data from 30 readers who participated in 540 reading sessions. Each participant evaluated 18 articles sampled at three levels of textual complexity (low, medium, high), determined by a readability algorithm (Van der Sluis, 2014). The data captures subjective appraisals of complexity, comprehensibility, and interest, alongside eye-tracking metrics to provide an objective view of readers' processing difficulty and engagement with the text.
Data File and Structure
Data is stored in .csv
format at the reading session level, with each row corresponding to a unique reading session by a participant. Participant identifiers, demographic scales, and trait measures were dropped as part of anynomisation. Key columns are:
stimulus
: Identifier for each article.appraised_complexity
: Participant’s perceived complexity of the article.appraised_comprehensibility
: Participant’s perceived comprehensibility of the article.processing_fluency
: Combined score of appraised complexity and appraised comprehensibility.interest
: Participant’s interest rating for the article.topical_familiarity
: Participant’s familiarity with the article’s topic.familiarity_group:
Grouping of articles into balanced blocks (either low, median, or high familiarity).pupil_diameter
: Average pupil diameter across the reading session, reflecting cognitive load during reading.pupil_corrected
: Baseline-corrected pupil diameter per reading session (details on preprocessing and correction can be found in Van der Sluis et al., 2023).novelty_comprehensibility_scale_*
: Semantic differentials, rating the content as complex – simple (57), not familiar to me – very familiar to me, easy to read – difficult to read(59), easy to understand – hard to understand (60), comprehensible – incomprehensible (61), coherent – incoherent (62), interesting – uninteresting (63), boring – exciting (64). Of these, 59, 60, 61, 62, and 63 were reverse scored in the resulting scales.fam_answer_*
: Participants' topical familiarity rating for five topics per article. Note that each column covers different topics depending on the article.
Data is stored in .db
format at the stimulus (text) level, with each row corresponding to a unique text. In addition to averaged measurements aggregated per article, key columns are:
text
: Article content excerpt (first 50 and last 50 words of excerpt presented to study participants).url
: Source URL for full text access.
Purpose
The purpose of this dataset is to facilitate the analysis of human responses to differences in textual complexity, with a focus on understanding how readers' interest varies with different complexity levels. The controlled conditions and validated data in this dataset make it ideal for assessing the accuracy and applicability of models of textual complexity, ensuring that the findings are both reliable and relevant to actual readers' perceptions and experiences.
Licensing
The dataset contains textual excerpts and metadata from The Guardian, shared under The Guardian’s open license terms (https:/www.theguardian.com/info/2022/nov/01/open-licence-terms). Full-text sharing is restricted, but excerpts of up till 100 words may be used with proper attribution.
References
Van der Sluis, F., & van den Broek, E. L. (2023). Feedback beyond accuracy: Using eye-tracking to detect comprehensibility and interest during reading. Journal of the Association for Information Science and Technology, 74(1): 3–16. https://doi.org/10.1002/asi.24657
Van der Sluis, F., van den Broek, E. L., Glassey, R. J., van Dijk, E. M. A. G., & de Jong, F. M. G. (2014). When complexity becomes interesting. Journal of the American Society for Information Science and Technology, 65(7): 1478–1500. https://doi.org/10.1002/asi.23095