Annotated dataset to assess the accuracy of the textual description of cultural heritage records

posted on 19.12.2020, 09:49 by Matteo Lorenzini, Marco RospocherMarco Rospocher, Sara Tonelli
The dataset contains more than 100K textual descriptions of cultural items from Cultura Italia (, the Italian National Cultural aggregator. Each of the description is labeled either HIGH or LOW quality, according its adherence to the standard cataloguing guidelines provided by Istituto Centrale per il Catalogo e la Documentazione (ICCD). More precisely, each description is labeled as HIGH quality if the object and subject of the item (for which the description is provided) are both described according to the ICCD guidelines, and as LOW quality in all other cases. Most of the dataset was manually annotated, with ~30K descriptions automatically labeled as LOW quality due to their length (less than 3 tokens) or their provenance from old (pre-2012), not curated, collections. The dataset was developed to support the training and testing of ML text classification approaches for automatically assessing the quality of textual descriptions in digital Cultural Heritage repositories.

The dataset is provided as a CSV file, where each row corresponds to an item from Cultura Italia, and contains the textual description of the item, the domain of the item (OpereArteVisiva/RepertoArcheologico/Architettura) and the quality label (Low_Quality/High_Quality).

The textual descriptions in the dataset are provided by Cultura Italia with a "Public Domain" license (c.f., The whole dataset, including the annotation, is openly distributed according to the Creative Commons Attribution-ShareAlike 4.0 Generic (CC BY-SA 4.0) licence.