Digitalidentifiersforresearchinfrastructures.pdf (4.1 MB)
Download file

Data citation and digital identifiers for time series data / environmental research infrastructures

Download (0 kB)
journal contribution
posted on 08.01.2015, 12:40 authored by Robert HuberRobert Huber, Ari AsmiAri Asmi, Justin Buck, Jesus Marco de Lucas, Michael Diepenbroek, Alberto Michelini, participants of the joint COOPEUS/ENVRI/EUDAT PID workshop

In the age of data driven science the re-use of data and the compilation of existing data from monitoring infrastructures has become an integral part of research. For the sake of transparency and reproducibility of research it is crucial to be able to unambiguously identify data that were used as the basis of a publication. Globally unique and resolvable, persistent digital identifiers (PID) for digital data sets are an important tool to achieve this goal enabling unambiguous links between published research results and their underlying data. In addition, this unambiguous identification allows citation of data. Proven and community based examples are the usage of GenBank identifiers in the biological literature or the data citation method by using DOIs (digital object identifiers) already used widely in the scholarly literature.

Identification of discrete digital objects is simple and citation can be formatted in analogy to citing literature. The identification of still ongoing, open time series does not seem to fit this pattern. A major prerequisite for the proper use of PIDs within data citations is the persistence of both, identifiers as well as the integrity of the associated data set. This poses questions when PIDs are to be used for unfinished data sets or open time series data. Such data is typically generated within research infrastructures during long lasting experiments such as satellite missions, environmental monitoring campaigns, or in permanent installations such as natural hazard detection and early warning systems (e.g., seismic traces acquired by field stations).

Open time series data are often used in research during ongoing experiments and potentially published earlier than the underlying data set has been closed and is publicly released. It is therefore important to enable the scientific community to properly cite these data in their publications. Yet what is the meaning of “persistence” of data in ongoing time series? How does it relate to versioning? What is the granularity of a time series? In this publication we discuss and compare solutions currently used in some major European research infrastructures and propose transparent solutions which allow the citation of time series data using PIDs.