Precise Data Identification Services for Long Tail Research Data

While sophisticated research infrastructures assist scientists
in managing massive volumes of data, the so-called long tail
of research data frequently suffers from a lack of such ser-
vices. This is mostly due to the complexity caused by the va-
riety of data to be managed and a lack of easily standardise-
able procedures in highly diverse research settings. Yet, as
even domains in this long tail of research data are increas-
ingly data-driven, scientists need efficient means to precisely
communicate, which version and subset of data was used in a
particular study to enable reproducibility and comparability
of result and foster data re-use.
This paper presents three implementations of systems sup-
porting such data identification services for comma sepa-
rated value (CSV) files, a dominant format for data ex-
change in these settings. The implementations are based
on the recommendations of the Working Group on Dynamic
Data Citation of the Research Data Alliance (RDA). They
provide implicit change tracking of all data modifications,
while precise subsets are identified via the respective subset-
ting process. These enhances reproducibility of experiments
and allows efficient sharing of specific subsets of data even
in highly dynamic data settings