Precise Data Identification Services for Long Tail Research Data

While sophisticated research infrastructures assist scientists<br>in managing massive volumes of data, the so-called long tail<br>of research data frequently suffers from a lack of such ser-<br>vices. This is mostly due to the complexity caused by the va-<br>riety of data to be managed and a lack of easily standardise-<br>able procedures in highly diverse research settings. Yet, as<br>even domains in this long tail of research data are increas-<br>ingly data-driven, scientists need efficient means to precisely<br>communicate, which version and subset of data was used in a<br>particular study to enable reproducibility and comparability<br>of result and foster data re-use.<br>This paper presents three implementations of systems sup-<br>porting such data identification services for comma sepa-<br>rated value (CSV) files, a dominant format for data ex-<br>change in these settings. The implementations are based<br>on the recommendations of the Working Group on Dynamic<br>Data Citation of the Research Data Alliance (RDA). They<br>provide implicit change tracking of all data modifications,<br>while precise subsets are identified via the respective subset-<br>ting process. These enhances reproducibility of experiments<br>and allows efficient sharing of specific subsets of data even<br>in highly dynamic data settings