Simple, Standards-Based Archiving in Dataverse

Archival copies serve a variety of purposes, from simply acting as an additional back-up copy to assuring that data and metadata can be read and understood far into the future, when the original software stack used to create a dataset may no longer exist. They can also serve as a verifiable snapshot of specific dataset versions, allow independent management of individual datasets, and serve as an export/transfer mechanism.

We describe the work done to implement an extensible, standards-based archiving mechanism within Dataverse and to support archiving to DuraCloud/Chronopolis and Google Cloud Storage. This work includes enhancing Dataverse’s workflow mechanism, associating metadata blocks with external community vocabularies using the Object Reuse and Exchange (ORE) format, developing an extensible ‘Submit To Archive’ Command class, and adapting work from the SEAD DataNet project to create self-describing zipped archive files (BagIt) following the recommendations of the Research Data Alliance Research Data Repository Interoperability Working Group.

These contributions to Dataverse provide an archiving mechanism that is simple to configure and easy to extend to support other archives. They produce a single archive file per Dataset version that is human and machine readable, providing a complete copy of research published through Dataverse that can be preserved using any repository providing file or object (e.g. S3) storage.