Codemeta: A Rosetta Stone for Software Metadata

2017-01-30T19:44:29Z (GMT) by Carl Boettiger
Software is critical to robust and efficient scientific discovery across disciplines, and yet is rarely valued or even understood. Researchers need to be able to discover and understand scientific software to apply it to their projects, but the approaches for documenting software are typically language specific and not interoperable. This project will have a broad impact on multiple disciplines by increasing the interoperability and consistency of software descriptions and by providing examples that illustrate the utility of interoperable software repositories for citation, discovery, archiving and preservation of scientific software. Research relies heavily on scientific software, and a large and growing fraction of researchers must now develop custom software to conduct their own research. Despite this, infrastructure to support the preservation, discovery, reuse, and attribution of software lags substantially behind that of other research products such as journal articles and research data. This frustrates the progress of science in several ways: lacking a way to discover and access software written by other researchers means that multiple teams must re-invent the same wheel. Limited re-use or accreditation of software also discourages researchers from investing more time to improve the performance, reliability or usability of the software they write. This lag is driven not so much by a lack of technology as it is by a lack of unity: existing mechanisms to archive, document, index, share, discover, and cite software contributions are varied among research disciplines and among software archives, and rarely consistent with best practices. The project will convene key stakeholders from software and data repositories to address this issue by aligning existing software metadata approaches. This alignment of software documentation will increase the efficiency and scale or research across disciplines, and simplify the process for researchers to collaborate on interdisciplinary projects.<br><br>This project will have three distinct phases:<br><br>1. Define a crosswalk table between exiting metadata schema for software<br><br>2. Develop prototype applications illustrating the value of crosswalk metadata<br><br>3. Assess and communicate impact of results.<br><br>The researchers will convene a meeting of repository and science stakeholders to harmonize approaches to software metadata. Rather than try and define yet another standard, they will map the correspondences between standards already in use -- a Rosetta stone of software metadata. In this process, the investigators will identify metadata use cases that have guided existing software metadata descriptions (e.g. more or different metadata may be needed to install software than to cite it, and even more to extend it), and then agree upon which metadata concepts are needed for each use case. This phase will identify some use cases that are not fully supported by existing software repositories (for instance, Zenodo is interested in associating software with funders as a use case but does not recognize funder identifiers yet). This will set the stage for the second phase where the crosswalk table will be used to harmonize the implementation of software metadata in three major repositories that support software deposition (KNB, Zenodo, Figshare). The researchers will modify the software and provenance metadata terms used in the DataONE federation to be interoperable with the crosswalk, and create a tool for generating and uploading software with this metadata to the KNB repository (a member repository of DataONE). Collaborators will extend the existing integration between the software repository GitHub and the data repositories Zenodo and figshare to provide interoperable software metadata. In the final phase, the team will conduct an assessment with researchers at a relevant scientific meeting to evaluate the effectiveness of the crosswalk for the identified software use cases and will summarize results in a scientific paper.