figshare
Browse
Manuscript.pdf (147.45 kB)

Building portable analytical environments to improve sustainability of computational-analysis pipelines in the sciences

Download (0 kB)
Version 7 2014-11-17, 12:11
Version 6 2014-11-17, 12:11
Version 5 2014-11-14, 05:30
Version 4 2014-11-14, 05:30
Version 3 2014-07-24, 13:34
Version 2 2014-07-23, 19:51
Version 1 2014-07-22, 04:00
journal contribution
posted on 2014-11-17, 12:11 authored by Stephen PiccoloStephen Piccolo

In many scientific studies, researchers use software tools to execute algorithms and analyze data. However, these tools are only executable within a computational environment that includes an operating system and dependent software libraries. Thus for a scientist to execute such an analysis, she/he must create (or identify) an environment that provides these components. If at a later time, the scientist desires to reexecute the analysis---or enable others to execute it---she/he must have a comprehensive description of the computational environment. But changes to computer configurations are common, and configurations often vary between research labs. Any difference in any operating-system component or dependent software may prevent a pipeline from executing properly or may lead to different analytical outputs. This paper advocates that scientists execute analyses within virtual machines and/or software containers, which can encapsulate operating-system and software components as well as all execution scripts and configuration parameters necessary to execute an analysis. Such environments can be shared easily with others and published alongside a manuscript that describes the analysis. Thus others can easily recreate the experimental setting, validate the analysis, and build upon it---many years into the future. This paper describes a methodology for creating such environments using existing open-access resources. The approach is most effective when scientists begin creating the environment from the outset of an analysis, store source code and data in public repositories, and use command-line scripts to execute the analysis. Advantages and limitations of the approach are also described.

History