Building portable analytical environments to improve sustainability of computational-analysis pipelines in the sciences

2014-11-17T12:11:36Z (GMT) by Stephen Piccolo
<p>In many scientific studies, researchers use software tools to execute algorithms and analyze data. However, these tools are only executable within a computational environment that includes an operating system and dependent software libraries. Thus for a scientist to execute such an analysis, she/he must create (or identify) an environment that provides these components. If at a later time, the scientist desires to reexecute the analysis---or enable others to execute it---she/he must have a comprehensive description of the computational environment. But changes to computer configurations are common, and configurations often vary between research labs. Any difference in any operating-system component or dependent software may prevent a pipeline from executing properly or may lead to different analytical outputs. This paper advocates that scientists execute analyses within virtual machines and/or software containers, which can encapsulate operating-system and software components as well as all execution scripts and configuration parameters necessary to execute an analysis. Such environments can be shared easily with others and published alongside a manuscript that describes the analysis. Thus others can easily recreate the experimental setting, validate the analysis, and build upon it---many years into the future. This paper describes a methodology for creating such environments using existing open-access resources. The approach is most effective when scientists begin creating the environment from the outset of an analysis, store source code and data in public repositories, and use command-line scripts to execute the analysis. Advantages and limitations of the approach are also described.</p>