Scientific workflows are a mainstream solution to process large-scale
modeling, simulations, and data analytics computations in distributed
systems, and have supported traditional and breakthrough researches
across several domains. While scientific workflows have enabled
large-scale scientific computations and data analysis, and lowered the
barriers for experiment sharing, preservation (including provenance),
and reuse between heterogeneous platforms (HTC and HPC), the
reproducibility of an end-to-end scientific experiment is hindered by
the lack of methodologies to capture pre- and post-analysis (or steps)
performed out of the scope of the workflow execution. Online notebook
technologies (e.g., Jupyter Notebook) emerged as an open-source web
application that allows scientists to create and share documents that
contain live code, equations, visualizations and explanatory text.
Jupyter Notebooks has a strong potential to reduce the gap between
researchers and the complex knowledge required to run large-scale
scientific workflows via a programmatic high-level interface to
access/manage workflow capabilities. This poster describes our approach
for integrating the Pegasus workflow management system with Jupyter to
foster easiness of usage, reproducibility (all the information to run an
experiment is in a unique place), and reuse (notebooks are portable if
running in equivalent environments). Since Pegasus 4.8, a Python API to
declare and manage Pegasus workflows via Jupyter has been provided. The
user can create a notebook and declare a workflow application using the
Pegasus DAX API – allows the scientists to specify data or control
dependencies between computational jobs. This API encapsulates most of
Pegasus commands (e.g., plan, run, statistics, among others), and also
allows workflow creation, execution, and monitoring. Additionally, the
API also provides mechanisms to define Pegasus catalogs (sites, replica,
and transformation), as well as to generate tutorial example workflows.
Funding
National Science Foundation under the OAC SI2-SSI program, grant #1664162