Provenance-Aware Scalable Seismic Data Processing with Portability
Most of our understanding about the Earth’s interior comes from seismology. Over the past decade, the huge success in many large-scale projects like the USArray component of Earthscope gave rise to a massive increase in the data volume available to the seismology community. Such data set has revealed the limitation of existing data processing infrastructure available to the seismologists. As a step towards addressing the issue, we devised a new framework we call Massive Parallel Analysis System for Seismologists (MsPASS), for seismic data processing and management. MsPASS leverages existing big data technologies: (1) a scalable parallel processing framework based on a dataflow computation model (Spark), (2) a NoSQL database system centered on document store (MongoDB), and (3) a container-based virtualization environment (Docker and Singularity).