Better Data Discoverability in Science Gateways

Science gateways primarily focused on remote job execution management generate domain specific output data mainly readable by application specific parsers and post processing utilities. For example, computational chemistry data outputs encode molecule information, convergence of the simulation and energy values. Such domain-specific information is non-trivial to search in a generic fashion. It is thus desirable to add a wide range of application-specific and user-specific post-processing features that may include remote executions of scripts and smaller applications that don’t require scheduling on clusters. It is also desirable to support integrations with searching, indexing, and general purpose data analysis and mining tools provided by the Apache “big data” software stack. As gateways become tenants to general purpose platform services, providing a general purpose infrastructure that enables these application specific post-processing steps is an interesting architectural challenge. Furthermore, it is desirable to share results from the post-processing and indexing. In this paper, we discuss how we have incorporated a new automated application output indexing system for the SEAGrid Science Gateway using Apache Airavata that will parse and index generated output for easy querying. We also examine data sharing and automated data publication so that another user can reuse the results without running an already executed experiment and hence reduce resource utilization.