SIE-SSI: EVOLVE: Open MPI for Next Generation Architectures and Applications

<p>For nearly two decades, the Message Passing Interface (MPI) has been an essential part of the High-Performance Computing ecosystem and consequently a key enabler for important scientific breakthroughs. It is a fundamental building block for most large-scale simulations from physics, chemistry, biology, material sciences as engineering. Open MPI is an open source implementation of the MPI specification, widely used and adopted by the research community as well as industry. The Open MPI library is jointly developed and maintained by a consortium of academic institutions, national labs and industrial partners. It is installed on virtually all large-scale computer systems in the US as well as in the rest of the world. The goal of this project is to enhance and modernize the Open MPI library in the context of the ongoing evolution of modern computer systems, and to ensure its future operability on all upcoming architectures. We aim at implementing fundamental software techniques that can be used in many-core systems to execute MPI-based parallel applications more efficiently, and to tolerate process and memory failures at all scales, from current systems, up to the extreme scales expected before the end of the decade.<br><br>Open MPI is an open source implementation of the Message Passing Interface (MPI) specification. The MPI API is currently being extended to consider the needs of application developers in terms of efficiency, productivity and resilience. The project will also support academic involvement in the design, development and evaluation of the Open MPI software, and ensure academic presence in the MPI Forum. The goal of this proposal is to enhance the Open MPI software library, focusing on two aspects: (1) Extend Open MPI to support new features of the MPI specification. Open MPI will continue to support all new features of current and upcoming MPI specifications. The two most significant areas within the context of this proposal are (a) extensions to better support hybrid programming models and (b) support for fault tolerance in MPI applications. To improve support for hybrid programming models, the MPI Forum is currently considering introducing the notion of MPI Endpoints, which could be used by different threads of an MPI rank to instantiate multiple separate communication contexts. The goal within this project is to develop an implementation of endpoints to support effective hybrid programming model, and to extend the concept to other aspects of parallel applications such as File I/O operations. One of the project partners (UTK) leads the current proposal in the MPI Forum to expose failures and ensure the continuation of the execution of MPI applications. In the context of this SSI proposal, the goal is to harden, improve, and expand the support of the existing ULFM implementation in Open MPI and thus enable end-users to design application-specific resilience approaches for future platforms. (2) Enhance the Open MPI core to support new architectures and improve scalability. While Open MPI has demonstrated very good scalability in the past, there is significant work to be done to ensure similarly good performance on future architectures. Specifically, we propose a groundbreaking rework of the startup environment that will improve process launch scalability, increase support for asynchronous progress of operations, enable support for accelerators, and reduce sensitivity to system noise. The project would also enhance the support for File I/O operations as part of the Open MPI package by expanding our work on highly scalable collective I/O operations through delegation and exploring the utilization of burst buffers as temporary storage.<br></p>