2020-01-30T20:54:16Z (GMT) by
This project will develop an open-source software service, the Parallel FileSystem TRacing and Analysis SErvice (PFSTRASE), that improves the reliability and performance of data storage systems for the nation's largest supercomputers. As simulations and computations represent reality more faithfully they grow commensurately in scale along with the size of the data they consume and generate. To handle the storage and movement of this data, supercomputing systems are built on the backbone of massively parallel data storage systems. Due to their parallel nature these storage systems are capable of moving data at hundreds of times the speed of conventional storage systems, enabling otherwise impractical computations. The performance capabilities these storage systems provide is accompanied by a complexity that results in them often functioning significantly less than optimally and even in some instances failing. This results in wasted computational time and ultimately lost scientific progress. The state of development of tools that could cast light on these problems and improve storage system reliability and performance is inadequate for current and future computing systems. PFSTRASE will fill this gap by continually and automatically monitoring storage system health and performance, providing insights through an easy to use interface that will improve the reliability and performance of storage and supercomputer systems.