SI2-SSI: FAMII: High Performance and Scalable Fabric Analysis, Monitoring and Introspection Infrastructure for HPC and Big Data
Hari Subramoni
Dhabaleswar K. Panda
Karen Tomko
10.6084/m9.figshare.11803875.v2
https://figshare.com/articles/presentation/SI2-SSI_FAMII_High_Performance_and_Scalable_Fabric_Analysis_Monitoring_and_Introspection_Infrastructure_for_HPC_and_Big_Data/11803875
<p>As heterogeneous computing (CPUs, GPUs etc.) and ,
networking (NVLinks, X-Bus etc.) hardware continue to advance, it becomes
increasingly essential and challenging to understand the interactions between
High-Performance Computing (HPC) and Deep Learning applications/frameworks, the
communication middleware they rely on, the underlying communication fabric
these high-performance middlewares depend on, and the schedulers that manage
HPC clusters. Such understanding will enable application developers/users,
system administrators, and middleware developers to maximize the efficiency and
performance of individual components that comprise a modern HPC system and
solve different grand challenge problems. Moreover, determining the root cause
of performance degradation is complex for the domain scientist. The scale of
emerging HPC clusters further exacerbates the problem. These issues lead to the
following broad challenge: How can we design a tool that enables in-depth
understanding of the communication traffic on the interconnect and GPU through
tight integration with the MPI runtime at scale?</p>
2020-02-05 02:45:50
NSF-CSSI-2020-Talk
Network Monitoring
MVAPICH
FAMII
Performance Analysis
HPC Clusters
Computer System Architecture