figshare
Browse
HPC_asia2024.pdf (1.19 MB)

Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication

Download (1.19 MB)
conference contribution
posted on 2024-07-08, 20:23 authored by Andres Sewell, Ke Fan, Ahmedur Rahman Shovon, Landon Dyken, Sidharth Kumar, Steve Petruzza
In high-performance computing, collective communication is critical for facilitating comprehensive data exchange involving all processes within an MPI communicator. Due to their inherently global nature, many collective operations present scalability challenges, particularly the all-to-all data shuffle with its quadratic communication pattern. Using a logarithmic communication pattern, the Bruck algorithm was designed to provide communication efficiency for all-to-all data shuffles involving short-sized messages. The Bruck algorithm has been extensively used to facilitate global data shuffles in a multi-CPU environment and is also part of the MPICH and Open MPI implementations. This work presents the first investigation of using the Bruck algorithm for all-to-all communication in multi-GPU systems using the NVIDIA Collective Communications Library (NCCL). Our experimental study demonstrates that while the Bruck algorithm exhibits superior performance for small-sized messages in a multi-CPU environment, the same advantages are not evident for multi-GPU environments. Furthermore, we describe and compare an optimized Bruck algorithm implementation in NCCL and compare it to NCCL's default all-to-all and MPI-based implementations. Finally, we discuss the challenges and opportunities of implementing new multi-GPU collectives using NCCL's public-facing API.

Funding

Collaborative Research: SHF: Small: Scalable and Extensible I/O Runtime and Tools for Next Generation Adaptive Data Layouts | Funder: National Science Foundation | Grant ID: CCF-2401274

Collaborative Research: SHF: Small: Scalable and Extensible I/O Runtime and Tools for Next Generation Adaptive Data Layouts | Funder: National Science Foundation | Grant ID: 2401274

Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale | Funder: University of Alabama at Birmingham | Grant ID: 2316157

Collaborative Research: PPoSS: Large: A Full-stack Approach to Declarative Analytics at Scale | Funder: University of Alabama at Birmingham

History

Citation

Sewell, A., Fan, K., Shovon, A. R., Dyken, L., Kumar, S.Petruzza, S. (2024, January). Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication. Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (pp. 127-133). Association for Computing Machinery (ACM). https://doi.org/10.1145/3635035.3635047

Publisher

Association for Computing Machinery (ACM)

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC