Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness
This work presents and evaluates algorithms for MPI collective communication operations on high performance systems. Collective communication algorithms are extensively investigated, and a universal algorithm to improve the performance of MPI collective operations on hierarchical clusters is introduced. This algorithm exploits shared-memory buffers for efficient intra-node communication while still allowing the use of unmodified, hierarchy-unaware traditional collectives for inter-node communication. The universal algorithm shows impressive performance results with a variety of collectives, improving upon the MPICH algorithms as well as the Cray MPT algorithms. Speedups average 15x - 30x for most collectives with improved scalability up to 65536 cores.^ Further novel improvements are also proposed for inter-node communication. By utilizing algorithms which take advantage of multiple senders from the same shared memory buffer, an additional speedup of 2.5x can be achieved. The discussion also evaluates special-purpose extensions to improve intra-node communication. These extensions return a shared memory or copy-on-write protected buffer from the collective, which reduces or completely eliminates the second phase of intra-node communication.^ The second part of this work improves the performance of MPI collective communication operations in the presence of imbalanced processes arrival times. High performance collective communications are crucial for the performance and scalability of applications, and imbalanced process arrival times are common in these applications. A micro-benchmark is used to investigate the nature of process imbalance with perfectly balanced workloads, and understand the nature of inter- versus intra-node imbalance. These insights are then used to develop imbalance tolerant reduction, broadcast, and alltoall algorithms, which minimize the synchronization delay observed by early arriving processes. These algorithms have been implemented and tested on a Cray XE6 using up to 32k cores with varying buffer sizes and levels of imbalance. Results show speedups over MPICH averaging 18.9x for reduce, 5.3x for broadcast, and 6.9x for alltoall in the presence of high, but not unreasonable, imbalance ^
Vijay S. Pai, Purdue University.
Off-Campus Purdue Users:
To access this dissertation, please log in to our