Open Access Dissertations

Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness

Benjamin Scott Parsons, Purdue University

Abstract

This work presents and evaluates algorithms for MPI collective communication operations on high performance systems. Collective communication algorithms are extensively investigated, and a universal algorithm to improve the performance of MPI collective operations on hierarchical clusters is introduced. This algorithm exploits shared-memory buffers for efficient intra-node communication while still allowing the use of unmodified, hierarchy-unaware traditional collectives for inter-node communication. The universal algorithm shows impressive performance results with a variety of collectives, improving upon the MPICH algorithms as well as the Cray MPT algorithms. Speedups average 15x - 30x for most collectives with improved scalability up to 65536 cores.^ Further novel improvements are also proposed for inter-node communication. By utilizing algorithms which take advantage of multiple senders from the same shared memory buffer, an additional speedup of 2.5x can be achieved. The discussion also evaluates special-purpose extensions to improve intra-node communication. These extensions return a shared memory or copy-on-write protected buffer from the collective, which reduces or completely eliminates the second phase of intra-node communication.^ The second part of this work improves the performance of MPI collective communication operations in the presence of imbalanced processes arrival times. High performance collective communications are crucial for the performance and scalability of applications, and imbalanced process arrival times are common in these applications. A micro-benchmark is used to investigate the nature of process imbalance with perfectly balanced workloads, and understand the nature of inter- versus intra-node imbalance. These insights are then used to develop imbalance tolerant reduction, broadcast, and alltoall algorithms, which minimize the synchronization delay observed by early arriving processes. These algorithms have been implemented and tested on a Cray XE6 using up to 32k cores with varying buffer sizes and levels of imbalance. Results show speedups over MPICH averaging 18.9x for reduce, 5.3x for broadcast, and 6.9x for alltoall in the presence of high, but not unreasonable, imbalance.

Disciplines

Computer Engineering

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Electrical and Computer Engineering

First Advisor

Vijay Pai

Committee Chair

Vijay Pai

Committee Member 1

Milind Kulkarni

Committee Member 2

Mithuna S. Thottethodi

Committee Member 3

Samuel P. Midkiff

Date of Award

Winter 2015

Recommended Citation

Parsons, Benjamin Scott, "Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness" (2015). Open Access Dissertations. 533.
https://docs.lib.purdue.edu/open_access_dissertations/533

Download

Included in

Computer Engineering Commons

COinS

Open Access Dissertations

Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness

Abstract

Disciplines

Degree Type

Degree Name

Department

First Advisor

Committee Chair

Committee Member 1

Committee Member 2

Committee Member 3

Date of Award

Recommended Citation

Included in

Search

Links

Links for Authors

Browse

Open Access Dissertations

Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness

Author

Abstract

Disciplines

Degree Type

Degree Name

Department

First Advisor

Committee Chair

Committee Member 1

Committee Member 2

Committee Member 3

Date of Award

Recommended Citation

Included in

Share

Search

Links

Links for Authors

Browse