Efficient high performance collective communication for distributed memory environments

Qasim Ali, Purdue University

Abstract

Collective communication allows efficient communication and synchronization among a collection of processes, unlike point-to-point communication that only involves a pair of communicating processes. Achieving high performance for both kernels and full-scale applications running on a distributed memory system requires an efficient implementation of collective communication operations. Developing an efficient implementation requires attention to both algorithmic and hardware issues. This dissertation proposes and describes the implementation of collective communication algorithms that are both novel and extremely efficient. These algorithms target distributed memory machines: both clusters (with nodes that are either SMPs or uniprocessors) and accelerator-based machines (e.g., IBM’s Cell processor, which is used as the accelerator core in IBM’s Roadrunner, the world’s fastest supercomputer). For the cluster of workstations environment, it also proposes efficient asynchronous and concurrent collective operations a generalized reduction algorithm and parallel reductions. For the Cell processor, this dissertation describes the implementation of very fast barrier synchronization, broadcast, all-gather, reduce and all-reduce collectives which work both on single and dual Cell machines. These collectives take into account the impacts of both concurrency and data traffic on the on-chip and off-chip interconnects. The implementations for both a cluster of workstations and the Cell processor achieve performance that is superior to the previous published state-of-the-art. This dissertation also presents and validates performance models for a variety of high-performance collective communication algorithms for systems with Cell processors. The models extend the PLogP model, a well-known point-to-point performance model, by accounting for the unique hardware characteristics of the Cell (e.g., heterogeneous interconnects and DMA engines) and by applying the model to collective communication. Finally, the dissertation presents experimental results validating our algorithm designs and the effectiveness of our models.

Degree

Ph.D.

Advisors

Pai, Purdue University.

Subject Area

Information science|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS