Automatic scaling of OpenMP applications beyond shared memory

Okwan Kwon, Purdue University

Abstract

The development of high-productivity programming environments that support the development of efficient programs on distributed-memory architectures is one of the most pressing needs in parallel computing, today. Many of today's parallel computer platforms have a distributed memory architecture, as most likely will future multi-cores. Despite many approaches to provide improved programming models, such as HPF, Co-array Fortran, Treadmarks, and UPC, the state of the art for these platforms is to write explicit message-passing programs, using MPI. This process is tedious, but allows high-performance applications to be developed. Because expert software engineers are needed, many parallel computing platforms are inaccessible to the typical programmer. The OpenMP programming model has been gaining popularity for writing shared memory system applications because of its preciseness for expressing parallelism with simple directives and clauses on top of serial program source code. To extend this OpenMP's high programmability to distributed memory systems, in this dissertation, we present a fully automated OpenMP to MPI translation system that consists of a translator and a runtime system. The system successfully executes the OpenMP versions of all regular, repetitive applications of the NAS Parallel Benchmarks on clusters. We describe the implementation of the system that introduces a novel, clean compile/runtime interface to generate inter-thread communication messages. Communication accuracy is one of the key factors to get high performance comparable to hand-written MPI. We discuss intrinsic limitations of compile-time techniques for generating efficient communication messages, and as a solution, we propose a hybrid compiler-runtime translation scheme that features a new runtime data flow analysis technique and a compiler technique that makes a conservative analysis more accurate. Enhancing data affinity and locality is also a critical issue and we discuss four data affinity problems that arise in the process of translating shared-memory applications to message passing variants. To resolve the issues, we propose corresponding compiler/runtime optimizations. In this dissertation, we evaluate numerical/scientific applications that have repetitive communication patterns on a middle size laboratory cluster. We quantitatively compare compiler-time and runtime communication generation schemes as well as overheads of the runtime techniques. We also present and discuss the performance of our translated programs including the performance improvement of our data affinity optimizations, and compare them with the performance of the MPI, HPF and UPC versions of the benchmarks. The results show that our twelve translated programs achieve 88% of the hand-coded MPI programs, on average.

Degree

Ph.D.

Advisors

Eigenmann, Purdue University.

Subject Area

Computer Engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS