Parallelization and performance-tuning: Automating two essential techniques in the multicore era

Chirag Uday Dave, Purdue University

Abstract

In today’s multicore era, parallelization of serial code is essential in order to exploit the architectures’ performance potential. Parallelization, especially of legacy code, however, proves to be a challenge as manual efforts must either be directed towards algorithmic modifications or towards analysis of computationally intensive sections of code for the best possible parallel performance, both of which are difficult and time-consuming. Automatic parallelization uses sophisticated compile-time techniques in order to identify parallelism in serial programs, thus reducing the burden on the program developer. This work considers the implementation of important parallelization techniques such as Data dependence analysis and advanced Points-to and Alias analysis in a source-to-source parallelizing compiler, Cetus. Auto-parallelization results are provided across a set of benchmarks from the NAS Parallel and SPEC OMPM2001 suites. A key difficulty in using automatic parallelization, however, is that optimizing compilers are generally unable to estimate the performance of an application or even a program section at compile-time, and so the task of performance improvement invariably rests with the developer. Automatic tuning uses static analysis and runtime performance metrics to determine the best possible compile-time approach for optimal application performance. This work describes an offline tuning approach that uses Cetus with an additional tuning framework to tune parallel application performance. An existing, generic tuning algorithm called Combined Elimination is used to study the effect of serializing parallelizable loops based on measured whole program execution time. The outcome is a combination of parallel loops that ensures to equal or improve performance over the original program. The results from the autotuning approach are compared against hand-parallelized C benchmarks from the SPEC OMPM2001 and NAS Parallel suites. The auto-parallelized and auto-tuned versions are close to serial performance or better than serial in most cases and always out-perform state-of-the-art parallelizers such as Intel’s ICC. Additional parallelization techniques and more extraction of beneficial parallelism can help improve the tuning results further.

Degree

M.S.E.C.E.

Advisors

Eigenmann, Purdue University.

Subject Area

Computer Engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS