A journey through performance evaluation, tuning, and analysis of parallelized applications and parallel architectures: Quantitative approach
In today's multicore era, with the persistently improved fabrication technology, the new challenge is to find applications (i.e. killer Apps) that exploit the increased computational power. Automatic parallelization of sequential programs combined with tuning techniques is an alternative to manual parallelization that saves programmer time and effort. Hand parallelization is tedious, error-prone process. A key difficulty is that parallelizing compilers are generally unable to estimate the performance impact of an optimization on a whole program or a program section at compile time; hence, the ultimate performance decision today rests with the developer. Building an autotuning system to remedy this situation is not a trivial task. Automatic parallelization concentrates on finding any possible parallelism in the program, whereas tuning systems help identifying efficient parallel code segments and profitable optimization techniques. A key limitation of advanced optimizing compilers is their lack of runtime information, such as the program input data.^ With the renewed relevance of autoparallelizers, a comprehensive evaluation will identify strengths and weaknesses in the underlying techniques and direct researchers as well as engineers to potential improvements. No comprehensive study has been conducted on modern parallelizing compilers for today's multicore systems. Such study needs to evaluate different levels of techniques and their interactions, which requires efficiently navigating over a large search spaces of optimization variants. With the recently revealed non-trivial parallel architectures, a programmer needs to learn the behavior of these systems with respect to their programs in order to orchestrate it for a maximized utilization of a gazillion of CPU cycles available.^ In this dissertation, we go in a journey through parallel applications and parallel architectures in quantitative approach. This work presents a portable empirical autotuning system that operates at program-section granularity and partitions the compiler options into groups that can be tuned independently. To our knowledge, this is the first approach delivering an autoparallelization system that ensures performance improvements for nearly all programs, eliminating the users' need to "experiment" with such tools to strive for highest application performance. This method has the potential to substantially increase productivity and is thus of critical importance for exploiting the increased computational power of today's multicores.^ We present an experimental methodology for comprehensively evaluating the effectiveness of parallelizing compilers and their underlying optimization techniques. The methodology takes advantage of the proposed customizable tuning system that can efficiently evaluate a large space of optimization variants. We applied the proposed methodology on five modern parallelizing compilers and their tuning capabilities; we reported speedups, parallel coverage, and the number of parallel loops, using the NAS Benchmarks as a program suite. As there is an extensive body of proposed compiler analyses and transformations for parallelization, the question of the importance of the techniques arises. This work evaluates the impact of the individual optimization techniques on the overall program performance and discusses their mutual interactions. We study the differences between polyhedral model based compilers and Abstract Syntax Tree compilers. We also study the scalability of IBM BlueGeneQ and Intel MIC Architectures as representatives of modern multicore systems.^ We found parallelizers to be reasonably successful in about half of the given science-engineering programs. Advanced versions of some of the techniques identified as most successful in previous generations of compilers are also most important today, while other techniques have risen significantly in impact. An important finding is also that some techniques substitute each other. Furthermore, we found that automatic tuning can lead to significant additional performance and sometimes matches or outperforms hand parallelized programs. We analyze specific reasons for the measured performance and the potential for improvement of automatic parallelization. On average overall programs, BlueGeneQ and MIC systems could achieve a scalability factor of 1.5.^
Rudolf Eigenmann, Purdue University.
Engineering, Computer|Computer Science