Improving Productivity of Accelerator Computing Through Programming Models and Compiler Optimizations
Abstract
During the past decade, accelerators, such as NVIDIA CUDA GPUs and Intel Xeon Phis, have seen an increasing popularity for their performance and have been employed by many applications. The source of accelerators performance comes from their massive parallelism and complex architectures. Programmers must exploit the unique characteristics and architectural features of the accelerators to obtain high performance. The answer to such necessities is specialized programming models, such as CUDA and OpenCL. However, the specialization comes at the cost of programmability, which could hinder programmer productivity and prevent potential applications from adopting accelerators. Programmer productivity can be improved by either reducing the programming effort or improving the quality of the outcome. This thesis explores productivity improvement from both spectrum and offers an improved programming model and compiler optimizations for accelerator systems. To improve the programmability of accelerators, prior work has studied high-level programming models on individual compute nodes, such as OpenMPC and HiCUDA. Two new standardized high-level programming models for accelerators have also emerged in the past couple years, namely OpenACC and OpenMP 4.x. However, the lack of support for distributed computer architectures in these models leaves the programmability issue open. The first part of the thesis presents HYDRA, a source-to-source translation system that generates programs for distributed environments from a simple shared address program -- specifically, it translates OpenMP into MPI+accelerator programs. The HYDRA system uses a simple OpenMP model, where programmers only specify parallel regions and shared data in the program. Prior to HYDRA, several programming models for accelerator clusters have been proposed; however, these models use accelerator-specific constructs to obtain required information. HYDRA explores the use of simpler programming models and automatic approaches to acquire the required information thus, improving the programmability of accelerator clusters. The thesis also presents compile-time analysis and optimizations to ensure scalability of the generated program on accelerator clusters and provides support for multiple accelerator architectures. The availability of HYDRA provides a common programming models for NVIDIA GPUs and Xeon Phis. The second part of this thesis takes this opportunity to comparatively study the difference between these two accelerators in terms of performance and productivity over a wide range of applications. Prior work has studied the performance difference between NVIDIA GPUs and Intel Xeon Phis; however, these projects mainly focused on single applications and the productivity comparison is missing. The challenge in comparing productivity is the difficulty of measuring programming effort. With HYDRA, the effort can be controlled across different architectures by using a single source code. The productivity is evaluated by how close such programs can come to the accelerator capability, represented by the hand-optimized variants of the programs. The last part of this thesis investigates opportunities for further performance optimizations on accelerators, specifically on NVIDIA GPUs. The program execution on GPUs relies on the efficient and balanced usage of the on-chip resources, such as registers, shared memory, and thread blocks. If the available resources cannot meet the demand, GPUs will reduce the number of concurrent threads, potentially leaving other resources underutilized. Most often, registers are the main limiter, while shared memory -- an on-chip, user-managed memory -- is rarely employed. The de facto approach ignores this fast memory space and spills excessive registers to the slower local memory. The last chapter of the thesis presents a compile-time, assembly-level optimization to reduce register usage on GPU programs by moving the registers to shared memory for post-Kepler GPUs. Prior work proposed register allocation algorithms that also utilize shared memory for spilled registers. The most directly related register allocation algorithm relies on nvcc to restrict register usage and converts local memory spills to shared memory. In doing so, the algorithm uses less efficient code in exchange for lower register use. In contrast, the proposed optimization is a stand-alone optimization and can perform on the more efficient code variant. The proposed optimization is supported by a compile-time performance predictor that uses information presented in assembly programs to choose the best code variant from among different register allocation methods.
Degree
Ph.D.
Advisors
Eigenmann, Purdue University.
Subject Area
Computer Engineering
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.