Multiple Fragments Multiple Threads (MFMT) for Control Flow Irregularity in GPGPUs

Di Mo, Purdue University

Abstract

General Purpose Graphical Processing Units (GPGPUs) rose to prominence with the release of the Fermi architecture by Nvidia in 2009. It introduced the idea of Single Instruction Multiple Thread (SIMT) execution, as well as dramatically improving the CUDA programming language used to program on GPGPUs. Since then, GPGPUs have grown as an alternative to traditional CPU based computing for a variety of parallel applications such as neural networking, Big-Data analytics, and machine learning. However, the SIMT execution model breaks down for programs with control divergence (namely branches) because the system can only support a single instruction stream. Thus, threads that do not take the current executing branch must wait their turn often leading to dramatic performance loss. In order to combat this, this paper proposes a Multiple Fragment Multiple Thread (MFMT) architecture allowing multiple instruction streams to execute in parallel on GPGPUs. MFMT's key insight is that a small number of control flow paths can be supported by current GPGPUs without major modifications, instead harnessing resources that are under-utilized in control-divergent applications. This means that with minimal area and energy costs, GPGPUs can be transformed to support a small but meaningful number of instructions for different control flow paths (dubbed fragments) dramatically benefiting control-divergent programs. For non control-divergent applications, MFMT will naturally mimic to the original SIMT execution model and maintain the original high level of performance. MFMT is evaluated via a GPGPU architectural simulator, comparing it with the baseline SIMT scheme as well as 2 current state of the art solutions for control-divergence: Multi-Path and Variable Warp Sizing (VWS). MFMT achieves 11% speedup over a baseline GPGPU at -3% dynamic energy overhead on a variety of control divergent workloads, while Multi-Path and Variable Warp Sizing achieve 4% and 3% speedup at -2% and 2% dynamic energy overhead, respectively.

Degree

M.S.E.C.E.

Advisors

Vijaykumar, Purdue University.

Subject Area

Computer Engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS