Characterizing the intra-warp address distribution and bandwidth demands of GPGPUs
Abstract
General-purpose Graphics Processing Units (GPGPUs) are an important class of architectures that offer energy-efficient, high performance computation for data- parallel workloads. GPGPUs use single-instruction, multiple-data (SIMD) hardware as the core execution engines with (typically) 32 to 64 lanes of data width. Such SIMD operation is key to achieving high-performance; however, if memory demands of the different lanes in the "warp" cannot be satisfied, overall system performance can suffer. There are two challenges in handling such heavy demand for memory bandwidth. First, the hardware necessary to coalesce multiple accesses to the same cache block—a key function necessary to reduce the demand for memory bandwidth—can be a source of delay complexity. Ideally, all duplicate accesses must be coalesced into a single access. Memory coalescing hardware, if designed for the worst-case, can result in either high area and delay overheads, or wasted bandwidth. Second, bandwidth demands can vary significantly. Ideally, all memory accesses of a warp must proceed in parallel. Unfortunately, it is prohibitively expensive to design a memory subsystem for the worst-case bandwidth demand where each lane accesses a different cache block. The goal of this thesis is to characterize the memory-access behavior of GPGPU workloads within warps to inform memory subsystem designs. The goal is not to propose and evaluate hardware optimizations based on this characterization. I leave such optimizations for future work with my collaborator Hector Enrique Rodriguez-Simmonds. Specifically, I characterize two properties which have the potential to lead to optimizations in the memory subsystem. First, I demonstrate that there is significant access monotonicity at both the cache-block and page levels. This is significant because my collaborator's work reveals that access monotonicity can be leveraged to significantly simplify address coalescing logic. Second, I characterize the memory bandwidth patterns by the number of unique blocks and pages accessed on a per-warp basis. My study motivates a novel horizontal cache organization called a "cache spectrum" (in contrast to traditional, vertical cache hierarchies) to maximize the number of unique accesses that can be served simultaneously. Finally, further optimizations are possible if the warps that access a large number of blocks are predictable. I examine two simple techniques to measure predictability of access patterns for intra-warp bandwidth demands. My (negative) results reveal that more sophisticated predictors may need to be explored.
Degree
M.S.E.C.E.
Advisors
Thottethodi, Purdue University.
Subject Area
Computer Engineering
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.