Date of Award

Spring 2014

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Electrical and Computer Engineering

First Advisor

Mithuna S. Thottethodi

Committee Member 1

Anand Raghunathan

Committee Member 2

T. N. Vijaykumar

Abstract

General-purpose Graphics Processing Units (GPGPUs) are an important class of architectures that offer energy-efficient, high performance computation for data- parallel workloads. GPGPUs use single-instruction, multiple-data (SIMD) hardware as the core execution engines with (typically) 32 to 64 lanes of data width. Such SIMD operation is key to achieving high-performance; however, if memory demands of the different lanes in the "warp" cannot be satisfied, overall system performance can suffer.

There are two challenges in handling such heavy demand for memory bandwidth. First, the hardware necessary to coalesce multiple accesses to the same cache block--a key function necessary to reduce the demand for memory bandwidth--can be a source of delay complexity. Ideally, all duplicate accesses must be coalesced into a single access. Memory coalescing hardware, if designed for the worst-case, can result in either high area and delay overheads, or wasted bandwidth. Second, bandwidth demands can vary significantly. Ideally, all memory accesses of a warp must proceed in parallel. Unfortunately, it is prohibitively expensive to design a memory subsystem for the worst-case bandwidth demand where each lane accesses a different cache block.

The goal of this thesis is to characterize the memory-access behavior of GPGPU workloads within warps to inform memory subsystem designs. The goal is not to propose and evaluate hardware optimizations based on this characterization. I leave such optimizations for future work with my collaborator Hector Enrique Rodriguez-Simmonds. Specifically, I characterize two properties which have the potential to lead to optimizations in the memory subsystem. First, I demonstrate that there is significant access monotonicity at both the cache-block and page levels. This is significant because my collaborator's work reveals that access monotonicity can be leveraged to significantly simplify address coalescing logic. Second, I characterize the memory bandwidth patterns by the number of unique blocks and pages accessed on a per-warp basis. My study motivates a novel horizontal cache organization called a "cache spectrum" (in contrast to traditional, vertical cache hierarchies) to maximize the number of unique accesses that can be served simultaneously. Finally, further optimizations are possible if the warps that access a large number of blocks are predictable. I examine two simple techniques to measure predictability of access patterns for intra-warp bandwidth demands. My (negative) results reveal that more sophisticated predictors may need to be explored.

Recommended Citation

Holic, Calvin, "Characterizing the Intra-warp Address Distribution and Bandwidth Demands of GPGPUs" (2014). Open Access Theses. 192.
https://docs.lib.purdue.edu/open_access_theses/192

Download

Included in

Computer Engineering Commons

COinS

Open Access Theses

Characterizing the Intra-warp Address Distribution and Bandwidth Demands of GPGPUs

Date of Award

Degree Type

Degree Name

Department

First Advisor

Committee Member 1

Committee Member 2

Abstract

Recommended Citation

Included in

Search

Links

Links for Authors

Browse

Open Access Theses

Characterizing the Intra-warp Address Distribution and Bandwidth Demands of GPGPUs

Author

Date of Award

Degree Type

Degree Name

Department

First Advisor

Committee Member 1

Committee Member 2

Abstract

Recommended Citation

Included in

Share

Search

Links

Links for Authors

Browse