Balancing bandwidth and capacity in area limited GPGPU caches
General-purpose Graphics Processing Units (GPGPUs) have shown enormous promise in enabling high throughput, data-parallel computation. As high performance parallel computation engines, GPGPUs require significant memory system performance to ensure bottleneck-free computation. Specifically, they need significant cache capacity and high bandwidth, within a limited silicon area. Unfortunately the three-way trade-off among area, cache capacity, and cache bandwidth makes it impossible to achieve all three goals. For example, it is possible to build caches to have both capacity and high bandwidth (via banking/multiporting), but that comes at the cost of high silicon area. Similarly, one may design compact (in silicon area) caches of fairly high capacity if one is willing to sacrifice bandwidth, or compact caches of high bandwidth if one is willing sacrifice cache capacity. The key contribution of this thesis is the design of a cache hierarchy that balances capacity and bandwidth demands within silicon area constraints to maximize GPGPU performance. My design leverages a key insight that high-bandwidth cache hits - cache hits of a warp that access several cache blocks - are typically from the most-recently used blocks. To exploit this observation, my design uses a small high-bandwidth auxiliary cache in parallel with a traditional L1 cache. While access to the two caches are in parallel, the allocation of blocks in the cache is hierarchical - blocks are allocated only in the high bandwidth auxiliary cache and trickle down to the L1 upon eviction from the auxiliary cache. Evaluation by simulation using GPG-PUsim shows that, for the PolyBench benchmark suite, my design achieves better performance compared to conventional L1-only cache designs of equal area and/or capacity.
THOTTETHODI, Purdue University.
Off-Campus Purdue Users:
To access this dissertation, please log in to our