Exploiting intra-warp address monotonicity for fast memory coalescing in GPUs
Graphics Processing Units (GPUs) are growing increasingly popular as general purpose compute accelerators. GPUs are best suited for applications which have abundant data parallelism wherein the computation expressed as a single thread can be applied over a large set of data items. One key constraint that affects application performance on GPUs is that the underlying hardware is single-instruction, multiple data (SIMD) hardware which requires parallel instructions from the multiple threads to execute in a lock-step manner. The benefits of lock-step execution can be seriously degraded if the threads diverge (because of memory or branches). Specifically in the case of memory, the addresses from each thread in a SIMD "wavefront/warp" must be coalesced to enable parallel memory access to minimize divergence. The general problem of coalescing assumes arbitrary address distribution which can be slow. This thesis aims to exploit intra-warp address monotonicity (as measured in a recent study by Holic) to achieve fast memory coalescing. Holic's study reveals the intra-warp addresses are monotonically increasing or decreasing in the common case. The key contributions of this thesis are twofold. First, I design novel hardware coalescing mechanisms to achieve fast-coalescing and quantify the area/delay of my coalescing designs. Second, I quantify the impact of fast-coalescing on overall GPU performance for a suite of GPU benchmarks.
Thottethodi, Purdue University.
Off-Campus Purdue Users:
To access this dissertation, please log in to our