Scalable and Energy-Efficient Simt Systems for Deep Learning and Data Center Microservices
Abstract
Moore’s law is dead. The physical and economic principles that enabled an exponential rise in transistors per chip have reached their breaking point. As a result, High-Performance Computing (HPC) domain and cloud data centers are encountering significant energy, cost, and environmental hurdles that have led them to embrace custom hardware/software solutions. Single Instruction Multiple Thread (SIMT) accelerators, like Graphics Processing Units (GPUs), are compelling solutions to achieve considerable energy efficiency while still preserving programmabilityin the twilight of Moore’s Law.In the HPC and deep learning (DL) domain, the death of single-chip GPU performance scaling will usher in a renaissance in multi-chip Non-Uniform Memory Access (NUMA) scaling. Advances in silicon interposers and other inter-chip signaling technology will enable single-package systems, composed of multiple chiplets that continue to scale even as per-chip transistors do not. Given this evolving, massively parallel NUMA landscape, the placement of data on each chiplet, or discrete GPU card, and the scheduling of the threads that use that data is a critical factor in system performance and power consumption.Aside from the supercomputer space, general-purpose compute units are still the main driver of data center’s total cost of ownership (TCO). CPUs consume 60% of the total data center power budget, half of which comes from the CPU pipeline’s frontend. Coupled with the hardware efficiency crisis is an increased desire for programmer productivity, flexible scalability, and nimble software updates that have led to the rise of software microservices. Consequently, single servers are now packed with many threads executing the same, relatively small task on different data.In this dissertation, I discuss these new paradigm shifts, addressing the following concerns: (1) how do we overcome the non-uniform memory access overhead for next-generation multi-chiplet GPUs in the era of DL-driven workloads?; (2) how can we improve the energy efficiency of data center’s CPUs in the light of microservices evolution and request similarity?; and (3) how to study such rapidly-evolving systems with an accurate and extensible SIMT performance modeling?.To this end, I propose three different frameworks and systems to address these challenges. First, to improve the quality of GPU research produced by the academic community, I have developed, Accel-Sim, a new GPU simulation framework to help solve the problem of keeping simulators up-to-date with contemporary designs. Using a counter-by-counter validation of the GPU memory system, Accel-Sim decreases cycle error from 94% in state-of-the-art simulation to 15%.Second, to maintain GPU performance scalability in the twilight of Moore’s Law, I propose a programmer-transparent Locality-Aware Data Management (LADM) system designed to operate on massive logical GPUs composed of multiple discrete devices, which are themselves composed of chiplets. LADM has two key components: a threadblock-centric compiler-assisted index analysis, and runtime system that performs adaptive data placement, threadblock scheduling and cache insertion policy. Compared to state-of-the-art multi-GPU scheduling, LADM reduces inter-chip memory traffic by 4× and capturing 82% of the unbuildable monolithic chip performance.Third, to exploit the similarity in contemporary microservices, I propose a new class of computing hardware, the Request Processing Unit (RPU), which modifies out-of-order CPU cores to execute microservices using a Single Instruction Multiple Request (SIMR) execution model. Our solution leverages the CPU’s programmability and latency optimizations while still exploiting the GPUs’ SIMT efficiency and memory model scalability.
Degree
Ph.D.
Advisors
Rogers, Purdue University.
Subject Area
Artificial intelligence|Energy
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.