Several recent papers argue that due to the slowing down of Dennard’s scaling of the supply voltage future multicore performance will be limited by dark silicon where an increasing number of cores are kept powered down due to lack of power. Customizing the cores to improve power efficiency may incur increased effort for hardware design, verification and test, and degraded programmability. In this paper, we show that dark silicon is sub-optimal in performance and avoidable, and that a gentler, evolutionary path for multicores exists. We make the key observations that (1) previous papers examine voltage-frequency-scaled designs on the power-performance Pareto frontier whereas the frontier extends to a new region derived by frequency scaling alone where voltage-scaled designs are infeasible, and (2) because memory latency improves only slowly over generations, performance of future multicores’ workloads will be dominated by memory latency. Guided by these observations and a simple analytical model, we exploit (1) the sub-linear impact of clock speed on performance in the presence of memory latency, and (2) the super-linear impact of throughput on queuing delays. Accordingly, we propose an evolutionary path for multicores, called successive frequency unscaling (SFU). Compared to dark silicon. SFU keeps powered significantly more cores running at clock frequencies on the extended Pareto frontier that are succesively lowered every generation to stay within the power budget. The higher active core count enables more memory-level parallelism, non-linearly offsetting the slower clock and resulting in more performance than that of dark silicon. For memory-intensive workloads, full SFU, where all the cores are powered up, performs 81% better than dark silicon at the 11 nm technology node. For enterprise workloads where both throughput and response times are important, we employ controlled SFU (C-SFU) which moderately slows down the clock and powers many, but not all, cores to achieve 29% better throughput than dark silicon at the 11 nm technology node. The higher throughput non-linearly reduces queuing delays and thereby compensates for the slower clock, resulting in C-SFU’s total response latency to be within +/- 10% of that of dark silicon.
Date of this Version