Effective Management of Shared Last-Level Cache Performance for Chip Multiprocessors

Abhisek Pan, Purdue University

Abstract

Current architectural trends of rising on-chip core counts and worsening power-performance penalties of off-chip memory accesses have made the shared last-level caches (LLC) one of the major determinants of multicore performance. In this thesis, I propose and explore hardware and software techniques for improving the performance of shared LLCs for parallel applications running on multicores. This thesis has two key contributions. First, I propose a hardware-only way-partitioning policy to improve shared LLC performance for symmetric multithreaded programs running on multicores. Unlike prior work on way-partitioning for unrelated threads in a multiprogramming workload, the domain of multithreaded programs requires both throughput and fairness. Additionally, the workloads show no obvious thread differences to exploit: program threads see nearly identical IPC and data reuse as they progress (as expected for a well-written load-balanced data-parallel program). Despite the balance and symmetry among threads, I show that a balanced partitioning of cache ways between threads is suboptimal. Instead, I propose a strategy of temporarily imbalancing the partitions between different threads to improve cache utilization by adapting to the locality behavior of the threads as captured by dynamic set-specific reuse-distance (SSRD). Cumulative SSRD histograms have knees that correspond to different important working sets; thus, cache ways can be taken away from a thread with only minimal performance impact if that thread is currently operating far from a knee. Those ways can then be given to a single “preferred” thread to push it over the next knee. The preferred thread is chosen in a round-robin fashion to ensure balanced progress over the execution. The proposed algorithm also effectively handles scenarios where an unpartitioned cache might outperform any sort of explicit partitioning. This dynamic partition imbalance algorithm allows up to 44% reduction in execution time and 91% reduction in misses over an unpartitioned shared cache for symmetric multithreaded applications from the PARSEC-2.0 and SPEC OMP suites. Second, I develop a hardware-software technique for shared LLC management for task-parallel programs. Task-parallel programming models with input annotation-based concurrency extraction at runtime present a promising paradigm for writing parallel programs for today’s multi or many-core heterogeneous systems. Through management of dependencies, data-movements, task assignments, and orchestration, these models markedly simplify the programming effort for parallelization while exposing higher levels of concurrency. In addition, the use of a runtime platform enables innovations in the hardware-software interface that allows the hardware to be highly responsive to the characteristics of the application and vice versa. I show that, for task-parallel applications running on multicores with a shared last-level cache (LLC), the concurrency extraction framework can be used to substantially improve the efficiency of the shared LLC. I develop a task-based cache partitioning technique that leverages the dependence tracking and look-ahead capabilities of the runtime. Based on the input annotations for future tasks, the runtime instructs the hardware to prioritize data blocks with future reuse and evict blocks with no future reuse. These instructions allow the hardware to preserve all the blocks for a subset of the future tasks and evict dead blocks early. This leads to a considerable improvement in cache efficiency over what is achieved by existing thread-centric cache management policies. Thread-centric cache management policies fail to track the complex patterns of data-reuse among tasks that can be assigned to arbitrary cores and hence replace blocks for all future tasks resulting in poor overall hit-rates. The proposed hardware-software technique leads to a mean improvement of 18% in application performance and a mean reduction of 26% in misses over an LRU-replacement based LLC for a set of input-annotated task-parallel programs using the OmpSs programming model implemented on the NANOS++ runtime. In contrast, the state-of-the-art thread-based partitioning scheme (proposed in the first part) suffers an average performance loss of 2% and an average increase of 15% in misses over the baseline.

Degree

Ph.D.

Advisors

Pai, Purdue University.

Subject Area

Computer Engineering|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS