Optimizing Memory Performance for Scientific and Mobile Applications

Malek R Musleh, Purdue University

Abstract

Microprocessor performance has been improving at roughly 60% per year. Memory access times, however, have improved by less than 10% per year. The resulting gap between logic and memory performance has forced microprocessor designs toward complex and power-hungry architectures that support out-of-order and speculative execution. Multi-cores have successfully delivered performance improvements over the past decade; however, they now face a problem: increased latency of read-write shared data communication. This problem is further exacerbated with the emergence of unstructured workloads that perform poorly in the traditional Invalidation-based cache-coherent protocols. Improved microprocessor performance has also accelerated growth of mobile devices and tablets has propelled has led it to become the primary computing device. These devices predominately run Web applications that exhibit significantly different qualitative and quantitative characteristics than compared to conventional scientific workloads. These qualitative differences contribute to heavy instruction and data interference behavior in the unified last-level cache (ULLC). This observation combined with the limited capacity of the ULLC contributes to a significant performance bottleneck. In the first part of my thesis, I focus on developing an adaptive run-time scheme to answer the four key questions of read-write sharing: 1) What data is being shared, 2) who is sharing the data, 3) when is the sharing occurs, and 4) how the data should be communicated. Although there exists several related papers that attempt to address this issue, many of them fall short of the desired objective because they only partially address the key concerns, or target very specific sharing patterns: making them unsuitable for unstructured workloads. I present a holistic approach that effectively targets these targets across a diverse set of scientific workloads scalable to 32 cores. In the second part of my thesis, I present a dynamic cache-partitioning scheme aimed at improving the efficiency of the ULLC. I show how conventional techniques such as statically partitioning the cache as well as techniques applied in other domains perform poorly in the context of mobile platforms. I present the key insights specific to the cache inefficiency of mobile platforms that make previous techniques unsuitable, and how our dynamic scheme utilizes these observations to determine prioritizing between instructions and data references.

Degree

Ph.D.

Advisors

Pai, Purdue University.

Subject Area

Computer Engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS