Processing -in -memory techniques for hiding memory access latency

Wessam M Hassanein, Purdue University

Abstract

As the gap between processor and memory speeds widens, program performance is increasingly dependent on the memory access latency. Prefetching is a common technique to hide latency and has traditionally been based upon prediction. However, memory-bound applications have large data working sets and complex data access patterns that defy address prediction. Precomputation-based prefetching overcomes the prediction drawbacks by pre-executing the code that generates complex irregular addresses. Precomputation can occur either on the processor side or on the memory side. Precomputation on the memory side has the advantage of low memory access latency and avoids the increase in fetch and execution resource contention typical of precomputation mechanisms on the processor side. This thesis presents concepts and implementation of new techniques that use Processing-In-Memory (PIM) to hide the memory access latency. We introduce a hybrid software/hardware approach for memory side precomputation. The proposed design keeps low hardware complexity on the processor side while achieving high performance. We propose a novel compiler algorithm that selects precomputation slices for speculative pre-execution in memory. Our approach uses static slice selection to avoid the associated hardware cost. The algorithm is fully automated, allows fine tuning of the constructed slices, and avoids the use of extensive profiling feedback for slice construction common in precomputation techniques. On the hardware side, we propose new approaches for dynamic slice filtering, slice prioritization based on Earliest Deadline First scheduling, and data forwarding from the memory to the processor side. To evaluate the performance of the proposed techniques we built a cycle-accurate PIM simulator of an aggressive out-of-order processor with accurate simulation of bus and memory contention. The performance study, using SPEC CPU2000 and Olden benchmarks, demonstrates the efficiency of the proposed techniques in terms of speedup and reduction in the average load access latency. The results show a performance gain of up to 2.04 (1.32 on average) over an aggressive superscalar processor. The average load access latency decreases by up to 71% (37% on average).

Degree

Ph.D.

Advisors

Eigenmann, Purdue University.

Subject Area

Electrical engineering|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS