This paper presents a novel compiler algorithm for selecting program slices that prefetch load values concurrently with program execution. The algorithm is evaluated in the context of an intelligent memory system. The architecture consists of a main processor and a simple memory processor. The intelligent memory system pre-executes program slices and forwards values of critical loads to the main processor ahead of their use. The compiler algorithm selects program slices for memory processor execution, and inserts synchronization instructions that synchronize main and memory processors. Experimental results of the generated code on a cycle-accurate simulator show a speedup of up to 1.33 (1.13 on average) over an aggressively latency-optimized system running fully optimized code.
Date of this Version