Software support for ordering memory operations in parallel systems

Xing Fang, Purdue University

Abstract

Parallel processing is essential to exploiting the potential of multi-core processors. Correct and efficient programming for parallel machines is a notoriously difficult job done well by only a few select, well-trained programmers. However, parallel platforms are becoming ubiquitous, requiring far more programs to be written by regular programmers. This motivates the implementation of new parallel programming paradigms that are efficient and easy to reason about and use. Modern processors implement relaxed memory models when used as part of a shared memory system, that is, one where loads and stores that do not reference the same memory location are allowed to execute in a different order than they appear in the program. Programming languages implement memory (or consistency) models that require other memory references to be executed in order, beyond those guaranteed to execute in order by the relaxed consistency model processor, i.e., they have a stricter memory model. An extreme example of a stricter memory model is the sequentially consistent memory model. A stricter model is thought by many to be easier to reason about than a relaxed model. Current processors provide fence instructions that allow these stricter orders to be enforced. We present a flow-based fence insertion algorithm for effectively enforcing the orders required. This algorithm is implemented in the Pensieve-Jikes compiler. Data showing the effectiveness of the algorithm is provided. New architectures have been proposed that aim to support high performance sequential consistency by committing groups of instructions (chunks) at one time. Aggressive compiler support is required to break programs into reasonable sized groups at strategic places, attaining a high-performance sequentially consistent environment. In the second half of this dissertation we present in detail a compiler algorithm and implementation that performs full-program automatic formation of chunks for such a blocked architecture. We show, for the first time, that fully automatic techniques with no programmer intervention provide a sequentially-consistent system that has a higher performance than conventional machines with relaxed memory models. For 8 full Java codes, we show that compiler generated code running on a simulated 4-processor blocked architecture and supporting sequential consistency, runs on average 5% faster than code on a conventional architecture supporting the more relaxed Java memory model.

Degree

Ph.D.

Advisors

Midkiff, Purdue University.

Subject Area

Computer Engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS