Abstract

Traditional monolithic superscalar architectures, which extract instruction-level parallelism (ILP) to achieve high performance, are not only becoming less effective in improving the clock speed and ILP but also worsening in design complexity and reliability across generations. Chip multiprocessors (CMPs), which exploit thread-level parallelism (TLP), are emerging as an alternative. In one form of TLP, the compilerlprogramrner extracts truly independent explicit threads from the program, and in another, the compiledhardware partitions the program into speculatively independent implicit threads. However, explicit threading is hard to program manually and, if automated, is limited in performance due to serialization of unanalyzable program segments. Implicit threading, on the other hand, requires buffering of program state to handle misspeculations, and is limited in performance due to buffer overflow in large threads and dependences in small threads. We propose the Multiplex architecture to unify implicit and explicit threading by exploiting the similarities between the two schemes. Multiplex employs implicit threading to alleviate serialization in unanalyzable program segments, and explicit threading to remove buffering requirements and eliminate small threads in analyzable segments. We present hardware and compiler mechanisms for selection, dispatch, and data communication to unify explicit and implicit threads within a single application. We describe the Multiplex Unified Coherence and Speculative versioning (MUCS) protocol which provides unified support for coherence in explicit threads and speculative versioning in implicit threads of an application executing on multiple cores with private caches. On the ten SPECfp95 and three Perfect benchmarks, neither an implicitlythreaded nor explicitly-threaded architecture performs consistently better across the benchmarks, and for several benchmarks there is a large performance gap between the two architectures. Multiplex matches or outperforms the better of the two architectures for every benchmark and, on average, outperforms the better architecture by 16%.

Date of this Version

October 2000

Share

COinS