Microarchitecture for defect tolerance and resiliency

Ethan Schuchman, Purdue University

Abstract

Continued device scaling allows faster and more complex CPUs but comes at the cost of an increase in the likelihood of CPU failures. This thesis address this worsening problem at the architectural level; proposing and evaluating three microarchitectures designed to compensate for increasing failure rates. The first microarchitecture discussed in this thesis targets defects that are evident immediately after fabrication or arise during burn-in. Conventionally, CPUs with such defects are destroyed leading to reduced yield and reduced profitability. This thesis proposes a superscalar architecture that allows defects to be isolated to architectural components (within a single CPU core) that can be disabled, leaving a functionally-correct CPU and increasing yield. The second microarchitecture proposed in this thesis targets failures that arise in the field. In this microarchitecture I augment current Simultaneous Multi-Threading (SMT) hardware to redundantly execute instructions on different microarchitectural structures within the same CPU core. Thus this microarchitecture constantly monitors for failures, allowing defects to be detected as soon as they arise. The third microarchitecture proposed in this thesis also targets failures that arise in the field but trades-off some detection latency to significantly reduce the energy and performance cost of redundancy-based detection. In this microarchitecture I propose area-efficient architectural support that allows high-coverage testing phases to be transparently interleaved with computation in the field.

Degree

Ph.D.

Advisors

Vijaykumar, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS