Hardware and software mechanisms for multithreading in uniprocessors and heterogeneous multiprocessors
Abstract
This thesis proposes, develops, and evaluates hardware and software mechanisms that enhance the efficiency and performance of multithreading in uniprocessors and in heterogeneous multiprocessors. Hardware synchronization mechanisms are shown via simulation to provide a performance improvement between 0% and 400%, depending on the workload and the synchronization frequency, for a constrained simulation model. Further results show the decrease in available parallelism with increasing synchronization overhead. In addition, a VHDL implementation of the functionality required to support hardware synchronization is discussed. Novel context-switch criteria are proposed that allow processors to better tolerate memory-access latency. While this advantage has been discussed previously for multithreaded processors, this work is the first to examine the performance of context-switch criteria that are based on architectural features used to provide some latency tolerance such as out-of-order dispatch and lockup-free caches. Results from a detailed multiprocessor simulator show a performance improvement of up to 35% over no multithreading, although many criteria examined result in a performance decrease. “Virtual Processors” are shown to provide a performance advantage for applications that have been parallelized for a homogeneous multiprocessor executing on a heterogeneous multiprocessor. Three additional modification are also discussed that provide an ease-of-use or ease-of-design advantages: more efficient interrupt support, starting and stopping threads, and entry to the operating system. The performance improvement is measured using several scientific programs (from the SPLASH-2 benchmark suite) and one commercial program (C4.5, a decision tree induction application). This work is the first to use C4.5 as a benchmark application: thus, this thesis presents the first complete characterization of the memory hierarchy behavior of C4.5, presents the first parallelization of decision tree induction optimized for a ccNUMA architecture, characterizes the parallel version, and examines decision tree induction as a possible benchmark application.
Degree
Ph.D.
Advisors
Fortes, Purdue University.
Subject Area
Electrical engineering
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.