Date of Award


Degree Type


Degree Name

Master of Science (MS)


Computer and Information Technology

First Advisor

Thomas J. Hacker

Committee Chair

Thomas J. Hacker

Committee Member 1

Phillip T. Rawles

Committee Member 2

Julia Taylor


High performance computing clusters provide an efficient and cost effective solution to tackle large and complex problems. These clusters make use of the computing power available from widely available and relatively inexpensive commodity hardware. However, commodity hardware is liable to frequent failures, which can cause processes that are executing on these components to fail. Hence, high performance clusters often suffer from poor reliability. Whenever failures occur, additional costs are generated which lead to an increase in the cost of running the cluster. To prevent processes from failing, proactive fault tolerance strategies may be used in these cluster systems. The scheduler in these systems is an appropriate venue for applying proactive strategies to help prevent failures from occurring.

In this thesis we have implemented an approach that incorporates reliability awareness in the scheduler. Based on historic system logs, estimates are made about the reliability of resources in the cluster. The scheduler makes decisions on where to schedule jobs depending on the reliability need of the job and the expected predicted reliability of computing nodes. This reliability need is calculated based on the characteristics of the job. Typically, jobs which are large and complex have a high reliability need. The scheduler assigns jobs which have a high reliability need to resources that can provide an adequate level of reliability, and avoids resources which have a low reliability. The lower reliability resources are allocated to jobs which have a low reliability need. Thus, by assigning jobs to resources based on reliability characteristics, failures of large and complex jobs can be statistically avoided compared to a typical node assignment strategy. Hence, by using this approach, the costs associated with failures can be reduced, and overall reliability of the system can be improved.