Date of Award

Spring 2014

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer and Information Technology

First Advisor

Thomas J. Hacker

Committee Member 1

Eric T. Matson

Committee Member 2

John A. Springer

Committee Member 3

Tomasz W. Wlodarczyk

Abstract

In high performance computing systems, parallel applications request a large number of resources for long time periods. In this scenario, if a resource fails during the application runtime, it would cause all applications using this resource to fail. The probability of application failure is tied to the inherent reliability of resources used by the application. Our investigation of high performance computing systems operating in the field has revealed a significant difference in the measured operational reliability of individual computing nodes. By adding awareness of the individual system nodes' reliability to the scheduler along with the predicted reliability needs of parallel applications, reliable resources can be matched with the most demanding applications to reduce the probability of application failure arising from resource failure. In this thesis, the researcher describes a new approach developed for resource allocation that can enhance the reliability and reduce the costs of failures of large-scale parallel applications that use high performance computing systems. This approach is based on a multi-class Erlang loss system that allows us to partition system resources based on predicted resource reliability, and to size each of these partitions to bound the probability of blocking requests to each partition while simultaneously improving the reliability of the most demanding parallel applications running on the system. Using this model, the partition mean time to failure (MTTF) is maximized and the probability of blocking of resource requests directed to each partition by a scheduling system can be controlled. This new technique can be used to determine the size of the system, to service peak loads with a bounded probability of blocking to resource requests. This approach would be useful for high performance computing system operators seeking to improve the reliability, efficiency and cost-effectiveness of their systems.

Share

COinS