Date of Award

12-2016

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer and Information Technology

First Advisor

Thomas J. Hacker

Committee Chair

Thomas J. Hacker

Committee Member 1

Eric T. Matson

Committee Member 2

John A. Springer

Abstract

High performance computing (HPC) systems frequently suffer errors and failures from hardware components that negatively impact the performance of jobs run on these systems. We analyzed system logs from two HPC systems at Purdue University and created statistical models for memory and hard disk errors. We created a small-scale error injection testbed—using a customized QEMU build, libvirt, and Python—for HPC application programmers to test and debug their programs in a faulty environment so that programmers can write more robust and resilient programs before deploying them on an actual HPC system. The deliverables for this project are the fault injection program, the modified QEMU source code, and the statistical models used for driving the injection.

Share

COinS