Date of Award
12-2016
Degree Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer and Information Technology
First Advisor
Thomas J. Hacker
Committee Chair
Thomas J. Hacker
Committee Member 1
Eric T. Matson
Committee Member 2
John A. Springer
Abstract
High performance computing (HPC) systems frequently suffer errors and failures from hardware components that negatively impact the performance of jobs run on these systems. We analyzed system logs from two HPC systems at Purdue University and created statistical models for memory and hard disk errors. We created a small-scale error injection testbed—using a customized QEMU build, libvirt, and Python—for HPC application programmers to test and debug their programs in a faulty environment so that programmers can write more robust and resilient programs before deploying them on an actual HPC system. The deliverables for this project are the fault injection program, the modified QEMU source code, and the statistical models used for driving the injection.
Recommended Citation
St. John, Jason R., "A small-scale testbed for large-scale reliable computing" (2016). Open Access Theses. 899.
https://docs.lib.purdue.edu/open_access_theses/899