A small-scale testbed for large-scale reliable computing

Jason R St. John, Purdue University


High performance computing (HPC) systems frequently suffer errors and failures from hardware components that negatively impact the performance of jobs run on these systems. We analyzed system logs from two HPC systems at Purdue University and created statistical models for memory and hard disk errors. We created a small-scale error injection testbed—using a customized QEMU build, libvirt, and Python—for HPC application programmers to test and debug their programs in a faulty environment so that programmers can write more robust and resilient programs before deploying them on an actual HPC system. The deliverables for this project are the fault injection program, the modified QEMU source code, and the statistical models used for driving the injection.^




Thomas J. Hacker, Purdue University.

Subject Area

Information technology|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server