Romero, Raul F. M.S., Purdue University, August, 2010. Live Migration of Parallel Applications. Major Professor: Thomas J. Hacker.
It has been observed on engineering and scientific data centers that the absence of a clear separation between software and hardware can severely affect parallel applications. Applications that run across several nodes tend to be greatly affected because a single computational failure present in one of the nodes often leads the entire application to produce incorrect results or to even die. This low observed reliability requires a combination of a proactive and reactive solution in order to preserve the state of parallel jobs running on degraded nodes; therefore it is possible to avoid runtime errors in parallel applications.
This thesis addressed the critical problem of low reliability in parallel jobs by implementing a fault tolerance approach based on OpenVZ virtualization. By using virtual machines on which parallel applications were running, this study showed that it was feasible to make parallel jobs independent of any particular hardware/software implementation; therefore when a degraded node is detected, the virtual machine(s) running on this degraded node(s) may be migrated with its parallel jobs to a healthier node. This study examined the correctness and performance of implementing live migration on hosts loaded with parallel jobs, and determined that it is possible to efficiently save the state of parallel applications after live migration of virtual machines to a more reliable node.
Virtualization, Parallel Applications, MPI, OpenVZ, MPICH2, LAM/MPI, OMEN, HPL
Date of this Version
Computer & Information Technology
Jeffrey L. Brewer
Month of Graduation
Year of Graduation
Master of Science
Head of Graduate Program
Advisor 1 or Chair of Committee
Committee Member 1
Committee Member 2