Abstract

Romero, Raul F. M.S., Purdue University, August, 2010. Live Migration of Parallel Applications. Major Professor: Thomas J. Hacker.

It has been observed on engineering and scientific data centers that the absence of a clear separation between software and hardware can severely affect parallel applications. Applications that run across several nodes tend to be greatly affected because a single computational failure present in one of the nodes often leads the entire application to produce incorrect results or to even die. This low observed reliability requires a combination of a proactive and reactive solution in order to preserve the state of parallel jobs running on degraded nodes; therefore it is possible to avoid runtime errors in parallel applications.

This thesis addressed the critical problem of low reliability in parallel jobs by implementing a fault tolerance approach based on OpenVZ virtualization. By using virtual machines on which parallel applications were running, this study showed that it was feasible to make parallel jobs independent of any particular hardware/software implementation; therefore when a degraded node is detected, the virtual machine(s) running on this degraded node(s) may be migrated with its parallel jobs to a healthier node. This study examined the correctness and performance of implementing live migration on hosts loaded with parallel jobs, and determined that it is possible to efficiently save the state of parallel applications after live migration of virtual machines to a more reliable node.

Keywords

Virtualization, Parallel Applications, MPI, OpenVZ, MPICH2, LAM/MPI, OMEN, HPL

Date of this Version

7-14-2010

Department

Computer & Information Technology

Department Head

Jeffrey L. Brewer

Month of Graduation

August

Year of Graduation

2010

Degree

Master of Science

Head of Graduate Program

Gary Bertoline

Advisor 1 or Chair of Committee

Thomas Hacker

Committee Member 1

John Springer

Committee Member 2

Eric Matson

Share

COinS