Failure-aware checkpointing in fine-grained cycle sharing systems


Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amountof idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact the local users of the host. Since the hosts are typically provided voluntarily, their availability fluctuates greatly. To provide fault tolerance to guest jobs without adding significant computational overhead, we propose failure-aware checkpointing techniques that apply the knowledge of resource availability to select checkpoint repositories and to determine checkpoint intervals. We present the schemes of selecting reliable and efficient repositories from the non-dedicated hosts that contribute their disk storage. These schemes are formulated as 0/1 programming problems to optimize the network overhead of transferring checkpoints and the work lost due to unavailability of a storage host when needed to recover a guest job. We determine the checkpoint interval by comparing the cost of checkpointing immediately and the cost of delaying that to a later time, which is a function of the resource availability. We evaluate these techniques on an FGCS system called iShare, using trace-based simulation. The results show that they achieve better application performance than the prevalent methods which use checkpointing with a fixed periodicity on dedicated checkpoint servers.


checkpointing, cycle-sharing, fault tolerance, performance reliability

Date of this Version



HPDC '07 Proceedings of the 16th international symposium on High performance distributed computing