On increasing reliability and availability in distributed database systems

Shy-Renn Sharon Lian, Purdue University

Abstract

This thesis proposes several mechanisms to deal with recovery and data availability issues in distributed systems. Checkpointing in a distributed system is essential for recovery to a globally consistent state after failure. We present a checkpointing/rollback recovery algorithm in which each process takes checkpoints independently. In our approach a number of checkpointing processes, a number of rollback processes, and computations on operational processes can proceed concurrently while tolerating the failure of an arbitrary number of processes. During recovery after a failure, a process invokes a two phase rollback algorithm. In the first phase, it collects information about relevant message exchanges in the system and uses it in the second phase to determine both the set of processes that must roll back and the set of checkpoints up to which rollback must occur. We have implemented the checkpointing and rollback recovery algorithm and evaluated its performance in a real processing environment. The evaluation measures the overhead due to time spent in executing the algorithm and the cost in terms of computational time and message traffic. We identify the components that make up the execution time of the algorithm and study how each of them contributes to the total execution time. A typed token scheme for managing replicated data in distributed database systems is proposed in this thesis. Compared to previous schemes, for each object, a set of tokens is used. Each token represents a specific capability for the allowable operations on the object. By distributing tokens to different physical copies of the object, the object can be made available for different operations in various partitions of the network during a failure. Two types of replication for each of these tokens are proposed. One is based on the semantics of operations and the other is based on the semantics of the object. When failures are anticipated, tokens can be redistributed to maintain high availability. We present a merge recovery scheme for efficient recovery of database consistency from dynamic partitioning and merge. In the merge recovery scheme, information needed for recovery is organized in a partition tree so that missing updates can be efficiently carried out when partition merge. The merge recovery scheme is used to support the typed token scheme for efficient partition merge. It can also be used to extend some other replica control protocol.

Degree

Ph.D.

Advisors

Bhargava, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS