Experimental analysis of replication in distributed systems

Abdelsalam Ali Helal, Purdue University

Abstract

The main objective of replication in distributed database systems is to increase data availability. However, the overhead associated with replication may impair the performance of transaction processing. Moreover, in the presence of changing failure and transaction characteristics, static replication schemes are so restrictive that they may actually decrease availability. The purpose of this research is to show how adaptability and data reconfiguration can be used in conjunction with static replication schemes to achieve and maintain higher levels of availability. The basis of this research is an integrated study of availability and performance of replication methods. The availability analysis part of the study is performed through an analytical model that encompasses transaction and database parameters, site and communication link failures, and replication methods' parameters. The performance evaluation part is conducted on the second version of the RAID distributed database system developed at Purdue, which includes off-line replication management, a stand-alone replication control server, a quorum-based interface to a library of replication methods, quorum selection heuristics, a surveillance facility, and a dynamic data reconfiguration protocol. Using the availability model and the RAID system combined, a series of experiments are conducted. We study static replication schemes and develop local policies for their efficient use. We show how partial replication and the deferred write approach can be used with the read-one-write-all method in order to increase its availability and reduce its message traffic and computation overhead. We show how we choose the weights and thresholds for the quorum consensus method, and show how we reduce its message traffic overhead by partial replication. We also present and compare three quorum selection heuristics that are used by the replication controller quorum interface when multiple quorums are available for an object. After developing local policies for static replication schemes, we proceed with studying how to adapt the use of these schemes to perturbations in parameters like the transaction read/write mix and site and communication link reliabilities. We define the practical degree of replication as the least degree of replication at which availability and response time are within 10% of their optimum values. We then examine the effect of transaction update percent and RAID site reliabilities on the practical degree of replication. Finally, we examine the impact of surveillance on the performance of transaction processing during failures. We show that the overhead of surveillance can always be expressed as a fixed additional cost of MIPS, whereas its benefits lie in maintaining failure-time response time that is comparable to the response time before the failure. We also show that surveillance possesses a throughput anomaly where the number of committed transactions drops for short periods of failure.

Degree

Ph.D.

Advisors

Bhargava, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS