Non-intrusive detection and diagnosis of failures in high throughput distributed systems

Gunjan Khanna, Purdue University

Abstract

Distributed systems form an integral part of human life—from ATMs to the Domain Name Service. Typical distributed systems consist of distributed services interacting through messages. Failures in these systems are often the causes of huge financial loss or human catastrophes. Efficient fault detection and diagnosis of cascaded non fail-silent failures is extremely challenging because of legacy code, black-box nature of application entities, scalability and state space explosion. Current error detection and diagnosis protocols suffer from one or more of the following problems—very specific to one application, require intrusive changes to the application, lack of scalability, impose additional load on the application, are offline and cannot detect (or diagnose) the failures at runtime. In this thesis, we propose Monitor, a scalable, autonomous, fault detection and diagnosis framework. The Monitor only observes the external messages between the components of the application and is unaware of any internal transition of the application entities. The Monitor uses a rule base of allowable behavior and does fast matching of incoming messages. We propose a sampling approach which adjusts a sampling rate in accordance with the incoming rate of packets such that the breakdown in the Monitor capacity is avoided. We use a distributed deployment of Monitors across the Purdue WAN to demonstrate its effectiveness. We compare the performance of the Monitor in diagnosing faults with Pinpoint, the state-of-the-art diagnosis approach.

Degree

Ph.D.

Advisors

Bagchi, Purdue University.

Subject Area

Electrical engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS