Department of Electrical and Computer Engineering Technical Reports

Probabilistic Diagnosis through Non-Intrusive Monitoring in Distributed Applications

Gunjan Khanna
Mike Yu Cheng
Jagadeesh Dyaberi
Saurabh Bagchi
Miguel P. Correia, University of Lisbon, Portugal
Paulo Vérissimo, University of Lisbon, Portugal

Abstract

required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging because fast error propagation may occur in high throughput distributed applications. The diagnosis often needs to be probabilistic in nature due to imperfect observability of the payload system, inability to do white-box testing, constraints on the amount of state that can be maintained at the diagnostic process, and imperfect tests used to verify the system. In this paper, we extend an existing Monitor architecture, for probabilistic diagnosis of failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access internal protocol state. At runtime, it builds a causal & aggregate graph between the PEs based on their communication and uses this together with a rule base for diagnosing the failure. The Monitor computes for each suspected PE, a probability for the error having originated in that PE and propagated to the failure detection site. The framework is applied to a test-bed consisting of a reliable multicast protocol executing on the Purdue campus-wide network. Error injection experiments are performed to evaluate the accuracy and the performance overhead of the diagnostic process.

Keywords

Distributed system diagnosis, runtime monitoring, reliable multicast protocol, probabilistic diagnosis, error injection based evaluation

Date of this Version

December 2005

Download

COinS

Department of Electrical and Computer Engineering Technical Reports

Probabilistic Diagnosis through Non-Intrusive Monitoring in Distributed Applications

Abstract

Keywords

Date of this Version

Search

Links

Links for Authors

Browse

Department of Electrical and Computer Engineering Technical Reports

Probabilistic Diagnosis through Non-Intrusive Monitoring in Distributed Applications

Authors

Abstract

Keywords

Date of this Version

Share

Search

Links

Links for Authors

Browse