For dependability outages in distributed internet infrastructures, it is often not enough to detect a failure, but it is also required to diagnose it, i.e., to identify its source. Complex applications deployed in multi-tier environments, such as the classic three tier e-commerce system, make diagnosis challenging because of fast error propagation, black-box applications, constraints on the diagnosis delay, the amount of states that can be maintained, and imperfect diagnostic tests. Here, we propose a probabilistic diagnosis model for arbitrary failures in components of a distributed application. The monitoring system (the Monitor) passively observes the message exchanges between the components and at runtime, performs a probabilistic diagnosis of the component that was the root cause of a detected failure. The diagnosis model takes into account the possibility of a service failure, link failure, test imperfection, and lack of perfect observability at the monitoring station. We demonstrate the approach by applying it to a J2EE-based e-commerce application called Pet Store exercising a workload of browse-and-buy user transactions. We compare our approach with Pinpoint by quantifying the latency and accuracy of the two systems. The Monitor system outperforms Pinpoint by achieving comparably accurate diagnosis with higher precision in shorter time.
Distributed system diagnosis, runtime monitoring, probabilistic diagnosis, fault injection based
Date of this Version