Nowadays, distributed systems are a necessity of almost all big enterprises. It is a programmers nightmare to encounter a bug which causes failures in the system and leads to a crash on such a large infrastructure. With the ever increasing code sizes and processing needs, a tool is required that is able to assist a programmer in figuring out potential causes of a bug and minimizing time taken for debugging, hence rectifying it quickly. We present our solution Orion+, which compares the system metrics at various levels, namely, hardware, OS, middleware and application layer. It then makes use of the association information provided by the stack traces of the normal and abnormal runs to narrow down the specified buggy code region to a particular sequence of function calls that contain the bug or are most affected by the bug. We benchmarked our work against already established bugs in open source software which have been fixed and find that Orion+ is able to provide root cause analysis for all the benchmark bugs.


Problem diagnosis, Distributed software, Metric mining, Root cause analysis

Date of this Version