Abstract. Today’s distributed systems need runtime error detection to catch errors arising from software bugs, hardware errors, or unexpected operating conditions. A prominent class of error detection techniques operates in a stateful manner, i.e., it keeps track of the state of the application being monitored and then matches state-based rules. Large-scale distributed applications generate a high volume of messages that can overwhelm the capacity of a stateful detection system. An existing approach to handle this is to randomly sample the messages and process the subset. However, this approach, leads to non-determinism with respect to the detection system’s view of what state the application is in. This in turn leads to degradation in the quality of detection. We present an intelligent sampling and Hidden Markov Model (HMM)-based technique to select the messages and states that the detection system processes such that the non-determinism is minimized. We also present a mechanism for selecting computationally intensive rules to match based on the likelihood of detecting an error if a rule is completely matched. We demonstrate the techniques in a detection system called Monitor applied to a J2EE distributed online banking application. We empirically evaluate the performance of Monitor under different load conditions and compare it to a previous system called Pinpoint

Date of this Version