Reliable identification of significant sets of episodes in event sequences
Abstract
In this thesis we present a solution to the problem of identification of significant sets of episodes in event sequences. In order to determine the significance of an episode in a monitored event sequence, we compare its observed frequency to its frequency in a reference sequence. The reference sequence in our work is represented by a variable-length Markov model of generating symbols in the reference sequence. An episode is significant if the probability that it would have a given frequency by chance, in the reference sequence, is very small. In order to identify significant episodes we first show how to select the sliding window size to ensure that a discovered episode is meaningful and then we show how to compute a lower threshold for under-represented and an upper threshold for overrepresented significant episodes. The frequency of occurrence alone is not enough to determine significance, i.e., an infrequent episode can be more significant than a frequent one, and the significance depends on the structure of the episode and on probabilistic characteristics of the reference and monitored event streams. As an extension, we propose a novel method for providing approximate answers, with probabilistic guarantees, to a class of ad hoc sliding window queries referencing past data in data streams. The queries in that class compute the frequency of past windows that satisfy given join conditions among tuples in a window comprising multiple streams. To represent the join conditions consisting of intra-stream and inter-stream constraints between tuples in the window we introduce a concept of a 2D-episode.
Degree
Ph.D.
Advisors
Atallah, Purdue University.
Subject Area
Computer science
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.