Global detectors of unusual words: Design, implementation, and applications to pattern discovery in biosequences
The enormous growth of biomolecular databases makes it increasingly important to have fast and automatic methods to process, analyze and understand such massive amounts of data. In the domain of sequence analysis, a prominent role is played by a family of methods which are designed to discover unusual patterns. Unusually frequent or rare words are implicated in various facets of biological function and structure. With sequence data becoming massively available, tasks akin to an exhaustive enumeration and testing of word frequencies in a whole genome are becoming increasingly appealing and yet pose significant computational burdens even when limited to words of bounded maximum length. In addition, the display of the huge tables possibly resulting from these counts poses significant problems of visualization and inference. ^ In this thesis, we show efficient and practical algorithms for the problem of detecting words that are, by some measure, over- or under-represented in the context of larger sequences. The design is based on subtly interwoven properties of statistics, pattern matching and combinatorics on words. These properties enable us to limit drastically and a priori the set of over- or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and to visualize such words in a fast and practically useful way. We also demonstrate that such anomaly detectors can be used successfully to discover exact patterns in biological sequences, by reporting results of a software tool, called V ERBUMCULUS on simulated data and test cases of practical interest. ^
Major Professor: A. Apostolico, Purdue University.
Biology, Genetics|Computer Science
Off-Campus Purdue Users:
To access this dissertation, please log in to our