A method for extracting highlights in document corpora
Abstract
This work introduces the concept of "Highlights" which are recurrent themes or topics in the components of a corpus. We propose a computationally efficient method for extracting these highlights in corpora consisting of text documents, where the text can comprise any alpha-numeric string such as words (from any language) or source code. The method is based on a low-cost corpus dependent topic extraction procedure for documents which provides us with a sparse document representation in the topic space. The topics occurring abnormally often are then identified as the highlights of the corpus. This highlighting procedure can be repeated within subsets of the corpus (e.g., the documents associated with a given highlight) in order to further analyze the corpus in a hierarchical fashion. We demonstrate this method on different corpora and discuss our results.
Degree
M.S.E.C.E.
Advisors
Boutin, Purdue University.
Subject Area
Statistics|Computer Engineering
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.