A method for extracting highlights in document corpora

Shiv R Biddanda, Purdue University

Abstract

This work introduces the concept of "Highlights" which are recurrent themes or topics in the components of a corpus. We propose a computationally efficient method for extracting these highlights in corpora consisting of text documents, where the text can comprise any alpha-numeric string such as words (from any language) or source code. The method is based on a low-cost corpus dependent topic extraction procedure for documents which provides us with a sparse document representation in the topic space. The topics occurring abnormally often are then identified as the highlights of the corpus. This highlighting procedure can be repeated within subsets of the corpus (e.g., the documents associated with a given highlight) in order to further analyze the corpus in a hierarchical fashion. We demonstrate this method on different corpora and discuss our results.

Degree

M.S.E.C.E.

Advisors

Boutin, Purdue University.

Subject Area

Statistics|Computer Engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS