Automatic methods for content -based access and summarization of video sequences
Abstract
Depending on the specific information they are seeking, users desire flexible and intuitive methods to search and browse multimedia libraries. However, the cost of manually extracting the metadata to support such functionalities may be unrealistically high. Therefore, over the last decade there has been a great interest in designing and building systems that automatically analyze and index multimedia data, and retrieve its relevant parts. In this work we describe algorithms that facilitate browsing, searching, and summarization of video sequences. We propose shot transition detection algorithms for the detection of cut and dissolve types of shot transitions based on a binary tree regression classifiers framework. Our system is able to detect these transitions with high accuracy. We discuss stochastic models to model video program genres, such as news programs or sitcoms, and show how these can be applied to automatically detect the genre of a given program. We investigate the use of hidden Markov Models (HMMs) and stochastic context-free grammars (SCFGs) for modeling. Since the computational complexity of SCFG training is high, we develop a hybrid HMM-SCFG model that reduces the training time of the models considerably. Deriving compact representations of video sequences that are intuitive for users and let them easily and quickly browse large collections of video data is fast becoming one of the most important topics in content-based video processing. Such representations, which we will collectively refer to as video summaries, rapidly provide the user with information about the content of the particular sequence being examined, while preserving the essential message. We propose an automated method to generate video skims for information-rich video programs, such as documentaries, educational videos, and presentations, using statistical analysis based on speech transcripts that are obtained by automatic speech recognition (ASR) from the audio. Ideally one would like the generated summaries to be both detailed and covering most of the important points of the full program they were derived from. Clearly, for high summarization ratios it is impossible to stiff both of these constraints. Our summarization approach quantifies these two concepts and maximizes a weighted sum of both detail and coverage functions to obtain a trade-off between the two. We also study objective evaluation methods for video summaries. We evaluate summaries produced by a number of algorithms using a question and answer evaluation scheme and discuss other methods of summary evaluation. In the final part of the dissertation we describe a real-world video processing application that makes use of many algorithms introduced in this work. In this application we generate a list of unique people appearing in a news program using a combination of visual and audio features. This system greatly facilitates the indexing of news programs and also may be used as part of a automatic open caption insertion system.
Degree
Ph.D.
Advisors
Delp, Purdue University.
Subject Area
Electrical engineering
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.