Structural event detection for rich transcription of speech

Yang Liu, Purdue University

Abstract

Although speech recognition technology has significantly improved during the past few decades, current speech recognition systems output only a stream of words without providing other useful structural information that could aid a human reader and downstream language processing modules. This thesis research focuses on the automatic detection of several helpful structural events in speech, including sentence boundaries, type of utterance, filled pauses, discourse markers, and edit disfluencies. The systems evaluated combine prosodic cues and textual information sources in a variety of ways to support automatic detection of these structural events. Experiments were conducted across corpora (conversational speech and broadcast news speech) and with different transcription quality (human transcriptions versus recognition output). The imbalanced data problem is investigated for training the decision tree prosody model component of our system because structural events are much less frequent than non-events. A variety of sampling approaches and bagging are used to address this imbalance. Significant performance improvements are obtained via bagging. Some of the sampling methods are useful depending on the performance metrics used. Sentence boundary detection and disfluency detection tasks are impacted differently by sampling, bagging, and boosting, suggesting the inherent differences between the two tasks. A variety of methods for combining knowledge sources are examined: a hidden Markov model (HMM), the maximum entropy (Maxent) model, and the conditional random field (CRF). The Maxent and CRF approaches are discriminatively trained to model the posterior probabilities and thus correlate with the performance measures. They also support the use of more correlated features and so enable the combination of a variety of textual information sources. The HMM and CRF both model sequence information, unlike the Maxent which explicitly models local information. A model that combines these three approaches is superior to any method alone. Interactions with other research efforts suggest that the methods developed in this thesis generalize well to other corpora (e.g., a multimodal corpus, a multiparty meeting corpus) and to similar tasks (e.g., a gestural model, dialog act segmentation and classification).

Degree

Ph.D.

Advisors

Harper, Purdue University.

Subject Area

Electrical engineering|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS