Statistical parsing and language modeling based on constraint dependency grammar

Wen Wang, Purdue University

Abstract

This thesis focuses on the development of effective and efficient language models (LMs) for speech recognition systems. We selected Constraint Dependency Grammar (CDG) as the underlying framework because CDG parses can be lexicalized at the word level with a rich set of lexical features for modeling subcategorization and wh-movement without a combinatorial explosion of the parameter space and because CDG is able to model languages with crossing dependencies and free word ordering. Two types of LMs were developed: an almost-parsing LM and a full parser-based LM The quality of these LMs gained significantly from the insights obtained from initial CDG grammar induction experiments. The almost-parsing LM uses a data structure derived from CDG parses called a SuperARV that tightly integrates knowledge of words, lexical features, and syntactic constraints. The full CDG parser-based LM utilizes complete parse information obtained by adding the modifiee links to the SuperARVs assigned to each word in a sentence in order to capture important long-distance dependency constraints. We have evaluated the almost-parsing LM on a variety of large vocabulary continuous speech recognition (LVCSR) tasks and found that it reduced recognition error rates significantly compared to commonly used word-based LMs, achieving performance competitive to state-of-the-art parser-based LMs with a significantly lower time complexity. The full CDG parser-based LM, when evaluated on the DARPA Wall Street Journal CSR task, outperformed the almost-parsing LM and produced a performance comparable to or exceeding the state-of-the-art parser-based LMs.

Degree

Ph.D.

Advisors

Harper, Purdue University.

Subject Area

Electrical engineering|Artificial intelligence

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS