This technical report concerns the development of a probabilistic Constraint Dependency Grammar (CDG) language model for speech recognition tasks. We have developed methods to quickly annotate a medium-sized carpus of sentences and extract high quality CDGs. We have also evaluated the quality of these grammars. Using the corpus of CDG parses, we have constructed and evaluated a language model that incorporates syntactically a.nd semantically enriched Part-of-Speech (POS) tags. The N-gram language model based on the enriched tags improves the perplexity and word error rate on the test corpus compared to a standard word-based N-gram language model and an N-gram POS-based language model on our corpus. Future work focuses on developing a probabilistic CDG language model that incrementally builds up a hidden dependency parse structure that uses syntactic and lexical constraints. Partial parse information will be used as the history of a word to enable the use of long-distance dependency information for word prediction. The model will tightly integrate tagging with parsing, and utilize dependency constraints, subcategorization/expect;ztion constraints, and lexical features of words to generate parse structures. The rriodel will search the parse space in a left-bright bottom-up mannter so that it can be integrated directly with a speech recognizer. Additionally, distance measure and punctuation information will be investigated to refine the modeling of dependency structures.
Constraint Dependency Grammar, Grammar Induction, Language Modeling, Statistical Parsing
Date of this Version