This technical report concerns the development of a probabilistic Constraint Dependency Grammar (CDG) language model for speech recognition tasks. We have developed methods to quickly annotate a medium-sized carpus of sentences and extract high quality CDGs. We have also evaluated the quality of these grammars. Using the corpus of CDG parses, we have constructed and evaluated a language model that incorporates syntactically a.nd semantically enriched Part-of-Speech (POS) tags. The N-gram language model based on the enriched tags improves the perplexity and word error rate on the test corpus compared to a standard word-based N-gram language model and an N-gram POS-based language model on our corpus. Future work focuses on developing a probabilistic CDG language model that incrementally builds up a hidden dependency parse structure that uses syntactic and lexical constraints. Partial parse information will be used as the history of a word to enable the use of long-distance dependency information for word prediction. The model will tightly integrate tagging with parsing, and utilize dependency constraints, subcategorization/expect;ztion constraints, and lexical features of words to generate parse structures. The rriodel will search the parse space in a left-bright bottom-up mannter so that it can be integrated directly with a speech recognizer. Additionally, distance measure and punctuation information will be investigated to refine the modeling of dependency structures.


Constraint Dependency Grammar, Grammar Induction, Language Modeling, Statistical Parsing

April 2001