Parsing and tagging sentences containing lexically ambiguous and unknown tokens

Scott Matthew Thede, Purdue University

Abstract

We present a parsing system designed to parse sentences containing unknown words as accurately as possible. Our post-mortem parsing algorithm combines syntactic parsing rules, morphological recognition, and closed-class lexicon with a method that attempts to parse a sentence first with a limited prediction for unknown words, and later reparse the sentence with a more broad prediction if first attempts fail. This allows great flexibility while parsing, and can offer improved accuracy and efficiency for parsing sentences that contain unknown words. Experiments involving hand-created and computer-generated morphological recognizers are performed. We also develop a part-of-speech tagging system designed to accurately tag sentences, including sentences containing unknown words. The system is based on a basic hidden Markov model, but uses second-order approximations for the probability distributions (instead of first-order). The second order approximations give increased tagging accuracy, without increasing asymptotic running time over traditional trigram taggers. A dynamic smoothing technique is used to address sparse data by attaching more weight to events that occur more frequently. Unknown words are predicted using statistical estimation from the training corpus based on word endings only. Information from different length suffixes is included in a weighted voting scheme, smoothed in a fashion similar to that used for the second-order HMM. This tagging model achieves state-of-the-art accuracies. Finally, the use of syntactic parsing rules to increase tagging accuracy is considered. By allowing a parser to veto possible tag sequences due to violation of syntactic rules, it is shown that tagging errors were reduced by 28% on the Timit corpus. This enhancement is useful for corpora that have rules sets defined.

Degree

Ph.D.

Advisors

Harper, Purdue University.

Subject Area

Electrical engineering|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS