Prosodic disambiguation in automatic speech understanding of Thai

Siripong Potisuk, Purdue University

Abstract

This research is aimed at studying the role of prosody in automatic speech understanding systems. It is believed that incorporating prosodic information into the current speech recognition scheme will improve performance. Prosody can be defined as changes in F$\sb0,$ timing and intensity of speech, and it is used to signal linguistic and affective information. Linguistic prosody which is used to signal grammatical information at the syllable, word, or sentence level, such as stress or intonation, is of primary interest. The language chosen for this investigation is Thai. Thai belongs to the class of tone languages, for which variations in F$\sb0$ at the syllable level signal differences in lexical meaning. Every Thai syllable carries a lexically-contrastive F$\sb0$ contour, or tone, and Thai has five tones: mid, low, falling, high, and rising. Three specific issues are addressed: (1) automatic stress detection; (2) automatic tone classification; and (3) constraint dependency parsing with prosodic disambiguation. Two experiments were designed to empirically study the acoustic characteristics of stressed and unstressed syllables in terms of the vowel length distinction and the relative importance of each acoustic correlate in signaling stress. We then developed a stress classification algorithm based on a Bayesian classifier with linear discriminant scores. A duration normalization procedure based on the mean rhyme duration for each syllable type was used to neutralize durational differences due to differences in segmental composition. Our classifier achieved a 97% classification accuracy. Next, coarticulatory effects among tones were examined through an acoustic experiment using trisyllabic sequences. We then developed an analysis-by-synthesis method of tone classification based on our extension to Fujisaki's model of F$\sb0$ contour synthesis. We also developed an F$\sb0$ normalization procedure using an equivalent-rectangular-bandwidth (ERB) scale conversion and z-score normalization to account for intra- and interspeaker variability and a time-varying mean-scaling procedure to account for declination effect. Our tone classifier achieved an 89.1% classification accuracy. Finally, prosodic constraints were incorporated into the language model for Thai. We extended PARSEC, a constraint-based language parser, to include prosodic constraints for ambiguity resolution. Prosodic constraints determined whether the input prosodic structure agrees with that of each of the competing sentence hypotheses of an ambiguous utterance.

Degree

Ph.D.

Advisors

Harper, Purdue University.

Subject Area

Electrical engineering|Computer science|Communication

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS