Speech Repairs, Intonational Boundaries and Discourse Markers: Modeling Speakers' Utterances in Spoken Dialog


Authors: Peter A. Heeman
Published: Technical Report 673
Dept. of Computer Science, U. Rochester, December 1997.
Doctoral dissertation.

Also available from the Computation and Language E-Print Archieve as cmp-lg/9712009.

Abstract: Interactive spoken dialog provides many new challenges for natural language understanding systems. One of the most critical challenges is simply determining the speaker's intended utterances: both segmenting a speaker's turn into utterances and determining the intended words in each utterance. Even assuming perfect word recognition, the latter problem is complicated by the occurrence of speech repairs, which occur where the speaker goes back and changes (or repeats) something she just said. The words that are replaced or repeated are no longer part of the intended utterance, and so need to be identified. The two problems of segmenting the turn into utterances and resolving speech repairs are strongly intertwined with a third problem: identifying discourse markers. Lexical items that can function as discourse markers, such as ``well'' and ``okay,'' are ambiguous as to whether they are introducing an utterance unit, signaling a speech repair, or are simply part of the context of an utterance, as in ``that's okay.'' Spoken dialog systems need to address these three issues together and early on in the processing stream. In fact, just as these three issues are closely intertwined with each other, they are also intertwined with identifying the syntactic role or part-of-speech (POS) of each word and the speech recognition problem of predicting the next word given the previous words.

In this thesis, we present a statistical language model for resolving these issues. Rather than finding the best word interpretation for an acoustic signal, we redefine the speech recognition problem to so that it also identifies the POS tags, discourse markers, speech repairs and intonational phrase endings (a major cue in determining utterance units). Adding these extra elements to the speech recognition problem actually allows it to better predict the words involved, since we are able to make use of the predictions of boundary tones, discourse markers and speech repairs to better account for what word will occur next. Furthermore, we can take advantage of acoustic information, such as silence information, which tends to co-occur with speech repairs and intonational phrase endings, that current language models can only regard as noise in the acoustic signal. The output of this language model is a much fuller account of the speaker's turn, with part-of-speech assigned to each word, intonation phrase endings and discourse markers identified, and speech repairs detected and corrected. In fact, the identification of the intonational phrase endings, discourse markers, and resolution of the speech repairs allows the speech recognizer to model the speaker's utterances, rather than simply the words involved, and thus it can return a more meaningful analysis of the speaker's turn for later processing.

Paper: Pdf
Note: My 1997 ACL paper is a shorter version of the second half of the thesis, and my 1997 Eurospeech and 1998 VCL paper are shorter versions of the first half.