| Authors: | Peter A. Heeman |
| Published: |
Technical Report 673 Dept. of Computer Science, U. Rochester, December 1997. Doctoral dissertation. Also available from the Computation and Language E-Print Archieve as cmp-lg/9712009. |
| Abstract: |
Interactive spoken dialog provides many new challenges for natural
language understanding systems. One of the most critical challenges
is simply determining the speaker's intended utterances: both
segmenting a speaker's turn into utterances and determining the
intended words in each utterance. Even assuming perfect word
recognition, the latter problem is complicated by the occurrence of
speech repairs, which occur where the speaker goes back and changes
(or repeats) something she just said. The words that are replaced or
repeated are no longer part of the intended utterance, and so need to
be identified. The two problems of segmenting the turn into
utterances and resolving speech repairs are strongly intertwined with
a third problem: identifying discourse markers. Lexical items that
can function as discourse markers, such as ``well'' and ``okay,'' are
ambiguous as to whether they are introducing an utterance unit,
signaling a speech repair, or are simply part of the context of an
utterance, as in ``that's okay.'' Spoken dialog systems need to
address these three issues together and early on in the processing
stream. In fact, just as these three issues are closely intertwined
with each other, they are also intertwined with identifying the
syntactic role or part-of-speech (POS) of each word and the speech
recognition problem of predicting the next word given the previous
words. In this thesis, we present a statistical language model for resolving these issues. Rather than finding the best word interpretation for an acoustic signal, we redefine the speech recognition problem to so that it also identifies the POS tags, discourse markers, speech repairs and intonational phrase endings (a major cue in determining utterance units). Adding these extra elements to the speech recognition problem actually allows it to better predict the words involved, since we are able to make use of the predictions of boundary tones, discourse markers and speech repairs to better account for what word will occur next. Furthermore, we can take advantage of acoustic information, such as silence information, which tends to co-occur with speech repairs and intonational phrase endings, that current language models can only regard as noise in the acoustic signal. The output of this language model is a much fuller account of the speaker's turn, with part-of-speech assigned to each word, intonation phrase endings and discourse markers identified, and speech repairs detected and corrected. In fact, the identification of the intonational phrase endings, discourse markers, and resolution of the speech repairs allows the speech recognizer to model the speaker's utterances, rather than simply the words involved, and thus it can return a more meaningful analysis of the speaker's turn for later processing. |
| Paper: | |
| Note: | My 1997 ACL paper is a shorter version of the second half of the thesis, and my 1997 Eurospeech and 1998 VCL paper are shorter versions of the first half. |