Managing
Conversation: Bringing a Human Touch To Speech
Technology
BY EMMETT COIN
Imagine a computer that could speak and truly
communicate, with feeling and the proper range of
emotions. That is the ultimate goal of conversational
speech technology, a goal that is closer than many
people think. Conversational systems provide a
functional and potentially engaging way to interact
naturally with a computer, for three main reasons.
First, conversation is fast. Human speech has evolved
to maximize communication while minimizing errors.
Application developers argue that commands are faster
when spoken rather than clicked or typed, and certainly
in highly specialized and critical situations unique
command-based systems do evolve naturally. But how much
does one invest to become productive and efficient with
a specialized system?
Second, conversation is innate. We don't need to
teach this to anyone - diverse personalities can find a
common conversational modality. Most children are
competent conversationalists at three years of age.
Finally, conversation is inherently error correcting.
In fact, it presumes an error-prone channel.
Conversational partners are always on the alert for
non-meaningful and contradictory exchanges. Whether for
business, education, or even fun, speech is an intuitive
experience for people.
BRIEF HISTORY OF SPEECH The
earliest steps toward the goal of man/machine
conversation focused on speech production. Synthetic
speech generation research dates back more than one
hundred years, when J. Q. Stewart built a machine
consisting of two coupled resonances excited by
electrical impulses. It could produce vowel-like sounds
by tuning the resonances to different frequencies. Some
of the speech sounds, vowels specifically, can be
thought of as musical chords. So, specific resonances
allowed the passage of specific "notes" in the "chord."
The difference between an "ah" and an "oh" is something
like the difference between a C major and a D minor
chord.
Jump forward to the late 1960s, when digital
computers allowed research to rapidly advance, making it
possible to write programs to control loudness, pitch,
and resonances. Today, synthetic speech generators,
which we call text-to-speech (TTS) systems, use a
combination of pronunciation rules and dictionaries, and
have essentially limitless vocabularies. At about the
same time, digital computers made practical the
progenitors of the automatic speech recognition (ASR)
systems we have today. At first, simple template matches
were made for patterns of those "chords" of speech.
Later, genuine phonetic models that accepted compact,
linguistic notations for large clusters of expected
utterances were developed.
The current technology is being used in many ways.
Some basic conversational applications allow users to
access information, such as e-mail or stock market
information, over a phone, which eliminates the need to
carry a laptop. As speech and understanding technology
continues to evolve, more and more applications have
appeared on the market. Clearly, such interfaces could
do a wide range of tasks.
How will we create all those different potential
conversations? At present, these applications are
"one-up" programs, each piggy-backing the one before.
Should each application be a complete new program? As
humans, we don't learn how to talk to a hardware store
clerk and then completely relearn from scratch how to
talk to a toy store clerk. What can we reuse? What are
the new parts? To move forward, we need a conversation
management system.
CONVERSATION MANAGEMENT SYSTEM
What does the conversation manager (CM) do? Overall,
it guides the conversation. Consider one exchange, where
a person talks and then a computer talks. The computer
"listens" to the person with expectations of what it
could hear. This is done by notifying the ASR that
something within a certain context is expected.
So, an e-mail system tuned for ordinary conversation
would not expect to hear a person say "Buy 200 shares of
IBM at market." The system, however, might guess at the
person's intention. Then, if the person concurs with
that intention, the system will advance the
conversation.
If something is to be done, maybe "read the next
e-mail from Bill," then a real-world function must
access the e-mail repository, preprocess it to deal with
the e-mail-specific idiosyncrasies that would not be
handled by the generic TTS system, and return the
modified version of the e-mail to the CM. The CM sends
the e-mail to the TTS system to be spoken and begins
listening immediately, in the event that the person
wishes to interrupt.
PARTS OF A CONVERSATION MANAGER
Conversation Plan The
conversation plan is the biggest unknown in
conversational systems. A simple programmatic approach
is common: A programmer writes a standard C/C++ program
that does it all. It loads the ASR with a context and
monitors the result status of the ASR. If a result is
returned, the program decides whether to accept or
reject. It changes its state correspondingly. Then the
cycle repeats. In this approach, all the subtlety of the
language and the conversational manner is created and
controlled by programmers. If it is to work well, the
program requires a sophisticated developer with
substantial linguistic and programming skill. This
approach will not remain as a viable solution for much
longer.
Another approach is to use a generic state machine
that removes the specifics of what is said and heard
from the realm of C/C++ code. In this kind of system, a
new level of design paradigm is created - CM programmers
who may not be traditional computer programmers (who, in
fact, should be linguistic experts) define a
conversation, and C/C++ programmers support the generic
state engine.
More sophisticated methods involve logic programming
and reasoning systems. In these systems, the
conversation is guided by rules. As conditions in the
conversation change, the rules are reevaluated against
various propositions (such as <READ_NEXT_EMAIL>),
and some result determines the action. It could be
simple: Just to read it or not. It could be more
complicated: If it is found, and not from Bill, and not
more than one minute long, and not already forwarded
from someone else, and it is very likely that this was
correctly recognized, then read it.
Automatic Speech Recognition (ASR)
ASR listens to and transcribes what the
person is saying. This is technology that can be bought
off-the-shelf today - many vendors offer some kind of
speech recognition package. While they are not all
interchangeable, they are remarkably similar. Of all the
phonetic, grammar-based recognizers, the vast majority
are solidly based on the Hidden Markov Model (HMM)
methodology. Some key differences between the products
are telephone versus microphone models, host- and/or
DSP-based models, models with dynamic word lists at run
time, and models that dynamically include uttered
elements in grammar.
A big distinction for a telephony model is the
quality of recognition over the network. As humans, we
use a great deal of contextual and common sense
information to compensate for the dramatic loss of
acoustic detail. For example, how often do you confuse
words that have "s," "f," "sh," or "ch" sounds in them?
These all have high-frequency components that are
removed by the phone system. This is one reason that
dictation over the phone is still impractical.
The high-density telephony solutions usually include
a hardware assist for some or all of the recognition
task. Some solutions have a proprietary card set, while
others take advantage of hardware such as Dialogic's
Antares DSP board.
Natural Language Understanding (NLU)
NLU needs to know what the person meant.
This is a transform function that converts any of the
ways that the user population might say something to the
computer into a class of action. "Gimme the next one"
and "Would you read the next e-mail, please" are
equivalent at some points in an e-mail reader
application. At those points these and a myriad of other
phrases must resolve to <READ_NEXT_EMAIL>.
Context Tracker (For ASR And NLU)
The context tracker is usually tied in
tightly to the conversation plan, but this element is
worth pointing out since it is key to any conversational
player, human or machine. Let's say a person is talking
about shoes. Are they brake shoes for a car or bowling
shoes?
The context tracker first needs to answer the
question: What are we talking about right now? Because
the CM is engineered, we must consider the trade-offs.
One important trade-off is that the more the system
attempts to "expect" to hear, the more likely it is to
misunderstand.
Text-To-Speech (TTS) TTS
generates spoken replies to a human query. In
conversational systems, the computer will speak with its
own voice (TTS) and not a patch quilt of human
recordings. In an e-mail reader application, it is
obvious that we must use TTS. As applications tackle
more tasks that manipulate larger amounts of data, the
recorded prompts and responses become unwieldy at best
(expensive, time consuming, hard to maintain, limited
voice model availability, etc.).
These recording issues, however, will soon become
irrelevant. First, TTS systems are getting better all
the time. In addition, some amazing things can be done
with prosody transplantation (imposing real human
intonation on generated TTS).
Interface To Real World Functions
This is the one component of a conversational system
that should be written by real programmers. Only a real
program is going to access a POP3 server, or some
proprietary data structure.
WHAT IS STILL MISSING? What ASR
Technology Lacks Remember, current ASR
technology turns speech into a telegram. It is all caps
and contains no punctuation. Yet half of what we say is
how we say it. How does ASR handle questions? If I ask
you when we can schedule a meeting, you might say
"tomorrow." You would mean <MEET_TOMORROW>. But if
you said "tomorrow?" you would mean
<QUERY_TOMORROW_OK>. ASR only gives us the word:
TOMORROW.
The same problem occurs with emphasis. "Three, six,
eight?" "NO, three, SEVEN, eight." It seems so natural
to give a little context (the three and the eight) and
then to stress what was different. A child understands
this easily, but current ASR doesn't give us a clue.
That's one reason correcting phone numbers or account
numbers via voice seems so awkward in today's
applications.
Laughter, vocal pauses such as "ummm," coughs, and
background noises are always present, and must be
accounted for. It would be nice for speech rec to say
"Bless you" after someone sneezes, but it is critical
that "ahh choo" must not be misunderstood to mean
<YES_DELETE>.
What TTS Technology Lacks It
lacks prosodic "warmth." Prosody is the music of speech.
It is the pitch, energy, and duration of the syllables.
From our studies, it is clearly more important that the
TTS voice speak like a human, than it is for the TTS
voice to sound like a human. To get the prosody right,
the CM must convey the purpose behind the things being
said.
Conversational skills give computers a higher degree
of personality and accessibility than ever before. Done
well, speech recognition and text-to-speech can ease
many of the frustrations users experience. Clearly, the
foundation has been set for meaningful conversation with
computers. As they are refined, conversational systems
will gradually become just another wonder of technology
that is taken for granted, is as ubiquitous as a word
processor, and is one that we cannot imagine living
without.
Emmett Coin is a self-described industrial poet.
Intellivoice Communications, Inc. delivers
speech-activated communications products to consumers
throughout the communications industry. Intellivoice
also sells its products to wireline and wireless
communications companies with the goal of making
conversational voice interfaces a standard feature in
public networks. The privately held Atlanta-based
company has grown from a small organization that focused
only on customer interactive voice response applications
to an expert in conversational speech recognition
interfaces for public networks. For more information,
contact the company at 404-816-3535 or visit their Web
site at http://www.intellivoice.com/. |