May/June 2002
 |
Related
Articles |
 |
| Separating Purpose
from Presentation |
|
By Mike Terry
Separating the core of a speech application,
its purpose, from its presentation is vital for
the growth of such applications. The need for
the distinction arises from the idea of allowing
a speech application to be presented in multiple
ways and on multiple devices. In the past,
applications were created with the central logic
and the presentation details together. Even
newer technologies like XML did not address this
issue. With multimodality coming over the
horizon, this distinction will have to be
further explored.
The World Wide Web Consortium (W3C) has
already begun addressing the distinction issue
with the introduction of XForms. XForms are the
successor to HTML forms. While HTML forms do not
consider the modality of a device, XForms allow
for different presentations on multiple user
interfaces, while operating on the same
back-end. For example, a PDA can express the
same presentation as a desktop computer, without
losing the original intent of the designer.
According to Bill Scholz, director of
engineering for Natural Language Speech
Solutions at Unisys, “These forms are comprised
of separate sections that describe what the form
does and how it works. This allows for
understanding what is common and what is
different among display models.”
The growth towards multimodality is fostering
this effort, with the newly formed SALT Forum
providing the muscle behind the push. According
to Scholz, “The growth towards multimodal will
be slow. First, we will see applications that we
can interact with by either voice or browser.
Then a new class of applications will emerge
which interact with us using multiple modes
simultaneously. For example, a user might say
‘show me on the map how to get from here to the
nearest restaurant’ or ‘show me a graph of my
stock portfolio's value over the last week’.”
There are a couple of means from which
multimodality can emerge. The first is the idea
of starting with a graphical user interface
(GUI) and adding speech and other modes. The
second is to reverse the order, starting with a
voice user interface (VUI) and adding
visuals/graphics. Both means have creating a
dialog between the user and the device in mind.
This isn’t to say that SALT, by aiming for
multimodality, is going to supersede VoiceXML or
vice versa. Scholz believes, “SALT and VoiceXML
will eventually coexist. Our industry will find
a way to merge them or provide a framework in
which they can comfortably coexist for the
benefit of the end-user.”
It is the benefit of the end user that we
must keep mind when developing applications, for
if they are not user-friendly and relatively
easy to adopt, acceptance will be unlikely. The
W3C, and its recently convened organization, the
Multimodal Working Group, may be the way to
decide how to solve the multimodal problem.
However, a simple, yet very important question
must be asked now–What is the best method of
attaining our dreams of multimodal? Separating
the presentation of the application from the
purpose of the application is definitely a
starting point.
Mike Terry is the editor for Speech
Technology Magazine. He can be reached at
mike@amcommpublications.com. |
 | |
 |
 | |
Speech is NOT Dialog
By Emmett Coin
CM - Conversation Management
Conversation
Management puts the emphasis on the mechanics of conversing
as opposed to just satisfying the dialog goal. Of course,
the goal is important, but a conversation that goes “with
the grain” will be judged more acceptable by a human.
Conversation is about following the grain.
ASR - Automatic Speech Recognition and TTS - Text To
Speech
All of us are familiar with ASR and TTS. ASR
detects and extracts the words embedded in an utterance. It
does this by using some formal expectation (n-grams,
grammars, etc.) and selecting a best fit of one of the
expectations to the sounds in the utterance. ASR does not
know what those words mean.
TTS accepts an utterance as text and attempts to generate
an acoustic (spoken) version. As an aside, TTS does try to
understand some of the meaning of the words in order to
resolve issues of pronunciation: “Last week I read a book to
learn how to perfect the way I will read the word perfect.”
To do a good job, the TTS must distinguish verb/noun
“perfect” and past/future “read” issues.
NLU - Natural Language Understanding
Natural
Language Understanding is another term commonly used. NLU
attempts to structure a text sentence in a way such that
specific elements can be referenced logically and directly
(i.e. the adjective modifying the noun that is the subject
of the sentence). Perhaps you have seen a circa 1950’s
high-school English text with a section on “sentence
diagramming.” You can think of NLU as a program that accepts
a text sentence and outputs a description on how to draw the
diagram along with additional information about the
categories that the specific words on each branch relate to.
SA - Synthetic Agent
Synthetic Agent is the
concept that an agent engenders a component of autonomy
while still having a clear purpose.
Is there a difference between speech recognition and
conversation management? The recognizer hears what was said
and then the computer just does something and responds,
right? Actually there are big differences between the
problem of deciphering the words contained in an utterance
and the problem of carrying on a conversation.
ASR is primarily a physics bound task. There are many
other methodologies from the general field of signal
processing that have been brought to bear. It is an
acoustical pattern-matching problem not too different from a
sonar based ship identification task.
CM is a mental modeling task. Given a record of the prior
exchange of utterances, what should I say next? What types
of utterances from my conversational partner are most likely
in response to what I will say next?
Can't CM just be part of ASR?
Why can’t CM just
evolve as a smooth extension of ASR? Well, for one thing
they are quite different things.
ASR operates in the domain of one utterance, and CM is
the realization of one specific chain of utterances out of a
large pool of potential chains. Much like DNA defines which
amino acids are assembled linearly as beads-on-a-string that
subsequently fold into incredibly complex 3D objects we call
proteins. Utterances strung together fold into
conversations. If you can bear one more analogy: ASR is to
CM as standing is to walking.
The emergent level of conversation is related to and
relies on the initial level of utterances, but it is an
entirely different kind of thing.
De-construct then generate
Another fundamental
difference is that ASR is a deductive technology and CM is
generative.
ASR decomposes an utterance against an expectation.
- It attacks a segment of sound by reducing it to
minimal elements of energy at specific times and
frequencies.
- It assembles the smallest pieces into somewhat larger
pieces (phonemes) and then into syllables or words.
- It finds the assembly that best fits for a given
segment of sound.
- And, not insignificantly, it benefits from the kinds
of expectations gleaned from the conversation
level.
CM predicts a future state.
- It relies on a history of the conversation up to the
present and anticipates potential future moves.
-
- The more accurate its predictions, the better the
conversation.
- It should know when to lead AND when to follow.
- A conversation merges the goals of two minds.
VoiceXML and SALT meet CM
Exactly how do
VoiceXML and SALT meet the problems of CM? And how do they
support speech technology and bridge the gap between ASR and
CM?
They succeed at encapsulation of the hardware, ASR, TTS,
telephony and other platform issues. And they bode well for
the potential of more portable voice applications. While
they don’t supply any functionality at the level of CM, they
do provide conventional programmatic control as well as a
starting place to experiment and prototype some CM support.
These CM features can be provided via separate, encapsulated
code that is accessed through JavaScript conventions.
They are very procedural and do not hide the conversation
details. They are very flat and everything can be controlled
anywhere and at any time. In fact they encourage and/or
require tinkering with even the most basic conversational
moves. Many platform vendors will continue to incorporate
their particular flavors and require developers to generate
slightly different, specific versions for each platform.
The future
Any development language that gets
bigger by getting broader but not deeper will stifle the
development of higher order CM behaviors. These styles of
languages have a growing number of low-level features, each
with a large number of options. Nature approaches complexity
by using layers and hierarchy. In order for these languages
to become truly complex they will need to lose details not
add them.
Three scenarios for ASR/CM in the near future:
A higher-level representation will be necessary. A system
that delegates some of the generic minutia and universal
strategies of conversation. In the beginning it will
automate simple, yet very human behaviors such as back
channel confirmation, greet and departure banter, or
not-recognized gambits. Later it will represent
parameterized conversations that are built on base level
templates. For instance, domains that might discuss
information about books and about magazines might both be
based on a simpler domain that has a representation for the
conversation commonalities involving information about
printed word publications. The elements of editorial staff,
readership, and publication schedule would be layered on the
base domain about printed word publications and result in a
domain about magazine information that would inherit its
ability to talk about general literary content. The domain
for novels might add other elements to the base such as
character summary, setting, or chapters. Not only will this
make complex conversations easier to build, but also they
will have consistency as the conversation moves through
different domains.
There will be more autonomy for the SA. This will begin
with subtle variation of generated speech. These variations
will be constructed for the purpose of introducing novelty
and to allow the SA to use conversational techniques such as
conversational ellipsis. Ellipsis refers to the elimination
of elements that are understood at that point in the
conversation and so it improves the conversation’s
bandwidth. For example, if you were scheduling several time
slots for a conference room using an SA. You might hear on
the first reservation, “What time do you want to schedule
Meeting Room A for today?” On the second reservation you
might hear “What time do you want Room A?” And on the third,
“What time for Room A?” The SA based on a CM will be able to
manage that behavior without all the bother of numerous
tests and branches in a procedural representation.
A CM that learns conversations may become practical.
Today most ASR engines learn phonemes and words by listening
to a human-annotated set of natural human utterances. It has
been a long time since anyone has written a program to
recognize the vowel “ah.” This may also be the most
effective way to capture the natural characteristics of real
conversation. Humans would annotate a large corpus of
natural conversations between humans. Offline analysis would
discover the patterns and compute the probabilities. Then,
using these analyses the CM could predict statistically, the
most likely conversational move that a real human would have
made.
Emmett Coin is the founder and CEO of ejTalk. He can
be reached at emmett@ejtalk.com.