Speech Technology



May/June 2002

Send To A Friend
Printer-Friendly Version

Related Articles
Separating Purpose from Presentation

By Mike Terry

Separating the core of a speech application, its purpose, from its presentation is vital for the growth of such applications. The need for the distinction arises from the idea of allowing a speech application to be presented in multiple ways and on multiple devices. In the past, applications were created with the central logic and the presentation details together. Even newer technologies like XML did not address this issue. With multimodality coming over the horizon, this distinction will have to be further explored.

The World Wide Web Consortium (W3C) has already begun addressing the distinction issue with the introduction of XForms. XForms are the successor to HTML forms. While HTML forms do not consider the modality of a device, XForms allow for different presentations on multiple user interfaces, while operating on the same back-end. For example, a PDA can express the same presentation as a desktop computer, without losing the original intent of the designer. According to Bill Scholz, director of engineering for Natural Language Speech Solutions at Unisys, “These forms are comprised of separate sections that describe what the form does and how it works. This allows for understanding what is common and what is different among display models.”

The growth towards multimodality is fostering this effort, with the newly formed SALT Forum providing the muscle behind the push. According to Scholz, “The growth towards multimodal will be slow. First, we will see applications that we can interact with by either voice or browser. Then a new class of applications will emerge which interact with us using multiple modes simultaneously. For example, a user might say ‘show me on the map how to get from here to the nearest restaurant’ or ‘show me a graph of my stock portfolio's value over the last week’.”

There are a couple of means from which multimodality can emerge. The first is the idea of starting with a graphical user interface (GUI) and adding speech and other modes. The second is to reverse the order, starting with a voice user interface (VUI) and adding visuals/graphics. Both means have creating a dialog between the user and the device in mind.

This isn’t to say that SALT, by aiming for multimodality, is going to supersede VoiceXML or vice versa. Scholz believes, “SALT and VoiceXML will eventually coexist. Our industry will find a way to merge them or provide a framework in which they can comfortably coexist for the benefit of the end-user.”

It is the benefit of the end user that we must keep mind when developing applications, for if they are not user-friendly and relatively easy to adopt, acceptance will be unlikely. The W3C, and its recently convened organization, the Multimodal Working Group, may be the way to decide how to solve the multimodal problem. However, a simple, yet very important question must be asked now–What is the best method of attaining our dreams of multimodal? Separating the presentation of the application from the purpose of the application is definitely a starting point.

Mike Terry is the editor for Speech Technology Magazine. He can be reached at mike@amcommpublications.com.

Speech is NOT Dialog

By Emmett Coin

CM - Conversation Management
Conversation Management puts the emphasis on the mechanics of conversing as opposed to just satisfying the dialog goal. Of course, the goal is important, but a conversation that goes “with the grain” will be judged more acceptable by a human. Conversation is about following the grain.

ASR - Automatic Speech Recognition and TTS - Text To Speech
All of us are familiar with ASR and TTS. ASR detects and extracts the words embedded in an utterance. It does this by using some formal expectation (n-grams, grammars, etc.) and selecting a best fit of one of the expectations to the sounds in the utterance. ASR does not know what those words mean.

TTS accepts an utterance as text and attempts to generate an acoustic (spoken) version. As an aside, TTS does try to understand some of the meaning of the words in order to resolve issues of pronunciation: “Last week I read a book to learn how to perfect the way I will read the word perfect.” To do a good job, the TTS must distinguish verb/noun “perfect” and past/future “read” issues.

NLU - Natural Language Understanding
Natural Language Understanding is another term commonly used. NLU attempts to structure a text sentence in a way such that specific elements can be referenced logically and directly (i.e. the adjective modifying the noun that is the subject of the sentence). Perhaps you have seen a circa 1950’s high-school English text with a section on “sentence diagramming.” You can think of NLU as a program that accepts a text sentence and outputs a description on how to draw the diagram along with additional information about the categories that the specific words on each branch relate to.

SA - Synthetic Agent
Synthetic Agent is the concept that an agent engenders a component of autonomy while still having a clear purpose.


Is there a difference between speech recognition and conversation management? The recognizer hears what was said and then the computer just does something and responds, right? Actually there are big differences between the problem of deciphering the words contained in an utterance and the problem of carrying on a conversation.

ASR is primarily a physics bound task. There are many other methodologies from the general field of signal processing that have been brought to bear. It is an acoustical pattern-matching problem not too different from a sonar based ship identification task.

CM is a mental modeling task. Given a record of the prior exchange of utterances, what should I say next? What types of utterances from my conversational partner are most likely in response to what I will say next?

Can't CM just be part of ASR?
Why can’t CM just evolve as a smooth extension of ASR? Well, for one thing they are quite different things.

ASR operates in the domain of one utterance, and CM is the realization of one specific chain of utterances out of a large pool of potential chains. Much like DNA defines which amino acids are assembled linearly as beads-on-a-string that subsequently fold into incredibly complex 3D objects we call proteins. Utterances strung together fold into conversations. If you can bear one more analogy: ASR is to CM as standing is to walking.

The emergent level of conversation is related to and relies on the initial level of utterances, but it is an entirely different kind of thing.

De-construct then generate
Another fundamental difference is that ASR is a deductive technology and CM is generative.

ASR decomposes an utterance against an expectation.

  • It attacks a segment of sound by reducing it to minimal elements of energy at specific times and frequencies.
  • It assembles the smallest pieces into somewhat larger pieces (phonemes) and then into syllables or words.
  • It finds the assembly that best fits for a given segment of sound.
  • And, not insignificantly, it benefits from the kinds of expectations gleaned from the conversation level.

CM predicts a future state.

  • It relies on a history of the conversation up to the present and anticipates potential future moves.
  • The more accurate its predictions, the better the conversation.
  • It should know when to lead AND when to follow.
  • A conversation merges the goals of two minds.

VoiceXML and SALT meet CM
Exactly how do VoiceXML and SALT meet the problems of CM? And how do they support speech technology and bridge the gap between ASR and CM?

They succeed at encapsulation of the hardware, ASR, TTS, telephony and other platform issues. And they bode well for the potential of more portable voice applications. While they don’t supply any functionality at the level of CM, they do provide conventional programmatic control as well as a starting place to experiment and prototype some CM support. These CM features can be provided via separate, encapsulated code that is accessed through JavaScript conventions.

They are very procedural and do not hide the conversation details. They are very flat and everything can be controlled anywhere and at any time. In fact they encourage and/or require tinkering with even the most basic conversational moves. Many platform vendors will continue to incorporate their particular flavors and require developers to generate slightly different, specific versions for each platform.

The future
Any development language that gets bigger by getting broader but not deeper will stifle the development of higher order CM behaviors. These styles of languages have a growing number of low-level features, each with a large number of options. Nature approaches complexity by using layers and hierarchy. In order for these languages to become truly complex they will need to lose details not add them.

Three scenarios for ASR/CM in the near future:

A higher-level representation will be necessary. A system that delegates some of the generic minutia and universal strategies of conversation. In the beginning it will automate simple, yet very human behaviors such as back channel confirmation, greet and departure banter, or not-recognized gambits. Later it will represent parameterized conversations that are built on base level templates. For instance, domains that might discuss information about books and about magazines might both be based on a simpler domain that has a representation for the conversation commonalities involving information about printed word publications. The elements of editorial staff, readership, and publication schedule would be layered on the base domain about printed word publications and result in a domain about magazine information that would inherit its ability to talk about general literary content. The domain for novels might add other elements to the base such as character summary, setting, or chapters. Not only will this make complex conversations easier to build, but also they will have consistency as the conversation moves through different domains.

There will be more autonomy for the SA. This will begin with subtle variation of generated speech. These variations will be constructed for the purpose of introducing novelty and to allow the SA to use conversational techniques such as conversational ellipsis. Ellipsis refers to the elimination of elements that are understood at that point in the conversation and so it improves the conversation’s bandwidth. For example, if you were scheduling several time slots for a conference room using an SA. You might hear on the first reservation, “What time do you want to schedule Meeting Room A for today?” On the second reservation you might hear “What time do you want Room A?” And on the third, “What time for Room A?” The SA based on a CM will be able to manage that behavior without all the bother of numerous tests and branches in a procedural representation.

A CM that learns conversations may become practical. Today most ASR engines learn phonemes and words by listening to a human-annotated set of natural human utterances. It has been a long time since anyone has written a program to recognize the vowel “ah.” This may also be the most effective way to capture the natural characteristics of real conversation. Humans would annotate a large corpus of natural conversations between humans. Offline analysis would discover the patterns and compute the probabilities. Then, using these analyses the CM could predict statistically, the most likely conversational move that a real human would have made.

Emmett Coin is the founder and CEO of ejTalk. He can be reached at emmett@ejtalk.com.



Current Issue | NewsBlast | Industry News | Industry Links | Subscribe | Subscribe to NewsBlast
Subscribe to Magazine | Back Issues | NewsBlast Archives | Conference | Exhibitor | Registration | Press Room
About Us | Ad Info | Contact Us | Editorial Calendar | Submissions Request | Events | Privacy Statement