Communications Solutions Magazine
TMCnet.com Columnist

HOME

FREE Subscriptions!
eNews In Your Inbox
Magazines

BizWatch
Quick Quote Look-Up

(delayed 20 minutes)

Site Search

Advanced Search

TMCnet.com
Technology News
Business News
Features Archive
Columnist Index
People&Places
Events Calendar
Link Library

TMC Bookstore
TMC Labs
Buyer's Guides

Magazines
Subscribe Online

Communications Solutions™

http://www.tmcnet.com/cis/

Internet Telephony®

Communications ASP™

Events
Communications Solutions EXPO™

Internet Telephony Conference & EXPO

Communications ASP Conference & EXPO

http://www.tmcnet.com/cisexpo/

FreeInfo
http://www.tmcnet.com/tmcnet/jobs.htm

Media Kits
About TMC
Directions
Linking Policy
Contact Us
Site Map

July 1998


Managing Conversation: Bringing a Human Touch To Speech Technology

BY EMMETT COIN

Imagine a computer that could speak and truly communicate, with feeling and the proper range of emotions. That is the ultimate goal of conversational speech technology, a goal that is closer than many people think. Conversational systems provide a functional and potentially engaging way to interact naturally with a computer, for three main reasons.

First, conversation is fast. Human speech has evolved to maximize communication while minimizing errors. Application developers argue that commands are faster when spoken rather than clicked or typed, and certainly in highly specialized and critical situations unique command-based systems do evolve naturally. But how much does one invest to become productive and efficient with a specialized system?

Second, conversation is innate. We don't need to teach this to anyone - diverse personalities can find a common conversational modality. Most children are competent conversationalists at three years of age.

Finally, conversation is inherently error correcting. In fact, it presumes an error-prone channel. Conversational partners are always on the alert for non-meaningful and contradictory exchanges. Whether for business, education, or even fun, speech is an intuitive experience for people.

BRIEF HISTORY OF SPEECH
The earliest steps toward the goal of man/machine conversation focused on speech production. Synthetic speech generation research dates back more than one hundred years, when J. Q. Stewart built a machine consisting of two coupled resonances excited by electrical impulses. It could produce vowel-like sounds by tuning the resonances to different frequencies. Some of the speech sounds, vowels specifically, can be thought of as musical chords. So, specific resonances allowed the passage of specific "notes" in the "chord." The difference between an "ah" and an "oh" is something like the difference between a C major and a D minor chord.

Jump forward to the late 1960s, when digital computers allowed research to rapidly advance, making it possible to write programs to control loudness, pitch, and resonances. Today, synthetic speech generators, which we call text-to-speech (TTS) systems, use a combination of pronunciation rules and dictionaries, and have essentially limitless vocabularies. At about the same time, digital computers made practical the progenitors of the automatic speech recognition (ASR) systems we have today. At first, simple template matches were made for patterns of those "chords" of speech. Later, genuine phonetic models that accepted compact, linguistic notations for large clusters of expected utterances were developed.

The current technology is being used in many ways. Some basic conversational applications allow users to access information, such as e-mail or stock market information, over a phone, which eliminates the need to carry a laptop. As speech and understanding technology continues to evolve, more and more applications have appeared on the market. Clearly, such interfaces could do a wide range of tasks.

How will we create all those different potential conversations? At present, these applications are "one-up" programs, each piggy-backing the one before. Should each application be a complete new program? As humans, we don't learn how to talk to a hardware store clerk and then completely relearn from scratch how to talk to a toy store clerk. What can we reuse? What are the new parts? To move forward, we need a conversation management system.

CONVERSATION MANAGEMENT SYSTEM
What does the conversation manager (CM) do? Overall, it guides the conversation. Consider one exchange, where a person talks and then a computer talks. The computer "listens" to the person with expectations of what it could hear. This is done by notifying the ASR that something within a certain context is expected.

So, an e-mail system tuned for ordinary conversation would not expect to hear a person say "Buy 200 shares of IBM at market." The system, however, might guess at the person's intention. Then, if the person concurs with that intention, the system will advance the conversation.

If something is to be done, maybe "read the next e-mail from Bill," then a real-world function must access the e-mail repository, preprocess it to deal with the e-mail-specific idiosyncrasies that would not be handled by the generic TTS system, and return the modified version of the e-mail to the CM. The CM sends the e-mail to the TTS system to be spoken and begins listening immediately, in the event that the person wishes to interrupt.

PARTS OF A CONVERSATION MANAGER
Conversation Plan
The conversation plan is the biggest unknown in conversational systems. A simple programmatic approach is common: A programmer writes a standard C/C++ program that does it all. It loads the ASR with a context and monitors the result status of the ASR. If a result is returned, the program decides whether to accept or reject. It changes its state correspondingly. Then the cycle repeats. In this approach, all the subtlety of the language and the conversational manner is created and controlled by programmers. If it is to work well, the program requires a sophisticated developer with substantial linguistic and programming skill. This approach will not remain as a viable solution for much longer.

Another approach is to use a generic state machine that removes the specifics of what is said and heard from the realm of C/C++ code. In this kind of system, a new level of design paradigm is created - CM programmers who may not be traditional computer programmers (who, in fact, should be linguistic experts) define a conversation, and C/C++ programmers support the generic state engine.

More sophisticated methods involve logic programming and reasoning systems. In these systems, the conversation is guided by rules. As conditions in the conversation change, the rules are reevaluated against various propositions (such as <READ_NEXT_EMAIL>), and some result determines the action. It could be simple: Just to read it or not. It could be more complicated: If it is found, and not from Bill, and not more than one minute long, and not already forwarded from someone else, and it is very likely that this was correctly recognized, then read it.

Automatic Speech Recognition (ASR)
ASR listens to and transcribes what the person is saying. This is technology that can be bought off-the-shelf today - many vendors offer some kind of speech recognition package. While they are not all interchangeable, they are remarkably similar. Of all the phonetic, grammar-based recognizers, the vast majority are solidly based on the Hidden Markov Model (HMM) methodology. Some key differences between the products are telephone versus microphone models, host- and/or DSP-based models, models with dynamic word lists at run time, and models that dynamically include uttered elements in grammar.

A big distinction for a telephony model is the quality of recognition over the network. As humans, we use a great deal of contextual and common sense information to compensate for the dramatic loss of acoustic detail. For example, how often do you confuse words that have "s," "f," "sh," or "ch" sounds in them? These all have high-frequency components that are removed by the phone system. This is one reason that dictation over the phone is still impractical.

The high-density telephony solutions usually include a hardware assist for some or all of the recognition task. Some solutions have a proprietary card set, while others take advantage of hardware such as Dialogic's Antares DSP board.

Natural Language Understanding (NLU)
NLU needs to know what the person meant. This is a transform function that converts any of the ways that the user population might say something to the computer into a class of action. "Gimme the next one" and "Would you read the next e-mail, please" are equivalent at some points in an e-mail reader application. At those points these and a myriad of other phrases must resolve to <READ_NEXT_EMAIL>.

Context Tracker (For ASR And NLU)
The context tracker is usually tied in tightly to the conversation plan, but this element is worth pointing out since it is key to any conversational player, human or machine. Let's say a person is talking about shoes. Are they brake shoes for a car or bowling shoes?

The context tracker first needs to answer the question: What are we talking about right now? Because the CM is engineered, we must consider the trade-offs. One important trade-off is that the more the system attempts to "expect" to hear, the more likely it is to misunderstand.

Text-To-Speech (TTS)
TTS generates spoken replies to a human query. In conversational systems, the computer will speak with its own voice (TTS) and not a patch quilt of human recordings. In an e-mail reader application, it is obvious that we must use TTS. As applications tackle more tasks that manipulate larger amounts of data, the recorded prompts and responses become unwieldy at best (expensive, time consuming, hard to maintain, limited voice model availability, etc.).

These recording issues, however, will soon become irrelevant. First, TTS systems are getting better all the time. In addition, some amazing things can be done with prosody transplantation (imposing real human intonation on generated TTS).

Interface To Real World Functions
This is the one component of a conversational system that should be written by real programmers. Only a real program is going to access a POP3 server, or some proprietary data structure.

WHAT IS STILL MISSING?
What ASR Technology Lacks

Remember, current ASR technology turns speech into a telegram. It is all caps and contains no punctuation. Yet half of what we say is how we say it. How does ASR handle questions? If I ask you when we can schedule a meeting, you might say "tomorrow." You would mean <MEET_TOMORROW>. But if you said "tomorrow?" you would mean <QUERY_TOMORROW_OK>. ASR only gives us the word: TOMORROW.

The same problem occurs with emphasis. "Three, six, eight?" "NO, three, SEVEN, eight." It seems so natural to give a little context (the three and the eight) and then to stress what was different. A child understands this easily, but current ASR doesn't give us a clue. That's one reason correcting phone numbers or account numbers via voice seems so awkward in today's applications.

Laughter, vocal pauses such as "ummm," coughs, and background noises are always present, and must be accounted for. It would be nice for speech rec to say "Bless you" after someone sneezes, but it is critical that "ahh choo" must not be misunderstood to mean <YES_DELETE>.

What TTS Technology Lacks
It lacks prosodic "warmth." Prosody is the music of speech. It is the pitch, energy, and duration of the syllables. From our studies, it is clearly more important that the TTS voice speak like a human, than it is for the TTS voice to sound like a human. To get the prosody right, the CM must convey the purpose behind the things being said.

Conversational skills give computers a higher degree of personality and accessibility than ever before. Done well, speech recognition and text-to-speech can ease many of the frustrations users experience. Clearly, the foundation has been set for meaningful conversation with computers. As they are refined, conversational systems will gradually become just another wonder of technology that is taken for granted, is as ubiquitous as a word processor, and is one that we cannot imagine living without.

Emmett Coin is a self-described industrial poet. Intellivoice Communications, Inc. delivers speech-activated communications products to consumers throughout the communications industry. Intellivoice also sells its products to wireline and wireless communications companies with the goal of making conversational voice interfaces a standard feature in public networks. The privately held Atlanta-based company has grown from a small organization that focused only on customer interactive voice response applications to an expert in conversational speech recognition interfaces for public networks. For more information, contact the company at 404-816-3535 or visit their Web site at http://www.intellivoice.com/.

 


Get eNews In Your Inbox! Sign up now for TMCnet.com e-mail newsletters.
Subscribe FREE to all of TMC's monthly magazines. Click here now.

TMC LOGO
Technology Marketing Corporation, One Technology Plaza, Norwalk, CT 06854 USA
Ph: 800-243-6002, 203-852-6800; Fx: 203-853-2845
General comments: tmc@tmcnet.com. Comments about this site: webmaster@tmcnet.com.
© Technology Marketing Corp. 1997-2001

TMC labs


WAP


TMC