ejTalk Logo
Home [Tech] Service Demo Member About
Tool Presentation [Opinion] Notebook
Steering UnNatural NonHal [Soul] Greed

Soul

Soul Searching and Synthetics

There has been much talk about synthetic human-like entities lately. These issues have been discussed in connection with some recent films (A.I. and Final Fantasy) and in some recently announced technology press releases for computer-generated speech. Most of us are intrigued with copies of ourselves. Implications have been made that we are “almost there” when it comes to synthesizing a virtual human with the computer. The truth is that these technologies are currently much more like puppeteering. There is little chance that these kinds of creations will fool any observers soon – or if something approaches that quality, then it (by some quick back-of-the-envelope calculations) will be more costly than “generating” twenty or thirty real people the old fashioned way (birth, nurture, shelter, education, etc.).

The two most important sensory interfaces for a virtual human are sight and sound (a la Max Headroom). There are surprisingly parallel issues between the visual and vocal aspects of these synthetic beings:

The static representation is good (still photos of Aki in Final Fantasy or a benign phrase in the AT&T speech synthesizer). Statically Aki looks pretty good. Watch her walk; it often looks reasonably natural. That’s because it is natural. It isn’t synthetic. It’s human! The CGI artists digitized humans walking around with key body point markers (think where a puppeteer might tie strings) being tracked in 3D and then they applied it to the static body models. We’re very sensitive to any atypical movements. Even Disney animators do this kind of “tracing” of human dynamics. Consider an animator as a kind of synthesizer of moving characters. The animator is capable of drawing the necessary motion. But, in important emotional moments, where the movement is telling us a lot, they have resorted to tracing human movement (i.e. the eerily natural solo by the Little Mermaid). Note, I do not say this to diminish the superb talents of animators; it is just to illustrate the exceptional sensitivity of all of us as viewers.

Concerning the human voice, AT&T’s text-to-speech (TTS) is quite good. In my opinion it is currently the best of the breed (Since I spend my time and energies devoted to human-computer conversational systems I feel qualified to say this.) Its quality is very human in the same sense that a static image can be quite human. But half, or maybe more, of what we say is how we say it. Most voice impersonators don’t so much sound like their target celebrity, but rather speak like them. Linguists call this the prosody -- the music of speech. Why is my Hamlet boring and hard to comprehend, but Kenneth Branagh’s is clear and engaging? The words are exactly the same.

There can be some degree of manual control (strings to pull) of the prosody of a TTS synthesizer, but as in synthetic visuals, even the most, basic modifications seem odd unless they are based on a template of real human dynamics. A technique called prosody transplantation, which is roughly equivalent to digitizing the prosodic movement of a specific human utterance, can be used to reshape (re-sing) the purely synthesized version of the exact same words. But this dynamic of speech is still very far from being systhensizable. The prosody tells the listener how the speaker feels about what was said. That’s why the original Star Trek ship’s computer was not terribly menacing, even when it was going awry. By the way she spoke, we subliminally presumed that she operated on facts. But, by the way HAL spoke, we suspected that he operated from beliefs. HAL meant what he said. We have evolved to notice things like that.

Capturing the visual and/or vocal qualities of a celebrity is certainly a substantial accomplishment. But this is far from synthesizing a satisfying replica of that celebrity. With the state of the art today an actor/impersonator will have to put on the celebrity’s “technology costume” and mimic the body language and speaking style in order to create a convincing portrayal.

It will be a while before we ever have a convincing synthetic F.D.R. fireside chat. His excellent baritone (remember speech is music) convinced us by the way he said things -- the nuance, the timing, the emphasis or de-emphasis – intra- and inter-word modulations tell us everything. Also, the real speaker’s choice of words and gestures is as important as their delivery. Imagine a speech “performed” by Rosie O’Donnel that is re-performed by a synthetic Judy Woodruff (Judy’s voice and face, Rosie’s phrasing and body language.) If we never saw the real Judy it might be fine, but an intercut sequence of the real and synthetic Judy will always make jarring transitions. Much of what we are is dynamic. In a very literal sense actions (and prosody) do speak louder than words. These subtle behaviors coordinate to signal to others who we are and what we are thinking.

Just as in A.I., Spielberg attempted to show how nuance was a major, if not the major, issue of believability. The more specialized robotic mechas had a flatter feel to their dialog and movement. We couldn’t quite sense what they were “thinking.” We may have concluded that they weren’t “thinking.” Spielberg knew that as humans we would notice very simple differences, very subtle differences, effortlessly. Just as we can tell visually from the way a friend is walking, that the blister on their right foot is almost healed. Or whether that friend’s utterance is sad about missing the dinner party last week or maybe only wistful. Even though the words would be exactly the same.

Please don’t think that I’m berating this technology since synthetic conversationalists are the core of my chosen profession. I really love this stuff! However, it is very apparent that synthesizing the magic of humanity will take a while for both the visual and linguistic components.

Emmett Coin
July 12, 2001



Home [Tech] Service Demo Member About
Tool Presentation [Opinion] Notebook
Steering UnNatural NonHal [Soul] Greed

November 18, 2003 Copyright © 1997 -  2003 ejTalk