"The point of philosophy is to start with something so simple as not to seem worth stating, and to end with something so paradoxical that no one will believe it." - Bertrand Russell, The Philosophy of Logical Atomism
Who we are?
ejTalk (pronounced "edge talk") is a corporation founded with the goal of advancing automated conversation by developing and improving current speech technologies to create the conversational machine.
What we do?
Taking this conversational approach, instead of an application-based approach, is what makes ejTalk unique in the world of speech. Where most seek to create a specific product with a narrow focus merely incorporating speech technology, ejTalk broadens the scope, to discover and build the science behind conversation. We build a broad and powerful framework that makes it easier to create specific, compelling speech applications that are natural.
Judge for yourself.
Here is a brief history of past approaches so you can compare and understand for yourself what makes ejTalk unique.
There are three parts of speech that a computer must deal with in order to recognize and respond to human dialog. They are the following,
1. What does it hear?
2. How does it talk?
3. What does it say?
The first two steps have been the primary focus of speech scientists in its short existence of 50 years.
What does the computer hear? For that we need a recognizer or ASR (Automated Speech Recognition) to transcribe our spoken words into digital bytes that a computer can read.
How does it talk? The computer needs to be able to respond, so how does it talk? There are two distinct approaches: recorded speech and synthesized speech (also called TTS: Text-To-Speech).
Recorded speech involves a person sitting down in front of a microphone and pre-recording words and phrases that are assembled at runtime to create the computer's spoken output. Unfortunately this approach is very constricting, requiring the person recording the voice to anticipate all possible responses that the computer may need to make, which ultimately will dramatically limit the range and variability of what it can say. So, with the exception of very simple scenarios, it is impossible for an actor to record every possible outcome to every possible conversation. In addition, recorded speech freezes the rhythm and stress of natural speech. Not only are there many things to say, but there are many ways we want to say them.
Synthetic speech is essentially a computer program that mimics the way human speech sounds. It may sound "electronic" and therefore unnatural at the moment, but TTS engines offer much more flexibility when it comes to what a computer says and how it can say it. An actor's tone may sound better to the human ear if the dialog is short and narrowly focused, but when it comes to a realistic, conversational response, TTS wins by a long shot. Ultimately it will be the only way to go.
So, at the industry's current state of speech technology a computer can recognize the anticipated words that are being spoken and it can talk back, but the huge issue left is: What does it say?
Currently there are very few applications out there that can react conversationally. Why? Well, speaking about things requires a certain amount of knowledge and two of the big issues A.I. (Artificial Intelligence) faces is how to represent knowledge and how to manipulate that knowledge. Scientists and philosophers have been trying to define what intelligence is since early antiquity, and even now, we are not much closer to a concrete answer. We know it has something to do with thinking, being able to reason, solve problems, judge, understand, adapt to unpredictable environments (and the list goes on), but how to go about representing these fundamental behaviors is a very big, very complicated problem.
Communication ? That's where ejTalk comes in.
A computer could be imbued with all the knowledge in the world, but without the ability to communicate it would be entirely useless. At ejTalk, we are concerned with strategizing and engineering the basic principles of natural conversation. Let's say we inform a child of a basic fact: one nucleus contains one proton. Now, if you ask the child what a proton is, they will not "know" what it is, but they will be able to infer that it's inside the nucleus. The child has no comprehension of what a proton is but they can automatically associate it with a nucleus and generalize on that knowledge, something computers don't do that well at the moment.
ejTalk's focus is on engineering these basic principles of speech - the common sense of "the conversation." We're a conglomeration of scientists focused on a general approach to automated conversation. The structure of conversation is too often neglected. Most speech based systems today use simple state machines. The strength of these state-transition systems lies in their simple and specific behavior. If an application is designed to get the current stock quotes, then there are only so many things a machine needs to recognize and say. Companies abbreviate their names with letters and stock prices are simple numbers, so technically the computer would only need to hear and respond with letters and numbers. Simple and straightforward, but the flaw is that these basic behaviors only apply to the stock quote domain. If you were to ask the same machine to give you a weather forecast, it would be lost. This is due to the fact that there are no generalized conversational mechanisms that permit easily adding a new topic or behavior. A programmer would have to essentially start from scratch in order to get the computer to talk about weather.
So, why can't we simply increase the recognition vocabulary? If we did that, it would require programmers to fill up grammars with more and more anticipated responses that would lead to the micromanagement of the conversation. Yes, the computer would be able to recognize a larger number of utterances, but managing the responses for consistency and combinatorial possibilities quickly becomes unfeasible. Can you script infinity? The real problem is not with speech recognition or speech synthesis, but rather how to manage the conversation.
When humans talk to humans they use a whole collection of common sense behaviors. For example, if you tell a colleague that you?re flying to Bora Bora on June 7th, 2009 and they ask when you plan on returning, you might respond with "I'll be back on the 9th." In this colloquial exchange, the context implies that you meant June 9th, 2009 without needing to say the entire date. It is theoretically possible to program a specific application to do that sort of behavior by micro-managing the conversational flow, but a much simpler approach for the developer would be to have the conversation engine know how and when reduce the date detail. We solve this problem by manipulating a date-token. The date token is passed along to the framework of the application, and the ejTalker generic dialog engine can then regulate how much of the date needs to be spoken. Delegating this detail not only makes the design and maintenance much easier, but it ensures that your application behaves consistently. As the application grows, it will continue to be of "one mind" all the pieces fitting together instead of remaining separate and more difficult to maintain.
The example of the date token is only one part of a larger concept we at ejTalk call meta-language, which was developed as way to abstract and organize types of information. Meta-language allows the developer to delegate natural behavior that would otherwise be expensive and tedious to implement by the current methods. Developed by Emmett Coin and JQua, the ejTalker dialog engine is an extensible architecture for managing conversation using meta-language. It is the core concept behind Cassandra.
If you're interested in learning more about meta-language, head on over to Her Story to see how we use it.
|