Why Should I Use Speech in My Application?

Typically, mention of  speech recognition or text-to-speech, brings to mind the computer on Star Trek or HAL from the movie 2001 A Space Odyssey  The computer's synthesized voice sounds like a human voice. The speech recognition is even better. It never listens in on conversations between people, and when given a command, the computer never makes a mistake. It's great. If computers had perfect speech recognition and text-to-speech like that today, everyone would be talking to computers. Unfortunately, it will be decades before speech technology is that good.

Even though speech technology isn't perfect, it's still useful in applications. Unfortunately, because some people expect more from speech recognition and text-to-speech than what the technology can deliver, they dismiss the technology outright. I will describe some reasonable expectations for speech technology and show you what you can accomplish with its current capabilities.

Why Should I Use Speech If It's Not Perfect?

Computer-based videos were introduced in multimedia PCs around 1991. They were a cool idea, but when the videos first appeared, they were really small. A 386/33 megahertz machine could support a video about 80x50 pixels with 15 frames per second (fps) and low-quality sound. Since then computer videos have improved significantly. They're now 320x200 pixels with 24 fps; conveying 24 times as much data per second as in 1991. These improvements in resolution and fps are still small compared to the ultimate video experience: a digital surround-sound movie. Real movies have about 4000x2000-pixel resolution, 24-bit color, and 4 channels of 44 kHz, 16-bit audio. That's several hundred times the data rate and processor speed of the videos you get on today's PCs.

So what does this have to do with speech recognition and text-to-speech?

Even though computer videos are significantly inferior to movies, many applications benefit from computer video technology, especially games and multimedia titles. Videos are so necessary to some applications that they wouldn't be able to function without them--even though computer videos are light years away from a real movie.

Speech recognition and text-to-speech are new technologies, just like computer videos. Although speech technologies aren't nearly as good as those on "Star Trek," many applications can get good use out of them.

Does Speech Replace the Keyboard and Mouse?

You might now be asking: If speech isn't perfect, how can I use it for my application? After all, speech is replacing the keyboard and mouse, and the keyboard never makes a mistake.

Don't worry.

Speech recognition will not replace the keyboard and the mouse, nor will text-to-speech eliminate text. Speech recognition and text-to-speech are just two more user interface "devices" available to an application developer. Speech recognition can be added to an already long list of user-interface "devices" such as the keyboard, mouse, joystick, and pen. Likewise, text-to-speech can be added to text, graphics, animated videos, and sound. Very few applications will use speech as their sole means of communication with users. Most will mix and match according to the strengths of each device.

If you think about it, applications already use several user-interface devices to communicate with users. A game makes use of both a joystick and keyboard. Users manipulate the joystick to "tell" the computer which direction they want their character to move. The keyboard is used to type in commands, such as "Say 'Hello.'" The joystick is a better input device for movement, while the keyboard is better for entering text. The game's choice of output "devices" works the same way. The application uses a combination of pictures, text, and sound to communicate with users. When an enemy aircraft blows up, a game doesn't draw "Boom!" in large letters. Instead, the game plays a recording of an explosion.

Speech is just another "device." As such, it has its place among other widely used devices.

What Can Speech Do?

Now that I've described reasonable expectations of speech technology, I'll tell you a bit about the current capabilities of text-to-speech and speech recognition.

Text-to-speech comes in two flavors, synthesized text-to-speech and concatenated text-to-speech.

Synthesized speech is what people typically think of when I mention text-to-speech. It reads text by analyzing the words and having the computer figure out the phonetic pronunciations for the words. The phonemes are then passed into a complex algorithm that simulates the human vocal tract and emits the sound. This method allows the text-to-speech to speak any word, even made-up ones like "Zamphoon," but it produces a voice that has very little emotion and is distinctly not human. You'd use this if you knew that the application had to speak, but you couldn't predict what it would need to say. Synthesized speech usually requires a 486/33 megahertz machine with 1 megabyte of working-set RAM.

Concatenated text-to-speech does something different. It analyzes the text and pulls recordings, words, and phrases out of a prerecorded library. The digital audio recordings are concatenated. Because the voice is just a recording that you've made, it sounds good. Unfortunately, if the text includes a word or phrase that you didn't record, the text-to-speech can't say it. Concatenated text-to-speech can be viewed as a form of audio compression because words or common phrases have to be recorded only once. For example, many telephone applications will have a recording for, "Press 1 to play new message; press 2 to send a fax," and so on, and another recording for, "Press 1 to fast-forward; press 2 to rewind." A concatenated text-to-speech will have only one recording of "press" rather than four. If concatenated text-to-speech doesn't seem that much different to you from recording your own .WAV files, you're right. However, concatenated text-to-speech will save you development time and bugs, allowing you to add more features to your software. Because concatenated text-to-speech just plays a .WAV file, it takes very little processor power and only a bit of memory, since most of the audio is stored on disk.

Speech recognition is somewhat more complicated to classify than text-to-speech. Each speech recognition engine has three characteristics:

  1. Continuous vs. discrete: If speech recognition is continuous, users can speak to the system naturally. If it's discrete, users need to pause between each word. Obviously, continuous recognition is preferred over discrete recognition, but continuous recognition takes more processing power.
  2. Vocabulary size: Speech recognition can support a small or large vocabulary. Small-vocabulary recognition allows users to give simple commands to their computers. To dictate a document, the system must have large-vocabulary recognition. Large-vocabulary recognition takes a lot more processor power and memory than small-vocabulary recognition.
  3. Speaker dependency: Speaker-independent speech recognition works right out of the box, while speaker-dependent systems require that each user spend about 30 minutes training the system to his or her voice.

Although any combination of the three characteristics is possible, two combinations are popular today.

"Command and Control" speech recognition is continuous, small vocabulary, and speaker independent. This means that users can use several hundred different commands or phrases. If a user says a command that is not in the list, the speech-recognition system will return either "not recognized," or will think it heard a similar-sounding command. Because users of Command and Control can say only specific phrases, the phrases must be either visible on the screen--so intuitive that all users will know what to say--or the users must learn what phrases they can say. Command and Control speech recognition requires a 486/66 megahertz machine with 1 megabyte of working-set RAM.

"Discrete Dictation" speech recognition is discrete, large vocabulary, and speaker dependent. It's used to dictate text into word processors or mail, or for natural-language commands. Although users may say anything they wish, they must leave pauses between words, making the speech unnatural. Discrete dictation requires a Pentium/60 megahertz machine with 8 megabytes of RAM.

Sample Uses for Speech Technologies

It's impossible to say, "Thou shalt use speech to do XXX." Not only is every application different, but speech technology is so new that application writers haven't had much experience with it. As a general rule, use speech when it provides significant advantages to users, not just for the "Wow!" effect. Try to find uses that make the application easier to use, or allow users to access features more quickly than with the keyboard or mouse. In the case of games, use speech to add realism to the interaction with characters, making the game more involving and fun.

Telephony applications benefit most from speech recognition and text-to-speech because their sole means of communicating with users is through the telephone audio. Text-to-speech is not only easier to use than manually recording all of the prompts, but it also enables new features that allow arbitrary text to be read, such as names, addresses, and e-mail. Speech recognition is a great substitute for touch-tone menus not only because it's more natural and flexible, but also because many users don't have touch-tone phones. Several engine vendors have phone numbers you can call for a demonstration of telephony applications.

Games, especially adventure games, are the next beneficiaries of speech. Any game in which the user "talks" with characters will enjoy the fruits of speech. Speech recognition allows users to really talk to the characters, which is a lot more fun than typing in a response. Rather than just printing out what characters "speak," use text-to-speech to really make them speak. Not only does this increase the realism of the game, but it allows users to take their hands off the keyboard, sit back in their chairs, and enjoy the game.

Multimedia titles can use concatenated text-to-speech for recording audio passages, treating it like improved audio compression. They can also incorporate speech recognition so that users can sit back and relax, just as they can with games.

Of course, there are other applications that can use speech. For example, some applications use speech because they require that the user's hands be free for other things, or because the user can't see the screen.

Microsoft Speech API Documentation

This was just a brief overview of what you can do with speech technology. The material was taken from Intro.doc which is part of the documentation included with the Microsoft Speech SDK. You should download the API and continue reading the Microsoft Speech API Documentation for more detail.

Overview of Speech Technologies

Speech recognition is the ability of a computer to understand the spoken word for the purpose of receiving command and data input from the speaker. Text-to-speech is the ability of a computer to convert text information into synthetic speech output.

Speech recognition and text-to-speech use engines, which are the programs that do the actual work of recognizing speech or playing text. Most speech-recognition engines convert incoming audio data to engine-specific phonemes, which are then translated into text that an application can use. (A phoneme is the smallest structural unit of sound that can be used to distinguish one utterance from another in a spoken language.) A text-to-speech engine performs the same process, in reverse. Engines are supplied by vendors that specialize in speech software; they may be bundled with new audio-enabled computers and sound cards, purchased separately, or licensed from the vendor.

The speech-recognition engine transcribes audio data received from an audio source, such as a microphone or a telephone line. The text-to-speech engine converts text to audio data, which is sent to an audio destination, such as a speaker, a headphone, or a telephone line. Under some circumstances, an engine may be able to transcribe audio data to or from a file.

An engine typically provides more than one mode for recognizing speech or playing text. For example, a speech-recognition engine will have a mode for each language or dialect that it can recognize. Likewise, a text-to-speech engine will have a mode for each voice, which plays text in a different speaking style or personality. Other modes may be optimized for a particular audio sampling rate, such as 8 kilohertz (kHz) for use over a telephone line.

Speech recognition can be as simple as a predefined set of voice commands that an application can recognize. More complex speech recognition involves the use of a grammar, which defines a set of words and phrases that can be recognized. A grammar may use rules to predict the most likely words to follow the word just spoken, or it may define a context that identifies the subject of dictation and the expected style of language.

Both speech-recognition and text-to-speech engines may make use of a pronunciation lexicon, which is a database of correct pronunciations for words and phrases to be recognized or played.

An engine’s approach to recognizing speech or playing text determines the quality of speech in an application — that is, the accuracy of recognition or clarity of playback — and the amount of effort required from the user to get good accuracy or clarity. The engine’s approach also affects the processor speed and memory required by an application; it may also influence the application’s features or the design of its user interface.

Speech Recognition

This section provides an overview of speech technology. An understanding of both speech recognition and text-to-speech will help you decide how to best incorporate speech in your application and how to choose a technology that supports what you want to do.

All speech recognition involves detecting and recognizing words. Most speech-recognition engines can be categorized by the way in which they perform these basic tasks:

· Word separation. The degree of isolation between words required for the engine to recognize a word.

· Speaker dependence. The degree to which the engine is restricted to a particular speaker.

· Matching techniques. The method that the engine uses to match a detected word to known words in its vocabulary.

· Vocabulary size. The number of words that the engine searches to match a word.

Word Separation

Speech-recognition engines typically require one of the following types of verbal input to detect words:

· Discrete speech. Every word must be isolated by a pause before and after the word — usually about a quarter of a second — for the engine to recognize it. Discrete speech recognition requires much less processing than word-spotting or continuous speech, but it is less user-friendly.

· Word-spotting. A series of words may be spoken in a continuous utterance, but the engine recognizes only one word or phrase. For example, if a word-spotting engine listens for the word "time" and the user says "Tell me the time" or "Time to go," the engine recognizes only the word "time."

Word-spotting is used when a limited number of commands or answers are expected from the user and the way that the user speaks the commands is either unpredictable or unimportant.

Continuous speech recognizers can also be used to do word-spotting.

· Continuous speech. The engine encounters a continuous utterance with no pauses between words, but it can recognize the words that were spoken.

Continuous speech recognition is the best technology from a usability standpoint, because it is the most natural speaking style for human beings. However, it is the most computationally intensive because identifying the beginning and ending of words is very difficult — much like reading printed text without spaces or punctuation.

Dependence on the Speaker

Speech-recognition engines may require training to recognize speech well for a particular speaker, or they may be able to adapt to a greater or lesser extent. Engines can be grouped into these categories:

· Speaker-dependent. The engine requires the user to train it to recognize his or her voice. Training usually involves speaking a series of pre-selected phrases. Each new speaker must perform the same training.

Speaker-dependent engines work without training, but their accuracy usually starts below 95 percent and does not improve until the user completes the training. This technique takes the least amount of processing, but it is frustrating for most users because the training is tedious, taking anywhere from five minutes to several hours.

· Speaker-adaptive. The engine trains itself to recognize the user’s voice while the user performs ordinary tasks. Accuracy usually starts at about 90 percent, but rises to more acceptable levels after a few hours of use.

Two considerations must be taken into account with speaker-adaptive technology. First, the user must somehow inform the engine when it makes a mistake so that it does not learn based on the mistake. Second, even though recognition improves for the individual user, other people who try to use the system will get higher error rates until they have used the system for a while.

· Speaker-independent. The engine starts with an accuracy above 95 percent for most users (those who speak without accents). Almost all speaker-independent engines have training or adaptive abilities that improve their accuracy by a few more percentage points, but they do not require the use of such training. Speaker-independent systems require several times the computational power of speaker-dependent systems.

Matching Techniques

Speech-recognition engines match a detected word to a known word using one of these techniques:

· Whole-word matching. The engine compares the incoming digital-audio signal against a prerecorded template of the word. This technique takes much less processing than subword matching, but it requires that the user (or someone) prerecord every word that will be recognized — sometimes several hundred thousand words. Whole-word templates also require large amounts of storage (between 50 and 512 bytes per word) and are practical only if the recognition vocabulary is known when the application is developed.

· Subword matching. The engine looks for subwords — usually phonemes — and then performs further pattern recognition on those. This technique takes more processing than whole-word matching, but it requires much less storage (between 5 and 20 bytes per word). In addition, the pronunciation of the word can be guessed from English text without requiring the user to speak the word beforehand.

Vocabulary Size

Speech-recognition engines typically support several different sizes of vocabulary. Vocabulary size does not represent the total number of words that a given engine can recognize. Instead, it determines the number of words that the engine can recognize accurately in a given state, which is defined by the word or words that were spoken before the current point in time. For example, if an engine is listening for "Tell me the time," "Tell me the day," and "What year is it?" and the user has already said "tell," "me," and "the," the current state has these two words: "time" and "day."

Speech-recognition engines support these vocabulary sizes:

· Small vocabulary. The engine can accurately recognize about 50 different words in a given state. Although many small-vocabulary engines can exceed this number, larger numbers of words significantly reduce the recognition accuracy and increase the computation load. Small-vocabulary engines are acceptable for command and control, but not for dictation.

· Medium vocabulary. The engine can accurately recognize about 1000 different words in a given state. Medium-vocabulary engines are good for natural language command and control or data entry.

· Large vocabulary. The engine can recognize several thousand words in a given state, perhaps up to 100,000. Large-vocabulary engines require much more computation than small-vocabulary engines, but the large vocabulary is necessary for dictation.

Typical Speech Recognizers on the Market

Three types of speech recognizers are common on the market today:

· Command and Control. The engine allows users to give simple commands, like "Minimize Window" or "Send mail to Fred" to the computer. It also allows for limited data entry such as numbers. Command and Control engines are continuous speech, speaker independent, sub-word modeling, and use a small vocabulary. To find more about what you can do with these recognizers look in the "Voice Commands" section.

· Discrete Dictation. The engine allows users to speak in arbitrary text so long as the user leaves pauses between words, providing text entry rates of up to 50 words per minute. Discrete dictation engines are discrete speech, speaker adaptive, sub-word modeling, and use a large vocabulary. To find more about what you can do with these recognizers look in the "Dictation" section.

· Continuous Dictation. The engine allows users to speak in arbitrary text using continuous speech, providing text entry rates of up to 120 words per minute. Continuous Dictation engines are continuous speech, speaker dependent, sub-word modeling, and use a large vocabulary. To find more about what you can do with these recognizers look in the "Dictation" section.


To render text into speech, a text-to-speech engine must first determine the phonemes required to speak a word and then translate those phonemes into digital-audio data. Most text-to-speech engines can be categorized by the method that they use to translate phonemes into audible sound.

A text-to-speech engine translates text or phonemes into audible sound in one of several ways, either by synthesis or by diphone concatenation.

bulletConcatenated Word. Although Concatenated Word systems are really synthesizers; these are one of the most commonly used text-to-speech systems around. In a concatenated word engine, the application designer provides recordings for phrases and individual words. The engine pastes the recordings together to speak out a sentence or phrase. If you use voice-mail then you’ve heard one of these engines speaking, "[You have] [three] [new messages]." The engine has recordings for "You have", all of the digits, and "new messages".
bulletSynthesis. A text-to-speech engine that uses synthesis generates sounds like those created by the human vocal cords and applies various filters to simulate throat length, mouth cavity, lip shape, and tongue position. The voice produced by current synthesis technology tends to sound less human than a voice produced by diphone concatenation, but it is possible to obtain different qualities of voice by changing a few parameters.
bulletDiphone Concatenation. A text-to-speech engine that uses diphone concatenation links short digital-audio segments together and performs intersegment smoothing to produce a continuous sound. Each diphone consists of two phonemes, one that leads into the sound and one that finishes the sound. For example, the word "hello" consists of these phonemes: h eh l œ. The corresponding diphones are silence-h h-eh eh-l l-œ œ-silence.

Diphones are acquired by recording many hours of a human voice and painstakingly identifying the beginning and ending of phonemes. Although this technique can produce a more realistic voice, it takes a considerable amount of work to create a new voice and the voice is not localizable because the phonemes are specific to the speaker’s language.

To find out more about text-to-speech look in the "Voice Text" section.

Speech Application Programming Interface (SAPI)

The Microsoft Speech Application Programming Interface (SAPI) uses the OLE Component Object Model (COM) architecture under Win32® (Windows® 9X and Windows NT® 3.51 and greater). The speech architecture is divided into two levels, a high level that is designed for ease and speed of implementation, and a low level that allows applications complete control of the technology.

DTalk implements both the low and high level SAPI COM objects as native VCL for use in Delphi and C++ Builder. The DTalk components are not ActiveX controls, they are native VCL controls wrapping a COM specification. 

Included with the DTalk package is Speech.pas (or dcu) which contains an object pascal translation of the SAPI header.

How to Implement Using the Low-Level Interfaces

Note that some knowledge of COM is needed to understand the rest of this document. The following material is presented to give a more in-depth introduction to the technology but is purely supplemental and can be skipped. DTalk takes care of all of these details so you don't need to deal with them.

The Speech API has lower-level interfaces which provide detailed control over the speech recognition and text-to-speech processes. The high-level modules (voice commands and voice text) actually use these same low-level interfaces and objects to do speech recognition.

Low-Level Speech Recognition

When an application uses the low-level speech-recognition interfaces, it is talking directly to the engine. This provides the application with a high level of control. This section won't go into detail about how the engine object is used. However, an architectural overview will give you an idea of the processes involved.

The low-level Speech API consists a number of interfaces and objects that are referenced to cause speech. Here's how the process works:

  1. The application determines where the speech recognition's audio should come from and creates an audio-source object through which the engine acquires the data. The WIndows operating systems supply a default audio-source object that gets its audio from the multimedia wave-in device, but the application is able to use customized audio sources, such as an audio source which acquires audio from a .WAV file or a specialized hardware device.
  2. The application, through a speech-recognition enumerator object, locates the desired speech-recognition engine, creates an instance of the engine and passes it the audio-source object.
  3. The engine object has a dialog with the audio-source object to find a common data format for the digital audio data, such as pulse code modulation (PCM). Once a suitable format is established, the engine creates an audio-source notification sink that it passes to the audio-source object. From then on, the audio-source object submits digital audio data to the engine through the notification sink.
  4. The application can then register a main notification sink that receives grammar-independent notifications, such as whether or not the user is speaking.
  5. When it is ready, the application creates one or more grammar objects. These are similar to the Voice-menu object in voice commands but more flexible in syntax recognition.
  6. To find out what words the user spoke; the application creates a grammar-notification sink for every grammar object. When the grammar object recognizes a word or phrase, or has other grammar-specific information for the application, it calls functions in the grammar-notification sinks. The application responds to the notifications and takes whatever actions are necessary.
  7. Typically, when a grammar object recognizes speech, it sends the grammar-notification sink a string indicating what was spoken. However, the engine may know a lot more information than this, such as alternative phrases that may have been spoken, timing, or even who is speaking the phrase. An application can find this information by requesting a results object for the phrase and interrogating the results object for more information.

For more information look in the "Low-Level Speech Recognition API" section of the SAPI documentation.

Low-Level Text-to-Speech

When an application uses the low-level text-to-speech interfaces, it is talking directly to the engine. This provides the application with a high level of control. This section won't go into detail about how the engine object is used. However, an architectural overview will give you an idea of the processes involved. The low-level Speech API consists a number of interfaces and objects that are referenced to cause speech. Here's how the process works:

  1. The application determines where the text-to-speech audio should be sent and creates an audio-destination object through which the engine sends the data. The Windows operating systems supply an audio-destination object that sends audio to the default multimedia wave-out device, but the application may use customized audio destinations, such as an audio destination that writes to a .WAV file.
  2. The application, through a text-to-speech enumerator object (not shown here, but provided by Microsoft), locates a text-to-speech engine and voice that it wants to use. It then creates an instance of the engine object and passes it the audio-destination object.
  3. The engine object has a dialog with the audio-destination object to find a common data format for the digital audio. Once an acceptable format is established, the engine creates an Audio-Destination Notification Sink that it passes to the audio-destination object. From then on, the audio-destination object submits status information to the engine through the notification sink.
  4. The application can then register a Main Notification Sink that receives buffer-independent notifications, such as whether the synthesized voice is speaking and mouth positions for animation.
  5. When it is ready, the application passes one or more text buffers down to the engine. These will be queued up and then spoken (to the audio destination) by the engine.
  6. To find out what words are currently being spoken, the application can create a Buffer Notification Sink for every buffer object. When the engine speaks a word, reaches a bookmark, or some other event occurs, it calls functions in the Buffer Notification Sinks. The notification sink is released when the buffer is finished being read.

For more information look in the "Low-Level Text-to-Speech Section." if the SAPI documentation,