Why Should I Use Speech in My
Whenever I mention speech
recognition or text-to-speech, people instantly think of the
computer on "Star Trek." The computer's synthesized
voice sounds exactly like a human voice. The speech
recognition is even better. It never listens in on
conversations between people, and when given a command, the
computer never makes a mistake. It's great. If computers had
perfect speech recognition and text-to-speech like that
today, everyone would be talking to computers. Unfortunately,
it will be decades before speech technology is that good.
Even though speech technology
isn't perfect, it's still useful to applications.
Unfortunately, because some people expect more from speech
recognition and text-to-speech than what the technology can
deliver, they dismiss the technology outright. I will
describe some reasonable expectations for speech technology
and show you what you can accomplish with its current
Why Should I Use Speech If
It's Not Perfect?
Computer-based videos were
introduced in multimedia PCs around 1991. They were a cool
idea, but when the videos first appeared, they were really
small. A 386/33 megahertz machine could support a video about
80x50 pixels with 15 frames per second (fps) and low-quality
sound. Since then computer videos have improved
significantly. They're now 320x200 pixels with 24 fps;
conveying 24 times as much data per second as in 1991. These
improvements in resolution and fps are still small compared
to the ultimate video experience: a digital surround-sound
movie. Real movies have about 4000x2000-pixel resolution,
24-bit color, and 4 channels of 44 kHz, 16-bit audio. That's
several hundred times the data rate and processor speed of
the videos you get on today's PCs.
So what does this have to
do with speech recognition and text-to-speech?
Even though computer videos
are significantly inferior to movies, many applications
benefit from computer video technology, especially games and
multimedia titles. Videos are so necessary to some
applications that they wouldn't be able to function without
them--even though computer videos are light years away from a
Speech recognition and
text-to-speech are new technologies, just like computer
videos. Although speech technologies aren't nearly as good as
those on "Star Trek," many applications can get
good use out of them.
Does Speech Replace the
Keyboard and Mouse?
You might now be asking: If
speech isn't perfect, how can I use it for my application?
After all, speech is replacing the keyboard and mouse, and
the keyboard never makes a mistake.
Speech recognition will not
replace the keyboard and the mouse, nor will text-to-speech
eliminate text. Speech recognition and text-to-speech are
just two more user interface "devices" available to
an application developer. Speech recognition can be added to
an already long list of user-interface "devices"
such as the keyboard, mouse, joystick, and pen. Likewise,
text-to-speech can be added to text, graphics, animated
videos, and sound. Very few applications will use speech as
their sole means of communication with users. Most will mix
and match according to the strengths of each device.
If you think about it,
applications already use several user-interface devices to
communicate with users. A game makes use of both a joystick
and keyboard. Users manipulate the joystick to
"tell" the computer which direction they want their
character to move. The keyboard is used to type in commands,
such as "Say 'Hello.'" The joystick is a better
input device for movement, while the keyboard is better for
entering text. The game's choice of output
"devices" works the same way. The application uses
a combination of pictures, text, and sound to communicate
with users. When an enemy aircraft blows up, a game doesn't
draw "Boom!" in large letters. Instead, the game
plays a recording of an explosion.
Speech is just another
"device." As such, it has its place among other
widely used devices.
What Can Speech Do?
Now that I've described
reasonable expectations of speech technology, I'll tell you a
bit about the current capabilities of text-to-speech and
Text-to-speech comes in two
flavors, synthesized text-to-speech and concatenated
Synthesized speech is what
people typically think of when I mention text-to-speech. It
reads text by analyzing the words and having the computer
figure out the phonetic pronunciations for the words. The
phonemes are then passed into a complex algorithm that
simulates the human vocal tract and emits the sound. This
method allows the text-to-speech to speak any word, even
made-up ones like "Zamphoon," but it produces a
voice that has very little emotion and is distinctly not
human. You'd use this if you knew that the application had to
speak, but you couldn't predict what it would need to say.
Synthesized speech usually requires a 486/33 megahertz
machine with 1 megabyte of working-set RAM.
does something different. It analyzes the text and pulls
recordings, words, and phrases out of a prerecorded library.
The digital audio recordings are concatenated. Because the
voice is just a recording that you've made, it sounds good.
Unfortunately, if the text includes a word or phrase that you
didn't record, the text-to-speech can't say it. Concatenated
text-to-speech can be viewed as a form of audio compression
because words or common phrases have to be recorded only
once. For example, many telephone applications will have a
recording for, "Press 1 to play new message; press 2 to
send a fax," and so on, and another recording for,
"Press 1 to fast-forward; press 2 to rewind." A
concatenated text-to-speech will have only one recording of
"press" rather than four. If concatenated
text-to-speech doesn't seem that much different to you from
recording your own .WAV files, you're right. However,
concatenated text-to-speech will save you development time
and bugs, allowing you to add more features to your software.
Because concatenated text-to-speech just plays a .WAV file,
it takes very little processor power and only a bit of
memory, since most of the audio is stored on disk.
Speech recognition is somewhat
more complicated to classify than text-to-speech. Each speech
recognition engine has three characteristics:
Although any combination of
the three characteristics is possible, two combinations are
Control" speech recognition is continuous, small
vocabulary, and speaker independent. This means that users
can use several hundred different commands or phrases. If a
user says a command that is not in the list, the
speech-recognition system will return either "not
recognized," or will think it heard a similar-sounding
command. Because users of Command and Control can say only
specific phrases, the phrases must be either visible on the
screen--so intuitive that all users will know what to say--or
the users must learn what phrases they can say. Command and
Control speech recognition requires a 486/66 megahertz
machine with 1 megabyte of working-set RAM.
speech recognition is discrete, large vocabulary, and speaker
dependent. It's used to dictate text into word processors or
mail, or for natural-language commands. Although users may
say anything they wish, they must leave pauses between words,
making the speech unnatural. Discrete dictation requires a
Pentium/60 megahertz machine with 8 megabytes of RAM.
Sample Uses for Speech
It's impossible to say,
"Thou shalt use speech to do XXX." Not only is
every application different, but speech technology is so new
that application writers haven't had much experience with it.
As a general rule, use speech when it provides significant
advantages to users, not just for the "Wow!"
effect. Try to find uses that make the application easier to
use, or allow users to access features more quickly than with
the keyboard or mouse. In the case of games, use speech to
add realism to the interaction with characters, making the
game more involving and fun.
Telephony applications benefit
most from speech recognition and text-to-speech because their
sole means of communicating with users is through the
telephone audio. Text-to-speech is not only easier to use
than manually recording all of the prompts, but it also
enables new features that allow arbitrary text to be read,
such as names, addresses, and e-mail. Speech recognition is a
great substitute for touch-tone menus not only because it's
more natural and flexible, but also because many users don't
have touch-tone phones. Several engine vendors have phone
numbers you can call for a demonstration of telephony
Games, especially adventure
games, are the next beneficiaries of speech. Any game in
which the user "talks" with characters will enjoy
the fruits of speech. Speech recognition allows users to
really talk to the characters, which is a lot more fun than
typing in a response. Rather than just printing out what
characters "speak," use text-to-speech to really
make them speak. Not only does this increase the realism of
the game, but it allows users to take their hands off the
keyboard, sit back in their chairs, and enjoy the game.
Multimedia titles can use
concatenated text-to-speech for recording audio passages,
treating it like improved audio compression. They can also
incorporate speech recognition so that users can sit back and
relax, just as they can with games.
Of course, there are other
applications that can use speech. For example, some
applications use speech because they require that the user's
hands be free for other things, or because the user can't see
Microsoft Speech API
This was just a brief overview
of what you can do with speech technology. The material was
taken from Intro.doc which is part of the documentation
included with the Microsoft Speech SDK 3.0. You should download
the API and continue reading the Microsoft Speech API
Documentation for more detail.
Overview of Speech Technologies
Speech recognition is
the ability of a computer to understand the spoken word for
the purpose of receiving command and data input from the
speaker. Text-to-speech is the ability of a computer
to convert text information into synthetic speech output.
Speech recognition and
text-to-speech use engines, which are the programs
that do the actual work of recognizing speech or playing
text. Most speech-recognition engines convert incoming audio
data to engine-specific phonemes, which are then translated
into text that an application can use. (A phoneme is
the smallest structural unit of sound that can be used to
distinguish one utterance from another in a spoken language.)
A text-to-speech engine performs the same process, in
reverse. Engines are supplied by vendors that specialize in
speech software; they may be bundled with new audio-enabled
computers and sound cards, purchased separately, or licensed
from the vendor.
The speech-recognition engine
transcribes audio data received from an audio source,
such as a microphone or a telephone line. The text-to-speech
engine converts text to audio data, which is sent to an audio
destination, such as a speaker, a headphone, or a
telephone line. Under some circumstances, an engine may be
able to transcribe audio data to or from a file.
An engine typically provides
more than one mode for recognizing speech or playing
text. For example, a speech-recognition engine will have a
mode for each language or dialect that it can recognize.
Likewise, a text-to-speech engine will have a mode for each voice,
which plays text in a different speaking style or
personality. Other modes may be optimized for a particular
audio sampling rate, such as 8 kilohertz (kHz) for use over a
Speech recognition can be as
simple as a predefined set of voice commands that an
application can recognize. More complex speech recognition
involves the use of a grammar, which defines a set of
words and phrases that can be recognized. A grammar may use rules
to predict the most likely words to follow the word just
spoken, or it may define a context that identifies the
subject of dictation and the expected style of language.
Both speech-recognition and
text-to-speech engines may make use of a pronunciation
lexicon, which is a database of correct pronunciations
for words and phrases to be recognized or played.
An engines approach to
recognizing speech or playing text determines the quality of
speech in an application that is, the accuracy of
recognition or clarity of playback and the amount of
effort required from the user to get good accuracy or
clarity. The engines approach also affects the
processor speed and memory required by an application; it may
also influence the applications features or the design
of its user interface.
This section provides an
overview of speech technology. An understanding of both
speech recognition and text-to-speech will help you decide
how to best incorporate speech in your application and how to
choose a technology that supports what you want to do.
All speech recognition
involves detecting and recognizing words. Most
speech-recognition engines can be categorized by the way in
which they perform these basic tasks:
· Word separation. The degree
of isolation between words required for the engine to
recognize a word.
· Speaker dependence. The degree to
which the engine is restricted to a particular speaker.
· Matching techniques. The method that
the engine uses to match a detected word to known words
in its vocabulary.
· Vocabulary size. The number of words
that the engine searches to match a word.
typically require one of the following types of verbal input
to detect words:
· Discrete speech. Every word must be
isolated by a pause before and after the word
usually about a quarter of a second for the engine
to recognize it. Discrete speech recognition requires
much less processing than word-spotting or continuous
speech, but it is less user-friendly.
· Word-spotting. A series of words may
be spoken in a continuous utterance, but the engine
recognizes only one word or phrase. For example, if a
word-spotting engine listens for the word
"time" and the user says "Tell me the
time" or "Time to go," the engine
recognizes only the word "time."
Word-spotting is used when
a limited number of commands or answers are expected from
the user and the way that the user speaks the commands is
either unpredictable or unimportant.
recognizers can also be used to do word-spotting.
· Continuous speech. The engine
encounters a continuous utterance with no pauses between
words, but it can recognize the words that were spoken.
recognition is the best technology from a usability
standpoint, because it is the most natural speaking style
for human beings. However, it is the most computationally
intensive because identifying the beginning and ending of
words is very difficult much like reading printed
text without spaces or punctuation.
Dependence on the Speaker
Speech-recognition engines may
require training to recognize speech well for a particular
speaker, or they may be able to adapt to a greater or lesser
extent. Engines can be grouped into these categories:
· Speaker-dependent. The engine
requires the user to train it to recognize his or her
voice. Training usually involves speaking a series of
pre-selected phrases. Each new speaker must perform the
work without training, but their accuracy usually starts
below 95 percent and does not improve until the user
completes the training. This technique takes the least
amount of processing, but it is frustrating for most
users because the training is tedious, taking anywhere
from five minutes to several hours.
· Speaker-adaptive. The engine trains
itself to recognize the users voice while the user
performs ordinary tasks. Accuracy usually starts at about
90 percent, but rises to more acceptable levels after a
few hours of use.
Two considerations must be
taken into account with speaker-adaptive technology.
First, the user must somehow inform the engine when it
makes a mistake so that it does not learn based on the
mistake. Second, even though recognition improves for the
individual user, other people who try to use the system
will get higher error rates until they have used the
system for a while.
· Speaker-independent. The engine
starts with an accuracy above 95 percent for most users
(those who speak without accents). Almost all
speaker-independent engines have training or adaptive
abilities that improve their accuracy by a few more
percentage points, but they do not require the use of
such training. Speaker-independent systems require
several times the computational power of
match a detected word to a known word using one of these
· Whole-word matching. The engine
compares the incoming digital-audio signal against a
prerecorded template of the word. This technique takes
much less processing than subword matching, but it
requires that the user (or someone) prerecord every word
that will be recognized sometimes several hundred
thousand words. Whole-word templates also require large
amounts of storage (between 50 and 512 bytes per word)
and are practical only if the recognition vocabulary is
known when the application is developed.
· Subword matching. The engine looks
for subwords usually phonemes and then
performs further pattern recognition on those. This
technique takes more processing than whole-word matching,
but it requires much less storage (between 5 and 20 bytes
per word). In addition, the pronunciation of the word can
be guessed from English text without requiring the user
to speak the word beforehand.
typically support several different sizes of vocabulary.
Vocabulary size does not represent the total number of words
that a given engine can recognize. Instead, it determines the
number of words that the engine can recognize accurately in a
given state, which is defined by the word or words that were
spoken before the current point in time. For example, if an
engine is listening for "Tell me the time,"
"Tell me the day," and "What year is it?"
and the user has already said "tell,"
"me," and "the," the current state has
these two words: "time" and "day."
support these vocabulary sizes:
· Small vocabulary. The engine can
accurately recognize about 50 different words in a given
state. Although many small-vocabulary engines can exceed
this number, larger numbers of words significantly reduce
the recognition accuracy and increase the computation
load. Small-vocabulary engines are acceptable for command
and control, but not for dictation.
· Medium vocabulary. The engine can
accurately recognize about 1000 different words in a
given state. Medium-vocabulary engines are good for
natural language command and control or data entry.
· Large vocabulary. The engine can
recognize several thousand words in a given state,
perhaps up to 100,000. Large-vocabulary engines require
much more computation than small-vocabulary engines, but
the large vocabulary is necessary for dictation.
Typical Speech Recognizers
on the Market
Three types of speech
recognizers are common on the market today:
· Command and Control. The engine
allows users to give simple commands, like "Minimize
Window" or "Send mail to Fred" to the
computer. It also allows for limited data entry such as
numbers. Command and Control engines are continuous
speech, speaker independent, sub-word modeling, and use a
small vocabulary. To find more about what you can do with
these recognizers look in the "Voice Commands"
· Discrete Dictation. The engine
allows users to speak in arbitrary text so long as the
user leaves pauses between words, providing text entry
rates of up to 50 words per minute. Discrete dictation
engines are discrete speech, speaker adaptive, sub-word
modeling, and use a large vocabulary. To find more about
what you can do with these recognizers look in the
· Continuous Dictation. The engine
allows users to speak in arbitrary text using continuous
speech, providing text entry rates of up to 120 words per
minute. Continuous Dictation engines are continuous
speech, speaker dependent, sub-word modeling, and use a
large vocabulary. To find more about what you can do with
these recognizers look in the "Dictation"
To render text into speech, a
text-to-speech engine must first determine the phonemes
required to speak a word and then translate those phonemes
into digital-audio data. Most text-to-speech engines can be
categorized by the method that they use to translate phonemes
into audible sound.
A text-to-speech engine
translates text or phonemes into audible sound in one of
several ways, either by synthesis or by diphone