Are You Talking to Me? Speech on Mac OS Xby FJ de Kermadec
Editor's Note -- Apple's recent announcement of Spoken Interface has moved speech recognition to the forefront. However, Mac OS X has included speech recognition and synthesis technologies for quite some time, and in this article we delve into the often misunderstood world of talking to your Mac.
The Early Speech Days
The documentation provided by Apple states that the Speech Manager -- the component that takes care of piping the text into the Speech Synthesizer -- was first introduced in 1993. Once again, this shows how innovative Apple can be. Computers of that time were very different from what we know today, and adding speech capabilities to a consumer product -- even thinking about it -- was a real breakthrough. If not somewhat crazy.
The Rebirth of Cool
However, Speech really was born again with the introduction of Mac OS X and especially in the two latest releases, Jaguar and Panther. The new audio capabilities of Mac OS X, along with the renewed commitment from Apple to this amazing technology have concurred to produce what is widely considered to be the most convenient and advanced speech technology available in this field.
Therefore, if you have tried and abandoned Speech during the last century -- the Mac OS 9 days, in other words -- give it another try.
Users who got used to the voice verification feature in Mac OS 9 (vocal password) should not despair. It is not currently built into Mac OS X, but theoretically nothing prevents Apple from adding it again if it is widely requested. This feature actually worked OK and can be considered to be very secure since, even if an attacker knows your pass phrase, he cannot "borrow" your voice. And no, recordings of your voice won't fool the system.
The Goals of Speech
When Apple began to built speech into the Mac OS, they formed a team composed of some of the world's leading speech and language scientists, aiming to bring the user-computer interaction mechanisms to a whole other level.
The Speech technology is in fact built in two parts: a speech synthesizer that your Mac can use to communicate with you -- read text on demand but also keep you informed about the status of a process. And a speech-recognition technology that allows you to talk to your Mac to send commands to it -- what you usually do with a keyboard and mouse.
Since Speech is built-in right at the core of Mac OS X, there is no need to install a special application or devices to make it work. Although the way a specific application reacts to spoken commands in detail is up to the developer, any Mac OS X application can, to a certain extent, be controlled by voice.
This amazing integration is mainly due to the development tools and elements that Apple provides to developers. Once Apple builds speech-controlling capabilities into the standard elements produced by the Interface Builder it hands out to developers, for example, all the applications built with this application can be controlled using standard commands.
Of course, since there is always customization, you can, at any time, add your very own commands to the speech recognition engine -- more on that later. There is, however, no need to worry: Apple ships Mac OS X with a predefined set that will allow you to perform the most common tasks -- browse the Web, check your emails, etc. -- right out of the box.
What Can it Do for Me?
With a hint of practice, you will be able to forget about your keyboard and mouse and do much of what you already do leaning back on your chair, therefore diminishing the risk of physical injuries. In fact, several HR departments are now encouraging people to use these features whenever possible for exactly this reason, i.e. to decrease the incidence of workplace injury through repetitive strains.
Speech can also, when used along with more traditional input devices, make your computing experience more productive and enjoyable. If you want to check your mail while working on an important report for your boss, you do not need to stop what you're doing. Simply say "Get my mail" and let your Mac do the work for you.
Of course, Speech is also very handy for users with disabilities since it allows them to interact with their computer without having to ask for assistance. Thanks to Speech, the Mac has become the computer of choice for visually impaired users who can enjoy quality voices and excellent voice recognition. Indeed, the feedback provided by Speech can allow a user who does not see the screen to determine whether the command he gave to the computer took effect or not and what the status of the request is.
There is also, let's face it, a "coolness" factor that will convince many users to turn Speech on. But before doing so, you should be warned that Speech is highly addictive!
Why Isn't Speech More Successful?
When asked, most Mac users will tell you that they have tried Speech, asked the computer to give them the time, then turned it off because they did not see it as valuable. Usually they thought the voice recognition was unreliable or the voices used by the computer weren't pleasant.
As with any technology, there is a short learning curve before you can really master it and feel comfortable speaking with your computer. After all, this is a brand new way of interacting with a machine and you may need a few hours to feel relaxed and speak normally again.
Voices are also computationally expensive and, up until recently, many computers couldn't deal with extremely complex, natural-sounding voices. The good news is that the incredible computing power packed in the latest Macs allows the Speech team to release increasingly natural-sounding voices and speech synthesizers, making the interaction with a computer even more pleasant. This will be especially noticeable for Panther users.
No synthetic voice sounds perfectly natural. Keep in mind that the specialized speech-synthesis technologies on which some phone systems rely are heavily trained and are "specialized." Ask your virtual reception desk to pronounce the word "asteroids" and you will probably hear the most unnatural voice ever. Your Mac is able to pronounce any word you give it in a natural way. Developers can go even further and use the various tools Apple puts at their disposition to fine-tune the speech synthesis in their applications. Few take the time to do that right now, but when they do the results are striking.
As you can see, the quality of voices has increased over the time. For example, Vicki, the new default Panther voice -- and last in this demo -- is 27.6 MB large instead of the more traditional 1.5 MB that older voices used to take up.
The Speech Synthesizer has also evolved a lot and is now able to distinguish common abbreviations and to add emphasis to long sentences and paragraphs automatically, making speech sound much more natural. This is especially noticeable when you read long text documents. The voice is now much livelier and lifelike since it better duplicates the emphasis a real-life speaker would put on different parts of the text.
How Does Speech Work?
Understanding how Speech works can provide you with valuable information to better take advantage of this technology. In this part, I will try to provide you with an in-depth look at the speech recognition engine as well as answer a few basic structural questions.
Speech Managers and Synthesizers
The process that converts the text string that must be read into the sound that goes out of your speakers can be roughly divided in four steps:
- The application passes a string or buffer of text to the Speech Manager. The developer may choose to give additional instructions along with the text to alter the way the text is pronounced and make it even more natural if he wishes.
- The Speech Manager pipes the strings that need to be read into the Speech Synthesizer. In itself, it does not do any sound-processing work, but instead provides developers with an easy way to interact with Speech.
- The Speech Synthesizer accepts the data from the Speech Manager and will take care of converting it into audible speech. It is sometimes referred to as the "speech engine." To do that, it relies on the various elements we have already seen: dictionaries, sets of rules and exceptions, and an understanding of the context in which the phrase to speak is placed. It will also alter the way it generates the sound according to the commands that were passed along with the text.
- The information is then passed to the Sound Manager that will take care of communicating with the audio hardware so that you can hear the voice.
As a general rule, the more RAM and processing power your computer has, the better the voice will sound. This is because a Speech Synthesizer heavily relies on your computer's resources to perform its calculations. Of course, any modern Mac is able to speak perfectly, but do not expect your old Performa to do as well as a PowerMac G5.
What's a Voice Anyway?
A voice is a set of characteristics defined in parameters that specify a particular quality of speech. They are like natural voices -- all of them are different and, from their characteristics, you can guess the age and sex of the speaker. Voices can talk slower or faster, but in the end you cannot change their base characteristics -- just like you can alter your voice but never entirely change it. Panther comes with 22 voices, but you can theoretically add more if you like.
Indeed, the Speech architecture is very flexible and, as years go by, more and more developers have created add-ons for it, to extend its capabilities and provide even more natural-sounding voices.
What Makes Speech Better than the Competition?
Do you remember when, at the beginning of this article, I told you that you may have in fact used Speech for months without knowing it? That's because the work going on with the Speech group at Apple doesn't stop at making voices and recognizing what you say. Far from it!
Indeed, for the Apple team speech cannot be distinguished from language and there can be no reliable speech technology without a good understanding of the language that is spoken. That's why Apple developed a complex set of rules that allows your Mac to truly analyze the text before it is spoken. Of course, Speech relies on a 121,000-word dictionary that tells it how the most common words are pronounced ... but what about the others? What about the context in which these words are placed? While some other technologies don't care, your Mac does.
This very same technology allows the Mail Junk Mail feature to reach 98% accuracy when it is properly trained and serves as the basis for the Japanese input method. If, like millions of Mac users, you have wondered how Mail does its magic, keep in mind the phrase "adaptive latent semantic analysis."
However, although this attention to the context in which a word or phrase is spoken is essential, the end user is more likely to notice something even more appealing: the speech recognition is speaker independent and does not require any training.
In other words, you do not need to read some predefined text for hours to allow Speech to get used to you or your environment. This means that you do not need to worry about switching between Macs only because you would need to retrain the system.
Much in the same way, Speech can adapt itself to very diverse environments and is able to cancel out the background noise. Therefore, there is no need to pay much attention to your environment as long as the background noise stays constant -- think a restaurant where all the background conversations mix to create a relatively constant noise.
Thanks to this flexibility, Speech does not require any additional hardware. Of course Speech addicts may wish to purchase a headset to further increase the accuracy of the speech recognition in hostile environments, but this really isn't needed as long as you plan to use it in regular conditions, such as a room of reasonable size with no strong echo -- an office, your living room, or the company's cafeteria, as opposed to an empty lecture hall, an underground cave, or an acoustic rock concert. This also means you do not have to wear these special noise-cancellation headphones provided by some other manufacturers. These are nice, but in many cases they do not provide a real help and are impractical.
However, to truly understand what makes the difference, we need to get a bit geeky and see in-depth how Speech works.