Speakable Web Services
Pages: 1, 2

I have to confess I'm still tempted to dismiss this speech stuff as an amusing parlor trick. But it may finally be reaching a tipping point. "Look, Dad's talking to the computer," my kids snickered. When I showed my son he could play GnuChess using voice commands, though, he was riveted. It's a case-by-case thing, but when an application has a limited control vocabulary ("pawn a2 to a4"), the Mac's speaker-independent speech recognition can give you hands-free control that's accurate and more effective than mouse control. Well, to be honest, mostly accurate. I'm having a little trouble getting GnuChess to distinguish between "d" and "e" -- a problem that could be solved by also supporting "delta" and "echo."

Not many of the XMethods services are likely candidates for voice treatment. Complex inputs and outputs don't make much sense. You can build IVR-style (interactive voice response) menus, like so:

tell application "SpeechRecognitionServer"
   local choices
   set choices to {"Temperature" "BlogStats"}
   set thePrompt to "What do you need to know?"
     set theResult to listen for choices with prompt thePrompt giving up
after 10
     say (do shell script "/Users/Jon/" & theResult)
   end try
end tell

Unless you really want to inflict voice trees on yourself, though, you'll probably soon tire of this approach, once the novelty wears off. Complex output is a nonstarter as well. It's faster to read than to hear more than a word or short phrase, the Mac's synthesized voices work best on short snippets, and there's no way for the computer to usefully speak structured output.

Namespace Management

The namespace made of the files in the Speakable Items folder is active system-wide. There are separate per-application namespaces. For example, the Speakable Items/Internet Explorer subfolder defines voice commands just for MSIE. You can, in fact, extend that namespace in a hands-free manner, using the "Make this page speakable" voice command. If the current page is, for example, then "Make this page speakable" prompts with the page's HTML doctitle, "Google News." When the prompt is active, the valid speech commands are "Save" and "Cancel." If you say "Save," you will create a voice-activated bookmark triggered by the phrase "Google News." Pretty darned slick! It's IE-specific, though, and that's a shame because I prefer Mozilla on the Mac to the IE version (5.2) that came with the TiBook.

The per-application namespaces are segregated from one another, but as you extend the main namespace, you'll start to run into conflicts. New commands that sound too much like existing ones will cause misrecognition. The problem is easily solved, though. Just open the Speakable Items folder and rename files -- either pre-existing items or your new items -- in order to step around these conflicts.

As you build up vocabularies, it's easy to forget that the recognition engine is speaker-independent, but language dependent. For example, I've been enjoying Brent Simmons' Huevos, a nifty little tool that can float in a small window and send a search term to any of a user-defined set of Web sites. The voice command to launch it -- "Switch to Huevos" -- works best when I Anglicize the name as "Hoo -- eee -- vos." Apple's site says that a Spanish recognizer is available but, for now, I'm still trying to decide whether to mangle the pronunciation of Huevos or rename it for speech purposes.

Related Reading

Mac OS X: The Missing Manual
By David Pogue

Speech control of computers is mainly considered to be an assistive technology. In my case, there's certainly an element of that. After too many years of typing and mousing, my wrists are chronically sore, and I'm happy to avoid all keystrokes and mouseclicks that I can. Most of that wear and tear is from writing and programming, though, so until I can come to terms with dictation (as, I'm told, the prolific author David Pogue has done), voice control won't help much. But Apple's implementation has made me rethink the mixed-mode user interface. Consider, for example, the mechanism for picking one of the fifty U.S. states in a Web form. Some sites ask you to type the two-letter abbreviation, but most offer a picklist. Scanning a list of 50 items is unproductive. I can use completion to skip to the N section of the list, but adding an H takes me to Hawaii, not New Hampshire. Here's a well-defined namespace that could probably be accessed using speech more quickly and naturally than by any other method. I suspect the same holds true for many multiple-choice situations in data entry forms and elsewhere.

Consider another Brent Simmons application, the popular RSS newsreader NetNewsWire Lite. It's already more usefully speakable then most OS X apps I've tried. Along with menu navigation, you can speak the crucial commands "Next Unread," "Mark All as Unread," and "Open in Browser." These are more mnemonic than their keyboard equivalents (Command-G, Command-Shift-K, and Command-B), and especially in the case of Command-Shift-K, more accessible, too. An interesting refinement would be to voice-enable random access to feeds, just as MSIE allows spoken random access to items on the Go and Favorites menus. I've got 128 subscriptions, for example. It would be cool to say "Sam Ruby" and jump straight to Sam's blog. Or to say "Jeremy," and jump to a completion list showing "Allaire" and "Zawodny," and speak one of those surnames to finalize the selection.

As software services multiply, so do their control vocabularies. XML manages this proliferation using namespaces. Per-application or per-service speech-enabling can use the same strategy to reduce the hard problem of open-ended speech recognition to an easier one that can be solved in useful and practical ways.

Jon Udell is an author, information architect, software developer, and new media innovator.

Return to the O'Reilly Network.