Vlingo’s Adaptive Speech Recognition Promises an End to Typing on your Phone Keyboard

It’s the technology journalist’s downfall: The hot technology that you suspect isn’t quite ripe but you can’t help writing about anyway.

In 2003, when I was a senior editor at MIT’s Technology Review (and, in the interest of full disclosure, Bob was editor in chief), speech recognition and natural-language processing were firmly in that category, yet we went ahead and published a cover story I’d written called “Computers That Speak Your Language.” Exercising a little more enthusiasm than usual, I predicted that computers equipped with speech recognition software “may soon be able to interpret almost any conversation, or to retrieve almost any information a Web user wants.”

But as is well known to anyone who’s been stuck in the telephone “death spiral”—the actual term experts use for one of those interactive voice-response conversations where you can’t get the frakkin’ computer to transfer you to a real person—most voice interfaces are still a long way from maturity. To the charges of overeagerness and a little gullibility, I plead guilty.

And yet—as hesitant as I am to go there once again—I am compelled to report that speech recognition technology may be taking a real lurch forward. Last week, the folks at Harvard Square startup Vlingo gave me a demonstration of a cell-phone-based voice-recognition system that not only works, but gets more accurate as more people use it.

Vlingo, formerly called Moebius, is officially emerging from stealth mode today and unveiling Vlingo Find, an application that users of Sprint mobile phones can download at www.vlingomobile.com. The software, which works over the phone’s data connection, allows users to search for business phone numbers nationwide simply by speaking the business name or category and the city name—for example, “Thai food in Cambridge, Massachusetts.”

If that sounds like a glorified 411 service without the human operator, that’s what it is. But that’s more remarkable than it sounds. After all, there is no human operator—nobody listening to the words and typing them into a computer somewhere. Instead, software runs the digitized sound of the user’s voice through statistical language models to make its best guess at what the user said, then sends that text back to the user’s screen for verification. If it’s correct, the user can press “search” to get the phone number. If there’s an error, it can be corrected by using the phone keypad cursor buttons to scroll back to the mistaken word and speaking it again.

The system works—in fact, it works so well that Vlingo officials who wanted to show me how easy it is to correct an error couldn’t get the system to make a transcription mistake in the first place. And Vlingo Find is just a taste of what Vlingo’s planning. The company is in talks with mobile phone manufacturers and cellular carriers to apply speech recognition to any task that normally requires tedious triple-typing on a standard mobile phone’s 12-button keypad, such as entering a URL in a Web browser or searching an online music store for a specific MP3. Vlingo “unlocks access to the mobile Internet with the power of voice,” says CEO David Grannan.

If you’ve had any experience with older speech recognition systems, you know that they achieve accuracy in one of two ways: either by drastically constraining the context so that only a few specific words might come up (“For billing inquiries, say 1, for technical support, say 2”) or through a long training process in which the speaker reads a prepared text and the software learns to recognize that speaker’s (and only that speaker’s) speech patterns.

If there were a Nobel Prize for software, a speaker-independent speech recognition system with a truly unconstrained vocabulary would win it. Vlingo’s system doesn’t do that, but it comes close, using an “adaptive” approach pioneered by the company’s technical founder, Michael Phillips. When I interviewed him last, for my 2003 story, Phillips was the principal scientist at interactive voice-response company Speechworks (which was later acquired for $132 million in stock by Scansoft, which renamed itself Nuance after acquiring a Speechworks competitor by that name). Speechworks’ technology was both grammar-based—loosely put, it expected words to occur in a certain order, such as subject-verb-object—and statistical, in that it used machine-learning techniques to comb databases of phrases for likely matches.

Today, Vlingo’s system uses both techniques, but with an ingenious addition: the ability to improve over time using the corrections users enter into their phones. “We are totally into adaptation,” says Phillips. “It’s the only way to make this work.” Vlingo’s so-called “adaptive hierarchical language models” are far too complex to run on mobile processors, so the heavy lifting actually takes place on servers accessed over the cellular data network. Remarkably, this doesn’t slow things down much—Vlingo can send back a transcription in only a fraction of the time it takes to complete a search once the query is actually submitted.

Vlingo is staffing up using $6.5 million in Series A funding from local venture firms Charles River Ventures and Sigma Ventures. “The technology has come a long way,” says Izhar Armony, general partner at Charles River, which was also an investor in Speechworks. “Mike’s brilliance was to realize that you can have the best of both worlds, user independence and an unrestricted grammar, if there is collective learning inside the network. The system gets better and better as more people call in. This is why we were so excited [about funding Vlingo].”

Of course, some bootstrapping using canned training information was needed to get the system up and running, so the system may still be raw around the edges. But Phillips says he expects it to improve quickly once people starting using Vlingo Find. Eventually, Phillips says, Vlingo could put an end to triple-typing. “The goal is to get carriers to buy into this as a standard part of the software ecosystem on their handsets,” he says. And that might lead to a generation of phones that really do speak your language.

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/ View all posts by Wade Roush