fine-tune its own language models. “If the last two words I saw were ‘the dog’ and I have a little ambiguity about the next word, it’s more likely to be ‘ran’ than ‘pan,'” Cohen explains. “The language models tell you the probabilities of all possible next words. We have been able to train enormous language models for Voice Search because we have so much textual data from Google.com.”
Over time, speech recognition capabilities have popped up in more and more Google products. When Google Voice went public in the spring of 2009, it included a voicemail transcription feature courtesy of Cohen’s team. Early in 2010, YouTube began using Google’s transcription engine to publish written transcripts alongside every YouTube video, and YouTube viewers now have the option of seeing the transcribed text on screen, just like closed-captioning on television.
But mobile is still where most of the action is. Google’s Voice Actions service, introduced last August, lets Android users control their phones via voice—for instance, they can initiate calls, send e-mail and text messages, call up music, or search maps on the Web. (This feature is called Voice Commands on some phones.) And the Voice Input feature on certain Android phones adds a microphone button to the virtual keypad, allowing users to speak within any app where text entry is required.
“In general, our vision for [speech recognition on] mobile is complete ubiquity,” says Cohen. “That’s not where we are now, but it is where we are trying to get to. Anytime the user wants to interact by voice, they should be able to.” That even includes interacting with speakers of other languages: Cohen says Google’s speech recognition researchers work closely with their colleagues in machine translation—the subject of the next article in this series—and that the day isn’t far off when the two teams will be able to release a “speech in, speech out” application that combines speech recognition, machine translation, and speech synthesis for near-real-time translation between people speaking different languages.
“The speech effort could be viewed as something that enhances almost all of Google’s services,” says Cohen. “We can organize your voice mails, we can show you the information on the audio track of a YouTube video, you can do searches by voice. A large portion of the world’s information is spoken—that’s the bottom line. It was a big missing piece of the puzzle, and it needs to be included. It’s an enabler of a much wider array of usage scenarios, and I think that what we’ll see over time is all kinds of new applications that people would never have thought of before,” all of them powered by user-provided training data. Which is precisely what Schmidt had in mind in Berlin when he quoted sci-fi author William Gibson: “Google is made of us, a sort of coral reef of human minds and their products.”
Coming in Part 2: A look at the role of big data in Google’s machine translation effort, led by Franz Josef Och.
[Update, 2/28/11: Click here for a convenient single-page version of all three parts of “Inside Google’s Age of Augmented Humanity.”]