Inside Google’s Age of Augmented Humanity: Part 1

perception, and second, related to sounds,” Cohen says today. “And the natural thing was speech recognition.”

Cohen started studying speech at Menlo Park’s SRI International in 1984, as the principal investigator in a series of DARPA-funded studies in acoustic modeling. By that time, a fundamental change in the science of speech was already underway, he says. For decades, early speech researchers had hoped that it would be possible to teach computers to understand speech by giving them linguistic knowledge—general rules about word usage and pronunciation. But starting in the 1970s, an engineering-oriented camp had emerged that rejected this approach as impractical. “These engineers came along, saying, ‘We will never know everything about those details, so let’s just write algorithms that can learn from data,'” Cohen recounts. “There was friction between the linguists and the engineers, and the engineers were winning by quite a bit.”

But around the mid-1980s, Cohen says, “the linguists and the engineers started talking to each other.” The linguists realized that their rules-based approach was too complex and inflexible, while the engineers realized their statistical models needed more structure. One result was the creation of context-dependent statistical models of speech that, for the first time, could take “co-articulation” into account—the fact that the pronunciation of each phoneme, or sound unit, in a word is influenced by the preceding and following phonemes. There would no longer be just one statistical profile for the sound waves constituting a long “a” sound, for example; there would be different models for “a” for all of the contexts in which it occurs.

“The engineers, to this day, still follow the fundamental statistical, machine-learning, data-driven approaches,” Cohen says. “But by learning a bit about linguistic structure—that words are built in phonemes and that particular realizations of these phonemes are context-dependent—they were able to build richer models that could learn much more of the fine details about speech than they had before.”

Cohen took much of that learning with him when he co-founded Nuance, a Menlo Park, CA-based spinoff of SRI International, in 1994. (Much later, SRI would also spin off Siri, the personal assistant startup bought last year by Apple.) He spent a decade building up the company’s strength in telephone-based voice-response systems for corporate call centers—the kind of technology that lets customers get flight status updates from airlines by speaking the flight numbers, for example.

The Burlington, MA-based company now called Nuance Communications was formerly a Nuance competitor called ScanSoft, and it adopted the Nuance name after it acquired the Menlo Park startup in 2005. But by that time Cohen had left Nuance for Google. He says several factors lured him in. One was the fact that statistical speech-recognition models were inherently limited by computing speed and memory, and by the amount of training data available. “Google had way more compute power than anybody had, and over time, the ability to have way more data than anybody had,” Cohen says. “The biggest bottleneck in the research being, ‘How can we build a much bigger model?,’ it was definitely an opportunity.”

But there were other aspects to this opportunity. After 10 years working on speech recognition for landline telephone systems at Nuance, Cohen wanted to try something different, and “mobile was looking more and more important as a platform, as a place where speech technology would be

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/