perception, and second, related to sounds,” Cohen says today. “And the natural thing was speech recognition.”
Cohen started studying speech at Menlo Park’s SRI International in 1984, as the principal investigator in a series of DARPA-funded studies in acoustic modeling. By that time, a fundamental change in the science of speech was already underway, he says. For decades, early speech researchers had hoped that it would be possible to teach computers to understand speech by giving them linguistic knowledge—general rules about word usage and pronunciation. But starting in the 1970s, an engineering-oriented camp had emerged that rejected this approach as impractical. “These engineers came along, saying, ‘We will never know everything about those details, so let’s just write algorithms that can learn from data,'” Cohen recounts. “There was friction between the linguists and the engineers, and the engineers were winning by quite a bit.”
But around the mid-1980s, Cohen says, “the linguists and the engineers started talking to each other.” The linguists realized that their rules-based approach was too complex and inflexible, while the engineers realized their statistical models needed more structure. One result was the creation of context-dependent statistical models of speech that, for the first time, could take “co-articulation” into account—the fact that the pronunciation of each phoneme, or sound unit, in a word is influenced by the preceding and following phonemes. There would no longer be just one statistical profile for the sound waves constituting a long “a” sound, for example; there would be different models for “a” for all of the contexts in which it occurs.
“The engineers, to this day, still follow the fundamental statistical, machine-learning, data-driven approaches,” Cohen says. “But by learning a bit about linguistic structure—that words are built in phonemes and that particular realizations of these phonemes are context-dependent—they were able to build richer models that could learn much more of the fine details about speech than they had before.”
Cohen took much of that learning with him when he co-founded Nuance, a Menlo Park, CA-based spinoff of SRI International, in 1994. (Much later, SRI would also spin off Siri, the personal assistant startup bought last year by Apple.) He spent a decade building up the company’s strength in telephone-based voice-response systems for corporate call centers—the kind of technology that lets customers get flight status updates from airlines by speaking the flight numbers, for example.
The Burlington, MA-based company now called Nuance Communications was formerly a Nuance competitor called ScanSoft, and it adopted the Nuance name after it acquired the Menlo Park startup in 2005. But by that time Cohen had left Nuance for Google. He says several factors lured him in. One was the fact that statistical speech-recognition models were inherently limited by computing speed and memory, and by the amount of training data available. “Google had way more compute power than anybody had, and over time, the ability to have way more data than anybody had,” Cohen says. “The biggest bottleneck in the research being, ‘How can we build a much bigger model?,’ it was definitely an opportunity.”
But there were other aspects to this opportunity. After 10 years working on speech recognition for landline telephone systems at Nuance, Cohen wanted to try something different, and “mobile was looking more and more important as a platform, as a place where speech technology would be