voice translation, when we do picture identification, all [the smartphone] does is send a request to the supercomputers that then do all the work.”
And the key thing about those supercomputers—though Schmidt alluded to it only briefly—is that they’re stuffed with data, petabytes of data about what humans say and write and where they go and what they like. This data is drawn from the real world, generated by the same people who use all of Google’s services. And the company’s agility when it comes to collecting, storing, and analyzing it is perhaps its greatest but least appreciated capability.
The power of this data was the one consistent theme in a series of interviews I conducted in late 2010 with Google research directors in the fundamental areas of speech recognition, machine translation, and computer vision. It turns out that many of the problems that have stymied researchers in cognitive science and artificial intelligence for decades—understanding the rules behind grammar, for instance, or building models of perception in the visual cortex—give way before great volumes of data, which can simply be mined for statistical connections.
Unlike the large, structured language corpuses used by the speech-recognition or machine-translation experts of yesteryear, this data doesn’t have to be transcribed or annotated to yield insights. The structure and the patterns arise from the way the data was generated, and the contexts in which Google collects it. It turns out, for example, that meaningful relationships can be extracted from search logs—the more people who search for “IBM stock price” or “Apple Computer stock price,” the clearer it becomes that there is a class of things, i.e. companies, with an attribute called “stock price.” Google’s algorithms glean this from Google’s own users in a process computer scientists call “unsupervised learning.”
“This is a form of artificial intelligence,” Schmidt observed in Berlin. “It’s intelligence where the computer does what it does well and it helps us think better…The computer and the human, together, each does something better because the other is helping.”
In a series of three articles this week, I’ll look more closely at this human-computer symbiosis and how Google is exploiting it, starting with the area of speech recognition. (Subsequent articles will examine machine translation and computer vision.) Research in these areas is advancing so fast that the outlines of Schmidt’s vision of augmented humanity are already becoming clear, especially for owners of Android phones, where Google deploys its new mobile technologies first and most deeply.
Obviously, Google has competition in the market for mobile information services. Over time, its biggest competitor in this area is likely to be Apple, which controls one of the world’s most popular smartphone platforms and recently acquired, in the form of a startup called Siri, a search and personal-assistant technology built on many of the same machine-learning principles espoused by Google’s researchers.
But Google has substantial assets in its favor: a large and talented research staff, one of the world’s largest distributed computing infrastructures, and most importantly, a vast trove of data for unsupervised learning. It seems likely, therefore, that much of the innovation making our phones more powerful over the coming years will emerge from Mountain View.
The Linguists and the Engineers
Today Michael Cohen leads Google’s speech technology efforts. But he actually started out as a composer and guitarist, making a living for seven years writing music for piano, violin, orchestra, and jazz bands. As a musician, he says, he was always interested the mechanics of auditory perception—why certain kinds of sound make musical sense to the human brain, while others are just noise.
A side interest in computer music eventually led him into computer science proper. “That very naturally led me, first of all, to wanting to work on something relating to