It’s a staple of every film depiction of killer androids since Terminator: the moment when the audience watches through the robot’s eyes as it scans a human face, compares the person to a photo stored in its memory, and targets its unlucky victim for elimination.
That’s computer vision in action—but it’s actually one of the easiest examples, from a computational point of view. It’s a simple case of testing whether an acquired image matches a stored one. What if the android doesn’t know whether its target is a human or an animal or a rock, and it has to compare everything it sees against the whole universe of digital images? That’s the more general problem in computer vision, and it’s very, very hard.
But just as we saw with the case of statistical machine translation in Part 2 of this series, real computer science is catching up with, and in some cases outpacing, science fiction. And here, again, Google’s software engineers are helping push to the boundaries of what’s possible. Google made its name helping people find textual data on the Web, and it makes nearly all of its money selling text-based ads. But the company also has a deep interest in programming machines to comprehend the visual world—not so that they can terminate people more easily (not until Skynet takes over, anyway) but so that they can supply us with more information about all the unidentified or under-described objects we come across in our daily lives.
I’ve already described how Google’s speech recognition tools help you initiate searches by speaking to your smartphone rather than pecking away at its tiny keyboard. With Google Goggles, a visual search tool that debuted on Android mobile phones in December 2009 and on the Apple iPhone in October 2010, your phone’s built-in camera becomes the input channel, and the images you capture become the search queries. For limited categories of things—bar codes, text on signs or restaurant menus, book covers, famous paintings, wine labels, company logos—Goggles already works extremely well. And Google’s computer vision team is training its software to recognize many more types of things. In the near future, according to Hartmut Neven, the company’s technical lead manager for image recognition, Goggles might be able to tell a maple leaf from an oak leaf, or look at a chess board and suggest your next move.
Goggles is the most experimental, and the most audacious, of the technologies that Google CEO Eric Schmidt described in a recent speech in Berlin as the harbingers of an age of “augmented humanity.” Even more than the company’s speech recognition or machine translation tools, the software that Neven’s team is building—which is naturally tailored for smartphones and other sensor-laden mobile platforms—points toward a future where Google may be at hand to mediate nearly every instance of human curiosity.
“It is indeed not many years out where you can have this little bird looking over your shoulder, interpreting the scenes that you are seeing and pretty much for every piece in the scene—art, buildings, the people around you,” Neven told me in an interview late last year. “You can see that we will soon approach the point where the artificial system knows much more about what you are looking at than you know yourself.”
Going Universal
Neven, like most of the polymaths at Google, started out studying subjects completely unrelated to search. In his case, it was classical physics, followed by a stint in theoretical neurobiology, where he applied methods from statistical physics to understanding how the brain makes sense of information from the nervous system.
“One of most fascinating objects of study in nature is the human brain, understanding how we learn, how we perceive,” Neven says. “Conscious experience is one of the big riddles in science. I am less and less optimistic that we will ever solve them—they’re probably not even