Inside Google’s Age of Augmented Humanity, Part 3: Computer Vision Puts a “Bird on Your Shoulder”

It’s a staple of every film depiction of killer androids since Terminator: the moment when the audience watches through the robot’s eyes as it scans a human face, compares the person to a photo stored in its memory, and targets its unlucky victim for elimination.

That’s computer vision in action—but it’s actually one of the easiest examples, from a computational point of view. It’s a simple case of testing whether an acquired image matches a stored one. What if the android doesn’t know whether its target is a human or an animal or a rock, and it has to compare everything it sees against the whole universe of digital images? That’s the more general problem in computer vision, and it’s very, very hard.

But just as we saw with the case of statistical machine translation in Part 2 of this series, real computer science is catching up with, and in some cases outpacing, science fiction. And here, again, Google’s software engineers are helping push to the boundaries of what’s possible. Google made its name helping people find textual data on the Web, and it makes nearly all of its money selling text-based ads. But the company also has a deep interest in programming machines to comprehend the visual world—not so that they can terminate people more easily (not until Skynet takes over, anyway) but so that they can supply us with more information about all the unidentified or under-described objects we come across in our daily lives.

I’ve already described how Google’s speech recognition tools help you initiate searches by speaking to your smartphone rather than pecking away at its tiny keyboard. With Google Goggles, a visual search tool that debuted on Android mobile phones in December 2009 and on the Apple iPhone in October 2010, your phone’s built-in camera becomes the input channel, and the images you capture become the search queries. For limited categories of things—bar codes, text on signs or restaurant menus, book covers, famous paintings, wine labels, company logos—Goggles already works extremely well. And Google’s computer vision team is training its software to recognize many more types of things. In the near future, according to Hartmut Neven, the company’s technical lead manager for image recognition, Goggles might be able to tell a maple leaf from an oak leaf, or look at a chess board and suggest your next move.

Goggles is the most experimental, and the most audacious, of the technologies that Google CEO Eric Schmidt described in a recent speech in Berlin as the harbingers of an age of “augmented humanity.” Even more than the company’s speech recognition or machine translation tools, the software that Neven’s team is building—which is naturally tailored for smartphones and other sensor-laden mobile platforms—points toward a future where Google may be at hand to mediate nearly every instance of human curiosity.

“It is indeed not many years out where you can have this little bird looking over your shoulder, interpreting the scenes that you are seeing and pretty much for every piece in the scene—art, buildings, the people around you,” Neven told me in an interview late last year. “You can see that we will soon approach the point where the artificial system knows much more about what you are looking at than you know yourself.”

Going Universal

Neven, like most of the polymaths at Google, started out studying subjects completely unrelated to search. In his case, it was classical physics, followed by a stint in theoretical neurobiology, where he applied methods from statistical physics to understanding how the brain makes sense of information from the nervous system.

“One of most fascinating objects of study in nature is the human brain, understanding how we learn, how we perceive,” Neven says. “Conscious experience is one of the big riddles in science. I am less and less optimistic that we will ever solve them—they’re probably not even

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/