amenable to the scientific method. But any step toward illuminating those questions, I find extremely fascinating.”
He sees computer vision as one of the steps. “If you have a theory about how the brain may recognize something, it’s surely nice if you can write a software program that does something similar,” he says. “That by no means proves that the brain does it the same way, but at least you have reached an understanding of how, in principal, it could be done.”
It’s pretty clear that the brain doesn’t interpret optical signals by starting from abstract definitions of what constitutes an edge, a curve, an angle, or a color. Nor does it have the benefit of captions or other metadata. The point—which I won’t belabor again here, since we’ve already seen it at work in the cases of Google’s efforts in speech recognition and machine translation—is that Neven’s approach to image recognition was data-driven from the start, relying on computers to sift through the huge piles of 1s and 0s that make up digital images and sniff out the statistical similarities between them. “We have, early on, and sooner than other groups, banked very heavily on machine learning as opposed to model-based vision,” he says.
Trained in Germany, Neven spent the late 1990s and early 2000s at the University of Southern California, in labs devoted to computational vision and human-machine interfaces. After tiring of the grant-writing treadmill, he struck out on his own, co-founding a company called Eyematic around a unique and very specific application of computer vision: using video from a standard camcorder to “drive” computer-generated characters in 3D. When that technology failed to pay off, Neven started Neven Vision, which began from the same foundation—facial feature tracking—but wound up exploring areas as diverse as biometric tools for law enforcement and visual searches for mobile commerce. “What Goggles is today, we started out working on at Neven Vision on a much smaller scale,” he says. “Take an image of a Coke can, and be entered in a sweepstakes. Simple, early applications that would generate revenue.”
How much revenue Neven Vision actually generated isn’t on record—but the company did have a reputation for building some of the most accurate face recognition software on the market, which was Google’s stated reason for acquiring the company in 2006. The team’s first assignment, Neven says, was to put face recognition into Picasa—the photo management system Google had purchased a couple of years before.
Given how far his team’s computer vision tools have evolved since then, Neven Vision probably should have held out for more money in the acquisition, Neven jokes today. “We said, ‘We can do more than face recognition—one of our main products is visual mobile search.’ They knew it, but they kept a poker face and said, ‘All we want is the face recognition, we are just going to pay for that.'”
Once the Picasa project was done, Neven’s team had to figure out what to do next. His initial pitch to his managers was to build visual search app for packaged consumer goods. That was when Google’s poker face came off. “We said, ‘Let’s do a verticalized app that supports users in finding information about products.’ And then one of our very senior engineers, Udi Manber, came to the meeting and said, ‘No, no, it shouldn’t be vertical. It’s in Google’s DNA to go universal. We understand if you can’t quite do it yet, but that should be the ambition.'” The team was being told, in other words, to build a visual search tool that could identify anything.
That was “a little bit of a scary prospect,” Neven says. But on the other hand, the team had already developed modules or “engines” that were pretty good at recognizing things within a few categories, such as famous structures (the Eiffel Tower, the Golden Gate Bridge). And it had seen the benefits of doing things at Google scale. Neven Vision’s original face recognition algorithm had achieved a