Diffbot Is Using Computer Vision to Reinvent the Semantic Web

You know how the Picturephone, a half-billion-dollar project at AT&T back in the 1960s and 1970s, turned out to be a huge commercial flop, but two-way video communication eventually came back with a vengeance in the form of Skype and FaceTime and Google Hangouts? Well, something similar is going on with the Semantic Web.

That’s the proposal, dating back almost to the invention of the Web in the 1990s, that the various parts of Web pages should be tagged so that machines, as well as people, can make inferences based on the information they contain. The idea has never gotten very far, mainly because the burden of tagging all that content would fall to humans, which makes it expensive and tedious. But now it looks like the original goal of making digital content more comprehensible to computers might be achievable at far lower cost, thanks to better software.

Diffbot is building that software. This unusual startup—the first ever to emerge from the Stanford-based accelerator StartX, back in 2009—is using computer vision technology similar to that used for robotics applications such as self-driving cars to classify the parts of Web pages so that they can be reassembled in other forms. AOL is one of the startup’s first big customers and its landlord. It’s using Diffbot’s technology to assemble Editions by AOL, the personalized, iPad-based magazine comprised of content culled from AOL properties like the Huffington Post, TechCrunch, and Engadget.

NPR's top news page as interpreted by Diffbot (click for larger version)

I went down to AOL’s Palo Alto campus last month to meet the company’s founder and CEO Mike Tung and its vice president of products John Davi. They didn’t deliberately set out to solve the Semantic Web problem, any more than the founders of Skype set out to build an affordable Picturephone. But their venture, which has attracted about $2 million in backing from Andy Bechtolsheim and a raft of other angel investing stars, is already on its way to creating one of the world’s largest structured indexes of unstructured Web content.

Without relying on HTML tags (which can actually be used to trick traditional Web crawling software), Diffbot can look at a news page and tell what’s a headline, what’s a byline, where the article text begins and ends, what’s an advertisement, and so forth. What practical use can companies make of that, and where’s the profit in it for Diffbot? Well, aside from AOL, the startup’s software is already being used in some interesting places: reading app maker Pocket (formerly Read It Later) uses it to extract article text from websites, and content discovery service StumbleUpon employs it to screen out spam.

In fact, companies pay Diffbot to analyze more than 100 million unique URLs per month. And that’s just the beginning. Building outward from its early focus on news articles, the startup is creating new algorithms that could make sense of many kinds of sites, such as e-commerce catalogs. The individual elements of those sites could then be served up in almost any context. Imagine a Siri for shopping, to take just one example. “We’re building a series of wedges that will add up to a complete view of the Web,” says Davi. “We are excited about having them all under our belt, so there can be a fully indexed, reverse-engineered Semantic Web.”

What follows is a highly compressed version of my conversation with Tung and Davi.

Xconomy: Where did you guys meet, and how did you end up working on Diffbot?

Mike Tung: I worked at Microsoft on Windows Vista right out of high school, then went to college at Cal and studied electrical engineering for two years, then went to Stanford to start a PhD in computer science, specializing in AI. When I first moved to Silicon Valley, I also worked at a bunch of startups. I was engineer number four at TheFind, which was a product search company that built the world’s largest product index. I worked on search at Yahoo and eBay, and also did a bunch of contract work. I took the patent bar and worked as a patent lawyer for a couple of years, writing 3G and 4G patents for Panasonic and Matsushita. I first met John when we were working at a startup called ClickTV, which was a video-player-search-engine thing. It was pretty advanced for its time.

Diffbot began when I was in grad school at Stanford [in 2005]. There was this one quarter where I was taking a lot of classes, so I made this tool for myself to keep track of all of them. I would put in the URL for the class website, and whenever a professor would upload new slides or content, Diffbot would find that and download it to my phone. I always felt like I knew what was going on in my classes without having to attend every single one.

It was useful, and my friends started asking me whether they could use it. So I turned it into a Web service and

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/