Diffbot Is Using Computer Vision to Reinvent the Semantic Web

started running it out of a dorm at Stanford. And people started adding a bunch of different kinds of URLs to Diffbot outside of classes, like they might add Craigslist if they were searching for a job or a product, or Facebook if they wanted to see if their ex’s profile had changed.

X: So I assume the name “Diffbot” related to comparing the old and new versions of a website and detecting the differences?

MT: Yes, but just doing deltas on Web pages doesn’t work too well. It turns out that on the modern Web, every page refresh changes the ads and the counters. You have to be a little more intelligent.

That’s where understanding the page comes into play. I was studying machine learning at Stanford, and in particular one project I had worked on was the vision system for the self-driving car [Stanford’s entry in the 2007 DARPA Urban Challenge]. This was the stereo camera system that would compute the depth of a scene and say, ‘This is a cactus, this is drivable dirt, this is not drivable dirt, this is a cliff, this is a very narrow passageway.’ I realized that one way of making Diffbot generalizable was to apply computer vision to Web pages. Not to say, ‘This is a cactus and this is a pedestrian,’ but to say, ‘This is an advertisement and this is a footer and this is a product.’

A human being can look at Web page and very easily tell what type of page it is without even looking at the text, and that is what we are teaching Diffbot to do. The goal is to build a machine-readable version of the entire Web.

X: Isn’t that what Tim Berners-Lee has been talking about for years—building a Semantic Web that’s machine-readable?

MT: It seems that every three years or so a new Semantic Web technology gets hyped up again. There was RSS, RDF, OWL, and now it’s Open Graph and the Knowledge Graph. The central problem—why none of these have really gone mainstream—is that you are requiring humans to tag the content twice, once for the machine’s benefit and once for the actual humans. Because you are placing so much onus on the content creators, you are never going to have all of the content in any given system. So it will be fragmented into different Semantic Web file formats, and because of that you will never have an app that allows you to search and evaluate all that information.

But what if you analyze the page itself? That is where we have an opportunity, by applying computer vision to eliminate the problem of manual tagging. And we have reached a certain point in the technology continuum where it is actually possible—where the CPUs are fast enough and the machine learning technology is good enough that we have a good shot of doing it with high accuracy.

X: Why are you so convinced that a human-tagged Semantic Web would never work?

MT: The number one point is that people are lazy. The second is that people lie. Google used to read the meta tags and keywords at the top of a Web page, and so people would start stuffing those areas with everything. It didn’t correspond to what actual humans saw. The same thing holds for Semantic Web formats. Whenever you have things indexed separately, you start to see spam. By using a robot to look at the page, you are keeping it above that.

X: Talk about the computer vision aspect of Diffbot. How literal is the comparison to the cameras and radar on robot cars?

MT: We use the very same techniques used in computer vision, for example object detection and edge detection. If you are a customer, you give us a URL to analyze. We render the page using a virtual Webkit browser in the cloud. It will render the page, run the Javascript, and lay everything out with the CSS rules and everything. Then we have these hooks into Webkit that

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/