Diffbot Is Using Computer Vision to Reinvent the Semantic Web

the Web. Most of the stuff a typical Web crawler goes through never appears in any search results. Most of the Web is crap.

X: Are people finding uses for the technology that you may not have thought of?

MT: We had a hackathon last year where a guy came in and built an app for his father, who is blind. It runs Diffbot on a page and makes it into a radio station. For someone who is blind, browsing a news site is usually a really poor experience. The usual screen readers will read the entire page, including the nav bars and the ads and the text. The screen readers have no context about what is important on the page. Using Diffbot to be his father’s eyes, this guy could parse the page and read it in a way that is much more natural.

JD: AOL’s Editions app is one of the more interesting use cases that I’ve seen. It’s an iPad app that features both their own content as well as snippets from across the Web, in a daily issue. I spent five years running engineering for the media solutions group at Cisco, selling a Web platform for media companies, and the biggest problem we faced was dealing with the excess of content management systems that all media companies have. In the case of Editions, AOL has myriad properties that they want to merge into this single app. But rather than consolidate TechCrunch and Engadget and the Huffington Post and a half dozen other sites, they use Diffbot to build a kind of content management system on the fly from the rendered Web pages. They extract the content and deliver it on the fly as if it came from a CMS right to the iPad magazine.

StumbleUpon is another interesting one. They use Diffbot as their moderation queue. Whenever a new website is submitted to their index, they want to make sure it’s legitimate before it’s available for stumbling. They have to rule out people who stumble a page, then swap it out for spam. So they run Diffbot on the source page, pipe that into their moderation queue, and if it looks like a legitimate page they can monitor that and keep checking on a regular basis to see how much it changes. If it has changed much between day 1 and day 10, it might warrant human intervention.

X: Aren’t there are a lot of news reader app these days that are doing the same thing you’re doing when it comes to identifying and isolating the text of a news article? That’s what Instapaper and Pocket and Readability and Zite are all doing.

MT: We power a lot of those apps. Our audience is the developers who work at those companies, who use our API to create their experience.

JD: We make it a lot more affordable to make those kinds of forays. When you look at building your own customized extraction tools, you are talking about multiple developers over weeks or months, to build something that is more brittle than what we offer out of the gate. Our ultimate goal is to be not only better but a lot cheaper than what you could build.

X: It’s not totally clear yet, though, whether publications or apps that aggregate lots of content from elsewhere, like Editions or even Flipboard, are going to be profitable in the long term, and where publishing is going as a business. Don’t you guys feel there’s some risk in tying your fortunes to such a troubled industry?

MT: The more interesting question is how do you monetize the Semantic Web, and where is the money in building the structured information. Articles are only one page type. Another that I mentioned is products. If you could show products on a cell phone, and people could buy the product and we could make that transaction happen, that is one very tangible way of making money. I think there is a lot of value in having structured information, because you can connect people more directly to what they want. Once we have the entire Web in machine-readable format, anybody who wants to use any sort of data can use the Diffbot view of it, and I think a lot of those apps can make money. Look at Siri—it’s great but it only works with the 10 or so sources that it’s hard-coded to work with. If you were able to combine Siri with Diffbot, Siri could operate on the Web and take a query and actually do it for you.

X: What page types will you move on to next? Did you start with articles because those are easiest?

MT: I wouldn’t say they were easiest, but they are pretty prevalent on the Web. A variety of factors help us prioritize what we should do next. One signal is what is the prevalence of that type of page on the Web. If doing one page type lets us knock out 30 percent of the Web, maybe we will go for it.

X: Will there always be a need for Diffbot, or with the transition to HTML 5, will Web pages gradually get more structure on their own?

MT: If you look at the ratio of unstructured pages to structured, it’s actually going in the opposite direction. I think human beings are creative, and they design pages for other humans. No matter what, people will find a way to create documents that lie outside of the well-defined tags, whether it’s HTML 5 or Flash or PDF or Xbox. What they all have in common is that they are just vessels that we can easily train and adapt Diffbot to work with.

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/