Could a Little Startup Called Diffbot Be the Next Google?

Diffbot

a desktop monitor or a smartphone screen. Then edge-detection algorithms and computer-vision routines go to work, outlining and measuring each element on the page.

Using machine-learning techniques, this geometric data can then be compared to frameworks or “ontologies”—patterns distilled from training data, usually by humans who have spent time drawing rectangles on Web pages, painstakingly teaching the software what a headline looks like, what an image looks like, what a price looks like, and so on. The end result is a marked-up summary of a page’s important parts, built without recourse to any Semantic Web standards.

The irony here, of course, is that much of the information destined for publication on the Web starts out quite structured. The WordPress content-management system behind Xconomy’s site, for example, is built around a database that knows exactly which parts of this article should be presented as the headline, which parts should look like body text, and (crucially, to me) which part is my byline. But these elements get slotted into a layout designed for human readability—not for parsing by machines. Given that every content management system is different and that every site has its own distinctive tags and styles, it’s hard for software to reconstruct content types consistently based on the HTML alone.

Hence the computer-vision approach. “What we’re trying to do is reverse-engineer the Web presentation and turn it back into structured relations,” Diffbot chief scientist Scott Waterman explains.

Engineers assemble a search server at the Diffbot house.
Engineers assemble a search server at the Diffbot house.

The first type of page that Diffbot mastered was the news story. By the time I first met Tung and Diffbot vice president John Davi in 2012, they’d already gotten very good at parsing articles on the Web, and today the Diffbot “Article API,” or application programming interface, is used by hundreds of companies to extract text and reformat it for presentation in Web or mobile news readers. Digg, Instapaper, Onswipe, and Reverb are among Diffbot’s customers. “They’re saying ‘Holy crap, how am I going to get clean text out of this sea of [Web pages], and in that case, we’re a developer’s best friend,” Davi says. “We turn an insurmountable problem into one you can solve with an API integration.”

But articles were just the first page type that Tung’s crew wanted to make Diffbot understand. Today the company offers four APIs—for articles, images, products, and home pages—as well as a classifier that can automatically determine the page type for any URL, and a “Crawlbot” that can comb through entire sites, rather than just specific URLs.

Davi says the startup rushed to finish the images API after it studied a few days’ worth of Twitter posts from mid-2012 and realized that images comprised a whopping 36 percent of the material being shared on the microblogging network. That was a tipoff that understanding image pages would allow the company to parse a huge chunk of the human-readable Web.

But finishing the product API was a much more strategic and potentially lucrative move. The reason is simple: anyone who sells or promotes anything on the Web wants to be able to show the price, and wants to know how competitors are pricing the same wares. “All of the various product-discovery startups—pinning, bookmarking, search, e-commerce companies—want pricing information,” Tung says.

Pinterest is a client, for example. Tung and Davi say Diffbot analyzes the entire “firehose” of data that Pinterest users are putting on their pinboards, including the pages that pins link to, mainly in order to figure out which pins represent products on e-commerce sites.

“The ability to turn on user-facing features based on product data is a potential future revenue stream for these bookmarking sites,” Davi explains. “Say 15 percent of pins are products. They can say, ‘Let’s find out the pricing and availability, then let’s tell the user that this product they just pinned is available at Amazon for $5 less,’ or that it’s just gone on sale somewhere.” If the tip leads to a transaction, the pinning or bookmarking site is then in line for an affiliate commission.

Another user of the product API builds Facebook ads from pages on e-commerce sites. “They end up using our Crawlbot, combined with the product API, to extract data from entire retail sites like Target or QVC, drop the product data into their backend, and generate ads on the fly,” Davi says.

There’s a real business here. Customers who tap the Diffbot APIs up to 250,000 times per month are expected to pay a $300 monthly fee. If your calls are closer to the 5-million-a-month mark, you’ll pay $5,000, and at higher volumes, “custom” pricing goes into effect. One of the major search engines (the startup isn’t allowed to say which one) is paying Diffbot “to improve the richness of their search interface,” Tung says. Almost all of the company’s deals result from inbound inquiries, he says, which means he hasn’t yet needed to hire a sales director.

And there are many page types left to tackle. There will eventually be APIs for things like comment pages, discussion forums, product reviews, social-media status updates, and pages with embedded audio and video (though the startup doesn’t plan to analyze the actual content of media files). Add in the less common kinds of pages such as

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/