a desktop monitor or a smartphone screen. Then edge-detection algorithms and computer-vision routines go to work, outlining and measuring each element on the page.
Using machine-learning techniques, this geometric data can then be compared to frameworks or “ontologies”—patterns distilled from training data, usually by humans who have spent time drawing rectangles on Web pages, painstakingly teaching the software what a headline looks like, what an image looks like, what a price looks like, and so on. The end result is a marked-up summary of a page’s important parts, built without recourse to any Semantic Web standards.
The irony here, of course, is that much of the information destined for publication on the Web starts out quite structured. The WordPress content-management system behind Xconomy’s site, for example, is built around a database that knows exactly which parts of this article should be presented as the headline, which parts should look like body text, and (crucially, to me) which part is my byline. But these elements get slotted into a layout designed for human readability—not for parsing by machines. Given that every content management system is different and that every site has its own distinctive tags and styles, it’s hard for software to reconstruct content types consistently based on the HTML alone.
Hence the computer-vision approach. “What we’re trying to do is reverse-engineer the Web presentation and turn it back into structured relations,” Diffbot chief scientist Scott Waterman explains.
The first type of page that Diffbot mastered was the news story. By the time I first met Tung and Diffbot vice president John Davi in 2012, they’d already gotten very good at parsing articles on the Web, and today the Diffbot “Article API,” or application programming interface, is used by hundreds of companies to extract text and reformat it for presentation in Web or mobile news readers. Digg, Instapaper, Onswipe, and Reverb are among Diffbot’s customers. “They’re saying ‘Holy crap, how am I going to get clean text out of this sea of [Web pages], and in that case, we’re a developer’s best friend,” Davi says. “We turn an insurmountable problem into one you can solve with an API integration.”
But articles were just the first page type that Tung’s crew wanted to make Diffbot understand. Today the company offers four APIs—for articles, images, products, and home pages—as well as a classifier that can automatically determine the page type for any URL, and a “Crawlbot” that can comb through entire sites, rather than just specific URLs.
Davi says the startup rushed to finish the images API after it studied a few days’ worth of Twitter posts from mid-2012 and realized that images comprised a whopping 36 percent of the material being shared on the microblogging network. That was a tipoff that understanding image pages would allow the company to parse a huge chunk of the human-readable Web.
But finishing the product API was a much more strategic and potentially lucrative move. The reason is simple: anyone who sells or promotes anything on the Web wants to be able to show the price, and wants to know how competitors are pricing the same wares. “All of the various product-discovery startups—pinning, bookmarking, search, e-commerce companies—want pricing information,” Tung says.
Pinterest is a client, for example. Tung and Davi say Diffbot analyzes the entire “firehose” of data that Pinterest users are putting on their pinboards, including the pages that pins link to, mainly in order to figure out which pins represent products on e-commerce sites.
“The ability to turn on user-facing features based on product data is a potential future revenue stream for these bookmarking sites,” Davi explains. “Say 15 percent of pins are products. They can say, ‘Let’s find out the pricing and availability, then let’s tell the user that this product they just pinned is available at Amazon for $5 less,’ or that it’s just gone on sale somewhere.” If the tip leads to a transaction, the pinning or bookmarking site is then in line for an affiliate commission.
Another user of the product API builds Facebook ads from pages on e-commerce sites. “They end up using our Crawlbot, combined with the product API, to extract data from entire retail sites like Target or QVC, drop the product data into their backend, and generate ads on the fly,” Davi says.
There’s a real business here. Customers who tap the Diffbot APIs up to 250,000 times per month are expected to pay a $300 monthly fee. If your calls are closer to the 5-million-a-month mark, you’ll pay $5,000, and at higher volumes, “custom” pricing goes into effect. One of the major search engines (the startup isn’t allowed to say which one) is paying Diffbot “to improve the richness of their search interface,” Tung says. Almost all of the company’s deals result from inbound inquiries, he says, which means he hasn’t yet needed to hire a sales director.
And there are many page types left to tackle. There will eventually be APIs for things like comment pages, discussion forums, product reviews, social-media status updates, and pages with embedded audio and video (though the startup doesn’t plan to analyze the actual content of media files). Add in the less common kinds of pages such as