Diffbot Challenges Google Supremacy With Rival Knowledge Graph

When you do a Google search at your desktop for a common health condition, you’ll get links to tons of webpages you can sift through in hopes of finding the specific facts you want.

But if you’ve searched Google from a mobile device recently, you may have been rewarded with a summary of important facts about the disorder, culled for you from many websites. You’re tapping into what Google calls its Knowledge Graph.

Palo Alto, CA-based artificial intelligence startup Diffbot has based its whole business on that second kind of search—ferreting out data points scattered across many websites and pulling them together into Big Data resources that can be queried, combined, and rearranged. The upstart company—14 engineers in a backyard bungalow—now says its own data mega-map, called its Global Index, is a bigger database than Google’s Knowledge Graph of billions of facts.

This kind of structured data—Web facts organized into a searchable database—is the resource behind the most popular mobile apps, says Diffbot founder and CEO Mike Tung. Such apps can answer questions like, “What is the best Thai restaurant in this neighborhood?” Diffbot’s mission is to capture everything online—articles, images, videos, comments, reviews, the works—and keep it updated.

“We are working to create a structured version of the Web,” Tung says. “We’re quite serious about that.”

The company developed elements of its Web-crawling methods as it served customer needs over the past few years, but only started proactively spidering the Web for its own purposes in the past few months. Its Global Index now contains more than 600 million objects (this can be anything from a celebrity to an Ikea chair model) and 19 billion facts. Diffbot clocks Google’s Knowledge Graph at about 570 million objects and 18 billion facts.

Diffbot, founded in 2008, is already covering its operating expenses by enhancing other search engines including Microsoft’s Bing and DuckDuckGo, and by powering apps for companies such as Cisco and AOL, Tung says. Diffbot subscribers can build apps based on narrowly targeted searches that answer questions such as, “What’s the best price in my region for Nike cross trainers?”

But Diffbot has larger ambitions, and it’s raising money to support them. The company just banked $500,000 from Bloomberg Beta, bringing an angel round up to $3 million. It wouldn’t be surprising to see a Series A round raised this year, Tung says. Just a hint about Diffbot’s ultimate interests: According to its CEO, Diffbot may help answer the long-debated question: Can computers ever duplicate human intelligence?

Diffbot has been exploring the art of teaching machines to function like a human researcher—compiling facts from multiple online sources so they can be combined and compared for many purposes. The company began building its Global Index by storing results from URL searches requested by customers, but in recent months Diffbot has been analyzing websites to build its index at a rate of up to 15 million pages a day.

Its artificial intelligence bots are doing the work without human supervision. Tung says Google’s Knowledge Graph, by contrast, has relied significantly on human curation.

“Our approach is fairly radical in that there’s no human behind the curtain,” Tung says. “This is why we were able to catch up in such a short time.”

Diffbot assembles its own servers at its Palo Alto bungalow. They’re not the kind you can rent from a cloud storage outfit. The company’s standard crawling and indexing machines use 32 terabytes of solid-state storage, have 192GB of RAM, and 40 CPU cores. Diffbot now has 100 servers in a guarded co-location space in Fremont, CA, where fiber optic cables link them to all the Internet service providers in the world, Tung says.

The crawlbots are adding millions of new objects to Diffbot’s index every day. Tung is envisioning adding thousands more servers, or tens of thousands.

“If we just throw more resources at it, we can generate structured data at true scale,” Tung says.

Last year, Xconomy’s much-missed San Francisco editor Wade Roush asked the question, “Could a Little Startup Called Diffbot Be the Next Google?” in his article about Diffbot’s mission to cover the Web more fully by taking search further than conventional search engines. Diffbot had developed bots that can “read” a webpage the way humans do, distinguishing among different parts of the layout such as headlines, main text, side columns, and so on.

With this computer vision, the machines can tell the difference between types of Web pages—article pages, home pages, and product offerings where prices are displayed. They can reshuffle these layout elements to reformat a Web page for mobile device screens—a chore that companies pay Diffbot to take on, and one of its early sources of revenue.

The bots also “learn” where they’re likely to find certain information on a page, such as prices or author names on articles. They can extract information from images, videos, blogs, and the discussion threads that follow published articles. The company’s newest product, Discussion API, has become a tool for marketers who want to check brand reputations, Tung says.

Companies also come to Diffbot to

Pages: 12

Author: Bernadette Tansey

Bernadette Tansey is a former editor of Xconomy San Francisco. She has covered information technology, biotechnology, business, law, environment, and government as a Bay area journalist. She has written about edtech, mobile apps, social media startups, and life sciences companies for Xconomy, and tracked the adoption of Web tools by small businesses for CNBC. She was a biotechnology reporter for the business section of the San Francisco Chronicle, where she also wrote about software developers and early commercial companies in nanotechnology and synthetic biology. View all posts by Bernadette Tansey