Could a Little Startup Called Diffbot Be the Next Google?

In tech journalism, it’s inadvisable to call any company “the next Google.” It’s almost always breathless hype or marked naïveté.

After all, people have been predicting the search giant’s demise for nearly as long as the company has existed. I wrote a Technology Review cover story called “Search Beyond Google” nearly 10 years ago. But with unlimited brainpower and money at its disposal, the company has managed to stay at the forefront in search, while also getting very good at other things, like mobile hardware.

So when I tell you that a seven-employee company called Diffbot really could be the next Google, I need to be very specific about what I mean.

I don’t mean that the tiny Palo Alto, CA-based startup is going to put Google out of business. In fact, Diffbot may already be partnering with Google. And there’s a good chance Google will just acqui-hire the startup at some point, thereby preempting the very interesting branch of the timeline where Diffbot gets big on its own.

And I don’t mean that Diffbot is going to redefine the search business. Not the search business as we’ve known it, anyway.

What I do mean is that Diffbot is poised to help the consumer and business worlds make sense of today’s more diverse Internet—one that takes many more forms, and is being put to many more uses, than the Web as it looked back in the 1990s, when Google was born.

Diffbot’s business is to use a combination of crawling software, computer vision, and machine learning to classify documents on the Web and break down each page type into its component parts. (The startup thinks there are about 20 of these types.) This allows people or programs to ask very specific questions about those parts—questions that can’t be answered very well using traditional search technology.

In other words, Diffbot is to today’s Internet as Google was to the Web of 1998. It’s a tool that can impose structure and meaning on resources that are currently disorganized and inaccessible, for a price that many businesses are willing to pay. And so far, that’s a game that Google itself doesn’t seem to want to play.

The Diffbot team. Left to right: Bharath Bhat, Scott Waterman, Mike Tung, Emmanuel Charon, Dan Steinberg, John Davi. Not shown: Matt Wells.

After writing my first story about Diffbot back in July 2012, I wanted to know about the latest progress at the Stanford-born startup, so I paid a visit to Diffbot’s new headquarters—a quiet backyard bungalow that feels insulated from all the nearby traffic on El Camino and Embarcadero Road. There, the Diffbot crew put aside their laptops for an hour to update me about the company’s ambitious vision. It hasn’t changed much since 2012, but it’s been fleshed out in key respects.

Diffbot founder and CEO Mike Tung started the company in 2009 to fix a problem: there was no easy, automated way for computers to understand the structure of a Web page. A human looking at a product page on an e-commerce site, or at the front page of a newspaper site, knows right away which part is the headline or the product name, which part is the body text, which parts are comments or reviews, and so forth.

But a Web-crawler program looking at the same page doesn’t know any of those things, since these elements aren’t described as such in the actual HTML code. Making human-readable Web pages more accessible to software would require, as a first step, a consistent labeling system. But the only such system to be seriously proposed, Tim Berners-Lee’s Semantic Web, has long floundered for lack of manpower and industry cooperation. It would take a lot of people to do all the needed markup, and developers around the world would have to adhere to the Resource Description Framework prescribed by the World Wide Web Consortium.

Tung’s big conceptual leap was to dispense with all that and attack the labeling problem using computer vision and machine learning algorithms—techniques originally developed to help computers make sense of edges, shapes, colors, and spatial relationships in the real world. Diffbot runs virtual browsers in the cloud that can go to a given URL; suck in the page’s HTML, scripts, and style sheets; and render it just as it would be shown on

Pages: 123

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/ View all posts by Wade Roush