In tech journalism, it’s inadvisable to call any company “the next Google.” It’s almost always breathless hype or marked naïveté.
After all, people have been predicting the search giant’s demise for nearly as long as the company has existed. I wrote a Technology Review cover story called “Search Beyond Google” nearly 10 years ago. But with unlimited brainpower and money at its disposal, the company has managed to stay at the forefront in search, while also getting very good at other things, like mobile hardware.
So when I tell you that a seven-employee company called Diffbot really could be the next Google, I need to be very specific about what I mean.
I don’t mean that the tiny Palo Alto, CA-based startup is going to put Google out of business. In fact, Diffbot may already be partnering with Google. And there’s a good chance Google will just acqui-hire the startup at some point, thereby preempting the very interesting branch of the timeline where Diffbot gets big on its own.
And I don’t mean that Diffbot is going to redefine the search business. Not the search business as we’ve known it, anyway.
What I do mean is that Diffbot is poised to help the consumer and business worlds make sense of today’s more diverse Internet—one that takes many more forms, and is being put to many more uses, than the Web as it looked back in the 1990s, when Google was born.
Diffbot’s business is to use a combination of crawling software, computer vision, and machine learning to classify documents on the Web and break down each page type into its component parts. (The startup thinks there are about 20 of these types.) This allows people or programs to ask very specific questions about those parts—questions that can’t be answered very well using traditional search technology.
In other words, Diffbot is to today’s Internet as Google was to the Web of 1998. It’s a tool that can impose structure and meaning on resources that are currently disorganized and inaccessible, for a price that many businesses are willing to pay. And so far, that’s a game that Google itself doesn’t seem to want to play.
After writing my first story about Diffbot back in July 2012, I wanted to know about the latest progress at the Stanford-born startup, so I paid a visit to Diffbot’s new headquarters—a quiet backyard bungalow that feels insulated from all the nearby traffic on El Camino and Embarcadero Road. There, the Diffbot crew put aside their laptops for an hour to update me about the company’s ambitious vision. It hasn’t changed much since 2012, but it’s been fleshed out in key respects.
Diffbot founder and CEO Mike Tung started the company in 2009 to fix a problem: there was no easy, automated way for computers to understand the structure of a Web page. A human looking at a product page on an e-commerce site, or at the front page of a newspaper site, knows right away which part is the headline or the product name, which part is the body text, which parts are comments or reviews, and so forth.
But a Web-crawler program looking at the same page doesn’t know any of those things, since these elements aren’t described as such in the actual HTML code. Making human-readable Web pages more accessible to software would require, as a first step, a consistent labeling system. But the only such system to be seriously proposed, Tim Berners-Lee’s Semantic Web, has long floundered for lack of manpower and industry cooperation. It would take a lot of people to do all the needed markup, and developers around the world would have to adhere to the Resource Description Framework prescribed by the World Wide Web Consortium.
Tung’s big conceptual leap was to dispense with all that and attack the labeling problem using computer vision and machine learning algorithms—techniques originally developed to help computers make sense of edges, shapes, colors, and spatial relationships in the real world. Diffbot runs virtual browsers in the cloud that can go to a given URL; suck in the page’s HTML, scripts, and style sheets; and render it just as it would be shown on