documents, charts, FAQs, locations, event listings, personal profiles, recipes, games, and error messages, and there are about 20 important page types altogether, Tung says. “Once we have all 20, we will essentially be able to cover the gamut, and convert most of the Web into a database structure,” he says.
Okay—why is that important, and how could it lead to a Google-scale opportunity?
As I’ve been trying to hint, the Web is a far richer place today than it was in 1998, when most pages were limited to text and images. Moreover, Web data is being tapped in new ways—and the majority of the entities using it aren’t even human.
That’s both alarming and intriguing. A study released last month by Silicon Valley Web security firm Incapsula showed that only 38.5 percent of all website traffic comes from real people. Another 29.5 percent comes from malicious bots, including scrapers, spammers, and impersonators—which is, of course, a serious problem. But on the up side, the final 31 percent of traffic comes from search engines and “good bots.”
This includes all of the services, from Instapaper to Flipboard to Pinterest, that extract data for presentation in other forms, leading, ultimately, to more page views for the original publisher. And it includes the growing category of specialized search engines and virtual personal assistants, from Wolfram Alpha to Siri and Google Now, that scour the Web to perform specific tasks for their human masters.
These bots are doing complicated things, which means they thrive on structure. And for them, Diffbot makes the Web a more welcoming place. For one thing, it gives them the ability to launch far more detailed searches against the raw data. “If we have Nike’s entire catalog as a database, you can select queries like ‘Show me x from Nike.com where the price is less than $70,’ and the things you get back aren’t Web pages optimized for viewing on a screen, but the actual record,” Tung says.
For that kind of search—which is more akin to a database query in SQL, the Structured Query Language, than to a keyword-based search—Google just won’t cut it. “Text search gets you only so far,” Waterman says. “When you start to understand the meaning of aspects of the page and glue them together, then you can do all kinds of other things.”
The first company that figures out how to map today’s more complex Web, and open it fully to automated traffic, stands to occupy a central place in tomorrow’s Internet economy. For as soon as data is readable by machines, Tung points out, Tim Berners-Lee’s vision of the Semantic Web will finally begin to take concrete shape. “New knowledge can be created with old knowledge,” he says. “Apps become like mini-AIs that take information, do some value-add with it, and produce other information.”
In September, Diffbot announced that it had brought on Matt Wells, the creator of an open-source search engine called Gigablast. Alongside Google, Bing, and Blekko, Gigablast is one of the only U.S.-based search engines to maintain its own index of the Web; at one time, its index of 12 billion pages was second only to Google’s.
“I believe in Mike’s vision, I see what he’s trying to do, and I thought it would be good to team up with a lot of smart people,” Wells told me. The hire is a sign that Diffbot’s ambitions extend beyond selling access its APIs to something potentially much bigger: constructing a new kind of search engine, built around new types of queries and new ways of formulating intent. And to do that, Diffbot will obviously need its own global index. “We want to convert the entire Web into a structured database,” Tung says. “Matt is one person who has done that Web-scale crawling before. Most of his competitors were teams of thousands of people with millions of dollars.”
So, in the end, Diffbot is a small group of super-talented engineers and machine-learning experts who want to analyze and structure the Web on a huge scale. Yet the Googleplex is just five miles away—and what would be a life-altering amount of money for any of Diffbot’s team members would be pocket change for Google.
So it’s probably silly to imagine a future where Diffbot grows to 10,000 employees and becomes the substrate for a community of AIs, working to make us all happier, more comfortable, and more informed; that is to say, where our online existence isn’t ruled solely by Google and the NSA. But it’s nice to think that it’s possible.