Could a Little Startup Called Diffbot Be the Next Google?

Diffbot

documents, charts, FAQs, locations, event listings, personal profiles, recipes, games, and error messages, and there are about 20 important page types altogether, Tung says. “Once we have all 20, we will essentially be able to cover the gamut, and convert most of the Web into a database structure,” he says.

Okay—why is that important, and how could it lead to a Google-scale opportunity?

As I’ve been trying to hint, the Web is a far richer place today than it was in 1998, when most pages were limited to text and images. Moreover, Web data is being tapped in new ways—and the majority of the entities using it aren’t even human.

That’s both alarming and intriguing. A study released last month by Silicon Valley Web security firm Incapsula showed that only 38.5 percent of all website traffic comes from real people. Another 29.5 percent comes from malicious bots, including scrapers, spammers, and impersonators—which is, of course, a serious problem. But on the up side, the final 31 percent of traffic comes from search engines and “good bots.”

This includes all of the services, from Instapaper to Flipboard to Pinterest, that extract data for presentation in other forms, leading, ultimately, to more page views for the original publisher. And it includes the growing category of specialized search engines and virtual personal assistants, from Wolfram Alpha to Siri and Google Now, that scour the Web to perform specific tasks for their human masters.

These bots are doing complicated things, which means they thrive on structure. And for them, Diffbot makes the Web a more welcoming place. For one thing, it gives them the ability to launch far more detailed searches against the raw data. “If we have Nike’s entire catalog as a database, you can select queries like ‘Show me x from Nike.com where the price is less than $70,’ and the things you get back aren’t Web pages optimized for viewing on a screen, but the actual record,” Tung says.

For that kind of search—which is more akin to a database query in SQL, the Structured Query Language, than to a keyword-based search—Google just won’t cut it. “Text search gets you only so far,” Waterman says. “When you start to understand the meaning of aspects of the page and glue them together, then you can do all kinds of other things.”

The first company that figures out how to map today’s more complex Web, and open it fully to automated traffic, stands to occupy a central place in tomorrow’s Internet economy. For as soon as data is readable by machines, Tung points out, Tim Berners-Lee’s vision of the Semantic Web will finally begin to take concrete shape. “New knowledge can be created with old knowledge,” he says. “Apps become like mini-AIs that take information, do some value-add with it, and produce other information.”

In September, Diffbot announced that it had brought on Matt Wells, the creator of an open-source search engine called Gigablast. Alongside Google, Bing, and Blekko, Gigablast is one of the only U.S.-based search engines to maintain its own index of the Web; at one time, its index of 12 billion pages was second only to Google’s.

“I believe in Mike’s vision, I see what he’s trying to do, and I thought it would be good to team up with a lot of smart people,” Wells told me. The hire is a sign that Diffbot’s ambitions extend beyond selling access its APIs to something potentially much bigger: constructing a new kind of search engine, built around new types of queries and new ways of formulating intent. And to do that, Diffbot will obviously need its own global index. “We want to convert the entire Web into a structured database,” Tung says. “Matt is one person who has done that Web-scale crawling before. Most of his competitors were teams of thousands of people with millions of dollars.”

So, in the end, Diffbot is a small group of super-talented engineers and machine-learning experts who want to analyze and structure the Web on a huge scale. Yet the Googleplex is just five miles away—and what would be a life-altering amount of money for any of Diffbot’s team members would be pocket change for Google.

So it’s probably silly to imagine a future where Diffbot grows to 10,000 employees and becomes the substrate for a community of AIs, working to make us all happier, more comfortable, and more informed; that is to say, where our online existence isn’t ruled solely by Google and the NSA. But it’s nice to think that it’s possible.

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/