Big data is a “hackneyed term,” said Michael Stonebraker. “I try hard not to use it.”
It was wintertime when I sat down with a few database experts in Boston to talk shop. Stonebraker, an MIT professor and entrepreneur, is one of those graybeards who was working in big data long before it was called big data—and will likely be doing so long after the term has faded.
In hindsight, his remark was a clear sign that the marketing hype around “big data” had peaked. Everyone was using the term, and no one seemed to know what it really meant—or how it could benefit mainstream businesses and reward data-savvy entrepreneurs.
The premise of big data, at least, is easy to grasp: more and more information is being collected, stored, and analyzed, from click streams to sales records to mobile-device locations. What hasn’t been easy is translating all that data into insights that help organizations make better decisions. That goes for retail, finance, healthcare, marketing, wireless, Internet commerce—name the industry and you’ll hear the lament that corporations aren’t fully capitalizing on their digital assets.
The underlying reason is that “big data” as a technology area has been a mirage. There’s no magic button, only myriad software techniques that may or may not work for problems specific to particular industries.
But a recent wave of startups has identified new classes of problems, showing where big-data capabilities are heading in the next few years. “It’s really not about big data. It’s about the most useful data,” says Andy Palmer, a co-founder (with Stonebraker) of Vertica Systems and Tamr, both data-related companies. He’s focused on giving companies the ability to access the information that’s most relevant, often hidden, and is “high-quality enough to answer compelling questions.”
Tamr, where Palmer (pictured) is currently CEO, is working on “data curation”—software that helps organizations understand and connect their many different data sources and formats. The idea is to use a combination of statistics and human experts to show customers how their records are interrelated, identify redundancies and errors, and scrub the data so it can be used effectively. The Cambridge, MA-based startup has done pilot tests with Novartis, Thomson Reuters, and other enterprises.
There are broader terms for this sort of unsexy software—data wrangling, plumbing, “munging,” or janitor work—but the goal is a real one: to help businesses make better decisions faster, and save money. And a market for such services seems to be emerging: other startups vying for a piece of the pie include Trifacta, Paxata, and ClearStory in data preparation, and Attivio and Bedrock Data in data integration.
Bedrock Data, for example, has developed software that “synchronizes” data across different business systems, such as customer relationship management, e-mail, marketing, and finance; the idea is to break down barriers between departments and make sure different teams’ records are consistent with each other. Meanwhile, the data-prep companies, including Tamr, are making tools meant to automate the traditional, labor-intensive “extract, transform, and load” (ETL) process used to prepare data for data warehouses.
But once the data is cleaned up and shared, how do companies actually make sense of it all? That’s a separate story, and it lies in the domain of analytics.
The field has seen a lot of consolidation and investment in recent months, with big players such as Intel, Hewlett-Packard, and Teradata buying into companies including Cloudera, Hortonworks, and Hadapt. A particularly hot sector has matured around Hadoop, an open-source analytics software platform. Many tech companies are writing software to make Hadoop industrial strength and integrate it with new and existing types of databases.
As Palmer sees it, analytics is increasingly moving into vertical industries and niche applications. RStudio, led by JJ Allaire and based in Boston, is one of the emerging leaders, though it’s hard to understand what the company does if you don’t use R, an open-source language for data scientists. Suffice to say, RStudio makes tools for large-scale statistical analysis, and the kinds of companies that use R include Bank of America, Facebook, Ford, Google, Uber, and Zillow.
With more targeted analytics tools, big businesses can collect data from new sources, such as sensors or social media, and start to squeeze useful insights from them. “Enterprise companies need to take a page from Internet companies,” Palmer says. “They need to get more analytical.”
Some examples of niche approaches in analytics: Vast, based in Austin, TX, is tackling Web search and analysis in the