Where the Red Book Meets the Unicorn

Mike Stonebraker – my partner, friend, and winner of the 2014 A.M. Turing Award – has been on the cutting edge of modern database research, development, and deployment for over 40 years. The cutting edge … for 40 years.

Mike was the main architect, starting in the 1970s through the 1990s, of many systems that had a huge impact on the database world – including Ingres, Postgres, and Mariposa, founding startup companies for each. He was famously at the center of the debate of object-oriented database systems vs. relational database systems and the ultimate evolution of object-relational mapping tools such as the popular Hibernate.

Over the past 10+ years, Mike and I have worked together as partners on big data technologies and companies like Vertica, VoltDB, Paradigm4, and Tamr, at the intersection of commercial and academic research:

Vertica:

  • Paper published at VLDB Conference, Trondheim, Norway, 2005 (link here)
  • C-Store declared public domain
  • C-Store Code published (website)
  • Vertica founded with fresh code base

VoltDB:

  • Paper published at VLDB Conference, Auckland, New Zealand, 2008 (link here)
  • H-Store declared public domain
  • Code published on public website
  • VoltDB started as open source skunkworks within Vertica, with fresh code base
  • VoltDB spun out of Vertica into separate company

Paradigm4:

  • Paper published at 23rd International Conference, SSDBM 2011, Portland, OR (link here)
  • Code published on public website
  • Started SciDB open source project
  • Paradigm4 started based on SciDB open source project

Tamr:

  • Paper published at 6th Biennial Conference on Innovative Data Systems Research (CIDR ’13), Asilomar, California (link here)
  • Data Tamer declared public domain
  • A copy of the code was put on GitHub
  • Tamr founded and fresh code base implemented

All this was accomplished through a combination of fearlessness of failure, open and transparent communication of ideas … and willingness to change. It was fueled by an approach that has never been more relevant for database researchers as we all move boldly into the big data era: Do your work at the intersection of the academic and commercial worlds.

In academia, you have the opportunity to question assumptions free of the shackles of corporate politics, short-term requirements, and non-technical middle managers <insert favorite “Office Space” reference here>. In the meantime, the commercial world today remains the only place to test and prove database systems theory at scale.

This academic-meets-commercial approach isn’t specific to Mike, database research, or even the IT domain. In Cambridge, Bob Langer runs the largest biomedical engineering lab in the world at MIT and is recognized as the most cited engineer in history. He also has 1,000+ patents that have been licensed/sublicensed to more than 300 companies. Bob’s prodigious academic and commercial accomplishments are inextricably linked. He has spent decades deriving the absolute best from what each sector has to offer – testing some of most advanced biomedical theory at real-world scale. Bob, like Mike, is a true rock star on both the academic and commercial sides of his field.

Database research, though, is particularly suited for this academic/commercial intersection. In fact, the only way to do really (really) large-scale research is on the commercial side. Database icons (and close friends) Jim Gray, Dave DeWitt, and Mike understood the value of developing and testing with massive commercial datasets long before the term “big data” was coined – and built their amazing careers on academic and commercial foundations.

Mike has pursued this academic/commercial course since the 1970s, aggressively testing database theory in commercial systems on real applications. He had some radical hypotheses that could be proved through research. But he knew he had to have the data to really know they were true. That’s the advantage of big, hairy, real-world commercial data.

This is even more true today – in the big data era – where extreme-scale research into new and important database systems is happening at companies like LinkedIn, Google, Twitter, and Facebook.

My advice to all next-gen computer scientists/software engineers interested in big data research and development is simple: follow the example of Gray, DeWitt, Stonebraker, and others before you dive too deeply into anything.

They learned critical lessons along the way at the intersection of their academic and commercial work. In fact, there has been a lot of “reinventing the proverbial wheel” with respect to distributed database systems over the past 15 years. Five-plus years ago, for example, Mike said MapReduce wasn’t going to amount to much (reference here). It wasn’t the most popular opinion at the time. But eventually people started to think that he might have been right (reference here). IMHO, Mike and Dave Dewitt were pretty much dead on. Many of these new database infrastructure projects are going to come to terms with the fact that, eventually, they are all wrestling with building a database system of one sort or another. And that the lessons learned in database systems over the past 40 years – relational, distributed, object-oriented, federated, or otherwise – are useful to avoid wasting time and money (see here about the MySQL sharding trend of the 2000s and why it was a huge distraction).

They also weren’t afraid to get things wrong. Like QUEL, a relational database query language associated with Ingres, which Mike promoted as superior to SQL in the 1980s. IBM and Oracle led the move toward SQL, largely stranding QUEL. Mike moved on. And so did we, better off in the end for the chances taken, as always, in the interest of advancing ideas, challenging assumptions, and figuring out what really works.

Right or wrong, Mike and his cohort not only learned their lessons, but they also unabashedly committed them to posterity in papers and other publications. In Mike’s case, the “Red Book” (Readings in Database Systems), co-edited with Joseph Hellerstein, first published in 1988 and updated ever since. Think of it as a lifetime of lessons learned at the intersection of academia and commerce by Mike and his peers on database management system theory and practice.

If you have any desire to research or deploy database systems seriously, the Red Book is required reading – preferably before you finish undergrad. Do it, and you might avoid the same hard lessons that people have been learning (brutally in some cases) over and over again in commercial settings about distributed database systems at scale.

In other words, much of what you need to know about large-scale distributed database systems is all right there at your fingertips. Ignore it at your own risk.

Congrats to Mike on the acceptance of his Turing Award, and thanks to him for all his contributions, his partnership, friendship, and his willingness to never take himself too seriously.

Author: Andy Palmer

Andy Palmer is a serial entrepreneur who specializes in accelerating the growth of early-stage, mission-driven startups. Andy has helped found and/or fund more than 50 innovative companies in technology, health care, and the life sciences. Andy’s unique blend of strategic perspective and disciplined tactical execution is suited to environments where uncertainty is the rule rather than the exception. Andy has a specific passion for projects at the intersection of computer science and the life sciences. Most recently, Andy co-founded Tamr, a next-generation data curation company, and Koa Labs, a startup club in the heart of Harvard Square, Cambridge, MA.