Big Data at Facebook—A Glossary

Facebook, like many engineering-driven companies, is seldom satisfied with off-the-shelf solutions for its computing problems. Its software teams regularly come up with new algorithms or management systems meant to make the company’s infrastructure more reliable and scalable. Many of these projects are offshoots of open-source technologies like Hadoop, and Facebook ends up contributing many of its innovations back to the open-source community. Here’s a list of projects the company has described in public, alphabetized by code name.

Avatarnode—A fail-safe version of the Namenode metadata server in Hadoop that improves the reliability of Hadoop clusters.

Claspin—A monitoring and visualization tool that shows Facebook engineers which servers in a cluster are underperforming or failing.

Corona—A system that improves the way jobs are scheduled and managed in Hadoop; open-sourced in 2012.

Dragonstone—The code name for the first server design released by Facebook through the Open Compute Project.

Gatekeeper—A service that controls which users see which experimental features on Facebook, and prevents overlapping changes from appearing on the same page.

Graph Search—A new service that shows Facebook users search results filtered according to the preferences of people in their networks.

Hadoop—A system that makes it simple to distribute computing jobs across dozens to thousands of servers; originally developed by Yahoo, heavily adopted by Facebook, and now managed by the Apache Software Foundation.

Haystack—Facebook’s custom-built infrastructure for storing photos.

Hiphop—A system that reduces CPU usage on Web servers at Facebook by transforming Facebook’s PHP source code into C++ before it’s reduced to machine code. Open-sourced in 2010.

Hive—-A data warehouse system that makes it easier to query data in large Hadoop clusters. Open-sourced in 2008 and now managed by the Apache Software Foundation.

Peregrine—A system for querying data in Hadoop clusters in near-real-time, without having the query wait as part of a batch-job system.

Prism—A system that makes a database distributed across multiple data centers behave as if it’s contained within a single data center, by replicating and moving data as needed.

Scuba—A Web-based system that makes it easier for engineers to dissect statistics about the performance of Facebook’s infrastructure.

TAO—a distributed database that lets Facebook engineers treat users and the relationships between them as if they were nodes and edges in a true graph database.

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/