speed up certain kinds of queries. If your CEO wants a regular report on sales by region, for example, you set up your whole data store around those statistics.
Facebook doesn’t think this way. “If you only expect certain types of questions to be asked, you only store that data,” says Sameet Agarwal, a director of engineering at Facebook who came to the company in 2011 after 16 years at Microsoft. And that practice, in turn, “tends to bias you toward the answers you were already expecting.”
Facebook’s approach “has always been to store all the data first, before you know what questions you want to ask, because that will lead you to more interesting questions,” Agarwal says.
Indeed, some of the activity data stored in Facebook’s back end dates all the way back to 2005, when the company was just two years old, Agarwal says. He says his team only deletes data when it is required to do so for security, privacy, or regulatory reasons.
But that commitment to storing everything causes considerable bother. To keep all of its analytics data accessible, Facebook operates several Hadoop clusters, each consisting of several thousand servers. The largest cluster stores more than 100 petabytes. (A petabyte is a thousand terabytes, or a million gigabytes.)
To support both its back end and its front end, the company has built or is building numerous data centers in far-flung locations, from Oregon to North Carolina to Sweden. According to the IPO registration documents it filed in 2012, Facebook spent $606 million on its data center infrastructure in 2011. Each new data center will have a capacity of roughly 3 exabytes (3,000 petabytes).
The offline analytics clusters are the biggest of the big databases at Facebook, because they can’t easily be divided or “sharded” into geographic subsets the way the front-end databases can be, according to Agarwal. To answer the most interesting questions, “you want all the data for all the users in one place,” he explains.
Keeping these huge clusters running is the job of Parikh’s team. It might sound like a low-pressure task compared to maintaining the front end, but Agarwal says there’s little room for downtime, for at least two reasons. First, roughly a quarter of the people inside Facebook depend on the analytics platform to do their daily jobs. Second, the back end actually powers many of the features users see on the front end, such as “People You Might Know,” a list of friend suggestions based on an analysis of each Facebook user’s social network and personal history.
“The longer this is down, the more your user experience will degrade over time,” says Janardhan. “The ads we serve might not be as relevant, the people and the suggestions for doing things that we serve might be less relevant, and so forth.”
To keep such large collections of machines running with close to 100 percent availability, Facebook turns to extreme automation. There are other ways to run a large company network, of course—most corporations with a distributed system of data centers would have a network operations center or NOC staffed by scores of people around the clock. Again, that’s not the Facebook way.
The heavy engineering bias at Facebook means there’s a culture of “impatience or intolerance for trivial work,” in Janardhan’s words. The first time a data center engineer gets awakened at 2:00 am to fix a faulty server or network switch, he might apply a temporary band-aid, Janardhan says. “But if they are woken up twice in a week for the same problem, they will ensure they fix it with some automation to ensure it never happens again.”
What does that mean in practice? For one thing, it means Facebook puts a lot of thought into handling server failures (which is a corollary of its decision to use cheap, lowest-common-denominator hardware in its data centers; more on that below). In Facebook’s Hadoop clusters, there are always three copies of every file. Copy A and Copy B usually live within a single rack (a rack consists of 20 to 40 servers, each housing 18 to 36 terabytes of data). Copy C always lives in another rack.
A mini-database called the Namenode keeps track of the locations of these files. If the rack holding A and B or the switch controlling that rack fail for any reason, Namenode automatically reroutes incoming data requests to Copy C, and it creates new A and B copies on a third rack.
That’s all standard in Hadoop—but as you might guess from this explanation, the machine running the Namenode is the single point of failure in any Hadoop cluster. If it goes down, the whole cluster goes offline. To cover that contingency, Facebook invented yet another mechanism. It’s called Avatarnode, and it’s already been given back to the Hadoop community as an open-source tool. (Check out this post on Facebook’s “Under the Hood” engineering blog if you’re dying to know the details).
The big idea at Facebook, Janardhan says, is to “make the machine do the work.” The company’s largest photo-storage cluster is “north of 100 petabytes,” he says. At most companies, it would take hundreds of people to maintain a database even half that size. Facebook has exactly five.
The Red Button or the Blue Button?
Facebook’s infrastructure engineers may obsess about storage, but they aren’t pack rats. The whole point of maintaining such an elaborate data back end is to allow continuous analysis and experimentation on the front end.
“There are literally tens of thousands of experiments running in production across our billion users at any given time,” says Parikh. “Some are subtle and some are very significant changes. One critical part of this testing is being able to measure the response.”
Thanks to off-the-shelf tools like Optimizely or Adobe’s Test&Target, virtually any company with a website can do what’s called A/B testing or multivariate testing. The idea is to create two or more variations of your website, split up the incoming traffic so that visitors see one or the other, then