Data Domain Founder, Kai Li, on EMC Acquisition and the Future of Data Storage

compete in price with tape library solutions. When the cost is roughly equal, the value propositions of fast and reliable recovery, and automatic disaster recovery, became very appealing to customers.

What separates Data Domain from many startups is how we invented the technology. We identified a very painful problem in data centers first, and invented technology to solve the problem, instead of inventing a technology and then looking for a market. Because of this, we are essentially executing the same business plan as when we formed, during the last downturn, one month after September 11 [2001]. That’s one of the main reasons Data Domain was able to lead the market, and that’s why deduplication is becoming such an important technology.

X: How does deduplication work, and how is it different from regular data compression (like WinZip)?

KL: Data compression has been used since the late 70s. The main observation we had was that the previous “local compression” had been doing encoding within a small window of bytes, such as 100 kilobytes and try to encode that. This method achieves roughly 2-to-1 compression [half the data] on average. Deduplication is fundamentally different. Instead of looking at a 100 kilobyte window, you make the window really large—as large as the entire storage system, or a network of storage systems. By doing so, we can reduce the data footprint by an order of magnitude. Then the challenge is how to keep track of the data segments, and how do you find the duplicates at high speed.

X: Why is this such a big deal for companies?

KL: This is a classical disruptive technology in IT. By disruption, I mean replacing the existing infrastructure, as opposed to incremental improvements. We disrupted tape libraries and disk-based storage for backup, near-online, and archival use cases. Because deduplication reduces the data footprint by an order of magnitude, it brings substantial value to large data centers.

Deduplication solved three problems. The first is to get rid of tape infrastructure for backups. When you tell data center customers you’ll replace backup libraries, that’s [an easy sell]. The second is to move data offsite easily. When you compress data, you can also move your data to a wide area network more easily. Especially for corporate intranets, the cost for bandwidth has not been reduced much in the past 10 years. If you use that T3 [communication] line to move uncompressed data, it’s not feasible. T3 moves about half a terabyte per day and costs about $72,000 a year. In the case of an Oracle database, you limit your database to half a terabyte if you want to do full backup every day. If you translate that to dollars per gigabyte, it’s $300 per gigabyte for two years—that’s more expensive, by more than an order of magnitude, than primary storage. The situation gets worse since the number of hours in a day does not increase, while data volume keeps increasing. But with deduplication, it’s the same cost as moving tape by physical transportation. And the third problem it solves is to store near-online data, which is the infrequently accessed but majority data. For “nearline” data, we can provide customers with a very economical storage system. This is especially helpful during an economic downturn.

X: So what stage is deduplication at with big data centers? Is it becoming mainstream?

KL: Deduplication is still at a relatively early stage. If you look at the tape library market, it’s about $3 billion. The consensus is data deduplication will become a multibillion-dollar market. Data Domain did $274 million in revenue last year. This year, the guidance is in the ballpark of $360 million. Data Domain is arguably the leader in the deduplication storage market. You can see the market is growing into a billion dollars a year soon.

X: But the field is crowded with competitors, including your new parent company.

KL: There are many players in deduplication storage. Their go-to-market strategies are different. Avamar, acquired by EMC [in 2006], was one of the early competitors. What they have been doing is applying deduplication technology to backup software. That’s what EMC has been selling

Author: Gregory T. Huang

Greg is a veteran journalist who has covered a wide range of science, technology, and business. As former editor in chief, he overaw daily news, features, and events across Xconomy's national network. Before joining Xconomy, he was a features editor at New Scientist magazine, where he edited and wrote articles on physics, technology, and neuroscience. Previously he was senior writer at Technology Review, where he reported on emerging technologies, R&D, and advances in computing, robotics, and applied physics. His writing has also appeared in Wired, Nature, and The Atlantic Monthly’s website. He was named a New York Times professional fellow in 2003. Greg is the co-author of Guanxi (Simon & Schuster, 2006), about Microsoft in China and the global competition for talent and technology. Before becoming a journalist, he did research at MIT’s Artificial Intelligence Lab. He has published 20 papers in scientific journals and conferences and spoken on innovation at Adobe, Amazon, eBay, Google, HP, Microsoft, Yahoo, and other organizations. He has a Master’s and Ph.D. in electrical engineering and computer science from MIT, and a B.S. in electrical engineering from the University of Illinois, Urbana-Champaign.