Technologies have a funny way of needing decades to become commercially viable. Flash memory, for example, has been around since the 1970s, but SSD drives didn’t take off until just a few years ago. And how long had the industry been trying to make tablets work?
The same goes for data deduplication (or dedupe for short). The concept has been around for decades, but only in recent years has it suddenly become a hot market.
Deduplication is just what the name implies: It looks for duplicates of exact files from multiple backups and removes the multiple copies of the same files. Because it uses SHA1 hash tags, if even one byte is changed, then that file is left alone as a new version. So the only files removed are identical copies.
Dedupe has become hot recently because data sets are getting ridiculously large, and it’s not just simple files such as emails and Word documents: all kinds of unstructured data, especially multimedia, are taking up huge amounts of data.
“I think the challenge the market is facing is scaling it up,” said Haseeb Budhani, chief product officer at Infineta Systems, maker of WAN deduplication products. “In every application of dedupe, the problem is performance. There’s more data on the wire. It used to be [that] a terabyte was a big number; now it’s petabytes.”
There are three major types of deduplication taking shape in the industry now: backup dedupe, primary storage dedupe, and networking dedupe. Each has firms that specialize in those categories and have their own story to tell.
Backup dedupe is the oldest and most common type. Marco Coulter, research director for the storage practice at Theinfopro, a 451 Group company, has proclaimed EMC the leader with a threefold market share lead over the competition: “EMC are miles ahead of everyone thanks to the Data Domain acquisition… In the backup dedupe market, it’s a done deal. I don’t see how anyone can catch EMC’s lead in that segment.”
The backup market is the logical place for dedupe to take off. If a firm is making a snapshot of all of its servers on a daily basis, those duplicates can add up fast. Backup deduplication, or the process of running dedupe on backed up data, can achieve reductions of 10 to 30 times reductions in the amount of storage space consumed, because all the duplicates are gone.
Rob Emsley, senior director of product marketing at EMC’s backup recovery systems division, cited Theinfopro research that said 65 percent of customers surveyed had backup dedupe in use in the first half of 2012—an increase of 20 percentage points from the first half of 2011.
The ten- to thirty-fold efficiency of backup dedupe is allowing customers to move away from tape, Emsley said, which has multiple benefits given the difficulties of dedupe of data on tape.
“The economics of tape vs. disk has always been one of the age old questions,” Emsley said. “Because tape doesn’t support dedupe, the ability to store what you need to store on deduplicated disk makes the cost dramatically more affordable, allowing customers to move to a disk-based backup structure for data they need to retain.”
Now, companies can move data to storage disk and replicate it over a network to an off-site location, rather than make tapes and transport them off-site. Emsley said he knows of more than 1,000 accounts in EMC’s customer base alone that have gone tapeless in recent years thanks to efficiency gained from backup dedupe.
Then there’s primary storage deduplication, which runs the dedupe process on active data in the most frequently accessed storage areas. Primary has only recently started taking off, said Coulter, citing NetApp as the leader. But it’s been a long road for primary dedupe because of its performance impact.
“Most people weren’t switching it on because it does introduce a performance overhead. But seeing people start to adopt it on lower tiers of data. None of these areas need high performance,” Coulter said.
Marc Farley, senior director of customer and community programs at StorSimple, which does primary storage dedupe, acknowledges the I/O penalty and that primary dedupe doesn’t rack up the huge gains like backup dedupe. “Dedupe is computationally-expensive, and while you’re trying to serve I/O to apps, it’s hard to have dedupe going on at the same time,” he said.
What’s making primary storage dedupe viable is the advent of SSD-based primary storage. In the traditional enterprise, the most-frequently accessed data resided on 15,000-RPM hard drives. The next tier down was 10,000-RPM drives. After that, less used data was moved to 7,200-RPM drives—and after that came tape.
SSD drives running on a 16x PCI Express card run thousands of times faster than 15k RPM drives, making dedupe less punishing. “We see a lot more primary dedupe in flash arrays simply because they are so much faster than rotating disc. The bottleneck is the storage. Flash relieves that bottleneck,” said Farley.
The ratios are still unknown, but he guesses that primary dedupe will go from about a three-fold reduction of redundant data to five-fold or more, which would be a doubling of the amount of space. “If you can buy half the storage you used to, that’s huge,” Farley added.
Infineta Systems’ Budhani insists that WAN deduplication is the hardest of the three because it requires scanning data as it goes over the wire instead of sitting on a storage system.
“If I have multiple data centers using an Internet connection WAN and that link is used to replicate date or push VMs or carry out file transfers,” he said, “customers would like to reduce bandwidth between the two because that gets expensive.”
Infineta can scan a full 10-gigabit pipe in real time. To do that, Infineta offers its own hardware, the Data Mobility Switch, with a custom FPGA CPU. It took programmable chips from Xilinx to make its own ASIC, programming it to do the task of examining the data in parallel and looking at it byte by byte.
Infineta went with custom chips over Intel Xeons. While Xeons are very fast if you’re solving a compute issue, they aren’t so good for heavy memory I/O. “Memory utilization can be less than 20 percent. Looking up things randomly for read or write utilization is low,” Budhani said.
The results are pretty random, however. Some customers see a 15-fold reduction in data while others only get a two-fold reduction. It all depends on the use case, he added.
The challenge this year is juggling two opposing trends, Coulter said. Budgets are being constrained while capacity continues to grow: “2012 is the year of optimization, the simple balance of budgets and capacity growth. Companies need to fit more into less. Dedupe is one solution, thin provisioning is another, and so is automated tiering.”
Automated tiering is the one technology Coulter said is hotter than dedupe. That’s the automated movement of blocks of data between different tiers of storage, from most requested and accessed to the least. Data will automatically be shuttled between SSD arrays and tape drives on either end of the storage spectrum, as well as hard drives in between.
Tape, he continued, is declining. While that dip is only just beginning, he has encountered people who have reached zero tape. Some companies will always need it for compliance reasons, but many others don’t.