This post is a follow up to Deduplication 101. In my 101 post, I tried to establish 3 things:
- Deduplication isn't a magical variant of compression, it works because we back up the same thing over and over and over again.
- Change in the object over time degrades our ability to deduplicate
- Deduplicated data can compress just like "normal" backup data. Compression helps our ability to deduplicate.
I also made the point that, in most cases, deduplication from different vendors and of different types is going to achieve more or less the same deduplication ratio and storage savings.
Well, its time to talk about some of the caveats to that statement: when and where it is less true or more true, and why.
To start with, lets review the different types of deduplication, and where deduplication happens.
We can break down deduplication into source, and target deduplication. We can also differentiate between hardware and software only implementations. Interestingly, source deduplication only happens in software, and target deduplication only happens in hardware (I included Falconstor there, but practically they require the end user to build there own hardware configuration and then deploy the software on top of that... not exactly an ideal practice. Practically, I think they are relying on OEMs like Sun to do the heavy lifting around the integration.)
It is also true that target deduplication can be in-band, out-of-band, or a bit of both. (Hello Sepaton and others... honestly, despite vendor protestations, the "bit of both" approach is really just obfuscation: they are out of band. Data is written to disk, then deduplicated. It may be deduplicated pretty darn quickly after it is written to disk--assuming that a bunch more data hasn't been thrown at the appliance--but it is still after.)
The interesting thing that emerges here however is that in-band, out-of-band, or something in between has zero impact on the final deduplication ratio.
However, we do occasionally see some pretty significant differences between source and target based deduplication. In fact, I can refer to one recent test performed by a customer, that showed source deduplication out-performing target deduplication by about 50%. In this particular test, the following results were reported:
- Total data protected (the "source" data): 195 GB
- Total backup data with Avamar source deduplication: 79 GB
- Total backup data with Data Domain target deduplication: 128 GB
- Total backup data on tape: 1.35 TB
- Total number of full backups: 9
In this example, Avamar (source) deduplication was actually 61% more efficient than Data Domain (target) deduplication. Not only was it more efficient, it was also much faster--typically backups took less than 25% of the time to complete with Avamar than they did with a traditional backup application and Data Domain. Backups with the "traditional" backup application (which I won't name because I don't think it makes any difference at all; they all would have behaved more or less the same in this circumstance) took about 6 hours, on average. Backups with Avamar took 1.5 hours for the first backup, and 20 to 30 minutes for subsequent backups.
So, there are two very significant things here: source deduplication can be much more effective at deduplicating backup data, and it can also be much faster.
It is also worth pointing out that this is a customer trial. EMC had no influence over the result. It was a pure "bake off" between competitive solutions conducted by the customer, who also had no stake in achieving one result or another. It is as close to a vendor neutral result as we are able to get.
Two other things happened here, that are also really important, and very useful in providing insight into deduplication:
- We achieved an extra-ordinarily high amount of deduplication--17:1--given the number of full backups. Remembering my previous generalization that deduplication comes from backing up the same data set multiple times, we would normally conclude that the deduplication ratio cannot exceed the number of full backups. Meaning that, in this case, with 9 full backup, we should not see better than 9:1 deduplication. However, the customer had a very large amount of intra-object commonality: the same blocks within the dataset. This is not all that common, except in VMware and Windows environments. The test server, in this case, was a large Windows file server. So, another useful generalization: the more intra-object commonality, the better the deduplication ratio will be.
- Source deduplication was much more effective because it could recognize common objects at the source, and understand where the natural divisions within the objects are. We can generalize: the better the deduplication method understands the source data, the higher the deduplication we can achieve.