I saw this little quote on Techtarget today: "File-level deduplication will save a relatively small amount of space on your disk/tape archive. Block-level deduplication will save more space on your disk/tape archive, and variable block-level deduplication will save even more space on your disk/tape archive." For those of you interested in reading the whole article, it can be found here (registration required).
While the statement is true, it is a little light on detail, and downplays the importance and impact of the different technology choices. By way of setting expectation, we could expect the following deduplication ratios when backing up the same data set with the same retention ratios:
- File level deduplication: 3:1 to 5:1
- Fixed block level deduplication: 5:1 to 10:1
- Variable block level deduplication: 30:1 or better
So there is a pretty substantial difference here. And while capacity savings should not be the be all and end all of a technology choice around deduplication, certainly differences of this magnitude will come into play. Bear in mind that as the dedup ratios increase, the incremental capacity savings decrease. Given a 1 PB backup data set:
- File level deduplication would require 200 to 330 TB of disk.
- Fixed block deduplication would require 100 to 200 TB of disk.
- Variable block deduplication would require 30 TB of disk or less.
This is a significant differentiator. Fixed block schemes are much less efficient, and you should absolutely take care to understand if your vendor offers file length, fixed block, or variable block.
As an endnote, I would say that based on my personal observations, these differences are very real. I have encountered several situations in the last few months where Avamar has demonstrated much high deduplication savings than products which only use fixed block deduplication. There is a good reason why EMC has implemented variable block deduplication in both our deduplication portfolio.
While it is true you can see very high dedupe levels with variable block level deduplication I find that you're shifting some of the expense of the operation to another area. If you're just deduplicating a file server it's not such a big deal. But if you're trying to deduplicate Oracle with Avamar you end up having a more expensive operation occur on the database than just dumping the backups to a disk. I have to caveat that my sample size is low, so I'm completely open to your response and would be ecstatic to be proven wrong.
I still like Avamar, I just think that some of the work that you have to do in order to see the high level of dedupe may not help medium sized companies as much as it does Fortune 500 companies where servers sit around mostly idle all of the time and so pushing an Oracle box harder isn't such a big deal.
My opinions are my own and do not represent my company.
Posted by: DanielJdoughty | January 28, 2010 at 11:54 AM
Daniel;
The situation you speak of may well be the case--and this is an ideal use case for target deduplication. Avamar may still be appropriate but there are a host of issues to consider.
As an interesting aside, most database backups with deduplication default to a fixed block deduplication of 8 kb, because that is how the size of a database field in most cases anyway. So it turns out to be more efficient to do this. On the other hand, we still achieve similar net deduplication ratios to the variable length dedup that I discussed above (in part due to how well databases compress, and assuming that we are talking about a database with an average change rate).
Posted by: Scott Waterhouse | January 28, 2010 at 12:56 PM
Hi Scott, good topic. The way I describe file/fixed/variable dedupe is that its a tradeoff between space savings and performance impact. File dedupe (SIS) will get you some small savings but its very easy to do. Variable block will get you the most savings but it takes the longest to do. Fixed Block-level dedupe is somewhere in the middle.
One is not better than the other, the choice depends on the User's tolerance for performance overhead vs the desired space savings.
Thanks,
DrDedupe
Posted by: DrDedupe | January 29, 2010 at 08:01 AM