When EMC first entered into the target deduplication market last spring, one of the things that I was really looking forward to was the opportunity to bring some honesty to the discussion of deduplication. In particular, I had longed watched various parties claim that with their technology you would get 25:1 deduplication. Or 50:1. Or whatever number seemed to strike the fancy of the sales rep and the marketing department.
The thing that strikes me now, in the fall, is that not much has changed in terms of the behavior of those vendors. And we now have a new set of folks claiming specific ratios (ahem, Sepaton, I am talking about you and your ridiculous "guarantee" on Exchange).
Frankly, it is time for the insanity to stop.
To know a deduplication ratio, you need to know 3 things:
- Compression ratio: how well does the data compress in normal backup situations (i.e. to tape)?
- Change rate: how much of the data changes at a block or segment level on a daily basis. Note that this is different, and less than, than the incremental change rate achieved by backup with a normal backup application--which simply captures change at an object level.
- Retention period: because the ratio is that of data on disk (deduplicated) to total data that is captured by the backup process, we must know how much data is captured. Therefore, we must know how often full backups are conducted, how often incremental backups are conducted, and how long each backup type is retained. Without this knowledge one of the two numbers in the ratio cannot be determined.
If you don't know those three things, you simply cannot state a deduplication ratio with any level of honesty.
It is impossible.
What's more, I would argue, you can't even make much in the way of a useful generalization. Why? Because there is no such thing as a "normal" retention period. Or frequency of backup.
I talk to hundreds of customers a year. And the only generalization that I can make about backup retention periods is that I can't make a generalization. Some people keep backups for a month. Some for 7 years. Some forever. Some keep weekly fulls for 7 years. Some keep daily fulls for 7 years (ouch). Some do synthetic fulls. Some do fulls every time they back up a database. Some do incrementals. I would even go so far as to say that there is not even a bell curve here, or an "average" retention policy. It is simply all over the board.
So, a deduplication ratio is: the total amount of data captured by the backup: the amount of data retained on disk after deduplication.
If I keep 50 full backups of 10 TB of data, and I store (after deduplication) only 20 TB of data, then my ratio is 25:1.
But unless I know how many full backups I am keeping, I simply can't say what my deduplication ratio is going to be.
And even when I do know that, I still need to know the rate of change within the data, and how well that data will compress.
I look forward to the day when vendors stop claiming (and therefore customers stop expectiing) a given ratio. Because on that day we will all finally understand that the only honest answer anybody can give to the question "how much deduplication am I going to get?" is "it depends". It depends on the three factors above.
But I guess it is easier for some vendors to just skip the tough part--educating our customers, and actually having an honest conversation with them about the things which affect deduplication--and say "we do 25:1... now how many units do you want?"
I would add to your blog that because of the lack of generalization in the backup retention area, and other key factors in the de-duplication of data, it falls upon us the vendors (sw or hw) to explain clearly those factors to our clients and guide them through white papers or best practices guidelines to change their behavior so they can take full advantage and evaluate the viability of the new paradigm that is de-duplication in their environment. It is a bit of a chasm between the traditional and the new afterall. Just because a vendor touts de-duplication doesn't make their product(s) a panacea. You need a product that can offer de-duplication as a dolid feature amongst other solid features to give our clients the flexibility to optimize their backup and recovery environments.
That's my $.02 worth :-)
Toe-Knee
Posted by: Toe-Knee | September 25, 2008 at 06:16 AM
De-duplication ratios have everything to do with the type of data and the number of copies or retention period. If the data is Microsoft Office, Data Base and email and you keep longer retention such as 18 weeks you can hit upwards to 50 to 1. However, if the data is pre-compressed data then the ratio will be poor or if the retention period is 4 weeks the ratio will be less. ExaGrid has hundreds of installations of disk-based backup with de-duplication behind existing backup servers and we see two things. The first is that about 2% of the data changes from backup to backup. So once you have the first copy each subsequent copy only take 2% more space. Across our customer base we see ratios of 10 to 1 all the way to 50 to 1 depending on the tpye of data the length of retention (number of nights and weeks kept). Therefore, the ratio can range greatly because the variations of data types and retention periods are endless. Hope this helps.
Posted by: Bill Andrews - ExaGrid | September 25, 2008 at 06:15 PM
I've read your post and completely agree that you cannot know dedupe ratio until you know some things about the environment. IMHO, the big ones are frequency of full backups and retention period. Compression ratio isn't usually an issue, but I know that it can be with some data types.
I don't see why you think that SEPATON's guarantee is "ridiculous," or why you feel the need to put "guarantee" in quotes. If you read the fine print of the guarantee (available at http://tinyurl.com/3kyehz), you would know that it addresses the things you listed as requirements.
It is only for "NetBackup v5.1, v6.0, and v6.5 with Microsoft Exchange 2003 Agent (Windows 2003)
data," using "full backups of Microsoft Exchange 2003 data five times per week." and "thirty days" of retention.
So they address everything you said except for compression ratio, which I'm sure they just used a conservative number on. (Exchange compresses quite well in comparison to other data types.)
Isn't it possible that they've done enough deduped backups of customer's Exchange data to know what their dedupe minimum is for a given set of conditions, and offer that as a guarantee, given those conditions?
Wouldn't it be quite stupid of them to be making this up, given that the guarantee says that they'll give them $50,000 of free disk if it's not met?
Posted by: W. Curtis Preston | September 26, 2008 at 11:14 AM
I think you're forgetting about the impact of the data growth rate on the de-dupe ratio. By growth I mean the introduction of new, unique data to the dataset. If the growth of data is low (i.e. system backups) then the de-dupe ratio will be very high from one full backup to the next and continue to increase. If the growth rate is high (i.e. unstructured data) then there will be too much new data introduced at each backup event and the de-dupe ratio will level off very quickly.
Posted by: Joe Walsh | October 02, 2008 at 10:55 AM
Excellent point. You are absolutely correct.
Posted by: Scott Waterhouse | October 14, 2008 at 11:32 AM
With respect to the "ridiculous" guarantee Curtis, it is ridiculous because it makes an assumption and then only details it in the fine print.
But the assumption is that you are doing full backups every day (well, 5 days a week). That is exactly the assumption that is so misplaced and that I was trying to highlight in my post on deception.
In other respects, I know you are right: Exchange does compress well, and there is very little risk as a result. (Unless of course the customer has very small inboxes, which will lead to very high rates of change, and therefore very low dedup ratios).
Bottom line: the claim of any particular ratio bugs me. Especially when the assumptions are in the fine print and not stated up front.
Posted by: Scott Waterhouse | October 14, 2008 at 11:49 AM