Normally, very high deduplication ratios are something to be pretty excited about. Particularly if you are the one getting them. If you get 450:1 deduplication, you are probably a pretty happy camper. It probably means you have saved yourself a pile of time, money, and administrative effort. It probably means your TCO is lower, your risk is reduced, and your backups are a lot more reliable.
So, life is good, right?
Well, maybe. But maybe not. Sometimes extravagantly high deduplication ratios are a sign that something is wrong. Really wrong.
I want to take the opportunity to talk about why, if you are getting 450:1 deduplication ratios, you may want to make some changes.
(By the way, I didn't pull the 450:1 number out of my hat... it is referenced here.)
So how can you get 450:1? Generally, there are only two ways:
- You have an extraordinarily high amount of intra-backup commonality. Basically, you are backing up multiple copies of the same file, a lot.
- You are doing a lot of full backups of a data set that doesn't change very much at all.
There are really not a lot of other alternatives.
So, if the first case applies to you, what does this mean? Well, again, there are really only two reasons why this might be the case.
The first of these is that you are doing VMware backup. Generally speaking, VMware backups can get very high deduplication ratios because you are backing up a lot of data that is the same from one VM to the next. Put another way: and two .vmdk files will have a lot of common segments between them (if nothing else, think of all the Windows files that are more or less identical, and that will be found in every VM). If you have a lot of VMs, you will have a lot of data that is duplicated. No surprise, when you go to back it up, you will get a lot of deduplication. (Incidentally, this is one of the cases in which we see Avamar getting deduplication ratios of 600:1 or better.) So if this is you, and you are getting very high deduplication ratios because you are doing a huge amount of VMware backup, you are excused. Give yourself a pat on the back, because life is good, and your very high deduplication ratios are not a sign of trouble.
However, the second reason why you can have a lot of intra-backup commonality is that you have a lot of file servers littered with multiple copies of the same files. Meaning that a bunch of people have saved the same file (more or less, with minor edits) to their personal directories. Or that you still have people saving .pst or .nsf files to their personal directories. This is bad.
How do you know if that is the case? Well, there are a bunch of free utilities that you can download that will do a quick analysis of data redundancy on your servers or NAS boxes. Or you could engage EMC to do a quick and free assessment of your environment--we can quickly and painlessly tell you just how much redundancy of data you do have. The final way you can observe this is if you get a significant amount of deduplication on your first, full, backup. If you reduce a 10 TB backup to 2 TB the first time you backup to a deduplication appliance, and you aren't backing up exclusively VMware systems, you have this problem.
Therefore, if this is in fact the case, if you do have a lot of redundant data in your file environment, yes you will get very high deduplication ratios. However, this is one of the times where that should be telling you that there is something fundamentally broken earlier in the backup/archive chain. Simply put: all that redundant data should be eliminated before it even gets to your backup device. It should be archived.
By archiving the data you will single instance it. This is the process of getting rid of multiple copies of the same file (or email attachment) and storing a single instance, with appropriate pointers for the multiple copies. By archiving it, you will also move it off (expensive) primary storage and move it to (less expensive) secondary or archival storage.
The notion is straight forward: it doesn't matter how well you deduplicate, it is always cheaper, and better to archive the data so that you never have to back it up in the first place. Cheaper, because you don't have to pay for primary storage, backup storage, network bandwidth, etc. Better because you have a scaleable process (which backup, in this context, is not) that is less burdensome, less risky, better serves user needs, and can ultimately add value by providing searchable meta-data indices of archived data, and compliance (both of which are also beyond the scope of backup).
Now, case number two above--a lot of full backups of static data--are also a guaranteed sign that you are a good candidate for archiving. If you backup the same data set, full, a lot of times, all you are doing is moving a whole bunch of copies of exactly the same thing over my network to store them on disk or tape.
This is senseless. This data should be archived.
By archiving it, you will no longer have to move it over the network every time you do a full backup. And just like my other case, you will also no longer need the primary storage to store it on either. By sticking the data on an archival device like the Centera, you will get it out of your backup stream entirely. Again, it is just cheaper and better to archive data than it is to back it up.
Incidentally, I will also offer the following observation: the customers that I see doing a lot of full backups are doing them because they have a compliance requirement of some sort. If that is the case, then you have another, very compelling, reason to consider archiving the data rather than backing it up in the traditional way. Archival storage is superior to tape or deduplicated disk in almost every way at meeting compliance requirements.
To conclude: if you are getting deduplication ratios of hundreds to one, and if you are not doing primarily VMware backup, then you will almost certainly benefit from archiving before you backup.