As a follow up to my two previous posts, Deduplication 101 and Deduplication 201, I am going to put up a deduplication calculator. I am doing so with some (considerable) reservations, because by no means should this be mistaken for a reliable or accurate tool. It is not for sizing or forecasting. It is more right than it is wrong, but that is as far as I am willing to go. Consider yourself warned!
So what does it do? Two things. First, it lets you get a very general idea about how much deduplication you can expect. But see my caveats below as far as that goes. Second, it does expose some of the things that make deduplication work really really well, or really really poorly. And by playing with the numbers you can therefore see very quickly the impact of changing key factors and what it does to our deduplication ratios.
The model is included here: dedupcalc.xls
The reader's digest version is this: the numbers in light blue can and should be changed to model an environment--they are in the "inputs" box. The numbers in darker blue are the output of the model, and can be found below in the "outputs" box.
To use it to get a quick understanding of the amount of deduplication you can expect, the first two inputs are key. How much data do you have? And, are you going to store archival images (quarterlies and annual backups, often retained for 7 years or more) on the deduplication appliance?
Right away, we can notice something really important: how we answer the question about archival images makes a big difference to the final deduplication ratio. And so it should. The general rule is: the more archival images you store, the lower your deduplication ratio is going to be. Why? Because an archival image represents data as it was a long time ago, and substantial change has likely occurred at the block level since it was made. It is therefore composed of much more unique data. Unique data will not deduplicate.
However, every rule has an exception. If you have a very low data change rate, or if the change occurs primarily within the same blocks, then the impact should be less. You can see this in the model by reducing the change rate to .5% or less. Then it doesn't matter so much how you answer the archival question! What the tool cannot do however, is account for both the daily change rate (which will contribute to degrading the deduplication ratio) and the archival change rate (which will only degrade deduplication if the change is across different blocks, not the same block changing repeatedly).
After answering the two big questions, it is just a matter of filling in the blanks. You need to input the:
- Daily change rate of the data. This is probably the hardest to do, because the most accurate results will be obtained if you know the block level change rate. Which is not the same as the change rate from an incremental backup. (But the data can be had if you have a CDP product such as RecoverPoint, or a replication product, that reports on such things.)
- The compression ratio of the data. Which is the same as you would see if you backed up the data to tape.
- How many full versions is the number of weekly and monthly (and daily if database) full backups that are retained.
- Incremental versions likely only matters for unstructured (file and application binaries) data.
- The number of archive versions are the versions that are retained for the long term--usually one to seven years, sometimes longer.
And once you have input the data, you can see (very) approximately what deduplication ratio you are likely to achieve. Please bear in mind the caveats that I mention at the bottom of the post.
Overall, the tool is very useful in terms of helping us appreciate the difference that change rate, compression, and the number of versions retained, can make upon the deduplication ratio.
(A quick note on compression: I have included compression in the figures for the "amount of data retained on tape". Unlike all other tools of this sort that I am familiar with which do not. The reason they don't? I am sure it doesn't have anything to do with the fact that by not accounting for compression on tape, it makes the deduplication ratio look better by the a factor of 2 or 3--whatever the average compression rate is. No. No way that could be the case, right? Not even NetApp would do that! Wait a minute... Sure they would. They did. So I have included it because I want to give as close to an "apples to apples" comparison of capacity as is possible.)
The caveats include, but are not limited to:
- The tool does not account for intra-object commonality. This can matter a lot (see Deduplication 201, for example).
- The tool does a relatively poor job of accounting for archival images and their impact on deduplication. It is usefully instructive, but poor as a predictive model, in this respect.
- The tool does not account for data growth over time.
- The tool does not account for TSM file backups with the progressive incremental methodology. Of course, you can model other types of TSM backup. For TSM progressive incremental backup pools, estimate 4:1 deduplication.
- The tool does not account for spare, scratch, overhead, or replication capacity (all of which would presumably be required on a real device).
- All of these things can be modeled much more accurately. The problem is that the complexity of the model goes up with the square of the accuracy. So it would be a lot more effort to get a little more accurate. Both myself and many others at EMC can do this with much more accuracy, but a blog is not the right forum! Further, as complexity goes up, the ability of the model to be instructive goes down. So it is what it is. For now, it is the right balance of accuracy and instructional utility.
So, is it accurate? Sort of.
Is it instructive? I hope so. It should be.
But, if you need to know the real answer, the actual deduplication ratio, then you need to talk to me or one of my counterparts at EMC. We do have the tools to give you a precise answer to question: what deduplication ratio will I get with my data, in my environment, with my retention policies?
I also hope however that you can appreciate after looking at the tool and reading Deduplication 101 and 201 that for any organization to advertise something like: "our box will get 25:1 deduplication" is a little disingenuous. It is necessary to know a lot more about the backup environment, the data, and the policies, before you can make an informed answer about the amount of deduplication that anybody will get. And anybody that tells you different is probably just as comfortable selling you a bridge in New York or marshy land in Florida as they are a deduplication solution for backup.
Comments