After posting the deduplication calculator last time, I got a few questions about source versus target deduplication, and whether the calculator was good for source deduplication scenarios.
The short answer is that it is not.
It is a reasonable tool for estimating target deduplication (with all the caveats that I included last time) but doesn't tackle source deduplication.
So I fixed it.
The tool now deals with source and target deduplication. All the same caveats apply: the tool is primarily useful for instructive purposes, not scientifically accurate sizing of deduplication appliances. It will be more right than wrong, but it may not be right enough to stake your precious budget allocation on. Or your job!
The newest version of the deduplication calculator is here: Download dedupcalcv2.xls
Two things in particular are substantially different when it comes to source deduplication:
- The deduplication ratio goes up. Significantly. There are two primary reasons for this. The first is that source deduplication is more efficient than target deduplication. It is closer to the data (naturally) and can better understand the data, and more intelligently break it up into segments for deduplication. This assists with the deduplication ratio. This is particularly true for VMware and Windows file servers. The second is that places in which source deduplication is most often employed--VMware and Windows file servers--have a higher degree of inter-server or inter-machine commonality than generic servers. Simply, I am more likely to find commonality between two VM images than I am between an Exchange server and an SAP server.
- You can save substantially on bandwidth from the source. The calculator takes this into account, and offers an estimated bandwidth usage for replication. Like the deduplication component, the emphasis here is on estimated.
With that said, the calculator is useful. In particular, it enables us to quickly and easily see the likely differences in deduplication ratios between source and target deduplication, and see how much more efficient source deduplication can be.
It is also worth noting that most organizations, particularly larger ones, will likely end up with a mixture of source and target deduplication. Source deduplication for instances where you care about bandwidth, including remote backup and VMware, for instance, and target deduplication for cases where I have a large amount of bandwidth available to get my backup jobs from the client to the destination storage. There are other factors that may push us in one direction or the other--source or target--but the availability of bandwidth is certainly the biggest single factor.
Comments