Curtis Preston penned something last week asking if global deduplication matters. I am not sure if Curtis would count me in the group that questions global deduplication's reality, or not...
Of course I believe it is real!
The real issues are actually two-fold: does it matter, and if so, how much?
Mr. Preston claims that it only matters if you back up more than 10-20 TB. Well, that is a pretty broad range. I would actually amend this to a more specific number: for a single EMC DL3000 that is 18 TB (12 hours of deduplication), and for a DL4000 with deduplication enabled it is 30 TB. Why the larger number? Because with deferred deduplication and faster ingest speeds, it is perfectly legitimate to deduplicate throughout the day. For the sake of operational realities, I am assuming you are deduplicating for 20 hours with this number. That still leaves ample time for other niceties, like restore, for example.
But global deduplication is also a matter of capacity.
So lets follow this logic through just a little bit, making an assumption of 20:1 deduplication. Both the DL3000 from EMC and the EDL4000 have 148 TB of useable deduplication storage. At 20:1, that would be roughly 3 PB of backup data. 3 PB of backup data is roughly 7500 LTO3 tapes (no compression) or 3750 at 2:1 compression.
So, more accurately, not having global dedup is like not having a tape library that scales past 3750 cartridges. Which is pretty much everybody, in practice. Yes some very big SL8500s and IBM libraries can, but in practice they tend not to due to floor space restrictions. Instead, customers opt for multiple libraries as they don't require contiguous rack space (we are talking about 30-40' or more of contiguous rack space!).
So, does global dedup matter?
The answer is: if you back up more than 30 TB per day, and if you want to retain more than 3 PB of backup data on disk, it may matter. If you back up less than 30 TB per day, and want to retain less than 3 PB of backup data on disk, it doesn't matter at all. Not within the EMC product range, anyway (clearly, these numbers will differ for different vendor implementations of deduplication, based on their ingest speeds and total storage capacity).
But lets extend our thinking a little further.
In internal testing, EMC has seen approximately a 10% deduplication benefit from commonality between data sets from different backup clients. In other words, 90% of the capacity savings of deduplication comes from repeatedly backing up data from the same client.
WIth that in mind, we could say that without global deduplication, we would get 18:1 rather than 20:1 as we assumed above. Making that assumption, lets look at a DL4406 with deduplication. A DL4406 has a single user interface, a single set of virtual resources, and a single policy engine. However, it can have up to 2 deduplication engines. That would give us a net capacity of 296 TB of useable deduplicated disk, which, at 18:1 is equivalent to 5.3 PB.
Again, compare that to LTO3 tapes: that is 13,320 tapes uncompressed, or 6,625 at 2:1 compression.
So now we have our real answer. Does global deduplication matter? Yes, if you have more than 60 TB to back up each day, or care to retain more than 5.3 PB on disk.
If it does matter, how much does it matter?
Well that is subjective. I would suggest that it is one of many criteria, including cost, manageability, scaleability, compatability with backup applications and source data types, ability to replicate deduplicated data, support for RAID6, policy control over deduplication process and timing, and so on, that matter.
Honestly, where to rank it amongst all these other criteria, and any of your own you may want to include on that list, I will leave to your discretion.
And finally, a footnote: EMC does have a globally aware deduplication solution: Avamar. Mr. Preston doesn't mention it at all, but it should be noted that it is in our portfolio, and is an important, and rapidly growing, component of it.