Apparently the folks over at Sepaton noticed my comments about the deduplication approach of DeltaStor, and have responded with additional information.
First off: welcome to the blogosphere JL. It is great to see Sepaton participate in the conversation. In the hope that you would rather engage in dialogue than devolve the discussion into a vitriolic and hyperbolic FUD-fest a la NetApp (and in the belief that that is probably what my readers would prefer to) I am going to make an earnest attempt to keep this on the level and refer only to the specific claims I made, and the counter-claims. In doing so, we can hopefully see some interesting things about deduplication and return on investment along the way.
The Sepaton post is here for those that are interested.
The firtst claim JL makes is this:
Given that there are at least 5 or 6 common backup applications, each with at least 2 or 3 currently supported versions, and probably 10 or 15 applications, each with at least 2 or 3 currently supported version, the number of combinations approaches a million pretty rapidly
This is a tremendous overstatement. HP’s (and SEPATON’s) implementation of DeltaStor is targeted at the enterprise. If you look at enterprise datacenters there is actually a very small number of backup applications in use. You will typically see NetBackup, TSM and much less frequently Legato; this narrows the scope of testing substantially.
This is not the case in small environments where you see many more applications such as BackupExec, ARCserve and others which is why HP is selling a backup application agnostic product for these environments. (SEPATON is focused on the enterprise and so our solution is not targeted here.)
OK, the text in italics in my (original) post. The following two paragraphs are the response. And he is right. Sort of. What can I say, combinations and permutations were never my strong suit.
A more accurate number would be found by: (number of backup applications) x (number of current supported versions of those applications) x (number of major applications) x (number of current supported versions of those applications) x (backup application agent support).
If I redo the math, we can see that: there are at least four backup applications that I regularly run into with my enterprise customers: TSM, Networker, NetBackup, and OmniBack. There are usually three supported versions of these at any one time. There are at least ten major applications and databases (Oracle, DB2, Exchange, Notes, SAP, SharePoint, Documentum, SQL, etc.). Each has, again, two or three currently supported versions. Finally, for every database backup, you typically have a choice as to whether to run native, through the backup agent, or in conjunction with a third party agent. (And we have excluded minor applications, data bases, and backup applications--even though it would not surprise me to learn that this is 25% of the enterprise market.)
Therefore, the correct math is: 4 x 3 x 10 x 3 x 2 = 720.
So, putting hyperbole aside, the support situation (and just as importantly, the mandate to test every one of those configurations) is a pretty heavy burden. JL also dismisses my argument that "there are many different versions of supported backup applications and many different application modules" as bogus because while "This is true ... it is misleading. The power of backup applications is their ability to maintain tape compatibility across versions." Which is only partially true. They do change. More seriously (and I am trying to avoid FUD as hard as I can here...) I think that certain vendors have a vested interest in their approach to deduplication succeeding, and alternative approaches failing. Would those vendors deliberately change tape format to ensure that? Well, who knows?
At the end of the day, I think you can fairly choose between a deduplication appliance that supports data from any source and solutions that only support data from some sources, some of the time. Everything else being equal, I think we would all choose the former. (And when I say any source, any time, that is more or less true, excluding only odd ball solutions like iSeries--and before anybody gets agitated about that, I know there are hundreds of thousands of iSeries boxes out there, it is just that they rarely get protected by the same backup and restore application and infrastructure as "Open" systems!)
The second major claim JL makes is:
do you want to buy into an architecture that severely limits what you can and cannot deduplicate? Or do you want an architecture that can deduplicate anything?
I would suggest an alternative question: do you want a generic deduplication solution that supports all applications reduces your performance by 90% (200 MB/sec vs 2200 MB/sec) and provides mediocre deduplication ratios or do you want an enterprise focused solution that provides the fastest performance, most scalability, most granular deduplication ratios and is optimized for your backup application??
Again, italics are my original text, the second paragraph is JL's. So... two claims here really: one, general purpose deduplication is slow; two, general purpose deduplication provides better deduplication ratios. (There is a third t0o, that of scalability, but I dealt with that, at least in a tangential way in my post: "How Big is Big Enough?")
Well, "slow" is a relative term. Let me just point out that EMC offers a VTL that deduplicates that can ingest data at 2,200 MB/s. Given that such a VTL would perform roughly as well as 50 LTO3 drives in the real world (because there are a lot of things which contribute to it being difficult to impossible to sustain the rated throughput of a tape drive over 8 hours consistently) I think it is tough to describe that as slow. But, again, it is a subjective term. So your mileage may vary!
With respect to backup ratios: this is interesting. I think the claim here is, more accurately: general purpose deduplication achieves lower deduplication ratios than targeted deduplication such as DeltaStor. I think I have two comments about such a claim:
- Show me the evidence! I have yet to see that targeted, application specific deduplication makes much of a difference. Sometimes it can, but as a general rule, it does not.
- More importantly, I don't think it matters. To make this specific, lets just assume that the claim is accurate. Lets assume that "general" deduplication like EMC's can achieve 25:1 on a given data set. Lets also assume that DeltaStor deduplication achieves 50:1. Twice as good! So how much storage does this save us? About 2 TB per 100 TB of data backed up. That's right, 2%. Two disk drives (using the 1 TB drive sizes currently in our deduplication products). Big deal. The more interesting question that falls out of this is: is it worth it? If I have to accept a limited support matrix, one that is difficult to maintain and keep up to data, would I accept that for a gain of a mere 2%? Finally, I should note that crediting targeted deduplication solutions like DeltaStor with a 2x deduplication advantage is being exceedingly charitable. It is far more likely that this technology will help you gain 10% on your deduplication ratios--but thinking of it in terms of the "worst" case is instructive in that it lets us see how inconsequential additional gains to deduplication ratios are after you get past 20:1 or so.
My conclusion, and, incidentally, the conclusion that EMC engineering has come to over the last several years of dialogue, is that it is not worth it. We are willing to make a small sacrifice in terms of deduplication ratios, in the rare cases where this is actually true, to achieve the benefit of a general purpose deduplication device that can deduplicate any data, from any source, from any backup application, and do it in-line or post-process.