EMC first introduced our DL3D data deduplication appliances approximately one month ago. Almost immediately after the introduction, I made a claim (probably outrageous to some people's way of thinking) that EMC had swiftly and decisively surpassed the competition. My exact words were:
We have the most flexible, scaleable, highest performing, and most reliable general purpose deduplication products available today. And you can take all of those literally: there is no prevarication, no misrepresentation, no subjectivity in any one of those.
At the time I wrote them, they were true.
Fortunately, they still are!
I think the how and the why of the truth here is pretty neat stuff however
However, I have received some questions about this claim. Some friendly. Some less so. In particular, certain competitors have tried to portray these claims as less than wholly accurate. Fair enough, that is their job.
(By way of a fairly important aside however: I would like to see the portrait our competitors paint as at least factually accurate. For example, if someone says that the DL3D products are not the fastest general purpose deduplication products available, I would hope they could say that it is because their product is x percent faster than the DL3D; not that they would say that our DL3D is y percent slower than we say it is. Unless they have conclusive, impartial evidence to back that claim. So far the only instances I have are of the later approach, not the former. Sigh.)
But lets take this opportunity to set the record straight. Lets discuss just how fast the DL3D is, under what conditions, and why there is no one single answer.
And yes I can see the folks over at Data Domain grinning right now. "No one single answer! I knew it! They are only faster when the Sun and the Moon are in perfect conjunction in Mars and the backup administrator carries a rabbit's foot!" Just kidding. No, even though there is no one single answer to "how fast are you?" the fact is that even the worst case number for the DL3D is equal to or better than competitive offerings.
But the fact that there are multiple answers arises from how we do deduplication. And I need to discuss that, in order for the performance explanation to make sense. So, for those of you that just want to know how fast is fast, you can either wait for the next post, or go riding with Valentino Rossi at Mugello (7 in a row?!?).
For those of you still with me, I am going to make a pretty huge generalization: most, if not all, other deduplication appliances available today offer you one of two choices on when to deduplicate. They are either in-line or out-of-band. In-line appliances do deduplication as the backup data stream is sent to the appliance. Out-of-band appliances wait until after the data is written. Normally this means that the backup is finished, and then deduplication begins.
At EMC we took a different approach with the DL3D. We realized that different people have different priorities. So we decided to give you a choice. The DL3D lets you choose when to do deduplication: in-line, out-of-band, or never.
What is more, is you don't have to make just one choice. You can choose which approach you want per VTL or per file share--flexible infrastructure indeed!
In our example above, the backup server in group A has a policy selected that sees that the DL3D will never deduplicate the data. (Caveat below.) Data will be written in, and stored in, its native format. It will not be deduplicated. Naturally this results in no capacity savings!
Backup server group B has scheduled deduplication set; deduplication happens out-of-band. This means that data will get written in the native format, and stored in that native format for some period of time. At some set point in the day, the appliance will examine this data, perform deduplication operations, and reduce the capacity consumed by the backup data by the appropriate amount.
And backup server group C has in-line deduplication turned on. This means that data is deduplicated as it is written. Data doesn't ever sit on the DL3D consuming all the capacity of the native backup: space is immediately saved.
Naturally, there are different performance characteristics for each of these approaches. And now that we have understood the alternatives, next time out we can get into what the performance characteristics are, specifically, of the different choices. Stay tuned, and don't touch that dial!
Footnote: I said above that there is a caveat to the statement that when deduplication is turned off, the DL3D will never deduplicate the data. And here it is: that is actually not entirely true. At any time, should the available capacity of the device fall below 30%, data will get deduplicated. Irrespective of when that data was written, or if it was written with a policy of "never deduplicate" or "scheduled deduplication." The simple logic here is that running out of space on your backup appliance is a bad thing. A really bad thing. So this simple measure reduces, as much as is possible, the likelihood of that happening.