Curtis Preston raised an interesting point over at Backup Central regarding the need for a common set of standards for the interaction of deduplication appliances and backup applications.
Now, on one hand I agree with Curtis' basic premise (even if I do slightly disagree with some of his historical characterizations). In fact, I have twice made pleas on this blog to similar effect, here and here. Having said that, I will readily admit that W. Curtis Preston is taking the request to the next level. Rather than just ask backup application vendors to implement the means to take advantage of the target side replica of a deduplication appliance, he is asking to community to have a common set of APIs, an open standard, around this.
So, a couple of things. First: everything in this post is strictly my opinion. Nothing in here has been read, endorsed, or reviewed by anybody else at EMC. Having said that: Mr. Preston is right. This is needed, and needed badly. I have been saying it for a while, and his request is the logical extension of the arguments I have made before. So rather than arguing with Curtis about history or debating how to do this, let me state clearly which I think such an effort should achieve, with the hope that this can constructively advance the conversation.
First, if I understand Curtis correctly, what he is asking for is an open standard, a set of APIs, that let us do two things with deduplication appliances: one, make a tape copy of the data that sits on them; and two, replicate them to a target location in a way that the backup application is aware of the version of the data that sits on both the source and the target, and can utilize each. So here is what I think the open standard should accomplish, and some of the issues that it will have to deal with:
- Are we going to make this work for both virtual tape and NAS-type deduplication appliances? That is a non-trivial distinction, even though it probably shouldn't be.
- Who is going to be the brains of the operation? Is this something the backup application will merely moderate, but with the majority of the control exercised by the deduplication appliance? Or will the backup application take control, and give specific instructions to the deduplication appliance, which is otherwise limited in its role? For what it is worth, I strongly favour the backup application doing the "heavy lifting" for the simple reason that deduplication appliances need all the CPU cycles they can get to do deduplication. Adding functionality at the expense of performance is not the way to go.
- Are we going to get deduplicated data to tape (locally or remotely)? Or will we"only" be able to move data in a fully hydrated state? That means effectively re-hydrating it back to its full size before writing to tape? There is only one vendor to accomplish the first so far, but that should change shortly. Stay tuned to these pages for more on that later.
- Are we going to be able to move the data in such a way that it can be intelligible and useful to a separate server? In NetWorker, that would be a server in a different zone, for NetBackup, that would be a different Master server, and for TSM, it would be any other server (ahem). It would certainly be desirable if the data had sufficient meta-data to accompany it that it could be operationally useful to a different server than was responsible for the backup of the data in the first place. In my opinion, the fact that this data is likely to have a very long term retention period only further emphasizes the importance of having a meta data "wrapper".
At the end of the day, here is the ideal: we want to be able to back data up. We want to be able to have a deduplication appliance reduce the size of that backup. We want to be able to move the reduced data set to another location, via replication. We want to be able to move the reduced data set to tape. And we want sufficient meta data to accompany either move that a different backup server (albeit the same application) can understand the contents of the tape and/or replica, and utilize it operationally for recovery or any other purpose. And we want the meta-data to be sufficiently robust that a different version of the backup application can leverage the data for recovery. (Because how likely is it that five years later I am going to have the right version of the backup application around to read the tape or replica?)
So how about it? Anything else you would add to this list? Who do you think should "own" this process? Which vendors need to participate in order for it to be meaningful?