After my post two weeks ago, I wanted to come back to this topic and revisit some of the issues, largely thanks to a couple of very insightful comments from Curtis Preston. He raised some good suggestions with respect to replicating deduplicated data sets.
A brief recap of my prior post is this: how do you get the backup application to recognize the deduplicated data set at both the source and the target in such a way that you can use the data at both sites (for recoveries, data portability, data refreshes, etc.)? It isn't easy! I pointed out a couple of work-arounds, and Curtis added two. Because I think they are important, I am going to quote Curtis wholesale here.
The first alternative is as follows:
My first kludge is a modification to yours. Instead of copying one set of virtual tapes to another set, I’d suggest using the inline tape copy functionality of NetBackup, CommVault, and Backup Express. My experience has been that it slows down the incoming backup by 10-15%, but when you’re done, you’ve got two copies. Then you can replicate one and not the other. It’s the same as your idea, but using inline tape copy instead of a regular copy. (And, remember since it’s deduped, two copies takes up the same space as one copy.)
So this is true. I will say that Curtis' experiences are a bit better than mine in the sense that I have seen a bigger degradation in the speed of the incoming backup. Sometimes as large as 50%. But there are a lot of factors at work here: network speeds and backup server speeds start to matter. And Curtis is also right in saying that this takes no additional capacity: they are the same data (exactly) and therefore the second copy gets completely deduplicated.
However, there is one gotcha, and it is a big one: both incoming copies count toward the total available deduplication bandwidth (or capacity if this is an out-of-band appliance). Meaning this: if my stream is 100 MB/s, and I twin it, I naturally end up with 200 MB/s worth of data being transmitted to the deduplication appliance. I am therefore effectively doubling the load on the device. If it is capable of ingesting data at 400 MB/s, my hypothetical backup stream of 100 MB/s would actually use half of the available ingest capacity. This may or may not matter to you, depending on how much of your backup window is being consumed currently, and how much of your total available throughput on your appliance is being consumed. (And if this is an out-of-band appliance, you will both double the CPU load for deduplication, as well as double the temporary storage requirement for the night's backup--for a 10 TB backup I will need a 20 TB buffer to store the data prior to deduplication.)
And one more thing: this is a really good example of why the backup application needs to relinquish some control to the appliance. It is a "killer application": allowing the appliance to replicate an object (logically) without actually requiring two writes/two CPU operations would enable enormous flexibility at the application level. Imagine what I could do if my backup application could just say: make me another copy of that, and assign it a different object ID and or retention period. Not having to run that I/O through the application is hugely important.
The second alternative Curtis raises is this:
Second, there’s another kludge that I use if we’re talking NFS/CIFS-based systems such as Data Domain, Quantum, NEC, or EMC. There’s usually no problem with having another media server copying backups from the same disk, right? (Such as two systems backing up to a single NFS mount point.) So, let’s make the media server that’s backing up to NFS mount A and the media server that’s going to copy backups from NFS mount B (which is a replicated copy of NFS mount A) THINK they’re accessing the same system. All you do is mount them as the same name (with some backup software you may need to fake out the local hosts file if the software actually notices the NFS server name), and the media server accessing the replicated backup thinks its accessing the same backups that the original media server wrote. The only problem here is that you can’t start the copy until the replication is finished. So you “just” need a script to coordinate that.
Again, all true. In fact I like this idea a lot. And I think Curtis recognizes some of the challenges in scripting this (it is not necessarily a trivial script). Setting that aside, the reason I didn't advocate this in the original post is simple: it scares me. There is something about having two devices on the same network with with the same name--and having each device accessed by a common application--with only a tricked out local hosts file to prevent potential issues around open files and locking. That is to say, the only thing that prevents the one application on two different hosts from both having simultaneous read/write access is the local hosts file. This scares me a little. Maybe it shouldn't. But in my opinion there is a small amount of risk here. It is risk that can be mitigated with documentation and procedure (so that, for example, if your regular backup administrator is away, and a media server has issues, a systems administrator doesn't "fix" that local hosts file). So weigh that risk for your self.
Two more thoughts that I would like to conclude with:
- There is no good way to fix this problem with TSM. TSM does not support the notion of twinned writes, and all copies of data have to go through the TSM server (there is no real notion of a media server, and no analog to a storage node in TSM). If anybody cares to suggest one, I am all ears.
- Making use of your target replicated backup depends on having access to the backup database. In normal operations this may not be a big deal (as long as each site is not part of a separate backup zone and doesn't use a different master server). In a disaster, it is a much bigger deal. One possible solution that appeals to me here is running your backup master within a VM. This works for NetWorker, and it works for NetBackup (it is one of several solutions according to them). I like the approach a lot. It also *might* offer a solution for TSM users. If there is any demand for it, I will explore this topic in a future post.
If others have more to add to this discussion, please comment below. As I said before, I think this is a pressing issue for all backup applications. And I think that as deduplicated targets for backup become the standard (and tape becomes a long term archival repository only) this issue will become more pressing for more users.
Hey, look pal. ;) If you’re going to use my name, at least put the “W.” in there. It’s all about the brand, baby! Don’t make me start calling you “Cott.”
Just a few thoughts:
“both incoming copies count toward the total available deduplication bandwidth”
That’s a very good point that I hadn’t thought about, but the original idea also takes up bandwidth, although not dedupe bandwidth in your architecture. (The read operation required by the copy would create I/O, but in your architecture it would typically be doing a read from an original, non-truncated copy. But it’s I/O nonetheless.
“this is a really good example of why the backup application needs to relinquish some control to the appliance”
Agreed, which is why I was so hard on you for being so hard on Symantec, the first backup software ISV to do something like that.
“There is something about having two devices on the same network with with the same name--and having each device accessed by a common application--with only a tricked out local hosts file to prevent potential issues around open files and locking.”
You’re absolutely right. It’s like democracy. It’s the worst form of government, but it beats all the rest. (I forgot who said that first; it definitely wasn’t me.)
“TSM does not support the notion of twinned writes”
I actually believe they do, but very few people use it. One of the biggest concerns of TSM users during backups is the number of mount requests they’ll get for tape drives, using tape twinning would actually exacerbate that problem.
One other possible solution to the problem is to replicate all backups from a device connected to backup server A to a device connected to backup server B, and have backup server B inventory and scan the contents of the tapes as they show up on the other side. Not only is this possible, it’s actually been automated and is now shipping in a supported solution from Overland. It currently supports only Backup Exec, but the architecture will support any products that allow you to scan the contents of their tapes. This is just about any product except for TSM tapes. You cannot scan their contents in with TSM like you can with other backup products. No catalog, no restore.
“One possible solution that appeals to me here is running your backup master within a VM. This works for NetWorker, and it works for NetBackup (it is one of several solutions according to them)”
I don’t know know who you’re talking to at Symantec, but that is an unsupported option. I’ve done it. It totally “works,” but performance is abysmal. I’ve actually VM’d a lot of backup servers for testing purposes, and my experience has been the same regardless of which backup software we’re talking about.
Posted by: W. Curtis Preston | November 06, 2008 at 11:45 AM
So, the problem is not the dedupe and replication appliance but the interface to the Backup Software. Why not test CommVault? First configure the Dedupe appliance as a NAS or CIFS Share and NOT as a VTL. To do a restore from the appliance at the DR site (or from tape) just change the restore source to 2 or 3. In case of a DR (or just power off the BUR server and appliance at the source site to test) CommVault automatically goes to the DR site appliance. Good Luck,
Posted by: Ernie Denzer | January 03, 2009 at 12:06 PM