« The Backup Blog is on Vacation | Main | EMC and Quantum Strengthen Relationship »

March 19, 2009

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

NK

I don't want to speak for Curtis.. but I think what he's getting it, is the way in which the DL scales in terms of ingest rate.

The DL4000 can ingest about 1.5TB/hr which equates to about 30TB a day, but you also have to consider the other stuff - like getting the data offsite and internally reclaming storage capacity (data that has expired and no longer has any pointers associated with it). Other backup solutions like TSM need to also find time to reclaim data within TSM.

I actually think that the main benefit that de-dupe provides is the ability to shift to a completely tapeless solution using electronic vaulting. The biggest impediment to doing so today is the huge bandwidth cost associated the vaulting backup data, especially when distance is involved (Big pipe + long distance = $$$). If however you dedupe the data before you vault it, then bandwidth costs should come down significantly and that makes dedupe very attractive to many people who want away from tape completely.

When you take this into a practical solution and put in place a 24hr RPO (which most clients aim for in a backup solution). That means that you have to complete backup + de-dupe + offsite copy within 24hrs.

In a post-process type of solution, none of these can really occur concurrently.. each must have their own window to work in. Overlapped windows could mean some backup data 24hr RPO window.

You can play with these windows to a certain extent by adding/subtracting bandwidth for example, but I think you'll find that when distance is involved, that latency will ultimately affect replication performance and you may find that throwing more bandwidth at the solution doesn't actually collapse the window. That topic by the way I think is something I haven't seen many vendors talk about - how they replicate and how latency affects their solution.

Now what would help collapse the window is increased de-dupe performance.. but there is no way to do this with the 3D solution without buying another box. So while you could buy another box to scale up your dedupe ingest rate and collapse the window, it's another box to buy, another box to manage and the global storage dedupe efficiency will be less... how much less is perhaps debatable.

Anyway.. hope this makes sense.

W. Curtis Preston

I said something to the effect that "if you back up more than 10-20 TB per day, global dedupe matters." (I give a range because it applies to all vendors.)

Using your numbers, you said "if you back up more than 30 TB per day, and if you want to retain more than 3 PB of backup data on disk, it may matter."

If you change that AND to an OR, I'm perfectly fine with your logic (if either condition happens, you need global dedupe). But I have a caveat.

You keep saying that it's acceptable to let dedupe take 20 hours. I'll agree with that IF the customer isn't replicating and/or that replicated copy isn't their first copy offsite. Because with a 20-hour dedupe window, my backups that finish at 8 AM won't get offsite until 4 PM (Consider a 12-hour backup window that starts at 8 PM. If you dedupe & replicate for 20 hours starting at 8 PM, you'll finish deduping at 4 PM, after which the last of the backups will be replicated offsite. Depending on replication speed, you may need a few more hours to replicate. That's 6-10 hours later than most people want to get their backups offsite, compared to the Iron Mountain truck showing up at 9 AM. Dedupe and replication should make things better, not worse. And if you're taking 20 hours to dedupe, your RTO is worse than with tape.

Now, on to the second part of your post. I'm not advocating global dedupe so that you can compare everything to everything (e.g. Oracle to Exchange). I completely agree that this buys you very little.

I'm advocating it so that Exchange will always be compared to Exchange (or Oracle with Oracle), no matter what heads I have or which one I back it up to. If I have global dedupe, it won't end up storing the same data on multiple based copies of it when I back it up to a second (or third or fourth) head for performance reasons. If you take a look at my response to your comments on my blog entry (https://www.backupcentral.com/content/view/231/47/), you'll find that I explain how those extra base copies can seriously add up -- as in a 400% difference in the amount of RAW DISK I'd have to buy (in a 5 node system without global dedupe vs a 5 node system with global dedupe).

And how did you get from 30 to 60 TB? If you did it by adding your two engines together, that's not valid. You don't have global dedupe between those two engines. If I load balance across your two front-end VTLs and dedupe with two back-end dedupe engines, I'll end up storing TWO FULL base copies of the data, increase my RAW disk storage requirements by 75% or so.

Your are right; there are multiple criteria when selecting a product. Of the criteria you mentioned, cost, manageability, and scaleability, are all significantly affected by global dedupe or the lack thereof. (If you're a large shop.)

* Compatability with backup applications and source data types (that's a show stopper)

* ability to replicate deduplicated data (show stopper if you plan on doing that)

* support for RAID6 (should be show stopper, IMHO, but everybody but NetApp has this now)

* policy control over deduplication process and timing (Very important, but most post-process vendors have this as well.)

But the one that a lot of vendors DON'T have is global dedupe, and I do believe it's really important for large customers, which is why I harp on it so much.

As to Avamar... Did you know I helped write the requirements for that product and helped build the first TCO model for it? Like the Quantum engine that you're familiar with, Avamar (Undoo then) is just right up the road from where I live. So of course I am aware of it and mention it all the time in my backup and dedupe schools . If you search my blog, you'll find that I've mentioned it no less than 11 times there as well. So please don't say that I don't talk about it. Right now we're talking VTLs and target dedupe, and Avamar is source dedupe, so I'm not mentioning Avamar because it's irrelevant to the discussion at hand.

The comments to this entry are closed.

Search The Backup Blog

  • Search

    WWW
    thebackupblog