« Mozy Picks Up Speed | Main | Clariion and Virtual Tape »

July 31, 2008

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Chuck Hollis

Great point -- couldn't agree more!

Bruce Clarke

Good story. Thanks for reinforcing an issue Data Domain has been trying to get communicated for over a year. Of course, reading the RMAN Best Practice paper on our website makes very similar points. One further point you might want to add is that the damage caused by multiplexing is less significant on archived logfiles since they generally don't deduplicate well anyway. So expanding your script to break out the logfiles and back them up in a separate step has also proven to be beneficial.

Good post!

W. Curtis Preston

First, I think if you take the time to review some of the posts at my blog (http://www.backupcentral.com/content/blogsection/4/47/). I think you'll find some of the analysis you say is missing out there.) I couldn't agree more that adoption of dedupe without understanding of how the product you're adopting works is the quickest way from the infatuation curve to the trough of disillusionment. For example, I know some customers that purchased a Data Domain box to use as their disk pool for TSM (where most people only store a day or so's worth of data). They only got 3:1 dedupe and complained loudly. Umm... How did you think it worked? You have to have something to compare against. If you don't have the previous versions of the files in the disk pool, how is it going to find any commonality?

I do think you are projecting and generalizing a bit here. Not all dedupe products are equally affected by multiplexing. One vendor's dedupe ratio will be 1:1 if you multiplex, and another vendor's ratio is not affected by it at all. Suffice it to say that your recommendations are appropriate for EMC, Quantum and (based on comment above) also true for Data Domain. It may or may not apply equally to other vendors.

Finally, I think you're discussing multiplexing and multistreaming as if they're the same. (It's not your fault: Oracle misuses the term in the own documentation.)

Multiplexing (AKA interleaving) is the practice of combining several streams of backup data into one stream. This is what NetWorker & NetBackup do when trying to make a tape go faster.

Multistreaming is the practice of creating multiple streams from a single source (e.g. a server or Oracle database). What you're describing with Oracle is multistreaming, not multiplexing.

Now, if you sent that multistreamed backup to NetWorker and set target sessions to something greater than 1 on your virtual tape drive, that multistreamed backup would get multiplexed/interleaved together.

Scott Waterhouse

TSM is a great example of how this has gone awry. I have had many conversations with customers that indicate vendors have promised a particular deduplication ratio and not bothered to elaborate on how that would play out with TSM. Personally, I have seen ratios more like 3:1 to 5:1 with TSM data associated with file type nodes. But as I said, it is instructive because deduplication doesn’t magically reduce the volume of data by a certain amount, it does it largely by having multiple instances of the same data. (I know I am preaching to the converted here Curtis, but just a bit of background for others reading this…)

Anyway, just to close off the TSM thing, my personal belief is that larger environments will benefit by not deduplicating the component of TSM that is associated with file data. I discussed this in my “Tales of Two TSMs” posts, but basically, TSM is schizophrenic. Part of it will deduplicate well (anything that is email or db/structured data) and part wont (anything that is file data). So deduplicate one and don’t deduplicate the other. Which sounds like it fits the model for a DL4000 3D pretty well, given the discussion we were having in the comments here: http://thebackupblog.typepad.com/thebackupblog/2008/07/whac-a-mole-part-ii.html. :)

Projecting? Possibly! Generalizing? Probably. I will plead guilty, and say that to a certain extent this is one of the limitations of the format. Any generalization I make will probably have half a dozen exceptions, some of them worth further discussion, some fairly trivial. It pains me not to elaborate sometimes, but I don’t want each post to turn into one massive digression either. Anyway, will this impact different vendors differently? Maybe. If it does, perhaps they can chime in. I don’t have specific detailed knowledge about Diligent or Sepaton in this respect, but I suspect we all have the same issue. (JL? Care to comment?)

And yes I was playing fast and loose with multiplexing vs. multistreaming. In truth we can make an closer analogy to NetWorker: allocate channel is the equivalent of parallelism, files per set is the equivalent of number of sessions. (Although different in that Networker will, by default, spread I/O traffic across multiple devices—determined by parallelism—only interleaving if the number of sessions is greater than the amount of parallelism. Oracle defaults to interleaving as long as the files per set is greater than one.)

The comments to this entry are closed.

Search The Backup Blog

  • Search

    WWW
    thebackupblog