« HiFn Dedup Card | Main | DeDup Ratios and Fixing Backup »

October 15, 2008

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Chuck Hollis

Interesting line of thinking, Scott.

One area that's unexplored here is the magnifying effect of dedupe on losing a disk.

It's one thing to lose a single backup set. It's another thing to put a very neat hole in *every* backup set you own.

The risk tradeoffs are one thing when we're discussion ordinary single-instance data (RAID 5 vs. RAID 6).

I think they're entirely different if a data block is used in hundreds or thousands of different places.

-- Chuck

Scott Waterhouse

Absolutely true. If the VTL has dedup functionality, then any catastrophic failure (double disk under RAID 5, triple disk under RAID 6) will likely mean data loss for the entire array. So, if you have dedup, you are more or less obliged to use RAID 6, because the risk of a failure impacts not just a (small) subset of the data on the array, but the complete array.

To follow the math though: on a virtual tape library with 12 RAID 6 groups (total capacity 48 TB), you would have 12 times the risk. Meaning that 12 x 4,000. So if NetApp put dedup on their VTL without going to RAID 6, we could say they have 48,000 times the risk of data loss as an EDL.

Of course the reason I didn't go there in the original post is that NetApp doesn't do dedup on their VTL, and likely won't for quite some time, from what I hear.

Put in the most blunt way possible: they "only" have 4,000 times the risk because they are behind on delivering functionality...

Michael Burgess

A lot of numbers thrown around with VTL, raid levels, and so forth but I read nothing about a probability of this happening. No substantiation of these numbers. Yes, VTL is supported by cheaper SATA drives which for the most part are unreliable because they are cheaply made. My questions is how did one derive at these numbers. I like to see substantiation data.

Scott Waterhouse

I tried to get the IBM red book that Alex cites to go deeper into the numbers, and his link is broken. I have asked him to update it, but until then I could not find the document even through Google.

Having said that, I think the argument comes down to this:

1) Is the probability of a double disk failure with 1 TB drives under RAID 5 non trivial? Lets say it is (as NetApp having been saying for years, very emphatically). If it is, then they also claim RAID-6 is 4,000 times less likely to fail. RAID-5 is bad, RAID-6 (or RAID-DP) is good, because you reduce the non-trivial chance of failure by 4,000 times.

2) The risk of double disk failure is trivial (call it zero). 4000 times zero is still zero. The whole discussion is pointless, we all have nothing to talk about, except NetApp, who should post a huge apology, as they have basically claimed that anything other than RAID 6/DP is dangerous and stupid because of the risk of failure, and heavily criticized EMC because we don't always use RAID 6.

So, I agree that we should know what the chance of failure is. That would be incredibly helpful. But as long as you accept that it is non-zero, non-trivial, then the question remains: why is NetApp's NearStore VTL not using it?

The consistency policy are comin', and their sirens are screaming!

Scott Waterhouse

Hi Michael, the link is now fixed, or you can get to the relevant document directly here: ftp://service.boulder.ibm.com/storage/isv/NS3574-0.pdf

The short version for our purposes seems to be that RAID 5 has a 6% chance of data loss for a 7+1 group of disks over 5 years; RAID 6 reduces this chance to 0.002%.

How I don't know about you, but intuitively that RAID 5 numbers seems pretty high to me. However that is what IBM and NetApp are claiming. It would mean that if you had a NearStore VTL with 15 RAID 5 groups (or VTL RAID) that you would have a 90% chance of data loss after 5 years. I know that probability people and statisticians are choking over my incorrect math here, but it is close enough.

Not sure where that all leaves us other than to say that IBM and NetApp apprently believe RAID 5 (VTL RAID) poses a substantial risk of data loss over 5 years. So my reasoning, I think, stands.

The only real way of proving or disproving things at the end of the day would be to lay our hands on real figures from the real world (failed RAID rebuilds) and somehow I don't think that anybody is going to be rushing to pony up that information.

Aaron Huslage

Your math overestimates the probability of failure in your comment by about 30%. You've double-counted the intersection of the two failures and come up with the wrong number. The actual number you were looking for is 60% on the RAID 5 ( P(disk A fails) + P(disk B fails) - P(disk A and B both fail)).

Scott Waterhouse

Aaron... those are the conclusions that IBM drew in the paper that NetApp uses to support their contentions that anything other than RAID 6 is the wrong approach.

If you are referring to the other article (NearStore VTL guarantees failure), then I think 6% x 6% equals .36%? So the probability of failure in a collection of two RAID 5 groups of 7 disks is 11.74%? Or am I mistaken? I won't claim to be so much as a part time statistician.

The comments to this entry are closed.

Search The Backup Blog

  • Search

    WWW
    thebackupblog