My previous post on this subject incited some really good discussion. A lot of is was in the comments, and much of it was far ranging and discursive, so I thought I would attempt to summarize the dialogue so far, and try to come to some conclusions.
My original post stemmed from a comment by a NetApp blogger that claimed RAID-5 was dangerous, because it would inevitably lead to data loss due to double drive failure. The lynch pin of this argument is the risk associated with double drive failures, and their potential for data loss. Therefore, the real claim is that any RAID schema which does not use double parity is dangerous and will inevitably lead to data loss. There was one important caveat Alex had, and I will repeat it here: big drive sizes increase this risk.
So the question becomes, why does NetApp use a single parity RAID scheme in their NearStore VTL if that is what they believe? And more importantly, I guess, why would you ever deploy a device that was "dangerous" and would "inevitably" have data loss?
The comments that came back from NetApp commentators (but not Alex), to my surprise, only reinforced their stance: RAID with single parity is dangerous. In fact, it is a well established and long held belief over at NetApp. Not only was it echoed in comments: Geert singled out RAID without DP as a bad thing, and claimed that unlike all other forms of RAID, RAID DP: "dramatically increases the *data availability* by surviving the "inevitable" dual disk failure". Val Bercovici has also historically held the same stance: dating back to his dialogue with StorageMojo at least.
So full points for consistency guys. It seems like RAID-6 DP is the way to go in NetApp's book, and everything else is unacceptably dangerous. Everything else will inevitably experience double disk failure, data loss is a given, and in general all other forms of RAID should be avoided like the plague.
If that is true, the original question holds now more than ever: why would NetApp sell, and why would anybody buy, a NetApp product like the NearStore VTL that doesn't use DP? After all, you are going to lose your backup data, it is only a matter of time.
Well, to that question, Alex basically said (and I am paraphrasing): "sure, NearStore is not RAID DP, and sure that's really bad, but it is not as bad as it could be because VTL RAID isn't quite as bad as normal single parity RAID". Why? Alex gave 3 reasons: 1) VTL RAID stops writing to the degraded RAID group. 2) VTL RAID rebuilds proceed much faster. 3) VTL Self tuning.
So lets deal with them in order:
1) VTL RAID stops writing to a RAID group after in fails. That's great. The issue here is, however, how long are you exposed, and is this too long? Long enough that a double disk failure becomes inevitable? And when we put it that way, reason 1 really becomes a contributing factor to reason 2, but isn't of itself a reason that not having RAID-DP on a Near Store is any more tolerable.
2) VTL RAID rebuilds proceed faster. And as evidence of which, Alex quotes an article 2 years old. That used 320 GB drives. (And by the way, more on the article in a little bit.) In fact, even with 320 GB drives, the rebuild time took 4.5 hours. So it seems reasonable to assume that with 1 TB drives, we would be looking at a rebuild time of 13.5 hours. So much for much faster. In fact, with currently shipping 1 TB drives, the Near Store VTL leaves you exposed for a much longer period of time than NetApp admits at first. And if 13.5 hours of exposure is not enough to be called "a serious and measurable risk" what is? Which bring us to point #3.
3) Self Tuning. Self tuning increases the risk of a RAID group failure. Although the table cited in Alex's follow up post raise as many questions as answers (how big were the drives sizes on the arrays? how long had they been in production? does the dispersal of virtual cartridges change over time as the array ages?) in does show that in practice, cartridges will be distributed across 2-4 RAID groups. Lets assume that we are emulating LTO-1: 100 GB/cartridge. So a VTL-RAID group of 6 TB would have capacity for 60 cartridges, but would likely have data from 120-240 virtual cartridges. In other words, the inevitable double disk failure will impact as many as 240 virtual cartridges.
So lets do some math. Lets assume NetApp is right about the rebuild time for an EDL (they are not, but I will get to that): 20 hours. Lets assume RAID group sizes are the same: 6 TB. Assume virtual cartridge sizing is the same. If you look at this from a risk over time perspective, the Near Store poses a risk to 120-240 cartridges for 13.5 hours. The comparable array from EMC would risk 60 cartridges for 20 hours. From a risk perspective, the Near Store poses 135% to 270% more risk. Of course, the EMC EDL actually uses RAID-6, so the only array with risk of data loss is NetApp's.
So, none of the three factors cites by Alex mitigate the risk. According to NetApp, single parity RAID is dangerous, and data loss inevitable. VTL RAID is no exception--RAID rebuilds are barely any faster than RAID rebuilds for RAID-5, and the amount of data exposed during a failure is 2-4 times greater than in standard RAID-5.
Overall, VTL-RAID is riskier than standard "dangerous" single parity schemes like RAID-4 or RAID-5.
What NetApp's explanations demonstrate is this: not only is VTL-RAID no better than the alternative single parity schemes, it is worse.
Long footnote about that article NetApp referenced: No surprise, NetApp and Veritest did it again, yet another "impartial" report that favors NetApp. Amazing. But dig a bit (and it doesn't take a lot of digging) under the covers, and you will find so many errors as to render the study worthless:
- Outdated equipment. When the report was published, the EMC array test was N-1, and had been for almost a full year.
- Outdated software. Similarly, the firmware levels on the array were N-2. With newer (then current) firmware, they would have encountered much faster RAID rebuilds on the EMC array.
- Incorrect procedures. None of the configuration methods used by Veritest are supported by EMC. Specific commands must be run to add capacity, for example, and Veritest did not use them. This would have a severe negative impact on performance.
- Block sizes for tests are inconsistent between the two arrays, and not generally reflective of real world backup block sizes.
- Specific tuning parameters were selected to exclude key EMC performance features--roughly equivalent to turning off NetApp self tuning.
- Veritest added capacity to the EMC array in a way which ensured it was not likely to add additional performance. Despite this, the EMC array gained performance as capacity was added, and the NetApp array began to decline in performance (after only 4 shelves--25% of total capacity).
I could go on, but I don't really think it is necessary. A quick perusal by anybody knowledgeable about EMC Disk Libraries would reveal so many errors in method and fact as to make any conclusions drawn from them meaningless. Enough said.