« If Tape is Dead... | Main | SAP Archiving Videos »

August 20, 2008

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Jered Floyd

Absolutely right. I don't think NetApp is using RAID 5 -- they use RAID 4 because their log-structured file system (WAFL) makes it so the typical read-before-write downside to RAID 5 isn't an issue -- but it's still a single parity scheme. Using a single-parity scheme with terabyte drives is just asking for failure! I made a small video explaining the problem for those more visually minded. (Looks like you don't allow links in the comments -- you can go find this video over at www.permabit.com.)

Where Alex says that RAID 6 can protect against two drive failures, he's being similarly misleading. I suppose if you could a bit-error as a drive failure it's correct, but RAID 6 can't protect against double spindle failures for the exact same reason. And double spindle failures do happen.

Honestly, I don't understand why NetApp is trying to postition themselves in the VTL business. They've got a great product for primary file storage, but it's not priced or positioned right for VTL. Trying to claim Filers as a backup target is just diminishing their brand.

Jered Floyd
CTO, Permabit Technology Corp.

Scott Waterhouse

Which raises another interesting point that NetApp doesn't like to make mention of: their VTL is not based on their WAFL technology. They got it when they bought Alacritus, and is a completely different OS.

As a result, I am pretty confident that it is RAID-5 protection, but not positive.

And no, I am not a big fan of using filers (theirs, ours, or ones from any other company) as a backup target. Not unless you add a bunch of additional capability around performance, tuning, management, error correction and authenticity. And take away a bunch else around visibility of the file system, exposure of the storage management layer, and so on. But that is just my $0.02.

James Curpy

But, EMC did not think RAID-6 was important just last year:

http://chucksblog.typepad.com/chucks_blog/2007/01/to_raid_6_or_no.html

the storage anarchist

Jered -

I don't follow your comment about RAID 6 not protecting against double drive failures. True, you can still get double drive failures, and even unrecovered bit-errors on members of the same RAID rank across two drives, with dual-parity RAID 6 this still will not result in data loss. It takes a THIRD failure before you lose data, and while mathematically possible, you have better odds of winning PowerBall with a single ticket!

Alex McDonald

Thanks for the reference to my blog, Scott. New as I am to blogging, I occasionally feel a little like I'm sending email to myself. Seriously, I'm pleased to see it's being read.

I'm going to deal with these points in more detail on my blog; hopefully before I go on holiday in a day's time. The VTL points I'll treat in more depth, but just a quick heads up to the article and Jered's comments;

1. NetApp's VTL is not based on RAID-5. Nor is it a "filer".

2. NetApp's primary storage supports RAID-4 or RAID-6 (aka RAID-DP). RAID-6 is the default and the recommended RAID setting.

3. RAID-6 *does* protect against two concurrent spindle failures. I'm surprised that Jered would think otherwise.

I'll post back here at some point with an update.


Scott Waterhouse

James;

There are two points to remember. Since Chuck was talking generically about all EMC storage platforms, I will focus in a little on the Clariion, as that is what is used in our EDL and DL3D.

First, Clariion is the only "mid-range" (and I use that word advisedly, because Clariion is a pretty potent mid-range device) system to achieve a real world 99.999% availability. Even with RAID-5. That is not some marketing mumbo jumbo number, that is what really happens in the filed based on dial home reports from all installed systems.

Secondly, at the time our EDL line only used 500 GB drives. Meaning rebuild times would be half of what current 1 TB drives rebuild times are. So the exposure was considerably less.

Finally, I can't ever recall a time where EMC actively sold a product we considered "dangerous" to data. Not a very responsible for a vendor to do, it my opinion. And NetApp sure likes to talk about responsibility.

Alex McDonald

NetApp has certification of five nines; http://media.netapp.com/documents/ar1056.pdf

"IDC has found NetApp’s methodology to be rigorous and consistent with other areas of the IT industry for characterizing storage system availability in terms of percentage of uptime over a given time period. This sound method for monitoring and the associated analysis have led NetApp to determine that its monitored storage systems have achieved greater than 99.999% uptime as measured on a rolling basis between July 2006 and January 2008."

Scott Waterhouse

Alex, thanks for the responses.

With respect to the 5 9's claim, I will say duly noted and I retract the "only" part of my statement above. I would note however that the study quoted does not deal with VTL architectures, and as you have recognized, NetApp's VTL does not utilize Filer/FAS architecture. If it is OK with everyone, I will just say I concede the point with respect to generic storage, but in the interest of keeping this focused on virtual tape I would say I still have significant concerns. It doesn't sound like we could accurate say that the NetApp Near Store VTL line or the storage it uses is observed to be 99.999% available.

With respect to the RAID type, while it may be called VTL RAID, and you claim it is not RAID 5, 4 or 6. The most specific information on the web just say that they are 4+1 (originally) and 6+1 (currently) RAID groups. Which indicates that they do have exposure to single drive failure, and would be subject to the concerns described in my posting. I do look forward to the clarification on what VTL RAID is, exactly! I admit that I am intrigued, because there are not a lot of RAID types left to choose from!

Jered Floyd

Storage Anarchist, Alex,

I say that RAID-6 / RAID-DP doesn't sufficiently protect against a double spindle failure because of the chance of a uncorrectable bit error during rebuild. Consider a typical configuration for RAID-DP with 14 data drives and 2 parity drives. If you lost both drives (it happens), you have to read the remainder of the drives, all 14, perfectly in order to reconstruct. That's 14 TB. If you're using 1 TB drives with a bit error rate of 1 in 10^14 bits, that's 1 bit in 12.5 TB. You're basically guaranteed to hit an uncorrectable block and lose data.

I've explained this in more detail in my blog at http://permabit.wordpress.com/2008/08/15/multiple-drive-failures-raid-6-vs-rain-ec/

Jered Floyd
CTO, Permabit Technology Corp

P.S. Scott, is there any way to subscribe to comments, or comment responses?

Geert

Jered,

Correct me if I'm wrong, but as far as I'm concerned you are describing a TRIPLE disk failure in your example. Drives don't typically fail two at a time, but one only with the guaranteed bit error during reconstruct. RAID-6 DOES protect against that (two disk) failure.

Corbett already descibed that perfectly in 2004:
http://www.usenix.org/events/fast04/tech/corbett.html.

Oh, and Scott; 99.999% *system* uptime does NOT guarantee 99.999% *data* availability.... Your disks may be perfectly spinning and LUNs available, but if you need to restore that mission critical database from tape onto those (now empty) LUNs (because you suffered a double disk failure on your RAID-5 or RAID-10 set) one still has something to explain to the CTO, right....?

RAID-DP will get you that extra mile in terms of *data* availability...

(I won't even start on fast Snapshot recovery for protection against logical errors.... which adds a few extra point to the *data* availability metric)

Scott Waterhouse

Geert;

You are correct, 99.999% system update is not semantically equivalent to 99.999% data availability.

While this is correct, I think it isn't relevant. EDL (standard and deduplicated) uses RAID-6. (And lets not confuse people by saying that RAID-DP will get you that extra mile... RAID-6 will, and RAID-DP is a subset of RAID-6. If you want to debate the merits and drawbacks of the two, that is fine, but a subject for a different post perhaps!). Clariion and DMX offer RAID-6. Centerra uses a different form of double parity, but is *very* roughly equivalent to RAID-6 in terms of disk failure.

Again, you are correct, but the point is not truly cogent. The same can be said for Snapshots... Again, lots of good technical discussion that could be had (and has been had) about our competitive approaches to the problem; but this may not be the correct forum.

Finally, the discussion is awesome. I welcome the feedback, criticism, and alternate points of view from competitive organizations. And it is great to see that this hasn't devolved into a flame fest. Having said all that, I don't see how any of the points made detract from the central thesis of the post: NetApp's claim that RAID using single parity is inappropriate for backup systems employing large disks, and the simultaneous presence of exactly that in the Near Store.

Val Bercovici

(Let's see if this comment passes the censor's guidelines)

Hmmm...

1. Scott - thanks for proving EMC's absolute maximum "safe" usable capacity (with acceptable performance) on *every single* storage platform you ship is 50%. After all, EMC recommends RAID1 (including "content mirroring on -ahem- Centera CAS) for improved disk failure protection and rebuild times compared to RAID3 & RAID5.

2. Jered's complete misunderstanding of what constitutes a triple disk failure just proved I'd be overqualified to work at Permabit. I sure hope he's not still writing any of their data integrity code!

3. I can confirm NetApp's VTL is not based on Data ONTAP & WAFL. I can also confirm that Tape Smart Sizing works exactly as advertised, and that the concept of "self tuning" on any type of storage platform continues to mystify EMC bloggers everywhere.

4. Can anyone point me to a single VTL deployment from any vendor configured with *complete* double-disk failure protection?
(a) Let's please not use Jered's curious definition for "complete double-disk failure protection"
(b) For bonus points, can anyone point me to a published VTL best-practices guide where RAID6 is recommended? If such a beast exists, please elaborate on the performance expectations. After all, isn't that why customers are deploying VTL's to begin with?

Oh, and if you want to see dedupe on NetApp VTL, you won't have to wait long. I'll personally have more to say on that very very soon :)

As it is, NetApp is already the recognized dedupe market leader by units deployed, customer count and capacity shipped, so we're getting a lot of practice on that subject!

So Scott, is this comment "ad-hominem-free" enough for you?

-Val
NetApp's CTO-At-Large

Val Bercovici

BTW - The topic of "Smart Sizing" gives me a great opportunity to introduce one of our latest blogs from a top VTL guru:

http://blogs.netapp.com/absolute/2008/08/a-bird-flew-thr.html

Gilda helps elaborate how to reduce customer shelf space by 50-66% for cloned tapes using NetApp VTL!

Geert

Scott,

I wasn't commenting on the direct relation to RAID and VTL (I will have Alex take care of that himself), but merely on Jered's comment around RAID-6 dual disk failure protection and yours around "uptime" (vs. "data availability").

I do want to respond to your statements above though. While we are all aware of the fact CLARiiON and DMX "offers" RAID-6, I still have see one best practice that shows the use of it in real life workloads (besides backup, but then again; isn't DMX a bit expensive for backup...). NetApp on the other hand has a long history of actually *implementing* RAID-DP in those environments across application tiers, system models and disk types. Actually it is made a best practice like forever (http://media.netapp.com/documents/tr-3437.pdf). And btw; RAID-DP is *not* a "subset" of RAID-6, it is an *implementation* of RAID-6 as recognized by SNIA and even Microsoft (http://technet.microsoft.com/en-us/library/bb738146(EXCHG.80).aspx); "RAID-DP from NetApp is a proprietary implementation of RAID double parity for data protection. RAID-DP falls within the Storage Network Industry Association definition of RAID-6. [...] Whereas current RAID-6 implementations incur an I/O performance penalty as a result of introducing an additional parity block, RAID-DP is optimized in terms of reducing read I/Os due to the way the NetApp controller handles parity write operations. [...]"

And since both RAID-DP and NetApp Snapshot (and thin provisioning for that matter) don't incur a performance hit, they're both used simultaneously to raise the bar for storage performance benchmarking (as independently audited by SPC): http://blogs.netapp.com/on_the_edge/2008/01/netapp-raises-t.html

(forgive me the direct comparision against CLARiiON, as that's not my point).

So why is RAID-DP important for *data availability*?

Well, EMC's own CLARiiON RAID-6 whitepaper describes it perfectly:
1) "Even with good RAID protection, it is still important to evaluate threats to data loss and implement proper backup and replication technologies to ensure that your most important data is protected in all situations." (only "your *most* important data..."?)
2) "The added reliability [of RAID 6] may come at the cost of performance when compared with other RAID types. RAID 6 has a disadvantage to RAID 5 and RAID 1/0 for small, random writes and system write bandwidth performance [which are most applications], but other I/O profiles are not affected as significantly." (...as significantly...???)

(check out the doc here: http://www.emc.com/collateral/hardware/white-papers/h2891-clariion-raid-6.pdf)

So this is exactly why RAID-DP gets deployed by default and is the de facto standard on all NetApp implementations; it dramatically increases the *data availability* by surviving the "inevitable" dual disk failure (so you don't need to fall back on your backup/painful restore), and *without* the cost of performance or additional parity overhead. No sacrifice....

Scott Waterhouse

Val, welcome back! And yes, I do appreciate the total absence of any ad hominem attacks.

Responding to your points, in order:

1) I think you drew a conclusion from facts not in evidence, to borrow a phrase. As RAID-6 carries a penalty of n/n-2 as a percentage, unless you have 4 disks in your RAID-6 group, I don't see how you get to 50%?

2) Ummm, yeah.

3) Tape smart sizing may work as advertised. But so does the square wheel. Point being they are both unncessary and useless.

4) I might be missing something here. EMC uses RAID-6 on our EDL virtual tape. How does this not protect backup data against a complete double disk failure? And I don't believe our best practices guide elaborates on RAID levels for EDL because there is no choice--it is an appliance after all, and all storage is automatically configured with no user intervention required. As far as performance, for various reasons, we found that RAID-6 offered a negligle to non-existent performance impact on our EDL systems. In summary: we use RAID-6, we are extremely happy with the performance vs. other RAID schemas, as are our customers.

And finally, I do look forward to the dialogue around NetApp's VTL with dedup when it finally creeps forth from the labs.

Competition makes us all better, and, in my opinion only, deduplication for backup is still very much at the point where a rising tide lifts all ships.

PS. While I am not too interested in using this blog to advertise NetApp blogs, you were so civil in your posting that really, why not? Let's let David Blaine take the stage, shall we?

Scott Waterhouse

Geert;

I promised that I wasn't going to be drawn into a debate about the merits of double parity on dedicated parity drives vs double parity with one bit on a parity drive and the other striped diagonally. And so I am not. Lets just say that NetApp took one approach, and EMC another. Since your approach was the easy (only) logical extension of RAID-4 WAFL, I guess I am not surprised you believe in it's virtues. We had more choice.

With respect to performance, I will repeat the response I made to Val: for EDL, utilizing RAID-6 made a negligible to nonexistent impact on performance. We are extremely happy with the performance of our EDL.

On to the core issue: "this is exactly why RAID-DP gets deployed by default and is the de facto standard on all NetApp implementations; it dramatically increases the *data availability* by surviving the "inevitable" dual disk failure"

Well, its not. Near Store VTL does not use RAID-DP.

It uses a single parity drive, and is susceptible to single disk failures. Despite all the reponses by NetApp on this subject, that single fact remains undisputed. And honestly, I think you just made my point for me: if dual disk failure is inevitable, then it is inevitable you will lose data from a NetApp VTL. And the way a NetApp VTL is constructed, you will almost certainly not lose just one virtual cartridge, you will almost certainly lose most or all virtual cartridges on the appliance.

If dual disk failure is inevitable, if single parity implementations are dangerous, then why use it on your VTL? And why use it in a way that amplifies the chances of a failure by 48 times? And why use it in a way that amplifies the consequences of a failure even more significantly?

Stephen McDonald

Scott,

First off, nice to "see" you again. I think we may have met in another life, but then again there might be a lot of Scott's working at EMC.

I would have to partially agree that the VTL is dangerous for some of the reason's you mention but I find myself asking at what point to we draw a line between risk and practicality? This discussion could be continued to the point of arguing RAID7/10 over RAID6 etc. Assuming that tape isn’t dead (as you mentioned in a previous blog of yours), if virtual tapes in the VTL are incrementally copied to tape everyday (a la TSM style) the data exposure of a double failure leading to a complete loss of all virtual tapes is minimal. Backups fail much more often than drives, and tapes fail about as often as disks if not more. There are many other risks backup data suffers beyond the disk array it sits on and to focus on only one seems a bit excessive. What are the actual statistical differences anyways so we can quantify the risk we’re talking about here?
With regards to smart tape sizing I wholeheartedly disagree with your comments of it being unnecessary. The EMC CDL architecture to have the backup application manage the copying of data to tape via a dedicated media server while simple and effective for the purpose of managing mixed media sizes, it doesn’t work very well for Netbackup. With smart tape sizing on the VTL, the process to copy the data to tape is handled outside of the backup application, and the tape gets completely filled as well. This works well with Netbackup as Netbackup’s process to move data is horribly inefficient to the point of being problematic in certain situations. As an exercise, try performing 1000 backups averaging 40 MB in size and migrating that to tape via the CDL using one tape drive. Sure, you can use multiple tape drives in parallel, and the performance improvement will probably improve linearly based on what is seen by Netbackup with generic disk pools, but you can also have more than 1000 small backups… Conversely, with the VTL the process is a lot more streamlined, the IO is offloaded to the disk array and at the same time tapes are fully utilized. I believe the NetApp engineers had considered this as there was a paper that commented on the speed of Netbackup Vaulting versus NetApp direct tape creation. http://whitepapers.pcmag.com/whitepaper2207/

Give me a shout and we can discuss it further over coffee and a whiteboard if you are the same Scott that I think I know.

Cheers!

Scott Waterhouse

Hey Stephen, good to see you again as well.

So, your points in order:

1) Reliability. What can I say, I agree. It is a case of how much is too much. Having said that, if NetApp themselves are going to say that single parity is dangerous with 1 TB drives, then I think the responsible thing to do would be to only sell dual parity RAID-6. I just think the inconsistency is worth examining. And if you can have a choice of RAID 5 or RAID 6 with the big drives, I'll take RAID 6 every time.

2) Breaking this down (as I did in the follow up): you can use your Backup App or not. If you do, none of this matters. (For the record, EMC gives you both choices...) If you don't, then you can use NetApp DTC w/ TSS, or EMC Tape Caching, and achieve more or less the same result: you use the appliance to create the tape. All good. The question is: would you waste tape space without it (meaning would the size of a compressed virtual volume be different than the size of the compressed physical volume)? Not in my experience. YMMV. (By the way, in your example of 1000 backups of 40 MB each, wouldn't you need 1000 physical tape cartridges if you wrote them to 1000 virtual cartridges? I haven't seen TSS or DTC being advertised as capable of volume stacking? If you wrote them sequentially to one more more virtual volumes, then either appliance can efficiently create the physical cartridge without intervention from the backup application. But again, TSS doesn't make any difference there).

The comments to this entry are closed.

Search The Backup Blog

  • Search

    WWW
    thebackupblog