Edit: Note there is revised math in the comments section.
Well, the bloggers at NetApp have been threatening to introduce deduplication for their VTL for a while now, so it should come as no surprise that they finally did it. And just like that, they have assumed the unenviable position of being the last major vendor to announce a dedup product for backup.
(Just a quick role call here: EMC. Check. HP. Yep. IBM. You bet. Sun. Check. Dell. Roger that. Data Domain. Naturally. Sepaton. Yes. Falconstor. Yes. So NetApp, welcome, at last, to the game.)
Naturally, I had a lot of questions when I heard the announcement. Would it do replication? Is it a variable or fixed block scheme? In-line or post-process? How does it protect the hash index? And, most importantly of all, did NetApp decide to stick with VTL RAID rather than moving to RAID DP?
The last question is really the most crucial.
NetApp apparently believes that RAID 5 (of which VTL RAID is a subset) is 4,000 times as likely to lead to catastrophic data loss as RAID DP. We can put that more precisely: any single parity RAID scheme with 7 data drives is 3,955 times more likely, over 5 years, to lead to data loss. Where data loss means that your RAID group is unrecoverable. For those who are interested in how NetApp came to this conclusion, the paper can be found here.
So why does this matter with deduplication? It matters because of the way a deduplication appliance necessarily lays out data across the RAID groups.
Deduplication ultimately reduces all backup data to segments. Those segments can be hashed, indexed, parity protected, self-describing, and so on, but ultimately all backups become reduced to a large number of segments. Segements are often between 8-128 kB in size. So you can see that a 10 TB backup will have a lot of segments associated with it. About 1.5 million of them, if they were 64 kB, or 3 million if they were 32 kB.
So a LUN will contain many, many segments. And each segment will almost certainly have many pointers to it--one for every duplicate instance of that data. Think of it this way: an appliance that is getting 20:1 deduplication will have an average of 20 pointers to each segment of data.
Those pointers will come from lots of different backups, and be associated with 20 different virtual cartridges. At the end of the day, all this means that segments from every backup will likely be found on every LUN. Every virtual tape cartridge associated with that system will likely be spread over every LUN and RAID group.
So what happens if a RAID group fails?
If a RAID group fails, you will lose all of the segments associated with that RAID group. But those segments are almost certainly referred to by every backup, and every virtual cartridge, ever stored on that system.
So if you lose a RAID group, you will lose all virtual cartridges. You will lose all your backups.
You don't just lose some of your data, you lose it all. A single RAID group failure is equivalent to a complete systems failure.
NetApp claims that VTL RAID is 4,000 times more likely to lose data than RAID 6, and is therefore dangerous to your data.
Now it seems that once you add deduplication to their VTL it is much more than 4,000 times more dangerous. It is actually 4,000 time more dangerous, multiplied by the number of RAID groups. In a system with 100 TB of useable capacity, this would be equivalent to 66,667 times more dangerous.
Lets put some specifics on this. NetApp thinks that a single parity RAID scheme with 7 drives has a 6% probability of data loss over 5 years. The probability of a system failure when the failure of any individual component causes system failure can be expressed as:
P(A) + P(B) + ... P(x) - P(A)*P(B)*...*P(x)
In the case of a 48 TB useable system, we will have 8 RAID groups of 6 TB useable each (that is 8 6+1 RAID 5 groups of 7 disks total each).
The probability of failure is:
6% + 6% + 6% + 6% + 6% + 6% + 6% + 6% - 6%^8 = 48%
In the case of a 104 TB useable system, we will have 17 RAID groups of 6 TB useable each (that is 17 6+1 RAID 5 groups of 7 disks total each). The probability of failure is 100%.
So I have a suggestion (tongue firmly in cheek). NetApp seems pretty eager to offer a data deduplication guarantee for space savings on their filers. How about they offer a data loss guarantee with their NearStore VTL? Because it does seem pretty much guaranteed.
One last thought: how do the rest of us mitigate the risk of a failure leading to data loss? Well honestly, you have two choices: RAID-6 and replication. NetApp doesn't do the former. And unfrotunately, they don't replicate either. That's right. No replication with the NearStore VTL. Which means that if you want to replicate, you don't get any of the bandwidth savings that deduplication offers, and you would have to replicate using your backup application, which means that you will consume 20 times more bandwidth than you would if you only replicated the deduplicated data.