I just saw an interesting article by Chris Mellor over at The Register. Chris was speculating on why we haven't seen any deduplicated backup on tape. Because, hey, if I can squeeze 20 times as much data on a disk using deduplication technology, doing so on tape has to be a good thing, right?
And it would allow tape vendors to show that the technology has some relevance beyond long term archiving.
Chris has a lot of good explanations as to why we don't have dedup on tape yet—except from CommVault and EMC. But he misses the biggest reason of all: it might be really good for backup, but it is horrible beyond belief when it comes to restore. And as myself, David Chapa, and every other backup specialist will tell you, it is all about restore.
Why is it so terrible? Seek time.
The average seek time of an LTO-4 tape drive is 57 seconds. Which means that every time I want to read a bit of data that is not sequentially adjacent to the last piece of data that I was reading, my tape drive is going to spend 57 seconds finding it.
But wait, it gets worse! How many pieces is a file going to be broken up into? Well, let's look at a 2 MB work document. On average, a good variable length deduplication technology will use a segment size of somewhere around 16 KB. That means that our 2 MB document will be composed of 128 segments. Even if some of those segments are, in fact, adjacent to each other on tape, many will not be. In fact, even if 3/4ths of them are adjacent, having as few as 1/4 adjacent segments will mean that the restore time for that file will be nearly 35 minutes. (I have allowed an extra 3 minutes for library find and drive load times.)
That thumping sound you hear in the background is your backup administrator banging his head against a brick wall. And hoping that unconsciousness will swiftly relieve him of this madness.
So that is one file.
But what if we have 200 file restores to do in a day? (And I work with a lot of shops that service 500+ restore requests daily).
Well, we would need more than 12 tape drives exclusively doing restores all day long. Not a very good use of resources.
And what if we had a file server with 5,000,000 files that we needed to restore?
Would you believe it would take one tape drive more than 285 years to do the restore?
That’s an SLA I would like to see.
So while I will never say it will never happen… It will never happen.
Now, let me back up a step too (as Chris rightly does in the later part of his article). There might be a use case here for archiving… Perhaps. After all EMC uses deduplication to tape for Avamar where it is acceptable from an SLA point of view to restore the tape to disk first, before initiating actual data restore. This two step restore makes sense because it is not done very often—it is a deep archive, to be referenced very rarely, not a repository for operational restore.
So if the technology were to be considered for long term archives, there may be a glimmer of a use case there. Although I would also add the further observation that these deep archives can probably be approached from different technology directions to deliver more value to the business. For example, an application that specifically archives business important data from file systems and email repositories. Because without some metadata, and some indexing, the value of these deep archives really does begin to approach zero. A final observation: deep archives, when properly indexed and single-instanced, will benefit from deduplication relatively less than standard backup data due to lesser redundancy of the data at the file and segment level.