The Impact of Deduplication Methodology on Deduplication Ratios
Whew... big title, hopefully the rest of the post will be a little lighter on the polysyllabic words. Ooops...
Well, what's a day without a little irony?
On to the subject at hand: as we have discussed in the past, there are two times at which deduplication can happen: in-line, and out-of-band. In-line deduplication happens when the appliance intercepts the data stream from the backup application before it is written to disk, deduplicates it by comparing it to data it already has, and writes out only net new chunks of data. And I say chunks, because it could be bytes, blocks, segments, or something else! Out-of-band deduplication happens when the data is written to disk, and some time later, probably when the backup is complete, the appliance begins processing the data and removing redundant chunks and replacing them with pointers.
Both methods have their merits, and neither particularly makes a difference to the amount of deduplication it is possible to get. However, somewhat counter-intuitively, which method you use does make a difference to how much disk you require.
Lets examine that a little closer...
In my example, I am going to assume that my backup job is 100 TB per night. I am going to take compression out of the equation, just to simplify the discussion. Further, I am going to assume there is no commonality of chunks or segments within the backup. (Both of these should have no impact on the conclusions of this discussion.) Finally, I am going to assume that there are 4 TB of changed data per night.
So, with in-line deduplication, I am going to do a full backup. That generates 100 TB of data for my appliance to store. Remember, no compression, no commonality within the backup, so the fact that I am deduplicating before I write actually saves me no disk at all yet. The next night however, I am going to do another full backup. Because I am deduplicating before I write, I am only going to write 4 TB. And the next night? Same thing: another 4 TB.
That means that after 30 days I have written 216 TB, although I have backed up 3000 TB of data. After 60 days, I have written 336 TB, and backed up 6000 TB.
OK... hold on here! There is something I want to note: There are two ways to talk about deduplication ratios. When a vendor talks about deduplication ratios, we need to keep this in mind. And they are:
- The first way talks about the ratio of source data to changed data (written). In my example this is 100 TB to 4 TB, or a deduplication ratio of 25:1.
- The second way is to talk about the ratio of source data to data stored. In my example above, after 30 days, this is 3000 TB to 216 TB, or a deduplication ratio of 13.89:1.
Knowing which number is being used is critical when you are trying to decipher how much storage you need!
OK, back to the real discussion...
What happens with out-of-band or post-process deduplication? Well, the first backup will be 100 TB. The deduplication process will then start, and accomplish nothing. The second backup will also consume 100 TB of disk. After the fact, the deduplication process will reduce this to 4 TB, which means my net storage requirement is 104 TB, but I need to provision 200 TB of disk within my deduplication appliance. The third day is similar to the second--I store 108 TB after the deduplication process is complete, but I require 204 TB of disk.
This means that after 30 days, I require 312 TB of disk (although I will only store 216 TB), and after 60 days I require 432 TB of disk (although I will only keep 336 TB after the final deduplication process runs).
So, if I want to discuss deduplication ratios as the amount of disk required : the amount of data backed up, I can get the following:
| In-Line | Post-Process | |
| 30 Days | 13.89:1 | 9.62:1 |
| 60 Days | 17.86:1 | 13.89:1 |
Interesting? I think so!
So, if post-process deduplication is actually less space efficient, then why does anybody do it? Why does EMC offer it as an option? I mean, deduplication is all about space savings, right?
Well, hold on. There is more to it than that! Speed, effectiveness of deduplication ratios, etc. matter too. So does manageability, reliability, scalability, and a whole bunch of other good things.
But with respect to in-line vs. out-of-band, there are two big claims made as two why you would choose to do out-of-band deduplication. One is performance. The other is effectiveness of the deduplication.
First up: performance. It is true. If you go back to my post on DL3D performance, you can see that out-of-band is faster. OK, good. So it is a trade off. Fair enough, depending on your requirements, and depending on your priorities, you might choose one rather than the other.
Second: effectiveness. Some post-process deduplication vendors claim that out-of-band deduplication is actually intrinsically more effective. To go back to my prior example, they claim that rather than only being able to reduce the backups to 4 TB of change data, post-process deduplication can do better. Some go so far as to say that it is twice as good: that the net amount of deduplicated data stored in this case would be only 2 TB.
Now, I have to say that I have yet to see any data that would support such a conclusion. But, for the sake of illustration, lets say it is true. What does that mean to my storage? Well it means that after 30 days I would store 158 TB of data, although I would require 256 TB of disk. After 60 days, I would store 218 TB of data, but I would require 316 TB of disk.
Now my deduplication ratio chart looks like this:
| In-Line | Post-Process | |
| 30 Days | 13.89:1 | 11.72:1 |
| 60 Days | 17:86:1 | 18.98:1 |
So, I actually need a longer retention period for this to matter! Even if post-process deduplication is twice as efficient at deduplicating data, it is still less space efficient in the short term. Only when we retain the data for about 60 days or more does it become more space efficient.
Now, two final notes.
First, as I have said, I have seen no evidence that indicates post-process deduplication can actually be much more efficient at all. Based on a bunch of factors I would make an informed estimate that it may be 10% more efficient in some cases. If that is the case, we can safely conclude that post-process is always less space efficient than in-line deduplication irrespective of the length of time data is retained for.
Second, and finally, the other factor besides deduplication performance is restore performance and replication performance. There are actually four components to post-process deduplication performance: initial write to disk, deduplication speed, time to finish replication of the day's backup, and restore speed. When EMC (and I would suppose other vendors) say that post-process is faster, we are really saying that the first two components are faster than in-line deduplication. The third may or may not be, as is the case with the fourth--it may or may not be.
Again, there are other considerations beyond performance, but we can see that if space utilization was one of them, that would incline us to choose in-line deduplication. In a coming post I will look at the performance impacts of the two approaches, and we shall see what other interesting differences we can highlight!
Comments