« NearStore VTL: Requiem for a Dream | Main | Free Avamar for VMware »

November 05, 2008


Feed You can follow this conversation by subscribing to the comment feed for this post.

W. Curtis Preston

You can say that your product does “immediate” deduplication, or “concurrent” deduplication, but saying it does inline deduplication is misleading. It’s the equivalent of calling an asynchronous replication “synchronous” just because the target system is updated seconds after the source system. You’re not synchronous unless you hold up the original write until it’s replicated, and you’re not inline dedupe unless you dedupe before you write to disk. It’s also the equivalent of Symantec calling Continuous Protection Server continuous data protection (CDP) because it’s “continuously protecting the data.” You can’t recover to any point in time (you can only recover to when you last took a snapshot), so it doesn’t meet the definition of continuous data protection (CDP) – and your product doesn’t meet the definition of inline dedupe.

The definition of inline deduplication (which was decided at least five years before EMC entered the target dedupe market by OEMing Quantum) is dedupe that is done in such a way that the native, non-deduped data is never written to disk – ever. Since your own description of how your dedupe system works is that it, “waits for 250 MB of data … before deduplication begins. Data is written in its "native" format, and deduplicated…” you are not doing inline dedupe.

You are doing post-process deduplication with a very small unit of work and processing it immediately after backups. Where most post-process dedupe systems will wait until a virtual tape is put back on a virtual shelf -- or will wait until a backup file is closed (NAS) – before starting dedupe, you’re starting it as soon as you get 250 MB of data – so you’re starting the process a bit sooner. You’re still processing the system after (post) you write it to disk – hence the term post-process.

I’m pretty sure I know what you’re thinking: post-processing systems wait until the entire backup is done, THEN dedupe it. This is a misconception that was started by inline dedupe sales reps and was never based in fact. Every post-processing vendor I know has always been able to process the backups as they’re coming in – just like you and Quantum can. You also have the choice to wait until all backups are done to start deduping data.

I decided to put the rest of my response in my blog:

Scott Waterhouse

W Curtis... I welcome the debate, but... I think there are two levels to this conversation: one is what the user cares about. And honestly I think that just boils down to a performance conversation--in the sense of throughput, time to finish backup, and time to finish replication. The other level is the technical minutiae of what is really going on with the device. The second is also partly semantics in this case.

Let me start with the second level. You say that you don't want the market confused. Fair enough. Neither do we. Having said that, I think you are trying to fit a square peg in to a round hole in order to define something. (By the way, we are fine with the "immediate" definition if you prefer that to in line.)

Why do I say that? You further wrote: "the second reason that the differentiation between inline and post-process is important is that post-processing systems can get “behind” and inline systems cannot." Well, the DL3D can't either. When it is running in "in-line" mode it will not build up a backlog of data on cache. It will deduplicate data as it is received. If you send it data faster than it can deduplicate, it will bottleneck (and slow down the reception of data). Just like any other in line system.

So honestly, the DL3D meets at least one of the tests you proposed for in line deduplication.

Now back to the first level--and the reason I brought up the subject in the first place: why does it matter to users? Only for performance. And in this respect, the DL3D approach to in line (or immediate) deduplication has a huge advantage over competitive approaches. Our approach allows you to restore data up to 6 times faster than you can from an appliance that doesn't employ any sort of disk cache (like Data Domain). So it has every performance characteristic of an in line solution (limit on write speeds, replication is hooked to deduplication and happens simultaneously, restore from truly deduped data is 1/4 the speed of a write to the system, etc.) except that if you are restoring from cached or non-truncated data, you can get up to a six times performance improvement. And it seemed to me that was worth mentioning.

W. Curtis Preston

As I just said in my response to your latest comment on my blog, I disagree that the 1/4th or 1/6th restore numbers that you're stating are typical of the industry. I have seen completely different numbers than that with a number of your competitors.

Scott Waterhouse

Fair enough. Our experience differs.

Having said that, there are so many components to the performance conversation that it can be tough to make comparisons: how many streams on back, how many on restore, over FC or IP, in-line (immediate) or delayed, how old is the data, how fast is the client, etc.

At the end of the day, it is important for vendors to honestly communicate these factors.

How come *no other vendor* posts *any* meaningful performance data? Do I think we go far enough at EMC? Personally? No. But we do disclose far more than others.

The comments to this entry are closed.

Search The Backup Blog

  • Search