« NetApp's Compliance Isn't Compliant | Main | The Impact of Deduplication Methodology on Deduplication Ratios »

July 21, 2008

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

W. Curtis Preston

You mention that "EMC offers a VTL that deduplicates that can ingest data at 2,200 MB/s" but you do not mention how fast that data is deduplicated. My understanding is that it's approximately 1/5th of the ingest rate you advertise. Therefore, while you can ingest data at 2200 MB/s, you can only do so for 4-5 hours per day if you want to dedupe it all. Assuming a typical 12-hour backup window and a 24-hour dedupe window, my math puts the "real" ingest rate at less than half of what you mentioned.

Scott Waterhouse

DL4000 3D performance really has two important dimensions: VTL performance, and deduplication performance. Because the VTL is the only interface that customers will interact with, it is an important one. And it is more than just a cache for out-of-band deduplication, because we allow deduplication to happen on a schedule. Meaning that data written to the VTL might be deduplicated a day later, a week later, or never. It depends on your SLAs, it depends on your restore requirements, and it depends on how long you want to retain the data. So far, in enterprise backup environments (clearly where the system is aimed at) this message has seemed to make sense to the customers I talk to: they do have different requirements for different applications, tiers of storage, and so on.

So, VTL performance is: 2,200 MB/s native. We can actually do a fair bit better than that based on "real world" internal tests, but that is the easily achievable number we choose for marketing purposes. The other important way to describe performance is 1,600 MB/s with hardware compression enabled (and most people do enable it for the capacity benefits).

Finally, the most conservative way to estimate compound performance (when there are simultaneous reads and writes) is that they won't add up to more than 1,600 MB/s with compression on. In truth, they often do--you might be able to get 1,000 MB/s of write at the same time as you get 800 MB/s of read, for example. But to avoid setting unrealistic expectations that this is "easy" or will happen in every circumstance, lets say that aggregate read and write will not add up to more than 1,600 MB/s.

Deduplication performance is 400 MB/s. That can run 24 hours a day, because it is a post-process deduplication. I typically don't recommend a scenario in which it would run for more than 20 hours a day on average, because I want to leave room for future growth, for restore requests, and the like. But at 20 hours per day, that is, roughly, 30 TB per day of deduplication capability.

W. Curtis Preston

If the device can only dedupe 30 TB a day, then it can only ingest 30 TB a day -- assuming you're going to dedupe it all. Therefore, if you ingest data at 2200 MB/s, you can only do so for 3.8 hours.

Here's my math:
30,000 GB / 2.2 GB = 13636 seconds, or 3.8 hours

But, if you need to use it for 12 hours (a typical backup window), it could only ingest data at about 700 MB/s.

Here's my math:
30,000 GB / 12 hr * 3600 seconds * 1000 MB = 694 MB/s

Yes, I know that the system has the capability to NOT dedupe some data, and that data could/should be excluded from these calculations, but since this system will likely be compared to other dedupe systems, it's important to understand it's "real" throughput number if all data is to be deduped.

One final note: I also think/know that it will take time to move the data from the ingest VTL to the dedupe VTL (due to the 4000's "unique" design), and that this time could also impact the amount of data that can be deduped in a day, but I'm unsure of how to do that.

Scott Waterhouse

Curtis;

I think you need to distinguish between ingest for the VTL and ingest for the dedup engine that is bonded to the VTL. The device can ingest 2,200/1,600/1,200 MB/s (depending on which perspective). Up to 30 TB/day of this can be deduplicated. The rest can be held for as long as you want (and as the maximum of 675 usable TB in the VTL permits).

Using an intake number of 1,200 MB/s (inclusive of compression and simultaneous write to deduplication engine), that is about 4 TB/hour. So the system can deduplicate in a day about what would be written in an 8 hour backup window.

In reference to your last sentance, all numbers discussed assume that the system is also reading or writing from the VTL--that is, the deduplication engine can ingest data at 400 MB/s irrespective of whatever else may be going on on the VTL.

So what is the real logic of all this? Two things: not everything ingested will deduplicate well (high change rate db, for example). Not everything will be kept for long enough to justify putting it on deduplicated storage. And some things you will want to be able to restore faster than deduplicated storage permits--so you want to leave them on the VTL for that period of time for which you want high speed restore.

Net net? The 4406 3D can ingest more than it can dedup in a day. There is a very big VTL space of 675 TB to write data to that is exclusive of the additional 148 TB of deduplicated storage. But yes, if you wrote at 2,200 MB/s (because you only cared about performance) you could write more to the VTL than the deduplication engine can ingest in a day.

The comments to this entry are closed.

Search The Backup Blog

  • Search

    WWW
    thebackupblog