One of the toughest issues to address with deduplication is that of performance. Backup performance can be difficult to characterize under the best of circumstances. There are many things which can contribute to improving (or degrading) performance in a backup environment. This holds true for the performance of deduplication appliances as well. There are a host of factors that can impact performance, and quantifying them can be difficult in the best of circumstances, not least because the thing which is driving data to the appliance--the backup application--can be difficult to quantify and measure reliably as well.
Having said that, there are four or five key questions that I think users should ask when it comes to the issue of performance and deduplication:
- How fast is my ingest speed with in-line deduplication?
- How fast is my deduplication speed if I am using delayed or scheduled deduplication? And that is really a two part question: how fast will be initial writes be? and how fast will the post-process deduplication be?
- How fast are restores? Again, this is a two part question: how fast are restores from deduplicated data? and how fast are restores from native, or fully hydrated data, if it exists on the system?
- How fast is replication?
- And finally the most interesting question of all: can you characterize the performance of a system under multiple workloads? What if a system is ingesting, deduplicating, and replicating all at once? What happens to performance for each of those tasks in that case?
I have written a great deal in this blog about the performance of the EMC systems in these different circumstances, and I have also tried to highlight cases in which the competition was either not, in my estimation, being very straightforward with their performance characterizations, or worse, where they made no meaningful characterization at all.
So when JL over at Sepaton said, in our last conversation on the subject, that further details on Sepaton's performance would be forthcoming, I was inclined to believe him. Now, JL has revealed the further details. And this amounts to a single metric: 25 TB per day per node.
Only one image is appropriate at this point, and it absolutely is a little old lady screaming "Where's the beef?"
Come on Sepaton, any intelligent customer is going to need more detail than that! Given that Sepaton does post-process deduplication (only), we have to wonder how fast are initial writes? And how fast does deduplication proceed after the fact? How fast does replication occur? What if it is bi-directional replication? When does a virtual cartridge become eligible for replication? How fast are restores from anything more than a day old?
Until you understand numbers like this, and until your vendor can have meaningful performance conversations that include discussing performance in all these different aspects, you could be in for some unpleasant surprises in enterprise deployments of deduplication technologies.