Last time out, we talked about Avamar's global client deduplication. This time, I want to go into some more depth on what makes Avamar so fast for some types of backups.
Saying something is "fast" is always a bit subjective. I know that some things like Lamborghinis are fast. I know that Valentino Rossi is fast. But how fast is Avamar?
On average, Avamar is about 90% faster than most other backup products to make a backup of an unstructured data set. And in backup, speed matters. In fact, I would argue that it is the most important thing a backup application or platform can offer. So finding a solution that lets you cut backup times by 90% is important. And in some cases, it is crucial. For platforms and infrastructure that have expreienced 24+ hour backup windows, this is a decisive factor.
How does Avamar do it? Well, Avamar has a pretty elegant way to reduce the amount of time that it takes to determine what needs to be backed up.
The technique is another way in which Avamar makes use of hashes, and is beautifully elegant in design and execution. And as far as I know, Avamar is the only backup application to make use of hashing in this way. The process works like this:
First, Avamar will build a hash of the filesystem root for a given client. Then, it will begin to break the filesystem down into successively smaller pieces. First dividing that into two, then dividing those pieces into two pieces each, and so on. Each piece gets a hash value for its current state. This process continues until Avamar reaches units of 100 individual files, which also get hashed.
Any time that a file changes, the hashes will change too. If a file changes, the hash for that group of files will change, as will the hash for each of the successively larger units above it. This enables Avamar to very rapidly isolate files that have changed—in a way that is far faster, and far more effective, than the alternative, brute force approach of asking the OS on a file-by-file basis if a file has changed (used by pretty much every other backup application).
Once the Avamar client has isolate the group of files in which a change has occurred, it will then determine the specific file. Once the file is identified, the process of chunking and hashing described in the previous post begins.
This is easiest to visualize in terms of directories. Lets assume that I have C:\, and C:\ has two directories of equal size: \documents and \system. Further \system is divided into \applications and \library, also equal in size. And \library has a bunch of files in it. If one of those files changes, then Avamar will know that C:\ has changed, as has \system and \library. Assuming that nothing has changed in \documents, Avamar will not need to look at that branch of the filesystem hash tree at all. And after doing one hash check on \system, Avamar can determine that \applications has not changed, and doesn’t need to be looked at further.
In this fashion, Avamar can typically eliminate 95% or more of the data from being examined at all (beyond a very few hash checks).
Take this to the extreme: what if I have a system, that has C:\ with 4 million files. And let's say this system hasn’t changed at all since the last backup. How long will Avamar take to back this up? About 30 seconds. It will compare the hash of C:\ from today to the one from yesterday, determine they are the same and nothing has changed, do some housekeeping, and be done. How long would this take for a typical backup application? Somewhere between 4 and 8 hours. And the differences are just as dramatic when we see systems with change levels of 2-5% of the total files (pretty typical for a file server).
So it is this beautiful solution to the crucial backup problem of determining what has changed Which enables Avamar to be so fast. And so much faster than anything else on the market.
Note that this discussion is largely about unstructured data sets—sets where there are lots and lots of files present. Usually many millions of files. You can see how the technique that I described above really only applies to unstructured data sets. For structured data sets, databases and the like, typically have a very small number of files: normally the OS, application and database binaries, and the database (.dbf or whatever) file itself. The majority of the work to back up structured data sets is in deduplicating and/or transmitting these huge database files. Which means that Avamar can't employ the methodology it uses on unstructured data sets to eliminate 90% or more of the data from being examined. With databases, we just have to brute force it.