I spend a lot of time talking to people about Avamar. I think it is a great application that pretty radically reinvents backup architectures. It does so many things well, and has consistently been ahead of the curve in terms of delivering advanced backup technologies. For example, Avamar was one of the first to market with a true disk based deduplication solution for backup.
And often I tell people that Avamar's overt association with deduplication is one of its' challenges too—in the sense that Avamar actually does two things at the client level that are really important. It is not just deduplication, but the duration and resource utilization of the Avamar client that are so impressive. Not only does deduplication reduce the amount of data to be backed up by 99% or more in many cases, but the duration of an Avamar backup is usually 90% less than the duration of a standard backup. To those of us that are concerned about growing amount of data and the impact on backup windows, this is a fundamentally important characteristic, and one that is, in my opinion, just as important as deduplication itself. (And one that I will discuss in a follow up post.)
Having said that, the focus of this post is all about the nature of global client deduplication for Avamar. This is one of the those things that I often get questions about. And they basically all come down to this: if Avamar is doing deduplication at the client level, how can deduplication be "global"? How can deduplication be cross–client in the Avamar architecture?
To answer this, lets follow the progression of the deduplication process in Avamar.
So what happens first? The first time that a client runs through the deduplication process, it will break down all the data on a system down into segments (also called "chunks" by Avamar). Chunks are variable in length, where the size depends on the type of data, and where within a data structure the piece resides. The chunk size from the beginning of a file may be different that the segment size from the middle of the file. Not incidentally, this variable length segment sizing is partly responsible for the very high deduplication ratios that Avamar can achieve.
As all those segments are being determined, they are then hashed. So every segment gets a unique hash id that can be used to identify that segment in the future. And each hash associated with a segment on a client is actually stored in two places: once on the client (in the p-cache) and once on the server (as part of the meta-data the server retains).
But here is the key: before any actual data is sent to the server for backup, the hashes for those chunks are sent to the server first, and an "is-present" operation is done. Essentially, the client asks the server: do you already have this chunk of data? Normally, these hashes are bundled together in large groups by the client to reduce the chattiness of the backup process and reduce the number of IP packets generated during a backup.
If the server determines that it is already retaining a chunk of data that is present on the client by matching the hashes, then there is no need for the client to send that data to the server. Only when the chunk is globally unique—not found on any other client that has done a backup already—is the actual chunk then transmitted to the server for retention.
So if I back up a presentation, and a colleague backs up that same presentation the next day, his backup is going to talk to the Avamar server, the server will inform him that the chunks associated with the presentation are already present on the server, and don't need to be backed up again. All that will be sent will be the meta-data associated with the presentation (essentially, we want the Avamar server to know that the client had the file on that day, with the proper place in the directory structure, was changed last on a particular date, and so on).
In this way, Avamar can perform backups locally, but deduplicate globally. And as a result, Avamar can typically reduce network traffic between backup clients and the Avamar server by as much as 99% or more. In turn, this makes Avamar extremely effective at backing up remote offices, and desktop/laptop environments. There are few solutions which can offer greater reductions in network traffic during backup, and truly global deduplication.