I have been getting lots and lots of questions about Avamar lately. And there seems to be lots of confusion about what Avamar is, what it does, where it fits, and so on. And because I have to wait 5 days to talk about all the really cool stuff that I have been alluding too with my countdown, I thought a quick tour through the world of Avamar might be fun.
So buckle your seatbelts. It is going to be weird at times. You might wonder if you should have chosen the blue pill instead of the red pill. But it will be worth it.
With that said, the first question I usually get is: "What is Avamar? Is it a backup application? Or is it hardware? We know it does deduplication, but what is it?"
And the answer is: both. And more.
Avamar is a backup application. It has a client, and a server. Client send backup data to servers.
Avamar is hardware (sort of) in the sense that EMC sells a product called a Data Store. A Data Store is an appliance that runs the server component of the Avamar application. It offers all the usual benefits that appliances have over the build-it-yourself approach. You can read the marketing, I don't need to repeat it here.
But Avamar is more, because it is also now a component of EMC Networker. Meaning that Networker has now implemented Avamar functionality, and that a given client can do "traditional" backups, Avamar backups, or both. More on that later.
So even though Avamar can be one of three things, lets focus in on the application itself for this post. We will set aside the Data Store implementation, and the Networker integration, for another time.
First up: the client. The client is, in a lot of ways, the brains of the operation. The client does all the deduplication. But, as is often the way with things, the how is the interesting part.
- The client does deduplication on a segment basis. Segments are, well, bigger than blocks and smaller than files. They are also variable in length. And this is where some of the "secret sauce" of Avamar is. Variable in length because Avamar is smart enough to know where within a data structure the segment is, and what type of data it is, and size it accordingly. For example, the segment from the start of a PowerPoint will be different in size from a segment in the middle of a .pdf or the start of an Excel spreadsheet.
- Essentially therefore the client looks for globally unique segments. However it is a little like a hippy: it thinks globally and acts locally. (And Jed and Tony just had a minor cardiac event with that analogy, I am sure...) Here is what happens, simplified:
- The client looks for changes in the file system, and then changes in files, and then new segements. (There is some very cool technology here by the way--it is one of the ways in which Avamar can back up big file systems 90% faster, or more, than any competitive backup application.)
- Once it identifies a new segment, it hashes it (twice) to product a unique fingerprint.
- Once it has collected all the new fingerprints, it queries the server, and asks if the Avamar server already has any of the fingerprints. Quite often, it does. In fact, Avamar can do as much as 50% better at identifying commonality within data on different client systems than other, target deduplication solutions.
- Finally, it gets a message back from the server identifying all the segments that are already stored. The client can then do the math, subtract out those segments already on the server, and transmit only truly unique segments. And this is HUGE. The practical implication here is that we only transmit a tiny fraction of the data that a normal backup would generate. When we say that Avamar reduces network requirements by as much as 99.9%, this is why. The Avamar client typically uses less than 1% of the network (LAN and/or WAN) capacity that a normal backup application requires.
- As a result, only globally unique segments are sent to the server. Each segment may be referenced multiple times (representing its presence in each backup, and across several clients) but it is only stored once (logically speaking).
Finally, for the sake of being complete, two more things we should note:
- Each backup is called a "snap-up". For Avamar, a snap-up is a backup, and vice versa.
- Each snap-up is a full backup. Because all references to segments are pointers anyway (Avamar never stores a full "file" on the server), every day Avamar has a complete pointer reference to everything on the client. So from an Avamar perspective, we only do full backups. This is conceptually similar, sort of, to TSM. But of course it is done at the segment level, rather than the file level. I raise the issue primarily because there has been some comparison between the two methodologies in the media. However, TSM properly only does progressive incrementals (for unstructured data), and Avamar does full snap-ups every day.
So that is the client, at a high level. Next up, the server.
Stay tuned.
I think you left out a step, where the Avamar client checks a local cache to see if new fingerprints match those of segments that have previously been backed up from that particular host, which further reduces the amount of network traffic. (That's my understanding anyway.)
--------
Walter: true. I think it was implied, but I am posting your comment here just to clarify things. Thanks for the comment--Scott.
Posted by: Walter | May 15, 2008 at 11:25 AM
How do you properly size a new grid and make sure it does not fill up with data?
We are using a 3rd party hosted Avamar solution and it works really well. Is it best to start with a smaller grid and add to it as necessary? We have a large mix of Networker tape backups and outsourced Avamar.
Posted by: Frank | December 23, 2009 at 02:51 PM
Frank;
Sizing can be a bit of an art. EMC has a tool that can accurately size an environment, and I would advise you try to get your Avamar provider or EMC to use it for your environment. To size accurate you need to account for commonality across platforms, change rate, retention times, amount of source data, and so on.
You can get an estimate by using a dedup calculator (like the one I link too) but that doesnt take into account commonality across Avamar clients, and doesnt size for a grid... The sizing tool really is the best way.
Assuming it is not grossly more than you require, I usually recommend starting with a DS5 or DS6 (5 or 6 node grid) as upgrades from those configurations follow an easier, less disruptive path than upgrades from single/dual node configurations.
Posted by: Scott Waterhouse | December 23, 2009 at 03:09 PM
Scott,
Thanks for your reply. We went through a mini sizing exercise way back in January while we were going through a POC with our Avamar service provider.
We have multiple locations and datacenters.
The interest in tapeless backup (and restore)has exploded in our company as a result. I have 10 locations across the globe using Avamar now.
We were pricing out a 13.3TB Avamar grid thinking that this would be large enough for today and leave us room for future growth.
Should we revisit the EMC sizing tool? I would say - yes.
I know it's the right thing for our company but it's difficult to understand and explain how all the moving parts work.
What is your stance on large databases - SQL, Domino, etc. I have so many questions:).
Sizing sure is an art!!!
Posted by: Frank | December 23, 2009 at 03:47 PM
I would definitely advise going through a new sizing exercise. Bear in mind that the new Avamar nodes are of a different physical size, and you want to work with your sizer to ensure they are sizing based on the older 2 TB nodes.
As far as databases go, it depends on what you consider large! ;)
Generically, anything under 1 TB or so is fine (with a possible exception of Domino servers which seem to generate exceptionally high change rates). Anything between 1-2 TB should be carefully considered. What is the change rate? What is the tolerance of the host to a backup process? Can you run a proxy server? Is it a VM or a physical system? Anything above 2 TB may be OK, but would almost certainly require a proxy. Another huge generalization: you are probably only going to do this if this is the last thing you have that you want to put on Avamar--i.e. doing this means you can turn off your traditional backup.
You said you had NetWorker, so your other strategy might be to run NW + Data Domain systems for databases and high change rate large size datasets, and Avamar for the remainder (remote, smaller, VMware, etc.).
If you have other questions just post them up, and if they seem common to me I will address them in a separate post.
Posted by: Scott Waterhouse | December 23, 2009 at 04:10 PM
Actually, they are not large - it is the daily change rate of the database. So right now, this is expensive using the 3rd party provider since they charge us for daily changes.
Database size - 120GB
Daily change rate - 50 to 60 GB
The Domino server in question is physical that will be converted to a VM in the new year.
This is a great blog, btw.
Do you provide contact information? I can certainly send you mine.
Posted by: Fiaria | December 23, 2009 at 04:53 PM
You can find me at scott dot waterhouse at gmail dot com or at my linkedin profile: http://ca.linkedin.com/in/sjwaterhouse Send me a note at gmail if you want my EMC address (or you can figure it out easily if you know our standard format of first name underscore last name at emc dot com).
With that kind of change rate your options are limited, unfortunately. It would be interesting to explore hosting a DD box at your DR provider site to see if that would be any less expensive.
Posted by: Scott Waterhouse | December 23, 2009 at 05:12 PM