What will backup in the cloud look like? Chuck Hollis wrote a post a few weeks ago on the subject.
I have been thinking about the issues that he raised ever since. Mostly because there was something about the piece that didn't quite satisfy me. There was something missing. Something that I couldn't put my finger on at first.
So I asked a more fundamental question: why do we backup like we do today?
And the only good answer I could come up with was: because that is the way we have always done it.
And I really don't like that answer.
In and of itself it is almost never a good answer. It may be that we have always done it that way, and there are good reasons for doing it that way. But if we are just doing it because we have always done it, and there is no other reason, that isn't good enough.
Not when the process is so badly broken.
Make no mistake: backup today is broken. Badly broken. It involves too many pieces. Too many components of the infrastructure. Has too many dependencies. Takes too much administrative effort. Takes too long. And it doesn't do a very good job of leveraging new technologies.
All because we take a host-centric view of backup.
What I mean is best seen by taking a look at the data flow in a traditional backup environment, where data is owned by hosts--both in the logical and the physical sense (except in the rare case of genuine clustered file systems or databases). Consider the diagram below:
When it comes time to back up, it is almost always the case that the host that owns the data (#2 above--the application server or servers) is responsible for the backup of that data. Thus, in most cases this host reads the data from storage (#3 above) and sends the data to another host (#1 above--the backup server--which can be a Media Server or a Storage Node, if you are a NetBackup or a NetWorker user respectively), which then sends the data to the backup target (#4). These days that target can be physical tape, virtual tape with or without deduplication, or just disk.
What is the matter with this? Well, to begin with: the data traverses the network 3 times, and drives I/O on all four components in our infrastructure: the backup host server, application server, storage, and disk library. That is a heck of a workload for (just) backup.
But wait. It gets worse.
What if you need to replicate the target? Well, there are really two scenarios:
- You let the backup target (#4) do the replication. This is a really good solution in the sense that I minimize the involvement of other parts of the infrastructure, and I minimize the load on the network--especially if my backup target deduplicates. On the other hand, this is not such a great solution if I want backup catalog consistency: all backup applications have a long way to go (with NetBackup OST showing some early leadership and direction in this respect) in achieving this. And, in fairness, it is not just the fault of the backup applications, as the backup targets also have a long way to go in terms of integrating with the backup applications.
- You let the backup server (#1) do the replication. This is the far more common approach, due to the problem with the first approach: backup application/backup target integration and catalog consistency. Unfortunately, it means that you put the data on the network two more times, and drive I/O on both the backup target and the backup server. Again. For a grand total of 5 times that the data has to traverse the network.
To circle back a bit here, what struck me most about Chuck's discussion is that none of this really changes if I put my backup in the cloud, or I am backing up servers/infrastructure in the cloud.
Yes, my architectural diagram will change somewhat. But mostly in minor ways, the same basic architecture and logical components remain. And yes, there is some opportunity to reduce the amount of data that traverses the network and reduce the number of logical components if I use Avamar, with it's source based deduplication. In that case only I will significantly reduce the network load--perhaps by 99%--and I will reduce the number of components as my backup server and backup target essentially become one component (I combine #1 and #4).
Now I would argue that these are pretty good reasons to consider Avamar.
But I am looking for a more general solution. A solution that can further reduce the amount of I/O on the network, and further reduce the number of infrastructure components involved. Ideally, the solution should also reduce the complexity, increase the reliability, reduce the administrative effort, and reduce the amount of time it takes to do a backup.
Chuck's post is particularly significant and timely in that if we are moving infrastructure and services into the cloud, there is a lot of rethinking and rearchitecting that needs to be done. So what better time to rethink and rearchitect backup?
Collectively we, the vendors of infrastructure and backup applications, as well as the end users of these components, have a huge opportunity to make things better. To make a change in the way we do backup. Let me put that another way: to simply move the process and procedure that we follow now into the cloud would be to waste a huge opportunity.
So what should backup look like? Consider the following diagram:
In this case, the backup server's job (and the role of application servers in backup) is reduced to job scheduling, sending and receiving meta-data, and managing the backup catalog database. So the backup server (#1) will instruct the storage (#3) to begin backup. The storage will transmit the data to the backup target (#4) and it it will be deduplicated there, at the target. Alternately, if the storage does deduplication at the source, we can imagine that the same process is followed, only there will be much less data traffic on the network, as only deduplicated data needs to be sent to the backup target.
The backup server would manage lifecycle of backup images, replication of backup catalog and policies to secondary sites, and migration of data within tiers of service at backup target.
To contrast this with the "traditional" approach to backup: data traverses the network just 1 time, and drives I/O on storage and the backup target only. Only meta-data needs to traverse the network to the backup server from storage and the backup target. Application servers (2) are not involved in backup operations at all--unless an application consistent backup is desired, in which case the application server should only need to alert the storage that the data is in a consistent state, and can be snapped or cloned to create an image from which to drive backup.
To me, two things really stand out about this approach.
First, we are adopting a more data-centric approach and less host-centric, and a more physical, less logical, view of backup. Hosts matter only insofar as they are the initiators or targets for meta-data; and, in the case of the backup server, as host and manager of the policy catalog and retention database. We are making back up all about the data, and not about every piece of the infrastructure. We are stripping away every extraneous component, leaving only the bare essentials.
Second, we are overtly changing backup from an application to a service. With the appropriately structured and appropriately located retention syntax, models and policies (again, this should not be driven by the backup server but should be driven by the application/service as part of its core service definition within the cloud; the backup server may be a repository--only--but even that is of questionable value) every application/server that joins your infrastructure/cloud automatically has backup provisioned and provided as a scheduled service. With little to no intervention by administrators, and little to no management.
It is time to radically simplify backup. It is time to radically reduce the load and operational complexity of backup.
It is time to fix backup.
Isn't sending data from the media on which it lives to the back up media, without having to go through the network, what NDMP is / was supposed to do?
Posted by: David Magda | June 23, 2009 at 04:31 PM
David... yes it is. Unfortunately, NDMP only works for filers (EMC Celerra and NetApp primarily) and carries some unfortunate legacy baggage in terms of how it defines backups (level 0, level 1, etc.). It doesn't work on just any old storage device, and probably isn't really well suited to backup of structured data sets. Lately it has been forced to do some weird things too--OST, anybody? At the end of the day, I guess I would have to chat to somebody that understands the guts of the protocol a lot better than I to see if it could be generalized to suit the purpose I describe above, or not. My suspicion is that it might be better to start again, but that is mostly a WAG.
Posted by: Scott Waterhouse | June 23, 2009 at 05:05 PM
This is known within Tivoli Storage Manager (TSM) as a LAN-free backup. The component used for this, the Storage Agent, has been around now for years. The actual data is copied over the SAN infrastructure (storage to backup target) and the metadata flows over the LAN. An entire RedBook (along with the picture you've drawn above) is written about this subject. Maybe you'll want to look into this and you'll see some similarities with your story.
Posted by: Tommy Hueber | June 24, 2009 at 12:41 AM
Tommy;
The LAN-free client is still only half way there to what I describe. With a LAN-free client, the data path still includes the application server; so in my diagram above the data would follow the path: #3 --> #2 --> #4. From a TSM server perspective, you do get the benefit that it only has to deal with meta-data, but the application server still has to bear an I/O load as well as generate meta-data. There are analogues for NetBackup and NetWorker in a SAN Media Server and a Dedicated Storage Node. But again, these are only half way to where we could go.
I apologize if I confused the issue by using network to describe both FC and IP connections. I did it deliberately because I feel that at some point we are going to have a converged network, and it was just simpler to discuss the issue in those terms.
Posted by: Scott Waterhouse | June 24, 2009 at 07:04 AM
Scott, you're correct. Data movement directly from #3 to #4 is known as server-free and needs some sort of 'data mover'.
Posted by: Tommy Hueber | June 26, 2009 at 01:15 AM
NetBackup RealTime does what you describe as well in terms of 1-traverse of the network. When you first turn RealTime on it does a 3rd party SAN copy from the storage array to another storage array without going through the host - just storage to storage. Then the changes the host makes are just tracked in a journal after the initial mirror/sync which is also only 1-traverse (two if you count it for each array it goes to). That journal + the intial mirror allows for any-point-in-time recovery. Being a part of NetBackup also allows the use of the NBU app agents to do transaction consistant bookmarks where no data is sent (0-traverse!), just a time stamp is inserted in to RealTime. You could also do off-host backups to a backup app like NBU, NetWorker or TSM where additional copies are made to tape or disk - although it would do a 2-traverse across the network since the backup app media server reads the data.
Posted by: Joe Pfeiffer | June 26, 2009 at 03:05 PM