What will backup in the cloud look like? Chuck Hollis wrote a post a few weeks ago on the subject.
I have been thinking about the issues that he raised ever since. Mostly because there was something about the piece that didn't quite satisfy me. There was something missing. Something that I couldn't put my finger on at first.
So I asked a more fundamental question: why do we backup like we do today?
And the only good answer I could come up with was: because that is the way we have always done it.
And I really don't like that answer.
In and of itself it is almost never a good answer. It may be that we have always done it that way, and there are good reasons for doing it that way. But if we are just doing it because we have always done it, and there is no other reason, that isn't good enough.
Not when the process is so badly broken.
Make no mistake: backup today is broken. Badly broken. It involves too many pieces. Too many components of the infrastructure. Has too many dependencies. Takes too much administrative effort. Takes too long. And it doesn't do a very good job of leveraging new technologies.
All because we take a host-centric view of backup.
What I mean is best seen by taking a look at the data flow in a traditional backup environment, where data is owned by hosts--both in the logical and the physical sense (except in the rare case of genuine clustered file systems or databases). Consider the diagram below:
When it comes time to back up, it is almost always the case that the host that owns the data (#2 above--the application server or servers) is responsible for the backup of that data. Thus, in most cases this host reads the data from storage (#3 above) and sends the data to another host (#1 above--the backup server--which can be a Media Server or a Storage Node, if you are a NetBackup or a NetWorker user respectively), which then sends the data to the backup target (#4). These days that target can be physical tape, virtual tape with or without deduplication, or just disk.
What is the matter with this? Well, to begin with: the data traverses the network 3 times, and drives I/O on all four components in our infrastructure: the backup host server, application server, storage, and disk library. That is a heck of a workload for (just) backup.
But wait. It gets worse.
What if you need to replicate the target? Well, there are really two scenarios:
- You let the backup target (#4) do the replication. This is a really good solution in the sense that I minimize the involvement of other parts of the infrastructure, and I minimize the load on the network--especially if my backup target deduplicates. On the other hand, this is not such a great solution if I want backup catalog consistency: all backup applications have a long way to go (with NetBackup OST showing some early leadership and direction in this respect) in achieving this. And, in fairness, it is not just the fault of the backup applications, as the backup targets also have a long way to go in terms of integrating with the backup applications.
- You let the backup server (#1) do the replication. This is the far more common approach, due to the problem with the first approach: backup application/backup target integration and catalog consistency. Unfortunately, it means that you put the data on the network two more times, and drive I/O on both the backup target and the backup server. Again. For a grand total of 5 times that the data has to traverse the network.
To circle back a bit here, what struck me most about Chuck's discussion is that none of this really changes if I put my backup in the cloud, or I am backing up servers/infrastructure in the cloud.
Yes, my architectural diagram will change somewhat. But mostly in minor ways, the same basic architecture and logical components remain. And yes, there is some opportunity to reduce the amount of data that traverses the network and reduce the number of logical components if I use Avamar, with it's source based deduplication. In that case only I will significantly reduce the network load--perhaps by 99%--and I will reduce the number of components as my backup server and backup target essentially become one component (I combine #1 and #4).
Now I would argue that these are pretty good reasons to consider Avamar.
But I am looking for a more general solution. A solution that can further reduce the amount of I/O on the network, and further reduce the number of infrastructure components involved. Ideally, the solution should also reduce the complexity, increase the reliability, reduce the administrative effort, and reduce the amount of time it takes to do a backup.
Chuck's post is particularly significant and timely in that if we are moving infrastructure and services into the cloud, there is a lot of rethinking and rearchitecting that needs to be done. So what better time to rethink and rearchitect backup?
Collectively we, the vendors of infrastructure and backup applications, as well as the end users of these components, have a huge opportunity to make things better. To make a change in the way we do backup. Let me put that another way: to simply move the process and procedure that we follow now into the cloud would be to waste a huge opportunity.
So what should backup look like? Consider the following diagram:
In this case, the backup server's job (and the role of application servers in backup) is reduced to job scheduling, sending and receiving meta-data, and managing the backup catalog database. So the backup server (#1) will instruct the storage (#3) to begin backup. The storage will transmit the data to the backup target (#4) and it it will be deduplicated there, at the target. Alternately, if the storage does deduplication at the source, we can imagine that the same process is followed, only there will be much less data traffic on the network, as only deduplicated data needs to be sent to the backup target.
The backup server would manage lifecycle of backup images, replication of backup catalog and policies to secondary sites, and migration of data within tiers of service at backup target.
To contrast this with the "traditional" approach to backup: data traverses the network just 1 time, and drives I/O on storage and the backup target only. Only meta-data needs to traverse the network to the backup server from storage and the backup target. Application servers (2) are not involved in backup operations at all--unless an application consistent backup is desired, in which case the application server should only need to alert the storage that the data is in a consistent state, and can be snapped or cloned to create an image from which to drive backup.
To me, two things really stand out about this approach.
First, we are adopting a more data-centric approach and less host-centric, and a more physical, less logical, view of backup. Hosts matter only insofar as they are the initiators or targets for meta-data; and, in the case of the backup server, as host and manager of the policy catalog and retention database. We are making back up all about the data, and not about every piece of the infrastructure. We are stripping away every extraneous component, leaving only the bare essentials.
Second, we are overtly changing backup from an application to a service. With the appropriately structured and appropriately located retention syntax, models and policies (again, this should not be driven by the backup server but should be driven by the application/service as part of its core service definition within the cloud; the backup server may be a repository--only--but even that is of questionable value) every application/server that joins your infrastructure/cloud automatically has backup provisioned and provided as a scheduled service. With little to no intervention by administrators, and little to no management.
It is time to radically simplify backup. It is time to radically reduce the load and operational complexity of backup.
It is time to fix backup.