Over the last little while, we have become used to seeing discussions of data and storage for Big Data. As if Big Data possesses a unique set of challenges around the acquisition, management, replication, and protection of the data. And so it does.
Because after reviewing the latest set of statistics from the 2011 Digital Universe study it is clear that the challenges (or "opportunities" as Chuck put it on a recent call) faced by Big Data users today are going to be the same challenges that most enterprise IT shops are going to have face in the years to come.
So for the long term thinkers in the backup and recovery community, I think there is some value in paying attention to the problems that Big Data users face with respect to the backup and recovery of their data sets, because their solutions and their answers are likely to shape enterprise data protection in the years to come.
Generally speaking Big Data can be separated into two types—structured and unstructured data. And one of the things that the Digital Universe study makes clear is that unstructured data is growing faster than structured data.
The striking thing here is that unstructured data is growing on a path similar to that of Moore's law: doubling every eighteen to twenty four months. Structured data is not growing as fast, but it is still growing faster than staffing.
And one of the key things that emerged out of the latest Digital Universe study is just that: staffing is not keeping pace with growth. But moreover, over the next ten years, there will be:
- 10X the number of servers (virtual and physical) worldwide
- 50X the amount of information managed by enterprise datacenters
- 75X the number of files
- Only 1.5X the number of IT professionals in the world
So it is time to ask ourselves: what does this mean for our approach to backup? Could your present backup infrastructure and methodology deal with 10 times the number of servers that your currently protect? Could it scale to protect 50 times the amount of information? Could it deal with 75 times as many files or objects?
I think there are actually two sets of answers to those questions: one technical and one based on business or cost considerations.
First, the technical: For most organizations, I suspect the answers are yes, maybe, and no respectively. Scaling the number of clients is a not such a big challenge for most backup platforms. Scaling to protect 50 times as much data is probably not such a big issue so long as your backup target platform can keep up: and I wrote on that particular issue here.
Scaling to protect 75 times as many files however is an entirely different challenge. As any good backup administrator knows, dealing with large unstructured repositories of data doesn't challenge you in the normal way. Throughput is usually limited not be the overall size of the data and the platform it resides on, but by the ability of the operating system and backup application to process the file system and metadata. File servers with 20 or 50 million files don't take days to back up because of the amount of data associated with them, they take days to back up because it takes that long for most backup applications to sort and process the metadata the file system/operating system is providing them. And with the exception of Avamar, which tends to back up large unstructured respositories about 10 times faster than any other backup application, there is just no relief in sight for this particular problem.
From a business point of view however, the answer may well be the same: yes, maybe, and no.
Yes, because presumably as the number of servers we protect grows, so will the capacity of an individual backup server. My cost will remain more or less static in that respect.
Maybe, because as I discussed in the other post I referenced above, so long as your deduplication target is capable of sustaining Intel curves (Moore's Law) with respect to throughput and capacity, and not Seagate curves (doubling every ten years) then your backup infrastructure will be able to keep pace with your structured data growth. Again, costs will remain more or less static over time. Continual investment will be required, but the magnitude of that investment will not change dramatically over time.
And no because the cost of protecting an unstructured data store that is 75x larger than today's unstructured repositories may be no longer justifiable.
And here is the heart of the problem: as tactical problems become strategic problems (and you better believe that a data protection bill for $125m would be a strategic problem for the business) we will need to rethink how we protect data, how we back it up, and what retention policies are appropriate.
The next question is, of course, how we accomplish that? How do you work with your business to determine what an appropriate, and appropriately costly, protection mechanism for data is? There is really only one (good) answer to this question in my opinion: what is the data worth?
Let me make one final observation, largely based on personal experience and not founded in any exhaustive or statistically rigorous survey: the big data consumers of today tend to have a very exact understanding of the value of the data they own. Most organizations that don't have Big Data, even if they have very large amounts of data, dont have any idea how much their data is worth. Genomics companies, Geology and Geophysics companies, Big Media companies--all of them understand very precisely what their content is worth.
And they can pick an appropriate data protection strategy. Maybe spending $125 would be appropriate if the data was worth $20bn. (It wasn't in the case above.)
But I have yet to work with an organization outside of the Big Data owners, even those with multiple PB of data, who know what their data is worth. So if you don't know what it is worth, how will you determine what the best backup strategy is? How will you be able to defend the current approach, or change to a new one? Alternately, you may be faced with a bill that will one day escalate to $125m. In that case, it is pretty easy to figure out that you have to change something. But what will you change to?
In my opinion, these are the big tough questions that the expanding Digital Universe will force us to ask.