If you have been involved with backup and recovery for any length of time, you probably know two things: one, you should be doing DR tests, and two, you don't do them as often as you should (or ever, in some cases).
Why do they get performed so infrequently? Because they are painful. And I mean really painful. Try packing up 5000 tape cartridges, some of your top IT operations people, and a whole bunch of operations manuals and run books and heading off to a cold room without windows for three days of work. That would be 3 eighteen hour days, usually. Over a weekend. Not only is this not a whole lot of fun for anybody involved, but there are two other problems.
(True story here... A customer of mine once shipped people and tapes more than 3000 km away to do a test. Unfortunately the people ended up a one site 3000 km away and the tapes ended up at another site that was 3000 km from home, but also about 1000 km away from the people. That made for a bad weekend.)
In sum, if you are betting person, you would bet on failure not success. And you would be getting a lot better odds of getting a payoff than Vegas would afford you.
These issues were pretty well known and understood at a major retailer that we were working with at EMC Backup and Recovery Systems Division. They had a lot of tapes. The DR test was expensive. And they were difficult. In fact, the testing was tough enough that they had never completed a successful DR exercise with TSM. Every time they did a test, they had 3 days in their cold windowless room, and every time they used all three days, and every time they didn't get a lot further than restoring the TSM (backup) server itself.
Then we began to talk to them about Data Domain and how it worked from a DR perspective. How it could minimize the bandwidth requirements necessary to get their data off site to their DR facility. How data would be replicated as it was backed up so that a DR copy would be ready soon after the initial backup was complete (and by soon, think in terms of minutes). How there would be no more tapes to ship. How they wouldn't have to load thousands of tapes into a library at the DR site. How this meant that they could begin restoring their backup server immediately. Not after a prolonged load and inventory process on a tape library.
They were convinced, but there was a problem. We were only two months away from the test. They had a question: was there any way that we could install 3 Data Domain systems, change to those systems from their existing backup and recovery infrastructure, complete the replication of their TSM storage pools, and be ready for a DR test in two months? Bear in mind that they hadn’t even placed anything on order yet!
Fast forward 2 months. The Data Domain systems were installed, with two DD880 systems at two different sites replicating to a single DD880 system at the DR facility. About 80 TB of physical capacity (and just shy of 1 PB of logical capacity) was replicated. The systems were performing admirably at keeping up with the incoming ingest and ongoing reclamation requirements of TSM.
And when it came time to do the DR test, the customer was able to use one of the most under-rated capabilities of the Data Domain platform: snap copies. Snaps turn out to be doubly valuable in a TSM environment, because they can be used to accomplish two things. First, you can make a second copy of your backup data that you are free to do anything with. Be disruptive. Delete certain tapes. Try to mess things up. It doesn't matter, because your primary copy remains untouched. Secondly, TSM has an architectural issue with replication: it doesn't have a way to have two different TSM servers access the same set of storage pools. Which means that it is very difficult to leverage the contents of a copy pool to do a DR test-because copy pools are still "owned" by the server at the primary site. But, with snap copies, I can simply mount the snap to the TSM server at my DR site and have a copy of my backup data available immediately. And I can do whatever sort of testing I want against this copy, including destructive testing, and it will be harmless to my primary copy at the DR site.
So what were the results of the test? It took 1 and a half days. And they demonstrated success by recovering TSM and key production servers. And they were done. They got to go home and spend the rest of the weekend doing things somewhat more entertaining than looking at the progress bar on a restore operation on the TSM management console. (A category which would include, in my opinion, watching paint dry. I'm guessing they chose to have a couple of beers however.)
It was so easy it was boring. And for the first time ever they had the security of knowing for sure that if they suffered a real disaster they could recover their business. Because they had tested it.
(As a footnote for the technically precise in the crowd: there is a way to share storage pools between TSM servers but it is both relatively complex and relatively crippled. Certainly not all that helpful in a DR test scenario. So nobody that I am familiar with actually uses the capability.)