I have been spending a lot of time with TSM accounts lately, so I thought I would take the opportunity to discuss some of the gotchas and lessons learned from these environments, as they moved from a tape-centric backup environment to a disk based backup environment using EMC Data Domain technology.
The first and major thing each of these customers have realized is the incredible impact the Data Domain systems have on the schedule that TSM operates under. TSM is architected a bit different (understatement of the year) than most other backup products. And for a lot of reasons, the TSM server is usually busy 24 hours a day. In fact, it is almost always the case that it could be busy for more than 24 hours a day, but there just isn't enough time and resources to enable it to do everything it wants and needs to do.
Lets take a look at a typical day in the life of a TSM server:
20:00 to 8:00 backups to disk cache (even in 100% tape environments)
21:00 to 9:00 migration from disk cache to tape
06:00 to 12:00 copy activity to create daily off-site (copy pool) tapes
12:00 to 13:00 TSM database backup to tape
13:00 to 13:30 copy activity to create off-site copy of TSM db (copy pool)
13:00 to 16:00 migration from disk cache to tape
16:00 to 17:00 expiration activity
17:00 to 20:00 reclamation activity
Now I have simplified this a bit: normally some of these activities are interwoven. Migration from disk cache to tape is often an ongoing process. Reclamation often spills over into the backup window. But overall, it is a fair generalization to say that almost every TSM server out there using tape needs more hours than there are in a day to get through this activity.
So what usually gets left out? Reclamation. 98% of the time, the shortfall in time and resources is made up by reducing the amount of reclamation that happens. In turn this means that more and more tape is consumed, and the density of data on tape drops. In TSM environments, it is not unusual to see less than 50% of the total tape capacity used by current valid backup data. In some cases, I have seen 70-90% of tapes wasted on unreclaimed data.
Now lets look at a typical schedule after implementing a Data Domain system:
20:00 to 4:00 backups to disk cache (still sometimesa required even with Data Domain)
20:00 to 4:00 backups to Data Domain from LAN-free clients
21:00 to 6:00 migration from disk cache to tape
6:30 to 7:30 TSM database backup to Data Domain
8:00 to 11:00 expiration activity
11:00 to 15:00 reclamation activity
So what has happened? First, we have moved larger backup clients to a LAN-free method, sending their data directly to Data Domain. With the much larger number of virtual resources we have available in a Data Domain system (up to 256 virtual tape drives--or 512 on a GDA) than most environments ever have access to in the physical world, we can make this simple architectural change which is enormously beneficial.
Second, we have got rid of the copy pool activities. This is a huge drain on the time and resources of the TSM server during the course of the day. By eliminating this entirely, and replacing it with Data Domain replication, we save many hours of processing. Incidentally, we also reduce the size of the TSM database (because we don't have entries for every backup object retained offsite and every version of them retained offsite). We reduce half of the reclamation workload of the TSM server, because it does not need to do reclamation processing against the copy pool. We reduce the CPU and I/O load on the server. These are all very good things.
Third, database backups and offsite copies are complete by 7:30 in the morning. This means that a full disaster recovery copy of the TSM backup pool and database is available for disaster recovery purposes many hours earlier in the day than they would be with physical tape. In fact, depending on the duration of the copy pool job and the timing of your couriers, off-site tape may not make it off-site for 24-36 hours for some TSM users. This means, by the way, that the best RPO that can be achieved is 72 hours. With Data Domain, we have a RPO of no more than 30 hours.
Finally, reclamation is going to run far faster. In general, reclamation from virtual tape on Data Domain is going to run four to ten times faster than reclamation from physical tape. In turn, this means that we can be aggressive with our reclamation policies, and reclaim a tape when it has 20-30% expired data, rather than 70-90%. In turn, this makes for far more efficient use of our backup infrastructure.
The net result here is that we have taken a typical TSM server from requiring 30+ hours to get through its daily activities, to 18-20 hours to get through the same activities.
These are are the first and most significant benefits that our customers are realizing when they pair TSM with Data Domain technology. Next up: gotchas and approach.
Comments