Friday, January 25, 2008

When no disaster recovery plan helps

Regardless of how "prepared" and "ready" one feel for a disaster, it will, in one form or another, inevitably happen. The best thing you could do is continuously revise and test your disaster recovery plan, strengthening it each time against any kind of disaster you can think of. Things generally go wrong when you least expect them to go wrong.

I was getting chills reading about Charter Communications, a St. Louis based ISP, accidentally deleted 14,000 active email accounts along with any attachments that they carried. All the deleted data of active customers is irretrievable. As someone who is responsible for data of one of the top 15 heaviest trafficked site in the world, according to Alexa, I know, I'd HATE to be in shoes of the person responsible for this.

As I was reading the news story, I was constantly thinking about the title of Jay and Mike's 2006 presentation: "What Do You Mean There's No Backup?"

Once a disaster happens, you can immediately think of the possible ways it could have been avoided. The real challenge is implementing ways of avoiding all types of disasters before they happen.

For instance, to protect against such a disaster, or at the very least, be able to recover from its effects, Charter communications could have:

1. fully tested the script on a QA/test box to ensure no test records of active users are deleted.
2. created a backup of the data by creating a file system snapshot just before running the script. That way deleted data can be recovered. Depending on your operating system/storage system, there are a lot of tools available that let you take file system snapshots such as fssnap (Solaris), LVM (Linux).
3. had a recoverable backup. There are a lot of cases out there where either no backup exists or the one that does exist, turns out to be corrupt. With a periodic backup, Charter could have, for instance, just announced to their customers that they lost their new emails since last week, instead of dropping the ball and saying that *all* their email is lost. Even having an off-site backup in this case would help if selective restore from that backup was possible.

BTW, Just a few days ago, I was testing a random sample of backups and found backups of a database to be corrupt. That triggered a system wide check of backups. The best way I have found is to have a list of backups from all databases sent to me by email. My report contains information about backups running at the time the script was generated and the backups that were created the previous night.

4. If the data deleted was on a database such as MySQL, recovery from this disaster would be possible by keeping a slave intentionally behind.

What are some of the other ways you can think of to avoid a disaster or to execute a recovery plan?

There are many ways a disaster like this can be triggered. A few, seemingly bizarre but very real, that come to my mind:

- What if you accidentally re-run a previously executed DELETE command, stored in your mysql client history, in a hurry, on the wrong server? Or you re-ran a disastrous command in your shell history in the wrong directory?

- What if you used a list of IDs generated from your QA/test machine to delete users from production machines/databases? Oh and the IDs were generated from an auto-increment column?

Can you think of more?

Sure, there are ways to prevent against each kind of disaster. The question then is: Are you prepared against 'all' of them?

The disaster recovery plan of your company may help your steer out of such a disaster, but in the case of Charter, their DR plan didn't cover this. They do, however, have plans to reimburse their customers $50. Don't know if that'd be sufficient to keep customers from switching.

If you are someone responsible for administering and executing disaster recovery plan(s) for your company, you may find my "Disaster is Inevitable--Are you ready" session at the MySQL Conference 2008 interesting. Plus, we can have great conversation afterward. :)


See also: disaster recovery, disaster recovery journal, mysql conference

No comments: