There’s no way you’ll be successful in recovery without testing. Full stop.
You might think you have a solid plan in place, but there are so many issues that can arise throughout a recovery, from communications to logistics to an array of technical glitches. And if you don’t test your plan ahead of time, it won’t be long before you learn the true meaning of Murphy’s law.
Whether you’re already a testing evangelist or need to brush up on the importance of disaster recovery (DR) testing, here’s an overview of why testing matters, what can happen without testing, and guidelines for how to approach testing to ensure your systems are ready for recovery.
Why disaster recovery testing is important
Discovering your DR plan is ineffective in the midst of a disaster is everyone’s worst nightmare. That’s why testing your DR plan ahead of time and on a regular basis is critical. Rigorous testing is the only way to uncover and fix issues in your plan.
And trust me, a lot can go wrong when a disaster strikes.
Communications is a common issue. You’re in the middle of a crisis and have a problem, but don’t know who to call. Or worse, you have the name and number of the person you’re supposed to call, but when you dial the number, you find out that person no longer works for the company.
Technical glitches run the gamut, but one that I’ve seen derail DR plans again and again is missing data. An absolute requirement for a successful recovery is to have all of your required data in place. Given the complexity and interdependencies among today’s systems, missing even one critical piece of data can result in an inability to bring up your production applications in your recovery environment.
DR testing matters because the recovery process is based not only on your recovery procedures but on coordination, collaboration and sequencing. You must maintain those across your storage, network, applications, databases and other platforms.
You may assume you can get that done, but you need confirmation that a given input will produce the desired outcome. There’s only one way to discover whether that’s true and what pitfalls you might face: testing.
How often should you test your disaster recovery plan?
There’s no universal answer here, but there’s a spectrum of testing frequency based on how much downtime your business can afford. If you need your business functions up and running in a week (which would be surprisingly long), testing once per year could be fine. If your systems can never be down, you owe it to yourself to test much more frequently.
Many companies have a two-day or one-day recovery requirement. If your requirement is 48 hours, you should be testing twice per year. If you have a one-day requirement, you should be in your recovery environment testing once per quarter. Executing tests at that frequency will help you to confirm if anything in your environment has changed.
Another factor to consider is major changes to your environment, or to internal or external requirements. If you typically test in March and September, but your company is making a change to your processing capabilities at the end of June because of corporate requirements, you should consider a one-off test to make sure those changes are reflected in your DR plan.
There are a few principles to keep in mind about DR testing, especially in the wake of a major disruption like the COVID-19 pandemic. Take note of these guidelines:
Testing lapses happen for any number of reasons. Some companies have been taxed by changes during the pandemic and have had to defer tests for a couple months as they figure out problems they’ve never faced before. That’s fine as long as you get back on schedule with your testing program.
But don’t take a temporary shift in priorities as an excuse to forego testing. Never let a temporary lapse turn into a permanent one.
Organizations have been able to execute DR plans remotely for a while now, so whether your team is working in the office or from home, they should be able to participate in a recovery event.
You should, however, make sure your team has viable access to the recovery center if working situations have changed. If you’ve always worked from the office and haven’t set up the configuration from home, that’s an adjustment you’ll need to make.
There’s no reason to fool yourself by going easy on your DR test. If you’re running a test and find out you’re missing a file, your instant reaction might be to go back to production, get that file, and continue with your test. But to ensure you can successfully recover, bringing something forward from production should be an automatic fail because that’s not possible in a true disaster.
However, you can set out certain documented exceptions to the full plan that you simply can’t execute as part of a test. For example, after a disaster, you’ll want to establish a link to your bank in your recovery system to process payroll. But before a test, you can make the decision to leave that connection to the bank out of the testing process. You don’t want the bank to process payroll during the test, but you do want to take the payroll system as far as you can in creating that file and analyzing whether it would match what you would send to the bank.
Don’t cheat on the test, but make informed, proactive decisions on steps to exclude.
During a disaster you will face many obstacles and challenges. But with consistent, successful testing, your DR plan doesn’t have to be one of them.
No one wants to discover an issue when they no longer have the ability to fix it. Testing gives you that chance. You can’t be successful in recovery without it.