2. Test your BC/DR plan in a realistic way to ensure it actually works.
An untested plan is a failed plan. If you've never tested your plan realistically, it almost certainly will fail because people won't know which end is up when the chaos hits. There are many kinds of tests, and you will want to leverage them all. There are tests that focus on one particular process. There are tests that assume a limited type of disaster. And unfortunately these days, there are tests where we have to imagine our primary site has been turned into a big, smoking crater.
Testing accomplishes multiple goals: it verifies whether your recovery procedures are correct (or perhaps more importantly, incorrect!), and it makes people familiar with the procedures so they can function in a crisis situation. Tests are practice runs for your plan – and practice makes perfect.
3. Ensure production changes are properly reflected in the BC/DR plan.
Let's say you have a very clear idea of your critical business processes, the associated applications and their SLAs, your infrastructure and data sources, and what steps are necessary to recover everything within your Recovery Point and Recovery Time Objectives. In fact, your plan is a beautiful thing: documented, tested, and proven!
Now, picture yourself six months down the line. You have deployed a new application system, moving applications off a physical server platform and into the cloud. Guess what? All your plans and tests related to that area of the business have become outdated and irrelevant overnight. If you have a disaster, all your hard work will have been for nothing because you won’t be able to recover anything (at least not without a lot of late nights and hair pulling). Unless, of course, you practice rigorous change management and keep your plan in alignment with your production environment.
4. Have a plan that is consistent and accessible even in the event of a major disaster.
You could have the best BC/DR plan in the world, but if people can't get to the documents and run-books in a time of crisis, it's all useless. So if, like many companies, your plans are documented in the form of PDF, Word, Excel, and/or Visio files, you need to make sure all of this is organized and accessible to your team TISI -1.29% even if the primary infrastructure is destroyed. And never underestimate the importance of version control, and keeping the most updated documents in the system to avoid confusion. Having an inaccessible or inconsistent plan is almost as bad as having no plan at all.
Whether it's simply storing these documents in the cloud, or building your business continuity / disaster recovery plans in a living disaster recovery planning system; during a serious event, ensuring unhindered access to it is critical
5. Always have several people fully trained on the BC/DR plan … preferably some of whom are outside the production region.
Too often, BC/DR is the purview of a single IT person or department at a company. If any critical person gets sick, leaves the company, or – heaven forbid – is rendered unavailable in a disaster, the company is left with tons of recovery documentation no one knows how to execute.
The solution is straightforward: Train and involve several people and departments on your BC/DR plan. And, if at all possible, train at least one of those people or teams outside your primary data center region. That way, if a widespread problem or incident renders the people near the primary data center unavailable, the team outside the region can step in to fill the role. A managed recovery program can also help meet this need, whereby the processes and procedures for application recovery are taught to another team of people in another region (or even outsourced to a third party).
6. Remember Murphy's Law: "Whatever can go wrong, will go wrong."
No matter how careful and diligent you are in preparing your BC/DR plan, no matter how many times you test it, you are going to have some snafu during the real thing. And that's where good old-fashioned ingenuity and creativity come into play. That's also when it pays to have a trusted B- Team or partner with experience recovering your application environments, so they can help you out of a jam.
You can also mitigate the effects of Murphy's Law by thinking about contingency plans. During tests, ask your team "If this fails, regardless of our best efforts, what would we do? How would we recover?" Think through manual recovery procedures, assume key personnel would not be available, have the team trade roles during tests and practice to make sure they each know the others job… in other words, keep your team's mind limber and innovative. Make sure more than one person or group knows which way is up, and no crisis will keep you down for long.
So there you are: 6 BC/DR best practices from the trenches. These will keep you moving forward no matter what outage, incident, or disaster may strike. So march on, and semper fi!
This blog was originally posted on Forbes.com