By Sungard AS
As a Marine I learned the hard way how difficult it is to think straight under pressure. Now I find myself after two decades in the trenches of business continuity and disaster recovery (BC/DR) realising some of the same lessons-learned apply to Information Technology! Since becoming a business technologist in the mid-1990's, I've been ambushed by unexpected problems, had to "march" at double-time to meet our business objectives, and sweat it out waiting to see if a plan-of-attack was going to work like our marching orders said it would!
So, I'd like to save you some of the blood, sweat, and tears I've experienced along the way by sharing with you these six valuable lessons-learned from the trenches. These are lessons that may seem like common sense… but you'd be surprised how many people ignore them and fail.
Let's say you have your primary data centre, including all your applications and data, in Atlanta, Georgia. I don't recommend building your secondary data centre in a town just 15 miles away. A single event could easily disable both data centres.
Though a "safe" distance depends on geography and other factors, a general rule is this: you want to maintain a full copy of your mission critical data at least 150+ miles away, and have the IT resources needed to use the data for business continuance and recovery. That will usually be sufficient to ensure both data centres are not affected by a single disaster, and your company will be able to continue doing business even if a serious outage or loss occurs.
Ideally, the two data centres would be on separate power grids, but in spite of this general rule, there may be specific use-cases where it's perfectly appropriate to locate your primary and secondary data centers nearby. Just do your homework or consult with an expert prior to making any final decisions.
An untested plan is a failed plan. If you've never tested your plan realistically, it almost certainly will fail because people won't know which end is up when the chaos hits. There are many kinds of tests, and you will want to leverage them all. There are tests that focus on one particular process. There are tests that assume a limited type of disaster. And unfortunately these days, there are tests where we have to imagine our primary site has been turned into a big, smoking crater.
Testing accomplishes multiple goals: it verifies whether your recovery procedures are correct (or perhaps more importantly, incorrect!), and it makes people familiar with the procedures so they can function in a crisis situation. Tests are practice runs for your plan – and practice makes perfect.
Let's say you have a very clear idea of your critical business processes, the associated applications and their SLAs, your infrastructure and data sources, and what steps are necessary to recover everything within your Recovery Point and Recovery Time Objectives. In fact, your plan is a beautiful thing: documented, tested, and proven!
Now, picture yourself six months down the line. You have deployed a new application system, moving applications off a physical server platform and into the cloud. Guess what? All your plans and tests related to that area of the business have become outdated and irrelevant overnight. If you have a disaster, all your hard work will have been for nothing because you won't be able to recover anything (at least not without a lot of late nights and hair pulling). Unless, of course, you practice rigorous change management and keep your plan in alignment with your production environment.
You could have the best BC/DR plan in the world, but if people can't get to the documents and run-books in a time of crisis, it's all useless. So if, like many companies, your plans are documented in the form of PDF, Word, Excel, and/or Visio files, you need to make sure all of this is organised and accessible to your team TISI -1.29% even if the primary infrastructure is destroyed. And never underestimate the importance of version control, and keeping the most updated documents in the system to avoid confusion. Having an inaccessible or inconsistent plan is almost as bad as having no plan at all.
Whether it's simply storing these documents in the cloud, or building your business continuity / disaster recovery plans in a living disaster recovery planning system; during a serious event, ensuring unhindered access to it is critical.
Too often, BC/DR is the purview of a single IT person or department at a company. If any critical person gets sick, leaves the company, or – heaven forbid – is rendered unavailable in a disaster, the company is left with tons of recovery documentation no one knows how to execute.
The solution is straightforward: Train and involve several people and departments on your BC/DR plan. And, if at all possible, train at least one of those people or teams outside your primary data centre region. That way, if a widespread problem or incident renders the people near the primary data centre unavailable, the team outside the region can step in to fill the role. A managed recovery program can also help meet this need, whereby the processes and procedures for application recovery are taught to another team of people in another region (or even outsourced to a third party).
No matter how careful and diligent you are in preparing your BC/DR plan, no matter how many times you test it, you are going to have some snafu during the real thing. And that's where good old-fashioned ingenuity and creativity come into play. That's also when it pays to have a trusted B- Team or partner with experience recovering your application environments, so they can help you out of a jam.
You can also mitigate the effects of Murphy's Law by thinking about contingency plans. During tests, ask your team "If this fails, regardless of our best efforts, what would we do? How would we recover?" Think through manual recovery procedures, assume key personnel would not be available, have the team trade roles during tests and practice to make sure they each know the others job… in other words, keep your team's mind limber and innovative. Make sure more than one person or group knows which way is up, and no crisis will keep you down for long.
So there you are: 6 BC/DR best practices from the trenches. These will keep you moving forward no matter what outage, incident, or disaster may strike. So march on, and semper fi!
This blog was originally posted on Forbes.com