Airline Outages are a Lesson for All Businesses When It Comes To IT Disaster Recovery Plans
February 1, 2017
By Sungard AS
Another week, another airline has a critical information system go down, stranding thousands of travellers and making a big splash in the news. If you think these incidents seem to be happening with increasing frequency, you are right.
Technology is no longer simply used to improve efficiency and increase performance in the airline industry; it is now fundamental to how these businesses operate. If that technology fails, an airline doesn’t just slow down, it comes to a full stop. While this is not unique to the airline industry, the difference is that system outages are highly visible, highlighting the importance of investing in resilient and quickly recoverable technology and systems.
In addressing this, airlines need to build production resiliency and DR plans by application tier, focusing on the business and customer impact of each system. The most recent Delta outages impacted websites, mobile apps, and airport departure screens. Are these considered critical applications? Are the application dependencies understood and managed? I’m not much of an expert in airlines (though goodness knows I spend enough time on planes that I ought to be), but I do know something about fully recoverable and highly resilient production systems.
The more critical the system, the more important is it to have multiple plans in place in case Plan A or Plan B fails. In the case of Delta last August, a minor problem became a major disruption when a power outage hit and it turned out that 300 of Delta’s 7,000 servers weren’t connected to the backup power system, causing thousands of cancelled flights. What happens when the failover fails?
More and more, airlines need to look beyond just Plan A/Plan B. No doubt, Plan A starts with building production resiliency into the application itself, so that it is resilient to common infrastructure problems. Plan B might include failover to a standby system with minimal loss of data or delay. In this day of ransomware and data compromises/corruption, it’s becoming increasingly clear that companies also need a Plan C that enables them to recover from replicated backups at a remote location where that data is properly isolated.
Finally, I sometimes wonder how often airlines truly test their disaster recovery plans. A plan that isn’t tested regularly is likely to be out of date as the systems, people, and processes change over time. You need to build muscle memory in your organisation for what to do and, more importantly, who does it.
Here is a quick checklist that airlines can use to assess their risk of having a “full stop”” systems outage.
When did you do your last application tiering and business impact analysis? How often do you do this?
How do you categorise systems such as those that drive websites/ mobile apps/ airport departure screens- are they considered mission critical? Important?
How do you manage the complexity of application interdependencies and update those with change?
[Plan A] What applications are considered mission critical and have production resiliency – high availability / fault tolerance / load balancing – built in?
[Plan B] What are considered Tier 2 / Tier 3? What are your application recovery time objectives (RTOs) and recovery point objectives (RPOs)?
[Plan C] What are your DR plans in the event of a production site being reduced to rubble by a major disaster? What are the plans around replication of offsite data? How quickly could applications be brought up in that scenario? What about in the scenario of a malware attack or data compromise/ corruption? Do you have isolated copies of your data that the malware/ corruption does not get propagated to?
How often do you update and test your DR plans? What success do you have with your DR testing? What percent of your apps do you test?
What percent is your IT budget of total revenue costs? What percent of your IT budget is spent on production resiliency/ modernization/ DR? Are you underinvesting?