Another week, another airline has a critical information system go down, stranding thousands of travellers and making a big splash in the news. If you think these incidents seem to be happening with increasing frequency, you are right.
Technology is no longer simply used to improve efficiency and increase performance in the airline industry; it is now fundamental to how these businesses operate. If that technology fails, an airline doesn’t just slow down, it comes to a full stop. While this is not unique to the airline industry, the difference is that system outages are highly visible, highlighting the importance of investing in resilient and quickly recoverable technology and systems.
In addressing this, airlines need to build production resiliency and DR plans by application tier, focusing on the business and customer impact of each system. The most recent Delta outages impacted websites, mobile apps, and airport departure screens. Are these considered critical applications? Are the application dependencies understood and managed? I’m not much of an expert in airlines (though goodness knows I spend enough time on planes that I ought to be), but I do know something about fully recoverable and highly resilient production systems.
The more critical the system, the more important is it to have multiple plans in place in case Plan A or Plan B fails. In the case of Delta last August, a minor problem became a major disruption when a power outage hit and it turned out that 300 of Delta’s 7,000 servers weren’t connected to the backup power system, causing thousands of cancelled flights. What happens when the failover fails?
More and more, airlines need to look beyond just Plan A/Plan B. No doubt, Plan A starts with building production resiliency into the application itself, so that it is resilient to common infrastructure problems. Plan B might include failover to a standby system with minimal loss of data or delay. In this day of ransomware and data compromises/corruption, it’s becoming increasingly clear that companies also need a Plan C that enables them to recover from replicated backups at a remote location where that data is properly isolated.
Finally, I sometimes wonder how often airlines truly test their disaster recovery plans. A plan that isn’t tested regularly is likely to be out of date as the systems, people, and processes change over time. You need to build muscle memory in your organisation for what to do and, more importantly, who does it.
Here is a quick checklist that airlines can use to assess their risk of having a “full stop”” systems outage.
Find out about our Disaster Recovery Services