Cloud services provider Fastly experienced an outage in June that knocked down websites such as Reddit, Twitch, Hulu, The New York Times, and more for nearly an hour. Unfortunately, incidents like this are far too common.
IT outages cost businesses $5,600 per minute – equating to over $300,000 per hour – according to Gartner and can result in long-lasting damage to their brand’s reputation. In this episode of IT Availability Now, Edwina Murphy, Lead Cloud Specialist for the EMEA Region at Sungard AS, discusses how to keep your organization from suffering the same fate. Listen to the full episode to learn:
- Critical steps businesses must take to prevent IT outages
- Ways to make certain your employees are prepared for disruption
- How to ensure you can successfully recover in your disaster recovery location if production and IT systems fail
Brian Fawcett is a Senior Manager of Global Sales Engagement at Sungard AS. With over 15 years of experience in a range of industries, he specializes in forming enterprise-wide global talent and learning development programs. Brian has enriched corporate learning culture by matching organizational vision and core values to curricula, leading to application and impact.
Ed Murphy is the EMEA cloud specialist team lead, where she supports customers through complex challenges and helps them adopt creative, robust, efficient, secure and resilient cloud solutions to transform business. Ed holds over a decade of experience in working with private, public and hybrid cloud solutions and managed services.
The full transcript of this episode is below.
BRIAN FAWCETT (BF): The financial consequences of an IT failure are well documented. According to Gartner, IT outages cost companies an average of $5,600 per minute, which amounts to more than $300,000 per hour, and the reputational ramifications are just as damaging.
I'm your host Brian Fawcett and this is IT Availability Now, the show that tells stories of business resilience from the people who keep the digital world available.
Back in June 2021, cloud services provider Fastly experienced an outage that created widespread chaos, as people trying to access sites like Reddit, Twitch, Hulu, The New York Times, and more were met with the dreaded 503 error message for the better part of an hour. While Fastly is far from the only company to find itself in this situation, news stories like this leave a negative impression that can ultimately damage your brand.
Fortunately there are steps you can take to avoid these issues. Here today to run through these steps is Ed Murphy. She's a Lead Cloud Specialist in the EMEA region at Sungard AS.
Welcome to the show Ed.
EDWINA MURPHY (EM): Thanks very much Brian and I'm glad to be here.
(BF): So as IT grows more complex, it creates greater opportunities for failure. So what must businesses do to prevent these problems from arising?
(EM): I think, yes, it's no secret that it creates greater opportunities for failure. I think there are three vital things that organizations need to be looking at.
The first is to architect for both resiliency and high availability, where you can.
Secondly, implement a disaster recovery plan and remember if it's not written down, it's not a plan.
And I would say, thirdly, account for the people, the processes, and the governance. All too often we see it rather focused on the technology and can often see errors arising from that particular approach when it's too technology focused.
What I would say Brian is as well, it's not just that businesses want to prevent these problems from arising. If they do, then not only do they want to protect their brand and image, actually they want to take advantage of potentially gaining a competitive edge in those scenarios. So it is increasingly important, particularly as you said it’s more complex. As we now know, on the recent Flexera State of the Cloud Report 2021, that 78% of enterprises have hybrid IT, and that is also making disaster recovery even harder.
(BF): So let's examine each of these areas a bit more. Why is it so important that organizations prioritize resiliency and high availability?
(EM): So businesses today need non-stop operations. They are expected to architect for higher availability. Customer demand has never been higher than it is today and that's only going to continue. So you must build resiliency into your production and archetype those production systems so then they have that greater level of availability to ensure that your customers can transact with you in whatever way they need to, whenever they want.
And build redundancy and fault tolerance into your production systems. You want to ensure that happens so that from the user experience they know nothing that's going on in the background within your organization.
There is an acceptance that high availability systems are more expensive. And we also need to be mindful, whilst it's fantastic from a speed perspective, they don't actually address every type of disaster. So some of those systematic disasters out there you know, like natural disasters, cyber attacks, pandemics, so on and so forth, that will not necessarily ensure that your business can survive and thrive.
(BF): And that's why employing a DR plan comes into play right?
(EM): Precisely. If your production and IT systems fail, you need the technology and solutions in place to recover your business into your DR location, wherever that might be. Just replicating your data isn't enough. That's not a recovery plan, that's not necessarily going to enable your business to recover.
(BF): So what does an effective DR plan entail?
(EM): So a detailed plan of action before, during, and after a disruption.
From a technology side, digital tools are an absolute must. Software used to carry out risk assessments, performance monitoring, and security checks are imperative.
You will also need to consider data collection. So analyzing the business processes and identifying the priorities, such as tiering applications, and your data based on business value. Define your RTOs and RPOs. So for your recovery time objective, how long do you need to set aside to say to your business and your customers ‘I must be recovered within two hours, four hours?’ Consider that - how long can the business afford to be down? From the recovery point objective, that question is more around how much data can you afford to lose. You know, does it have to be no more than an hour's worth, a day's worth, two days worth, etc.
And when you're looking at this, I would urge organizations to consider working with a third party partner. And that’s largely around they can give advice and create blueprints for your specific organization, factoring in all your applications, and how they hang together within your environment.
Back to when we were considering the complexity of hybrid environments now and ensuring you have that governance across the stack.
And also using cloud based recovery, like disaster recovery as a service (DRaaS). Now some fantastic high level benefits here. First and foremost, around the cost, because you're not having to pay for dual environments, do lots of software maintenance, so on and so forth. And that can be a little bit trickier as well ensuring that DR actually mirrors production, which is also important, far more easier to achieve that with a cloud based recovery.
The other benefits are around using that as a service is, if you think about it, within your organization, you will have to employ a highly expert disaster recovery team, which in theory you would hope you don't have to use that often, so that can be quite expensive. But also if you're using a partner who handles thousands of invocations, across all the different disaster recovery scenarios then a lot of that intellectual property is brought into your service, so therefore you can take advantage of the people, the process, and the technology that's already predefined and has a blueprint for you.
(BF): It all makes sense. So, you've addressed the technology aspect. How do employees fit into this puzzle?
(EM): As you rightly say, technology and systems are only part of the solution. Your people and processes and governance are just as critical. So you must make sure you have that dedicated DR team that’s properly and consistently trained so that they understand the roles in the events of disruption, and build that muscle memory so that they're fully prepared.
Also test your DR plan regularly. This is very, very hard sometimes when organizations typically do it themselves and have on premise to actually create the time to do the disaster recovery testing. It often gets left aside and that can sometimes be very, very damaging because even when customers have DR plans, they’re not necessarily up to date with the changes that have been made within the environment. So it really is the only way to uncover and rectify problems with a plan, including potentially those technical glitches, communication problems, or logistical issues.
(BF): Yeah, so how often should businesses test this whole DR plan for their organization?
(EM): So that completely varies. Largely it’s based on how much downtime your business can afford. But for some organizations that can even be driven by compliance or regulation depending on what sectors they're in or what their customers demand of them. We are seeing an increasing amount of our customers where their customers are actually looking to see their DR plans before they will even place business with them, and understand how long they might be down for. So it's becoming extremely critical.
Largely though, on average, a two day recovery requirement tends to have two test periods in a year. If a customer has a one day recovery requirement, then typically we see that kind of testing around once per quarter. That’s certainly what we would recommend, and as I mentioned before, it is extremely imperative that you factor in those major changes to your environment, or to internal or external requirements that might sit around that particular DR plan.
(BF): That’s great information. IT failures come in all shapes and sizes and can happen for any number of reasons, but in the end, the business ramifications are the same, financial losses and hits to your brand reputation. By architecting for resiliency and availability, ensuring your employees are prepared for disruption, and implementing a DR plan and regularly testing it, you're less likely to experience an IT failure that leaves your business scrambling for damage control. Ed, I appreciate you being here today.
(EM): Thanks Brian. I had a lot of fun. So again, thanks for having me.
(BF): Ed Murphy is Lead Cloud Specialist for the EMEA region at Sungard AS. You can find the show notes for this episode at SungardAS.com/ITAvailabilityNow.
Please subscribe to the show on your podcast platform of choice to get new episodes as soon as they’re available.
IT Availability Now is a production of Sungard Availability Services.
I’m your host, Brian Fawcett, and until next time, stay available.