ALISON BROOKER (AB): Ring Doorbell notifications went quiet, Alexa devices stopped responding to commands and shows on streaming platforms were inexplicably paused when Amazon Web Services experienced a sudden outage earlier this month. The outage caused major disruptions to a number of services, from airlines to delivery services, showing just how much falls under a cloud provider’s scope.
I'm your guest host, Alison Brooker, and this is IT Availability Now, the show that tells stories of business resilience from the people who keep the digital world available.
On today's episode, Servaas Verbiest, Lead Cloud Evangelist at Sungard AS, discusses what happened with this outage and how businesses can mitigate the risk of disruptions like this in the future.
Servaas, thanks for joining us today.
SERVAAS VERBIEST (SV): Thanks for having me.
(AB): Now we don't know all the details about the outage at this point, but can you walk us through what we know so far?
(SV): So I can tell you what I know. Obviously, there's probably some remediation going on on their side, but from what they’ve released, it appears there were some API issues that really impacted U.S. East 1, which is a popular region for people to deploy services in AWS.
(AB): That's great. We hear all the time, right, in the media, you know, outages, in general, are pretty common, but what makes this one so significant?
(SV): So, on the surface, you know, I mentioned one component. And you would imagine that that one thing wouldn't cause as much chaos. But the reality of the situation is APIs are used to make things happen in the cloud and bind services together. So when you have an issue where application interfaces can't function appropriately, there can be turmoil.
From what I saw, everything from airlines’ ability to deliver services to their clients, streaming platforms and even logistics associated to packages were impacted, which, you know, causes a shift in everyday life. With the significant market share that they've maintained and the sheer volume of business they have, what would potentially be a small outage to other providers causes larger ripples when we’re starting to talk about somebody like Amazon.
(AB): Absolutely. Do cloud providers typically experience outages of this magnitude and has this ever happened before?
(SV): Well I mean, outages happen to all cloud providers, right? Inevitably your service provider is going to have an outage if you leverage them. Realistically, that's one of the reasons they establish SLAs and they also provide options to mitigate against those risks. And it's not limited to simple infrastructure as a service providers.
Look at the large outage before this one with Facebook. You know, people typically see that as a social media platform that they use to communicate with friends or share photos or experiences, but the reality of situations is that is a platform that gets used for authentication into a lot of services. So you know, there are people who, when you log into your Spotify account, or other accounts where you purchase goods and services, you access your account by authenticating in with your Facebook service.
And if that's unavailable, you can't buy things or access that service, which inevitably impacts that provider's ability to get revenue, right? So when you start to leverage cloud providers or SaaS platforms as a part of your business's delivery model, when they have an issue, you're the one who is going to be feeling the pain.
(AB): Absolutely. Can you tell us a bit about what businesses can do to mitigate the risk of disruptions?
(SV): So it really breaks down to three things. It's balancing and potentially changing the way you leverage providers so that your risk is not really aligned in one place, right? If you can't do that, you’ve got to really look at your design to re-examine the ecosystem that you're part of to see if there's a way for you to distribute workloads or services that you're leveraging in a way that, if there is a regional issue, you're still available and can conduct business. And most of these platforms, they have options that allow you to do that.
Where it kind of falls off is more in line with the Facebook example that I used because you know if you're using that medium, and that's your only option, you can't control how they protect their services. So in that case, you’ve just got to look to see if you can offer other authentication mechanisms or something to provide that function if that's unavailable.
(AB): Right, so can we dive into that one a bit more, starting with balancing and changing providers, and how should businesses do that?
(SV): So, if you look at the business processes that drive a transaction, and the applications associated to those processes, that's a great place to start, right? Because certain platforms will have specific features or benefits that align to application stacks or certain industries that you're not going to be able to change. But for the stuff that is more general and can be hosted and distributed on, you know, standardised IaaS offerings, that's, you know, a low hanging fruit that people can use to kind of provide a provider related issue risk mitigation response.
Now, there's a lot more that goes into it and I can probably talk to it for days. And that's why people really frankly work with Sungard, but you also have to re-examine those designs. And when I say re-examine a design, right, it's looking for the points of failure or opportunities where things aren’t decoupled and you have that single thing that's driving an action or supporting a process. If there's no way for you to break away from that, you've got some serious re-architecture and design to do, but in most cases, integrating at least another option so that you have a backdoor alternative if it's not available, is going to be your best case. Which, you know, you should be consistently reevaluating how your organisations consumes services from a cloud and software as a service perspective because with the level of innovation that the marketplace is seeing now - and it's greater than it's ever been - there's always going to be great options available if you've got the expertise in house to really capitalise on it.
(AB): One other thing I wanted to touch on, Servaas. You mentioned earlier in the conversation about the regions that experienced an outage and that this outage only affected those certain regions. Is there something specific businesses should be thinking about when re-examining their designs and their cloud strategy? And you know, what else should they consider to avoid this kind of outage or issue?
(SV): I mean, look, just don't put all your eggs in one basket. Don't take it for granted that that individual deployment that you have is going to be fully protected. Try to diversify. And look at your design and try to think about consuming services in a way that mitigates against risks in a financially responsible way. No one has an infinite budget right? There's no business that I've worked with that says you know, “you tell me what I need. I have unlimited amounts of funds to put towards this.” We're operating in an economy that requires us to be intelligent with our resources. So just also developing a centre of excellence with people that understand the operational and financial components so that you can just put the appropriate levels of risk mitigation against the processes and services that you need to protect is also really good practise to bring into consideration.
(AB): That definitely makes sense. You also mentioned leveraging new services. Can you expand on that a bit?
(SV): So in the time we're done with this conversation, these cloud providers will probably have released at least one new service. They do that how many times a day, how many days of the week, right? So the ecosystem is constantly being populated with new features and functions that optimise or make the consumption of certain things easier. And you just need to be aware of those because, you know, there's some risks that you have to accept when you're starting to adopt leading edge technology, but the level of testing and verification that goes into the release of the service is much better than it was a few years ago.
And frankly, that's because of how competitive the marketplace is and specifically when you look at these hyperscale providers, they're not just dealing with a race to zero to offer the cheapest and most cost effective platform, they're also driving towards being the most innovative. Right? And they know that combining that optimised cost with the highest level of innovation is really what's going to capture their customer base. So you shouldn't miss out on opportunities because you're not looking for those services. That's really the only point I’m trying to make there.
(AB): That's great. These are all great points Servaas.
Disruptions like the recent AWS outage can happen to any service provider and businesses need to prepare and understand how to mitigate potential risks. By balancing providers, examining your cloud strategy and leveraging new services you'll be better positioned to withstand disruptions like this in the future.
Servaas, thanks so much for joining us today and filling us in on what happened here and how businesses can mitigate outages like this.
(SV): My pleasure.
(AB): You can find the show notes for this episode at SungardAS.com/ITAvailabilityNow.
Please subscribe to the show on your podcast platform of choice to get new episodes as soon as they’re available.
IT Availability Now is a production of Sungard Availability Services.
I’m your guest host, Alison Brooker, and until next time, stay available.