Amazon Web Services (AWS) had a fairly public outage this past weekend, and much has been written about it. This event certainly impacted a lot of applications and the people who use and support them, but it got me thinking about how far we’ve come on availability as an industry over the past 20 years. When events like this occur, I would humbly inject some perspective alongside the hyperbole:
- Technology fails. Stuff breaks. Designing for failure is not a new concept, but an important one that is still being learned and applied (by both providers and their customers).
- Every provider has outages. What matters is how they adapt and improve to minimize the frequency and customer impact of these events.
- Outages at public clouds are more visible due to their scope of impact and transparency. The rise of mega-scale, homogeneous infrastructure is going to result in fewer, more impactful, incidents. Combined with today’s level of status reporting (see AWS’ Service Health Dashboard), we simply know much more about incidents in near-real time. This doesn’t mean public clouds have lower availability.
- You can achieve enterprise-grade high availability on public clouds. The tools to maximize uptime at our disposal in a public cloud like AWS are better than they’ve ever been.
- Recoverability and high availability are still complex technical challenges (aka fun challenges). The technologies have gotten better, but our expectations of availability have increased at an equal or greater pace. We just don’t tolerate downtime like we did even just a few years ago.
For those of us responsible for keeping applications up and running, the events of this past weekend should serve as a reminder that we need to be designing with availability and resiliency best practices. The fact is that many organizations have plenty of applications that haven’t been rebuilt to take advantage of some of the latest public cloud functionality and documented best practices. This takes new approaches to infrastructure and applications using new technologies, processes and skill sets. Practically speaking, we simply can’t re-factor all of our apps overnight, so we have to assess the value of availability for each application and make prioritization choices. We must also transform our processes and the skills of our teams simultaneously. For those managing large enterprise application portfolios, this can be a daunting task.
The size and scope of the challenge isn’t an excuse to not get started. To do our part, we’ve been working on ways to make applications more recoverable at public clouds like AWS, using functionality such as RDS read replicas, Route 53, CloudWatch, DynamoDB streams, CloudFormation, EBS, S3, Glacier and more. AWS re:Invent is coming up October 6-9 and Sungard Availability Services will be there to share more about some of the exciting things we’re doing on AWS. Please stop by the booth, say hello and find out more.
Thanks for reading!