A little chaos now & then is the best test for resilience

Resilience is “the capacity to recover quickly from difficulties, or toughness.” With the rise in both natural disasters and cyberthreats, today’s businesses must ensure not only their physical resilience, but the resilience of their IT systems so they can continually provide a great customer experience.

But how do you know if you’re prepared for the worst? Test, test, test. In fact, one method of testing is known as “chaos engineering,” which is defined as “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” (From http://principlesofchaos.org/)

The goal of chaos testing is to expose weaknesses in your systems before they manifest themselves as some end-user service being down. By doing this on purpose, you and your systems become better at handling unforeseen failures.

Typically, though, we don’t look for a service’s complete failure, or for high latency in a service’s response. Couple this with the fact that almost all modern IT systems are very distributed in nature, and we have other issues like cascading failures that are very hard to foresee from a test team’s perspective.

A simple way to begin your testing regime is by looking at recent production issues and discovering whether you could have caught any of those problems by experimenting earlier on. Many traditional enterprises have a Problem Management group that can help to spearhead this discussion, or you can check with your DevOps/Service team(s). Some IT organisations introduce system degradation using something a tool like Chaos Monkey, a tool invented by Netflix in 2011 to gauge the resilience of its IT infrastructure.

Remember that your goal is not to cause problems, but to reveal them. Be careful not to overlook the type and amount of traffic being created by your tests. Tools like the Chaos Automation Platform (ChAP, another test bed built within Netflix) provide ways to route a percentage of your internet traffic to the experiment and thereby help ‘increase the safety, cadence, and breadth of experimentation.’

In other words, it is worth shifting your thinking from “What happens if this fails?” to “What happens when this fails?” At Sungard AS, for example, we strive to provide state-of-the-art resilience to our customers by looking at these experiments as ‘continuous limited scope disaster recovery.’ This helps us identify and address issues before they become news – which is GOOD news for you.

Find out more about the discipline, frameworks and maturity around Chaos Engineering in our CTO blog about true resiliency.

Photo by Erik Eastman on Unsplash

// Blog

A little chaos now & then is the best test for resilience