A little chaos now & then is the best test for resilience

October 17, 2018

Resilience is “the capacity to recover quickly from difficulties, or toughness.” With the rise in both natural disasters and cyberthreats, today’s businesses must ensure not only their physical resilience, but the resilience of their IT systems so they can continually provide a great customer experience.

But how do you know if you’re prepared for the worst? Test, test, test. In fact, one method of testing is known as “chaos engineering,” which is defined as “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” (From http://principlesofchaos.org/)

The goal of chaos testing is to expose weaknesses in your systems before they manifest themselves as some end-user service being down. By doing this on purpose, you and your systems become better at handling unforeseen failures.

Typically, though, we don’t look for a service’s complete failure, or for high latency in a service’s response. Couple this with the fact that almost all modern IT systems are very distributed in nature, and we have other issues like cascading failures that are very hard to foresee from a test team’s perspective.

A simple way to begin your testing regime is by looking at recent production issues and discovering whether you could have caught any of those problems by experimenting earlier on. Many traditional enterprises have a Problem Management group that can help to spearhead this discussion, or you can check with your DevOps/Service team(s). Some IT organizations introduce system degradation using something a tool like Chaos Monkey, a tool invented by Netflix in 2011 to gauge the resilience of its IT infrastructure.

Remember that your goal is not to cause problems, but to reveal them. Be careful not to overlook the type and amount of traffic being created by your tests. Tools like the Chaos Automation Platform (ChAP, another test bed built within Netflix) provide ways to route a percentage of your internet traffic to the experiment and thereby help ‘increase the safety, cadence, and breadth of experimentation.’

In other words, it is worth shifting your thinking from “What happens if this fails?” to “What happens when this fails?” At Sungard AS, for example, we strive to provide state-of-the-art resilience to our customers by looking at these experiments as ‘continuous limited scope disaster recovery.’  This helps us identify and address issues before they become news – which is GOOD news for you.

Find out more about the discipline, frameworks and maturity around Chaos Engineering in our CTO blog about true resiliency.

Photo by Erik Eastman on Unsplash

Other Posts You Might Be Interested In

BCI Cyber Resilience Report Paints Worrying Picture

  A new report published by the BCI and sponsored by Sungard Availability Services confirms cyberattacks are a constantly evolving threat that is growing more dangerous...

Verus Analytics counts on Sungard AS for IT resilience

  Have you ever received a letter from your insurance company telling you that you might have unclaimed insurance funds? Or seen those annual unclaimed property lists...

Resilient Culture Ensures Business As Usual for Augentius

  Longstanding, global Sungard Availability Services customer Augentius, one of the largest independent private equity and real estate fund administrators in the world,...