Chaos Engineering — Simulate AZ Failures on AWS

Chaos engineering is about introducing turbulent conditions that systems are likely to face in production environments.  These chaos experiments uncover new information, which can then be used to make changes to code, making our systems more resilient than they were before. Chaos experiments are not equivalent to Testing.  In Testing, we check system response against a predefined expected result. However, in the case of chaos experiment, we don’t have a predefined outcome.  The experiment gives us new information about the system, which can then be used for the betterment of systems.

In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS.  Highly available applications need to be resilient against AZ failures.  Your application, for example, a Kubernetes cluster spanning across multi-AZ, should be able to survive such AZ failures. These chaos simulations allow you to check and prepare for that.