The Principles of Chaos Engineering

Resilience is something those who use Kubernetes to run apps and microservices in containers aim for. When a system is resilient, it can handle losing a portion of its microservices and components without the entire system becoming inaccessible.

Resilience is achieved by integrating loosely coupled microservices. When a system is resilient, microservices can be updated or taken down without having to bring the entire system down. Scaling becomes easier too, since you don’t have to scale the whole cloud environment at once.

Chaos Engineering — Simulate AZ Failures on AWS

Chaos engineering is about introducing turbulent conditions that systems are likely to face in production environments.  These chaos experiments uncover new information, which can then be used to make changes to code, making our systems more resilient than they were before. Chaos experiments are not equivalent to Testing.  In Testing, we check system response against a predefined expected result. However, in the case of chaos experiment, we don’t have a predefined outcome.  The experiment gives us new information about the system, which can then be used for the betterment of systems.

In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS.  Highly available applications need to be resilient against AZ failures.  Your application, for example, a Kubernetes cluster spanning across multi-AZ, should be able to survive such AZ failures. These chaos simulations allow you to check and prepare for that.