Systematic and Chaotic Testing: A Way to Achieve Cloud Resilience

In today’s digital technology era where downtime translates to shut down, it is imperative to build resilient cloud structures. For example, in the pandemic, IT maintenance teams can no longer be on-premises to reboot any server in the data center. This may lead to a big hindrance in accessing all the data or software, putting a halt on productivity, and creating overall business loss if the on-premises hardware is down. However, the solution here would be to transmit all your IT operations to cloud infrastructure that ensures security by rendering 24/7, round-the-clock tech support by remote members. Cloud essentially poses as a savior here.

Recently, companies have been fully utilizing the cloud potency, and hence, observability and resilience of cloud operations become imperative as downtime now equates to disconnection and business loss.

How Chaos Mesh Helps Apache APISIX Improve System Stability

Apache APISIX is a cloud-native, high-performance, scaling microservices API gateway. It is one of the Apache Software Foundation's top-level projects and serves hundreds of companies around the world, processing their mission-critical traffic, including finance, the Internet, manufacturing, retail, and operators. Our customers include NASA, the European Union's digital factory, China Mobile, and Tencent.


As the community grows, Apache APISIX's features more frequently interact with external components, making the system more complex and increasing the possibility of errors. To identify potential system failures and build confidence in the production environment, we introduced the concept of Chaos Engineering.

Chaos Engineering Make Disciplined Microservices

Chaos and discipline, These two words are an oxymoron, you might be thinking, how can chaos make disciplined microservices?

But the universal truth is discipline means the absence of chaos, so until you have not experienced chaos you can not be disciplined.

The Principles of Chaos Engineering

Resilience is something those who use Kubernetes to run apps and microservices in containers aim for. When a system is resilient, it can handle losing a portion of its microservices and components without the entire system becoming inaccessible.

Resilience is achieved by integrating loosely coupled microservices. When a system is resilient, microservices can be updated or taken down without having to bring the entire system down. Scaling becomes easier too, since you don’t have to scale the whole cloud environment at once.

A Key to Success: Failure with Chaos Engineering [Video]

Test in Production is back! In May we hosted the Meetup at the Microsoft Reactor in San Francisco. The focus of this event was the culture of failure. Specifically, we wanted to hear how the culture of failure (avoiding failure, recovering from failure, and learning from failure) has an impact on how we test in production.

Ana Medina, Chaos Engineer at Gremlin, spoke about how performing Chaos Engineering experiments and celebrating failure helps engineers build muscle memory, spend more time building features and build more resilient complex systems.