You already have everything you need for Chaos Engineering


We know getting started with Chaos Engineering can be scary. Teams want to be prepared and know that they have everything in place before getting started. Maybe you’re concerned about not having observability and monitoring, or you don’t think your staging environment is a good replica of production, or you don’t have the right tools to run chaos experiments.

While these are valid concerns, the truth is you probably already have all of these. At the very least, it’s not as hard as you think. In this article, we’ll explain everything you need, where you need to be, and how you can get started.

You already have a culture of reliability

Culture is an important part of reliability, as having a strong cultural focus is what pushes engineers to focus more on building resilient systems. The thing is, engineers already want to build resilient systems. Unreliable systems that are buggy tend to fail more often, creating extra work and stress for the engineers maintaining them. Oftentimes engineers simply don’t have time to find and fix bugs because they’re busy building new features.

By giving engineers time to focus on reliability, you’re reducing your risk of incidents and outages later on.

You already have enough monitoring and observability

You don’t need a full-blown observability practice to get started with Chaos Engineering. Yes, you do need to understand how your systems are operating so that you can compare their behavior before an experiment and during an experiment. But as long as you’re able to quantify the metrics that are important to you and your team (for example, the golden signals of latency, traffic demand, error rate, and resource saturation) and track their changes over time, you’re ready to go. You don’t need to start with fully developed SLOs.

You already have a test environment

Trying to reproduce your production environment in a staging environment is a Sisyphean task. You’re never going to get the same behaviors in pre-production as you would in production. I know what you’re thinking, and the answer is yes: you can actually do Chaos Engineering safely in production! But there are smart ways of going about it.

The most important step is to limit your blast radius, or the number of systems impacted by your chaos experiments. In production, the blast radius also includes the users who are potentially impacted by your experiments. If you serve thousands of customers per day, running a chaos experiment across your entire production stack could potentially cost you millions of dollars.

The key is to control the impact. Start by reducing your blast radius to the smallest number of systems needed to test your hypothesis. In addition, use techniques like canary deployments to segment off part of your production infrastructure and traffic. The benefit of a canary deployment is that it’s still production traffic running on production hardware, but it constrains the blast radius to a much smaller customer base. This is how teams like Netflix run chaos experiments, load tests, and regression tests.

An even safer approach is to use dark launches, where traffic is duplicated and sent to both production systems and a separate, “dark” production environment. This lets you run experiments in a production replica without actually impacting users.

You already have a way to continuously validate your resiliency

Chaos Engineering is a practice that requires repeated experimentation and validation that your systems are resilient. The good news is that if you’re a company that has an automated deployment process (such as a CI/CD pipeline), you already have everything you need. Many Chaos Engineering tools can integrate into a CI/CD pipeline, letting you run experiments during a build process. You can deploy your changes to a canary or dark environment, run experiments to validate its resilience, and if your experiments fail, roll back the deployment automatically.

Just like with automated QA testing, the practice of “continuous chaos” means you can validate your services against a full range of potential failure modes without having to run the experiments yourself. You can continuously add new experiments to your test suite and ensure your services don’t have regressions or introduce new failure modes.

You already have access to Chaos Engineering tools

Getting started with Chaos Engineering is easier than you think. There are a ton of open source tools available, as well as fully managed platforms like Gremlin. Open source tools allow you to extend your Chaos Engineering practice by modifying the tool to your needs, while managed platforms involve minimal setup and configuration. Ultimately, success with Chaos Engineering depends on three things: having a culture of reliability, having enough observability in place to understand how your systems are working, and having the tool to run chaos experiments and systems to experiment on. As long as you have these, you’re good to go.

Join the cloud native community at KubeCon + CloudNativeCon Europe 2021 – Virtual from May 4-7 to further the education and advancement of cloud native computing.

Alluxio Enhances Support For AI/ML Workloads, Partners With Nvidia

Previous article

The Three Essential Principles of Observability

Next article
Login/Sign up