Simple AWS
Posts
Chaos Engineering on AWS: Building Resilient Systems

Chaos Engineering on AWS: Building Resilient Systems

Guille Ojeda
March 31, 2024

Hope is not a strategy. That's why we test our software, instead of just hoping it works as intended. But as we're focused on testing features, there are many critical aspects that we often forget about or leave for later, such as resilience.

In this article we'll discuss Chaos Engineering, which is the practice of intentionally breaking your own system to understand how to make it more resilient. We'll also do a brief introduction to AWS Fault Injection Simulator, a service that lets you run chaos experiments on your AWS infrastructure.

Understanding Chaos Engineering Principles

Chaos Engineering is about proactively testing and validating the resilience of distributed systems. It's based on the idea that by intentionally breaking our own stuff (in a controlled way), we can identify and resolve weaknesses before things inevitably break in an uncontrolled way and cause serious outages.

The principles of Chaos Engineering can be summed up in four key points:

Define steady state: Identify the normal behavior of your system and establish metrics to measure it.
Hypothesize: Develop a hypothesis about how your system will respond to a specific failure scenario.
Introduce controlled failures: Deliberately inject failures into your system in a controlled manner.
Observe and analyze: Monitor how your system responds to the failures and compare the results to the experiment's hypothesis.

Benefits of Chaos Engineering

So why go through the trouble of intentionally breaking things? Here are some of the benefits:

Increased confidence in system resilience: By regularly testing your system's ability to handle failures, you can have greater confidence that it will perform as expected in the face of real-world outages. Taken to the extreme, this makes real outages indistinguishable from failures you induced, meaning outages are a non-event.
Proactive identification of weaknesses: Chaos Engineering helps you identify and address weaknesses in your system before they cause real-world impact. Like having features break in your tests before they break in the hands of users.
Improved incident response: By simulating failures in a controlled environment, you can practice and refine your incident response processes. Practice is everything in these situations, and a well practiced incident response protocol is just as important as system resilience.
Reduced mean time to recovery (MTTR): By identifying and addressing weaknesses proactively, you can reduce the time it takes to recover from real-world outages. Again, this comes down to practice.

Steady State and Turbulent State

To effectively apply Chaos Engineering, it's important to understand the concepts of steady state and turbulent state. Steady state refers to the normal, expected behavior of your system. It's the baseline against which you measure the impact of your experiments. To define steady state, you need to identify key metrics that indicate your system is functioning as expected, such as response time, error rate, and throughput. Turbulent state is the opposite: it's the behavior of your system when it's experiencing failures or unexpected conditions.

Steady state is easy, everything is fine and you don't need to do anything. The goal of Chaos Engineering is to observe how your system behaves in a turbulent state, and ensure that it can recover and return to steady state.

When designing Chaos Engineering experiments, it's important to have a clear understanding of your system's steady state and the specific turbulent conditions you want to test. This allows you to create targeted, meaningful experiments that provide valuable insights into your system's resilience. Just like any test, where you don't test random stuff, but specific features in specific ways with specific inputs.

Blast Radius and Containment

Another key concept in Chaos Engineering is blast radius. Blast radius refers to the scope and impact of a failure or experiment. In other words, how much stuff blows up because of a particular failure?

When designing Chaos Engineering experiments, it's important to carefully consider and control the blast radius. You want to introduce failures that are significant enough to provide valuable insights, but not so severe that they cause widespread damage or disruption. In other words, make sure you break only what you want to break, and not more. Be especially careful about cascading failures: Situations where breaking one component causes another one to break, such as breaking a database causing a backend system to overload because it can't finish processing requests and runs out of memory.

Techniques for containing blast radius include:

Targeting specific components or services rather than the entire system
Conducting experiments in isolated environments, such as staging or test environments
Implementing safety controls and rollback mechanisms to quickly recover from failures
Gradually increasing the scope and severity of experiments over time
Really understanding your architecture and how it fails (this will in part come from running chaos engineering experiments, so don't feel bad if you don't get this right the first time)

AWS Fault Injection Simulator

AWS Fault Injection Simulator (FIS) is a fully managed service that makes it easy to perform controlled Chaos Engineering experiments on your AWS workloads. With AWS FIS, you can inject failures into your application stack to test your system's resilience and validate your recovery procedures.

Key features and benefits of FIS include:

Pre-built actions for common failure scenarios, like EC2 instance termination, EBS volume failure, and AZ outages
Flexibility to create custom actions using AWS Systems Manager documents
Integration with AWS CloudFormation and AWS CodePipeline for automation
Built-in safety controls and stopping conditions to minimize blast radius
Detailed experiment logs and metrics for analysis and reporting

I already wrote an article about Chaos Engineering Using AWS Fault Injection Simulator, which will give you the practical details of FIS. I decided to still write this part, just to give you the theory and context behind it.

Fault Injection Simulator Actions and Targets

An action is a specific type of failure that you want to introduce, like terminating an EC2 instance or causing network latency. FIS provides several pre-built actions that you can use, and you can also create custom actions that use AWS Systems Manager (SSM) documents.

A target is the resource or group of resources that you apply the action to. Targets can be specific resources, like specific EC2 instances or EBS volumes, or groups of resources, such as Auto Scaling groups or ECS clusters.

When you create chaos experiments in FIS you define the actions and targets that you want to include. This allows you to create targeted, repeatable experiments that test specific failure scenarios and components.

Experiment Templates and Execution

To run a Chaos Engineering experiment in FIS, first you need to create an experiment template, where you define the actions, targets, and parameters for your experiment.

When creating a FIS experiment template, you have to specify the following:

The name and description of your experiment
The actions to include and their parameters (e.g. duration, frequency)
The targets to apply the actions to
The stop conditions for the experiment (e.g. maximum duration, error rate threshold)

Once you have an experiment template, you can execute it to start the experiment. FIS will then run the specified actions against the targets, and monitor the results.

During the experiment you can monitor the progress and status of the actions, as well as any stop conditions that have been triggered. FIS provides detailed logs and metrics for each experiment, allowing you to analyze the results and identify any issues or areas for improvement.

After you've run your experiment, you should analyze the results and take any actions needed. You can repeat experiments any time you want, by running the experiment templates again.

Designing Chaos Engineering Experiments on AWS

The most important thing is to start with clear goals and hypotheses. What do you want to learn about your system's resilience? What specific failure scenarios do you want to test? By answering these questions upfront, you'll be creating meaningful experiments, instead of just testing random stuff.

here are some best practices for designing experiments:

Focus on testing critical components and services first
Start with small, low-risk experiments and gradually increase scope and severity (because you'll get things wrong in the beginning)
Define clear success criteria and metrics for each experiment (i.e. don't just break things randomly and act like that solves anything)
Involve stakeholders from across the organization to ensure what you test is aligned with business objectives (like you should do for anything in IT, but that's a discussion for another article)

Experiment Planning and Prioritization

Once you have a clear idea of the experiments you want to run, prioritize them based on their potential impact and risk. Some factors to consider when prioritizing experiments:

The criticality of the component or service being tested (i.e. how important this is)
The potential impact of a failure on end-users or business objectives
The likelihood of a failure occurring in the real world
The complexity and risk of the experiment itself

Again, it's also important to collaborate with stakeholders from across the organization when planning experiments. This includes involving teams like operations (if you have a separate ops team), development (again, if this is separate), and business stakeholders. You're the one responsible for your job, but you also need to make sure your job is adding value to the organization.

Experiment Execution and Monitoring

When executing experiments, it's important to:

Communicate the experiment plan and timeline to all relevant stakeholders. Don't break stuff unannounced!
Make sure that there are safety controls and rollback mechanisms in place. Don't break stuff you can't fix!
Monitor the experiment in real-time to identify any issues or unexpected behavior. Don't break stuff blindly!
Collect detailed logs and metrics for later analysis and reporting. Don't break stuff for no reason!

The metrics that you should pay attention to are response time, error rate, and resource utilization, as well as logs from relevant services and components. If something breaks during the experiment, it's important to have a clear process in place for triaging and resolving the problem. This may include rolling back the experiment (totally cool! you can try again), adjusting parameters, or escalating to the appropriate teams for a more detailed investigation.

Analyzing and Interpreting Experiment Results

The first step in analyzing experiment results is to collect and aggregate all relevant data and metrics (which is the entire reason for which you are running these experiments!). This may include data from CloudWatch, X-Ray, and CloudTrail, as well as application metrics and logs.

Once you have the data, it's time to look for patterns and anomalies that can give you insights into how the system behaves under stress. Some key things to look for are:

Unexpected spikes or drops in key metrics, like response time or error rate
Cascading failures or dependencies between services and components
Resource constraints or bottlenecks that limit scalability or resilience

Based on the analysis, you can then identify areas for improvement and develop a plan for addressing any issues or weaknesses that were identified. This may include things like:

Optimizing resources and scaling
Improving error handling and retry mechanisms
Refactoring your application to be more resilient to failures
Updating runbooks and processes based on the lessons you learned

Measuring and Quantifying Resilience

One of the key benefits of Chaos Engineering is the ability to quantify and measure the resilience of your system over time, beyond just calculating a theoretical availability based on AWS services. Stakeholders probably won't care much about your specific numbers for error rates, downtime or MTTR, but really understanding how your system fails will let you communicate the risks and potential loss of value in a more clear way.

These are the metrics you want to track:

Mean time to recovery (MTTR): The average time it takes to recover from a failure or outage
Error rate: The percentage of requests that result in error responses or failures
Availability: The percentage of time that the system is functioning properly
Latency: The time it takes for the system to respond to requests

Another interesting metric is Mean Time Between Failures (MTBF), which is the average time between one failure and the next. You can't get that from chaos experiments though, you'll need to track your production failures for it.

Implementing Chaos Engineering in Your AWS Environment

The most basic way to do this is by using AWS Fault Injection Simulator to create Chaos Experiments. Let's explore a few additional things that you might want to do.

Infrastructure as Code and Chaos Engineering

Something that ranges from a nice extra to a key aspect is using Infrastructure as Code (IaC) tools like AWS CloudFormation or Terraform to create your experiments. By defining your experiments using infrastructure as code, you can version control your experiments and automate the entire process of running them (assuming you're already creating your infrastructure using IaC).

Chaos Engineering in CI/CD Pipelines

Already running chaos experiments in an automated way? Well, why not run them automatically before every deployment? This means incorporating them to your CI/CD Pipelines, and effectively to your deployment process as another form of testing.

Keep in mind the following things though:

Define clear success and failure criteria for experiments
Integrate experiment results into your overall test reporting and analysis
Automate the rollback or promotion of changes based on the experiments' results
Make sure that experiments are run in a safe and controlled manner, like a dedicated environment

Conclusion

Chaos Engineering sounds a little crazy at first. You're proactively injecting failures and testing your system's response, which will break stuff. However, this allows you to identify weaknesses and areas for improvement before they cause real-world impact.

AWS Fault Injection Simulator is a great platform for creating and running Chaos Engineering experiments on AWS. It has pre-built actions, integrates with other AWS services, and lets you set safety controls.

When implementing Chaos Engineering in your own AWS infrastructure, it's important to start small and focus on critical components and services first. Define clear goals and success criteria, collaborate with stakeholders to understand what's important to the business, and automate your experiments where possible.

Ultimately, the goal of Chaos Engineering is to test resilience, just like you test any functional requirements. In fact, resilience is a functional requirement! It's how the system behaves under load, which is about behavior. It's just a bit different in how you test it, that's all. Continuous testing is the only way to consistent quality, and continuous resilience testing is the only way to consistent availability.

Did you like this issue?

Loved it! 💖 | It was good 🙂 | No bueno 😑

Reply

or to participate.