Welcome to the third part in our series about non-functional testing! This series kicked off as a result of the big problems surrounding Taylor Swift’s Eras tour ticket sales. Our contention is that it’s not a simple matter of Performance & Load Testing, but also the confluence of Performance, Security, and Chaos Testing that needs to be part of your strategic test plan.
In this article, we’re going to talk about the next topic: Chaos Testing!
In today's complex and interconnected world, even minor failures can have significant consequences, and the impact of these failures can be difficult to predict or manage. Chaos engineering provides a way for engineers to simulate real-world failures and faults in a safe and controlled environment and observe how their systems respond. By doing so, they can identify weaknesses in the system and take appropriate measures to prevent future failures.
In this blog, we will explore what chaos engineering is, how it works, why it is important, and some popular tools to get started with chaos experimentation.
Chaos engineering is the practice of intentionally introducing failures and faults into a system in a controlled manner to test its resilience and ability to recover. The idea is to simulate real-world scenarios that could cause a system to fail, such as a sudden surge in traffic or a hardware failure and observe how the system responds. By doing so, engineers can identify weaknesses in the system and make improvements to prevent or minimize the impact of future failures.
Without this practice, you can only be reactive in your approach to outages and root-cause analysis, which is not ideal if there is downtime in your production systems. Instead, we are always preparing for the worst-case scenario, to take pre-emptive measures and build up confidence in our infrastructure.
The first step is to define the system that will be tested. This can include any aspect of a software application or infrastructure, such as servers, databases, networks, or APIs. We want to be clear on the results of our testing, so this foundation understanding of the system architecture is important.
The next step is to identify potential weaknesses in the system that could lead to failure. This can be done through a variety of methods, such as analyzing logs, monitoring performance metrics, or conducting threat modeling. If we can prioritize and focus our area of test, this will lead to more beneficial results and better mitigative actions.
Once potential weaknesses have been identified, the next step is to design experiments that will intentionally introduce failures and faults into the system. These experiments should be carefully designed to minimize any negative impact on users and should be conducted in a controlled environment.
The final step is to analyze the results of the experiments and use them to improve the system. This can involve identifying new weaknesses, updating existing systems, or implementing new features to better handle failures.
Here’s an example of a Chaos Test you can run on an API:
Define the primary API under test – probably a microservice with some upstream dependencies.
Analyze the dependencies for relative impact and “surface area intersection” (i.e., the amount of interaction between the two APIs).
Determine which of the upstream services has the least amount of impact on the primary API. Let’s call this service “Omega.”
Run an experiment to incrementally add load to the primary API.
During the execution of the load test, shut off the “Omega” API.
You’ll learn a great deal about your systems (and the teams who implemented them) when you start shutting down dependent microservices during a load test.
Then, repeat the test for the other dependencies.
Your goal from this test (and subsequent tests) is first to understand the effects of such a shutdown on a system, then to design fallbacks, failovers, and graceful exits when such an event occurs in production.
Disaster Recovery (DR): Organizations are more and more exposed to liability as a result of poor Disaster Recovery planning. Customers are asking for DR reports, and these reports and results from experiments are part of the process of gaining regulatory compliance in some industries. One of the outputs of Chaos Engineering is a thorough DR report (especially if this is one of the goals from the beginning).
Increased System Resilience: By identifying weaknesses and vulnerabilities in a system, you can make targeted improvements that increase the system's overall resilience. This can result in fewer incidents, faster recovery times, and improved system performance.
Enhanced Confidence in Systems: By regularly practicing chaos engineering, you can gain greater confidence in the resilience of your systems. This can help teams feel more comfortable deploying changes and rolling out new features, knowing that they have a robust process for identifying and addressing any issues that arise.
More Efficient Resource Allocation: By identifying and addressing weaknesses in a system proactively, you can avoid costly downtime and other issues that can arise from unexpected failures. This can help teams allocate resources more efficiently, focusing on areas that will have the greatest impact on the system's overall performance and resilience.
Improved Organizational Culture: Finally, practicing chaos engineering can help promote a culture of continuous improvement and experimentation. By embracing failure and using it as an opportunity to learn and grow, teams can become more innovative and better equipped to handle the changing demands of modern technology environments.
System Security: As we noted in the article that inspired this series, Performance, Security, and Chaos Testing do not live in silos. In a sense, Chaos Testing is Security testing, because it aims to close down vulnerabilities in your system by illuminating all dark corners. You can’t guarantee that a system is secure if you don’t know how it behaves in error conditions you didn’t test.
By adding artificial delays to specific network connections, you can test how well systems handle slow or unresponsive services. This can help identify potential bottlenecks and other performance issues.
By intentionally killing processes or services, you can test how well systems recover from unexpected failures. This can help identify areas for improvement in automated recovery processes and other incident response measures.
By introducing faults, such as network errors or database failures, you can test how well systems handle unexpected errors. This can help identify areas for improvement in error handling and recovery processes.
Load and Performance is meant to stress-test a system under “normal” and “peak” conditions. It is not necessarily intended to anticipate DDOS-level conditions (or Taylor Swift announcements). When you get above a certain threshold of traffic, you’re entering the field of Chaos Engineering. This isn’t for finding normal system bottlenecks–this is to expose areas of the system that are vulnerable to catastrophic failure.
Chaos Engineering helps companies identify and mitigate potential system failures before they occur. By intentionally introducing failures into a system, engineers can gain a deeper understanding of how the system works and how it responds to different types of failures. This can help them make improvements to prevent future failures and ensure that their systems are resilient and reliable.
What are some real-world instances where this importance can be recognized, and what are the results of not being prepared:
A popular sporting event is taking place, and your product received a surge of traffic – your application fails and there are outages in production. Revenue is lost, reputation is damaged, and an impact on new users joining. Meanwhile, your product is vulnerable to attacks from unknown threat vectors.
You have launched a new business and promoted a special offer–the database storing customer data fails, and offers are not being issued correctly. Trust is lost in your customer base and a bad first impression is built.
One region of AWS suffers an outage. Your whole system isn’t hosted in that region, only one small microservice that handles credit card transactions for that area of the country. Suddenly, millions of users in six states can’t buy anything, but you haven’t figured out what’s going on yet because your system isn’t experiencing an outage, per se–you’re getting reports directly from users. Troubleshooting this is very difficult if you haven’t run across these conditions before.
Here is a brief list of products and tools which can be used for chaos engineering. There are many others available, both open-source and paid solutions, so it's worthwhile to define your expectations and requirements, before adopting one.
An established and enterprise solution is Gremlin. They follow a process of baseline, remediate, and automate – and provide an easy-to-use interface for Chaos Engineer and Reliability management.
‘Traditional approaches to improving reliability don’t fit modern software development. Gremlin's Reliability Management platform includes everything you need to standardize and automate reliability at scale—without waiting for incidents.’
The original ‘Chaos Monkey’ built by Netflix is available as Open Source and is a good starting point. They have detailed documentation to help you get started.
‘Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services.’
And a solution which has a bit of both but specializes in Kubernetes infrastructure: chaos-mesh. It is open-source and receives frequent updates and improvements. They provide chaos orchestration and experiment monitoring dashboards, for visual reporting and analysis.
‘Using Chaos Mesh, you can conveniently simulate various abnormalities that might occur in reality during the development, testing, and production environments and find potential problems in the system…You can easily design your Chaos scenarios on the Web UI and monitor the status of Chaos experiments.’
Chaos engineering is a valuable practice that can help companies proactively identify and mitigate potential system failures. By intentionally introducing failures into a system, engineers can gain a deeper understanding of how the system works and how it responds to different types of failures. This can help them make improvements to prevent future failures and ensure that their systems are resilient and reliable. As the complexity of software applications and infrastructure continues to grow, the importance of chaos engineering will only increase.
Gary Parker is currently working as a Senior QA Architect, responsible for QA Architecture, tooling, frameworks, and processes. Specializing in front-end web and mobile technologies. With almost 10 years of experience in the QA industry across many different domains, products, and environments. He enjoys writing technical blogs as a way to keep up-to-date with the industry and ensure a deeper understanding of the topics at hand. You can also follow him on Twitter.
Taylor Swift's Eras Tour broke records, hearts, and the Internet itself. Even if you're not a Swiftie, web developers and QAs can learn many lessons from this event. Marcus Merrell explains.
Learn about the importance of implementing security testing, including the benefits it provides, and best practices for ensuring its effectiveness.