In this heavily technological age we are living in, we are witnessing numerous advancements. Distributed software systems of a large-scale are increasingly changing the playing field for software engineering. The industry as a whole is quick to adopt practices that boost the flexibility of development, as well as the acceleration of deployment. This begs a question concerning the benefits. Exactly how much confidence is there to be had with the complex systems we put to work?
Individual services within a distributed system may operate properly, but the interactions are what’s important. When you least expect it, they could potentially cause unpredictable – and often unfavorable – outcomes. These uncertain results have the capacity to make these distributed systems rather chaotic. Disruptive real-world events that tamper with production environments definitely don’t help matters either.
Defects that could birth disaster
It is essential to identify weaknesses before they are able to manifest abnormal behaviors across the system. These weaknesses can take various forms, including the following:
- Improper fallback settings in the event of a service being unavailable.
- Outages during a time when downstream dependency receives more traffic than they can handle.
- Retry storms that stem from timeouts with improper tunage.
- A series of failures in the event of a single point consisting of failure crashes.
Indeed, it’s important to be proactive and draw attention to the most significant of weaknesses. What’s more, it should be done before they get the chance to affect the clientele in production. In times like this, it is crucial to have a management method for the chaos in association with these systems. Take full advantage of growing flexibility and velocity and be confident in your production deployments. Even with the complexity that they represent, have faith in them.
An approach that focuses on the system is capable of addressing the chaos in distributed systems. It covers a large scale and builds belief in the ability of those systems to endure realistic conditions. To learn about a system’s behavior, some decide to observe it in the midst of a contained experiment. This is what is known as ‘chaos engineering’.
What is it?
Chaos engineering is a disciplined approach towards the identification of failures prior to becoming outages. By doing this, they build confidence in the system’s ability to handle unstable and unpredictable conditions. With proactive testing of a system’s response when under stress, one can pinpoint and eventually fix failures before they wind up uncontrollable.
This process requires setting up monitoring tools and actively running chaos through your systems in production. This way, you can get a good look – in real-time even – at how your service responds and handles pressure. Practices pertaining to chaos engineering will appear different from every team that uses it. However, they are always applicable as a technique for the voluntary infusion of chaos into your systems.
Chaos engineering gives teams the ability to learn from the failure in its truest form. They can also gain knowledge and insight from the performance of their applications and infrastructure. This practice allows you to compare what you believe will happen to what actually happens in the systems. In a way, you will actually break things on purpose in order to learn how to build much stronger systems.
This concept initially came from a basic, straightforward fact concerning software development. A software system’s ability to tolerate failures all while still providing satisfactory quality of service – usually generalized as resiliency – is typically a requirement. Be that as it may, development teams frequently fail to meet this requirement. This is largely due to various factors such as short deadlines or a considerable lack of knowledge of the field. The overall purpose of the chaos engineering technique to help meet this resilience requirement.
The history of chaos engineering actually has connections to Netflix, if you can believe it. Over the years, this company would evolve its infrastructure in order to support increasingly complex activities. This is a crucial step because its customer base has garnered over 100 million users in more than 190 countries.
The company’s original rental and streaming services were operating in on-premise servers. As successful as it was, its operation would create a single point of failure, along with several other issues. In August of 2008, corruption within a major database would result in a three-day outage. During this time, Netflix was unable to ship any DVDs to its customers. Netflix engineers would respond by trying to find an alternative architecture. In 2011, they would make a large change to their operation. They would move the company’s monolithic on-premise stack to a distributed cloud-based architecture that runs on Amazon Web Services.
This new type of architecture – consisting of hundreds of microservices – effectively removes that point of failure. In doing so, it would also introduce new forms of complexity that require systems that are more dependable and fault-tolerant. It was at this point that the team behind Netflix’s engineering would learn an important lesson. The best way to avoid failure is by failing constantly.
Wanting to achieve this led to the creation of Chaos Monkey. This is a tool that Netflix would use to deliberately cause failures in random places at random intervals in their systems. More to the point, as the tool’s maintainers on GitHub state:
“Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment.”
By using Chaos Monkey, engineers determine if the services they are constructing are resilient enough to tolerate sudden failures.
Not technically “chaotic”
Kolton Andrus, CEO of Gremlin, a chaos engineering startup, once worked at both Google and Netflix. He made the suggestion of imagining chaos engineering as some sort of flu shot. Sure, on the surface, it seems crazy to want to deliberately inject something harmful into your body. Especially in the hopes that it will prevent any future illnesses. However, as Andrus claims, this type of approach is also possible with distributed cloud-based systems.
At its core, chaos engineering focuses on carefully implanting harm into systems. By doing this, they are able to test how the system responds to it. This gives companies the chance to prepare and practice for any potential outages or failures. What’s more, it allows them to minimize the effects of downtime prior to its occurrence.
The keyword in chaos engineering is “carefully.” Contrary to what its name implies, chaos engineering is not necessarily chaotic; not in the way we think of it. As a matter of fact, very few of these tests are random and turbulent. Chaos engineering instead involves thoughtful and planned experiments. The inherent design of these tests is to demonstrate how your systems behave when they experience failure.
Said founder and CEO of ChaosIQ.io (a European chaos engineering platform), Russ Miles, in an interview:
“Of all the chaos engineering experiments that I have conducted with customers over the last year, I can probably count just one or two that have had a random quota to them. Most of them are very careful, very controlled, proper experiments. It really has nothing to do with randomness, unless randomness is the thing you’re trying to test for.”
There are some key principles one should remember when wanting to explore chaos engineering further. The following principles describe an ideal application of this method; particularly in relation to the processes of experimentation. The degree to which one pursues these principles correspond strongly to the confidence we could have in a distributed system.
1 – Define what is “normal” for your system
Chaos engineering is similar to the scientific method. If you do not define a control group and a variable group, then you won’t have anything to measure against. Therefore, it is essential to first and foremost define your application’s/service’s “normal” state.
It should be a team’s priority to define the key metrics they need. They will be valuable factors for tracking, monitoring, and measuring the output of their system. With this, they can effectively determine what is indicative of normal behavior. When a team understands the metrics implying when your service is stable, you can define the metric thresholds that pinpoint when your system suffers. Every team wanting to implement chaos engineering principles has to first understand what their service looks like when operating properly.
2 – Disrupt your system’s “normal” in a realistic manner
There are plenty of ways in which a system could fail and the best thing to do is acknowledge them. Conduct brainstorm sessions with your team and determine realistic ways that your system could fail. Moreover, take note of the failures that have a higher chance of occurring than others. Afterward, think about how you can disrupt your system’s “normal.” It could be anything from synthetic spikes in traffic to intentionally destroying servers. There are endless possibilities of chaos that you can inflict. Overall, it is imperative that your experiments and chaos tests reflect scenarios that are likely to happen.
Sooner or later, you will reach a point where you learn how these failures affect the system as a whole. Only then can you create real change to your technology, which in turn will result in more resilient services.
3 – Minimize the blast radius
In many ways, experiments are seeking out the unknown. In doing so, they also look for unpredictable repercussions of failure. The idea here is to shed some light on these notable vulnerabilities without accidentally blowing everything up. This is what’s called “minimizing the blast radius.”
Whenever you are executing tests on the unknown within your applications and infrastructure, negative customer impact is likely to emerge. It is ultimately the responsibility of the chaos organizer to properly reduce the blast radius of the tests. Moreover, they need to ensure that the team has the right preparations for incident response, just in case.
So long as the blast radius is properly contained, these outages and failures can generate enlightening insights. And all without causing destructive harm to the customers. In the process, it will help your team build a more robust software for the future.
4 – Test in production
Let’s assume that you are someone who wants to jump right into chaos testing. It is great to start in staging, however, in the end, you will want to conduct chaos experiments in production. Realistically speaking, there is only one way you can truly see how failures and outages will affect your system and customers. That particular way is to apply the principles of chaos to production environments. Due to the risks relating to chaos engineering in production, minimizing the blast radius of your experiments becomes more crucial.
5 – Continuous chaos
One of the great things about Netflix’s ‘Simian Army’ is that the tools are continuously running chaos through their architecture. Constant chaos tests help the team pinpoint problems, allowing the team to spend more time creating new features and services. By automating chaos to the same degree as your CI/CD (Continuous Integration/Continuous Deployment) pipeline, you are continuously providing improvements. These, in turn, help boost the capabilities of both current systems and future ones.
A deeper knowledge of the system can result in a team that has the ability to develop new services. And with the additional perk of fewer issues. When incidents accompanying new releases inevitably appear, you are able to detect the issues much faster. What’s more, you can mitigate incidents with very little customer impact.
6 – Test with more confidence
Confidence is a key part of chaos engineering. It would be smart for you to do your own research before running any chaos experiment to minimize the blast radius. Moreover, you should make sure that you have engineers readily available in the event of an emergency.
Be that as it may, in the end, you will simply need to let the chaos loose and see what will happen. Have faith in your approach to chaos testing and make sure that you take notes while you conduct it. Only then can you learn from your mistakes. Doing so will improve the way in which you approach chaos engineering and provide better systems much faster.
The teams behind various systems obviously reap the benefits that come out of chaos engineering. However, where does that leave everyone else? The customers, businesses, and even the technology itself? Well, they have their fair share of perks, as well.
In the case of customers, they receive a considerable increase in the service’s availability and durability. This means that there are next to no outages that could disrupt their daily lives. For businesses, chaos engineering has the capacity to help prevent extremely large losses. Not just in revenue but also in maintenance costs. In doing so, it could create more satisfaction and convenience for engineers, thus making them happier. The outcome will be improvements for on-call training for engineering teams. In addition, improvements for the SEV (incident) Management Program for the entire company.
Finally, there are the technical aspects. The insights that we could gain from chaos experiments can mean a substantial reduction in incidents and on-call burdens. There could also be a noticeable increase in the understanding of system failure modes and better system design. Furthermore, it could result in faster detection time for SEVs and a serious decline in repetitious SEVs.
Some noteworthy misconceptions include the following:
- The network is 100% reliable
- There is zero latency
- There is an infinite amount of bandwidth
- The network is completely secure
- Topology does not change in any way
- There is only one administrator
- The cost of transport is zero
- The network is homogeneous
A majority of these fallacies are the driving force behind the design of chaos engineering experiments. These include “packet-loss attacks” and “latency attacks.” Network outages, for example, could provoke a wide variety of failures for applications. In turn, they could potentially cause a severe impact on customers.
Applications might stall during the time they are waiting for a packet. Furthermore, applications may permanently consume copious amounts of memory or other system resources. Even in the aftermath of a network outage, applications will probably fail to retry operations that are delayed. Alternatively, they might go forward with a retry but may be too aggressive about it. In some cases, applications will likely require a manual restart. It would be smart to test each one of these examples and prepare for them.
All in all, chaos engineering is a powerful practice. It is gradually changing the way in which we design software and it is helping to develop various large-scale operations. Other practices focus on velocity and flexibility, whereas chaos engineering primarily tackles systematic uncertainty. The aforementioned principles help build confidence for quick innovation at massive scales. What’s more, they supply customers with the high-quality experiences that they deserve.
Before you get into chaos engineering, you of course need to stabilize your people, processes, and technology. It would be smart to have a proactive system for reactions and detecting incidents. Moreover, and you should not be spending most of your time responding to incidents in production.
First, you will need to secure your system from real chaos before starting to run intentional chaos. Next, when you are ready to conduct chaos engineering, you will have a comprehensive system for incident response and mitigation. It will help visualize your system’s performance and, in turn, collaborate when an incident occurs.