How to make your system Fault Tolerant? Crash it! — Netflix Chaos Monkey
Netflix is a global video streaming giant. It has more than ~1 Billion visits per month with an average duration of ~27 minutes. If you are a Software Developer you’ll know that it’s at a huge scale.
The Netflix system is mostly hosted on cloud. The cloud is all about redundancy and fault-tolerance. Since no single application component can guarantee 100% uptime (and even the most expensive hardware eventually fails), companies like Netflix have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. Introducing multiple fault tolerance techniques in systems at scale is not enough. They have to be prepared for failures which occurs once in a blue moon. And, no matter how hard you try you can’t always make sure that the Developers in a company at this scale are writing a code that is resilient to Hardware failures always. Here enters the Chaos Monkey.
“By deliberately inducing faults, you ensure that the fault tolerance machinery is continually exercised and tested which can increase your confidence that faults will be handled correctly when they occur naturally.” — Martin Kleppmann
Chaos Monkey is a tool that randomly disables the production instances to make sure system can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data centre (or cloud region) to randomly shoot down instances and chew through cables — all the while the system continue serving the customers without interruption.
According to Netflix — “By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.”
Chaos Monkey randomly chooses instances from production servers and disables them, shuts them off during business hours. It has actually developed a habit in their engineers to build more resilient applications which can easily withstand these kinds of failures.
According to Netflix — “Knowing that this would happen on a frequent basis created strong alignment among our engineers to build in the redundancy and automation to survive this type of incident without any impact to the millions of Netflix members around the world. We value Chaos Monkey as a highly effective tool for improving the quality of our service.”
This technique is called Chaos Engineering. Many larger tech companies practice Chaos Engineering to better understand their distributed systems and microservice architectures. The list includes Twilio, Netflix, LinkedIn, Facebook, Google, Microsoft, Amazon, and many others. The list is always growing.
Do you think it’s crazy or not!!!
If you have reached till here. Please, like this article and for more such articles, follow #tech-granth.
References —