Netflix was an early proponent of chaos engineering culture. Source: Shutterstock

AWS brings chaos engineering to the business masses

A discipline pioneered at streaming giant Netflix around a decade ago, chaos engineering is a practice that allows companies to test the resilience of their systems by simulating worst-case scenarios.

The practice is essential to the reliability of the services we use and rely on every day. It’s comparative to running fire-drills within increasingly complex, microservice-laden IT engine rooms. Chaos engineering enables development teams to find and fix flaws in systems before they cause a problem. A breakdown or interruption to services is hugely damaging for digital businesses – just look at Google’s Cloud and Workspace outages this week.

So far the practice has been the reserve of large tech giants, like Uber, JP Morgan Chase, and Amazon itself, but AWS’ launch of Fault Injection Simulator aims to bring chaos engineering to the mainstream as a new XaaS offering, with all companies now tech companies to a degree.

“We believe that chaos engineering is for everyone, not just shops running at Amazon or Netflix scale. And that’s why today I’m excited to pre-announce a new service built to simplify the process of running chaos experiments in the cloud,” Amazon CTO Werner Vogels said at AWS re:Invent this week. 

Vogels explained that AWS Fault Injection Simulator offers a fully-managed service to run ‘fire drills’ on applications running on AWS hardware. 

“FIS makes it easy to run safe experiments. We built it to follow the typical chaos experimental workflow where you understand your steady-state, set a hypothesis, and inject faults into your application. When the experiment is over, FIS will tell you if your hypothesis was confirmed, and you can use the data collected by CloudWatch to decide where you need to make improvements,” he explained.

Earlier this year, our sister publication TechHQ interviewed Kolton Andrus, the CEO of Gremlin, a chaos engineering SaaS company that provides companies with the tools to safely run tests to improve the health of their systems.

There’s a little bit of [apprehension] until you’ve run it yourself, and the world hasn’t caught on fire,” he said on initial attitudes toward the practice. “There’s no doubt it’s just the best practice – it’s how [all] software will be built in ten years.” 

While outages of Google Workspace, or those of trading apps, e-commerce websites, and streaming services, may seem trivial in the scheme of things, the importance of airtight resilience hits home when we take stock of the more central role of software today in applications such as healthcare, online voting and even autonomous driving

“There’s a lot of critical parts of society that are going to be emerging over the next 10 and 20 years, whether it’s drones or self-driving cars, whether it’s elections, whether it’s how money is transferred and exchanged – there’s a lot of important things where people’s safety could be at risk,” said Andrus.

“I’m a big believer that if you want me to get in a self-driving car, I would hope that you’re taking every step possible to mitigate risk and ensure that the system will operate when things go wrong.”

AWS plans to make Fault Injection Simulator available next year.