Chaos & Resilience Engineering talk
· @jabenninghoffUpdate: by request, I’ve posted handouts for my Secure360 version of this talk here.
I’m giving a talk next Tuesday (9/24) at at the September OWASP MSP Meeting on “Chaos & Resilience Engineering”. Because the talk is told as a story and a demo, I won’t be posting copies of the slides, but I am including an abstract and a list of references here. The talk tells the story of my journey to find chaos engineering, introduces chaos engineering, describes how it is complemented by resilience engineering, and discusses how to get started and join the movement.
Note: I presented a version of this talk at an internal company conference in 2019, which led me to create the Chaos & Resilience Engineering Guild. Later on, I left security and moved to Infrastructure to start a Site Reliability Engineering practice.
Abstract
Chaos engineering started at Netflix in 2011 with the invention of the Chaos Monkey, a tool that intentionally disrupted systems on the production network to discover systemic weaknesses so that they could be removed. Since then, the Chaos Monkey has grown to become the Simian Army, and chaos engineering has spread to a global community that develops free & commercial tools to facilitate experiments in QA and production.
My journey to chaos & resilience engineering started in 2009 with my desire to find a better way, leading me to the world of safety science and to its connection to the work at Netflix, Etsy, and elsewhere. In this talk, I’ll explain chaos engineering, the prerequisites for doing it in production, and how it relates to resilience. I will share some of the work I’ve done in chaos engineering (in a small way) and resilience engineering (in a larger way), and also ask attendees to share their own experiences in chaos & resilience engineering - you might not or realize how easy it is to get started, or know that you’re already doing it!
My Journey to Chaos Engineering
- Risk Homeostasis
- The Checklist Manifesto
- How Complex Systems Fail (video)
- Engineering a Safer World
- STAMP/STPA/CAST
- Managing Risk and System Change
- Secure360
Chaos & Resilience Engineering
- Chaos Monkey
- Simian Army (retired)
- Chaos Engineering Book
- Awesome Chaos Engineering
- Gremlin (free, limited feature version available for up to 5 nodes)
- Gremlin Demo
- Principles of Chaos Engineering:
- Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
- Hypothesize that this steady state will continue in both the control group and the experimental group.
- Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
- Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
- Resilience Engineering Book
- Four Potentials of Resilience
- Etsy Blameless Post-Mortem
How to get started and join the movement
- After-Action Review:
- What was expected to happen?
- What actually happened?
- Why were these different?
- What has been learned?
- John Boyd’s OODA Loop
- Situation Awareness
- Safety II
- FMEA
- STPA/CAST Handbooks
- Veracode State of the Software V9