Chaos & Resilience Engineering @ Secure360
· @jabenninghoffUpdate: by request, I’ve posted handouts for this talk here.
I’m speaking at Secure360 on May 5, 2020, presenting an updated version of “Chaos & Resilience Engineering”. As I’ve done before, I won’t be posting copies of the slides. Instead, I’m posting an updated list of references from the talk here.
Note: this post includes some additional references that are not in the final version of the talk (italicized)
My story is told in three acts: My journey to find chaos engineering (ACT I), Chaos engineering and how resilience engineering complements it (ACT II), What I’ve learned so far (ACT III), and How to get started with chaos & resilience engineering (END).
ACT I: My Journey to Chaos Engineering
- Risk Homeostasis
- The Checklist Manifesto
- How Complex Systems Fail (video)
- Engineering a Safer World
- STAMP/STPA/CAST
- TCD: Managing Risk and System Change
- Lund: Human Factors & System Safety
ACT II: Chaos & Resilience Engineering
- Chaos Monkey
- Simian Army (retired)
- Chaos Engineering Book
- Chaos Engineering: System Resiliency in Practice (new book, 2020)
- Principles of Chaos Engineering
- Resilience Engineering Book
- The Four Potentials of Resilience
- Etsy Blameless Post-Mortem
ACT III: What I’ve learned so far
- Lesson 1: Incident Management Teams in Technology are similar to those in Oil & Gas
- Crichton, M. T., Lauche, K., & Flin, R. (2005). Incident Command Skills in the Management of an Oil Industry Drilling Incident: a Case Study (PDF)
- Muhren, W. J., van den Eede, G. G. P., & van de Walle, B. A. (2007). Organizational learning for the incident management process (PDF)
- Situation Awareness
- Dossier 1: A sociotechnical case study of an IT major incident management team
- Lesson 2: Safety has risk assessment methods that can be applied to computer systems
- NIST 800-30
- STPA Handbook
- FMEA (Failure mode and effects analysis)
- GameDay Discussion (2012)
- Dossier 3: A comparison of NIST and STPA risk assessment methods applied to an informational website
- Lesson 3: Changes cause outages
END: How to get started with chaos & resilience engineering
- Chaos Engineering – break stuff
- Resilience Engineering – fix stuff
- DevOps – build stuff
- information-safety.org resources