Two years of study and research has changed how I see risk. Safety science taught me that improving performance is the key to managing risk, and studying successes is the key to risk analysis. The ‘New School’ of safety argues that you can’t have a science of non-events; safety comes through being successful more often, not failing less. Research in DevOps, Software Security, and Security Programs show a strong link between general and security performance. In many (but not all) cases, organizations most effectively reduce cybersecurity risk by improving general performance, not by improving one-dimensional security or reliability performance.
This talk presents a new model for security performance that informs how we can maximize the value of our security investments, by focusing on improving existing or creating new organizational capabilities in response to new and emerging threats, where general performance falls short. It will review both the theory that improving performance improves safety, how that relates to cybersecurity risk, evidence from my own and others’ research that supports this theory, and how it can be used to analyze and manage risk more effectively.
Talk
The talk is broken down into three sections, and covers both the theory as well as how to apply the theory to best improve security performance.
Assumptions backed by accepted theory
Assumption 1: organizations are sociotechnical systems
Assumption 2: all failures are systems failures
Arguments for a new theoretical model backed by evidence
Argument 1: resilience improves through performance
Argument 2: security performance is correlated with general performance
Implications of the model for information risk management: optimize risk management based on your performance mode
Mode 1: improve general performance
Mode 2: add security enhancements to general performance
Mode 3: create security-specific systems
Guided Adaptability
Work against the adversary
Overall, I think the talk went better than I expected. While the theory supports some potentially controversial conclusions, like “retire your vulnerability management program”, I had good engagement from the audience, ran out of time for questions and spent some time afterwards talking with a few attendees in the hall.
I got the survey results back pretty quickly. Only 9 people responded, which was maybe 10-20% (I’m not a good judge of crowd size), but those responses were very positive, with ~90% of attendees saying they would attend my future talks. My weakest score was, “I am Interested in hearing more of this topic” which scored just below “agree”.
Slides
My slides with notes, including references, are here.
Slides from all presenters at Secure360 (who provided them) are available here, and most of my past talks and security blog posts are available at https://transvasive.com.
Other Versions
I presented this talk two other times:
As mentioned above, the original version was presented at SIRAcon 2021, those slides are available here.
I was also selected to be a keynote speaker at my internal company technical conference in 2022! This was a condensed, 30-minute version of the talk, and draft slides from that talk can be found here.
Thanks to a generous sponsor, an artist created visual notes for my SIRAcon presentation! As a cool bonus, I have a laminated plaque with the visual in my office.
Last August, I took on a new role at my company, and changed my title to Resilience Engineer. Which leads to an obvious question, what is Resilience Engineering?
Resilience Engineering (RE) as a concept emerged from safety science in the early 2000s. While the oldest reference to “Resilience Engineering” appears to be a paper written by David Woods in 20031, the most-cited work is the book, Resilience Engineering: Concepts and Precepts, a collection of chapters from the first Resilience Engineering symposium in 2004.2 In that book and in subsequent publications, there have been many definitions of RE. This post is my attempt to succinctly define Resilience Engineering as I practice it, which is:
Resilience Engineering is the practice of working with people and technology to build software systems that fail less often and recover faster by improving system performance.
Let’s break that definition down further:
Resilience and Resilience Engineering
Resilience is a concept from ecology that describes a system’s ability to dynamically withstand and recover from unexpected disruptions, rather than maintain a predictable, static state.3 Whereas resilience in ecological systems is the result of the interplay between variability and natural selection, Resilience Engineering seeks to achieve the same results through deliberate management of the variability of performance:
“Since both failures and successes are the outcome of normal performance variability, safety cannot be achieved by constraining or eliminating that. Instead, it is necessary to study both successes and failures, and to find ways to reinforce the variability that leads to successes as well as dampen the variability that leads to adverse outcomes.” 4
As both definitions make clear, resilience isn’t achieved through stability, rather, it is achieved through variability.
Working with people and technology
Systems safety recognizes that people are an integral part of the system; one can’t talk about aviation safety without talking about the technology of the plane and air traffic control, the people - the pilots and controllers, and the interplay of the people and the technology. Similarly, the software systems I work with consist of the code, the machines running the code, and the people that write and maintain the code. The software engineers and the systems they build comprise a sociotechnical system, with both technological/process and social/psychological components.
Further, while technology can’t be ignored, beyond a baseline level of technology, people are the main contributor to resilience or lack thereof; most advances in aviation safety over the past 50+ years have come from human factors research, and it is not by accident that safety science is usually part of the psychology department. For this reason, I focus my efforts on people, and the relationship between people and technology.
Systems that fail less often and recover faster
‘Systems that fail less often and recover faster’ is an over-simplification of resilience, but that statement accurately describes the value proposition of Resilience Engineering in technology; organizations are increasingly reliant on software systems, to the point where software has become safety-critical. We have come to expect that our software systems just work, so that failures are infrequent and systems (the software and the people together) are able to recover from unexpected disruptions quickly.
This is a distinctly different goal than ecological resilience: it isn’t enough to build systems that simply survive, they also need be productive. This is a challenge unique to Resilience Engineering, as it requires both limiting and encouraging variability.
Improving system performance
For me, the key to understanding Resilience Engineering is HOW to achieve resilience. Historically within technology, security and operations have sought to prevent failures (outages, breaches) through centralized control, which does work, but suffers from limitations that RE seeks to overcome.5 The shift in approach starts with the premise that we can’t have a science of non-events, a science of accidents that don’t happen.6 Safety-II (an alternative to traditional ‘Safety-I’) proposes that resilience is the result of factors that make things go right more often - working safely, something that can be studied. Under this model, there is no safety-productivity tradeoff, since improving outcomes leads to improvements in both productivity and resilience.
The work of the DevOps Research and Assessment group at Google demonstrates this concept within software: as organizations improve performance (deployment frequency and lead time for changes) they also improve resilience (time to restore service, change failure rate).7 I’ve found that this approach works more generally, and through RE, seek to help teams improve their performance and help leaders to improve the performance between teams by managing organizational factors.
Other Perspectives
Resilience Engineering is a diverse space and there is a small but growing group of practitioners and researchers that are applying it to software systems. Two notable groups are the Resilience Engineering Association and the Learning From Incidents community. I’ve also recently discovered the work of Dr Drew Rae and Dr David Provan through their Safety of Work podcast. Their paper on Resilience Engineering in practice is aimed at traditional safety professionals but I’ve found its ideas easily adapted to software systems.
As a practitioner-researcher myself, I’m hoping to adapt and apply the science to software systems, to improve the profession, as well as contribute to the collective knowledge - of Resilience Engineering.
Future Articles
Update: I’ve been asked to elaborate on the ideas behind Resilience Engineering, so I’ve added this section to cover a plan for future articles on the topic:
The origins and history of Resilience Engineering
Parallels between Cybersecurity, Operations, and Safety
Hollnagel, E. (2008). Preface : Resilience Engineering in a Nutshell. In E. Hollnagel, C. P. Nemeth, & S. Dekker (Eds.), Resilience Engineering Perspectives, Volume 1: Remaining Sensitive to the Possibility of Failure (pp. ix-xii). Ashgate. ↩
Provan, D. J., Woods, D. D., Dekker, S. W. A., & Rae, A. J. (2020). Safety II professionals: How resilience engineering can transform safety practice. Reliability Engineering & System Safety, 195, 106740. https://doi.org/10.1016/j.ress.2019.106740↩
Forsgren, N., Smith, D., Humble, J., & Frazelle, J. (2019). 2019 Accelerate State of DevOps Report. DORA & Google Cloud. https://research.google/pubs/pub48455/↩
Around the time of SIRAcon 2020, I decided to start using R. I needed a data analysis tool that would allow me to conduct traditional statistical analysis, and I wanted a tool that would be valuable to learn and one that would allow me to do exploratory analysis as well. Originally I considered SPSS (free to students) and RStudio. The tradeoffs between the two were pretty clear: SPSS is very easy to use, but expensive, proprietary, and old. RStudio and R have a tougher learning curve, but are free and open source, under active development, and have a large online community. After reading a thread on the SIRA mailing list, I was leaning towards R, and re-watched Elliot Murphy’s 2019 SIRAcon presentation on using notebooks, which led me to consider both R Markdown and Python Jupyter Notebooks. I did more searching and reading, and finally settled on R Notebooks for a few reasons: R Notebooks are more disciplined (no strange side effects from running code out of order), fewer environment problems, the support of the RStudio company, better visualizations, and just because R is the more data-sciency language.
The SIRA community was quite supportive of this idea when I asked for suggestions on getting started in the BOF session, and recommended Teacup Giraffes and Tidy Tuesday for learning R, and on my own I found RStudio recommendations. Of course, being a sysadmin at heart, I set out to figure out how exactly to best install R and RStudio, and manage the notebooks in git.
Installation on macOS was easy enough, just brew install r and brew cask install rstudio. GitHub published a tutorial in 2018 on getting RStudio integrated with GitHub, and I started working on that. Quickly I discovered that while the tutorial was helpful, it wasn’t quite the setup I wanted; it published R Markdown through GitHub pages, but wouldn’t directly support the automatically generated html of R Notebooks. Side note: the consensus was to use html_notebook as a working document, and html_document to publish. After more searching, I was able to get Notebooks working on GitHub, but I used the method described in rstudio/rmarkdown #1020 - checking in the .nb.html into git, and using GitHub Pages so that you can view the rendered HTML instead of just the HTML code.
Working through this, I noted that RStudio is quite good at automatically downloading and installing packages as needed; it triggered installation of rmarkdown and supporting packages when creating a new R Notebook, and also readr when importing data from csv. Which got me thinking, what about package management? While it seems that R doesn’t have the level of challenge posed by Python or Ruby, managing packages on a per-project basis is a best practice I learned from using Bundler to manage the code of this site. (the only gem I install outside a project is bundler) So I went looking for the R equivalent…
I first found Packrat and then its replacement, renv (Packrat is maintained, but all new development has shifted to renv). Setting it up is as simple as install.packages("renv") and renv::init(), and RStudio has published:
This left one final question: how exactly to install r? Homebrew itself offers 2 methods: install the official binaries using brew cask install r, and just brew install r. Poking around further, I found that the cask method was sub-optimal as it installs in /usr/local which causes issues with brew doctor. Interestingly, I also found that Homebrew’s R doesn’t include all R features, but the same author, Luis Puerto offered a solution to install all the things. I haven’t tried it yet, but I may go with homebrew-r-srf as suggested by Luis (or a fork of it).
What’s next? At some point I plan to try to integrate GitHub actions for testing, and create a CI/CD pipeline of sorts for Pages, using GitHub actions. And, of course, actually using R for data analysis…
Update: I tested homebrew-r-srf, and am going with homebrew r. There was some weirdness with the install/uninstall (/usr/local/lib/R left over), I don’t know if I’ll need the optional features, and homebrew r now uses openblas. If I find I actually need any of the missing capabilities, I’ll likely write my own formula.