Information Safety

Improving technology through lessons from safety.

Interested in applying lessons from safety to security? Learn more at security-differently.com!

The Definitive Introduction to the DORA Research

I’ve spent a good deal of time over the last three years studying software delivery performance, both learning from the work of Nicole Forsgren and the DevOps Research and Assessment (DORA) team at Google, as well as conducting my own research. I’ve often needed to explain the research to others, especially in the context of the “four metrics”, and set out to write this, the definitive introduction to the research (well, at least my definitive version).

DevOps Metrics

The research program that is now run by the DORA team originated in 2013, when Nicole Forsgren, a PhD researcher, joined two early DevOps champions, Jez Humble and Gene Kim, to work on the Puppet 2014 State of DevOps Report. The team combined practical experience with the rigor academic research to create a report that established a causal relationship between specific DevOps practices and organizational performance, as measured by three key metrics, which were later expanded to four. The key metrics, along with their definitions taken from the DORA Quick Check are listed below:

  • Deployment frequency: For the primary application or service you work on, how often does your organization deploy code to production or release it to end users?
  • Lead time for changes: For the primary application or service you work on, what is your lead time for changes (that is, how long does it take to go from code committed to code successfully running in production)?
  • Time to restore service: For the primary application or service you work on, how long does it generally take to restore service when a service incident or a defect that impacts users occurs (for example, unplanned outage, service impairment)?
  • Change failure rate: For the primary application or service you work on, what percentage of changes to production or releases to users result in degraded service (for example, lead to service impairment or service outage) and subsequently require remediation (for example, require a hotfix, rollback, fix forward, patch)?

DevOps and Performance

The findings of the DORA research can be summarized succinctly as: performance begets performance. A visual map of the program shows all of the predictive relationships the team has discovered: many technical, cultural, management, and leadership practices associated with the DevOps movement have been shown to improve Software Delivery Performance (as measured by the four metrics), and ultimately to improve organizational performance. This is the key finding of the body of work: organizations that improve their software delivery performance improve both their commercial performance (profitability, market share, and productivity) and non-commercial performance (quantity of goods and services, operating efficiency, customer satisfaction, quality of products or services, and achieving organization or mission goals). Over time, the program has investigated and identified additional practices that predict improved performance, and added a “fifth metric”, Reliability, the degree to which a team can keep promises and assertions about the software they operate, which includes availability, latency, performance, and scalability. The 2021 Accelerate State of DevOps Report calls this metric “[the] ability to meet or exceed their reliability targets;” expressed another way, this could be measured as how well the organization meets their Service Level Objectives.

It is important to stress that the factors that improve performance extend beyond the technical practices typically thought of as “DevOps”, including CI/CD (Continuous Integration/Continuous Delivery). Many of the factors are cultural, including softer concepts like Trust, Voice, and Autonomy, and some factors are self-reinforcing: for example, Software Delivery Performance predicts improved Lean Product Management, and improved Lean Product Management predicts improved Software Delivery Performance. A central theme is a leadership focus on creating a supportive culture and environment, while allowing teams significant delegated authority in making decisions about the software they build and support. In my own research studying DevOps adoption and performance, I identified that the organizational system can have a significant impact on team performance: teams can be constrained by mandatory enterprise practices, such as change management.

Measuring Performance

Starting in 2015, the DORA researchers have reported on the profiles of “Low”, “Medium”, “High”, and sometimes “Elite” organizations. Using cluster analysis, the team identified data-driven categories of performance. These categories serve as useful benchmarks, and show how the metrics relate to each other:

Metric Elite High Medium Low
Deployment frequency On-demand (multiple deploys per day) Between once per week and once per month Between once per month and once every 6 months Fewer than once per six months
Lead time for changes Less than one hour Between one day and one week Between one month and six months More than six months
Time to restore service Less than one hour Less than one day Between one day and one week More than six months
Change failure rate 0%-15% 16%-30% 16%-30% 16%-30%

Categories of performance from 2021 Accelerate State of DevOps Report

The clusters highlight a larger theme: higher-performing organizations perform better across all measures of performance, which extends beyond the four metrics; for example, organizations that do better at meeting reliability targets and shifting left on security have higher software delivery performance as measured by the four metrics.

It is also notable that there has been variability in these categories over time. In the prior 2019 Accelerate State of DevOps Report (no report was produced in 2020), the profiles were:

Metric Elite High Medium Low
Deployment frequency On-demand (multiple deploys per day) Between once per day and once per week Between once per week and once per month Between once per month and once every six months
Lead time for changes Less than one day Between one day and one week Between one week and one month Between one month and six months
Time to restore service Less than one hour Less than one day Less than one day Between one week and one month
Change failure rate 0-15% 0-15% 0-15% 46-60%

Categories of performance from 2019 Accelerate State of DevOps Report

I find the 2019 profiles to be more useful benchmarks, at least for my work comparing team performance within a larger organization, as the relationship between the metrics is clearer, and fits better with my own experience of team performance.

The DORA group uses survey research to measure software delivery performance, out of necessity: obtaining and comparing direct data across organizations is impractical. However, it is feasible to implement partially or fully automated collection of these metrics within an organization (as I have done). One way of doing so is by collecting data each time code is deployed, using the code deployment automation itself. Writing a log of when each deployment occurred along with the application or service, a calculation of lead time measuring the difference between deployment time and when code was committed, and the type of deployment (normal or hotfix) allows calculation of three of the four metrics over time: Deployment frequency, Lead time for change, and Change failure rate. Time to restore service can be measured as part of the incident (outage) response process, ideally in an automated way, such as pulling data from the trouble ticket system.

How to Improve

So, how do you improve software delivery performance? The simple answer is “adopt all the practices that the research shows improves performance”, but how do you get started? In her 2017 talk The Key to High Performance: What the Data Says, Forsgren cites a specific example: “By focusing on trunk-based development and streamlining their change approval processes, Capital One saw stunning improvements in just two months,” with a 20x increase in releases, some applications deploying to production in a day 30+ times] and no increase in incidents. In the same talk, she offers some general device: “It depends,” and suggests looking at decoupling architecture, adopting a lightweight change approval process, and full continuous integration. My take is that organizations should adopt the DevOps ways of working first, to support cultural change.

Regardless of how, make sure to measure: measuring outcomes using the four metrics will help identify opportunities to improve and measure improvement over time.

References

DevOps Research & Assessment. Explore DORA’s research program. https://www.devops-research.com/research.html

Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate : the science behind DevOps : building and scaling high performing technology organizations (First edition. ed.). IT Revolution.

Forsgren, N., Kim, G., Kersten, N., & Humble, J. (2014). 2014 State of DevOps Report. Puppet Labs, IT Revolution Press, ThoughtWorks. https://nicolefv.com/resources

Google. (2020). DORA DevOps Quick Check. https://www.devops-research.com/quickcheck.html

Smith, D., Villalba, D., Irvine, M., Stanke, D., & Harvey, N. (2021). 2021 Accelerate State of DevOps Report. Google Cloud. https://cloud.google.com/devops/state-of-devops/

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site reliability engineering : how Google runs production systems (First edition. ed.). O’Reilly. https://landing.google.com/sre/sre-book/toc/index.html

Benninghoff, J. (2021). A cross-team study of factors contributing to software systems resilience at a large health care company [Master’s thesis, Trinity College Dublin]. Ireland.

2015 State of DevOps Report. (2015). Puppet Labs, IT Revolution. https://nicolefv.com/resources

Forsgren, N., Smith, D., Humble, J., & Frazelle, J. (2019). 2019 Accelerate State of DevOps Report. DORA & Google Cloud. https://research.google/pubs/pub48455/

Forsgren, N. (2017). The Key to High Performance: What the Data Says. https://www.youtube.com/watch?v=RBuPlMTXuFc&t=25s

comment

Secure360 2022

A couple of weeks ago, I spoke at Secure360 2022! My talk, “What Safety Science taught me about Information Risk” was an updated version of my SIRAcon 2021 talk (available in the members area at https://www.societyinforisk.org).

Session Description

Two years of study and research has changed how I see risk. Safety science taught me that improving performance is the key to managing risk, and studying successes is the key to risk analysis. The ‘New School’ of safety argues that you can’t have a science of non-events; safety comes through being successful more often, not failing less. Research in DevOps, Software Security, and Security Programs show a strong link between general and security performance. In many (but not all) cases, organizations most effectively reduce cybersecurity risk by improving general performance, not by improving one-dimensional security or reliability performance.

This talk presents a new model for security performance that informs how we can maximize the value of our security investments, by focusing on improving existing or creating new organizational capabilities in response to new and emerging threats, where general performance falls short. It will review both the theory that improving performance improves safety, how that relates to cybersecurity risk, evidence from my own and others’ research that supports this theory, and how it can be used to analyze and manage risk more effectively.

Talk

The talk is broken down into three sections, and covers both the theory as well as how to apply the theory to best improve security performance.

  • Assumptions backed by accepted theory
    • Assumption 1: organizations are sociotechnical systems
    • Assumption 2: all failures are systems failures
  • Arguments for a new theoretical model backed by evidence
    • Argument 1: resilience improves through performance
    • Argument 2: security performance is correlated with general performance
  • Implications of the model for information risk management: optimize risk management based on your performance mode
    • Mode 1: improve general performance
    • Mode 2: add security enhancements to general performance
    • Mode 3: create security-specific systems
    • Guided Adaptability
    • Work against the adversary

Overall, I think the talk went better than I expected. While the theory supports some potentially controversial conclusions, like “retire your vulnerability management program”, I had good engagement from the audience, ran out of time for questions and spent some time afterwards talking with a few attendees in the hall.

I got the survey results back pretty quickly. Only 9 people responded, which was maybe 10-20% (I’m not a good judge of crowd size), but those responses were very positive, with ~90% of attendees saying they would attend my future talks. My weakest score was, “I am Interested in hearing more of this topic” which scored just below “agree”.

Slides

My slides with notes, including references, are here.

Slides from all presenters at Secure360 (who provided them) are available here, and most of my past talks and security blog posts are available at https://transvasive.com.

Other Versions

I presented this talk two other times:

  • As mentioned above, the original version was presented at SIRAcon 2021, those slides are available here.
  • I was also selected to be a keynote speaker at my internal company technical conference in 2022! This was a condensed, 30-minute version of the talk, and draft slides from that talk can be found here.

Thanks to a generous sponsor, an artist created visual notes for my SIRAcon presentation! As a cool bonus, I have a laminated plaque with the visual in my office.

visual notes

comment

What is Resilience Engineering?

Last August, I took on a new role at my company, and changed my title to Resilience Engineer. Which leads to an obvious question, what is Resilience Engineering?

Resilience Engineering (RE) as a concept emerged from safety science in the early 2000s. While the oldest reference to “Resilience Engineering” appears to be a paper written by David Woods in 20031, the most-cited work is the book, Resilience Engineering: Concepts and Precepts, a collection of chapters from the first Resilience Engineering symposium in 2004.2 In that book and in subsequent publications, there have been many definitions of RE. This post is my attempt to succinctly define Resilience Engineering as I practice it, which is:

Resilience Engineering is the practice of working with people and technology to build software systems that fail less often and recover faster by improving system performance.

Let’s break that definition down further:

Resilience and Resilience Engineering

Resilience is a concept from ecology that describes a system’s ability to dynamically withstand and recover from unexpected disruptions, rather than maintain a predictable, static state.3 Whereas resilience in ecological systems is the result of the interplay between variability and natural selection, Resilience Engineering seeks to achieve the same results through deliberate management of the variability of performance:

“Since both failures and successes are the outcome of normal performance variability, safety cannot be achieved by constraining or eliminating that. Instead, it is necessary to study both successes and failures, and to find ways to reinforce the variability that leads to successes as well as dampen the variability that leads to adverse outcomes.” 4

As both definitions make clear, resilience isn’t achieved through stability, rather, it is achieved through variability.

Working with people and technology

Systems safety recognizes that people are an integral part of the system; one can’t talk about aviation safety without talking about the technology of the plane and air traffic control, the people - the pilots and controllers, and the interplay of the people and the technology. Similarly, the software systems I work with consist of the code, the machines running the code, and the people that write and maintain the code. The software engineers and the systems they build comprise a sociotechnical system, with both technological/process and social/psychological components.

Further, while technology can’t be ignored, beyond a baseline level of technology, people are the main contributor to resilience or lack thereof; most advances in aviation safety over the past 50+ years have come from human factors research, and it is not by accident that safety science is usually part of the psychology department. For this reason, I focus my efforts on people, and the relationship between people and technology.

Systems that fail less often and recover faster

‘Systems that fail less often and recover faster’ is an over-simplification of resilience, but that statement accurately describes the value proposition of Resilience Engineering in technology; organizations are increasingly reliant on software systems, to the point where software has become safety-critical. We have come to expect that our software systems just work, so that failures are infrequent and systems (the software and the people together) are able to recover from unexpected disruptions quickly.

This is a distinctly different goal than ecological resilience: it isn’t enough to build systems that simply survive, they also need be productive. This is a challenge unique to Resilience Engineering, as it requires both limiting and encouraging variability.

Improving system performance

For me, the key to understanding Resilience Engineering is HOW to achieve resilience. Historically within technology, security and operations have sought to prevent failures (outages, breaches) through centralized control, which does work, but suffers from limitations that RE seeks to overcome.5 The shift in approach starts with the premise that we can’t have a science of non-events, a science of accidents that don’t happen.6 Safety-II (an alternative to traditional ‘Safety-I’) proposes that resilience is the result of factors that make things go right more often - working safely, something that can be studied. Under this model, there is no safety-productivity tradeoff, since improving outcomes leads to improvements in both productivity and resilience.

The work of the DevOps Research and Assessment group at Google demonstrates this concept within software: as organizations improve performance (deployment frequency and lead time for changes) they also improve resilience (time to restore service, change failure rate).7 I’ve found that this approach works more generally, and through RE, seek to help teams improve their performance and help leaders to improve the performance between teams by managing organizational factors.

Other Perspectives

Resilience Engineering is a diverse space and there is a small but growing group of practitioners and researchers that are applying it to software systems. Two notable groups are the Resilience Engineering Association and the Learning From Incidents community. I’ve also recently discovered the work of Dr Drew Rae and Dr David Provan through their Safety of Work podcast. Their paper on Resilience Engineering in practice is aimed at traditional safety professionals but I’ve found its ideas easily adapted to software systems.

As a practitioner-researcher myself, I’m hoping to adapt and apply the science to software systems, to improve the profession, as well as contribute to the collective knowledge - of Resilience Engineering.

Future Articles

Update: I’ve been asked to elaborate on the ideas behind Resilience Engineering, so I’ve added this section to cover a plan for future articles on the topic:

  • The origins and history of Resilience Engineering
  • Parallels between Cybersecurity, Operations, and Safety
  • Is DevOps culture High Reliability culture?
  • My research in software systems resilience

Updates and links will be posted here.

  1. Woods, D., & Wreathall, J. (2003). Managing Risk Proactively: The Emergence of Resilience Engineering. https://www.researchgate.net/publication/228711828_Managing_Risk_Proactively_The_Emergence_of_Resilience_Engineering 

  2. Hollnagel, E., Woods, D. D., & Leveson, N. (2006). Resilience engineering : concepts and precepts. Ashgate. 

  3. Holling, C. S. (1973). RESILIENCE AND STABILITY OF ECOLOGICAL SYSTEMS [Article]. Annual Review of Ecology & Systematics, 4, 1-23. https://doi.org/10.1146/annurev.es.04.110173.000245 

  4. Hollnagel, E. (2008). Preface : Resilience Engineering in a Nutshell. In E. Hollnagel, C. P. Nemeth, & S. Dekker (Eds.), Resilience Engineering Perspectives, Volume 1: Remaining Sensitive to the Possibility of Failure (pp. ix-xii). Ashgate. 

  5. Provan, D. J., Woods, D. D., Dekker, S. W. A., & Rae, A. J. (2020). Safety II professionals: How resilience engineering can transform safety practice. Reliability Engineering & System Safety, 195, 106740. https://doi.org/10.1016/j.ress.2019.106740 

  6. Hollnagel, E. (2014). Is safety a subject for science? Safety Science, 67, 21-24. https://doi.org/10.1016/j.ssci.2013.07.025 

  7. Forsgren, N., Smith, D., Humble, J., & Frazelle, J. (2019). 2019 Accelerate State of DevOps Report. DORA & Google Cloud. https://research.google/pubs/pub48455/ 

comment