Information Safety

Improving technology through lessons from safety.

On Deadlines

Getting rid of deadlines improves safety and security.

I was recently discussing the safety of Artificial Intelligence (AI) with a colleague. He sent me a link to an episode of the Medicine & Machine Learning podcast featuring Munjal Shah, the co-founder and CEO of Hippocratic AI. In our conversation, my colleague mentioned a quote from the podcast on deadlines:

“You can’t say you have a deadline and say you’re safety-first. One of those things is not true.”

This resonated with me for a specific reason. One of the most effective software engineering teams I’ve worked with was Core Engineering. As the name suggests, Core Engineering created “core” libraries and tools for other software engineering teams to use, like a standardized logging facility and components for authentication and authorization. (Like many large companies, there was value in creating components tailored to our environment). This required writing code, and also high quality documentation and examples to make the components accessible to a broad audience of developers of different skill levels and experience at the company.

The Core Engineering team was notable for a couple of key reasons: first, they were high performing in nearly all aspects of software development: their software was high quality, and the few bugs (including security flaws) that were discovered were fixed quickly, they had high levels of automation including automated tests, wrote excellent documentation, and had effective leadership. Second, and more importantly, they typically did not have hard deadlines for the work they did. Because the software they built wasn’t directly used by clients or internal teams, they had a high level of control over the timing of the release of their software.

Hard deadlines

Why was the lack of hard deadlines so important to their performance? This changed the incentives for the team, giving them the time and space to focus on building it right - put another way, being safety-first.

Because the components the team built were used by many applications, a failure would have broad impact. Additionally, the team wasn’t dependent on client or internal stakeholder funds. Taken together, this meant that the cost and risk of missing a date was low, but the impact of an outage or security flaw was much higher. Almost by accident, the work environment favored prioritizing work to promote confidentiality, integrity, and availability, and so the team did.

Practical Advice

So, to create a safety-first environment, you can’t have deadlines. Just get rid of them and all will be good, right? If it were that easy…

One of the things I’ve learned from safety is that there are always conflicts and trade-offs to manage. Getting rid of deadlines is just not practical; there will always be situations where there is a very real cost of not completing work on time, which in some cases can be quite large (compliance to a new government regulation comes to mind).

So what can be done? This is where safety and security professionals can help, by understanding and reducing the goal conflicts and supporting decisions to prioritize safety or security ahead of deadlines when the risks are too great. Simply being mindful of this conflict can help manage it - leaders can create time and space for teams to prioritize safety and security as best they can, which may be less when the team is under a deadline, but can be more once the deadline has passed. Put another way, they can make the choice to take on security tech debt in the short term to meet an important client obligation and pay it back by raising the priority of security when the production pressures are reduced.

While this won’t always work, acknowledging that the conflict exists and that it will never be fully resolved is a good start.

comment

Security Differently

For a while, I’ve been seeing evidence that cybersecurity, especially traditional security, has been stagnant; adding security controls hasn’t appreciably improved outcomes and we continue to struggle with basic problems like vulnerabilities. (as Cyentia discovered and reported in its Prioritization to Prediction series, organizations of all sizes only fix about 10% of their vulnerabilities in any given month)

Many organizations have accumulated 20+ years of security policies, standards, and controls, without significantly removing rules that may no longer be needed, and organizations of all sizes continue to experience security breaches.

Safety faced a similar problem 10-15 years ago. Safety scientists and practitioners saw that security outcomes were stagnant and looked for new approaches. One of these, Safety Differently, was created by notable safety science academic Sidney Dekker in 2012. It was part of the emerging acknowledgement that the traditional method of avoiding accidents through policies, procedures, and controls was no longer driving improvements in safety.

Safety Differently argues that three main principles drive traditional thinking:

  • Workers are considered the cause of poor safety performance. Workers make mistakes, they violate rules, and they ultimately make safety numbers look bad. That is, workers represent a problem that an organization needs to solve.
  • Because of this, organizations intervene to try and influence workers’ behavior. Managers develop strict guidelines and tell workers what to do, because they cannot be trusted to operate safely alone.
  • Organizations measure their safety success through the absence of negative events.

Safety Differently advocates a switch from a top-down to a bottom-up approach, adopting new principles:

  • People are not the problem to control, they are the solution. Learn how your workers create success on a daily basis and harness their skills and competencies to build a safer workplace.
  • Rather than intervening in worker behavior, intervene in the conditions of their work. This involves collaborating with front-line staff and providing them with the right tools and environment to get the job done safely. The key here is intervening in workplace conditions rather than worker behavior.
  • Measure safety as the presence of positive capacities. If you want to stop things from going wrong, enhance the capacities that make things go right.

What does this have to do with cybersecurity? I believe that we’re seeing the same thing in security: historically, we’ve focused on constraining worker behavior to prevent cybersecurity breaches, and the limits of that approach are becoming increasingly clear. Adapting concepts from Safety Differently offers a solution, by supporting success and focusing on positive capacities: Security Differently.

Adopting Security Differently

In practical terms, what would adopting Security Differently look like? The Safety Differently Movie provides good insights into how this would apply to security and evidence of its effectiveness:

  1. Most importantly, the organization’s top leadership must take responsibility for security. Since security performance can’t be separated from organizational performance, security can’t be “the CISO’s problem” or even “the CIO’s problem.” A key part of this shift is acknowledging that it is our workers - not our security team - that create security.

  2. A clear shift in ownership of security performance to the Operations and Engineering teams. As I argued in a 2021 talk, many positive security outcomes are well within the capabilities of Technology organizations. One example is vulnerability management, which is solved through proactively updating software and refreshing technology - something all technology teams can do.

  3. Likely, a much smaller security team. The head of health and safety for Origin noted that he cut the size of his team from 20-30 people to 5, as his team had to give up safety performance management. This doesn’t necessarily mean that security spending is significantly reduced, because of the shift in ownership of security performance.

  4. A focus on positive measures of security performance (Security Metrics). A security team is still needed to measure security outcomes, like successfully defending against an attack, as well as measures that have been shown to contribute to success, like security updates and secure configuration.

  5. A significant reduction of security policies and procedures, along with training on Security Differently concepts. The Australian grocery Woolworth’s found that the combination of eliminating national safety procedures and training in Safety Differently led to the best outcomes: fewer accidents in the store, along with the highest levels of safety ownership and engagement.

  6. Asking, not telling. While ownership shifts to the Technology team, security expertise is still needed, to coach and support security performance - advising developers on how to fix security bugs - and develop new capacities to address novel threats (the Solarwinds Attack is an example). Simply asking teams, “what do you need to be secure?” is a key part of improving their performance.

In an organization that has fully adopted Security Differently, top leadership (CEO/CIO) sets security goals, the Security team keeps score through evidence-based metrics aligned to those goals, provides expertise and support to achieve the goals, and develops or acquires new defenses when new threats emerge (in practice, this happens infrequently).

Importantly, investment in Security Differently is not a cost, rather, it is an investment in improved organizational performance. By changing the focus from preventing bad outcomes to creating positive outcomes and developing organizational capacities not only improves security, but also improves quality, engagement, and overall organizational performance. (And also reduced incident response costs!) Evidence of this affect can be found in the DORA Research, which can be summarized as “performance begets performance”: the technical capabilities of DevOps, including shifting left on security, improve software delivery performance and ultimately organizational performance.

Adopting Security Differently can both improve both efficiency and outcomes, much like the traffic experiment from Safety Differently: when traffic engineers removed traffic controls from a key mixed-use intersection in Drachten, they forced people to take greater responsibility for safety, and what looked riskier on the surface was much safer, reducing annual accidents from 10 to 1, and also eliminated gridlock.

In a future article I will continue to explore the idea of reimagining the role of security through related work in safety.

comment

The Definitive Introduction to the DORA Research

I’ve spent a good deal of time over the last three years studying software delivery performance, both learning from the work of Nicole Forsgren and the DevOps Research and Assessment (DORA) team at Google, as well as conducting my own research. I’ve often needed to explain the research to others, especially in the context of the “four metrics”, and set out to write this, the definitive introduction to the research (well, at least my definitive version).

DevOps Metrics

The research program that is now run by the DORA team originated in 2013, when Nicole Forsgren, a PhD researcher, joined two early DevOps champions, Jez Humble and Gene Kim, to work on the Puppet 2014 State of DevOps Report. The team combined practical experience with the rigor academic research to create a report that established a causal relationship between specific DevOps practices and organizational performance, as measured by three key metrics, which were later expanded to four. The key metrics, along with their definitions taken from the DORA Quick Check are listed below:

  • Deployment frequency: For the primary application or service you work on, how often does your organization deploy code to production or release it to end users?
  • Lead time for changes: For the primary application or service you work on, what is your lead time for changes (that is, how long does it take to go from code committed to code successfully running in production)?
  • Time to restore service: For the primary application or service you work on, how long does it generally take to restore service when a service incident or a defect that impacts users occurs (for example, unplanned outage, service impairment)?
  • Change failure rate: For the primary application or service you work on, what percentage of changes to production or releases to users result in degraded service (for example, lead to service impairment or service outage) and subsequently require remediation (for example, require a hotfix, rollback, fix forward, patch)?

DevOps and Performance

The findings of the DORA research can be summarized succinctly as: performance begets performance. A visual map of the program shows all of the predictive relationships the team has discovered: many technical, cultural, management, and leadership practices associated with the DevOps movement have been shown to improve Software Delivery Performance (as measured by the four metrics), and ultimately to improve organizational performance. This is the key finding of the body of work: organizations that improve their software delivery performance improve both their commercial performance (profitability, market share, and productivity) and non-commercial performance (quantity of goods and services, operating efficiency, customer satisfaction, quality of products or services, and achieving organization or mission goals). Over time, the program has investigated and identified additional practices that predict improved performance, and added a “fifth metric”, Reliability, the degree to which a team can keep promises and assertions about the software they operate, which includes availability, latency, performance, and scalability. The 2021 Accelerate State of DevOps Report calls this metric “[the] ability to meet or exceed their reliability targets;” expressed another way, this could be measured as how well the organization meets their Service Level Objectives.

It is important to stress that the factors that improve performance extend beyond the technical practices typically thought of as “DevOps”, including CI/CD (Continuous Integration/Continuous Delivery). Many of the factors are cultural, including softer concepts like Trust, Voice, and Autonomy, and some factors are self-reinforcing: for example, Software Delivery Performance predicts improved Lean Product Management, and improved Lean Product Management predicts improved Software Delivery Performance. A central theme is a leadership focus on creating a supportive culture and environment, while allowing teams significant delegated authority in making decisions about the software they build and support. In my own research studying DevOps adoption and performance, I identified that the organizational system can have a significant impact on team performance: teams can be constrained by mandatory enterprise practices, such as change management.

Measuring Performance

Starting in 2015, the DORA researchers have reported on the profiles of “Low”, “Medium”, “High”, and sometimes “Elite” organizations. Using cluster analysis, the team identified data-driven categories of performance. These categories serve as useful benchmarks, and show how the metrics relate to each other:

Metric Elite High Medium Low
Deployment frequency On-demand (multiple deploys per day) Between once per week and once per month Between once per month and once every 6 months Fewer than once per six months
Lead time for changes Less than one hour Between one day and one week Between one month and six months More than six months
Time to restore service Less than one hour Less than one day Between one day and one week More than six months
Change failure rate 0%-15% 16%-30% 16%-30% 16%-30%

Categories of performance from 2021 Accelerate State of DevOps Report

The clusters highlight a larger theme: higher-performing organizations perform better across all measures of performance, which extends beyond the four metrics; for example, organizations that do better at meeting reliability targets and shifting left on security have higher software delivery performance as measured by the four metrics.

It is also notable that there has been variability in these categories over time. In the prior 2019 Accelerate State of DevOps Report (no report was produced in 2020), the profiles were:

Metric Elite High Medium Low
Deployment frequency On-demand (multiple deploys per day) Between once per day and once per week Between once per week and once per month Between once per month and once every six months
Lead time for changes Less than one day Between one day and one week Between one week and one month Between one month and six months
Time to restore service Less than one hour Less than one day Less than one day Between one week and one month
Change failure rate 0-15% 0-15% 0-15% 46-60%

Categories of performance from 2019 Accelerate State of DevOps Report

I find the 2019 profiles to be more useful benchmarks, at least for my work comparing team performance within a larger organization, as the relationship between the metrics is clearer, and fits better with my own experience of team performance.

The DORA group uses survey research to measure software delivery performance, out of necessity: obtaining and comparing direct data across organizations is impractical. However, it is feasible to implement partially or fully automated collection of these metrics within an organization (as I have done). One way of doing so is by collecting data each time code is deployed, using the code deployment automation itself. Writing a log of when each deployment occurred along with the application or service, a calculation of lead time measuring the difference between deployment time and when code was committed, and the type of deployment (normal or hotfix) allows calculation of three of the four metrics over time: Deployment frequency, Lead time for change, and Change failure rate. Time to restore service can be measured as part of the incident (outage) response process, ideally in an automated way, such as pulling data from the trouble ticket system.

How to Improve

So, how do you improve software delivery performance? The simple answer is “adopt all the practices that the research shows improves performance”, but how do you get started? In her 2017 talk The Key to High Performance: What the Data Says, Forsgren cites a specific example: “By focusing on trunk-based development and streamlining their change approval processes, Capital One saw stunning improvements in just two months,” with a 20x increase in releases, some applications deploying to production in a day 30+ times] and no increase in incidents. In the same talk, she offers some general device: “It depends,” and suggests looking at decoupling architecture, adopting a lightweight change approval process, and full continuous integration. My take is that organizations should adopt the DevOps ways of working first, to support cultural change.

Regardless of how, make sure to measure: measuring outcomes using the four metrics will help identify opportunities to improve and measure improvement over time.

References

DevOps Research & Assessment. Explore DORA’s research program. https://www.devops-research.com/research.html

Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate : the science behind DevOps : building and scaling high performing technology organizations (First edition. ed.). IT Revolution.

Forsgren, N., Kim, G., Kersten, N., & Humble, J. (2014). 2014 State of DevOps Report. Puppet Labs, IT Revolution Press, ThoughtWorks. https://nicolefv.com/resources

Google. (2020). DORA DevOps Quick Check. https://www.devops-research.com/quickcheck.html

Smith, D., Villalba, D., Irvine, M., Stanke, D., & Harvey, N. (2021). 2021 Accelerate State of DevOps Report. Google Cloud. https://cloud.google.com/devops/state-of-devops/

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site reliability engineering : how Google runs production systems (First edition. ed.). O’Reilly. https://landing.google.com/sre/sre-book/toc/index.html

Benninghoff, J. (2021). A cross-team study of factors contributing to software systems resilience at a large health care company [Master’s thesis, Trinity College Dublin]. Ireland.

2015 State of DevOps Report. (2015). Puppet Labs, IT Revolution. https://nicolefv.com/resources

Forsgren, N., Smith, D., Humble, J., & Frazelle, J. (2019). 2019 Accelerate State of DevOps Report. DORA & Google Cloud. https://research.google/pubs/pub48455/

Forsgren, N. (2017). The Key to High Performance: What the Data Says. https://www.youtube.com/watch?v=RBuPlMTXuFc&t=25s

comment