As I’ve been posting, I’m cataloging and posting my past presentations, and this is the last one! This talk from SIRAcon 2023 summarizes my experiences leading Site Reliability Engineering (SRE), “Measuring and Communicating Availability Risk”.
The particular focus of my SRE work was on measuring availability and availability risk, and I learned quite a bit over the 3 years or so I did SRE. One of the key lessons was that the value of measuring availability using Service Level Objectives (SLOs) was for decision support (SIRA helped with this framing). That is, SLOs and the associated measurements help make decisions about what to do, either during an incident, tactically over the course of a month, and strategically over the course of several months and into the future.
Our biggest success was the result of measuring availability in ways that supported all three timescales, using a explicitly defined customer-focused measure of “available”, we were able to construct visualizations that helped during incidents (real-time), during maintenance planning (one month), and for longer-term work (many months).
A key element of this success was the business imperative: the work supported a large and important client, who had just negotiated a significant increase in availability by no longer allowing us to count scheduled downtime against our availability target. The Service Level Indictor (SLI) we created helped our incident responders understand outages better, and the SLO we created allowed our teams to schedule maintenance with confidence or confidently defer it. A hidden benefit was that the metrics, being based on direct observations from our monitoring tools, brought together and aligned the different stakeholders on a common view of how available our systems were - the new approach we developed was even adopted by our client as an improvement.
A copy of my slides are here, and the visual notes from the talk are below! As a bonus, you’ll get to see a photo of my dog, Gertie, which was added at the last minute as part of an ongoing cats vs dogs competition at the conference.
As I posted yesterday, I’m working through cataloging my past presentations, and I’m nearly done! Today I’m sharing a talk from SIRAcon 2022 that’s much different than what I typically do, “Making R Work for You (With Automation)”.
Many SIRA members do data analysis as part of their work, and talk about the results of their analysis at SIRAcon. However, we don’t often talk about the mechanics of our craft, how we go about doing data analysis. However, in 2019, Elliot Murphy gave a talk about just that, by showing how to use Jupyter (Python) and R Notebooks for data analysis. His presentation inspired me to start working with R using R Notebooks, and I wanted to share what I’d learned, and built to automate my workflow.
I think the talk went reasonably well, although it was hard to say for sure, as the conference was once again virtual that year. Unfortunately, some of the key attendees weren’t able to attend, and I didn’t get their feedback - although later one of them did watch the replay and shared that what I did was similar to his approach.
Aside from learning how to write better R code, I learned a couple of things from the experience (both doing it and talking about it):
Doing something brings deeper knowledge than reading about something. One of my goals with R was to learn good software engineering practices (documentation, testing, source code control, etc.) including DevOps practices (continuous integration and continuos delivery, CI/CD). While my experience was limited mainly to myself, I did come away with a better and deeper understanding of what it’s like to write modern software.
If writing software was more physically demanding, we’d probably do a better job creating tools and automation to help with the writing. As I noted in my talk, the carpenters who worked on our house spent a whole day setting up their environment to make it easier to move materials they were removing to the dumpster, and didn’t try to just brute-force the work. Experience and the challenge of physical labor led them to an economy of movement.
A copy of my slides are here, and the visual notes from the talk are below!
As I work through cataloging presentations I’ve done this week, I’ve come across a few that I haven’t yet posted here (or on https://transvasive.com). I’ll be posting them here over the next three days.
One of the “missing” talks was a short slide deck I put together as part of a “Papers We Love” discussion on Learning from Cyber Incidents: Adapting Aviation Safety Models to Cybersecurity, a paper published by a working group organized by Harvard’s Belfer Center to explore the concept of creating a “Cyber NTSB”.
I came across this paper having met one of the lead authors, Adam Shostack. Adam especially has been interested in creating a “Cyber NTSB”, an idea we share, although I likely take a broader interest in adapting safety to cybersecurity.
The paper is well written and the workshop seemed well thought out, as it included presentations from people actually working at the NTSB, grounding the discussion in work-as-done instead of work-as-imagined at the NTSB. It also included a session led by the psychologist and safety scientist David Woods on cross-domain learning; as I discovered in my studies, safety doesn’t translate directly (for example between aviation and marine safety). The findings are sound and follow current safety science thinking and are included in the slides.
For me, the practical takeaways were and remain:
A recurring theme is discussion of blame, and how NTSB specifically avoids assigning liability in accident investigations, as avoiding blame improves learning
There are domain-specific challenges unique to Security; don’t blindly copy what works in aviation safety
Near Miss reporting is an important complement to incident investigation; share stories of the close calls