A Chat with Lex Neva of SRE Weekly

Since 2015, Lex Neva has been publishing SRE Weekly. If you’re interested enough in reading about SRE to have found this post, you’re probably familiar with it. If not, there are a lot of great articles to catch up on! Lex selects around 10 entries from across the internet for each issue, focusing on everything from SRE best practices to the socio-side of systems to major outages in the news.

I had always figured Lex must be among the most well-read people in SRE, and likely #1. I met up with Lex on a call and was so excited to chat with him about how SRE Weekly came to be, how it continues to run, and his perspective on SRE.

What’s Difficult About Problem Detection? Three Key Takeaways

Welcome to episode 4 of our series "From Theory to Practice." Blameless’s Matt Davis and Kurt Andersen were joined by Joanna Mazgaj, director of production support at Tala, and Laura Nolan, principal software engineer at Stanza Systems. They tackled a tricky and often overlooked aspect of incident management: problem detection.

It can be tempting to gloss over problem detection when building an incident management process. The process might start with classifying and triaging the problem and declaring an incident accordingly. The fact that the problem was detected in the first place is treated as a given, something assumed to have already happened before the process starts. Sometimes it is as simple as your monitoring tools or a customer report bringing your attention to an outage or other anomaly. But there will always be problems that won’t be caught with conventional means, and those are often the ones needing the most attention.

SRE: From Theory to Practice: What’s Difficult About Tech Debt?

In episode 3 of "From Theory to Practice," Blameless’s Matt Davis and Kurt Andersen were joined by Liz Fong-Jones of Honeycomb.io and Jean Clermont of Flatiron to discuss two words dreaded by every engineer: technical debt. So what is technical debt? Even if you haven’t heard the term, I’m sure you’ve experienced it: parts of your system that are left unfixed or not quite up to par, but no one seems to have the time to work on.

Pretend your software system is a house. Tech debt is the leak in your sink that you’ve haven’t gotten around to fixing yet. Tech debt is the messy office you haven’t organized in a while. It’s also the new shelf you bought but haven’t installed. To-do’s quickly build up over time. Even if certain tasks are quick, there are just so many of them that it’s tough to know where to start.

SRE: From Theory to Practice: What’s Difficult About Incident Command?

A few weeks ago, we released episode two of our ongoing webinar series, "SRE: From Theory to Practice." In this series, we break down a challenge facing SREs through an open and honest discussion. Our topic this episode was “What’s difficult about incident command?” When things go wrong, who is in charge? And what does it feel like to do that role? To discuss, Jake Englund and Matt Davis from Blameless were joined by Varun Pal, Staff SRE at Procore, and Alyson Van Hardenburg, Engineering Manager at Honeycomb.

To explore how organizations felt about incident command, we asked about the role on our community Slack channel, an open space for SRE discussion. We found that most organizations don’t have dedicated incident commander roles. Instead, on-call engineers are trained to take on the command role when appropriate. Because of this wide range of people who could end up wearing the incident commander hat, it’s important to have an empathetic understanding of exactly what the role entails.

SRE: From Theory to Practice: What’s Difficult About On-call?

We wanted to tackle one of the major challenges facing organizations: on-call. "SRE: From Theory to Practice - What’s Difficult About On-call" sees Blameless engineers Kurt Andersen and Matt Davis joined by Yvonne Lam, staff software engineer at Kong, and Charles Cary, CEO of Shoreline, for a fireside chat about everything on-call. 

As software becomes more ubiquitous and necessary in our lives, our standards for reliability grow alongside it. It’s no longer acceptable for an app to go down for days, or even hours. But incidents are inevitable in such complex systems, and automated incident response can’t handle every problem.

The Universal Language: Reliability for Non-Engineering Teams

We talk about reliability a lot from the context of software engineering. We ask questions about service availability, or how important it is for specific users. But when organizations face outages, it becomes immediately obvious that the reliability of an online service or application is something that impacts the entire business with significant costs. A mindset of putting reliability first is a business imperative that all teams should share.

But what does reliability mean for people outside of engineering? And how does it translate into best practices for other teams? In this blog post, we’ll investigate:

Building an SRE Team With Specialization

As organizations progress in their reliability journey, they may build a dedicated team of site reliability engineers. This team can be structured in two major ways: a distributed model, where SREs are embedded in each project team, providing guidance and support for that team; and a centralized model, where one team provides infrastructure and processes for the entire organization. Most structures will be some combination of these ideas, with some SREs focusing on specific projects and other SRE projects completed as an SRE team.

When looking at centralized models of SRE teams, there are further distinctions to make based on the role of each SRE. One perspective says every SRE should be a generalist, capable of performing every duty of the role. This has the advantage of being very robust - if each SRE can do any given job, any person’s absence won’t cause an issue. On the other hand, you could run into a “jack of all trades, master of none” issue, where your potential is limited. This is where the specialization perspective can help.

How Disaster Ready Are Your Backup Systems, Really?

In SRE, we believe that some failure is inevitable. Complex systems receiving updates will eventually experience incidents that you can’t anticipate. What you can do is be ready to mitigate the damage of these incidents as much as possible.

One facet of disaster readiness is an incident response - setting up procedures to solve the incident and restore service as quickly as possible. Another strategy involves reducing the chances for failure with tactics like reducing single points of failure. Today, we’ll talk about the third type of readiness: having backup systems and redundancies to quickly restore function when things go very wrong.

How To Write Meaningful Retrospectives

One of the foundations of incident management in SRE practice is the incident retrospective. It documents all the learnings from an incident and serves as a checklist for follow-up actions. If we step back, there are 7 main elements to a retrospective. When done right, these elements help you better understand an incident, what it reveals about the system as a whole, and how to build lasting solutions. In this article, we’ll break down how to elevate these 7 elements to produce more meaningful retrospectives.

1. Messages to Stakeholders

Incident retrospectives can be the core of your communication with customers and other stakeholders, post-incident. We talk a lot about how retrospectives function best when they involve input and feedback from all relevant stakeholders. That doesn't necessarily mean squeezing tons of folks into one meeting or sending out one long pdf to a large group without thoughtful considerations. 

How To Analyze Contributing Factors Blamelessly

SRE advocates addressing problems blamelessly. When something goes wrong, don't try to determine who is at fault. Instead, look for systemic causes. Adopting this approach has many benefits, from the practical to the cultural. Your system will become more resilient as you learn from each failure. Your team will also feel safer when they don't fear blame, leading to more initiative and innovation.

Learning everything you can from incidents is a challenge. Understanding the benefits and best practices of analyzing contributing factors can help. In this blog post, we'll look at:

Here’s What SLIs AREN’T

SLIs, or service level indicators, are powerful metrics of service health. They’re often built up from simpler metrics that are monitored from the system. SLIs transform lower-level machine data into something that captures user happiness.

Your organization might already have processes with this same goal. Techniques like real-time telemetry and using synthetic data also build metrics that meaningfully represent service health. In this article, we’ll break down how these techniques vary, and the unique benefits of adopting SLIs.

What are MTTx Metrics Good For?

Introduction

Data helps best-in-class teams make the right decisions. Analyzing your system’s metrics shows you where to invest time and resources. A common type of metric is Mean Time to X or MTTx. These metrics detail the average time it takes for something to happen. The “x” can represent events or stages in a system’s incident response process.

Yet, MTTx metrics rarely tell the whole story of a system’s reliability. To understand what MTTx metrics are really telling you, you’ll need to combine them with other data. In this blog post, we’ll cover:

So You Want an SRE Tool. Do You Build, Buy, or Open Source?

As your organization’s reliability needs grow, you may consider investing in SRE tools. Tooling can make many processes more efficient, consistent, and repeatable. When you decide to invest in tooling, one of the major decisions is how you’ll source your tools. Will you buy an out-of-the-box tool, build one in-house, or work with an open source project?

This is a big decision. Switching methods halfway through adoption is costly and can cause thrash. You’ll want to determine which method is the best fit before taking action. Each choice requires a different type of investment and offers different benefits. We’ll help you decide which solution is your best fit by breaking down the pros and cons. In this blog post, we’ll cover:

How to Scale for Reliability and Trust

As more people depend on your product, reliability expectations tend to grow. For a service to continue succeeding, it has to be one that customers can rely upon. At the same time, as you bring on more customers, the technical demands put on your service increase as well.

Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency. It isn’t a problem that you can solve by throwing resources at it. Your organization will have to adapt its way of thinking and prioritization. In this blog post, we’ll look at how to:

It’s All Chaos! And It Makes for Resilience at Scale

Introduction

Chaos engineering is a practice where engineers simulate failure to see how systems respond. This helps teams proactively identify and fix preventable issues. It also helps teams prepare responses to the types of issues they cannot prevent, such as sudden hardware failure. The goal of chaos engineering is to improve the reliability and resilience of a system. As such, it is an essential part of a mature SRE solution.

But integrating chaos engineering with other SRE tools and practices can be challenging. To get the most from your experiments, you’ll need to tie in learnings across all your reliability practices. You’ll also need to adjust your chaos engineering as your organization scales. In this blog post, we’ll look at:

How To Build an SRE Team With a Growth Mindset

Introduction

The biggest benefit of SRE isn’t always the processes or tools, but the cultural shift. Building a blameless culture can profoundly change how your organization functions. Your SRE team should be your champions for cultural development. To drive change, SREs need to embody a growth mindset. They need to believe that their own abilities and perspectives can always grow and encourage this mindset across the organization.

In this blog post, we’ll cover:

Writing Better Production Readiness Checklists

When we think of reliability tools, we may overlook the humble checklist. While tools like SLOs represent the cutting edge of SRE, checklists have been recommended in many industries such as surgery and aviation for almost a century. But checklists owe this long and widespread adoption to their usefulness.

Checklists can help limit errors when deploying code to production. In this blog post, we’ll cover:

How to Have a Cloud Transition You Can Be Proud Of

In the reliability era, many services are migrating from in-house servers to the cloud. The cloud model allows your service to capitalize on the benefits of large hosting providers such as AWS, Microsoft Azure, or Google Cloud. These servers can be more reliable than in-house servers for reasons including:

  • Large hosting providers have many infrastructure redundancies, which means individual servers can fail without affecting customers
  • Cloud providers benefit from strong security measures to mitigate breaches
  • Clouds have high bandwidth and capacity, reducing the risk of outages

However, as with all things, cloud providers present their own risks and challenges as well. Teams will want to take advantage of the benefits while accounting for these limitations. To do this, your DevOps practices must be built with the cloud in mind. In this blog, we’ll look at how SRE helps with migrating and operating in the cloud, as well as share some tips on how to maximize reliability.