incident management | The Blog Pros

March 30, 2022

How To Build a Strong Incident Response Process

When building an incident response process, it’s easy to get overwhelmed by all the moving parts. Less is more: focus first on building solid foundations that you can develop over time. Here are three things we think form a key part of a strong process.

I’d recommend taking these one at a time, introducing incident response throughout your organization.

March 30, 2022

SRE vs. Platform Engineering: The Key Differences, Explained

Site Reliability Engineering (SRE) teams and Platform Engineering teams share similar goals, like maximizing automation and reducing toil, and similar methodologies. However, they have different priorities and use somewhat different tools to achieve them.

What are SREs? What are platform engineers? How is each role similar and different? This article explains.

March 29, 2022

Why a Site Reliability Engineer Is Important to Your CI/CD Pipeline

This is an article from DZone's 2022 DevOps Trend Report.

For more:

Read the Report

Continuous integration and continuous deployment are the two major components of DevOps principles. Every organization that wants to move away from the traditional way of working has to learn, design, and implement a mature CI/CD pipeline. Having a mature CI/CD pipeline is a good start for site reliability engineering, but alone, it’s not enough. The site reliability engineering (SRE) methodology brings a new perspective to the software development life cycle by aiming to achieve reliability at scale.

March 11, 2022

What Does AIOps Mean for SREs? It’s Complicated

If you’re an SRE, you might view AIOps with great excitement. By automating complex workflows and troubleshooting processes, AIOps could make your life as an SRE much easier.

Alternatively, SREs may choose to view AIOps with disdain. They might think of AIOps as just a fancy buzzword that doesn’t live up to its promises, and that can become a distraction from the SRE tools that really matter.

March 5, 2022

What SREs Can Learn From Capt. Sully: When To Follow Playbooks

When are you smarter than your playbooks, and when are your playbooks smarter than you?

That’s a question that engineers rarely step back to consider. The rational, disciplined parts of our minds tell us that the playbooks we are supposed to follow were carefully designed and tested and that we should stick to them at all costs.

February 18, 2022

Why and How SREs Can Benefit From Feature Flags

When you think of who uses feature flags, your mind most likely goes to developers. In general, feature flags are closely associated with software engineering.

But Site Reliability Engineers (SRE), too, can benefit from feature flags. SREs may not be the ones to create feature flags, but they should work closely with developers to ensure that the applications their teams support include feature flags.

January 21, 2022

A Primer on the History and Evolution of Incident Management to Today

What’s the history of incident management?

If you’re an SRE, you may be so caught up in the day-to-day work of managing reliability and responding to incidents that you never take time to step back and ask that question. And that’s a shame because SREs didn’t invent incident management concepts and strategies on their own.

January 14, 2022

Top 5 Incidents and Outages of 2021

Now that 2021 has come and gone, it’s possible for SREs to look back definitively at the major incidents that occurred during the past year. Let’s do that in this post by examining outages on platforms like AWS, Verizon, and beyond — and what SREs can learn from these incidents.

AWS Network Incident

2021 was not an excellent year for AWS, which suffered multiple network outages.

January 10, 2022

What Log4j Vulnerability Means for SREs

If you’re an SRE, you’ve almost certainly heard all about Log4Shell, the Log4j vulnerability that some analysts are calling the worst software security flaw in decades. And you’ve also hopefully by now patched any systems you manage to fix the vulnerability (if you haven’t, go do that right away!).

Even after you’ve patched Log4Shell in your environments, though, you shouldn’t put the vulnerability in the back of your mind. For SREs, there are some important lessons to glean from this fiasco.

December 3, 2021

Who Needs Site Reliability Engineers (SREs)?

From its humble origins as a role inside Google, site reliability engineering has become a type of position or team that a wide variety of companies now embrace.

But that doesn’t mean that every company under the sun needs SREs. In this article, we unpack how to determine whether SREs should be a part of a given organization. In so doing, we aim to help both employers who are trying to figure out whether they should invest in SREs, and SREs wondering which types of companies are looking for the skills they stand to offer.

December 2, 2021

Incident Management Process and Tools

Incident management is one of the most critical processes a software development team has to get right. Service outages can be costly to the business and teams need an efficient way to respond to and resolve these issues quickly. For example, many organizations report downtime costing more than 300.000 euros per hour, according to Gartner. For some web-based services, that number can be dramatically higher. In this article, we will discuss how critical it is to have a reliable method to prioritize incidents, how to get to resolution faster, and offer better service for the end-users.

What is Incident Management?

First of all, what is incident management exactly? It is the process used by DevOps and software development teams to respond to an unplanned event or service interruption and restore the service to its operational state.

November 26, 2021

6 Steps SREs Should Take to Prepare for Black Friday and Cyber Monday 2021

Being an SRE is a tough (if rewarding) job on any day of the year. But it's especially challenging on Black Friday and Cyber Monday, the post-Thanksgiving event that has become the biggest online shopping day of the year. We'll focus on calling it Cyber Monday throughout this guide.

And for 2021, Cyber Monday promises to include not just the standard challenges associated with massive spikes in traffic but also a spike in cybersecurity attacks, which the FBI expects to surge in frequency this holiday season. And although security may not be SREs' main job, they'll be expected to assist security and DevSecOps teams in confronting the reliability threats that hackers pose.

November 19, 2021

History of SRE: Why Google Invented the SRE Role

If you know anything about the origins of Site Reliability Engineering, or SRE, you know that the concept was born at Google.

But why did Google establish the SRE role? And how did SRE spread from the search giant to companies of all types -- including but not limited to Web-scale businesses with massive reliability needs?

October 30, 2021

SRE vs. SWE: Similarities and Differences

SRE and SWE: These acronyms are only a letter apart, and they refer to similar roles within the realm of software development and management. However, SREs and SWEs are distinct types of jobs, even if the tools and skill sets associated with them overlap to a certain degree.

What is an SRE, what is an SWE, and how are SRE and SWE roles similar and different? Keep reading to find out.

October 23, 2021

An Introduction to Incident Response Roles

In the world of reliability engineering, folks talk frequently about “incident response teams.” But they rarely explain what, exactly, an incident response team looks like, how it’s structured, or which roles organizations should define for incident response.

That’s a problem because your incident response team is only as effective as the roles that go into it. Without the right structure and responsibilities, you risk leaving gaps in your incident response plan that could undercut your team’s ability to respond quickly and efficiently to all aspects of an incident.

October 8, 2021

What SREs Can Learn From Facebook’s Largest Outage

Facebook’s October 2021 outage was the type of event that gives SREs nightmares: a series of critical business apps crashed in minutes and remained unavailable for hours, disrupting more than 3.5 billion users around the world and costing about 60 million dollars. As incidents go, this was a pretty big one.

It’s also a pretty big learning opportunity for SREs. The outage is a lesson in how even expertly planned systems can sometimes fail, despite having multiple layers of reliability built-in.

October 1, 2021

Google’s State of DevOps 2021 Report: What SREs Need to Know

SRE and DevOps deliver the best value when used together. Culture is key to avoiding burnout. You need the cloud more than ever.

These are among the main takeaways from Google Cloud’s latest Accelerate State of DevOps report, which examines how companies are using DevOps practices. Based in part on a survey of more than 32,000 professionals, the 2021 report, which was compiled by the DevOps Research and Assessment (DORA), identifies best practices that DevOps and SRE teams can use today to achieve operational excellence.

September 4, 2021

The Role of SREs in Observability

How do you achieve observability, which means the ability to understand the internal state of a system based on external outputs?

The most obvious answer to that question is to deploy observability tools, which can collect and correlate data from multiple sources to provide visibility into the internal state of a system.