site reliability engineering

March 30, 2022

Monitor Kubernetes Events With Falco For Free

Kubernetes is now the platform of choice for many companies to manage their applications both on-premises and in the cloud. Its emergence a few years ago drastically changed the way we work. The flexibility of this platform has allowed us to increase the productivity of the engineering teams, thus requiring new working methods more adapted to this dynamic environment.

Kubernetes requested an adaptation of the security control processes to ensure the continuity of the reliability of this system. Falco is a tool that fits into this ecosystem.

February 23, 2022

Trino, Superset, and Ranger on Kubernetes: What, Why, How?

This article is an opinionated SRE point of view of an open-source stack to easily request, graph, audit and secure any kind of data access of multiple data sources. This post is the first part of a series of articles dedicated to MLOps topics. So, let’s start with the theory!

What Is Trino?

Trino is an open-source distributed SQL query engine that can be used to run ad hoc and batch queries against multiple types of data sources. Trino is not a database, it is an engine that aims to run fast analytical queries on big data file systems (like Hadoop, AWS S3, Google Cloud Storage, etc), but also on various sources of distributed data (like MySQL, MongoDB, Cassandra, Kafka, Druid, etc). One of the great advantages of Trino is its ability to query different datasets and then join information to facilitate access to data.

January 21, 2022

Site Reliability Engineer (SRE) Roles and Responsibilities

Software development is getting faster and more complex – frustrating IT operations teams more than ever. So, DevOps gained popularity in order to combat siloed workflows, decreased collaboration, and a lack of visibility. While establishing a culture of DevOps has helped teams collaborate better and deliver reliable software faster, DevOps teams don’t necessarily have someone specifically dedicated to developing systems that increase site reliability and performance. That’s where a site reliability engineer (SRE) comes into the picture.

The concept of SRE was initially brought to life by Google engineer, Ben Treynor. Then, shortly after implementing SRE, they published their popular SRE eBook – helping the movement gain traction in the industry. Site reliability engineers sit at the crossroads of traditional IT and software development. Basically, SRE teams are made up of software engineers who build and implement software to improve the reliability of their systems.

December 3, 2021

Who Needs Site Reliability Engineers (SREs)?

From its humble origins as a role inside Google, site reliability engineering has become a type of position or team that a wide variety of companies now embrace.

But that doesn’t mean that every company under the sun needs SREs. In this article, we unpack how to determine whether SREs should be a part of a given organization. In so doing, we aim to help both employers who are trying to figure out whether they should invest in SREs, and SREs wondering which types of companies are looking for the skills they stand to offer.

November 13, 2021

Services Don’t Have to Be Eight-9s Reliable [Video]

Viktor: You are known in the observability community and SRE community very well. I’ve followed your work for a while during my time at Confluent, so I’m super excited to speak with you. Can you please tell us a little bit about yourself? Like what do you do? And what are you up to these days?

Liz: Sure. So I’ve worked as a site reliability engineer for roughly 15 years, and I took this interesting pivot about five years ago. I switched from being a site reliability engineer on individual teams like Google Flights or Google Cloud Load Balancer to advocating for the wider SRE community. It turns out that there are more people outside of Google practicing SRE than there are inside of Google practicing SRE.

November 7, 2021

SRE vs. DevOps: Responsibilities, Differences, and Salaries

There is significant debate around the differences between Site Reliability Engineering (SRE) and DevOps. Given that there are certain similarities between these two approaches to software development and deployment, it isn’t uncommon for people to use these terms interchangeably.

However, SRE and DevOps have distinct identities and processes in place to meet the requisite goals. This article will highlight the differences between the two with regard to fundamentals, associated responsibilities, and salary.

October 30, 2021

SRE vs. SWE: Similarities and Differences

SRE and SWE: These acronyms are only a letter apart, and they refer to similar roles within the realm of software development and management. However, SREs and SWEs are distinct types of jobs, even if the tools and skill sets associated with them overlap to a certain degree.

What is an SRE, what is an SWE, and how are SRE and SWE roles similar and different? Keep reading to find out.

October 19, 2021

Free Resources To Become SRE/DevOps Engineer

The purpose of this post is to centralize a set of free resources in order to present a way to understand and develop Site Reliability Engineering (SRE) and DevOps skills. The content of this post is based on a return of several years of experience in the industry and a willingness to share content that may still be unknown to some people who would like to evolve in their career or open themselves to new opportunities.

The purpose is not to explain what an SRE is or what the DevOps methodology is, but to describe probably the major aspect of these roles: Continuous Learning.

September 22, 2021

Prometheus Blackbox: What? Why? How?

Introduction

Today, Prometheus is used widely in production by organizations. In 2016, it was the second project to join CNCF and, in 2018, the second project to be graduated after Kubernetes. As the project has seen a growing commercial ecosystem of implementers and adopters, a need has emerged to address specific aspects already implemented in older monitoring tools like Nagios. Blackbox service testing is one of them.

What Is Prometheus Blackbox?

As everyone knows, Prometheus is an open-source, metrics-based monitoring system. Prometheus does one thing, and it does it well. It has a powerful data model and a query language to analyze how applications and infrastructure perform.

May 29, 2021

Advice for Someone Moving From SRE to Backend Engineering

Recently there was a Reddit post asking for advice about moving from Site Reliability Engineering to Backend Eng. I started writing a response to it, the response got long, and so I turned it into a blog post.

In the post, OP mentions a couple of things driving the motivation for the transition. One is a concern that they may be losing development skills because they’re spending so much time creating scripts and automating. The other reason is that they’re having trouble adjusting to on-call life.

April 21, 2021

What are MTTx Metrics Good For?

Introduction

Data helps best-in-class teams make the right decisions. Analyzing your system’s metrics shows you where to invest time and resources. A common type of metric is Mean Time to X or MTTx. These metrics detail the average time it takes for something to happen. The “x” can represent events or stages in a system’s incident response process.

Yet, MTTx metrics rarely tell the whole story of a system’s reliability. To understand what MTTx metrics are really telling you, you’ll need to combine them with other data. In this blog post, we’ll cover:

April 11, 2021

So You Want an SRE Tool. Do You Build, Buy, or Open Source?

As your organization’s reliability needs grow, you may consider investing in SRE tools. Tooling can make many processes more efficient, consistent, and repeatable. When you decide to invest in tooling, one of the major decisions is how you’ll source your tools. Will you buy an out-of-the-box tool, build one in-house, or work with an open source project?

This is a big decision. Switching methods halfway through adoption is costly and can cause thrash. You’ll want to determine which method is the best fit before taking action. Each choice requires a different type of investment and offers different benefits. We’ll help you decide which solution is your best fit by breaking down the pros and cons. In this blog post, we’ll cover:

March 28, 2021

How to Scale for Reliability and Trust

As more people depend on your product, reliability expectations tend to grow. For a service to continue succeeding, it has to be one that customers can rely upon. At the same time, as you bring on more customers, the technical demands put on your service increase as well.

Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency. It isn’t a problem that you can solve by throwing resources at it. Your organization will have to adapt its way of thinking and prioritization. In this blog post, we’ll look at how to:

March 19, 2021

It’s All Chaos! And It Makes for Resilience at Scale

Introduction

Chaos engineering is a practice where engineers simulate failure to see how systems respond. This helps teams proactively identify and fix preventable issues. It also helps teams prepare responses to the types of issues they cannot prevent, such as sudden hardware failure. The goal of chaos engineering is to improve the reliability and resilience of a system. As such, it is an essential part of a mature SRE solution.

But integrating chaos engineering with other SRE tools and practices can be challenging. To get the most from your experiments, you’ll need to tie in learnings across all your reliability practices. You’ll also need to adjust your chaos engineering as your organization scales. In this blog post, we’ll look at:

March 12, 2021

How To Build an SRE Team With a Growth Mindset

Introduction

The biggest benefit of SRE isn’t always the processes or tools, but the cultural shift. Building a blameless culture can profoundly change how your organization functions. Your SRE team should be your champions for cultural development. To drive change, SREs need to embody a growth mindset. They need to believe that their own abilities and perspectives can always grow and encourage this mindset across the organization.

In this blog post, we’ll cover:

March 6, 2021

SRE and Organizational Transformation: Lessons from Activist Organizers

In the software industry’s recent past, the biggest disruptive wave was Agile methodologies. While Site Reliability Engineering is still early in its adoption, those of us who experienced the disruptive transformation of Agile see the writing on the wall: SRE will impact everyone.

Any kind of major transformation like this requires a change in culture, which is a catch-all term for changing people’s principles and behaviors. As your organization grows, this will extend beyond product and engineering. At some point you also need to convince the key power-holders in your organization to invest in this transformation.

March 3, 2021

SRE2AUX: How Flight Controllers Were the First SREs

In the beginning, there were flight controllers. These were a strange breed. In the early days of the US Manned Space Program, most American households, regardless of class or race, knew the names of the astronauts. John Glen, Alan Shepard, Neil Armstrong. The manned space program was a unifying force of national pride.

But no one knew the names of the anonymous men and later, women, who got the astronauts to orbit, to the moon, and most importantly, got them back to earth. The Apollo 13 mission changed all that, not because it was successful, but because it was a successful failure; no one died.

February 17, 2021

Writing Better Production Readiness Checklists

When we think of reliability tools, we may overlook the humble checklist. While tools like SLOs represent the cutting edge of SRE, checklists have been recommended in many industries such as surgery and aviation for almost a century. But checklists owe this long and widespread adoption to their usefulness.

Checklists can help limit errors when deploying code to production. In this blog post, we’ll cover: