Creating an SRE Practice: Why and How

This is an article from DZone's 2022 Performance and Site Reliability Trend Report.

For more:


Read the Report

Site reliability engineering (SRE) is the state of the art for ensuring services are reliable and perform well. SRE practices power some of the most successful websites in the world. In this article, I'll discuss who site reliability engineers (SREs) are, what they do, key philosophies shared by successful SRE teams, and how to start migrating your operations teams to the SRE model.

The Next Frontier for Observability: Data Ownership With OpenTelemetry

Observability is a mindset that lets you use data to answer questions about business processes. In short, collecting as much data as possible from the components of your business — including applications and key business metrics — then using an AI-powered tool to help consolidate and make sense of this huge volume of data gives you observability into your business.

Having observability for your business and applications lets you make smarter decisions, faster. You save time troubleshooting and can proactively solve problems before they impact customers.

Site Reliability Engineer (SRE) Roles and Responsibilities

Software development is getting faster and more complex – frustrating IT operations teams more than ever. So, DevOps gained popularity in order to combat siloed workflows, decreased collaboration, and a lack of visibility. While establishing a culture of DevOps has helped teams collaborate better and deliver reliable software faster, DevOps teams don’t necessarily have someone specifically dedicated to developing systems that increase site reliability and performance. That’s where a site reliability engineer (SRE) comes into the picture.

The concept of SRE was initially brought to life by Google engineer, Ben Treynor. Then, shortly after implementing SRE, they published their popular SRE eBook – helping the movement gain traction in the industry. Site reliability engineers sit at the crossroads of traditional IT and software development. Basically, SRE teams are made up of software engineers who build and implement software to improve the reliability of their systems.

Observability: It’s Not What You Think

What Is Observability?

Observability is a mindset that enables you to answer any question about your entire business through the collection and analysis of data. If you ask other folks, Observability is the dry control theory definition of “monitoring the internal state of a system by looking at its output,” or it’s the very technical definition of “metrics, traces, and logs.” While these are correct, Observability isn’t just one thing you implement, then proudly declare “now this system has Observability™.” Building Observability into your business lets you answer questions about your business.

What Kind of Questions?

Of course, the basic “what happened in our app when this error count spiked up” questions can be answered with Observability tools, but that’s barely scratching the surface of what Observability  actually is. What an Observability mindset lets you do is to figure out why the error count spiked up. If you’re intimately familiar with your app and all of its dependencies, then perhaps you can get this insight from a monitoring system, but as modern apps become increasingly more complex, the ability to maintain the state of them in your head becomes more and more challenging. Business demands, feature launches, A/B tests, refactoring into microservices… things like this all combine to create ever-increasing entropy, so knowing everything about your system without help gets more difficult by the day.