site reliability engineering | The Blog Pros

January 4, 2021

Modern Operations Best Practices From Engineering Leaders at New Relic and Tenable

As reliability shifts left, more companies are adopting SRE best practices. These best practices don’t only include conducting incident retrospectives. The heart and soul of these best practices are a blameless culture and a desire to grow from each incident.

In a recent industry leaders’ roundtable hosted by Blameless, top experts discussed how teams can embrace SRE best practices and make cultural shifts towards blamelessness. The Executive Fireside Chat members included:

December 14, 2020

What Financial Crises Can Teach Us About Sre

In light of the pandemic, the global economy is suffering. While this downturn is extreme, it’s not irreparable. In fact, after experiencing economic meltdowns such as the Great Depression and the Great Recession, we’ve learned a great deal about how to regulate our economies to prevail through and recover from such upsets.

In the article, “Are We Safer? The Case for Strengthening the Bagehot Arsenal” by previous United States Secretary of the Treasury and President of the Federal Reserve Bank of New York, Tim Geithner focuses on how disaster happens, disaster response, and the craft of financial crisis management. From Geithner’s experience and research, we can draw parallels between how financial crises can be managed to how we can view SRE as a crisis prevention and response solution within our organizations.

December 13, 2020

Here Are the Top Predictions for SRE in 2021

Who else is glad that 2020 is almost over? We’ve had one of the most difficult years in recent history. With everything going on, it’s been difficult to think further than a few days out, much less into the new year. But, we’re hopeful that 2021 will be a better year for everyone. And we’re predicting some exciting things in the future for SRE.

Here are our two cents: SRE adoption will only continue to grow. Yet, the practice and culture shift, rather than the role, will take priority in 2021. More people (not only SREs) will have a reliability mindset, which means reliability will be shifting left through the software lifecycle. SLIs, SLOs, and error budget policies will become common practice. Practices such as observability, runbook automation, and blameless retrospectives continue to be table stakes.

November 10, 2020

Engineers, Stop Hoarding Your Metrics

Metrics are the golden ticket to knowing what’s going on with your system… or so everyone thinks. But there can be too much of a good thing. Are your metrics really doing you any favors? Are they letting you see into what your customers truly want from you? If not, you might have a problem. You might be fetishizing your metrics. The good news is you’re definitely not alone

Like The Hobbit’s dragon Smaug laying on his pile of gold, never spending and only hoarding, many of us often stockpile pretty, feel-good, but useless metrics that never make a difference. In fact, they could actually be clouding your ability to get the context and clarity you need from your metrics. In this blog post, we'll help you kick your fetish and move away from Smaug-ing up all your metrics.

November 4, 2020

Insights On Chaos Engineering and SRE With Yury Niño Roa

Blameless recently had the pleasure of interviewing Yury Niño Roa, Site Reliability Engineer, Solutions Architect and Chaos Engineering Advocate at ADL Digital Labs. She’s worked in roles ranging from solutions architect, to software engineering professor, to DevOps engineer, to SRE. Additionally, Yury is an avid blogger and conference speaker who regularly presents at events such as Chaos Conf, DevOpsDays Bogotá, and more.

In this interview, we’ll delve into what draws Yury to SRE and chaos engineering, how she defines resilience, as well as her predictions on emerging trends in the SRE landscape.

October 7, 2020

4 Signs That Software Reliability Should Be Your Top Priority

You know the companies who break away from the pack. You buy their products with prime shipping, you ride in their cars. You’ve seen them disrupt entire industries. It might seem like giants such as Amazon and Uber have always existed as towering pillars of profit, but that’s not so. What sets companies like these apart is a crucial piece of knowledge. They spotted the tipping point when reliability becomes a top priority to a software company’s success.

Pinpointing this tipping point is hard. After all, many companies can’t afford to stop shipping new features to shore up their software. Timing the transition to reliability well can launch a company ahead of the competition, and win the market (e.g. Amazon, Home Depot). But missing it can spell a company or even an industry’s doom (e.g. Barnes & Noble, Forever 21, and Gymboree in the retail apocalypse). Luckily, there are signs as you approach the tipping point. From examining over 300 companies, we’ve identified five.

September 24, 2020

Availability, Maintainability, Reliability: What’s the Difference?

We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean?

To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.

July 22, 2020

How SLOs Help Your Team With Service Ownership

Service ownership is becoming a best practice for teams looking to innovate while maintaining the level of reliability that customers expect. Service ownership means seeing the service through its entire lifecycle. In short, it means you build it, you run it. You’ll be responsible for the service’s security, reliability, performance, and quality.

This doesn’t mean you won’t have help from SREs to optimize or automate toil. It does mean that, as a developer, you need to build with quality in mind as you’ll be on call if your code breaks.

July 2, 2020

Twitter’s Reliability Journey

Twitter’s SRE team is one of the most advanced in the industry, managing the services that capture the pulse of the world every single day, and throughout the moments that connect us all. We had the privilege of interviewing Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zachary Kiehl, Sr. Staff SRE to learn about how SRE is practiced at Twitter.

As a company, Twitter is approximately 4,800 employees strong with offices around the world. SRE has been part of the engineering organization formally since 2012, though foundational practices around reliability and operations began emerging earlier. Today, SRE at Twitter features both embedded and core/central engagement models, with team members that hold the SRE title as well as those without but who perform SRE responsibilities. Regardless of their role or title, a key mantra among those who care about reliability is “let’s break things better the next time.”

June 30, 2020

Reduce Engineering Problems With a Resiliency Mindset

Resiliency isn’t something that just happens; it’s a result of dedication and hard work. To reach your optimal state of resilience, there are some crucial SRE best practices you should adopt to strengthen your processes.

Increase Cognitive Capacity With Runbooks

As you know, failure is not an option… because actually, it’s inevitable. Things will go wrong, especially with growing systems complexity and reliance on third-party service providers. You’ll need to be prepared to make the right decisions fast. There’s nothing worse than being called in the wee hours of a Sunday morning to handle a situation where thousands of dollars are going down the drain every second. Your brain is foggy, and you’ll likely need time to adjust to the extreme pressure of a critical incident. In these cases (and really, all cases where an incident is involved), incident runbooks can help guide you through the process and maximize the use of time.

June 12, 2020

Site Reliability Engineering (SRE) 101 With DevOps vs SRE

Consider the Scenario Below

An Independent Software Provider (ISV) developed a financial application for a global investment firm that serves global conglomerates, leading central banks, asset managers, broking firms, and governmental bodies. The development strategy for the application encompassed a thought through DevOps plan with cutting-edge agile tools. This has ensured zero downtime deployment at maximum productivity. The app now handles financial transactions in real-time at an enormous scale, while safeguarding sensitive customer data and facilitating uninterrupted workflow. One unfortunate day, the application crashed, and this investment firm suffered a severe backlash (monetarily and morally) from its customers.

Here is the backstory – application’s workflow exchange had crossed its transactional threshold limit, and lack of responsive remedial action crippled the infrastructure. The intelligent automation brought forth by DevOps was confined mainly to the development and deployment environment. The IT operations, thus, remained susceptible to challenges.

March 23, 2020

The Importance of SRE (and How It’s Changing)

Today, companies are increasingly turning to the cloud to push services out to multiple geographies and ever-increasing user bases. Scaling up is of course beneficial, but maintaining reliability, security and safety standards at scale presents a significant challenge.

There’s no handbook to assist with delivering a service effectively at scale, but firms would do well to follow the example of larger companies that have led the way in cloud. As one of the four big techs, AKA GAFA, Google is both a forerunner and prime example of how to build and run services at scale. A core component to its success was its implementation of, and continued focus on, site reliability engineering (SRE).

January 27, 2020

Four DevOps Trends for 2020

Last year, I put together a few thoughts on what I saw as the emerging DevOps trends for 2019. As we enter a new year and decade, I thought it might be useful to do the same for 2020. A common theme in this year’s trends concerns the way in which firms are dealing with delivering services at scale in the cloud, which I think could be a grand trend for the decade – one I wanted to highlight from the offset – but for now, here are four trends for the year ahead.

1. Site Reliability Engineering

As more and more companies leverage the cloud to host their services, how do they manage large user bases around the globe without a large 24x7 Operations team? Embracing failure and observing standard setters such as Google, Netflix, and Spotify, firms are looking to site reliability engineering (SRE) for the answers.