Chaos Engineering – The Practice Behind Controlling Chaos

Chaos Engineering might sound like a buzzword - but take it from someone who used to joke his job title was Chief Chaos Engineer (more on that later) it is much more than buzz or a passing fad - it’s a practice. 

The world can be a scary place and more and more companies are beginning to turn to Chaos Engineering to proactively poke and prod their systems and in doing so are improving their reliability and guarding against unexpected failures in production and unplanned downtime. 

Top 5 Incidents and Outages of 2021

Now that 2021 has come and gone, it’s possible for SREs to look back definitively at the major incidents that occurred during the past year. Let’s do that in this post by examining outages on platforms like AWS, Verizon, and beyond — and what SREs can learn from these incidents.

AWS Network Incident

2021 was not an excellent year for AWS, which suffered multiple network outages.

6 Steps SREs Should Take to Prepare for Black Friday and Cyber Monday 2021

Being an SRE is a tough (if rewarding) job on any day of the year. But it's especially challenging on Black Friday and Cyber Monday, the post-Thanksgiving event that has become the biggest online shopping day of the year. We'll focus on calling it Cyber Monday throughout this guide.

And for 2021, Cyber Monday promises to include not just the standard challenges associated with massive spikes in traffic but also a spike in cybersecurity attacks, which the FBI expects to surge in frequency this holiday season. And although security may not be SREs' main job, they'll be expected to assist security and DevSecOps teams in confronting the reliability threats that hackers pose.

7 Essential Tools for SREs

Introduction

Mastering the concepts at the core of reliability is the first step in becoming an SRE. But you also need tools to put those concepts into practice.

Which types of tools do SREs need to do their jobs? And what are the best tools in each category? This article answers these questions by discussing what SREs should think about when building their toolbox. It walks through the key categories of tools for SREs to leverage and suggests specific options in each one.

3 Lessons DevOps Can Learn From 5 Biggest Outages of Q2 2020

‘Learn from the mistakes of others. You can't live long enough to make them all yourself’ – Eleanor Roosevelt.

Nobody is immune from outages but it’s better to learn from other’s mistakes than from your own. The second quarter of 2020 was marked by several serious outages at prominent services including IBM Cloud, GitHub, Slack, Zoom and even T-Mobile (Source: StatusGator Report). I’m sure you noticed these outages like our team did. I decided to share the lessons we learned from this downtime, hoping we can all grow from it.

When Machine Identities Go Bad

Managing machine identities, such as SSL/TLS certificates is boring, right? It’s not inspiring work and it’s easily overlooked or forgotten in the day to day onslaught of changes and incidents in a typical enterprise technology department. And they seem like such little things… but when certificates go bad, well, life can turn pretty dark. Here are some real-life nightmares that happened as the result of mismanagement of machine identities.

1. Expired Certificates Delayed Breach Detection

The notorious breach at Equifax — talk about reputational damage, right? Nearly 150 million customer records stolen including date of birth and social security numbers. That’s a lot of people having sleepless nights about ID fraud thanks to an error somewhere in Equifax’s approach to cybersecurity. While the initial attack was performed via a Struts vulnerability (a common one I still frequently see during application scanning), the detection of the breach took 76 days. The reason it took 76 days to detect: misconfiguration of the device inspecting encrypted traffic on the network. The reason for the misconfiguration of the device: a digital certificate that had expired ten months previously.

Five Benefits of DevOps for Database and How to Achieve Them

One of the major benefits of DevOps is that it speeds up the development and delivery process, typically for applications and other software development. It increases efficiency, reduces errors, and better leverages IT talent. But these benefits can be delayed when database changes are also required because most DevOps teams don’t encompass databases. As a result, many DevOps teams work in a partially divided environment that can cause delays that reduce productivity and increase costs.

The strongest DevOps teams include database administrators (DBAs), with database DevOps functioning as a natural component of DevOps processes. Incorporating DevOps processes into database changes and integrating database teams with the wider DevOps team and processes to create a single team can help increase efficiency and deliver better results to end users. Once implemented, database DevOps contributes to a leaner, more reliable, and faster development process. By adopting database DevOps, companies typically see five major benefits.