How To Build a Strong Incident Response Process

When building an incident response process, it’s easy to get overwhelmed by all the moving parts. Less is more: focus first on building solid foundations that you can develop over time. Here are three things we think form a key part of a strong process.

I’d recommend taking these one at a time, introducing incident response throughout your organization.

What Is an SRE? How To Land an SRE Role Today

What is SRE?

Site Reliability Engineering (SRE) is a relatively new term in the software industry. It is a software engineering approach designed for improved system management and problem-solving. Think of it as a new form of system administration.

In SRE, a software engineer is in charge of tasks that are usually performed by the operations team. Site reliability engineering involves ensuring the availability, latency, performance, capacity, scalability, and deployment of software systems by the engineers themselves.

Five Ways Developers Can Help SREs

It is not easy to be a Site Reliability Engineer (SRE). Monitoring system infrastructure and aligning it with the key reliability metrics is quite a daunting task. Whereas, a software engineer's job is to deliver high-quality software.

Relationships between software engineers and site reliability engineers can sometimes be tricky. To begin with, developers are generally assigned to write code that goes into production. Then, there are SREs who are responsible for improving the product's reliability and performance. 

Monitoring Kubernetes cert-manager Certificates With BotKube

The monitoring and alerting stack is a crucial part of the SRE practices. That’s where BotKube helps you monitor your Kubernetes cluster and send notifications to your messaging platform or any other configured sink. In this blog post, we will be configuring BotKube to watch the Kubernetes cert-manager certificates CustomResources.

What is BotKube?

BotKube is a messaging tool for monitoring and debugging Kubernetes clusters. BotKube can be integrated with multiple messaging platforms like - Slack, Mattermost, or Microsoft Teams to help you monitor your Kubernetes cluster(s), debug critical deployments, and gives recommendations for standard practices by running checks on the Kubernetes resources.

Top SRE Toolchain Used by Site Reliability Engineers

Introduction

Site reliability engineering (SRE) practices help organizations by ensuring the smooth functioning of their deliverables with utmost reliability and resilience. These can be achieved by a set of well-defined tools that are deployed at every phase of the production system to keep up with SRE best practices.

This blog identifies and lists the chain of top SRE tools and their significance towards ensuring the reliability of the architecture.

So You Want an SRE Tool. Do You Build, Buy, or Open Source?

As your organization’s reliability needs grow, you may consider investing in SRE tools. Tooling can make many processes more efficient, consistent, and repeatable. When you decide to invest in tooling, one of the major decisions is how you’ll source your tools. Will you buy an out-of-the-box tool, build one in-house, or work with an open source project?

This is a big decision. Switching methods halfway through adoption is costly and can cause thrash. You’ll want to determine which method is the best fit before taking action. Each choice requires a different type of investment and offers different benefits. We’ll help you decide which solution is your best fit by breaking down the pros and cons. In this blog post, we’ll cover: