This Is the Most Underappreciated Skill for SREs

Delivering great software and sustainable systems is a team sport. Without the support of all stakeholders, adoption initiatives often fail. In successful initiatives, SREs are responsible for bringing together all resources and team members to help resolve reliability-related issues.

But getting together these resources takes much more effort than people think. SREs engage in lots of glue work to ensure these collaborative efforts happen. Glue work refers to tasks that are essential to a project’s success, even if they don't contribute to the codebase.

The Secret of Communicating Incident Retrospectives

In the world of SRE, incidents are unplanned investments in reliability. Why? Because they are valuable opportunities to learn and grow. This perspective can be difficult to communicate to other stakeholders. Some may be upset about the cost incurred or the affected customers. Others might not understand why incidents happen in the first place. It is important to show how the lessons of an incident are relevant to each stakeholder role.

One of the most valuable tools in sharing these lessons is the incident retrospective or postmortem. These documents are built after the incident response process and reviewed in internal meetings. Sometimes an edited version is shared with external stakeholders. In this blog post, we’ll show how to coordinate incident retrospectives across different stakeholder groups, how to cultivate a culture of blamelessness during the process, and how to drive change from key findings.

Little Known Ways to Better Use Your Error Budgets

One of the most versatile and foundational SRE tools is the SLO, or service level objective. The SLO is a threshold set for key reliability metrics. When incidents push the metric over the threshold, a response launches to prevent further damage. Conversely, as long as you meet your SLO, you can continue to ship new code. The space you have before you breach this threshold is the error budget. When evaluating new developments, you can judge if the error budget can accommodate the potential risk of unreliability.

We generally think of the error budget as a tool for developers. It helps them understand tradeoffs between development velocity and reliability. But error budgets can be helpful to many roles throughout the organization. In this blog post, we’ll look at how error budgets can help cross-functional teams across the organization such as QA, legal, executives, and more. We’ll also look at ways engineers can use error budgets beyond development planning.

Four Ways SRE Helps New Employees Onboard

Onboarding is an essential yet challenging part of the hiring process. As your organization matures, more of its processes become unique. This makes it harder for new employees to get up to speed. Investing in custom processes and tooling to achieve your specific goals is a valuable practice. But, you must balance this with an investment in onboarding.

Fortunately, an investment in SRE is also an investment in onboarding, as one of the important goals of SRE is to help democratize context across software teams. At first, SRE may seem like an area with a high learning curve. The diversity of the skills expected of the SRE role can make it difficult to hire for. However, these skills help broaden engineer’s abilities and understanding of their organization’s systems. The SRE mentality can provide insights into many areas, including onboarding itself.