Systematic and Chaotic Testing: A Way to Achieve Cloud Resilience

In today’s digital technology era where downtime translates to shut down, it is imperative to build resilient cloud structures. For example, in the pandemic, IT maintenance teams can no longer be on-premises to reboot any server in the data center. This may lead to a big hindrance in accessing all the data or software, putting a halt on productivity, and creating overall business loss if the on-premises hardware is down. However, the solution here would be to transmit all your IT operations to cloud infrastructure that ensures security by rendering 24/7, round-the-clock tech support by remote members. Cloud essentially poses as a savior here.

Recently, companies have been fully utilizing the cloud potency, and hence, observability and resilience of cloud operations become imperative as downtime now equates to disconnection and business loss.

eBPF: Observability with Zero Code Instrumentation [Video]

Current observability practice is largely based on manual instrumentation, which requires adding code in relevant points in the user’s business logic code to generate telemetry data. This can become quite burdensome and create a barrier to entry for many wishing to implement observability in their environment. This is especially true in Kubernetes environments and microservices architecture.

eBPF is an exciting technology for Linux kernel-level instrumentation, which bears the promise of no-code instrumentation and easier observability into Kubernetes environments (alongside other benefits for networking and security).

Developer’s Guide to Building Notification Systems: Part 4 – Observability and Analytics

This is the fourth and final post in our series on how you, the developer, can build or improve your company’s notification system. It follows the first post about identifying user requirements, the second about designing with scalability and reliability in mind, and the third about setting up routing and preferences. In this piece, we will learn about using observability and analytics to set your system and company up for success.

Developing an application can often feel like you're building in the dark. Even after development, gathering and organizing performance data is invaluable for ongoing maintenance. This is where observability comes in—it’s the ability to monitor your application’s operation and understand what it’s doing. With close monitoring, observability is a superpower that allows developers to use various data points to foresee potential errors or outages and make informed decisions to prevent these from occurring.

Extending Apache SkyWalking With Non-Breaking Breakpoints

Non-breaking breakpoints are breakpoints specifically designed for live production environments. With non-breaking breakpoints, reproducing production bugs locally or in staging is conveniently replaced with capturing them directly in production.

Like regular breakpoints, non-breaking breakpoints can be:

Systems Observability

What Is Observability?

According to Wikipedia: "Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. In control theory, the observability and controllability of a linear system are mathematical duals."

In simple words, it is how the system describes its internal state through its external outputs. 

Getting Started With Observability for Distributed Systems

To net the full benefits of a distributed system, applications' underlying architectures must achieve various company-level objectives including agility, velocity, and speed to market. Implementing a reliable observability strategy, plus the right tools for your specific business requirements, will give teams the insights needed to properly operate and manage their entire distributed ecosystem on an ongoing basis.

This Refcard covers the three pillars of observability — metrics, logs, and traces — and how they not only complement an organization's monitoring efforts but also work together to help profile, interpret, and optimize system-wide performance.

Getting Started With OpenTelemetry

With the growing number of mass migrations to the cloud, OpenTelemetry solves new challenges by simplifying and reducing time spent on data collection through automation. OpenTelemetry is an open-source collection of tools, APIs, SDKs, and specifications with the purpose of standardizing how to model and collect telemetry data. OpenTelemetry has been proven to enable effective observability and it aims to become a standard of observability implementation.

In this Refcard, we introduce core OpenTelemetry architecture components, key concepts and features, and how to set up for tracing and exporting telemetry data.

Observability: It’s Not What You Think

What Is Observability?

Observability is a mindset that enables you to answer any question about your entire business through the collection and analysis of data. If you ask other folks, Observability is the dry control theory definition of “monitoring the internal state of a system by looking at its output,” or it’s the very technical definition of “metrics, traces, and logs.” While these are correct, Observability isn’t just one thing you implement, then proudly declare “now this system has Observability™.” Building Observability into your business lets you answer questions about your business.

What Kind of Questions?

Of course, the basic “what happened in our app when this error count spiked up” questions can be answered with Observability tools, but that’s barely scratching the surface of what Observability  actually is. What an Observability mindset lets you do is to figure out why the error count spiked up. If you’re intimately familiar with your app and all of its dependencies, then perhaps you can get this insight from a monitoring system, but as modern apps become increasingly more complex, the ability to maintain the state of them in your head becomes more and more challenging. Business demands, feature launches, A/B tests, refactoring into microservices… things like this all combine to create ever-increasing entropy, so knowing everything about your system without help gets more difficult by the day. 

Services Don’t Have to Be Eight-9s Reliable [Video]

Viktor: You are known in the observability community and SRE community very well. I’ve followed your work for a while during my time at Confluent, so I’m super excited to speak with you. Can you please tell us a little bit about yourself? Like what do you do? And what are you up to these days?

Liz: Sure. So I’ve worked as a site reliability engineer for roughly 15 years, and I took this interesting pivot about five years ago. I switched from being a site reliability engineer on individual teams like Google Flights or Google Cloud Load Balancer to advocating for the wider SRE community. It turns out that there are more people outside of Google practicing SRE than there are inside of Google practicing SRE.

Prometheus Definitive Guide: Prometheus Operator

In this blog post, we will focus on how we can install and manage Prometheus on the Kubernetes cluster using the Prometheus Operator and Helm in an easy way. Let’s get started!

What Is an Operator?

Before moving directly to the installation of the Prometheus using the Prometheus Operator, let’s first understand some of the key concepts needed to understand the Prometheus Operator.

Two Keys to Agile Transformation

Enterprise agility has rapidly become one of the most crucial variables for a business’s long-term resiliency. With the COVID-19 pandemic, never-ending disruptions to the global supply chain, and nearly every industry’s typical processes flipped upside down, there has never been a more important time to prioritize agility than now.

While the rewards of agility are high, organizations are struggling to implement key technologies and adhere to the methodologies that can get them there. In fact, the 14th Annual State of Agile Report notes that 59 percent of organizations are following agile principles, but only 4 percent of these organizations are getting the full benefit. But the ability to adapt quickly, seize new opportunities, and reduce costs is critical for survival in the ever-evolving and hypercompetitive digital age.

Trace-Based Testing with OpenTelemetry: Meet Open Source Malabi

By Yuri Shkuro, creator and maintainer of Jaeger, and , Co-Founder & CTO of Aspecto.


If you deal with distributed applications at scale, you probably use tracing. And if you use tracing data, you already realize its crucial role in understanding your system and the relationships between system components, as many software issues are caused by failed interactions between these components.

Understanding How CALMS Extends To Observability

As our observability and DevOps practices continue to join, similar alignment happens with our frameworks and goals. In fact, when one considers that observability is about increased and deeper data about our environments, it is evident that the frameworks aren’t changing so much as adapting to faster insights.

Let’s take a look at CALMS. CALMS (Culture, Automation, Lean, Measurement, Sharing) was created by Jez Humble and is meant as a method of assessing how an organization is adapting to DevOps practices. However, as we add observability, CALMS can extend to our observability practice as well.

Prometheus Blackbox: What? Why? How?

Introduction

Today, Prometheus is used widely in production by organizations. In 2016, it was the second project to join CNCF and, in 2018, the second project to be graduated after Kubernetes. As the project has seen a growing commercial ecosystem of implementers and adopters, a need has emerged to address specific aspects already implemented in older monitoring tools like Nagios. Blackbox service testing is one of them.

What Is Prometheus Blackbox?

As everyone knows, Prometheus is an open-source, metrics-based monitoring system. Prometheus does one thing, and it does it well. It has a powerful data model and a query language to analyze how applications and infrastructure perform.

Observability: Let Your IDE Debug for You

Current events have brought an even stronger push by many enterprises to scale operations across cloud-native and distributed environments. To survive and thrive, companies must now seriously look at cloud-native technologies—such as API management and integration solutions, cloud-native products, integration platform as a service (iPaaS), and low-code platforms—that are easy to use, accelerate time to market, and enable re-use and sharing. However, due to their distributed nature, these cloud-native applications have a higher level of management complexity, which increases as they scale.

Building observability into applications allows teams to automatically collect and analyze data about applications. Such analysis allows us to optimize applications and resolve issues before they impact users. Furthermore, it significantly reduces the debugging time of issues that occur in applications at runtime. This allows developers to focus more on productive tasks, such as implementing high-impact features. 

Five Ways Developers Can Help SREs

It is not easy to be a Site Reliability Engineer (SRE). Monitoring system infrastructure and aligning it with the key reliability metrics is quite a daunting task. Whereas, a software engineer's job is to deliver high-quality software.

Relationships between software engineers and site reliability engineers can sometimes be tricky. To begin with, developers are generally assigned to write code that goes into production. Then, there are SREs who are responsible for improving the product's reliability and performance. 

The Role of SREs in Observability

How do you achieve observability, which means the ability to understand the internal state of a system based on external outputs?

The most obvious answer to that question is to deploy observability tools, which can collect and correlate data from multiple sources to provide visibility into the internal state of a system.

Progressive Delivery: A Detailed Overview

Every developer has been there before: You release a new feature expecting a smooth ride, only to have something go awry in the back end at the last minute. 

An event like this can derail a launch, letting down customers and leaving you scratching your head wondering what went wrong.