Observability Recipes

What Is Observability?

Observability is the ability to derive a valid conclusion of what is happening currently to the system and why it is happening.

Guiding Principles for Observability 

  1. Context and sequential flow of each end-tend-end request is most important. We need to be able to see what is having an issue, which other parts might/are affected and what are the commonalities of issues when  things go wrong.
  2. Must be able to cut the data in many ways and correlate the different aspects of a request (e.g. ability to filter for each user, their session, each server node and any of them combined with  the other attributes)
  3. Use questions to drive features required for observability instead of relaying on what we can see.

Observability Components

Components What is means?
Metrics Metrics are numeric values to help evaluate a service's overall behavior over time.

They compromise of a set of data points that can be used to derive system's performance.

Typical examples are:
  • uptime
  • response time
  • # request per second
  • CPU/RAM utilisation
Events An event is a collection of data points about what it took to complete a unit of work. they are records of selected significant points that happened with metadata to provide context.

Typical examples are:
  • change of a workflow status
  • batch job completion
Logs Logs are important for troubleshooting and trying to understand a problem. they provide detail data and context so one can re-create and diagnose a problem

Typical examples are:
  1. application logs
  2. serer logs
  3. error logs
  4. debug logs
Traces Traces are important for showing a step-by-step journey of how a request or action as it moves through the system. these give specific insight into the flow and help one to identify errors, find bottlenecks so they can be optimised and rectified.
Visualisation Data needs to be connected in a visual and easy to comprehend approach that allows data to be correlated and derive connections from the different data points and events that is happening in the system. This provides context that  are otherwise not easily identifiable by looking at individual metrics alone.


OpenTelemetry: A Way to Achieve Observability

We all understand that proper data analytics is crucial to the success of an organization. But what if your analytics can do more than help you troubleshoot current problems? Splunk is building a future where data analytics proactively solve problems before they occur. 

Data is essential to success and innovation for modern organizations. However, no commercial vendor has an effective single instrument or tool to collect data from all of an organization’s applications.

4 Key Observability Metrics for Distributed Applications

A common architectural design pattern these days is to break up an application monolith into smaller microservices. Each microservice is then responsible for a specific aspect or feature of your app. For example, one microservice might be responsible for serving external API requests, while another might handle data fetching for your frontend. 

Designing a robust and fail-safe infrastructure in this way can be challenging; monitoring the operations of all these microservices together can be even harder. It's best not to simply rely on your application logs for an understanding of your systems' successes and errors. Setting up proper monitoring will provide you with a more complete picture, but it can be difficult to know where to start. In this post, we'll cover service areas your metrics should focus on to ensure you're not missing key insights.

5 Essential Diagnostic Views to Fix Hive Queries

A perpetual debate rages about the effectiveness of a modern-day Data Analyst in a Distributed Computing environment. Analysts are used to SQL’s returning answers to their questions in short order. The RDBMS user is often unable to comprehend the root-cause when queries don’t return results for multiple hours. The opinions are divided, despite broad acceptance of the fact that Query Engines such as Hive and Spark are complex for the best engineers. At Acceldata, we see full TableScans run on multi-Tera Byte tables to get a count of rows, which to say the least is taboo in the Hadoop world. What results is a frustrating conversation between Cluster Admins and Data Users, which is devoid of data that is hard to collect. It is also a fact that data needs conversion into insights to make business decisions. More importantly, the value in Big Data needs to be unlocked without delays.

From here we start from the point where the Hadoop Admin/Engineer is ready to unravel the scores of metrics and interpret the reasons for poor performance and taking resources away from the cluster causing:

Monitoring Velero Backup and Restore With BotKube

One of the key challenges for Kubernetes Day 2 operations is observability, i.e. having a holistic view of your system’s health. This is where BotKube helps to improve your monitoring experience of your Kubernetes clusters by sending notifications to supported messaging platforms. BotKube helps you solve several interesting use cases, for example, monitoring Velero backup failure or certificate issue/expiry status by cert-manager, etc. In this blog, we will configure BotKube to monitor your Velero backups and restores.

What Is BotKube?

BotKube is a messaging tool for monitoring and debugging Kubernetes clusters. BotKube can be integrated with multiple messaging platforms like - Slack, Mattermost, or Microsoft Teams to help you monitor your Kubernetes cluster(s), debug critical deployments, and gives recommendations for standard practices by running checks on the Kubernetes resources. — BotKube website

How to Be An Effective Engineering Manager By Investing In The Right Tools

After working with a diverse set of software engineering teams, we at Moesif have gained a unique perspective on what traits enable engineers to take on leadership positions and become outstanding managers vs others who have a harder time rising through the ranks. Gaining an advanced title and responsibility is not an easy task. After all, there are far more software engineers than executives at a company regardless of size. So how do you stand out to land that awesome VP or C-suite role? To understand what makes a great leader, it’s best to understand from the perspective of the person driving the promotion.

The mindset of leadership

Engineering naturally applies logical reasoning to a set of problems or challenges. Whether it’s to build and scale a new data platform, or how to gain a leadership title. Engineers love to “grade” a system on metrics such as performance or defect rate. This can even manifest into orchestrated mass testing activities such as the familiar Hackathon. Yet, the CEO or CTO probably doesn’t care how fast you can code an Uber clone or hook up some API. Even the usual textbook items of soft skills matter less. While it’s handy to have great communication and speaking skills, there are many executives who dread going up on a stage to present some eye candy. So what is it? 

How to Show the Business Value of Your APIs with Embedded Metrics

When you’re providing APIs to your customers, you want to ensure they are getting value from them. At the same time, the best APIs are designed to be fully automated without requiring human intervention. This can leave your customers in the dark on whether your API is even being used by the organization and if you’re meeting any SLA obligations in your enterprise contracts.

Types of metrics to surface

Most API first companies have some sort of developer portal for customers to log into, manage API keys, and customize features. This area is a great way to also expose key metrics to your customers demonstrating how much value they are getting from your API. This can be as simple as a counter showing number of API transactions made within a billing period or provide additional metrics around what those transactions are. Each customer has different metrics they want to look at. Developers will want to look at access logs where as product and engineering leadership are more interested in usage and performance metrics. Finally, the finance department may need to look at billing usage for capacity and financial planning.

What is API Observability

API Observability is a key component to properly execute APIOps Cycles and ensure your building something of value for your API users. If you’re not familiar with APIOps Cycles, take a look at this guide which provides an agile framework to quickly build APIs that are business-oriented and serve customer needs. API Observability itself is an evolution of traditional monitoring and born out of control systems theory.

Traditional monitoring focuses on tracking known unknowns. This means you already know what to measure like Request Per Second or Errors Per Second. While the metric value may be unknown beforehand, you already know what to measure or probe such as a counter to track requests into buckets. This makes it possible to report on the health of a system (like Red, Yellow, Green), but is a bad tool for troubleshooting engineering or business issues which usually require asking arbitrary questions.

How AIOps Helps in Application Monitoring

There’s no one-size-fits-all approach regarding application monitoring, especially for companies using applications in various cloud environments. Companies are rapidly investing in microservices, mobile apps, data science programs, data ops, etc. Subsequently, they’re also integrating monitoring tools to improve domain-centric monitoring abilities.

AIOps tools help streamline the use of monitoring applications. It allows companies that need high application services to efficiently manage the complexities of IT workflows and monitoring tools. AIOps extends machine learning and automation abilities to IT operations. These robust technologies aim to detect vulnerabilities and issues to resolve them, determine operational trends, and simplify the remediation of the problems that affect their applications’ performance and availability.

Does Observability Throw You for a Loop?

Our new mantra for managing and maintaining the health and functionality of our apps and environments is observability. Observability is the quality of software, services, platforms, or products that allows us to understand how systems are behaving. Without the new sources of data giving us insights, our modern cloud-native applications would be quite a challenge to monitor. Observability, that deep data, is the new fuel for our developer and DevOps engineers.

The duality of observability is controllability. Observability is the ability to infer the internal state of a 'machine' from externally exposed signals. Controllability is the ability to control input to direct the internal state to the desired outcome. While driving, observing a red stoplight means controlling our vehicle by pressing the breaks (or in some modern vehicles, having the brakes applied automatically for us).

Remote Debugging: What It Means for Java Applications

Following the lingering promise of managed infrastructure, reduced operational cost, and resiliency, cloud computing has seen phenomenal trends in adoption since the past decade. As software development marches towards the cloud, we soon realize that this shift warrants the need to rethink our debugging strategies. This is because as software systems are now leveraging these advancements in cloud computing and distributed systems we see gaps emerging in debugging that cannot be satisfied by the traditional methods of logging and breakpoints.

For example, a major issue while using breakpoints is that the codebase needs to be run in debug mode. Therefore, we are not actually replicating the actual state of our systems taking into consideration multi-threading, distributed services, and dependencies on remote services in a cloud-native environment along with multi-service architecture. Similarly, logs offer no respite, as they may be cumbersome and even costly to execute and store.

Monitoring Kubernetes cert-manager Certificates With BotKube

The monitoring and alerting stack is a crucial part of the SRE practices. That’s where BotKube helps you monitor your Kubernetes cluster and send notifications to your messaging platform or any other configured sink. In this blog post, we will be configuring BotKube to watch the Kubernetes cert-manager certificates CustomResources.

What is BotKube?

BotKube is a messaging tool for monitoring and debugging Kubernetes clusters. BotKube can be integrated with multiple messaging platforms like - Slack, Mattermost, or Microsoft Teams to help you monitor your Kubernetes cluster(s), debug critical deployments, and gives recommendations for standard practices by running checks on the Kubernetes resources.

Security and GitOps

As we all know and firmly believe, applications and infrastructures need to be secured, but the shipping processes of this whole ecosystem also need to be.

In a previous article, we introduced GitOps as a methodology to improve the velocity of the development and the management of an entire infrastructure. But there are many other benefits from GitOps, and one of them is the potential improvement of security.

Survivorship Bias in Observability

During World War II, a mathematician named Abraham Wald worked on a problem –  identifying where to add armor to planes based on the aircraft that returned from missions and their bullet puncture patterns. The obvious and accepted thought was that the bullets represented the problem areas for the planes. Wald pointed out that the problem areas weren’t actually these areas, because these planes survived. He found that the missing planes had unknown data, indicating other problem areas existed. In fact, the pattern for the surviving planes showed the areas that weren’t problematic.

By McGeddon - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=53081927

Containerized 5G Infrastructure Visibility

Cloud native and containerized architectures are becoming the de facto design standard for 5G networks and applications. In the telecommunications industry, the players are focused on building out 5G Stand Alone (SA) deployments to deliver the promise of faster connection speeds to enable IoT, medical, and autonomous use cases – not to mention improved communications, support for streaming real-time content, and the promise of myriad new applications and services. In working with Tier 1 operators, MVNOs, and analytics providers, we are encountering a staggering issue: they can no longer adequately monitor, correlate, and measure critical network and application communication events at the container level and across the infrastructure.

As we have illustrated through our demonstrations and proof of concept deployments of our Containerized Visibility Fabric (CVF) with telco and related technology suppliers, the most common phrases we’re hearing during the engagements are:

Monitoring vs. Observability

The IT sector has become exponentially complex in recent times – more environments, more connected devices, more data, and more updates. As such, the legacy methods used to monitor modern disseminated applications and the management of predictive failures do not work optimally. Monitoring is a crucial factor to growth and keeping pace with the  challenges that technology brings.

Observability tends to streamline complexities. To efficiently diagnose and debug code, the system must be observable in the lines of the microservices' architecture. But what makes this new IT buzzword different from monitoring?

Common Use Cases for Observability With AIOps

“We can't build tomorrow using yesterday's tools.” - Scott McDonald

IT infrastructures have been evolving constantly and rapidly, along with Big Data. Businesses worldwide are moving from predictable and static physical systems to intuitive software resources that can reconfigure and adapt based on consumer behaviours.