Achieving Observability: How to Address the Unknown Unknowns in Your Application

 ● 22nd Oct 2019

7 min read

Imagine if nature documentaries were composed solely of stationary footage shot in the wild. 

As a viewer, you would likely catch a few glimpses of wildlife and may even be able to see some interesting animal behavior, but your ability to gain any understanding of the ecosystem would be severely stunted. That’s why there are many more tools (and people) involved in the filming and editing process for shows like National Geographic. 

The goal of their documentaries is not to monitor the animals, but to provide observability into the wider ecosystem, specifically for educational purposes.

That’s not to say that monitoring isn’t important. In fact, that’s most likely how nature documentaries begin. First, the crew needs to be aware that something is happening, then they can collect shots with more details and different angles. By the time you’re watching at home, there’s been extensive editing and narrating added in for complete context.

Here’s the catch – when the crew detects action in the field (via monitoring), they need to be there ready to capture the full story to share with their audience. They can’t be out in the field without their equipment and expect to run back and forth in time to catch the action.

When it comes to our applications, the question is – how helpful is monitoring if you aren’t proactively capturing the data you need to understand what’s causing the errors you find? The answer… Not very. That’s why it’s important for us to go beyond monitoring and work on expanding our capabilities in terms of observability.

Observability 101

The concept of observability was first introduced by American-Hungarian engineer Rudolf E. Kalman for the field of linear dynamic systems. “In control theory, it’s a measure for how well internal states of a system can be inferred by knowledge of its external outputs.”

When applied to the world of software, observability is how well you can understand what’s going on inside your application based on accessible outputs. Looking at this from outside the realm of errors and application failures, etc., observability might describe how well a developer understands how a certain feature works the way it does.

This example clearly shows that the metrics and context needed to attain what we can call a “reasonable” level of observability depends on the requirements of the individual system. We wouldn’t necessarily expect a developer at one company to understand how the features of another company’s application work behind the scenes. Likewise, an e-commerce application and a healthcare system will require different metrics to be observable.

In order to improve observability, we need to leverage 3rd-party tools to provide additional, relevant data that enables us to understand and resolve any unexpected application behavior.

Bringing Observability to the Team

Every company that operates software has at least the most basic level of observability thanks to our long, long time friends – log files. 

Unfortunately, in most cases we don’t know ahead of time what’s going to break, meaning no data gets logged. Occasionally, we may have a hunch that a certain method will be more likely to fail but even then, the data that’s written to the logs is usually shallow and requires additional investigation and context.

A key measure of observability is how well you can answer “why this happened” without needing to ask additional questions. Teams that rely heavily on log files to understand and troubleshoot issues generally have low observability as we all know that the logs often lead to hours if not days of follow-up questions. To improve observability, it’s important to focus on proactive solutions.

Remember how our friends from National Geographic had all of their equipment with them already? If you have to run back to camp to grab your camera, write more log statements and redeploy your application… You still have some way to go to reach a reasonable level of observability.

Organizations, especially at the enterprise-level, use much more than log files to collect data around application errors and slowdowns. Application Performance Monitoring (APM) tools also play a significant part in most teams’ tooling stacks. Traditional APM tools provide additional context by helping teams identify when and where their application is experiencing performance or availability issues.

These tools, and more, are commonly used by companies to monitor application behavior and to investigate the root cause of issues. Each tool that provides additional context into the internal function of the application increases our observability. Once your tool stacks provide the full context needed to understand the root cause of any issue, you’ve reached a reasonable level of observability. 

Creating a Culture that Supports Observability

Just like with every other buzzword in the industry – CI/CD, DevOps, etc. – achieving a reasonable level of observability means cultivating a culture that values and supports such a goal. There are several layers that need to be addressed and integrated together, but let’s break it down.

The first step is to create a proactive mindset towards system hygiene. This is a concept that’s worth revisiting from our recent post by Pierre Bouchard, in which he discusses how to instill a quality-focused mentality across teams. He writes, “from beginning to end, all members should be trained to build and deploy code with quality in mind or there is a high risk that every change will create heavy technical debt.”

When he talks more specifically about the concept of code hygiene, Bouchard refers to the mindset of engineers beyond just the desire to write high-level code. It involves the planning process and understanding the purpose behind code changes. It doesn’t mean going beyond our abilities to predict the future, but having the foresight to plan for the unpredictable as much as we can.

We all know that releasing code to production is like sending it out into the wild, and we can all do a better job of making sure we have all the equipment we need to understand its behavior. That includes our mindset as it forces us to consider our goal much earlier in development, namely by answering the following questions:

  • What are we trying to improve?
  • How can we measure success or failure?
  • What metrics do we need to measure this?

Once this has been established, radical transparency and open communication follow naturally and further enable improvements in this area. When standards are established and issues are clearly communicated, problem areas in development can more easily be identified and resolved.

Without the human aspect in place, no tool in the world can give an organization a reasonable level of observability.

Final thoughts

Observability is about more than just monitoring your system – it’s about understanding it.

In order to achieve a higher level of observability, tooling and culture are two major factors that should be addressed. There are many different ways to attain the best level of observability for your organization, so understanding what you need in order to be proactive is crucial.

OverOps helps teams achieve better observability by mapping and analyzing code at runtime to provide actionable insights into critical issues such as new or resurfaced errors. To learn more about the metrics that OverOps collects, check out this blog post or request a demo with one of our solution engineers. 

Do you have any other tips for improving observability? We’d love to hear about them in the comments below.

Tali is a content manager at OverOps covering topics related to software monitoring challenges. She has a degree in theoretical mathematics, and in her free time, she enjoys drawing, practicing yoga and spending time with animals.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More