Metrics Matter: The 4 Types of Code-Level Data Harness Service Reliability Management Collects

Updated

All the data in the world means nothing if it’s not the right data. But when it comes to delivering reliable software and troubleshooting issues, what is the right data?

To answer this question, we created a framework that helps organizations pinpoint critical gaps in data and metrics that are holding them back on their reliability journeys. At the foundation of this framework is the concept of Continuous Reliability (CR), or the notion of balancing speed, complexity, and quality by taking a continuous, proactive approach to reliability across the SDLC. When it comes to CR, it’s not just about what data you can capture, but how you analyze and leverage it.

With increasingly complex systems and ever-growing expectations for digital customer experiences, traditional tooling and the shallow data they provide is insufficient. To fully understand what’s going on inside your application and maintain stability, this data must be collected at the code level.

One of the things that makes Harness service Reliability Management (SRM) a powerful reliability tool is the way that we capture, analyze and present code-level data across the software delivery lifecycle. In this post, we’ll break down the four key types of data SRM captures and why they’re critical to advancing your journey toward Continuous Reliability.

‍
1. Code Metrics

Capturing all the information about events occurring in your code is critical to deciphering which issues need to be addressed. Before you can effectively prioritize and fix critical code-level issues, you first need visibility into exactly which issues are occurring.

At the most basic level, SRM automatically captures 100% of events happening within your application in both test and production – even those missed by your logging framework or APM tools. This includes:

Logged errors and warnings
Uncaught and swallowed exceptions
Slowdowns and APM bottlenecks

With SRM, you no longer need to rely on logs and foresight into which events to capture, what to include in a log statement, or how to analyze it.

On top of detecting every event, SRM applies a layer of intelligence to automatically prioritize all events based on severity. That way, your team can focus on the issues that matter most.

Taking into account things like if an error is new, when it was first and last seen, how many times it occurred and if there has been a sudden increase, SRM is able to mark errors as severe based on criteria such as if a new or increasing error is uncaught, or if its volume and rate exceeds a certain threshold. It considers established baselines and averages to pinpoint anomalies and immediately notify DevOps and SRE teams of events that require immediate resolution.

2. True Root Cause

Many APM vendors will tell you that they provide the root cause of an issue, including “code-level” insights. What they actually mean is that they provide you with a stack trace. Stack traces, while useful, only help identify the layer of code where an issue occurred. From there, you’re left to your own devices, including spending time manually digging through shallow log files to find context that can help you reproduce the issue.

Service Reliability Management helps you go beyond the stack trace, capturing deep data, down to the lowest level of detail – without dependency on developer or operational foresight. This includes:

The source code executing at the moment of the incident captured directly from the JVM
The exact offending line of code
Key data and variables associated with the incident
DEBUG and TRACE Log statements
Environment and Container Variables
Ability to map Events to Specific Applications, Releases, Services, Etc.

3. Transactions & Performance Metrics

In the context of software development and reliability, a transaction is a sequence of calls that are treated as a unit, often based on a user-facing function. When a transaction fails, customer experience is often impacted, so it’s important to be able to identify and prioritize these failures in the context of the transactions that they impact.

SRM captures data about every transaction failure, ranging from how many times it happened, to how many transactions failed, to the response time of the transaction. Using insights from the code events we mentioned above, we can determine the success of a transaction by correlating errors, exceptions and slowdowns within a given timeframe and surface this data to our users.

These performance metrics include things like throughput, or the number of transactions that occur during a given period of time, and response time baselines. The ability to capture data about application performance is critical to understanding what your end users are experiencing, as well as correlating related events that may help with identifying the root cause.

4. System Metrics

SRM focuses on data at the code level of your application, but we recognize the importance of correlating code-level failures with other aspects of your system. For example, what impact did your latest deployment have on CPU/memory utilization? Are there any blocked threads related to this failure? Was this CPU spike caused by the application?

Through the SRM reliability dashboards, you can correlate events, transactions and performance metrics to things like Garbage Collection, Threads, CPU, Class Loading and Memory Consumption, giving you a more comprehensive view into dependencies indirectly related to your application.

How Do We Do It?

What allows SRM to capture this depth and breadth of data that other monitoring tools simply can’t? The not-so-secret secret to our unique capabilities is a combination of a few key elements:

Code Mapping & Runtime Code Analysis– As code is loaded during server startup, SRM maps it and assigns unique fingerprints for every code instruction. Then at runtime, the resulting code graph is used to efficiently and securely access the memory and capture application state.

Software Data Optimization – Our agent operates between the JVM / .NET CLR and processor to capture real-time code and variable state from live microservices and produce optimized software data that provides 10x granularity and context, with minimal performance overhead.

Advanced Machine Learning – SRM runs this high-fidelity data through machine learning algorithms for de-duplication, classification and anomaly detection and delivers Code Quality reports that are based on the code’s runtime behavior.

To learn more about how SRM can help you capture deeper data, schedule a call with one of our engineers.

Conclusion

The powerful combination of data and analysis is the key to enterprise scale observability and reliability. SRM helps your team not only capture a complete picture of how your code is executing and the errors and slowdowns that occur, but analyzes and adds meaning to that data so you know exactly which issues to prioritize.