4 Reasons Log Files Suck and How to Enrich Your Splunk with Code-Aware Machine Data

 ● 14th Aug 2018

6 min read

Most organizations know of and probably use, Splunk. It helps aggregate and simplify machine data for dev and ops teams, allowing them to capture, index, and correlate real-time data in a searchable repository. Splunk helps organizations understand where things go wrong in applications, how to optimize the customer experience and to identify the fingerprints of fraud. It has become a standard in the data center.

Splunk does so by investigating machine data – the digital exhaust created by the systems, technologies, and infrastructure powering business. It ingests this data from a myriad of sources but one of the largest contributors is the log files associated with our applications. They can provide some initial insight into what went wrong and overall quality of an application.

Four Challenges with Log Files: Why You Only See Part of the Picture

Log files help identify when or where something has gone wrong, but to troubleshoot an issue, there is still a lot of work to get to the “why” or “what” happened. There are a few challenges:

  1. Noise: Searching through noisy log files: Avoid parsing and searching through text to find what’s important. You need to be able to identify regressions, in correlation to code changes and agile deployments, as well as detect anomalies amongst billions of errors
  2. Depth: Getting the right level of detail: Modify the way you think about logging and ensure your development teams take on a consistent best practice. Splunk provides some great resources to do so, but this is a challenge as it is a manual process.
  3. Tracking: To understand how an error occurred or to back trace an event, we must add explicit structure across numerous log files to tie events together. This especially becomes a stress point with the introduction of microservices.
  4. Visibility: Often issues that cause the most pain are those you may not even know about. The uncaught and swallowed exceptions, for example.

Log files have been with us for ages and are of huge value, but is there another way to collect more in-depth, complete information? If everyone uses log files, how do we differentiate at this level?

OverOps Provides a New Source of Machine Data

At OverOps we deliver a wholly new approach for gaining insight into the quality of our apps and services. Using both static and dynamic code analysis, we capture complete, code-aware insight into every known and unknown error and exception at the moment they occur.
With a log file you might determine which function failed and capture a few variables, but OverOps provides:

  • the value of every single variable across the entire execution stack
  • the complete state of the JVM, including heap and garbage collection
  • the value of every environment variable
  • the last 250 log statements (all environment, including debug/trace in prod)
  • the frequency and failure rate of an error
  • the classification of new and reintroduced
  • the release or build number associated with the event
  • the frequency and failure rate for each error
  • and with integration into your code repository we can even map an error back to a developer or team.

This unique approach not only captures complete information but it is also code-aware. This allows us to capture the log statements as noted above. And, since we know what code was to be executed in all environments, we can also capture the uncaught and swallowed errors that you would never have visibility into. Often these are some of the most painful production issues.

Enriching Splunk with Machine Data from OverOps

There are three ways in which OverOps integrates directly with Splunk. We can help developers troubleshoot more quickly, give metric and insight to DevOps to gauge the overall quality of software and fuel some new AIOps initiative. Let’s explore each:

OverOps Inserts Links into Your Current Splunk Log Files

With the insertion of the OverOps links within a log file, you can find where the error occurred using Splunk, and then easily link to OverOps to access complete insight into what happened. A Splunk user can now see these links and connect directly to the powerful OverOps Root Cause Analysis UI to troubleshoot issues with this complete information. This can both help developer productivity and have a huge impact on the QA to dev conversation as your team is armed with exact details about every event.

Splunk to OverOps dashboard

OverOps links from Splunk to OverOps Root Cause Analysis Screen

OverOps Data Within Your Splunk Metrics Dashboard

OverOps also exposes all the data we collect through direct integration into our platform. We publish an API and have an option to deliver the information via StatsD. With this integration you can profile the data collected through the Splunk dashboard and see in a glance the applications, deployments, servers, and methods that were affected by different issues. It provides granular information about application level events and errors classified by “New Today”, “Resurfaced”, “Network”, “Database”, and more. You can analyze and monitor your application’s health, which includes the number of new errors that were detected, the number of errors that resurfaced, and other information that is beneficial for your team and product.

IT Service Intelligence Integration and How Great Data Makes AIOps a Differentiator

A key emerging capability from Splunk is their IT Service Intelligence (ITSI) product, which introduces artificial intelligence on top of events. It gives you visibility across IT and business services, and enables you to use AI to switch from reactive to predictive IT. OverOps understands the importance of data in this new effort, and our granular set of contextual insights can provide an incredibly useful baseline for deep, useful AI.

Splunk Service Analyzer

OverOps Helps You Cut Through the Noise Inside Your Log Files

Automatically detect anomalies in billions of log events and errors. OverOps automatically classifies events and helps you find a signal in the noise, by analyzing application code at the JVM level to enhance log data at the moment of the event. It allows you to detect anomalies without complex Regex queries or manual code instrumentation, deduplicate billions of log events into accurate analytics without parsing and searching through text, and get to the root cause for any error or bottleneck with one click. With OverOps, you can see DEBUG statements regardless of log verbosity setting, as well as get the complete source code, stack trace and variables for any event with no dependency on whether they were logged, or not. You can also stream this data directly into Grafana for visualization and correlation with metrics data.

debug level logs

Visibility to DEBUG level logs and uncaught exceptions

While log files are still extremely valuable, this net new data can provide additional value and also help you accelerate the delivery of more reliable software. Further, using OverOps you can gain insight into the quality of your apps and services in lower level environments such as dev, test and staging without increasing the size of your Splunk implementation.
Learn more about how OverOps and Splunk can work together to help your application.

Jim is a 20 year veteran of tech and is a developer turned marketer. During his time at OverOps, Jim was responsible for the company narrative and drove field enablement and press/analyst relations.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More