Eliminating the need to replicate the error

 ● 16th Jun 2021

7 min read

Here we go again, the development team gets a bug report from a customer experiencing an issue in production. The report has insufficient information to actually engage on the issue. No logs attached to the case, a barely usable reproduction path, and no information on the data that led to the issue. Sound familiar?

Suppose a nightly job starts randomly failing about 12% of the time without any explanation. The teams end up in a loop between development, support, and the customer on a quest to reproduce the error so it can be diagnosed and fixed.

Given the sensitivity of data, some customers might not be allowed to send you data and/or log files. Requests for sample redacted data lead to long wait times, or refusals, and ultimately customer dissatisfaction.

With insufficient information to reproduce, the developer, the support team, and the customer all enter the same frustrating “Insufficient Info” loop:

Note that this loop has many forms and iterations, but the end goal is to move from “Insufficient Info” to the point of having enough information to engage on the issue. For example: 

  • The support engineer may ask the customer to increase the logging level, reproduce the issue, and send us the resulting log files.
  • The developer may team up with the support engineer on one or more conference calls to try to replicate the issue and gather more information. 
  • The developer may create one or more debugging builds with lots of extra logging code (essentially breadcrumbs) to try to provide some clues as to what the issue may be, send it to the customer, have them reproduce the issue, and send us more log files (yes folks, log files suck).

And for each of these efforts, the result is often the same: “Insufficient Info”, try again. Developer time wasted.

Each iteration through the “Insufficient Info” loop increases the Mean Time To Resolve (MTTR) time. MTTR includes time spent detecting the failure, time spent diagnosing and repairing the failure, and time spent to ensure that the particular failure doesn’t reoccur.

This is where OverOps comes in.

One of the most important features of OverOps’ technology is that the data leading to an exception is automatically captured when the exception is detected by the OverOps agent.

With OverOps’ dynamic code analysis (DCA) running in your production environment, the developer can go directly to the exception in the Automated Root Cause (ARC) screen, and navigate through the source code user interface inspecting the variable state that led to the exception. This is DCA for application reliability in action.

Let’s look at key elements of the ARC screen when an exception happens on an application running in production. In this example:

  • a customer reported that once or twice a week they see an error in the logs about a failure to load an S3 resource.
  • OverOps is running in their production environment.

Let’s jump into OverOps and take a detailed look at the exception.

When you zoom into an exception, you are presented with the ARC screen which has four notable sections (some variable values are intentionally blurred by the author to protect sensitive information):

Let’s zoom in on each item:

The stack trace is presented in the left-hand window and looks like this (some values intentionally blurred). Every entry in the stack is selectable, and will take you to the line in the code window.

 

The code window shows you the code on and around the method listed in the stack trace. Every highlighted variable name can be hovered over to show the variable value.

 

The recorded variables window provides a table of all the variable values. Each item in the table can be expanded to show the complete value.

 

The line chart at the top shows points in time where the same exception has occurred, and each dot corresponds to a previously taken snapshot. You can switch from the current snapshot to any of the ones recorded in the past to compare the most recent occurrence of an exception to a past occurrence.

So now, when talking with the customer and support engineering, we’re thinking ahead about possible areas we might need to examine. For example:

  1. Invalid configuration – perhaps a configuration file is pointing to a file that doesn’t exist, and we’re getting this exception over and over due to the system continuing to try to load a file that doesn’t exist.
  2. Hard-coded file name – perhaps a class has a hard-coded file name, the system keeps trying to load that file, but it doesn’t exist.
  3. Data integrity issue – there is a database record somewhere that indicates that something can be found in a file, but the file has been deleted (expired, retention policy, etc).
  4. Computational issue – the file name is being somehow computed, and the algorithm computing the file name has a bug.

With OverOps, you can quickly compare variable values captured with the exception over time to see if it’s the same file name. If the system is looking for the same file name over and over, then my line of inquiry would lead to either #1 or #2 above. However, if the file name is different every time, then I am likely looking at either #3 or #4 above. So let’s do that!

This is the most recent snapshot of the exception as denoted by the top red arrow pointing to the snapshot on the furthest right. The bottom red arrow shows that the variable “key” is highlighted (and partially blurred except for the bit pointed to by the arrow). Here, the file name (S3 key name) is “[blurred].49a9085f3762.[blurred]”

Now, let’s look at a snapshot from the past:

Note here that the top arrow shows that the snapshot was from May 12th, and the bottom arrow once again points to a (partially blurred) value of the variable “key”. Here, the file name is “[blurred].a22d0872c596.[blurred]”.

The file names being sought are completely different – so we can completely ignore the possibility of #1 or #2 as avenues of investigation, and can instead focus on #3 or #4. With this information, we just cut out 50% of the testing and troubleshooting time and effort just by being able to look at this exception and snapshots over time using OverOps.

Conclusion

With OverOps, the developer gets armed with the variable values up and down the stack, and can quickly create a unit or integration test that replicates the issue.

This provides an extremely early exit from the “Insufficient Info” loop. And with that, the developer can analyze, diagnose the issue, and create the fix. It’s a game-changer. 

This can help minimize or even eliminate the frustrating “Insufficient Info” loop which in turn leads to decreased MTTR. When MTTR decreases, product quality increases as does customer satisfaction. And when developers and support engineers aren’t spending so much time in the “Insufficient Info” loop, that means increasing the time spent on feature development which increases job satisfaction.

To see OverOps in action, take a 1-minute guided product tour here.

Marc is the Director of Engineering at OverOps, focused on growing and expanding the global dev teams. Prior to OverOps, Marc has founded 3 companies, and worked in almost every facet of the organization as Engineering Manager, Chief Engineer, IT Director, Director of Engineering Operations, VP of Engineering, and VP of Software Security and Compliance

Win a “log files suck” t-shirt Are you up for the challenge?
Win a “log files suck” t-shirt

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More