Here we go again, the development team gets a bug report from a customer experiencing an issue in production. The report has insufficient information to actually engage on the issue. No logs attached to the case, a barely usable reproduction path, and no information on the data that led to the issue. Sound familiar?
Suppose a nightly job starts randomly failing about 12% of the time without any explanation. The teams end up in a loop between development, support, and the customer on a quest to reproduce the error so it can be diagnosed and fixed.
Given the sensitivity of data, some customers might not be allowed to send you data and/or log files. Requests for sample redacted data lead to long wait times, or refusals, and ultimately customer dissatisfaction.
With insufficient information to reproduce, the developer, the support team, and the customer all enter the same frustrating “Insufficient Info” loop:
Note that this loop has many forms and iterations, but the end goal is to move from “Insufficient Info” to the point of having enough information to engage on the issue. For example:
- The support engineer may ask the customer to increase the logging level, reproduce the issue, and send us the resulting log files.
- The developer may team up with the support engineer on one or more conference calls to try to replicate the issue and gather more information.
- The developer may create one or more debugging builds with lots of extra logging code (essentially breadcrumbs) to try to provide some clues as to what the issue may be, send it to the customer, have them reproduce the issue, and send us more log files (yes folks, log files suck).
And for each of these efforts, the result is often the same: “Insufficient Info”, try again. Developer time wasted.
Each iteration through the “Insufficient Info” loop increases the Mean Time To Resolve (MTTR) time. MTTR includes time spent detecting the failure, time spent diagnosing and repairing the failure, and time spent to ensure that the particular failure doesn’t reoccur.
This is where OverOps comes in.
One of the most important features of OverOps’ technology is that the data leading to an exception is automatically captured when the exception is detected by the OverOps agent.
With OverOps’ dynamic code analysis (DCA) running in your production environment, the developer can go directly to the exception in the Automated Root Cause (ARC) screen, and navigate through the source code user interface inspecting the variable state that led to the exception. This is DCA for application reliability in action.
Let’s look at key elements of the ARC screen when an exception happens on an application running in production. In this example:
- a customer reported that once or twice a week they see an error in the logs about a failure to load an S3 resource.
- OverOps is running in their production environment.
Let’s jump into OverOps and take a detailed look at the exception.
When you zoom into an exception, you are presented with the ARC screen which has four notable sections (some variable values are intentionally blurred by the author to protect sensitive information):
Let’s zoom in on each item:
So now, when talking with the customer and support engineering, we’re thinking ahead about possible areas we might need to examine. For example:
- Invalid configuration – perhaps a configuration file is pointing to a file that doesn’t exist, and we’re getting this exception over and over due to the system continuing to try to load a file that doesn’t exist.
- Hard-coded file name – perhaps a class has a hard-coded file name, the system keeps trying to load that file, but it doesn’t exist.
- Data integrity issue – there is a database record somewhere that indicates that something can be found in a file, but the file has been deleted (expired, retention policy, etc).
- Computational issue – the file name is being somehow computed, and the algorithm computing the file name has a bug.
With OverOps, you can quickly compare variable values captured with the exception over time to see if it’s the same file name. If the system is looking for the same file name over and over, then my line of inquiry would lead to either #1 or #2 above. However, if the file name is different every time, then I am likely looking at either #3 or #4 above. So let’s do that!Now, let’s look at a snapshot from the past: The file names being sought are completely different – so we can completely ignore the possibility of #1 or #2 as avenues of investigation, and can instead focus on #3 or #4. With this information, we just cut out 50% of the testing and troubleshooting time and effort just by being able to look at this exception and snapshots over time using OverOps.
With OverOps, the developer gets armed with the variable values up and down the stack, and can quickly create a unit or integration test that replicates the issue.
This provides an extremely early exit from the “Insufficient Info” loop. And with that, the developer can analyze, diagnose the issue, and create the fix. It’s a game-changer.
This can help minimize or even eliminate the frustrating “Insufficient Info” loop which in turn leads to decreased MTTR. When MTTR decreases, product quality increases as does customer satisfaction. And when developers and support engineers aren’t spending so much time in the “Insufficient Info” loop, that means increasing the time spent on feature development which increases job satisfaction.
To see OverOps in action, take a 1-minute guided product tour here.