The Challenge
As a massive digital retailer, code-level application issues can have a significant impact on customers and public reputation. People rely on these applications to accomplish important daily tasks, like grocery shopping or prescription renewals, in an efficient way. Any failed transaction or unplanned downtime limits the retailer’s primary value to its users, and can mean major inconvenience and lost revenue. For example, with their pharmacy app, a simple error can result in prescription processing issues that mean patients cannot access their critical medication.
Prior to OverOps, the retailer was struggling with getting to the root cause of an issue quickly, despite having a comprehensive ecosystem of monitoring and troubleshooting tools in place. Its APM solution identified when an error occurred, but not where or why. Even with heavy investment in log management tools, the logs only captured the issues the team was able to anticipate. When an unexpected problem occurred, for example if a customer’s order failed at checkout, the logs were virtually useless.
Additionally, the errors lacked critical context, such as which order failed or what state the user’s shopping cart was in at the time of the failure. Without this insight, they would have to manually go back and try to recreate the problem – usually an exercise in futility. This resulted not only in a poor user experience and reduced customer satisfaction, but also a lot of developer time wasted on troubleshooting rather than innovating. Developers were out of commission for days trying to figure out what went wrong.
The Solution
OverOps gives the retailer’s development and DevOps teams an unprecedented level of insight into their code across the entire pipeline, from development to production. By arming the team with complete data around every error, including the complete source code, variables and environment state, OverOps eliminates guesswork from their troubleshooting process.
With OverOps, their developers no longer need foresight to predict failure scenarios. When an error is detected – even one that wasn’t in the logs – the team can easily access OverOps and get straight to the root cause with the complete context of how the issue happened.
This has drastically cut down debugging time and increased the team’s efficiency. Rather than spending multiple days, if not weeks, tracking down exactly what happened to cause a customer-impacting incident, they go straight to OverOps and are able to reproduce and resolve the issue in minutes, leaving more time to build new features and products.
“OverOps eliminates the cycle of guesswork in troubleshooting by giving our developers unprecedented visibility into how and why production failures happen.”
How are you integrating OverOps with your daily workflow?
OverOps is now an integral part of our Splunk workflow and has helped optimize that investment. By providing a link to every error’s analysis directly within Splunk, OverOps fits easily into the team’s existing troubleshooting workflow and adds an extra layer of context on top of our existing log insight.