Customer story

Leading Canadian Retailer Eliminates the Guesswork from Troubleshooting Critical Errors with OverOps

80,000

Grocery Orders Per Week

10 million+

Loyalty Program Members

$40B+

In Revenue

135,000+

Employees
Highlights
  • Increased developer efficiency by cutting troubleshooting time from weeks to minutes.
  • OverOps enables the retailer to take a proactive approach to error resolution, without relying on foresight or logs.
  • OverOps is an integral part of the company’s reliability strategy for its primary digital channels including the loyalty program and grocery delivery app.
  • OverOps helps free up developer time so they can focus their energy on developing new features and out-innovating their competitors.
Key Integrations
Senior Director

Software Engineering

Download PDF

The Challenge

As a massive digital retailer, code-level application issues can have a significant impact on customers and public reputation. People rely on these applications to accomplish important daily tasks, like grocery shopping or prescription renewals, in an efficient way. Any failed transaction or unplanned downtime limits the retailer’s primary value to its users, and can mean major inconvenience and lost revenue. For example, with their pharmacy app, a simple error can result in prescription processing issues that mean patients cannot access their critical medication.

 

Prior to OverOps, the retailer was struggling with getting to the root cause of an issue quickly, despite having a comprehensive ecosystem of monitoring and troubleshooting tools in place. Its APM solution identified when an error occurred, but not where or why. Even with heavy investment in log management tools, the logs only captured the issues the team was able to anticipate. When an unexpected problem occurred, for example if a customer’s order failed at checkout, the logs were virtually useless.

 

Additionally, the errors lacked critical context, such as which order failed or what state the user’s shopping cart was in at the time of the failure. Without this insight, they would have to manually go back and try to recreate the problem – usually an exercise in futility. This resulted not only in a poor user experience and reduced customer satisfaction, but also a lot of developer time wasted on troubleshooting rather than innovating. Developers were out of commission for days trying to figure out what went wrong.

 

The Solution

OverOps gives the retailer’s development and DevOps teams an unprecedented level of insight into their code across the entire pipeline, from development to production. By arming the team with complete data around every error, including the complete source code, variables and environment state, OverOps eliminates guesswork from their troubleshooting process.

 

With OverOps, their developers no longer need foresight to predict failure scenarios. When an error is detected – even one that wasn’t in the logs – the team can easily access OverOps and get straight to the root cause with the complete context of how the issue happened.

 

This has drastically cut down debugging time and increased the team’s efficiency. Rather than spending multiple days, if not weeks, tracking down exactly what happened to cause a customer-impacting incident, they go straight to OverOps and are able to reproduce and resolve the issue in minutes, leaving more time to build new features and products.

“OverOps eliminates the cycle of guesswork in troubleshooting by giving our developers unprecedented visibility into how and why production failures happen.”

How are you integrating OverOps with your daily workflow?

OverOps is now an integral part of our Splunk workflow and has helped optimize that investment. By providing a link to every error’s analysis directly within Splunk, OverOps fits easily into the team’s existing troubleshooting workflow and adds an extra layer of context on top of our existing log insight.