- Reduced MTTI and MTTR by over 90%
- Met larger corporate goals tied to customer experience automation
- Cost savings on early error detection and resolution prior to production
Site Reliability Engineer
Expedia Group, Inc. is one of the world’s largest travel platforms, providing a self-service platform for users to arrange and manage their entire travel journey online– including lodging, airlines, car rentals and more – from start to finish. Expedia’s top priority is a streamlined customer experience.
In light of the global pandemic and resulting uncertainty around travel, Expedia’s customer service live agents were struggling to keep up with the increased volume of questions regarding cancellations, reschedules and itinerary changes. Thus, they increasingly turned to Expedia’s digital application to alleviate the demand in a timely manner and deliver a positive customer experience.
Given the real-time nature of their user transactions, for the Virtual Agent Platform (VAP) team – responsible for the applications, microservices and components of the Expedia website and 3rd party API’s – site reliability is paramount. They cannot afford downtime, or to waste time troubleshooting customer-impacting errors. They must be able to both identify errors early in the release cycle, as well as quickly detect and resolve any anomalies that reach production; however, visibility has proved to be a challenge for the team.
“Prevention is always better than the cure. With OverOps, you’re preventing the heart attack before it ever happens.” – Gavan McLaughlin Site Reliability Engineer, Expedia
In order to field user conversations across multiple channels, properties and languages, the VAP requires a distributed event-driven architecture that is highly scalable and complex, and their existing tooling could not deliver the necessary metrics and insight. To accommodate an asynchronous system, they used a clunky, time-consuming workflow that relied heavily on sifting through logs to find the root cause of an issue.
Expedia needed a solution that could not only help them shift left to identify code quality issues early in the release cycle and stop NullPointers from making it to production, but also provide them with deeper visibility across the pipeline so they can better understand where their code is most susceptible, and resolve issues faster.
OverOps’ dynamic code analysis arms the Expedia VAP team with contextual code-level data across the entire pipeline. This allows them to monitor application health by code tier and deployment, reduce reliance on logging and Splunk searches, and properly identify which errors are new and critical in any release, application or environment – all without compromising speed.
“With OverOps, you’re not sacrificing speed to improve quality.”
Expedia initially started by leveraging OverOps in pre-production to prevent expensive production outages, but quickly saw value and expanded into their production environment after just 4 months. The team has not only improved overall code quality, but also developer productivity. Using OverOps, the VAP engineering team was able to reduce MTTI and MTTR by over 90%, allowing them to spend less time troubleshooting errors and more time driving innovation and providing an exceptional customer experience.
“The moment we see an anomaly, OverOps helps us quickly understand the entire context around the problem so we can reproduce it.”
OverOps arms Expedia leadership with a quantifiable measurement of quality tied directly to the customer experience and bottom line. Additionally, OverOps’ flexible licensing model and thorough security measures support Expedia’s enterprise environment without sacrificing value, jeopardizing sensitive data or bogging down the team with financial frustrations.
How are you integrating OverOps into your daily workflow?
OverOps is part of Expedia’s internal KUMO (tools and services) pipeline and integrates seamlessly into the team’s homegrown Haystack tool for tracing. The platform’s headless nature allows the team to publish critical OverOps data to third party applications like Splunk and arm developers with critical insight across the pipeline.
In pre-production, OverOps serves as a quality gate after unit and end-to-end tests are run. OverOps’ rich runtime data helps define the team’s go/no-go decision tree based on critical exceptions to move safe releases into production.
In production, OverOps’ integration with Splunk automatically inserts root cause links directly into the log files that are part of the team’s existing workflow. In one instance, the team was dealing with a time sensitive production issue. They were able to go straight to their logs and click on the OverOps tiny URL to directly access the code and variables tied to the error. They were able to reproduce the issue in less than 15 minutes and begin working on a hotfix rather than having to rollback the release.