TripAdvisor is the world’s largest online resource and category leader for travel destination activities.
Production Monitoring Ecosystem
Key challenges and pain points:
Working in a PCI Level One compliant environment means that it’s nearly impossible to replicate issues locally, and it also limits production access. On top of that there’s legacy code, when the original developer is long gone and things are being logged just-in-case, that leads to very noisy (and costly) logs. This made identifying, finding and understanding the errors and exceptions a real challenge for us.
Example problem that OverOps helped resolve:
We saw a performance degradation following a release. We tried looking through our APM tools, but couldn’t find any obvious elements that might have caused it within the code. Then, we received an email from OverOps that showed many “gets” from a cache failing, so objects were being re-generated.
OverOps then showed us all the variable values across the callstack for that error in production, and enabled us to fix this issue quickly and easily.
Another case was after releasing a major version - We saw a sudden spike of 500 errors on one of our APIs, and our logs were filling up with exceptions. We had to roll back the release, search the logs and code, create a hotfix with extra logging, release the new version and wait for replication. That took 3 days.
Now when we release a version, OverOps alerts us about errors in real time, shows us the variables and lets us easily reproduce and solve the issue. OverOps turned days of work into minutes. Short of attaching a debugger in production, OverOps is the next best thing.
How are you integrating OverOps with your daily workflow?
OverOps’s email digests alert us on new errors that were detected in our application during the last 24 hours. These emails also notify about specific errors that exceed target volumes that we’ve set up in advance.
Click here to watch Steve Rogers' QCon talk on OverOps - The New Way to Debug Java in Production.