What Is OverOps (And How It All Started With Downtime)

 ● 12th Mar 2013

5 min read

Where should we begin?
Yes, we know who we’re dealing with – we’ll start with the short answer (skip the story) and tell you exactly what we do.
OverOps was built to help developers understand why their software is ‘misbehaving’ (exceptions, illegal states, thread latency). While notifying you there’s a problem with a server, and maybe pointing its location is nice, our mission in OverOps is different – we’re going deeper, much deeper, into the code. We want to tell you exactly which variables, conditions and objects in your code are causing the problem, in staging or production.
We’re hoping to change how developers debug server code, yes – it’s a mess today.
Our vision (and product) is pretty straightforward – when something goes wrong on your server we’ll show you :

  • All the methods which led there, both active and completed (think of it as a super call stack), starting from the first call into your code within the thread.
  • Values of all the variables which are relevant to the problem (getting warmer)
  • History of all the relevant variable assignments – quickly answering how come a variable ended up with a certain value (even warmer)
  • Thread data – all the data is presented across different threads which led to the exception, showing which variables were assigned where and why (bingo!).
Shhh debugging... - OverOps's team (Chen, Niv, Iris, Tal & Dor) in a debugging session

Shhh debugging… – OverOps’s team (Chen, Niv, Iris, Tal & Dor) in a debugging session

Here are a few examples where you’d like to use OverOps:

  • You’ve deployed new code and now your servers are suffering. This can happen during the night, or even better, a weekend night. If we’re talking about downtime it’s usually on your birthday, or much better – your spouse’s. That’s when you want to understand exactly which strange input is causing it to throw exceptions or which method is blocking (and much more importantly – why).
  • You have this mystical and elusive bug you’ve been chasing for a long time. Some of your co-workers blame the full moon, others think it’s caused by evil hackers from exotic countries. Understand once and for all which input, state and thread combination is leading to it.
  • You’re facing a problem on one of your production servers and need to collect data to understand it better. Deploying a new version to production with extra logging is a huge pain (if possible at all). OverOps lets you set breakpoints – choose a location and a condition within your running code and start receiving data immediately, without stopping your app or redeploying code.

How was OverOps born and where’s it going?
Like every good story there’s some amount of pain and suffering involved. The idea came about when we experienced a major downtime – the usual story, right after the launch. We had to scale from five thousand users to over a million in a few weeks. That’s where the pain part came in.. (think long nights and long weekends). The first incentive for OverOps was just looking for a tool that would help us understand more quickly what’s causing errors within our code (not just telling us that the server is slower, or give us a call-stack), and not finding the kind of tool we wanted.
Then we started looking at the wider picture. While most of our server environment has completely changed over the past few years (moving to the cloud, new ways for deploying code, new DBs, etc.. ) the way we debug server code has remained the same. Debugging in Eclipse or IntelliJ is a joy. Debugging a running production server is going back to the stone (log) age.
To create what we wanted we had to break a lot of assumptions. Understanding what’s causing code to fail in production is a very compute-intensive operation. It can’t really be done by your overly-worked local JVM as your app runs (that’s why you get so little information when it does fail). To answer the kind of question we wanted (“why is this field null?”), we take the load off your servers and move it to the cloud. We index your code graph in the cloud, and when a new exception happens in your app we can query which variables and conditions are causing it, figure out what’s the most valuable data for you, and what’s the best way to get it. Moving all the heavy work from your server is the key to keep performance overhead to a minimum (a tool that slows your app is no good).
That’s our vision – we want you (and us) to spend less time debugging, sadly gazing at weird looking bugs, scraping data from logs (and wishing you could reproduce locally), and spend more time developing new stuff (okay, that’s what we like doing in our free time, you can enjoy your own hobbies).
Coming soon to a server near you.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More