God-Mode in Production Code

 ● 30th Jun 2014

10 min read

Debugging is like taxes – everybody likes to write code, few like to pay for it. If you catch errors during development – things aren’t bad. You’ve got your IDE with breakpoints, watches, tooltips and plenty of time to reproduce. And even more importantly – you can fix things before they can do any real harm.
When code fails in production, all that goes out of the window.
With debuggers no longer an option, it’s up to you to use log files and stack traces to try to determine the source code and variable state combination that caused the error.

OverOps is trying to level the playing field by making it just as easy to fix Java and Scala code in production as it is on your desktop.

It detects errors and exceptions in server code, provides analytics to help prioritize them, and captures the source code and values of variables that caused them.

OverOps was founded in 2012, and has been in beta for the last year with over 200 companies.

How Would You Use OverOps?

There are many debugging tools developers use today ranging from command-line debuggers, to dynamic tracers and log analyzers. OverOps focuses heavily on production debugging, breaking that process into three steps –

  1. Detection – know when a new error has been introduced into your environment at either staging or production, or when an existing one has increased in frequency.
  2. Prioritization – get the metrics needed to decide if and when to fix it.
  3. Analysis – get the actual source code and combination of variable values that caused it. Think of it like a debugger that automatically turns on once an error happens, collects the variables and source code for later review, and then lets the code continue executing.

1. Detecting Errors

OverOps operates at the native JVM level, which allows it to detect and show you any form of exception or error in your code, regardless of whether it was thrown by the application code, the JVM, a 3rd party library, or how it was caught. The same is true for logged and Http errors.

You can see and sort through all the errors through OverOps dashboard which operates as a sort of spreadsheet for all the errors in your application. You can sort and filter them by the most recent ones, ones that have recently increased in volume, or by a specific type (e.g uncaught NullPointer exceptions).

When a new location in your code begins firing an error, OverOps will notify you by email. It also sends daily digests that summarize which new errors have been introduced into your code, and top errors across your cluster.

2. Prioritizing Errors

Once you’ve seen an error through the dashboard or email, the next step is to decide whether you want to do something about it right now, tomorrow or next quarter. For this you’ll need to understand its actual impact. This requires correlating a number of metrics, including how often it’s happening, when did it start, and whether it’s related to a recent change in the code.

To help with this process OverOps provides a set of metrics for each error –

  • When it started. The first and last time that location in your code fired that type of exception.
  • Code changes. For every method in the error’s call stack, OverOps shows where its code was modified on that machine in the day or week prior to the error. OverOps detects deployments by assigning a unique binary signature to each .jar, .war, or .class loaded into the JVM. So when code breaks, it can tell when it was deployed onto that machine and into the application in general.
  • Frequency. One of the most important aspects of prioritizing an error is frequency – both absolute and relative to the calls into the code. If an exception was fired 1000 times today, that may be significant if the code is called 5000 times. But if it’s being called a million times, this may actually be okay. For this OverOps shows both the number of times an error occurred and the percentage of the total calls to that code which that represents.
  • Trends. Some exceptions represent normal application logic such as cache misses, login failures, or conditional update failures. That may be normal, but what if an error has increased by 40% since the last deployment? You’ll probably want to know about that. So OverOps tracks trends for each error, to show whether it has increased in the past hour, day or week.

3. Analyzing Errors

Once you’ve decided you want to fix an error, as a developer you’ll most likely need two things – the source code which was executing on that machine, and the variable state at the moment of error. OverOps captures that information to show a reconstruction of the source code running on that machine, and the variable and object values across the stack. This enables you to quickly discover any mismatches between the two, allowing you to determine the root cause of the error.

To display source code OverOps will either decompile bytecode in the cloud as necessary, or use .jar source files and Scala source directories if present on the machine.

Distributed debugging. OverOps will also show the source code and variable values across machines. So if machine A makes an HTTP call into machine B which fires an error, OverOps will show not just the code on machine B (where it may already be too late to do anything), but also the code and variable’s values across any number of machines calling into that.

This is done through a process of “reverse signalling”, where the machine that fires the error, signals back to the machines calling into it, that an error has occurred, and that they need to collect error data for the call. OverOps will then correlate these snapshots into one “story” which is presented to you.

This is especially efficient from a developer’s perspective, when compared to stack trace from a log file, where it can be challenging to identify and access the machine which originated the call, and then try and find the relevant variable data for that call (assuming it was logged) within the logs.

Communicating Errors Between Teams

The universal language of Java errors are stack traces. They’re the currency in which errors are described and passed along between dev, QA and Ops teams. One of the things OverOps does is make stack traces smarter, to contain not only a description of what happened, but also of when and why it happened.

Whenever an error is logged, OverOps makes a small addition to the stack trace called a power link. This hyperlink lets a developer jump directly from the stack trace into the error’s source code, variable state and analytics. This data is persistent as part of the stack trace, even when it’s shared by email or pasted into a bug defect system such as Jira or BugZilla. This enables developers to get much better data from Ops or QA using the same methodology used today without having to instrument, redeploy and reproduce an error to get to the variable state which caused it.


Debugging during development is materially different than in production. When you run a JVM in debug mode using either JWDP or JVMTI, you’re enabling hooks within the JVM that enable a debugger to receive notifications when low-level events such as exceptions happen, or the ability to pause execution at specific bytecode locations for things like step-over and breakpoints.

The downside is that enabling these hooks prevents the JIT compiler from performing some of the optimizations it would normally do, which impacts the speed in which your code will execute. An example of this would be exception callbacks. When enabled by a debugger, they prevent the JIT compiler to fully optimize try / catch clauses, and can cause code to revert back to interpreted mode when an exception is thrown, in order the make the call back into the debugger. With this comes a significant drop in speed – especially at scale.

OverOps approaches this challenge by combining static bytecode analysis in the cloud (similar to tools like Coverity) with dynamic data collection at the native JVM level.

OverOps offloads compute intensive operations (such as bytecode analysis and data reduction) from your machine to its servers. At the machine level it instruments bytecode that’s loaded into the JVM for compilation, but also the resulting X86 machine code that’s produced by it. This enables it to collect low-level data and intercept signals without incurring a continuous performance overhead that a normal debugger would have. Through this it can operate with an average performance of less than 3% once anlysis of your code has been completed.

Installing On Your Machine

OverOps runs on Windows 7 / 8 / Server, OS X and major Linux flavors. You can install using standard Linux wGet / cURL commands, or through installation packages such as DEB, RPM and Chef. After you install OverOps, the next step is to add an agent argument to your JVM. Once your application launches, OverOps will analyze your code for the first time. Any exception or error that your code encounters will be detected and tracked. Since OverOps operates at the JVM level, it’s agnostic to which frameworks (e.g. Guava) or web containers (e.g Tomcat, Play) you use. No code or additional configuration changes are needed to run.

How Is Data Stored

A big challenge when it comes to collecting data directly from the application is storing and securing it. This is especially true if the data collected contains personally identifiable information (PII), such as user names or credit card numbers.

OverOps provides two modes for storing data: hosted and on-premises. In hosted mode source and variable data is encrypted on your machine using your private 256-bit AES encryption key before it’s sent to OverOps. Your data can only be decrypted by you using your private encryption key. This is similar to the way you would secure access to an AWS instance using a private key.

In on-premises mode data is not sent to OverOps servers, but to a designated server located on-premises, where it gets stored. When you open an error inside of OverOps, instead of pulling that data into the browser from OverOps servers, it is pulled from the on-premises server using a RESTful API call. You can customize the way in which data is stored (e.g file system, relational DBs, key / value store), and the method by which users authenticate against it. This enables you to manage data access and retention in accordance with your own organization’s internal policies.


OverOps is trying to improve on of the most basic operations developers do every day – fixing their code when it breaks. You can open up a trial account and test OverOps on your application. Share your notes, thoughts and questions with us in the comments section below – we’d love to hear them!

As a co-founder and CTO, Tal is responsible for overseeing OverOps' product and engineering strategy. Previously, Tal was co-founder and CEO at VisualTao, acquired by Autodesk Inc. (ADSK). Following that, Tal was the Director for the AutoCAD global Cloud and Mobile product line. Plays Jazz drums and Skypes, sometimes simultaneously.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More