The OverOps Story: It’s the End of Log Files as We Know It

 ● 13th Oct 2016

5 min read

One click and one second to root cause analysis. See how it all started.

OverOps was built to help developers know when, where and why code breaks in production. After an amazing journey with 300% growth in new customers and a brand new name, it’s time to see how it all started – from the basic idea up to the OverOps we know today.

Pains and Challenges of Scaling

OverOps was founded at 2011, based on an idea that originated from the team’s first company, VisualTao (Acquired by Autodesk on 2009).
VisualTao enabled designers and engineers to edit, share, and collaborate over 2D and 3D designs. After its acquisition, VisualTao was relaunched as AutoCAD web and mobile, which became the biggest launch in well over a decade for Autodesk. It was the company’s flagship $1.3B product line, servicing over 20M professional designers and engineers worldwide.
AutoCAD used the best APM and log analyzers in the market in order to find errors and exceptions in its own app. Despite having those tools, the team still had to sift through mass amounts of log files whenever the application broke, in search of an answer as to why it broke.
Through the pains and challenges of scaling AutoCAD, the OverOps team knew what they were looking for when facing an error in production.

Making Logs Better

APM tools are great at telling you when web pages render or mobile apps respond slowly, but they don’t show you the actual root cause. When an application breaks, in 9 out of 10 issues, it’s the logs that give dev and ops the actual root cause of an error.
The inherent problem with logs is that they contain millions or billions of text events, and require you to search through them. Logs assume that you know what to look for, and that it’s in fact in the log. It’s a highly reactive process.
Errors might lie dormant in log files for weeks, impact the user’s experience and only surface after the damage is done.
That’s why the team wasn’t looking for a better log analyzer, but better logs. Instead of getting massive amount of information in one giant file, we wanted a one-click zoom-in view that includes the actual cause of any error.
Zooming in would provide the complete source code and variable state that caused each error. Zooming out will give a broader look, that will include data about when new errors were introduced, and when critical ones increase. The team wanted to enable this without having to parse and Regex through mass amounts of unstructured log data.
Some developers considered it science fiction.

Building OverOps

Most tools nowadays are parsing issues out of TBs of log files downstream, where the intelligence of the code has been lost, and try to reconstruct them from text. OverOps uses a completely different technology to detect and analyze issues in real-time within the application.
OverOps uses a micro-agent that operate between the software VM (i.e. JVM, CLR) and the processor. That gives it two “superpowers”:

  • It can see all events in the application, regardless of whether they originate from the application code, 3rd party or JVM code
  • Dev and ops can react 10X quicker than through classic techniques such as bytecode instrumentation or logging

This enables OverOps to “fingerprint” any event as it occurs in real-time in production. You can know exactly whether or not it’s new, when was it introduced, how often it’s been happening, and out of how many calls into the code.

Meet the Micro-Agent

When the micro-agent detects a critical event, it zooms-in and captures the actual source code that was executing within the JVM, and the complete variable state across the entire call stack (even debug log statements that don’t even appear in logs). That information provides you with the full root cause analysis needed to fix an issue, without having to spend hours / days to reproduce it.
This information has literally 100X more variable state data than you would get in a log file, without having to change and redeploy code to capture more state. The agents operate at under 3% CPU.
The tool alerts developers on new or increasing errors within a second of them happening in production. The alerts can be sent out directly through Slack, HipChat, PagerDuty or JIRA and includes a link directly to the complete source code, variables and debug-level log statements.
This enables customers to move from issues lying dormant for days/weeks and then erupting, to detecting them in seconds. The dashboard gives developers the information they need to fix these errors in one click, without having to go through the back and-forth with ops to get logs and reproduce an error. One second, one click – it’s that easy.
OverOps runs as SaaS, Hybrid or fully on-premises, and requires no changes to code or build.
We have 150+ customers such as Amdocs, TripAdvisor, RMS and others.
If this sounds interesting to you, sign up for a free 14 days trial.

As a co-founder and CTO, Tal is responsible for overseeing OverOps' product and engineering strategy. Previously, Tal was co-founder and CEO at VisualTao, acquired by Autodesk Inc. (ADSK). Following that, Tal was the Director for the AutoCAD global Cloud and Mobile product line. Plays Jazz drums and Skypes, sometimes simultaneously.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More