5 Ways OverOps Customers Save Weeks of Work

 ● 17th Nov 2019

8 min read

Hello OverOps Community! Jonathan Seper here with my first ever blog post. I’ve been on the Customer Success Team at OverOps for two years now, working mostly with our existing customers to help them navigate our software and resources to enable massive success.

One of the things I love most is seeing technology succeed and have a real impact on the world – not just OverOps software, but the unique applications our users are building.

Every day I see our customers creating better end-user experiences, winning or creating new markets (in particular with the move to offering cloud-based solutions), finding peace in their DevOps practice, implementing high innovation with low risk, reducing costs, and – most of all – making the world a better place using software.

I’m seeing products like economical banking for disenfranchised persons, easier and cheaper delivery of healthcare, and smarter AI enabling safer working conditions for labourers. Our customers are doing all these things and more.

Over the years, working with our customers (and the amazingly smart team here at OverOps), I’ve collected this list of top 5 “hacks” for OverOps Customers to maximize the value they get from us:

1. Leverage Low-Risk Agent Technology

It should go without saying, but we want to be the ones solving your problems, not creating them. Our agents are ultra lightweight and they send data to a centralized collector to keep overhead to a minimum. During implementation, our Customer Success team walks every customer through the phased approach of starting in a lower environment, architecting correctly and always confirming the performance impact before deploying to production.

Furthermore, the OverOps agent is easy to turn off without causing dependency issues in your own application. One benefit of this is the ability to do sampling. With a reasonable amount of coverage of similar types of services, we collect the data you need to ensure application reliability for your users. Plus, if you’re able to shift left and fix code issues before they reach production, then why wouldn’t you? More on that in #5…

2. Use Remote Collectors to Boost Performance

Almost all of our customers are working on a move towards microservices. Usually these are smaller capacity units of compute.

Years ago, our standard was what we called a local collector that ran on the JVM, along with our Agent, at a one-to-one ratio (indeed this is still the default on our free trial). The current best practice is to leverage a separate environment with a remote collector.

This architecture ensures that CPU-intensive activities such as data redaction and code fingerprinting happen away from your application. One collector can handle multiple environments and hundreds of agents which helps maintain our ultra-low performance impact so we can run well in production.      

3. Access True Root Cause Directly from Logs and APM   

Our Automated Root Cause (ARC) Screen is perhaps one of the biggest differentiators between us and the rest of the tools on the market. The ARC screen shows the source code and variable state where the error occurred and also includes our LogView which provides TRACE and DEBUG-level logging without having to manually enable those logging levels within your application.

Pretty much everything we do is centered around putting this data into the hands of your developers at the right time. TinyLinks embedded into your log files are the most popular and powerful way to access the ARC Screen. We automatically insert our links directly into your existing logs at runtime so it doesn’t depend on which logging or APM tool you are using.

The most powerful aspect of this is your developers do not need to check and use new user interfaces because the data they need to fix any issue is right within the tools they are already using. In many cases, they don’t even know they are using OverOps.   

OverOps’ TinyLinks as they appear in Splunk logs

OverOps’ TinyLinks as they appear in New Relic

This particular customer (below) is accessing OverOps’ Root Cause right from their logs over 50% of the time (Open Hits are views of the ARC screen).

OverOps customer usage data

One important caveat – OverOps also catches swallowed and uncaught exceptions, so embedding us into your log files just scratches the surface. To take full advantage of OverOps’ capabilities, you still need to set up a mechanism for your teams to see those events for which no logs exist. This may include views and intelligent alerts, code quality gates, etc.

4.  Reduce Alert Fatigue with Intelligent Alerts

Java, and legacy code in general, can be very noisy with lots of exceptions being thrown and no clear logic as to what should take priority. Customers have told me ‘Jonathan, there is nothing I would like more than to shut down our app for two years and rewrite everything, but that is simply not possible.’

Our new Reliability Dashboards are the answer to the noise. With an added layer of AI that automatically prioritizes what is important to fix based on your own criteria, things like new and resurfaced errors are fairly obvious. As one customer told us, “My developers don’t need to figure out or guess what to work on every day, instead they are just handed the few critical/new issues and how to fix them.”

OverOps Reliability Dashboard

OverOps identifies issues at runtime and classifies them based on the following criteria: 

  • New – what are the new errors that didn’t exist in the previous release?
  • Critical – what are the most urgent errors to address?
  • Increasing – which errors increased in rate since the previous build?
  • Slowdowns – which transactions are taking longer than usual to execute?
  • Resurfaced – which errors happened again although they were supposed to be resolved?
  • Volume – has the total number of errors exceeded a predefined threshold?
  • Unique – how many unique errors are happening?

That said, it’s important to consider what is unique and important to your app. The scoring system for critical issues can easily be adjusted in the setting page to mark certain parts or error types as having more weight on the scores and alerts. Exceptions that are ‘Customer Impacting’ are often key. 

For example, one e-commerce customer wants to know if any response times start to approach 4 seconds, as their revenue immediately tanks when this happens. They can easily mark this as a critical event. SLAs and monitoring specific new customers is often a key ask.   

Other customers use us to make their AI smarter (i.e. aligning sub-20 millisecond responses with their ‘bid vs win’ ratio). Another example is events happening related to the data storage and ingestion layer. Most customers want to be extra vigilant here as not only do they need to go fix the error, but they also need to go back and posthumously correct the bad data which can be very difficult and time consuming.

Data ingestion is often key for our customers on-boarding their own customers (i.e. the move to cloud-based solutions is very compelling but the onboarding and data MUST be accurate and is a key use case for most of our customers).

5.  Shift Left – or Right

About a quarter of our customers are now using OverOps specifically in pre-production environments as part of their shift left strategy (because it’s way cheaper to fix code before it goes into production). Another quarter of our customers use us in production to identify and resolve issues before it affects their users. Half use us in both places.

The value provided in each of these use cases (test and production, respectively) is very unique. On the one hand, shift left and don’t promote bad code and, on the other hand, find and fix issues in production before they impact customer experience.

OverOps Reliability Report

Using recently released CI/CD integrations, teams can leverage OverOps data in pre-production to identify errors missed by QA and test automation and automatically block critical issues from being promoted.

This capability is supported by an open-source scoring system that takes into account new or increasing errors and slowdowns to determine the stability of a release. Here are our 6 code quality gate types, most customers start with a focus on New and Critical.

Final Thoughts

Our strongest offering to our customers is the data we’re able to provide them. Just like with any other massive influx of data, it’s crucial to have a measured approach for managing the noise. These “hacks” are not really hacks at all, but a collection of some of our best practices for getting the most out of OverOps.

I hope this has given you some useful insight into how to get the most out of the OverOps Platform and the data we collect for you. For a more in-depth look at any or all of these great features, request a demo with a member of our Solutions Engineering team or try it out for yourself

We look forward to working with you all. If there is ever anything I can do personally to help, or if you have ideas on any of the above, I would love to discuss them with you in the comments below or via email – jonathan.seper@overops.com.

Jonathan is the VP Customer Success at OverOps helping to onboard new customers and to ensure their continued success. He is a seasoned professional who truly enjoys supporting teams to create disruptive technology. In his free time, you can find him flyfishing with friends and family.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More