OverOps #OpsView: Introducing Microservice and Server Views

 ● 07th Feb 2017

5 min read

New OverOps feature: #opsview and error locations at the microservice level.

Present and Future State

Up until now with OverOps you would use the main dashboard (app.overops.com) to see the amount of log errors, warnings, HTTP errors and caught / uncaught exceptions impacting your environment as a whole. Views such as “New Today”, “New this Week”, and “Logged Errors” help you slice the data to focus on the errors which are most critical to you. You can further create your own custom views by using private and team-wide filters. This could be used for example to show all AWS errors or NullPointers from a specific app or code package which you own.
For each view you can easily set an alert to proactively know when a new error is introduced (e.g. you just introduced a new IndexOutOfBoundsException) or exceeded the allowed threshold (e.g. alert me anytime a NullPointerException occurs or over 3 OutOfMemoryErrors occur) and route that alert to the right people through a Slack channel, HipChat room, JIRA project and more.

Microservice Views in 3 Quick Steps

To make things even more powerful, today we delivered the next step in this story by enabling you to see errors broken down by microservices and servers. This new capability enables you to do some pretty powerful things, and get real-time answers to question such as:

  • Did we introduce any new errors into any of our microservices during our last deployment?
  • If / which of my microservices encountering a spike in the amount of log errors over the last hour and why?
  • Did any of our microservices encounter any uncaught exceptions today?

You can access all of this information using the new Application / Server / Error grouping mode selector. Through it you can aggregate data based on the microservice or machine from which errors are coming. Let’s see this puppy in action:

Step 1: Find Where is the Problem – Microservices in Action

Within the “Application” grouping mode we see all the error data within the environment broken down by microservice. We can see a spike in the overall amount of errors in the environment and immediately see It’s coming from JVMs in the “prod-SQS-5” microservice.
Microservices in Action

Step 2: Find What is the Problem – Zooming into a Microservice

With one click on the “prod-SQS-5” row we can zoom in and see a deduplicated list of all the log errors occurring in the microservice, based on where they’re emitted from in the code. Even if that error happened thousands of times or million of times, OverOps will automatically reduce it into a single metric that we can easily see.
We can see the exact log error that’s causing the spike:
Zooming into a Microservice

Step 3: Find How to Fix the Problem – Getting to the Source Code and Variable State

With one click on the spiking error we can jump in and see the complete source code and variable state behind it, including its preceding 250 lines of DEBUG-level log statements and the memory state of the JVM at that moment. All the information that we would need to reproduce and fix the issue. Sweet!

Getting to the Source Code and Variable State

Quick Recap

That was pretty cool – with three clicks we were able to go from a 20,000 ft view of our environment which may include hundreds of servers, JVMs and microservices to see the cause of an issue on a critical microservice and drill-in to see the complete code and variable state that caused it.
This allows us to move directly from an Ops role where we look for spikes, anomalies and introduction of new errors into our environment, to a developer role focused on fixing the issue.
And to make things even better, this new feature requires no updates to the OverOps agent, so you can begin enjoying it right now!

Bonus round

For those of you with a keen eye for details, you’ll see that in the new “Grouping Mode” selector there’s a fourth grouping mode called “Deployments”. With this new grouping mode that we’ll be releasing soon you’ll be able to intersect issues with the microservices they impact and the deployment from which they were introduced. Each error will now be tied to the specific deployment from which it originated.
OverOps Deployment Views

Final Thoughts

To bridge the gap between Ops and dev, Ops can see whether a new deployment increased the amount of errors in a target group of machines or applications, and compare that to previous releases. From a dev perspective, you can see exactly which new issues were introduced with a new release either in staging or production, zoom-in to see the code and variable state behind them, and resolve errors before they impact users.
Keep your ears tuned for Deployments Mode which is slated for release soon. In the meanwhile, if you have any questions or comments – let us know in the comments section below 🙂

As a co-founder and CTO, Tal is responsible for overseeing OverOps' product and engineering strategy. Previously, Tal was co-founder and CEO at VisualTao, acquired by Autodesk Inc. (ADSK). Following that, Tal was the Director for the AutoCAD global Cloud and Mobile product line. Plays Jazz drums and Skypes, sometimes simultaneously.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More