New OverOps feature: #opsview and error locations at the microservice level.
Present and Future State
Up until now with OverOps you would use the main dashboard (app.overops.com) to see the amount of log errors, warnings, HTTP errors and caught / uncaught exceptions impacting your environment as a whole. Views such as “New Today”, “New this Week”, and “Logged Errors” help you slice the data to focus on the errors which are most critical to you. You can further create your own custom views by using private and team-wide filters. This could be used for example to show all AWS errors or NullPointers from a specific app or code package which you own.
For each view you can easily set an alert to proactively know when a new error is introduced (e.g. you just introduced a new IndexOutOfBoundsException) or exceeded the allowed threshold (e.g. alert me anytime a NullPointerException occurs or over 3 OutOfMemoryErrors occur) and route that alert to the right people through a Slack channel, HipChat room, JIRA project and more.
Microservice Views in 3 Quick Steps
To make things even more powerful, today we delivered the next step in this story by enabling you to see errors broken down by microservices and servers. This new capability enables you to do some pretty powerful things, and get real-time answers to question such as:
- Did we introduce any new errors into any of our microservices during our last deployment?
- If / which of my microservices encountering a spike in the amount of log errors over the last hour and why?
- Did any of our microservices encounter any uncaught exceptions today?
You can access all of this information using the new Application / Server / Error grouping mode selector. Through it you can aggregate data based on the microservice or machine from which errors are coming. Let’s see this puppy in action:
Step 1: Find Where is the Problem – Microservices in Action
Within the “Application” grouping mode we see all the error data within the environment broken down by microservice. We can see a spike in the overall amount of errors in the environment and immediately see It’s coming from JVMs in the “prod-SQS-5” microservice.
Step 2: Find What is the Problem – Zooming into a Microservice
With one click on the “prod-SQS-5” row we can zoom in and see a deduplicated list of all the log errors occurring in the microservice, based on where they’re emitted from in the code. Even if that error happened thousands of times or million of times, OverOps will automatically reduce it into a single metric that we can easily see.
We can see the exact log error that’s causing the spike:
Step 3: Find How to Fix the Problem – Getting to the Source Code and Variable State
With one click on the spiking error we can jump in and see the complete source code and variable state behind it, including its preceding 250 lines of DEBUG-level log statements and the memory state of the JVM at that moment. All the information that we would need to reproduce and fix the issue. Sweet!
That was pretty cool – with three clicks we were able to go from a 20,000 ft view of our environment which may include hundreds of servers, JVMs and microservices to see the cause of an issue on a critical microservice and drill-in to see the complete code and variable state that caused it.
This allows us to move directly from an Ops role where we look for spikes, anomalies and introduction of new errors into our environment, to a developer role focused on fixing the issue.
And to make things even better, this new feature requires no updates to the OverOps agent, so you can begin enjoying it right now!
For those of you with a keen eye for details, you’ll see that in the new “Grouping Mode” selector there’s a fourth grouping mode called “Deployments”. With this new grouping mode that we’ll be releasing soon you’ll be able to intersect issues with the microservices they impact and the deployment from which they were introduced. Each error will now be tied to the specific deployment from which it originated.
To bridge the gap between Ops and dev, Ops can see whether a new deployment increased the amount of errors in a target group of machines or applications, and compare that to previous releases. From a dev perspective, you can see exactly which new issues were introduced with a new release either in staging or production, zoom-in to see the code and variable state behind them, and resolve errors before they impact users.
Keep your ears tuned for Deployments Mode which is slated for release soon. In the meanwhile, if you have any questions or comments – let us know in the comments section below 🙂