How to Overcome Kubernetes Monitoring Challenges

 ● 14th May 2019

5 min read

As more and more companies pivot to providing software as a service, end users have come to expect that ‘service’ to always be available, fast, responsive, error-free, and to be continuously updated with new features.

Containerized microservices enable developers and DevOps engineers to meet these demands. Microservices are simple to develop, test, deploy, and scale, but they’re not without their own challenges.

Each microservice must be individually configured, deployed, and monitored. This is no small task as the number of microservices for any given application can be considerable. Manually managing the entire system quickly becomes intractable, requiring sophisticated automation just to maintain it. Kubernetes solves many of these problems, making it easy to both create and manage a cluster, as well as individual containers, deployments, and services running on that cluster.

What is Kubernetes?

Kubernetes (k8s) is an open source container management system for automating the deployment, scaling and management of containerized applications. Kubernetes has quickly risen to the top and become by far the most popular container management system available. An open source Cloud Native Computing Foundation project based on Google’s Borg, Kubernetes has one of the largest open source communities, backed by thousands of contributors and top enterprise companies including Microsoft, Google and Amazon. Kubernetes runs in the cloud, in hybrid data centers, and in on-premises data centers, allowing maximum flexibility without vendor lock in.

With the rise of infrastructure as a service (IaaS) and infrastructure as code (IaC) we now have the ability to provision resources and services on demand and can apply software development practices and tools to operations, allowing for version control, code review and CI/CD integration at the infrastructure level.

Monitoring Challenges

With so many moving pieces, getting to the bottom of an issue in large distributed systems is challenging. Compared to a traditional environment, a Kubernetes cluster tends to have significantly more servers and services, and therefore more logs and other areas to investigate when something goes wrong.

Where in a traditional monolithic environment one might need to search through a log or two, with microservices one must search through many many more logs – one or more for each microservice involved in the issue being troubleshooted. Sifting through logs from so many services is time consuming and often not helpful for discovering the true root cause of the issue.

Likewise, where before there may have been only a handful of servers and services involved in any single transaction, in Kubernetes there are usually many more components involved. To determine which microservices to investigate, tracing headers are often added to each transaction, which makes it easier to discover which microservices were involved and which ultimately failed. Unfortunately, adding these headers requires a code change and even when it’s known which service failed, we still must resort to logs to try to discover why.

Monitoring Solutions

Monitoring solutions for Kubernetes aren’t much different from traditional monitoring tools. For example, those using Splunk in a traditional environment can continue using Splunk in their cluster. Similarly, ELK users can use the popular open source EFK logging stack to aggregate and search through logs cluster wide.

APM tools and popular open source tools including Prometheus and Zipkin enable monitoring and tracing, providing significant insight into resource consumption and transaction flow through the system. While these tools will tell you which service is consuming too much memory or CPU, or which service failed with the dreaded generic “500: Internal Server Error,” they won’t tell you why a failure occurred.

Despite improved tools geared specifically towards Kubernetes, it remains difficult to get to the true root cause of any given problem. Logs ultimately rely on developer insights to know what to log. Properly implemented, tracing will reveal in which service an error originated, however we’re still left to discover why the error occurred in the first place. APM and monitoring tools may help in discovering if a container is consuming too many resources, but again we’re left without an answer as to why.

Why OverOps is Different

OverOps continuously monitors microservices for anomalies at the JVM-level, detecting issues without relying on logs or other metrics. When paired with monitoring and logging tools, OverOps augments these tools, providing the data needed to discover the true root cause of an issue.

OverOps reveals why individual microservices fail, capturing not only logged errors, but also caught and uncaught exceptions. OverOps captures the full JVM state, exact variable values at the time of the issue, and pinpoints the specific container, deployment and line of code that caused the failure to occur. Our new Reliability Dashboards take it a step further by detecting anomalies such as slowdowns, new errors, and increasing error rates.  Best of all, OverOps is built for enterprise, able to scale horizontally to monitor thousands of JVMs concurrently, and provides real-time insights across your entire cluster.


Dave is an Integration Engineer at OverOps with nearly a decade of software experience in a variety of engineering roles including frontend, fullstack, and devops. Dave’s focus lately has been on breaking down legacy monolithic apps into containerized microservices running in Kubernetes. In his free time, you’ll find him skiing and hiking in the mountains of Colorado.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More