The quest towards a complete monitoring dashboard is peaking, and your tools want to help you achieve it
During the last few months, we’ve taken an in-depth view into the world of Application Performance Monitoring tools as well as log management tools/analyzers, to better understand how to use the data we get from them. While it’s not a new trend, it seems that recently companies have realized that these two elements are inseparable, and should be combined under one roof.
In the following post, we’ll try to understand the motivations behind this move, what’s in the offering and if it’s indeed the complete solution. Spoiler: something is missing. But let’s not get ahead of ourselves.
What’s the difference between APM and logs?
Application Performance Management (APM) tools and log management tools both have the same bottom line goal – they want to help you gain a deeper understanding of what’s happening within your app. While the goal is similar, the execution is different, and each focuses on a different aspect within the application.
APM tools provide analytics around the applications’ performance. These analytics can include the amount of time it took to execute different elements within the code, how long it takes for certain transactions to complete and so on.
These tools also allow us to take a look into how users are experiencing our application, and let us monitor production environments, application loads (transactions, requests and pages per second), calculate response time and the general direction of what might cause a delayed response.
This information focuses more on “how” and not “what”. To get more information about what happened that caused a certain transaction to fail we need to look deeper. To understand what happened in the application and server when an exception was thrown, we have to sift through the log files to search for an answer.
Log files contain a lot of information, such as machine data, business metrics that may include sales transactions and user behavior, as well as information about product related issues. The main challenge when using logs is that they often contain an unmanageable number of entries. In most cases, trying to figure out what happened is pretty similar to finding a very specific needle in a needle-stack. And that’s where log management and analyzers fit it.
Tools that are making the switch
As we mentioned, APM and log management tools give us much-needed information and insights into our application, and each one of those fields holds a world of information. That’s why most vendors choose to focus on one of these solutions, offering the best capabilities in its field. Or at least, so it was up until recently.
In 2018 a lot of tools have started to shift towards an all-inclusive suite, offering log management and APM under the same roof, adding insights and business intelligence on top of it. One possible risk here could be that they’re spreading themselves too thin. If they expand their area of focus, that might mean that they won’t be able to handle both elements as good as one.
Many engineering teams already use these two tools in some variation, using them to get a better view of their application. However, if you’re interested in putting all of your eggs in one basket, there are some tools that offer APM, logs, and BI combined into one package:
Elastic has a list of tools that aim to give its users a complete overview of their application. These tools include Elastic APM that contains Kibana, which allows developers to visualize performance data. Logstash is the solution that ships the logs from your application and servers into Elasticsearch which then stores this data and allows you to run operations on it.
Datadog announced recently that they’re entering the log management and APM space, offering to analyze logs as well as monitoring the application and system. In addition to automatically collecting logs across services, applications, and platforms, this new addition will allow users to trace requests from end to end across distributed systems, track performance and instrument code, among other abilities.
Stackify isn’t new to this trend, and the company has been saying for a while that APM alone is not enough. According to the company, these two elements are needed in order to give context to the issues that might occur and will help understand where issues are occurring within the code.
Sematext is another “old-timer” in the combined logs and APM space, or as it says on their website – SPM and logs. The concept of SPM is similar to APM, in which it captures events within the application as well as trace transactions, application mapping (AppMap), network mapping (NetMap) and more. Their log management tool, Logsene, offers a hosted ELK stack to monitor and analyze your logs.
The missing piece between APM and logs
APM and logging tools are great, but they have one main downside – you still have to rely on logs for troubleshooting. This means that you still have to go through a manual process of choosing where and when to log something, and these often include little to no context about what actually happened when the code ran.
Log files are noisy and they usually contain a countless amount of errors, and the number grows larger every single day. While APM and log tools try to sort this chaos out, it all comes down to what developers managed to catch, and where they chose to log it in the first place. And if this wasn’t bad enough, over 50% of logging statements are written wrong, and they (usually) can’t help us find the real root cause of our errors.
Trying to find the issue is often a tedious process, and it can take up to 25% of your developers’ work week. This is a big issue for large companies as well, such as Comcast, who told us that at some point they had too many alerts which were not effective, which led to the team disregarding them since they were noisy.
Comcast understood that there must be a better way to handle these issues, even with the tools they had. That’s why they chose to integrate OverOps as part of their automated deployment model, that helps them instrument their application servers.
Now, when one of their existing tools detect an issue, OverOps adds relevant information by collecting a snapshot of the complete state at the point of execution, for both known and unknown issues. This snapshot is used to speed investigation of critical production application errors and can also be used by other tools, such as APM and log management tools, to optimize delivery of applications from development through production.
If you want to learn more about how Comcast used OverOps on its flagship X1 XFINITY platform in order to stay on top of every new error, check out this webinar.
The combination of APM and log tools seems like a step in the right direction, especially since most companies use both of them in some variety. They give us a new way to increase the observability of our applications, servers, and environments. However, we need to remember that it doesn’t matter which tool we choose to use, it’s up to the development team to make sure they log everything they can in order to catch exceptions, errors or bug.