What is the ultimate alerting strategy to make sure your alerts are meaningful and not just noise?
Production monitoring is critical for your application’s success, we know that. But how can you be sure that the right information is getting to the right people? Automating the monitoring process can only be effective when actionable information gets to the right person. The answer is automated alerting. However, there are some elements and guidelines that can help us get the most out of our monitoring techniques, no matter what they are.
To help you develop a better workflow, we have identified the top benefits that your alerts can offer you. Let us check them out.
Timeliness – Know as soon as something bad happens
Our applications and servers are always running and working, and there is a lot going on at any given moment. That is why it is important to stay on top of new errors when they are first introduced into the system.
Even if you are a big fan of sifting through log files, they only give you a retroactive perspective of what happened to the application, servers, or users. Some would say that timing is everything and getting alerts in real time is critical for your business. We want to fix issues before they severely impact users or our application.
This is where 3rd party tools and integrations are valuable, notifying us the minute something happens. This concept might not sound as nice when an alert goes off at 3:00 AM or during your night out, but you still cannot deny its importance.
When it comes to production environment every second counts, and you want to know the minute an error is introduced.
Context is key to understanding issues
Knowing when an error has occurred is important, and the next step is understanding where it is happening. Aleksey Vorona, Senior Java Developer at xMatters, told us that for his company, context is the most important ingredient when it comes to alerts; “Once an error is introduced into the application, you want to have as much information as possible so you can understand it. This context could be the machine on which the application was running on, user IDs and the developer that owns the error. The more information you have, the easier it is to understand the issue”.
Context is everything. And when it comes to alerts, it is about the different values and elements that will help you understand exactly what happened. For example, it would benefit you to know if a new deployment introduced new errors, or to get alerts when the number of logged errors or uncaught exceptions exceeds a certain threshold. You will also want to know whether a certain error is new or recurring, and what made it appear or reappear.
Breaking it down further, there are 5 critical values we want to see in each error:
- What error was introduced into the system?
- Where it happened within the code.
- How many times each error happened and what is its urgency?
- When was the first time this error was seen?
- When was the last time this error occurred?
These were some of the issues we had to face ourselves here at OverOps, trying to help developers, managers and DevOps teams automate their manual error handling processes. Since each team has its own unique way of handling issues, we created a customizable dashboard in which you can quickly see the top 5 values for each error.
OverOps allows you to identify critical errors quickly, understand where they happened within the code and know if they are critical or not.
You need to know when, where, what and how many errors and exceptions happen to understand their importance and urgency.
Root cause detection – Why did it happen in the first place?
Now that we are getting automated alerts with the right context, it is time to understand why they happened in the first place. For most engineering teams, this is the time to hit the log files and start searching for that needle in the haystack. That is, if the error was logged in the first place. However, we see that the top performing teams have a different way of doing things.
Usually, applications fire hundreds of thousands or even millions of errors each day, and it becomes a real challenge to get down to their real root in a scalable manner without wasting days on finding it. For large companies such as Intuit, searching through the logs was not helpful; Sumit Nagal, Principal Engineer in Quality by Intuit points out that “Even if we did find the issues within the logs, some of them were not reproducible. Finding, reproducing, and solving issues within these areas is a real Challenge.”
Instead of sifting through logs trying to find critical issues and closing tickets with a label stating, “could not reproduce”, Intuit chose to use OverOps. With OverOps, the development team can immediately identify the cause of each exception, along with the variables that caused it. The company can improve the development team’s productivity significantly by giving them the root cause with just a single click.
Getting to the root cause, along with the full source code and variables, will help you understand why errors happened in the first place.
Communication – Keeping the team synced
You cannot handle alerts without having everyone on the development team on board. That is why communication is a key aspect when it comes to alerts. First, it is important to assign the alert to the right person. The team should all be on the same page, knowing what each one of them is responsible for and who is working on which element of the application.
Some teams might think that this process is not as important as it should, and they assign different team members to handle alerts only after they “go off”. However, that is bad practice, and it is not as effective as some would hope.
Imagine the following scenario: it is a Saturday night and the application crashes. Alerts are being sent to various people across the company and some team members are trying to help. However, they did not handle that part of the application or the code. You now have 7 team members trying to talk to each other, trying to understand what needs to be done to solve it.
This is caused due to lack of communication in earlier parts of the project, leading to team members to not being aware of who is in charge, what was deployed or how to handle incidents when alerts are sent out.
Communication is important, and you should work on making it better as part of your error handling process.
Accountability – Making sure the right person is handling the alert
Continuing our theme of communication from the previous paragraph, an important part of this concept is knowing that the alert reaches the right person, and that he or she is taking care of it. We might know which team member was the last one to handle the code before it broke, but is he the one responsible for fixing it right now? On our interview him, Aleksey Vorona pointed out that it is important for him to know who the person in charge of every alert or issue is that arises. The person who wrote the code may be more likely to handle it better than other members of the team, or there may be another team member equipped to resolve it.
The bottom line is that by automating your alerts, you can direct exception handling tasks directly to the team member that is responsible for them. Otherwise, the right people might miss important information and accountability goes out the window. It is problems like these that can lead to unhappy users, performance issues or even a complete crash of servers and systems.
Team members should be alerted to issues that they are responsible for maintaining, so that it is always clear who is accountable for which tasks.
Processing – Alerting handling cycle
You have your team members communicating and working together, which is great. However, you still need to create a game plan that the team will aspire to achieve. A good example of a game plan is having an informed exception handling strategy rather than treating each event in isolation.
Exceptions are one of the core elements of a production environment, and they usually indicate a warning signal that requires attention. When exceptions are misused, they may lead to performance issues, hurting the application and its users without your knowledge.
How do you prevent it from happening? One way is to implement a “game plan” of an Inbox Zero policy in the company. It is a process in which unique exceptions are acknowledged, taken care of and eventually eliminated as soon as they are introduced.
We have researched how companies handle their exceptions and found out that some have the tendency to push them off to a “later” date, just like emails. We found that companies that implement an inbox zero policy have better understanding of how their application works, clearer log files and developers focused on important and new projects. We will cover this more in the next chapter.
Find the right game plans for you and implement them as part of a better alerting handling process.
Integrations? Yes please!
Handling alerts on your own might work, but it is not scalable in the long run. For companies such as Comcast, servicing over 23 million X1 XFINITY devices, it is almost impossible to know which alerts are critical and should be handled ASAP. This is where 3rd party tools and integrations will be your best friends.
After integrating OverOps with their automated deployment model, Comcast was
able to instrument their application servers. The company deploys a new version of their application on a weekly basis, and OverOps helps them identify the unknown error conditions that Comcast did not foresee. Watch John McCann, Executive Director of Product Engineering at Comcast Cable explain how OverOps helps the company automate their deployments.
Integrations can also be helpful in your current alerting workflow. For example, Aleksey Vorona from xMatters works on developing a unified platform for IT alerting and developed an integration with OverOps. The integration allows the company to get access to critical information, such as the variable state that caused each error and alert the right team member.
Use third party tools and integration to supercharge your alerts and make them meaningful.
Alerts are important, but there is much more to it than just adding them to your application. You want to make sure you have information on why they happened in the first place, how you should handle them and how can you make the most out of them (vs. just knowing that something bad happened). Automated alerting is an important part of the monitoring system. We need the right people to know when, where and why things go wrong in production so that they can fix it as soon as possible.