The course of digital transformation never did run smooth. This week’s caucus catastrophe provides us with an important lesson in the value of knowing how to quickly identify and prioritize critical application errors.
For those of you following the political headlines, you’ve probably heard plenty about the “Shadow app” Iowa caucus disaster. The short of it was that a new mobile app and backend were used to tally the votes of the U.S Democratic primary elections in the state of Iowa – the first of 50 primary races.
The app, which was only 5 months old, failed in a spectacular way, delaying the results of the elections by days… The developers of the app reported the issue was a “coding error” and that they had fixed it. Of course by that time the damage had already been done, making headlines in almost every major news outlet worldwide. Oops!
Most of us in the business of writing code and delivering software will probably (thankfully) never be put in a situation where we have to ship out a mission-critical application so quickly, and support it when it fails with the eyes of the world on us. But the act of having to rush a feature out to meet a specific business or product milestone, and the feeling of impending doom and angst that comes with it, is something that many if not most of us are deeply familiar with.
I thought that this would be a good opportunity to look at some strategies we’ve found to be effective when releasing code under tight conditions.
The Two Things You Should Always Look for in a New Release
In most situations, you won’t be shipping a brand new application into the wild, but more likely a new feature or capability within an existing application. The risk in this case would either be in broken behavior for the new feature, or – even worse – unexpected impact on existing ones. The challenge here is to detect abnormal behavior or deviations from normal execution as quickly as possible, identify the origin, cause and quickest path towards resolution.
There are two primary things to look for: new errors and spiking errors. So how do we do this?
APMs can be helpful, but they usually look at deviations from known conditions and health checks, whereas in this case, you’re looking specifically at the impact of new code on the application, without the luxury of knowing exactly what to look out for ahead of time. Typically, the most likely place abnormal behavior or failures will be reported in the application’s log stream. However, your logs would probably contain millions of events, which makes spotting issues related to your actual deployment (vs. ones that are just part of baseline application behavior – good or bad) a real challenge.
Technique #1: Detecting New Errors 🐞
With this in mind, new errors should be your first goal. The big question is: how can you tell which errors are new? Ideally, you can assign each error its own unique “ID,” which is shared by all errors of the same type coming from the same place in the code. From there, you can check to see whether any errors in the log stream have unique IDs that are new. Unfortunately, that’s easier said than done. How can you assign an ID to an error that is totally unexpected?
The most effective approach we’ve seen is using the origin of the error in the code (i.e. the error’s stack trace and exception class type). A very handy technique when logging any exception that has a call stack is logging the stack trace hashcode.
Simply toString() the exception object’s stack trace, making sure to remove line numbers (these change frequently and can generate noise) and then log that exception’s stack as part of the log error as its own deduplicated exception ID. At that point you can use your log analyzer to search for new codes in the log stream, and review each one of those to ensure they are not a direct result of your code change.
The advantage here is that the codes are inherently deduplicated, meaning they are not impacted by the variable state of the app at the point of error in the way that log messages are (as those contain variable state that is usually woven into the error message).
This approach also has good performance as the stack is about to be logged anyway, which means that making a simple call to get the stack trace and appending it into a stringBuilder without line numbers is computationally very fast, making it very suitable for production use-cases.
The caveat with this approach is that it requires foresight, meaning you will need to guesstimate where issues might occur, and address that. Therefore, it will only cover the issues that we were able to log in the first place. This is one of the reasons why we built OverOps, and why companies are using it to identify critical issues, regardless of whether they were logged or not.
Technique #2: Detecting Error Spikes 📈
The next thing we have to look out for are errors that are not necessarily new, but whose volume has increased since our code change. These would most likely be errors revolving around network or DB connectivity, message queueing and more. These errors are the ones that usually signify that an application’s environment is in an unstable state.
In the case of the Iowa elections, the app’s backend was unable to process the results which most likely resulted in backend errors and HTTP errors which were sent back to the mobile client.
Here, the most effective technique is to ensure the category of every error or exception is logged vs. that of a specific call stack. This is because your DB client may declare dozens of types of exceptions and without a lot of manual work you don’t have all those types readily classified for you at the moment of exception.
It’s important to group different exceptions into logical groups, as our goal is to identify whether a specific operational aspect of our app (i.e. DB, Network, Queue) is behaving in a way which is unexpected and for which we may not have necessarily had the foresight to put an APM health check in place.
A really effective method is to print out a shorthand form of the exception namespace such as the first letter of each word on the namespace and the last word as a defined field of the error message. For example “java.sql.SQLException” -> “EID=jsql”. From that point you can easily query your log analyzer to see the top 20 exception IDs (“EID”) and see if there are any noticeable volume changes to any of those graphs following the deployment. This could be a very strong leading indicator as to something bad going on that you would want to look into.
While there are many things to look out for when releasing a critical feature under time constraints, nailing down new and spiking errors flowing the first 24hrs after a release – or even before it reaches production – is something that can make all the difference between a successful deployment and a war story that still sends shivers down your spine years later.
If these techniques seem interesting to you but may be too much work to implement, you might want to check out OverOps. By analyzing your code as it changes over time, and assigning unique identifiers to all errors and exceptions, OverOps tells you after each release exactly which new errors have been introduced and which ones have spiked. OverOps can even show you the state of the code at the moment an error occurred to help you reproduce it much faster without having to add logging or increase log verbosity.
It’s my sincere hope that one day when you’re faced with a high stakes release you’ll have some of these techniques put in place, be it by coding them into your app, with OverOps, or any other tool, so you don’t have to experience something anywhere near what those poor developers and angry Iowa voters felt this week!