Knowing when a release is “good to go” is, well… complicated. We might feel pretty confident with the release if it’s passed all of our tests in pre-prod, but that still doesn’t mean that we won’t face issues in production. Instead of releasing new code and hoping for the best, every SRE or DevOps team should implement code quality gates to block bad code from reaching production.
There are two main issues here which need to be addressed. The first is Testing Coverage. Our testing is only as good as our coverage, which, even if you are able to reach 100% code coverage, doesn’t really mean 100% testing coverage. In complex systems, it’s impossible to foresee all the different code and infra combinations, especially with the release of a new feature.
The second issue is Anomaly Detection. Even in pre-production environments, the noise-to-signal ratio for large scale systems is high. If something unexpected happens, how do we even know about it? How do we know the quality of our new release compared to the previous one? Can we trust, based on our tests, that this new release won’t break anything for our customers?
It seems the industry needs to craft a more effective approach, an approach that uses Machine Learning to automatically block bad code from moving up the chain with code quality gates. First, we need to agree on a few objective metrics that we’ll use to quantify the quality of a release. Once that’s done, we can set up our quality gates to block a release that doesn’t meet our quality standards, before it moves from staging to production.
These are the four metrics we came up with to define our quality gates, starting from the simple to the more complex:
The 4 Code Quality Gates You Need
Gate #1 – Error Volume
This quality gate is the most basic one – does this release increase our error volume? The challenge here is that we need to be able to capture this data correctly and incorporate all errors, not only those that are logged. That includes everything, even swallowed and uncaught exceptions, as those might actually be more important than those in the logs.
The second stage is normalizing data. Volume only makes sense when compared with throughput. You’ll undoubtedly have higher error volume when more throughput is pumped into the system, so normalizing the error volume into a percentage is critical. The next thing we need is the ability to duplicate the data, so we can easily see what’s makes the bulk of the volume, and whether or not that’s benign or severe.
Code Quality Gate: The normalized error rate of an application should not increase between releases.
Gate #2 – Unique Error Count
This leads us directly to the second gate which is the ability to determine which errors make up the volume and whether or not the unique count has gone up or down since our last release. At the core of this is the ability to transform an ocean of separate events into a time series that we can chart and split into core components. Once we transform this mass of code and log events into a set of analytics, we can begin to see where these errors are coming from – which apps, containers, locations in the code, and under what conditions.
This gives us a picture of the code quality and also the performance cost associated with having those events into the code and emitting them, in addition to their impact on the reliability of the app. This becomes more important when looking at key applications or reusable components (i.e. tiers) in the code such as payment processing or DAL, when more errors can be a very negative indicator.
Gate: The number of unique error counts, especially in key applications or code tiers, should not increase between release.
Gate #3 – New Errors
Drawing directly from that, once we’ve broken down all the errors in our environments into components, we want to be able to quickly separate from what could be hundreds, or sometimes even thousands of locations in the code, the ones that are new and have just been introduced by this release.
Especially if we are the caretakers of an enterprise application, one with an honored 15-year legacy, which brings with it a mass of existing errors, we shouldn’t be introducing new ones into the environment. Or at least not ones that we did not expect. When dealing with a massive release or major infrastructure change, you might encounter a large number of these “micro fractures”, in the form of a dozen new errors – that’s just reality. So what do you do? Set up a quality gate, of course!
For this, it’s important to be able to prioritize events, to make sure that even when new errors are introduced into the environment, they aren’t severe. What makes for a severe error you ask? There are two key attributes to consider.
The first is whether the error is of a critical type or in a critical component? A socket exception is much more benign, and can be perfectly normal versus a NullPointer, DivideByZero or IndexOutOfBounds, which are all direct results of code and data not working well together. There is little excuse for these, and their impact will likely lead to unexpected results for our customers. We don’t want those.
The second key attribute of a critical new errors is a high error rate. There is a difference between new errors that happen during a deployment for a short period of time compared to something that’s happened more than 500 times in more than 10% of the calls (again – everything must be normalized). We don’t want those as well.
Gate: New errors of a critical type or high rate error should block a release.
Gate #4 – Regressions & slowdowns
This brings us to the last, and probably the most sophisticated quality gate. Here we’re using the data obtained from the first two gates to look, not at new errors, but at the behavior of existing errors within the system. In this case, we’re looking for regressions in the form pre-existing errors that are happening at a higher rate, or slowdowns that deviate previous performance.
Creating a quality gate for regressions is more complex as things are all relative. For something to be considered regressed or slow, it needs to be compared against itself. For this, a baseline must be established. We’ll also need to determine our tolerance – what is the level of regression or slowdown we are willing to tolerate?
It’s courageous to say we have no tolerance for regressions or slowdowns in our systems, but life is far from perfect and environments are noisy. Instead, let’s start by saying an increase of more than 50% is said to be a regression, and an increase of more than 100% is said to be a severe regression / slowdown. With severe regressions or slowdowns, our release should not be promoted – at least not without inspection of the anomaly. This is the most complex of the quality gates, but also the one that has the strongest predictive quality with respect to potential outages and severity 1 incidents.
Gate: Severe regressions and slowdowns should block a release.
With these four gates, we wanted to define a benchmark that is powerful and broadly applicable to complex environments, but also isn’t overly complex to the point that it remains an academic exercise.
Using this as a basis, we’ve enabled teams to confidently release code to production. Learn more about how you can incorporate these quality gates into your own release cycle with integrations to your CI/CD tooling.