Grep the Word “Error” in Your Logs – Can You Find the One That’s Costing You $1M?

 ● 30th Apr 2019

6 min read

You may not take all of your errors seriously, especially your caught errors, but they’re still costing you money.

According to a report from the Consortium for IT Software Quality, developers introduce an average of 100 to 150 errors for every thousand lines of code they deploy. It’s pretty shocking that an estimated 10-15% of code running in production contains errors, but don’t worry. There’s a bright side. Kind of.

Not all of those errors are serious, assuming by serious we’re talking about the errors that contribute to application downtime, transaction failures or any other kind of customer-facing performance issue. The rest are the kinds of errors that don’t have any visible impact on customer experience or the company’s bottom line. They’re the kinds of errors that slowly, and often continuously, accumulate because the cost of fixing them seems to heavily outweigh the cost of just letting them be.

In this post, we’ll examine how all application errors, big and small, contribute to additional costs for companies ranging from adolescent start-ups to full-grown Fortune 500s.  

Serious Errors Introduce Serious Costs

Let’s assume that just 10% of the errors introduced to our code eventually cause performance issues or failures. As is pointed out in the same CISQ report, “even if only a small fraction—say 10 percent—of these errors are serious, then a relatively small application of 20,000 lines of code will have roughly 200 serious coding errors.”

These errors contribute to costs that are generally visible and quantifiable. Take, for instance, the fees incurred from breach of SLA contracts or the revenue lost from failed transactions. These are the most visible costs, but there are more quantifiable costs that are less transparent.

Engineering time dedicated to troubleshooting can also be calculated. It may be easy to overlook or to dismiss as a necessary expense, but nevertheless, it is only truly comprehensible when looking at the number itself.

According to our recent survey of more than 2,400 IT professionals (and industry standards for engineering salaries), companies are paying on average $15,000 per developer per year for debugging and troubleshooting duties alone. In a company with 100 developers, that adds up to roughly $1.5 million only for developer manpower debugging issues. Now, we’re not suggesting this cost can be eliminated entirely, but imagine if debugging time could be reduced just 10%. That accounts for a potential $150k savings each year.

Aside from the quantifiable costs, there are the intangible costs that we pay for these kinds of errors. The kinds of costs that keep our executives up at night. The cost of brand tarnishment from dissatisfied customers ranting on Twitter. The cost of replacing employees that aren’t satisfied with the type or quality of work that they’re producing. These costs are not quantifiable as there are many, many complex factors which contribute to them. They’re not easy to pin down, but they also cannot be ignored.

Those Caught Exceptions are Costing You Too

On the one hand, we have serious errors which carry serious costs. These errors are crashing our applications, impacting performance and causing unexpected behavior for customers. On the other hand, we have tons of caught exceptions that aren’t making waves or really disturbing anyone. Well, as individual errors.

The biggest contribution that these “non-serious” errors have on our cost is not as individual events. As standalone exceptions and errors, they’re benign – they don’t directly impact the application’s performance so we let them slide. One problem with our current frame of mind is that we stop there. We tell ourselves that we’ll never live in a world with zero exceptions, so why bother with those that aren’t causing problems.

NOTE: Not everyone shares that outlook actually, check out this post we wrote after talking to teams that approach exceptions like emails, and implemented inbox-zero policies for exception handling.

Anyways, lofty goals for eliminating exceptions aside, there is value in confronting all of the “benign” errors that keep showing up in your application and piling up in your logs. Imagine a single “harmless” logged error that ends up occurring a million times over the course of a week.

With the average size of a log statement being about 0.000002GB (2×10^6), that’s 2GB for the week, for which ingestion and storage costs add up to about $100. Not too bad, even if that volume continues it’s just $5,200 for a year. It really doesn’t seem too bad, until you remember that we’re talking about logging costs for just a single logged error. We didn’t even ask you what you found when you Grep searched “error” in your logs.

Excessive errors also use CPU and memory resources and may require additional cloud and data center infrastructure to support them or additional licenses to provision new servers. A significant portion of R&D budget is spent on infrastructure overhead costs such as hardware and network licensing, storage and 3rd party services. Error volume is a determining factor in many, if not all, of those costs.

The point is, despite their friendly disposition, when you let even those seemingly benign errors add up, your maintenance and overhead costs are sure to go up as well.

Final Thoughts

Whether or not an error is “serious” is fairly subjective. Does “serious” mean the error crashed the application? I’d call that serious. Does it mean that it negatively affected users? You could definitely say that. Does it mean that it’s affecting the company’s bottomline? I’d say so.

As we discovered in this post, this would mean that every error is serious. That can be pretty overwhelming, but it doesn’t mean that you should stop everything and implement an inbox-zero policy for handling exceptions. It just means that we can’t assume that just because our systems are up and running and no customers are complaining that an exception isn’t critical.

Identification and prioritization should be standard practice for ALL errors according to their potential for causing major performance problems or contributing to excessive overhead costs. OverOps not only helps companies with these crucial steps, it also provides teams with all of the data they need to prevent them from reaching production and quickly resolve any errors found anywhere in the SDLC – including production. See how it works here.

PS. Interested in seeing how the costs add up in your company? Try our calculator to uncover the hidden costs of errors in your application.

Tali is a content manager at OverOps covering topics related to software monitoring challenges. She has a degree in theoretical mathematics, and in her free time, she enjoys drawing, practicing yoga and spending time with animals.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More