The DevOps Paradox: How to Balance Speed, Quality and Complexity

 ● 12th Feb 2020

5 min read

OverOps sat down with the hosts of the DevOps Paradox podcast to discuss continuous reliability and the multitude of challenges facing today’s DevOps teams.

We all know code is being shipped faster than ever these days, but this shifting velocity comes at a price. The faster companies release new features, the harder it is to ensure the quality and reliability of software. Testing times shrink, and undetected errors make their way into production, leading to a degraded customer experience

Recently, OverOps VP of Solution Engineering Eric Mizell chatted with Darin Pope and Viktor Farcic from DevOps Paradox about all things software delivery, from technical debt and automated testing, to production failures and continuous reliability. You can give the full podcast episode a listen here.

Below we’ve highlighted just a few of our favorite soundbites and takeaways from their conversation.

On Continuous Reliability…

Eric Mizell: There is this notion of a Venn diagram where I have speed, or “how fast can I move, how fast can I deploy and develop code?” I have quality, which is “how good or reliable is my code?” And complexity, which is “am I moving to microservices or is my application monolithic?” 

The notion of continuous reliability is about how do I get to the middle of this Venn diagram where I can deploy code faster and I can have better feedback loops to catch errors sooner in a shift left world – before they get to production. 

On Release Velocity Challenges…

Eric Mizell: As a software engineer, the biggest challenge is trying to deliver quality code across a continuum to production. In the new world of CI/CD where we’re trying to move faster to keep up with the Amazons and Googles of the world – where they deliver or deploy code hourly – how do I move to that world when I’m coming from a monolith or heavier code bases where I deliver code every six months?

Viktor Farcic: I think that most companies throughout the history of software engineering focused primarily on how to make something not fail in production. That’s why we would have year long cycles. But the part of that theory I was always curious about is that you would fail every single time and you would still think that it makes sense. 

On Business Needs, Production Failures & Testing Time Constraints…

Eric Mizell: There’s a lot of talk about “I need more code coverage” or “I need better testing,” but the reality is code is never really tested until it gets to production. So you get this vicious cycle of how do I get those test cases back into production or into my QA cycle and move faster? 

When writing software, we would think of every permutation that could break the code, and spend weeks on this. You don’t have that time anymore. The product has to get to production. Business doesn’t wait.

We have to change our mindset to how do I move faster, but have a way to catch the issues in the code or the issues in my pipeline – whether it’s the database, middleware, whatever – before it impacts my customers, without having to take the months of time.

Viktor Farcic: But it’s not only about speed. I would actually say that being faster comes later. First, I think we need to abandon the idea that you’re going to make something that works in production from the first attempt, 100%. Once you accept that, you go faster and in shorter iterations – then at least we can control the scope of that failure better.

On Technical Debt & Changing the Status Quo…

Eric Mizell: Everyone has accepted technical debt and legacy debt. Your application has tons of errors, thousands of these things fail? Oh well, no one is complaining. I reboot. It’s what I call the status quo: This is what we’ve had; it’s what we accept; this is how we do stuff, how we’ve always done it, and it’s fine. 

It needs to change. I’m actually starting to hear “error budgets” coming up with companies and technical debt burndown budgets, bug bounties. They pay them to come in on a weekend and for every bug they bash, they get a little stipend or little perk.

Everything is a cause and effect. If I fix all this technical debt, guess what? I don’t have to reboot every night at midnight because of a memory leak. I fixed it. These deadlock issues or thread problems or my database index problems go away if I fix them, and guess what? Cause and effect – everything runs better. My SLAs are met, applications run better, less headache for everyone. Less Sev1s. 

Viktor Farcic: It’s kind of like I acknowledge that I have technical debt, but at least I can work on not increasing it. 

I think that many people don’t have a valid point of reference. On average, it takes us three months to release a new feature and that’s normal. Because your only point of reference is yourself. As long as you think it’s normal to release once a year, then why would you fix technical debt? You’re within the boundaries of normal.

On OverOps…

Eric Mizell: OverOps is about identifying, prioritizing and resolving your most critical exceptions before they impact your customers. We actually will show you source code and all the variables that were moving through the system at the time of the event, and helps you to triage and troubleshoot problems in a fraction of the time. So not diving through logs, looking at stacktraces, just trying to figure out what’s happening. 

To see OverOps Continuous Reliability solution in action, sign up for a free trial or request a personalized demo.For more great insights and developer war stories from Darin, Viktor and Eric, check out the full podcast recording and transcript here.

Nicole is a communications and product marketing manager at OverOps. Her expertise includes technologies ranging from artificial intelligence and predictive analysis to DevOps, incident management and more.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More