The Bridge Between Dev and Ops Needs Automated Structural Visibility

 ● 26th Feb 2019

7 min read

Guest post by Jason English, Principal Analyst, Intellyx

November 7, 1940. When the first Tacoma Narrows Bridge was completed in summer and opened for traffic, it was considered an engineering marvel of span and flexibility. Unfortunately, the engineers never anticipated that mere wind could harm such a grand structure. And so, on this day, repeated wind gusts caused the innovative bridge to flex and buck in a harmonic pattern, ultimately tearing ‘Gallopin’ Gertie’ down.

The twin supports of this famous bridge collapse could be related to Dev and Ops — two separate collaborators that suspended disbelief, shared accountability and made things move faster with DevOps — for a while.

Failures in production are not for lack of trying to bridge this gap from both sides.

The bridge to DevOps, paved with automation

I recently spent some time talking about the state of the “Dev vs. Ops blame game” with OverOps’ VP Solution Engineering Eric Mizell, who spends much of his time advising teams at companies going through a DevOps crossing.

His firm sponsored this widely sampled Dev vs. Ops – State of Accountability study (surveying more than 2,000 Dev and Ops professionals) to track the impact of DevOps on the culture of collaboration between these two once-separate sides of the software delivery function.

We’re 10 years in since we first heard of DevOps — and this movement has become a very popular way to rethink the nature of agile development and continuous releases based on customer value. More than 82% of organizations say they have adopted some aspect of DevOps — an amazing awareness stat compared to 5 years ago when it was certainly less than half. A few (17%) even claimed to achieve a complete DevOps adoption, according to the study.

“Change is being mandated. [DevOps] was slow-roll at the beginning, but now the evolution is moving much faster,” Mizell says. “By this time next year, the number of companies you will see that are fully DevOps enabled will be doubled.”

Companies are indeed creating enviable levels of automation through DevOps, with some accelerating releases from months to days, and faster still.

But DevOps is not a platform, nor a standard set of tools supported by a vendor. As we modernize the SDLC toward this new, faster approach, it becomes subject to inconsistent traffic patterns and unpredictable winds of change, exposing critical design flaws and instability in the face of real world conditions.

Cultural improvement, with an accountability gap

First, good news: DevOps improves development culture, with 73% of respondents answering that both dev and ops teams ultimately share responsibility for application reliability. That’s huge, as it means we’re avoiding the ‘blame game’ of Dev throwing code over the wall to Ops, and Ops trying to block releases to prevent production failures.

However, following the positive boost to culture and release frequency, an uncertain reality sets in. With so many participants in complex enterprise application environments, with faster releases and thousands of components, nobody knows who is ultimately responsible when applications break.

Perhaps the most scary stat: 52% said they rely on customers to tell them about errors, and three-quarters use manual processes to discover errors.

“How do you know when something is happening? Despite the stats, I’d say 90 percent of the real issues come from customers,” Mizell said. “They tell me ‘I don’t want to get lambasted on Twitter.’”

How do companies get better accountability and higher responsibility for resolution out of everyone?
Well, everyone wants to do work that matters. If developers and ops only have customer complaints, and blaming, and a bunch of logs gathered from every service and server, how can they improve delivery quality?

Full-stack developers and talented operations people are in short supply, and likely to leave if exposed to constant firefighting. If this struggle continues, the retention problem becomes more critical than ever.

Instrumentation of code is not enough

What happened to the core DevOps way of ‘automating all the things’ from CI/CD practices?
Or, if 70% of development teams say they are automating tests (according to survey numbers), then why are 25% of all IT professionals spending around one full day troubleshooting, and 42% spending a half day or more per week?

That’s losing at least 10-20% percent of the whole team’s value in fruitless labor.

“I’ve sat with customers who’ve had one issue occupy 40 people for three full days — do the math on that,” MIzell said. “We had another customer that would see everything go down every 10 days. So they had people rebooting everything every 8 days to avoid it.”

Like our suspension bridge, the automation solution can’t be rigid. It needs to be flexible enough to support constant shifts in coding and testing techniques, tons of new and old components and open source frameworks, new service integrations.

So should developers instrument their own code — a practice DevOps inherited from CI/CD?
“You can’t really have code monitoring code,” Eric says.

The issue with self-instrumentation is that in addition to being extra work, it only traps problems the developer already anticipated at the time they wrote the code.

The same applies to APM and ITOps monitoring tools alone. They can do a good job providing warnings and alerts. The savvy SRE can respond to a notification by throwing more infrastructure at a performance problem, but it is still hard to drill down to root causes to report back to development.

Automated Structural Visibility

The best way to reinforce a bridge to carry above its own weight isn’t with more supports of bricks and mortar – it is in a suspension model, with loose binding. This forces rethinking the design approach so the infrastructure becomes fully exposed.

“It’s scary, but I see companies with thousands of errors a day. And with so much noise, how do I know what to work on?” Mizell said. “The key is visibility. I need a better ability to know what is an exception. What haven’t I seen before?”

Depending on self-reporting, introspection and gating tools that only check for known structural bugs consumes development time in cycles. It also leaves much of the heavy lifting of root cause analysis for operations teams to unravel when problems appear in production.

And when IT Ops reports more issues to development, without traceability to source code, it often creates a ‘works on my machine’ response from development, and prematurely closed tickets — especially when ‘closing tickets’ and ‘number of check-ins’ are how teams are incentivized.

So this bridge can be crossed, if structural visibility is automated.

One major media company was able to take what was a one-to-two day exception handling process for development teams sifting through logs, down to about 15 minutes with automated structural visibility to find one errant variable. They used an AIops platform from OverOps alongside their existing APM and release management solutions to burn down hundreds more such nagging exceptions.

The Intellyx Take

We revisit our suspension bridge with lessons learned. The old road is a monument at the bottom of the channel, but the monolithic pillars remained. They rebuilt with the struts and structure fully exposed to the light of day, and monitored for visibility. Manual inspection is now augmented with sensors and traffic cams ready to report the first sign of unexpected conditions.

To share accountability between Dev and Ops, you need to share visibility, down to code and metrics. This needs to be done in as fully automated a way as possible.

When both sides share the same view into real-world application conditions, with exceptions being thrown as they would impact customers, they can finally bridge the accountability gap.

There is also a big morale question of aligning incentives. The more you can encourage shared accountability between Dev and Ops teams around successful, sustainable software releases, with positive business outcomes, the better.

Reliability in this DevOps world demands automated, real-time visibility into application code at every release stage.

Contents © 2019, Intellyx LLC. At the time of this writing, OverOps is an Intellyx client. Interview content from a Jan 2019 webinar review of the OverOps Dev vs. Ops State of Accountability survey results.


Jason “JE” English is Principal Analyst at digital transformation advisory firm Intellyx. Drawing from 20+ years of expertise in enterprise software and services, in his role he is focused on covering how agile collaboration between customers, partners and employees accelerates innovation in DevOps.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More