According to Gartner, the average cost of IT downtime is $5,600 per minute and typically ranges from $140,000 per hour on the low end to $540,000 per hour at the higher end. And that’s not even taking into account the “hidden” costs of downtime, such as erosion of trust—something that can happen especially fast when it comes to financial services applications.
But how do we reduce downtime without spending as much money on the measures to reduce it as the cost of downtime itself?
Unplanned downtime is of course far worse than planned downtime because you didn’t anticipate it, and of course a total system failure is the worst thing that could happen. Unfortunately, unplanned downtime is a given. The question is: how do you deal with it?
The Harsh Reality: There Will Always be Some Downtime
Being up all the time is probably unrealistic with a distributed architecture.
So, instead of trying to achieve the unrealistic goal of 100% uptime, or even 99.999% uptime, a better route to take is to focus on schemes for fallback and rollback.
If you accept, at least on the surface, that software failures are going to happen, then you can start to focus on how best to respond to the inevitable downtime situations.
How can you get things up and running again as quickly as possible?
There are many strategies to employ for this, including canary or A/B testing, rolling out less features per node (maybe just two per node), and not trying to upgrade everything at once so that you can turn things on and off very quickly.
But all of the above comes down to being prepared.
Why Planning and Strategizing are Key
In many respects, microservices and distributed systems enable us to plan for failures far more easily than the monolithic apps of the past.
That said, microservices won’t solve the problem for you. The key to quick recovery is to plan ahead and do things at the whiteboard first:
- What’s your failover strategy?
- What’s your rollback strategy?
- How will this get deployed on your nightlies and how can you push out some of these updates on a regular basis?
DevOps teams still struggle a lot with a “crossing our fingers” approach when it comes to deployments, and that’s a major part of the unplanned downtime problem.
“Failure is Always an Option”
As they say in the TV show MythBusters, “Failure is always an option.”
That is, just because bad things happen, it doesn’t mean it has to be disastrous. You can recover quickly and it’s never an “all or nothing” proposition. You can turn features on or off, enable or disable features, and identify things that are likely to cause problems before they occur.
If you can get better at:
- Identifying problems before they happen;
- Recovering from them before a customer notices:
- And failing over gracefully when the inevitable downtime happens…
Then you will be fulfilling your role as a software leader.
A lot of this has to do with the modularity of microservices.
Part of what microservices allows us to do, among other things, is separate all the concerns by deploying in a piecemeal fashion and using things like feature flags to separate the code-to-deployment motion from the turning-on-the-features motion.
Turning things on gradually allows us to root out problems and prevent the all-or-nothing approach to deployment that so often leads to major failures.
Related Panel Discussion: 4 BIG Software App Challenges For FinServ
This past May, Bob Kemper, VP of Worldwide Engineering at OverOps, and Anders Wallgren, VP of Technology Strategy at CloudBees took part in a panel discussion on the 4 BIG Software App Challenges For FinServ. In part 2, the two discuss the challenges of avoiding downtime: