OverOps CTO Tal Weiss shares a few best practices for avoiding digital disasters like the recent Zoom outage.
When we first wrote about the impact of the coronavirus in March, almost a third of the U.S – and over a billion people around the world – were in virtual lockdown. Now 5 months later, there are fewer people in lockdown, but the threat of COVID looms as large as ever.
Since March, we’ve seen the grave realities of adjusting to a global pandemic: businesses on a successful trajectory suddenly completely shutdown; massive layoffs across nearly every major industry; basic business processes now held up for days by safety precautions. The list goes on. And even for the businesses that have managed thrive in this tumultuous time, there are still significant hurdles and hard lessons to be learned.
Yesterday morning marked the return to semi-normalcy for many young people across the United States – the first day of a new school year. In true Covid fashion, most schools had spent the summer preparing for virtual school year, focusing on video conference-based curriculum. But as comedic timing would have it, students awoke yesterday to join their peers and teachers remotely via Zoom but found themselves unable to log on.
— Happening Hoops (@happeninghoops) August 24, 2020
Zoom began receiving reports of customer issues just before 9am ET. It wasn’t until about an hour later that the company announced they had identified the source of the issue – “an application-level bug,” according to Zoom’s President of Product and Engineering – and were working on rolling out a fix. The problems continued through the morning, and it wasn’t until shortly after 1pm ET that they officially announced the issue had been fully resolved.
A 4-hour service degradation for any company, on any day is a nightmare – let alone for one of the world’s current most mission-critical businesses on one of their biggest days of the year. If something like this can happen to Zoom, a modern company that has presumably invested in cutting edge engineering practices – is there any hope for other companies to avoid this catastrophe?
Although not every edge case can be predicted, and not every team has the luxury of delaying a risky release, there are some measures you can take to not only catch application-level issues earlier in the release cycle, but to also detect those that make it to production before your customers feel the pain. Below are just a few of the ways to do this:
- Automate, Automate, Automate
As we’ve mentioned before, developers are great at writing code, but inherently limited in their ability to foresee where it will break down later. Given the massive operational data volume and noise which high scale environments produce, the task of detecting software issues and gathering the information on them in production should be automated. The wasted time and resources traditionally allocated to manual identification and reproduction of issues needs to become a thing of the past.
- Focus on What’s New and What’s Spiking
Many of today’s tools only focus on deviations from known conditions and health checks, but when you’re looking at the impact of a new release on your application, you don’t have the luxury of knowing everything you need to look out for ahead of time. And when a critical production error occurs, you definitely don’t want to waste precious minutes searching through shallow log files that might not even contain the answers you’re looking for. Establishing a mechanism for identifying never-before-seen errors – both in pre-production and your customer-facing environment – is critical for quickly resolving unexpected issues. You can read more about exactly how to do this on your own here, or check out OverOps – we do this automatically. 😎
- Think Differently About Peak Days – and Prepare Accordingly
This advice is twofold. First, start thinking differently about peak days. Just as the needs of your business have recently changed, so too have the needs of your customers. Throw out the pre-covid playbook and start re-thinking when your service might provide the most value to your users. Last year, the first day of the school year would not necessarily be an important day for a company like Zoom, but we’re living in a new normal. Avoid the pitfalls encountered by companies like Robinhood and try to get a step ahead if you can.
Second, in anticipation of these peak days or seasons, prepare your mission-critical applications for possible edge cases and failures. Great care should be put into testing in pre-production, as well as troubleshooting unexpected production issues in real-time.
When experiencing a dramatic change in input, like ecommerce platforms often see on major shopping holidays, the main risk is that a set of unforeseen errors can quickly cascade across the environment leading to systemic failure. At that point, what began as a local surge of errors in one or more services/components can rapidly generate errors in downstream or dependent services. This makes it extremely important to be able to ascertain which errors (out of what could be millions of errors flooding the system) are new. See above advice.
These are tricky times, but one silver lining of the many unprecedented challenges we’re facing might just be a renewed focus on delivering better quality software. It is my sincere hope that as we continue to settle into this strange reality, the practices I mentioned above will help your team avoid disaster.