What’s the connection between Robinhood services, Coronavirus and your application’s health? It’s all about the environment.
For those of you reading this post long after its publication, it might be hard to really sense the mood in which it was written – a weird time where the natural world and our physical/virtual infrastructures are colliding head-on.
The recent Coronavirus outbreak has wreaked havoc on many industries, as well as financial markets. This virus and the volatility it has brought into our lives is also having a heavy impact on the software we use everyday.
Much in the same way that the virus represents an “edge case” that both our immune system and pharma-made vaccines have not foreseen, so too does its impact on mission-critical software expose many edge cases with major financial and business impact.
The case in point is Robinhood.
For the past few years, the unicorn startup has been disrupting the way many millennials invest and manage their money, making it easier to buy and sell stock and cryptocurrencies directly from your phone with minimal bank involvement. For major banks that rely on wealth management as a cornerstone of their business, Robinhood’s success has been quite a thorn in their side.
For Robinhood, the burden of proof was to show they can provide an infrastructure that is as scalable, reliable and secure as that of major banks who have been developing their trading infrastructure for the last quarter-century.
That promise fell flat this week when the market volatility brought on by COVID-19 triggered a set of edge cases that brought Robinhood’s service to its knees, leading to a deluge of bad coverage and a formal apology from its founders.
Contrary to the recent infamous Iowa Democratic primaries fiasco where an immature app that was never tested failed magnificently, Robinhood’s infrastructure is at the cutting edge of modern software engineering and testing. As such, we feel a good deal of sympathy for those at the front line of Robinhoods’ DevOps team who are dealing with a set of unprecedented circumstances at a scale they have never experienced before (not even by the financial markets themselves).
This brings us to the question of what can a DevOps/ SRE team do when a complex system is encountering a surge of incoming data that their application was not fully designed to handle – a true instance of edge cases at scale. How can they ensure continuous reliability (CR)?
When discussing CR, we often focus on the impact of software changes on the reliability and security of a given application. We’ve previously outlined a set of techniques that teams can employee to reduce the probability of code changes impacting production environments and users.
However, in this case, the cause of the issue isn’t a change in code, but the dramatic change in input. This change manifests in both input frequency and unexpected variable states.
The main risk in situations like this is that a set of unforeseen errors that the system was never built to properly handle, can quickly cascade across the environment leading to systemic failure. At that point, what began as a local surge of errors in one or more services/components can rapidly begin to generate errors in any downstream or dependent services.
In effect, what begins as error volumes that number in the thousands can quickly grow in size to billions, drowning all logs stream and monitoring tools in a torrent of duplicate alerts.
At this point, it becomes almost impossible to understand the cause of the issue vs. what its symptoms are. The more quickly the cracks in the dam can be identified, the more quickly they can be sealed before the dam collapses. The good news is that the techniques themselves used to identify the root errors (vs cascaded ones) are very similar to the practices employed when verifying a new release as part of an effective CR pipeline.
The first step the team should take is to ascertain which errors (out of what could be millions of errors flooding the system) are new. This is important because edge cases exposed in times of high data volatility will most likely cause brittle areas of the code which were not designed to handle this influx of data to break.
It is from that point that things usually start to deteriorate. Therefore it is critical as part of both the firefighting and the post-mortem to identify the original cracks in the system. The challenge in doing this is that it is incredibly hard to know from a massive, overflowing log stream which errors are new vs pre-existing and re-triggered by the incident.
This is where fingerprinting of errors based on their code location can be the difference between company heroes and a week (and weekend) of sleepless nights. By identifying and triaging new errors (i.e. those that the system has never experienced prior to the incident) the team stands the best chance to be able to put a lid on the issue quickly.
A few years ago when working with a major financial institution, we were performing a post-mortem on an infrastructure issue that brought their trading system around the world to a grinding halt. After days of analysis, we found that a new class of messages that were sent into their queuing system infrastructure exceeded the allowed size for that specific queue. The size of messages tested during acceptance tests was smaller than what they would encounter in production during a specific surge in trading.
Once their queue began to reject the messages, errors began cascading across the system, masking the core issues and making it almost impossible to tell what began the chain reaction (and what would be the key to stopping it). As the target queue began to overflow, the system began rejecting valid messages as well, bringing trading to a halt.
When the post-mortem was complete we saw that if they were able to spot the original rejection of the message they would have been able to patch the code quickly and avoid what became a costly event for the bank.
The skeptical reader would surely ask, “ What if no new errors are detected, even if we could find them quickly?” A good question indeed. In that case, we move to the second modality which is anomaly detection via deduplication.
With this approach, the team’s goal is to quickly see which pre-existing errors within the system are surging when compared to their baseline (i.e. ‘normal’ behavior). Those errors that begin surging before the system goes into convulsions are usually the bellwethers of the storm, and as such if identified and addressed quickly, can prevent the entire herd from falling over the cliff 🐑🐑🐑.
This approach, much like new error detection, is reliant on having the ability to fingerprint errors in a way which enables the team to quickly deduplicate errors and to “translate” their log streams from an infinite set of discrete events into a set of metrics that can be used to detect specific increases and correlate them to their origin within the code.
This also requires that a long enough baseline – usually a week and up – so that an anomaly detection algorithm can spot which specific error signatures began spiking first.
If we apply this to the example of the bank queuing system provided above, in a case where no new error was generated by the illegal queueing operation (i.e. it has happened in the past at much lower frequency) it is critical that the team will be able to spot which errors ones began spiking before the overall error volume of the environment began surging.
Given that ability, they would have been able to identify the original surge and take faster action to remediate – even if no new errors were brought on by the issue.
While we may never know exactly what happened within the confines of the Robinhood data center, the damage to their company is something that will long be talked about within the trading industry. At OverOps we work with many customers who are recovering from similar catastrophic events (or are hoping to avoid such ones) by implementing continuous reliability capabilities that can be used during both software changes, or as in the case of the last few days, the environment changes.
Regardless of which tools or practices you use, it is critical that you verify reliability under both normal conditions and those of acute duress to keep your software healthy and reliable.