Know your unknowns with the help of continuous reliability.
Let’s rewind to the year 2002 (this should give you an idea of how long I have been working in software development). Mr Donald Rumsfeld, then United States Secretary of Defense, had this to say in response to a question:
While this statement was not made in reference to software, the underlying principles are applicable to the way we think about software troubleshooting. When preparing any software application for a production deployment, there are a number of important items to consider:
These are predictable scenarios, usually ‘managed’ scenarios that occur frequently in a production environment. They need to be validated manually or through automation tools, static analyzers etc. There exist a plethora of different processes, tools to handle these. Let’s dive deeper.
- Unit tests: Usually performed by a developer or when a developer integrates their code with a larger team. This usually tests small pockets of functionality. They often use mock objects or mock data for accuracy.
- Automated Regression Tests (or ART): Automated regression or integration tests run as part of a suite of tests before promoting code. These are more exhaustive and could cover cross module/application tests.
- Manual tests: A Business Analyst (or Quality Assurance person) would go through a list of predefined tests to validate the application. Oftentimes they veer from this predefined list based on the need of the hour.
As the name implies, this usually accounts for the “Known” and managed flavors. Foresight is not needed as most scenarios are contained.
In production, there are a number of expected scenarios that are unfortunately not accounted for. Some of them include:
- Volume and Scale: In this age of COVID-19, volume and scale are incredibly difficult to predict. We have already seen large brokerage firms go down. While most enterprises nowadays test for volume and scale, without precedent, the right scale is often tough to estimate. The eventual goal is to make sure that what works for 100 customers/transactions should also work for 10 million customers/transactions. This applies to both infrastructure and application scale.
- ALL known Business scenarios: Any application cannot validate all possible scenarios. For example, back when I was a developer writing code for calculating commissions for insurance agents, every state had different rules, licensing, registration, commission rules, etc. It was not possible to test every permutation and combination. Scenarios like these are where we start to make logical assumptions.
Finally, while there are multiple variations of the ‘known unknowns’, as the name implies, this list includes expected but unaccounted errors.
Production is the wild wild west. Anything can happen – unpredictable errors, bugs, slowdowns, scale and performance issues, etc. Every application will have some. And then some more.
Most of them are impossible to predict and unable to uncover using traditional methods.
- Log Analysis tools: What happens when there exists no correlation between the logs and the unknowns?
- APM tools: How does one troubleshoot when the unknown is not a slowdown and cannot be identified with traditional APM metrics?
Handling the ‘Known Unknowns’ and ‘Unknown Unknowns’
Traditional CI/CD and reliability stop at the point of release i.e. they stop when we move into production. The “unknowns” usually occur after moving into production – hence the disconnect.
Continuous Reliability (CR) bridges this gap. CR is defined by the ability to prevent code changes from impacting the customer through the combination of automated code quality gates, contextual feedback loops and observability within a CI/CD workflow. When an issue does reach production, Continuous Reliability helps identify and resolve the issue before the customer is impacted.
While there is no magic pill to handle the “unknowns”, OverOps‘ Continuous Reliability solution allows enterprises to continuously Identify, Prevent and Resolve issues as they happen.
OverOps identifies every error, irrespective of whether they are reported or not. If you have an “unknown” error that happens at 2:00 AM after your Friday release, OverOps immediately notifies you. Let’s assume the Application owner gets to work at 9:00 AM the following business day. He will have all the context regarding the previous night’s issue in his/her inbox even before the customer could report it. They can now prioritize these based on criticality and business needs.
OverOps provides the ability to prevent bad code from being promoted into production by using quality gates as part of your CI/CD process. While “unknown unknowns” usually manifest in production, the ability to recreate and prevent them in a pre-production environment saves time, money and lots of energy.
The biggest issue with the “unknowns” is determining context. OverOps provides valuable context about every error, including error metrics, code graph, source code, exact line where the error occurred, and finally, the data associated with the error. This in turn allows DevOps and Developers to quickly assimilate contextual information that helps resolve them quicker.
In summary, the reality is that ALL software has bugs in varying shapes and sizes. Some of them are known, and some unknown. Some of them impact the customer negatively, and some don’t. Regardless, the goal of every Support, DevOps, SRE, Development and Application teams is to identify, prioritize and resolve them quickly. OverOps helps them do just that, try it for yourself.