AIOps, at its core, is a data-driven practice of bridging resources and leveraging AI and machine learning to make predictions based on historical data
AIOps seems to be all the rage these days, and it’s not hard to figure out why. It sounds like a new magical solution to resolving all errors ever! Ok maybe not quite, but it certainly promises to add serious firepower to DevOps teams’ monitoring arsenals.
Read on to get some basic AIOps 101, major value propositions for implementing AIOps solutions plus a deep dive into the data you need to make the magic work (it’s not really magic, but it might as well be).
Let’s do it.
First defined by Gartner in 2016, AIOps means taking artificial intelligence (AI) and machine learning (ML) practices and applying them to monitoring and error resolution for IT Operations. Traditional challenges that IT Operations teams face may eventually be answered by AIOps. For now, teams have already started applying some ML to in-house monitoring practices, and some have adopted off-the-shelf AI solutions like Splunk’s IT Service Intelligence or Moogsoft’s AIOps tool.
Machine learning and artificial intelligence are complex concepts. But understanding the benefits of applying them to application monitoring and operations is relatively straight forward. All you really need to know about AI and machine learning (ML) as it relates to your application is that it’s all about recognizing patterns in historical data and making predictions based on those patterns.
Think about the endless amounts of data produced by your application every minute. With access to the right data, these tools may even be able to predict when the next issue will arise. For example, with historical data about an application’s releases and the number of new issues introduced in each release, they can predict how many issues future releases are likely to have.
In 2017, usage of AIOps tools in enterprise application development was at 5%. Growth estimates for this number are astronomical, with Gartner estimating that by 2019 we’ll be at 25% of enterprises adopting some kind of AIOps tool and 40% by 2022.
With such a low number of companies having implemented one of these tools at this time, successful case studies of companies using these tools are few and far between. For now, the general sentiment seems to be that AIOps may be too good to be true. The idea certainly sounds like the answer to our prayers, but what capabilities do the tools really have? Sure, AI and ML give tools predictive capabilities, but if they rely on historical data for a knowledge-base then they can only predict events based on data that’s been routinely collected.
A good example of something an AIOps tool would be able to do is find patterns in application performance like CPU usage, and predict when a spike is likely to occur again. For these kinds of performance metrics, AIOps tools can be instrumental in proactively identifying slowdowns and other performance concerns. The next question, then, is how much value do we gain from this information?
3 Value Props for Adding AIOps to Your Tooling Arsenal
Be proactive instead of reactive.
This is an easy one. AIOps gives us predictive capabilities so the first thing we want to do is know when an issue is going to happen and address or resolve it before it reaches our customers. With data on circumstances that previously led to issues, it’s possible to predict when a similar issue is likely to repeat itself.
One of the most obvious trends that would be caught by any AIOps tool worth its salt (and hopefully by anybody working on e-commerce apps) is the traffic surge that occurs around the holidays. AIOps tools can analyze vast amounts of data, though, and are able to identify patterns that are much more complex than simple seasonal trends. Using these tools, we expect to be able to identify anomalous events in advanced so that we can prepare for them.
Save time and effort monitoring performance and reliability.
Monitoring and alerting platforms are already excellent at what they do. Alerts can be set up to go out to the “right” person immediately after an event occurs, and that person can immediately begin investigating what happened. With AIOps the same thing is possible – except the alert actually comes before the issue actually occurs. Instead, something happening in the app (that is similar to something that previously occurred before an issue arose) triggers an alert that the issue is likely to occur.
This is obviously an oversimplified explanation of what AIOps does, but you get the picture. We’re already ahead of where we were without it. We got the alert and nothing really happened yet. I’m starting to feel like a broken record, but it’s worth emphasizing here – with access to the right data, these alerts can even include highly specific root cause analysis. It all depends on what data you feed the algorithms.
Connect issues to business impact.
Tying performance metrics and event data to things like revenue and customer satisfaction help DevOps teams understand how issue resolution time and error regression (for example) directly impact the company’s bottom line.
This understanding gives further clarification of how to prioritize the investigation and resolution of one issue over another. It also helps teams see the potential return on investment (ROI) for adopting new tooling or hiring new team members.
Getting the Right Data For Your AIOps Implementation
In a recent article, TechBeacon writes that, “at its core, AIOps is data-driven, so it requires access to all relevant operations data, including unstructured machine data such as logs, metrics, streaming data, API outputs, and device data.” (See? It’s not just me saying that the data is important!) Beyond that, AIOps acts as a bridge between existing tooling to tear down data silos.
So, what data do you really need to get your hands on?
Most (if not all) IT Operations teams are already using APM solutions like AppDynamics or Dynatrace and log aggregators like Splunk. With access to historical performance metrics, AIOps tools and solutions can predict when performance issues like slowdowns are likely to occur. This could come from seasonal (or otherwise) surges in traffic volume or it could be based on a seemingly arbitrary event that no human would pay much attention to, but has preceded previous slowdowns.
Log data can also be leveraged for AIOps, though it’s not an ideal data source. Events can, to a certain extent, be analyzed and filtered to reduce noise and to create a hierarchy of priority. The main issue with logs when it comes to AI and ML is that they are themselves manually created by humans. Anything that wasn’t logged doesn’t have any log data attached to it, and everything that is logged is generally inconsistent and shallow. It’s a toss up that the logs will contain any variable information at all about the code where the event occurred (more than 50% of log statements don’t contain any variable information).
To complement these types of data, what we really need to be able to access is code-aware machine data. We want to know, down to the line of code and the variables that passed through it, what happened in the application that caused an error. By leveraging this data with AIOps solutions, we can know more than just when an issue is likely to occur, we can predict errors and exceptions and their detailed root cause analysis.
OverOps is a software reliability platform that applies machine learning to your code as its running to automatically identify anomalies such as newly introduced issues, error regressions and slowdowns and to provide their root cause to the right person.
When an anomalous failure occurs, a complete picture of the code is provided, including:
- Execution stack, source code and complete variable state
- Previous 250 log statements (including DEBUG- and INFO-level, even in production)
- Frequency and failure rate for ALL known and unknown errors and exceptions
- Classification of new versus reintroduced errors
- Which release or build is associated with each specific event
- & more event analytics
The data collected is unique to the OverOps platform but can easily be leveraged by other tools like Splunk, AppD and Grafana for monitoring, alerting and visualization purposes. Not only does access to this information greatly reduce resolution time for issues in all environments including production, it provides deep visibility into the overall quality of new deployments and of the application as a whole.