If you’re interested in the IT Ops Management space (go here for a general intro), it is likely that you have heard of AIOps.
But is it a buzzword that everyone paints over their solution? Is it really going to change how we operate applications, services, infrastructure etc?
Is there magic behind that?
Let me try and share some background, so you can make up your own mind of what it is for you.
I like definitions, especially simple ones. With that, I recommend one I found over at Gartner:
AIOps is the application of machine learning (ML) and data science to IT operations problems.
It’s a little businesslike, and it doesn’t touch on the aspect of user experience. Forrester mentions this in its glossary:
Software that applies AI/ML or other advanced analytics to business and operations data to make correlations and provide prescriptive and predictive answers in real time. These insights produce real-time business performance KPIs, allow teams to resolve incidents faster, and help avoid incidents altogether.
Look at: “Providing … answers”, and “resolving incidents faster & avoiding them”:
Those are critical points to me. I believe Ops Management should always be focussed on reducing MTTR, MTBF and the associated cost of management. It is also something we have been focusing on with our portfolio. More on that later.
For now, let’s look at some IT Operations problems, and what the application of ML could do in that context:
- Too much noise: IT environments are increasing and the number of applications is growing. Also, instrumentation gets broader and deeper. With that, Ops teams, and no matter if traditional, DevOps or SRE, need to face a heavily increasing amount of information. In essence, they need to filter what is relevant to get to actions quickly.
- Knowledge not current enough: Applications are becoming more dynamic. Additionally, micro services and rapid deploy cycles are becoming very popular. Therefore, the state of your environment is far from static. Above all, keeping up with those changes is tough.
- Too slow in responding: Operators struggle with responding in time. This is often caused by to too much noise. Another reason is the high amount of changes in the system under observation
Now how could Machine Learning help here?
So let’s look at some solution approaches for each of the challenges above.
- Filter for me and help me prioritise:
An AIOps system could use Machine Learning based on historic evidence of your data to:
⁃ Group pieces of information, e.g. events, that belong together
⁃ Show you the most relevant events first
⁃ Help you pin point a probable cause
To gather that evidence and deliver the right recommendations, it needs to continuously discover and collect data from a broad range of sources. Those sources range from logs to metrics and events. This also leads to the next point:
Give me context, and keep up with all the changes automatically:
An AIOps system should always be on top of latest developments in your infrastructure. It should pull live information from management systems, like Hypervisors, Container Orchestrators, CI/CD pipelines etc. Additionally, it should also store history. It should be able to detect dependencies between services, applications and infrastructure. Events and incidents should be shown context of such dependencies.
- Help me respond, or respond automatically:
An AIOps system should provide recommendations on resolution actions. Alternatively, it should be able to run automated actions as a response to events & incidents.
With Netcool, we are addressing many of the areas above, and you can read about that here:
- General overview of getting from IT Ops to AI Ops
- How topology helps you in finding weak spots in your environments
- How we use Machine Learning to detect correlation in event data (with videos!)
I really believe AIOps is going to help to master the challenges mentioned above. In conclusion, it could be the life saver for IT Ops Management.
What do you think? I’d also love to hear about your most important IT Operations problems, so please leave a comment.