The release of Netcool Operations Insight 1.6 has taken Operations Management and AIOps to the next level: from the all-new, completely redesigned, incident-centric UX, to the dramatic advancement in event correlation; that combines machine learning, tribal knowledge, and topology into a single, unified correlation engine. So, what’s “wow” about Netcool Operations Insight 1.6? Let’s take a closer look.
ALL-NEW ANALYTICS & CORRELATION ENGINE
There are many different ways to skin the proverbial cat. So too with event correlation, this can also be done in different ways. Netcool Operations Insight has previously centred around two main ways to do event correlation: analytics-based and scope-based. Analytics-based event correlation (also referred to as temporal correlation) uses machine-learning algorithms to analyse the event history, looking for events that always suspiciously occur together. It then automatically groups these events together if they occur again in future. Scope-based grouping works on the basis of grouping events that occur “at the same place, at the same time”. It allows the user to define the basis for what the scope should be. For example, the scope might be the geographical location the events have come from, or it might be a logical grouping, such as a line-of-business or application group.
The Event Analytics engine has been completely rebuilt for Netcool Operations Insight 1.6 as a cloud-native application. Although there were techniques available that allowed you to combine event-grouping capabilities (see my previous blog), Netcool Operations Insight 1.6 now does this automatically, and creates “super groups” for you. A key difference in version 1.6 is that an event can now be a member of different types of groups simultaneously. It is the events that are members of more than one group that defines how super-grouping is done: groups that have overlapping members are automatically combined.
This is a big deal. Although there are many ways that event correlation can be done, experience has shown that no one method will work for every scenario. Analytics and machine learning techniques consume data in an attempt to detect patterns and thereby predict future eventualities. But what if a given scenario has not been seen before? Similarly, scope-based correlation is excellent at automatically grouping together events that occur at the same place. But what if the incident affects multiple places, or happens between places? How do you link these groups of events together?
Netcool Operations Insight leverages multiple event correlation techniques, and uses them collaboratively. By leveraging multiple event correlation techniques simultaneously, Netcool Operations Insight is better able to more completely correlate events into incidents. Not only does more complete and accurate correlation reduce the number of Event Viewer rows presented to operations, it has also proven in the field to significantly reduce Mean-Time-To-Repair (MTTR) as well as the number of trouble tickets created. A large North-American cable provider successfully used these techniques to reduce ticket counts by 75%. They also calculated their average MTTR dropped by around 63% by having events correctly correlated together.
Netcool Operations Insight leverages multiple event correlation techniques, and uses them collaboratively. … A large North-American cable provider successfully used these techniques to reduce ticket counts by 75%. They also calculated their average MTTR dropped by around 63% by having events correctly correlated together.
In the following image, you can see a group of events in the Event Viewer, containing five events. In the “Grouping” column, four of the events are marked with a Venn diagram icon, indicating that they’re part of the group by virtue of a scope-based association. Two of the events are marked with a clock icon, indicating that they’re part of the group by virtue of a temporal (analytics-based) association. The correlation engine has determined that one of the events is a member of both, hence it has merged the groups together into a super-group.
This feature is big news: now you don’t have to choose which correlation technique to use; you can instead leverage multiple capabilities together. This will eventually include topology-based correlation – but more on that shortly.
ALL-NEW UX WITH INTUITIVE INCIDENT-CENTRIC WORKFLOWS
Netcool Operations Insight 1.6 brings with it an all-new incident-centric workflow based user experience. As depicted above, the Event Viewer has been enhanced to include grouping and Seasonality indicators. On clicking on the “Investigate” hyperlink, the user is taken to a more detailed view of the incident, called the Incident Viewer.
From here, the user can expand each event in-turn, to inspect each one’s values. At the right-hand side, the user can see events that have Seasonal attributes, and how each one’s membership of the group or incident has come about. In this example, the first two critical events are both Seasonal and have been linked via a temporal relationship “a”. Remember that a temporal relationship is discovered by Netcool Operations Insight’s Event Analytics, whereby events are grouped together on the basis that they have previously and consistently occurred together. The second critical event, and the remainder of the minor events, are included in the grouping by virtual of a scope-based correlation, marked “soc”. It is the second critical event, which is a member of both groupings, that ties the two groupings together. Remember, whenever overlap is seen between members of different groups of events, the event correlation engine automatically merges those groups together.
ANALYTICS PRESENTED IN-CONTEXT
Netcool Operations Insight 1.6 provides transparency to the Event Analytics by providing contextual detail of the analysis. The first of the screen shots below shows how by clicking on group “a”, detail information is presented to inform the user how the events have come to be grouped together, by virtue of their having occurred multiple times together in the past. In this case, the two events have been seen to occur together 14 times in the past, since February 18. The user can then click on “More info” if they would like to dig deeper into the historic occurrences of these events. Once the user is satisfied that the grouping is valid, the grouping can be approved or rejected, which would then group (or not) future occurrences of these events, respectively.
In the screen shot below, the user has clicked on the Seasonal marker for the first event and discovered how this particular event is Seasonal. In this case, the marked event has historically occurred on Mondays between 3pm and 4pm. As before, the user can click on “More info” to see historic occurrences of this event in the event history. This Seasonal information provides potentially key clues as to what has caused the event to occur, and also what may have contributed to the incident.
NO RELIANCE ON THE REPORTER DATABASE
In Netcool Operations Insight 1.5 and earlier, the Event Analytics engine worked off the contents of the REPORTER database to do its analysis. This meant when an analytics configuration was run, the Event Analytics engine would pull the selected dataset from the target database, do its processing, and then tabulate the results. In practice, this caused some issues for some clients. First, poorly performing databases, or ones that contained enormous amounts of event data, were taking inordinate amounts of time to analyse. This also resulted in time-out issues in the UI when users queried the system for historic event occurrences. Second, many REPORTER databases are not deployed in a best-practice manner. In some cases, columns required by the analytics were missing or used for other purposes. This meant the analytics could not run, or may yield no meaningful results.
The Event Analytics engine in Netcool Operations Insight 1.6 now has no reliance on the REPORTER database. It takes a direct feed from the ObjectServer and stores internally everything it needs to do analytics and drive the UI. In practice, this makes it all work much faster, and dramatically improves the user experience. Note that it is possible to prime the new event analytics engine with historic event data, if you have it. Netcool Operations Insight 1.6 comes with a utility that can be used to ingest data from your REPORTER database, to get off to a fast start. It is no longer a required component at deployment time however, and it does not rely on it subsequently.
DEPLOY FIRST or REVIEW FIRST
The default deployment model for Netcool Operations Insight 1.6 is to activate Event Analytics and automatically group events that it has learned historically always occur together. This mode of operation is called “deploy-first”, where groupings are automatically deployed as they are discovered by the Event Analytics engine. Alternatively, the system can be set up so that all discovered groupings must first be validated before they are allowed to perform automatic event grouping. This alternative mode of operation is called “review-first”.
We saw earlier that one of the new features of Netcool Operations Insight 1.6 is to provide transparency of the analytics to the users within the Incident View. Users can see why events are grouped together, as well as drill down into the historic occurrences of the grouping. In both deploy-first and review-first modes of operation, users can also get access to the found groupings via the Manage Policies portlet. In the Manage Policies portlet, an administrator can review the Live groupings and the Suggested ones. In deploy-first mode, all groupings are automatically made “live” unless a grouping is rejected by a user. In review-first mode, all groupings are first suggested and have to be validated before they are made live. As before, these groupings can be assessed, the previous historic occurrences examined, and then the grouping approved or rejected. This would allow anyone looking at an occurrence of the group of events in the incident view later to see that this grouping had been specifically approved and validated.
If the system is running in review-first mode, the system helps a user with the review process by automatically ranking the groupings in the Suggested view. Factors that affect a grouping’s ranking include: when the grouping was last seen, what the maximum Severity of the events was, the number of events in the group, and the number of times the grouping has been seen. This is very helpful in leading the user to the groupings that would bring the highest value to the business. If approved, groupings move from the Suggested box to the Live box.
AWARD-WINNING TOPOLOGY VIEWER
One of the jewels in the crown of Netcool Operations Insight 1.6 is its award-winning Topology Viewer.
With the ever increasing trend towards more dynamic, cloud, and multi-cloud environments, being able to visualise how your environment is connected in a single pane of glass has become a vital element in being able to support and manage it. The Topology Viewer component in Netcool Operations Insight 1.6 has a number of key capabilities that make it essential for the management of dynamic environments.
Just as traditional Netcool Probes are for the collection of events, the job of the Observers is to collect topology data. The library of topology ingestion Observers includes specific off-the-shelf ones designed to connect to specific types of topology source, like Kubernetes or VMware, and generic ones that can be used to ingest custom topology data, like from file or the REST API.
One of the principal design elements of the Topology Viewer was that it needed to be able to consume and depict topology data in real-time. Many of the Observers plug directly into dynamic orchestration systems, consume the topology changes published by the target system, and update the topology in Netcool on-the-fly. This feature is essential in allowing an operations team to effectively manage a highly dynamic environment. Not only does an operator need to be able to see how things are connected now, they need to be able to “go back in time” to see how things were connected at the time the events occurred.
TIMELINE AND DELTA
The Netcool Topology Viewer stores received topology data, so that a user can “go back in time” and view how the environment was connected at a previous point in time. By switching on the DELTA view, a user can also see what has changed. This is an essential capability for both troubleshooting a current issue, as well as for doing a debrief after a major outage.
In the image shown above, DELTA mode is enabled and the pins have been positioned on two different points in the timeline. The topology view is consequently showing the user the differences in the topology between the two points in time. The timeline suggests the number of Relationships increased between those two points in time, and the topology shows that a number of new elements were indeed added. Items marked with a small purple plus icon indicate they were added between the two time points. Similarly, items greyed out and marked with a small black minus icon indicate they were removed between the two time points.
Another feature of the topology viewer is the ability to stitch or merge together topology parts that have originated from different sources. An example of this might be a hybrid environment whereby some parts of the managed environment reside in an on-premise containerised environment and other parts in a public cloud environment. The various parts may collectively make up a service and have mutual dependencies, hence being able to visualise all the parts simultaneously as well as the connectedness is vital to troubleshooting any potential problems. Topology parts from multiple different sources can all be stitched together in a similar manner to provide visualisation of the entire estate, on a single pane of glass.
TOPOLOGY-BASED EVENT CORRELATION
Correlating events coming from highly dynamic environments where resources are created and destroyed on-the-fly in a more-or-less random fashion is very difficult, if not impossible, without a view of the topology at the time the events occurred. In such environments, trying to deduce event relationship based on an analytical assessment of the historic event data is of limited value. After all, it is very difficult to predict how events should be correlated together in future, by looking at how events occurred together in the past, if those events came from topology that no longer exists!
Currently in development and highly anticipated, is Netcool Operations Insight 1.6’s ability to perform topology based event correlation. This feature will allow users to define topology subsection templates – called sub-topologies – which will allow events to be correlated that occur within the same sub-topology type. For example, we may define a sub-topology template to have four things of certain types strung together in a line. The automation would then look to correlate events together that come from devices of these types connected together in the prescribed manner.
The topology-based event correlation will work in conjunction with the scope-based and analytics-based correlation capabilities in a collaborative fashion, to create super groups. This capability will enable clients to finally close the loop on many of the edge case correlation scenarios that may not have been possible to elegantly solve before.
Consider the following diagram:
In this scenario, the cause of the outage is due to a link going down between two parts of the environment, causing events to be generated in multiple places. The analytics-based correlation has grouped together the three events circled in orange, since this is a known grouping based on previously observed event data. Similarly, the scope-based correlation has created two groupings of events based the events’ respective locations. With these two correlation mechanisms alone, two incidents would be created.
Netcool Operations Insight 1.6 is unique in that it leverages multiple powerful event correlation techniques, simultaneously and collaboratively, allowing far greater and more accurate event correlation than any single approach method can do alone.
Enter: the third correlation capability. Due to a predefined sub-topology template that defines the four resource types indicated in the diagram, the topology-based event correlation has correctly correlated the four events coming from the four resources circled in green. Since there is overlap between this topology-based grouping and both of the scope-based event groups, the event grouping engine will merge the events covered by all these groups into one super group, thereby creating just the one incident, and hence only one ticket.
INDUSTRY LEADING EVENT CORRELATION
As we know, scope-based event correlation will correlate events within the same scope, however make the scope too large and you risk correlating events together incorrectly. Analytics-based event correlation leverages machine learning capabilities to learn what events have historically occurred together, however in many cases not every event scenario possible has been seen before enough times for validation. Topology-based correlation allows us to define connectivity templates that define how alarms from specific types of connected things can be correlated.
Experience in the field has taught us that, while all three approaches are very powerful, no one approach alone will fulfil every possible use-case. Netcool Operations Insight 1.6 is unique in that it leverages multiple powerful event correlation techniques, simultaneously and collaboratively, allowing far greater and more accurate event correlation than any single approach method can do alone.
OTHER NEW CAPABILITIES
Netcool Operations Insight 1.6 comes with a number of other new capabilities to make managing your Hybrid Cloud environment easier.
INBOUND AND OUTBOUND INTEGRATIONS WIZARD
IBM’s Cloud Event Manager (CEM) is a Software-As-A-Service offering that provides a cloud-based Operations Management system. If you’re familiar with CEM, you will be familiar with the convenient wizard-driven inbound and outbound integrations it comes with. New in Netcool Operations Insight 1.6 is entitlement to use CEM as part of your cloud-platform-based deployment. Once installed, these new wizards will help you to quickly set up event integrations both in and out of Netcool, with just a few clicks. This set of off-the-shelf integrations are being built-on and expanded continuously.
The new inbound event integrations include:
The new outbound event integrations include:
NOTE: both inbound and outbound integrations offer a generic web hook option, in the case that an off-the-shelf option is not available.
RUNBOOK AUTOMATION AND ALERT NOTIFICATION
Additional embedded components of CEM are the Runbook Automation (RBA) and Alert Notification (AN) capabilities. These also come as part of your entitlement with Netcool Operations Insight 1.6 onwards.
A runbook is essentially a set of instructions or steps that can be followed to resolve a problem. A “manual” runbook is a set of manual instructions that an operator might carry out themselves with no automation involved. For example, it might involve copying and pasting commands to run that would resolve an issue, such as reset a link on a switch. A “semi-automated” runbook is one that provides a set of steps that are initiated by the operator, but that automate the execution of the step in each case. For example, they might be presented with a button that connects to the switch and resets the link, when clicked on. In either case, runbooks are contextually-sensitive to the event they were launched from, and the runbook instructions and steps are populated based on the selected event. An example of a semi-automated runbook is shown below:
At the end of each runbook is the option for the operator to give feedback on the runbook – for example, did it work or not? Did any of the steps fail? If a runbook author sees that their runbook was used 100 times in the past month, and was always successful, they might look to make the runbook fully automatic. This means the runbook will run without any user intervention and introduce self-healing elements to the environment. The beauty of this approach is that a fully-automated resolution can be tried and tested organically in production by real users, before it is let loose on the environment. Human interaction is needed only if the runbook fails to resolve the problem. Hence a great deal of resource can be saved by automating many of the mundane repetitive corrective tasks that use up a lot of operators’ time.
Alert Notification allows for the uploading of work rotas and allows for the specification of how operations teams should be contacted in the case of an escalation. Notification options include an SMS, an email, or a voice call.
DEPLOY ON CLOUD
As with Netcool Operations Insight 1.5, version 1.6 also deploys onto IBM Cloud Private. What’s new in version 1.6 is the ability to also deploy onto OpenShift. Both ICP and OpenShift can be installed either locally on-premise or onto public cloud (ie. PaaS) environments.
IBM Netcool Operations Insight 1.6 brings a wealth of new capabilities and new technology that make it more intelligent and more powerful than ever. Its all-new user experience provides a more intuitive incident-centric way of working, that makes problem determination quicker and easier, and that gives transparency to the analytics done under the covers. Its ground-breaking new correlation capabilities – scope-based, analytics-based, and coming soon, topology-based – work collaboratively to enable event correlation in ways not possible before.
All this helps to drive down ticket counts, Mean-Time-To-Know, and Mean-Time-To-Repair. It includes the new Topology Viewer that is absolutely essential for visualising and managing highly dynamic and multi-cloud environments. It provides simpler wizard-driven integration into third party hybrid-cloud applications. It now includes Runbook Automation, that allows for the organic development of self-healing environments, and Alert Notification, that allows users and groups to be notified in a manner of their choice when incidents need to be escalated.
IBM Netcool Operations Insight 1.6 raises the bar yet again, and takes AIOps to the next level.