2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

Building a closed-loop automation system by using metrics, logs, and topology information

Today’s cloud-native and microservices-based architectures rely on a complex infrastructure that is made up of various hardware and software components. This increasingly complex infrastructure makes it difficult to troubleshoot and resolve issues quickly. Closed-loop automation systems help transform network and IT operations by using AI-driven automation to detect anomalies, determine resolution, and implement the required changes in a highly automated framework. You can learn more about the use cases for closed-loop automation, the benefits of closed-loop automation, and some of the challenges in implementing a closed-loop automation system in our first article in this series, “An introduction to closed-loop automation.”

One common use case for closed-loop automation is traffic flow optimization. By implementing a closed-loop automation system, teams can automatically correct issues like network anomalies within the provisioned network infrastructure. You can learn more about how to implement a closed-loop automation system that uses IBM Cloud Pak for Network Automation, IBM Cloud Pak for Watson AIOps, and Cognitive Automation services in our second article in this series, “Use closed-loop automation to optimize network performance and resolve network issues.”

At the heart of the closed-loop automation system for the traffic flow optimization use case are the components of IBM Cloud Pak for Watson AIOps. In this third and final article in this series on closed-loop automation, we explain how you can use Cloud Pak for Watson AIOps to analyze data from across your runtime ecosystem, consuming metric, log, event, and topology data to correlate, predict, and address network issues before they impact the performance of your environment.

We also provide information about how you can use metrics, logs, and topology information to identify the source of failure in a real-world network function deployment. We describe in detail the following capabilities of closed-loop automation systems that use these Cloud Pak for Watson AIOps components:

  • Detecting metric anomalies in Metric Manager
  • Detecting log anomalies n AI Manager
  • Combining metric and log anomalies in the Event Grouping Service
  • Localizing the fault to a particular node by using the topology in Topology Manager
  • Triggering the closed-loop action via event information sent to Event Manager.

We briefly cover how these specialized components work together to enable detection and identification of failure sources in the network.

Components View image larger

In the closed-loop automation use case for traffic flow optimization, the high-level flow chart illustrates how we configured Cloud Pak for Watson AIOps to process event, metric, and log data to determine actionable insights and run necessary actions.

During configuration and training, the AI Manager creates models by training on log data from applications, infrastructure, and the network. The Metric Manager is trained on performance metric data to create a model familiar with normal operating behavior of the KPIs. During operations, log data is analyzed by the AI Manager while time-series metric data is analyzed by the Metric Manager in parallel processes.

After the AI Manager and Metric Manager generate alerts, the Event Grouping Service (of AI Manager) ingests the generated alerts to group-related events into a single event output. The Event Grouping Service uses topology information from the Topology Manager to identify the root cause by using fault localization as well as to calculate any neighboring nodes in the blast radius that are affected by the fault.

This single actionable event (called a derived story in our flow chart) can be shared with a subject matter expert (SME) through ChatOps like Slack, PagerDuty, or an automated system like the Event Manager (in Cloud Pak for Watson AIOps) or Cloud Pak for Network Automation, which is what we used in this traffic flow optimization use case.

The final flow in the chart shows the Event Manager, which can organize, deduplicate, and track events through the event’s entire lifecycle. The Event Manager also maintains what actions to take for different events and triggers follow-up actions. The event information can then be shared with other tools like Cloud Pak for Network Automation to take specific actions to address and resolve the fault.

Now, let’s explore the individual components of Cloud Pak for Watson AIOps and their role in providing closed-loop automation in the traffic flow optimization use case.

Handling system metrics

We use the Metric Manager to build a model of the normal operating behavior of the time series data that is collected from system metrics and to detect anomalies at run time. The Metric Manager’s anomaly detection algorithms use numerous statistical and analytics techniques to detect anomalies.

During training, the analytics algorithm analyzes the metrics in the source data to learn about the behavior and creates a mathematical model of what was learned based on the data in their training window. An algorithm is trained when a metric has sufficient data to train. A model is created only if the available data can be modeled accurately by the algorithm. The model is rejected in the validation step if the algorithm is not able to build an accurate mathematical model. Retraining of these algorithms is done at a regular interval value to update the mathematical model so that it represents the metric data as accurately as possible.

After an algorithm creates a model for a metric, it can detect anomalies in the data that it receives for that metric at subsequent intervals. It compares subsequent data that is extracted with data in the model so it can identify any changes in system behavior.

Metric anomaly detection

After an algorithm trains and creates a model for a metric, it compares the value of the metric with the model information at each interval. If a metric’s value fits within the model information, the algorithm takes no further action. However, if the metric’s value deviates from the model information, the algorithm detects an anomalous pattern. The algorithm then uses various properties such as the minimum number of intervals that a metric must be anomalous to determine whether to output an anomaly event.

By default, an algorithm sends an anomaly event if it detects an anomalous pattern for a metric on 3 of the previous 6 intervals at which data was received. Therefore, the Metric Manager only generates anomalies when it is fully confident that anomalous behavior is occurring and is worth investigating.

For the traffic flow optimization use case, the data source is performance metrics (raw metric data) from all of the nodes of our system. The Metric Manager analyzes this time series metric data from the Juniper NorthStar SDN Controller and can detect anomalies, as shown in the following image. The image shows an anomaly being detected on the ICMP Round Trip Time delay between two of the nodes in the topology (Core.R3 and Core.R4). The detected anomaly shows a deviation from normal values.

Anomaly detection View image larger

Handling system logs

In our use case, we use the AI Manager’s log anomaly detector to consume logs from all components of our system and perform log-based anomaly detection. The logs that are collected from services in our environment are in a rsyslog format and are being fed to the AI Manager as input.

The following figure is an example of raw logs that are collected from our infrastructure.

Raw logs View image larger

Normalization

The AI Manager consumes data from various sources such as logs, events, and tickets, and must normalize or standardize that data in a common format for better handling of the information that is present in the input data.

In the case of logs, normalization is performed by converting the input rsyslog to a closely resembled logging format such as a LogDNA format.

The following image shows an example of the normalized log. In this case, corresponding to the first line in the raw log shown in previous image.

Normalized log View image larger

Templatization

After normalization, the input logs are passed to the templatization engine. A template is a generic message string that generates many lines in the log output. Templatization also detects the position of parameters that are present in the lines.

The templatization engine first runs the lines of the log file through a classifier that classifies the lines in erroneous or non-erroneous groups, which helps separate lines pertaining to the healthy state of the system. A template miner then runs the lines through a pretrained model to map lines to templates and generates different IDs for each unique template. So, each line in the log file is mapped to a template ID. The following figure is a template that is extracted by the templatization engine while parsing logs of our system.

Template View image larger

The template subfield in this example shows the text that forms the base for this line of the log file, the pid of processes, and exit_status code are being detected as parameters. The original log line for this template looked like inetd[7362]: /usr/sbin/sshd[31282]: exited, status 255.

The example also shows that the classifier detected this line as a non-erroneous line, as shown by "error_flag": “False”.

Training for logs

The AI Manager must be trained on sample logs from a “healthy” system state to create a model of the normal behavior of the system logs. While training, the AI Manager runs the training logs through normalization and templatization, creating a model of unique templates that are seen in the logs. It also collects information about how frequently each unique log template occurs in the training data. This model is created specific to each unique entity that is detected in the input data that generated the log output, such as Core.R3 and Core.R4.

Log anomaly detection

We feed logs from various components of our system into the AI Manager in real time. The AI Manager converts them to normalized logs that are then run through the templatization engine in batches called windows of logs.

For each window, a count_vector is generated, which maintains the occurrence frequency of various templates present in the log output. Data that is collected in each window is then compared to the healthy state of the system model that was created while training. A deviation of the system from the healthy state is detected as an anomaly. For example, the presence of a new log template (an error template) is identified by the log anomaly detector as a log anomaly, which is then consumed by the event grouping service.

Handling system topology

In our use case, we use the Topology Manager, also known as the Agile Service Manager (ASM), to generate a topology database through active discovery of networks, the active discovery of networks, connections, and applications, and by using the existing sources or known connections.

The Topology Manager allows DevOps personnel, SREs, and operations team members to have real-time visibility of complex distributed workloads and infrastructures by observing the interactions and connections in the form of a topology. The topology also helps teams to quickly find the blast radius (distance from the source of a fault to other components) of an issue, distinguishing symptoms and root causes, and visualizing topology changes over time, all of which help teams discover any deviations in the topology.

The following figure is a snapshot of the topology of our traffic flow optimization use case.

Topology snapshot View image larger

The topology is stored in the form of a graph, where nodes represent an application, a service, or a component of an application or the infrastructure, and the edges represent the interaction relationships between them. The topology uses various edge labels and edge types such as memberOf, runsOn, and accessedVia. In the traffic flow optimization use case, we have a simple dependsOn relationship based on the direction of traffic flow between the nodes.

The following image shows the topology in a JSON format, which shows all of the nodes and the relationships among them.

Topology in JSON View image larger

Combining metric and log anomalies

The AI Manager’s Event Grouping Service groups alerts coming from various sources that pertain to the same fault to provide a better understanding and analysis of the root cause of the fault. It also localizes the fault to a point of failure and performs impact domain (blast radius) calculations.

The Event Grouping Service applies multiple algorithms to group alerts. Temporal grouping is applied to alerts that have a strong correlation with respect to the proximity of their occurrence time to the entities that throw the error logs. Template grouping is applied to alerts that have a similar description. For example, if the underlying log lines (more accurately, templates) that are present in two alerts are similar, then it’s possible that a failure at one place causes the same error to be thrown by upstream or downstream services.

For temporal grouping, the system uses the occurrence time of two alerts to group them together, but the alerts can also be grouped if the log line corresponding to the error in different alerts was emitted by the same entity (code/module/node).

For template grouping, the system considers each different entity which emitted a log line (for example, Core.R3 and Core.R4) in the alert window, and then calculates a similarity score for log templates emitted by each entity across the alerts. A strong similarity score forms the basis of a template based cluster of alerts.

The Event Grouping Service applies these algorithms to all alerts until no more correlation is possible or a single alert group is formed.

Fault localization

As the name suggests, fault localization is defined as a process tracing back the fault propagation and pinpointing the faulty component among many components in a complex distributed system.

After grouped alerts (where anomalies are grouped together) are generated by the Event Grouping Service, entities mentioned in each grouped anomaly along with the dependency graph from the Topology Manager are used to perform fault localization and blast radius calculation.

Topology manager View image larger

The following sample output from the Event Grouping Service in the traffic flow optimization use case shows fault localization of the output.

Output for Event Grouping Service View image larger

After performing event grouping and fault localization, the AI Manager sends a single actionable event as the output to trigger further action.

Handling generated events

In the traffic flow optimization use case, we used the Event Manager to combine and deduplicate the received alerts and events, while tracking the lifecycle of the event through remediation and resolution actions.

The Event Manager maintains which actions to take in response to different events and triggers follow-up events as needed.

In our use case, the event that is detected by the AI Manager’s Event Grouping Service is sent as input into our Event Manager. The Event Manager can consolidate this further with events from other sources. The events are consolidated, normalized, and stored in an in-memory database called the Object Server. As the Event Manager tracks an event through its lifecycle, it appropriately prioritizes the incident and trigger actions or policies on other tools and systems.

The following figure is a screen capture of one of the events in the Event Manager dashboard.

Event Manager dashboard View image larger

In our traffic flow optimization use case, the Event Manager triggers remediation by sending the event to our Network Operations Center. As the remediation steps take place, it tracks the active anomaly through any changes until it is resolved, at which point it can trigger additional follow-up actions or notifications, completing the closed-loop scenario.

Summary

This article describes how in a real-world system, basic signals like logs and performance metrics that are collected from the application and infrastructure can be used to detect faults and errors at run time. The topology of the system along with any fault information can be used to pin-point the fault to a specific entity. The fault can then be tracked through its entire lifecycle, providing an SME with an entire view of fault detection and fault remediation through automated or manual steps. Thus, the components of IBM Cloud Pak for Watson AIOps help to deliver a closed-loop automation system.

While Cloud Pak for Watson AIOps provides a suite of products and components, we focused on the tools that helped us enable the traffic flow optimization use case. Your enterprise can extend this architecture to any complex use case like IP Multimedia Subsystem (IMS) or 5G Core deployments.