Win $20,000. Help build the future of education. Answer the call. Learn more

Building a closed loop automation system for real-world telco network workloads

Closed-loop automation systems enable companies to transform network and IT operations by using AI-driven automation to detect anomalies, determine resolutions, and implement the required changes within a highly automated framework. To learn more about closed loop automation and its benefits, challenges, and architectures, read our article, “An introduction to closed-loop automation.”

To better understand how to set up a closed-loop automation system, let’s look at a traffic flow optimization use case. We’ll show how using AI-driven automation can help teams automatically correct issues like network anomalies within the provisioned network infrastructure by using these three main components:

The network components in our traffic flow optimization use case consist of a set of virtual Juniper switches managed by a Juniper SDN Controller and orchestrated by IBM Cloud Pak for Network Automation.

When an anomaly in the network (such as increased network load on one of the l3/l2 transport tunnels in this case) is detected, you can divert the network traffic from its primary flow to a backup flow proactively, while automatically fixing the issues on the primary flow in parallel. After the issue is fixed, Cloud Pak for Watson AIOps and Cloud Pak for Network Automation can then reroute that traffic back to the primary flow.

The following figure shows the components in our implementation of the closed-loop automation system for traffic flow optimization.

Architecture of closed-loop automation system for traffic flow optimization use case

The following flow chart shows the steps and components of the closed-loop automation system.

Flowchart of closed-loop automation system components

Let’s step through the components of this closed-loop automation system:

  1. Cloud Pak for Watson AIOps learns the normal behavior of the system by training its Metric Manager on the system metrics and training its AI Manager on the logs of applications under study in the system.
  2. Based on the model of normal system behavior, both the Metric Manager and AI Manager ingest real-time metrics and logs respectively, analyzing the system for anomalous behavior.
  3. A network anomaly is created by increasing network load between two of the system nodes.
  4. Both the Metric Manager and AI Manager are able to detect the anomalous behavior in the system because of a change in the metrics and errors in the logs, and they generate an alarm, which is then sent to the Event Manager.
  5. The Event Manager displays the alarms on the network events dashboard, which can be monitored by any SME. The Event Manager then sends an alert to the Cognitive Automation component.
  6. The Cognitive Automation component is trained using documents that contain relevant information on various problems and their solutions. It identifies the appropriate fix for the generated alert and sends an appropriate “next steps execution request” to Cloud Pak for Network Automation.
  7. Based on the recommended actions by the Cognitive Automation component, Cloud Pak for Network Automation performs the appropriate actions using the SDN Controller on the transport network.

Orchestration using Cloud Pak for Network Automation

IBM Cloud Pak for Network Automation provides multi-vendor and multi-domain service-level orchestration capabilities. It provides orchestration capabilities to directly provision software like xNFs and can also integrate with vendor- or function-specific element managers and controllers for deployment orchestration.

The Cloud Pak for Network Automation service designer allows anyone to create design templates to define service-chaining between the onboarded network function and to create complex services. It also allows anyone to create behavior tests for test, pre-production, and production environments.

Cloud Pak for Network Automation follows intent-driven orchestration where it models the desired service operational state rather than pre-programming workflows.

In our traffic flow optimization use case, Cloud Pak for Network Automation integrates with Juniper’s SDN controller to provision and manage lifecycle operations of Juniper virtual switches (vMXs).

The service design example in the figure below contains a set of Cloud Pak for Network Automation service definitions (or assemblies) chained together to represent the network service templates for Juniper transport network setup including L2-L3 label-switched path routes.

Network service diagram for Cloud Pak for Network Automation service definitions

Assurance using Cloud Pak for Watson AIOps

IBM Cloud Pak for Watson AIOps is a suite of products designed to manage data, metrics, events, and more in a software runtime environment. The individual components cater to specific needs while being able to work with each other to provide complete support for software runtime infrastructure management and monitoring.

In our use case, we used these components of Cloud Pak for Watson AIOps for assurance:

  • Metric Manager, which is used to monitor and analyze metrics and KPIs across various technological silos, such as processes, containers, VMs, network links, systems, and so on. The Metric Manager can consume metrics that it collects from these sources to build models for normal operational behavior of these systems and then analyze the metrics in real time to generate alarms when detecting a deviation from normal behavior. These deviations can be processed by other components like the AI Manager for further correlation or the Event Manager to trigger an appropriate action.

  • AI Manager, which combines infrastructure and operations management into a consolidated structure across various assets including business applications, infrastructure components, virtualized components, network and storage devices, and protocols. It analyzes unstructured data collected from all the resources at run time to provide actionable insights into its faults and failures and to perform root cause analysis. The log anomaly detector collects logs from various sources like LogDNA, Splunk, and so on, to automatically learn normal log patterns from training data, and create a model of normal log behavior, and then perform real-time detection of anomalies through log analysis.

  • Topology Manager, which is used to fetch and visualize the topology of components and interactions between different components of application and infrastructure services. Topology Manager maintains the topology containing components and their interactions with other components of the system. It dynamically evolves the topology by learning new information through discovery of new components and their interactions with existing parts of the system under study. It also supports manually uploading static topology based on initial build and deployment information of a service.

  • Event Manager, which monitors and manages events that occur throughout the lifecycle of entire stack of application deployments. It collects, classifies, normalizes, and deduplicates events. It can also perform event enrichment for analytics, event correlation, and event grouping either via manual rules or via built-in algorithms. Event Manager processes the events in real time to provide actionable insights that can be consumed by an orchestration platform to perform specific actions.

In our traffic flow optimization use case, we train Metric Manager on a standard delay between links connecting nodes of our system. We use the AI Manager’s log anomaly detector to consume logs from all the components of our system and perform anomaly detection in the logs. We use an event grouping service to combine anomalies detected from logs and metrics (using Metric Manager) and then perform fault localization to target closed loop remediation at the specific entity which caused the fault. We consume the topology from the Topology Manager in AI Manager to localize faults to a specific component, the faulty network link between two nodes, which helps in narrowing down of the steps needed to analyze and remediate fault. We use Event Manager to record the faults detected by the AI Manager and Metric Manager and trigger specific actions in relation to switching the network route from its primary route to a backup route when a fault is detected. The Event Manager can track the lifecycle of individual events, maintaining the information if an event has been marked as resolved. We use this information to switch the network back to the primary route.

The Cloud Pak for Watson AIOps components have complex algorithms and capabilities, which we provided just an overview of these components used in the traffic flow optimization use case here. We will provide more detailed explanation and deep dive of their working and usage in TFO use case in the next article.

AI using Cognitive Automation

Our Cognitive Automation component adds the power of artificial intelligence (AI) to the self-healing and optimization workflows of the Network Operations Center (NOC). Traditional “closed loop” controls in a NOC consists of workflows that are developed in an external RBA (runbook automation) tool which requires a separate technical team to manage and maintain since NOC requirements are continuously changing.

The Cognitive Automation component takes the futuristic approach of using AI to guide machine-to-machine (M2M) communications and to simplify the creation and maintenance of the closed loop or open loop workflows. It simplifies the process to an extent that changes in the workflow can be done by the users such as NOC engineers or business analysts without any programming know-how.

Our Cognitive Automation component seamlessly integrates with Cloud Pak for Watson AIOps and Cloud Pak for Network Automation to easily resolve network faults or inconsistencies.

The following figure shows the sub-components that comprise the Cognitive Automation component:

Cognitive automation component's sub-components

Cognitive Automation component consists of following sub-components:

  • IBM Watson Discovery Service. The core natural language processing (NLP) engine of the cognitive automation solution. It provides the base platform to train the NLP models based on the NOC processes and methods of procedures.

  • Event Manager component of Cloud Pak for Watson AIOps. The component that controls the runtime execution of the solution. It receives continuous incoming events from Cloud Pak for Watson AIOps that require cognitive self-healing resolutions.

In our use case, as an event occurs, Cloud Pak for Watson AIOps identifies the event, and persists it in the Event Store. The Cognitive Automation component polls or receives the event by leveraging its Watcher service. The Cognitive Automation component identifies and processes events based on the documented methods of procedure.

When an event notification is received, the following troubleshooting steps are executed by interpreting the specific methods of procedure, using IBM Watson Discovery:

  • If the alarm was raised at node “Core.R3” (the router where increased network load fault is injected), then based on the severity (that is, 3 or 0) it will move the assembly to either “backup” or “primary”. Severity is provided by Cloud Pak for Watson AIOps; a severity of 3 means that a fault is detected, and a severity of 0 means that the fault is resolved.
  • If the alarm was raised at any other node, it won’t take any action and just exit the process.

Note that the information about which node the fault occurred on, such as “Core.R3” in our use case, is also provided in the event by Cloud Pak for Watson AIOps using fault localization.

A closed-loop automation system

Let’s now bring together the different components that we talked about in this article and identify how they come together to create a closed loop automation system.

  • We started with provisioning the service using Cloud Pak for Network Automation. After the service had been provisioned and running for some time, we saw the anomaly first appears on our service.
  • Metric Manager was then able to identify the deviation in the service metrics, and AI Manager was able to identify errors in the logs being collected from service in real time.
  • AI Manager was also able to group metrics and log anomalies together using the event grouping service, and it was able to pinpoint faults to a single node by performing fault localization, and then forward an event with the same information to the Event Manager.
  • Event manager then filters and sends an event to our AI running as part of the Cognitive Automation component.
  • After the Cognitive Automation component receives the event, it further identifies the specific problem, identifies a solution to the problem with a degree of confidence, and then finally sends the solution to Cloud Pak for Network Automation.
  • Cloud Pak for Network Automation executes the steps identified by the Cognitive Automation component and applies them to our service, which triggers a “heal intent” that restores the service to its correct configuration which finishes the closed loop automation.

Summary

In this article, we presented a detailed description of a traffic flow optimization use case and explained how different IBM products like Cloud Pak for Network Automation, Cloud Pak for Watson AIOps, and our Cognitive Automation component work together and enable a closed loop automation system in a real-time.