IT and network operations are becoming increasingly complex, so troubleshooting and resolving issues quickly is critical. This complexity arises because of the number of applications, the variety of hardware and software in the infrastructure, the volume of data, and the large number of business processes that are part of network and IT operations.
While cloud-native and microservices-based architectures are becoming popular for their agility, ease of development, scalability, and roll-out of upgrades, they also increase the number of components that need to be analyzed. Troubleshooting and root-cause analysis are harder with the explosion of data available from individual microservices.
Closed-loop automation systems enable companies to transform network and IT operations by using AI-driven automation to detect anomalies, determine resolution, and implement the required changes within a continuous highly automated framework. Closed-loop automation helps solve many problems before they even become issues.
What is closed-loop automation?
A simple closed-loop implementation detects issues that could happen in the future. The appropriate data is analyzed by various predictive models, which then make a recommendation on the change to be made to the orchestration layer, which implements the change.
In complex cases, closed-loop automation combines the predictive insights information with additional AI systems to determine a resolution. The AI system is trained to resolve these issues and is integrated with a robotics automation system to automate the resolution process. If the AI system determines it has a high confidence that the suggested resolution is correct, it will invoke the orchestration engine to implement the solution automatically. If not, a trouble ticket is generated, and an engineer works to resolve the issue.
The following image provides an overview of a closed-loop automation system that addresses issues of varying complexity.
Closed-loop automation enables these key capabilities in network operations:
- Anomaly detection — Anomaly detection makes use of large, real-time, time-series data to analyze networks applications, database metrics, operating systems, etc. This gives anomaly detection the capability to identify patterns and anomalies, and raise awareness toward predictive actions.
- Intelligent alerts — In a general operations environment, multiple connected components may raise alerts all related to the same failure event. These add to the overall load and volume of operations teams. However, 20 percent of overall alert volume is false-positive. Closed-loop automation uses machine learning models to create the patterns for the series of alerts so that those can be bound to causes and known actions, and then be corrected accordingly.
- Predictive planning — Organizations can use machine learning algorithms to predict how application and network behaviors are dependent on seasonality and other factors to ensure that appropriate corrective actions are taken, thereby permitting systems to perform optimally.
- Root-cause analysis — Closed-loop automation leverages data to intelligently identify all anomalies in the service path and use AI to map it to find the most likely cause for a particular incident. It makes use of various AI algorithms to ensure the accuracy of root-cause identifications and implements the required remediation steps.
IT Ops teams apply AI and automation to network operations in stages:
- Simplify and focus — Use analytics and machine learning to reduce events, focus on high-value services and clients, and create real-time topology.
- Predict to get ahead — Use machine learning algorithms and analytics to digest time-series data and find patterns and seasonality of network behavior to predict abnormal behaviors that precursor events.
- Augment the process — Automate the process of operations with the application of cognitive process automation with robotics for the end-to-end operation automation and artificial intelligence to determine what is needed and next steps.
- Augment staff — Apply AI and automation to speed problem resolution and learning. Consolidate structured and unstructured data needed for the problem, provide guidance in natural language, and capture the learning for continual improvement and augmentation of operations.
Use cases for implementing closed-loop automation
Now that you have a basic understanding of closed-loop automation, let’s look at some use cases:
- Container orchestration with Kubernetes
- Intent-based management of 5G radio access networks
- Closed-loop automation in security
- Traffic flow optimization with closed-loop automation
Use case #1: Container orchestration with Kubernetes
Container orchestration engines, such as Kubernetes, support a few limited closed-loop automation scenarios, such as for self-healing and auto-scaling. Kubernetes internally executes a control loop where it continuously monitors for the state of deployed applications in the cluster and matches them with declarative specifications of the desired state specified by application developers. If the current state does not match the desired state, the control loop takes necessary actions to reach the desired state of the system.
Application developers can specify liveness and readiness probes as part of the deployment specification. Kubernetes will periodically invoke these probes to check if the application state is normal. If any of the probes fail for a certain pre-configured number of times, Kubernetes can restart the specific container to self-heal the application. This is a simple closed-loop automation scenario that is natively supported by Kubernetes.
Another scenario pertains to auto-scaling. Application developers can specify simple policies based on thresholds of certain metrics. If the metric crosses the specified thresholds, Kubernetes can spawn additional containers or remove surplus containers to automatically scale the service up or down. Kubernetes supports other actions as part of its control loop and is also designed to be programmatically extensible.
Use case #2: Intent-based management of 5G radio access networks
The transition from 4G to 5G presents a radical transformation in communications technology, enabling service providers to support differentiated and guaranteed service to enterprise and industrial applications with varying traffic characteristics. It can support ultra low-latency communication, high bandwidth, and the ability to simultaneously communicate with millions of devices.
A typical radio access network (RAN) makes millions of decisions every second regarding which user to serve and how to serve them. These decisions have a tangible impact on the service quality and any SLAs that need to be guaranteed. In 2G and 3G systems, the possible settings and configuration parameters were few, and it was possible to estimate how any change in a setting would impact end-user services. In modern 5G systems that support multiple services and applications simultaneously, it is practically impossible to predict how any configuration changes would impact end users.
For this reason, the RAN is evolving to an intent-based management framework, where the service provider would specify the services desired, any high-level business policies, and priorities across different users and services as intents. Automated management and orchestration systems would continuously monitor current service levels against the specified intents, automatically translate them into changes in technical parameters and settings, and dynamically adapt. The RAN Intelligent Controller is a concept developed by the O-RAN Alliance, where the controller leverages AI and machine learning for service assurance and closed-loop automation. It provides an open platform where applications can be developed that can build on top of one another, working with a broad array of data and insights ranging from application-level information to radio-signal strength and related parameters to end-user related information like mobility trajectories. This enables a high degree of network programmability fine-tuned and optimized to specific user-level service assurance.
The O-RAN specification calls out two types of RAN Intelligent Controllers. A non-RT (non-realtime) RAN Intelligent Controller supports non-real-time intelligent RAN optimizations, typically 500ms or more, for use cases such as policy-based control and network planning. A near-RT (near-realtime) RAN Intelligent Controller supports intelligent real-time control loops, for use cases that include beam-forming, scheduling, fast spectrum management, quality of service assurance, slice optimization, and mobility management. This is shown in the image below (connections shown are logical connections and not representative of specific interfaces).
Non-realtime rApps and realtime xApps can be hosted on the RAN Intelligent Controller platform to provide value-added services that handle specific traffic classes or users, and provide specific data processing, analytics, or orchestration functions. These rApps and xApps can be developed independently, making the RAN an open platform for innovation.
Use case #3: Closed-loop automation in security
As mentioned, a closed loop takes a manual process, analyzes it, and automates the process. This process can be applied in other areas of IT support. One such area is cybersecurity. Within the world of technology today, you hear about stolen data and even attacks on financial institutions fairly regularly.
We can apply closed-loop automation in this loop to help enhance the security of our data. Imagine if an unknown user had tried to access our data, we can use tooling like IBM QRadar SIEM to detect the issue, find the appropriate action, and apply the changes with an orchestrator.
As seen below, we can see an example of the closed loop automation for security. It goes through detecting, investigating, and then by responding to the event using the orchestrator.
For example, we can move the files to a location that is no longer in the reach of the attacker while simultaneously identifying how the user was able to access the data and identify and apply a way to block that vulnerability. This is just a simple idea of a closed-loop automation in cybersecurity in which we are automating the handling of the threat instead of having to wait for someone to identify, manage, and apply the security fixes.
Use case #4: Traffic flow optimization with closed-loop automation
In this use case, you can see how using operations analytics, operations/service management, cognitive operations, and MANO (Management and Orchestration) can help automatically correct issues within the provisioned infrastructures network.
When an anomaly is detected, we can divert the network traffic to a backup flow proactively, while we automatically fix the issues on the primary flow. After the issue is fixed, our AI and orchestrator can once again reroute that traffic back to the primary flow. We will expand further on this use case in the next article as we expand on the role of AI and analytics in automation.
Benefits of closed-loop automation
Consider these benefits of closed-loop automation:
- Improved network reliability through automation built on AI – With the ability to automatically identify issues on our network, not only can we automatically fix the issue but we can also apply alternate network paths that can help mediate the effects while the fix is applied. We will expand more on this as we talk about our traffic flow optimization solution in our next article.
- Superior customer experience leading to reduced customer churn – By automatically resolving issues that may arise, we can help ensure that the end-user customer faces minimal service interruptions.
- Manual tasks are reduced through automation, increasing workforce productivity – With closed-loop automation, we are able to mitigate tasks from our network engineers who will then be able to focus on other issues that require further inspection, thus improving overall efficiency.
- Mean-time to resolution for incidents is decreased, providing improved network services, better network performance, and a faster rollout of new services.
In this article, we introduced closed-loop automation and the problems it helps solve. We then looked at multiple use cases of closed-loop automation in areas such as intelligent RAN, Kubernetes, and security. We also investigated some of the major benefits and challenges of closed-loop automation. In the next article of this series, we will cover in detail the traffic flow optimization use case and provide an overview of the technologies used to realize this use case, including IBM Cloud Pak for Network Automation and IBM Cloud Pak for Watson AIOps.