Automation-powered AIOps

Instana, one of the core components of IBM’s AIOps portfolio, is an enterprise-grade full-stack observability platform, while Ansible Automation Platform is an enterprise framework for building and operating IT automation at scale, from hybrid cloud to the edge. At first glance, the relationship between these two different platforms may not be obvious. But, we built a proof of concept (POC) that shows you how easily you can integrate these two platforms for optimizing IT operations by applying AI (AIOps).

Other than the common automation use cases where the Ansible Automation Platform is used to perform operating system (OS) hardening, upgrading, patching, and package installation, we explored some new, interesting use cases using IBM Observability by Instana APM (Instana). Instana and the Ansible Automation Platform can be used to enable application SREs, developers, infrastructure operators, and business owners to automate and streamline app-centric full-stack observability in a typical hybrid multi cloud environment.

In this article, we review three use cases that our Proof of Concept (PoC) supports:

  • Automating the deployment of Instana agents to tens of Kubernetes or Red Hat OpenShift clusters and thousands of VMs, with various operating systems like RHEL, Windows, or Ubuntu, to gain full observability
  • Ensuring service continuity with auto-remediation by triggering Ansible playbook via webhook to recover the service whenever an offline event is observed in Instana
  • Using GitOps for day-2 operations to equip application SREs, developers, and infrastructure operators with deeper insights for critical services (such as MySQL) that require technology-specific configurations

To read more about the background of our POC and the supported personas, read this article, “The Power of AI and the Science of Operations.”

Architecture of our PoC

The PoC is based on a typical hybrid cloud environment with a high performance database running on virtual machines (VMs) and microservices-based applications running on Kubernetes.

The PoC consists of 2 different applications running on top of a Managed OpenShift cluster on IBM Cloud (Red Hat OpenShift Kubernetes Service, or ROKS for short):

  • The Robot Shop application runs on top of OpenShift natively;
  • The Spring Boot application has two parts: the application runs on OpenShift while its backend MySQL database runs on a virtual machine.

The Ansible Automation Platform is integrated with Instana via a webhook, configured with Git for playbooks and extra technology-specific configuration for the desired components, MySQL in this case, which will override the default settings in Instana.

The overall architecture is shown in the following figure.

Architecture of the Proof of Concept for Instana and Ansible

Demonstration of our PoC

Watch us discuss and walk through our proof of concept in this video:

The rest of this article also details the three use cases represented in our PoC.

Deploying Instana agents using Ansible playbooks

We start off with a clean slate in Instana, with the Infrastructure View dashboard showing the instana-cluster. Instana has been configured to monitor itself by simply deploying the Instana agent into the cluster. Note that this is a simple single-VM setup just for demo purposes.

Instana cluster

To monitor the “to-be-managed” infrastructure such as public clouds and on-prem platforms including Docker, Kubernetes, Red Hat OpenShift, Cloud Foundry, VMware Tanzu, and any virtual machines, you must install the Instana agents on these infrastructure components.

To manually install an agent, you can run the automatically generated “one-line deployment” command that is available in the Instana user interface.

Deploy Instana agents to infrastructure components in UI

However, the manual installation of Instana agents to hundreds or even thousands of nodes and clusters is not scalable, requires too much effort, and is error prone. Instead, the recommended approach for installing at scale is to use an automation platform such as the Ansible Automation Platform.

In our PoC, we make use of survey forms in the Ansible Automation Platform to abstract and input the required information and have Ansible execute the tasks. This can be done on Kubernetes, Red Hat OpenShift, and VMs running different operating systems including Red Hat Enterprise Linux (RHEL), Windows, and Ubuntu, in a fast, accurate and consistent manner.

Ansible Automation Platform survey forms

All activities are centrally logged and we use the role-based access control (RBAC) feature in the Ansible Automation Platform to delegate the execution to a designated group of infrastructure operators or application SREs. For more details, refer to the playbook.

Upon executing the playbooks, the instana-agent operator will be deployed on Red Hat OpenShift and the agents will be deployed on the Kubernetes nodes based on the agent’s custom resource (CR).

Command output of Instana agents deployed to Kubernetes nodes

As part of the automation, an Instana agent will also be installed on the VM that hosts the MySQL database, as a systemd managed daemon.

Ansible templaates

See the playbook for details.

Command output of Instana agents deployed to VM with MySQL database

As a result of Ansible automation, all Instana agents for the PoC are deployed. The Instana dashboard shows both the Red Hat OpenShift cluster as well as the MySQL VM in different zones: ansible_managed_IBM_ROKS and ansible_managed_zone respectively. Zones in Instana are logically groups of infrastructure for better management experience. While Instana is app-centric, it also offers a comprehensive Infrastructure View to visualize the overall infrastructure as transparent zones and boxes.

Instana Infrastructure View showing the overall infrastructure

Using Instana Application Perspectives (AP) for app-centric observability

Once deployed, the Instana agents automatically activate sensors to collect observability data for each technology with very low overhead. Less than 0.1 CPU and 80 megabytes memory is added to the managed platform to capture all the important metrics, traces, and events with a one second granularity.

The role of application SREs typically focuses on 3 major areas:

  • Application-centric observability
  • Root cause analysis (RCA)
  • Automated alerts and actions

Instana provides a powerful concept called Application Perspectives, where a series of out-of-the-box models are available so that the application SRE can easily create a customized application monitoring dashboard for specific services of highest interest. This is important, especially in the microservices world, where an application can contain a large number of different components.

Instana application perspective

Once the application perspective is created, a fine-grained dashboard is generated automatically. The dashboard contains the relevant widgets for capturing and visualizing the golden signals in a single view.

Instana fine-grained dashboard

The application SRE can easily start the analysis by diving deeper into a particular issue, with the metrics, traces, events, and logs aggregated together, by clicking the Analyze Calls button.

Instana analyze call details

The application SRE can dive into the stack trace at the right panel to find out which line of code caused/is causing the problem. This is extremely helpful for Root Cause Analysis (RCA) and can significantly reduce the Mean Time To Repair (MTTR).

Instana stack trace

For more information about the capabilities of Instana, review the Instana documentation.

Providing service continuity with auto-remediation

Beyond identifying the problem and finding the root cause of the issue, wouldn’t it be great to fix it automatically? Let’s see how Ansible Automation Platform can be integrated with Instana (via webhook) to automatically recover the application in our PoC.

To achieve auto-remediation, we need to first set up a custom event rule in Instana for the signature that we are monitoring — the “MySQL Down” event in the PoC.

Instana custom event rule for auto-remediation

Next, we set up a webhook for the Alert Channel in Instana where we associate the Ansible playbook or workflow with the custom alert. By doing so, Instana can trigger the appropriate playbook or workflow to remediate the issue when it detects a database failure.

Ansible set up webhook

The next part of the demo simulates a failure by deliberately shutting down the MySQL service. We can see that we no longer receive a response from the application endpoint once the database service goes down.

Simulating a failure

The failure will be registered as an event in Instana.

Instana registered event

Instana then triggers the webhook with a REST API call to the Ansible Automation Platform to execute the associated playbook to re-start the MySQL service, thereby recovering the application.

Ansible REST API call triggered

Note that Ansible Automation Platform can also be integrated with IT Service Management (ITSM) systems such as ServiceNow to create, update and close tickets. The Ansible Certified Content for ServiceNow makes it possible for Ansible to create a ticket in ServiceNow, resolve the problem identified by Instana, and then update and close the ticket once it’s done. We can gain huge operational efficiency in this way as it helps to free up the SRE team to work on other more important and critical issues.

GitOps for Day 2 and Beyond

Instana offers powerful default settings for one to kick off the observability journey with a zero-configuration experience. For instance, if we dive deeper into our MySQL service, we can see that there is a dynamic “stack” concept (3 layers in our case):

  • MySQL Service
  • mysqld Daemon Process
  • Linux Host

Instana dynamic stack for MySQL database service

Every process within the managed host can be monitored with some basic metrics, such as CPU, memory, and open files.

Instana basic metrics for each managed host in stack

But if we want to gain greater insights into the MySQL service (for example, queries, average query latency, threads connected, wait events, or schemas), we need to provide additional configuration that is specific to MySQL itself, including other “advanced” services too. Once MySQL configuration details are provided, Instana will be able to retrieve more substantial information, enabling infrastructure operators and/or application SREs to address Day 2 application issues more effectively.

In the case of MySQL, the following configuration actions must be executed:

  1. Create a MySQL user with appropriate permissions.
  2. Specify the user credential we created above for Instana so that the Instana agent’s sensor can use it to access the MySQL instance.

Fortunately, the above configuration actions can be executed seamlessly by leveraging the built-in GitOps feature in Instana. Using a Git repository as the single source of truth means that we can easily update the configurations on the managed nodes by updating the files in the Source Code Management (SCM) platform, GitHub in this case, through a Pull Request (PR) process with a review and approval mechanism. The Instana agent will pick up the changes from the configured Git repository and apply them on the managed nodes. We can enable the GitOps experience by configuring the agent’s Configuration Management.

Instana agent configuration

But as mentioned earlier, it can be challenging to manually update individual agent configuration at scale. Ansible can be used to automate the process of MySQL user creation, assigning required permissions to MySQL users, and enabling GitOps for hundreds or even thousands of agents.

Ansible automating the configuration

Upon successful execution of the playbooks, the agent is now configured with GitOps enabled, and Instana is now able to retrieve additional MySQL information including number of queries, query latency, and number of threads that are connected.

Instana insights from GitOps

As a result, more insights are available for our targeted personas to assist them in Day 2 operations.

Instana GitOps Day 2 operations

Conclusion

In this PoC, we have gone through specific use cases, showcasing Instana’s features and how Ansible Automation can help augment these capabilities to fulfill the expectations of infrastructure operators, application SREs, developers, and business owners, for full-stack observability of applications and infrastructure.

Besides observability, there are additional capabilities that should be considered:

  • Infrastructure automation for both traditional (that is, virtualization) and cloud-native (for example, Kubernetes or OpenShift) infrastructure
  • Application performance assurance with continuously optimized resources
  • Proactive resolution with AI-powered insights and actions

These are the building blocks of a complete stack of AI-powered solutions for IT and apps operations, known as AIOps. AIOps brings confidence and efficiency to managing and operating IT infrastructure and applications at scale, across hybrid multi-cloud landscapes. IBM, together with Red Hat, has been proactively leading the market with the AIOps stack illustrated below. Customers can adopt all or some of the components that make the most sense for their environments to accelerate their digital transformation journey.

AIOps stack for automated AIOps

We will continue to discover and build more practical real world use cases that demonstrate the power of AI and the science of operations. Reach out to us if you have further questions or comments on what was discussed!

Learn more about observability-driven development by exploring more content on the Instana hub. Or, learn more about multi-tier application deployments on Kubernetes using Ansible in this tutorial.