Strengthening Application Performance with Turbonomic

AIOps is the application of AI to IT operations. The objective of AIOps is not to mimic human intelligence but to apply algorithms to solve specific problems, often much faster, much more accurately, and at much higher scale than a human.

As applications become more distributed and complex, and as the infrastructure that those applications run on gets more distributed and complex (often spanning from data centers to public cloud to edge computing), it becomes untenable for applications to perform reliably and efficiently at scale without AIOps.

Enterprises that adopt AIOps are discovering that their developers, SREs, and Ops teams can be more productive and spend more time on innovation because AIOps frees them from troubleshooting problems, performing root cause analysis, or conducting routine maintenance, and other “keeping the lights on” activities.

Most IT organizations spend significant resources in day-to-day operations within their IT environments. In many enterprises, there are so many alerts being generated by different monitoring tools that a class of monitoring has emerged to filter out all but the most severe performance issues or risks, which are then surfaced for IT staff to investigate and remediate.

These tools bring in advanced analytics and logic-based capabilities (the AI in AIOps) to classify alerts that can in all probability be ignored and suppressed from view, so staff can more quickly identify the root cause of a problem when a significant issue exists or, better yet, address a risk before it becomes a big problem.

AIOps and Turbonomic Application Resource Management

IBM Turbonomic Application Resource Management (Turbonomic), an application resource management platform, gathers instrumentation data from the applications down to the infrastructure, agentlessly, in order to make decisions around how to scale, provision, de-provision, and optimize resources for performance, cost, and availability.

The most important objective of IT infrastructure is to provide applications with the resources they need to deliver their service levels. A companion objective is to do it as cost-efficiently as possible and adapt to changing environment and application demand scenarios by dynamically adjusting resources over time. Key capabilities of an application resource management platform include:

Application-aware optimization
Support for on-premises, hybrid, and multicloud deployments
Full-stack visibility and control across the full environment
Trustworthy and automatable actions
Enforcement of business policy compliance abstraction, analytics, and automation

It’s also a simple integration into your application and infrastructure lifecycle to make sure that these actionable decisions are also seamlessly made part of your day-to-day operations.

In the following figure, you can see how the Turbonomic integrations with full-stack technology platforms allows Turbonomic to pull in analytics and make decisions by automatically mapping the resource dependencies, application demand, and then matching resources to assure performance and reduce unnecessary costs.

The automation layer then interacts with the target systems to take the actions such as scaling CPU, memory, storage, networks, IaaS, Kubernetes containers, pods, and nodes, and much more.

Turbonomic integrations with full-stack tech platforms

These decisions are not only used for your real-time environment, but also used to provide intelligent, data-driven guidance about how to architect your environment for the future.

A primary challenge for large enterprises is exponential complexity inherent in modern applications built on microservices and deployed on a modern containerized multicloud infrastructure.

Enterprises are rapidly adopting AIOps with machine learning embedded in many of their monitoring and management systems, including application performance management (APM), service management, infrastructure as code, and configuration management. Turbonomic serves as a control plane, a modern application hosting platform, to tie those systems together and scale to millions of managed elements in a single instance.

As AIOps technology continues to evolve in independent tools, Turbonomic learns about the changes through its integrations and incorporates that data into the Turbonomic AI engine’s decision process, enabling large enterprise development and IT organizations to grow along with their technology investments.

Turbonomic makes sure that applications get the resources they need, when they need them, continuously and automatically. By understanding the demand of applications and matching those resource needs to the available infrastructure, Turbonomic assures the performance of applications, and it does so as efficiently as possible for the lowest cost. This is also done while maintaining your important business and IT compliance such as application availability, SLO compliance, data and application locality, and operational cost management.

First, Turbonomic does automatic, agentless discovery of your applications and infrastructure by simply targeting the management endpoints such as your APM platform (such Instana or Dynatrace), your container platform (such as Red Hat OpenShift, Kubernetes, cloud-based EKS (Elastic Kuberenetes Servcie), AKS (Azure Kubernetes Service), GKE (Google Kubernetes Engine), or VMware Tanzu), your cloud provider (Amazon Web Service (AWS), Microsoft Azure, Google Cloud Platform), and your physical and virtual compute and storage platforms.

Screen capture of a sample business app in Turbonomic UI

Next, the application and infrastructure analytics are used by the Turbonomic AI platform to discover what application and infrastructure resources are needed and available, from the application all the way down through the physical, virtual, containerized, and cloud stack.

These analytics are used to generate actions such as placement, scaling, rescheduling, provisioning, and de-provisioning resources to assure the performance without unnecessary waste and spending.

Screen capture of Turbonomic actions in Turbonomic UI

Turbonomic use cases

General AIOps use cases include activities like:

Real-time anomaly detection for risk mitigation and problem avoidance
Faster root cause analysis through event correlation when problems do occur
Low-priority alert suppression, so high-priority alerts get better visibility
Capacity planning and management, based upon predictive analytics
IT service management automation

Let’s consider these specific operational use-cases that are the most common among new Turbonomic environments:

AppOps: Why is my application having issues?
ContainerOps: How is my container cluster doing?
CloudOps: How do I pick which compute and storage type to use?

AppOps: Why is my application having issues?

The core question we get when the service desk calls is “why is my application having issues?” Not “which server is failing?” or “Which transaction flow is slow?”, but a very simple top-level question “what’s wrong with the RobotShop app?”

Application developers might have enabled application performance management (APM) on their application which gives the lens of observability to help analyze application issues:

Screen capture of Turbonomic analytics screen

While this is helpful, it does not always give the true indication of the source issue that is usually happening outside of the code. Non-application issues are often represented as a general latency issue because there is no ability to trace it further down the resource layers into the infrastructure itself.

Let's take the same application that is not generating application-specific issues in our APM view, but is showing resource challenges that are causing performance problems based on what Turbonomic is telling us.

Screen capture of sample application showing resource issues

Clicking through the application and infrastructure tiers shows you the various components and their dependent resources and health.

But, most importantly, you go one step further than just seeing the visibility and observability features to clicking on the Actions view for any part of the application and showing the actions that need to be taken to bring the application back to a healthy state:

Screen capture of sample app actions to take

In just this one business application, application developers can take numerous, dynamically changing actions to assure the performance and to make sure it’s running as efficiently as possible, so we don’t have runaway spending on cloud and Kubernetes node costs.

The actions you see also consider all the other parts if the environment including applications, containers, pods, nodes, and the cloud instances or virtualization and bare metal resources.

This is important because you might think you need to add memory or double the millicores that are assigned to a container but that will impact other applications on the same infrastructure. The value of this system-level optimization and system-level automation is that you can ensure that every application is operating with optimal performance.

Recommended actions are the default setting but we also have an option once you are in manual mode to take the actions right in the UI. In this sandbox environment, every user is currently in read-only (observer) mode, which is why the Execute Action button is greyed out.

Each action provides utilization data and a before-action and an after-action view to show what the result will be. These actions can include scaling, provisioning, deprovisioning, placement, and reserved capacity purchasing:

Screen capture of sample app actions to take

Every decision is made based on optimizing CPU, memory, storage, networking, and the cost of the workload. Actions are prioritized based on severity.

Actions can be taken in Turbonomic through the UI, through a fully RESTful API, and nearly every action can also be automated. Automation of actions is done without any agents by having Turbonomic use the native platform APIs to take the action and update resources. Turbonomic APIs are fully documented in swagger, which you can view in the Turbonomic sandbox.

For example, a simple call can return actions that can also be parameterized in a POST request. Or, you can query by the entity and include this in CI/CD processes or other workflows because every action on an entity will be accessible via the API.

Screen capture of API Doc

Each action has a pre-action, an action, and a post-action interaction opportunity to easily integrate with.

There is also out-of-the-box integration with ServiceNow so you can have tickets generated and closed automatically and included your business approval workflows. This means you can have your AppOps and ITOps teams use their existing ServiceNow workflows without ever having to log in to Turbonomic or to your cloud or on-prem infrastructure provider.

With decision automation, your application sizing and resourcing is going to be optimized and as an application developer you remove the guesswork of how to allocate CPU, memory, storage, and network resources.

As your application usage changes over time, the resources can adjust automatically to optimize based on real-time requirements and understanding historical peaks, averages, and percentile utilization. The end result is better performance and reduced waste which means real per-minute savings on cloud.

ContainerOps: How is my cluster doing?

Container clusters are operationally complex. While Red Hat OpenShift greatly reduces the operational complexity by making application deployments simple, there are still issues that will occur with things like:

Container sizing issues – memory and CPU allocations that don’t match application utilization and we often guess on the high side
Pod placement issues – initial placement may have been satisfactory, but the environment changed and the pod has not reached a threshold where it will cause the pod to be rescheduled
Node utilization issues – nodes are usually sized too large with the hope of allowing for enough resources to handle peaks and growth which creates unnecessary and costly overhead
CPU throttling – is CPU scheduling causing throttling and queuing in the compute layer that is impacting?
Cluster sizing – what is the optimal size of the cluster to support my applications without unnecessary waste and costs?

Understanding the operational health of the Kubernetes hosting environment is critical to the success of the applications and the IT organization. Today, more people are trying to apply classic practices such as quotas and managing thresholds for scaling manually which leads to extra manual work and is not adaptive the changing needs of the applications.

Kubernetes endpoints can be OpenShift, managed Kubernetes on cloud providers (GKE (Google Kubernetes Engine), EKS (AWS Elastic Kubernetes Service), AKS (Azure Kubernetes Service)), or any upstream Kubernetes implementation.

Views can be from the cluster, namespace, node pool, node, pod, container, or application. Turbonomic is automatically stitching together the relationships from the applications down through to the underlying node for you.

Screen capture of container platform relationships

Narrowing to a single cluster shows all resources that are operating within the cluster and where the cluster is located, which in this case is running OpenShift Compute Platform on an AWS environment.

Screen capture of container platform single cluster

Actions are available to optimize, size, and provision containers, container pods, and to allocate namespaces and node resources. This optimization creates truly elastic Kubernetes environments and gets rid of the rigid methods of assigning quotas and limits that just do not account for real-time changes in the application needs.

Screen capture of container platform actions

As applications change in utilization and demand, you see scaling for both vertical sizing and horizontal sizing.

Another powerful capability is the preventive actions such as rescheduling a pod to a new node in order to assure performance without causing contention on the node. The normal behavior of Kubernetes will be to wait for failure of the app and rescheduling, or worse, the node suffers and suddenly reschedules multiple pods to new cluster nodes. By actively rescheduling resources to prevent failure, you ensure better performance and health of the overall environment.

Screen capture of container platform preventive actions

CPU throttling is incredibly difficult to track down as applications dynamically change but can cause real performance challenges for every application running on the container node. Actions that identify throttling as a constraint will be presented (see the following figure) to make changes that will relieve that congestion:

Screen capture of container platform CPU throttling

Managing the Kubernetes cluster is also part of the Turbonomic capabilities that gives more intelligent decisions for sizing and scaling the underlying Kubernetes nodes. This capability leads to optimized node infrastructure to deliver better performance without the unnecessary overhead because Turbonomic can dynamically scale from the container down to the node preventatively.

ITOps teams can run more highly utilized Kubernetes because Turbonomic can scale when needed at the Kubernetes node layer.

Scaling up is done as demand increases, but also includes scaling down as demand is reduced to cut unnecessary costs of running under-utilized Kubernetes infrastructure.

Screen capture of container platform scaling based on use

These same optimization and utilization practices and capabilities are also available for any on-premises virtualization platform and also Kubernetes on bare-metal.

CloudOps: How do I pick which compute and storage type to use?

Dozens upon dozens of options exist for cloud compute, cloud storage, and cloud DBaaS configurations, which turns into millions of combinations. The simple question of “what IaaS and DBaaS resources are the optimal choice for my cloud application workloads?”

Choosing the optimal compute resources (memory and CPU) is enough of a challenge but then you must choose the optional compute instances that support the optimal storage capacity and throughput needed.

Human nature has us guess on the high side. This estimation is usually sizing for peaks or potential usage, but it leaves performance and money wasted. The major advantage of on-demand cloud resources is being able to change at any time.

In Turbonomic, you start in the Cloud view which lets you see all your accounts and the top-level actions that you can take to both improve performance and reduce cloud costs by optimizing your application workloads in IaaS and relational database PaaS.

Screen capture of Cloud view

You can choose any part of the application environment on the left hand or also begin with any cloud (Amazon Web Services, Microsoft Azure, Google Cloud Platform) in the Top Accounts view.

Screen capture of Cloud accounts

You can start at any account, subscription, or drill into the resource groups. The important piece for CloudOps teams and cloud application developers is the actions that can be taken to assure performance, health, and cost optimization simultaneously.

It’s important that you, as the app developer or CloudOps team, have the detail needed to understand why an action needs to be taken and the result once you take it.

Screen capture of Cloud action details

Actions can be taken on-demand or automated at any context layer (application, instance, container, group, zone, region, and so on).

You can also schedule when actions can occur to ensure that you maintain application availability and compliance at all times.

Your actions also include buying and optimizing reserved capacity. Optimizing your long-term costs by best matching what workloads can leverage existing, or purchase new RI capacity, to greatly reduce costs without impacting performance.

Screen capture of Cloud action details

Conclusion

There are many more use-cases, especially in hybrid deployments, that we can explore. Operations teams, development teams, and the next generation of roles (such as Site Reliability Engineer (SRE), DevOps Engineer, Platform Engineer) are changing the way that we manage applications and infrastructure.

By having an adaptive and infrastructure-agnostic AIOps and automation platform like Turbonomic teams can reduce unnecessarily wasted time and costs with day-to-day operations. The ultimate goal for all of these use cases is ideally to move towards truly self-driving operations.

If you want to find out more and try our complete self-service sandbox environment you can go to https://turbonomic.com/try which will get you your own account in a live SaaS deployment of Turbonomic to be able to walk through some use-cases and the general user experience.