Observability, insights, and automation

Observability is the extent to which operations teams can understand the internal state or condition of a complex system based only on knowledge of its external outputs. So, a highly observable system (an application or service) is one that externalizes sufficient data to be able to understand how the application or service is executing and its current state. You can read more about observability in the IBM Cloud Learn Hub.

Because developers are increasingly responsible for more of the application lifecycle thanks to modern DevOps best practices, teams must instrument their systems to be highly observable.

The self-instrumentation approach does, however, have limitations. Self-instrumentation works well for new projects, but for existing deployments and solutions, your instrumentation needs to be retrofitted to potentially very old codebases. Also, teams can only use self-instrumentation for deployments that they own and have source code access to, which means that it can be difficult or impossible to instrument dependencies or third-party packages. Additionally, for external services, such as databases, teams become reliant on the providers of those external services to make their services observable.

IBM Observability with Instana, a software observability platform, solves this problem by providing auto-instrumentation of various programming languages and runtimes, including Java and .NET, which means that teams always have rich observability for their deployments while also providing configuration and code SDKs to add additional data. Instana supports open source instrumentation, such as the open source data collection project OpenTelemetry, allowing development teams to define the observability they need for their applications. Instana also supports over 260 additional third-party technologies and services, providing end-to-end observability from mobile to mainframe and across a wide range of technologies.

Instana technology support and capabilities

The three pillars of observability

Metrics, logs, and distributed traces are often referred to as the three pillars of observability. Each pillar provides a different type of external output for a system:

  • Metrics provide a continuous point-in-time view of the system.
  • Logs provide a view of events and errors occurring in the system.
  • Traces provide a per-request or per-transaction view of the system.

The three pillars of observability

Multiple open source projects provide implementations of the pillars, largely with each project aligned with one of the pillars — for example, Prometheus for metrics, OpenTracing for tracing and the Elasticsearch-Logstash-Kibana (ELK) stack for logging. Those projects are often combined, and multiple groups like the observability teams at Twitter and Netflix have described how they provide observability of their systems using these pillars and open source projects.

These three pillars of observability provide vital sources of data or telemetry for the system and for individual requests being handled by the system. However, there are many types of data that can also be invaluable in understanding the internal state of a system, such as end-user monitoring, which extends observability to provide data on the activity that is taking place on front-end clients, such as web or mobile applications, and how those applications are interacting with back-end services and IT systems, and profiling, which provides deep code-level data that shows a continuous point-in-time view of the execution of the system, which can be used to provide understanding of the resource usage of a system (both of which are provided by Instana).

To provide a highly observable system, you need to have correlation across each of the pillars of telemetry, such that the data can be used in context. While the term “pillars” implies that they are siloed, there is significant value in having correlation and context between metrics, traces, and logs — for example, being able to understand the metrics or logs that occur inside the scope of a specific request or being able to aggregate and roll up multiple requests to generate metrics.

Correlation and context between the three pillars of observability

The value of the pillars is limited if the data is collected by separate technologies and viewed in different dashboards. All telemetry must be correlated, aggregated, and viewed together to have a highly observable system. Additionally, any given component in an IT system is affected by the other components and systems around it, so the telemetry needs to be viewed in the context of its wider environment.

The observability pyramid

Any IT system, component, or application has a complex web of other systems, components, and applications that it is dependent on, making it impossible to understand a specific system, component, or application without also understanding its dependencies and the holistic environment that surrounds it. Additionally, different roles in any organization require different views of the overall system.

These views generally form at one of the four layers of the observability pyramid, with each layer of primary interest to different personas. To understand a specific layer requires understanding the context of the system and the environment as a whole.

The four layers of the observability pyramid

For example, if an IT infrastructure administrator wanted to understand the resource usage of the infrastructure, they would need to be able to understand the platforms the infrastructure supports, along with all of the processes deployed onto the platforms, the services and applications on which those processes run, and, ultimately, the business processes those services and applications support.

Similarly, if the SRE for a microservice wanted to understand the slow performance of a REST request, they would need to understand the resources like CPU and memory available to the microservice’s process, which is affected by the CPU and memory provided by the platform and infrastructure, and which then might be affected by the other processes competing for CPU and memory that are on the same platform or infrastructure.

Observability platforms like Instana provide the required context by creating and maintaining a dynamic graph — an always up-to-date graph of the relationship and dependencies between the layers of the observability pyramid.

The dynamic graph of vertical relationships across the four layers of the observability pyramid

This dynamic graph maintains a constant understanding of the vertical relationship between a business application or process, the replicas or instances of an application on which it executes, where those replicas are deployed, and the underlying infrastructure on which it executes. The dynamic graph also understands and maintains the horizontal relationships between any given replica or instance of an application and its upstream and downstream dependences, such as the other IT systems or components that make requests or calls of it or that it makes requests to.

IT systems can be observed not just with a rich set of data but also in the context of its wider environment. This contextualization becomes even more valuable when we move from observing a system to analyzing it, deriving insights from it, and being able to understand the root cause of such issues as errors, poor performance, or availability.

Analysis and insights

A rich set of telemetry that is fully contextualized achieves the goal of observability, making it possible to understand the internal state of a system. However, working with the volume of data and the rich information that the highly observable system provides leads to cognitive overload for the individuals and operators observing the system. They receive more information than they can comfortably handle, leading to an inability to process and comprehend the data and make appropriate decisions based on it.

The complex dashboards with large numbers of graphs, charts, and metrics aren’t enough to help the operators. To make the volume of data manageable, it needs to be distilled down to a set of meaningful indicators that operators observing the system can focus on. These typically take two forms: signals and events (or alerts).

The four golden signals

The four golden signals are the most valuable metrics for any service that exposes an interface, whether that interface is used by other services or by end users. The four golden signals are:

  • Latency — How long it takes to handle or service a request against the interface
  • Calls or traffic — Volume of traffic against the interface, such as requests per second
  • Errors — Rate at which requests against the interface result in an error
  • Saturation — Current utilization of the service, typically as a percentage of its total capacity

Instana supports the golden signals and pulls the first three of these, the user-facing signals, and presents them as the primary view for each service or configured application, making it easy to see and understand the quality of service being provided to the clients of the service or application. Instana also provides the information for saturation as a measure of resource allocation and utilization of the services.

By using the four golden signals, observability platforms like Instana reduce the amount of data an operator needs to focus on and simplifies the view, but they do not necessarily summarize the entire state of the system. Additionally, they still require the constant attention of an operator to determine whether one of the signals is highlighting a problem or projecting that a problem may soon occur.

To continue with the analyses, operators need events and alerts. By applying a set of conditions to the data, events can be raised when the data doesn’t meet those conditions, and ultimately, the operator can be alerted of events that are of high importance.

Conditions, events, and alerts

At the most basic level, conditions can be applied as a set of rules to detect situations that are potentially or definitely problematic. An example set of conditions might be to look at the available disk space on a given infrastructure host and create a “Warning”-level event when disk space is low, and a “Critical”-level event when disk space is exhausted.

Essentially, systems can use automation to watch the wide range of telemetry, reducing the cognitive load on the operator and ensuring that problematic conditions don’t get missed. Instana provides a built-in event-rules capability, which is pre-populated with a knowledge base of over 260 conditions, as well as the ability for operators to create and apply custom rules to create new events.

Instana also provides the ability to set smart alerts on the four golden signals. These alerts make it possible to either set static or dynamic thresholds for each of the signals. These look at the past history of the signal and create a dynamic threshold designed to detect anomaly or deviations from the service’s typical behavior. Additionally, Instana provides the ability to set and monitor service-level indicators (SLIs), error budgets, and service-level objectives (SLOs) on the four golden signals so that SREs can set and be alerted on their availability and performance goals.

Event storms, event grouping, and root-cause analysis

Moving to signals, conditions, and events greatly reduces the volume of data that operators must understand and interpret. However, there is often further distillation of the data required.

When a fault occurs in an IT system, the highly connected nature of the environment often means that the fault manifests in several ways, causing multiple conditions to be triggered and a storm of events to be generated.

In the scenario where there is the failure and restart of an underlying infrastructure node, conditions are likely to be triggered not just for the underlying node but also for every process running on it and every instance of every service that those processes provide. Conditions are also likely to be triggered for any upstream service that makes calls to the affected services. This means that a large number of events can be generated in a short period of time with a single event representing the origin of the fault and the others generated by effects of that original fault.

When these event storms occur, it’s vital to be able to group together events that relate to the same underlying fault and to identify which event and fault is the root cause versus those that are downstream effects.

Instana can group related events together and carry out root-cause analysis using its dynamic graph to include related events and build a timeline to understand the underlying root cause.

Alert channels

This use of conditions, events, grouping, and root-cause analysis means that operators can be alerted when there are issues that require their attention and with the information on where they need to focus their attention. Those alerts need to surface in a way that notifies the operator so they don’t need to spend their time watching dashboards for events to surface. Instana supports propagating alerts and events to a range of systems that can be used to:

  • Notify the operator by using systems like Slack, Microsoft Teams, Google Chat, or a notification system like PagerDuty
  • Integrate with incident-management systems such as IBM Netcool Operations Insights
  • Carry out further analysis and trigger automated handling of the alert, such as AI-based automated IT operations solutions like IBM Cloud Pak for Watson AIOps.

Action and automation

Having comprehensive observability for IT systems and having an observability layer that can automatically detect and identify faults in near real time, especially when they impact external interfaces and users, is becoming increasingly important to enterprises. However, detection and identification only represent the first part of the incident-resolution process.

Ops teams also need to diagnose the issues in order to understand why the component that was the root cause of the incident developed a fault, to repair the diagnosed fault to restore service, and to resolve the underlying issue to ensure that the same fault does not occur again. These tasks are often lengthy, and to dramatically shorten those additional steps requires acceleration of the IT service management (ITSM) process, which can be provided by a solution like IBM Cloud Pak for Watson AIOps.

IBM Cloud Pak for Watson AIOps can take the alert generated by Instana and launch collaborative engagement across the various parties involved in resolving IT incidents, including SREs, IT operations, DevOps, and developers using ChatOps. It can also recommend best actions to the engaged team, including executing runbooks to carry out diagnostics and analysis, and to repair and restore the affected components, drastically shortening the overall incident-resolution process. The following figure shows the improvements to mean time to diagnose and mean time to repair issues using Instana and IBM Cloud Pak for Watson AIOps, reducing the duration of an IT incident from hours to minutes, with SREs and operations teams being armed with tools to assist or even fully automate the incident-resolution process.

Improvements to mean time to diagnose and mean time to repair using IBM Cloud Pak for Watson AIOps

Conclusion

Being able to vastly reduce the elapsed time of the IT service management incident process – and, therefore, client impact — requires comprehensive observability. With comprehensive observability, operations teams can:

  • Understand the underlying state of the systems
  • Perform analyses and discover insights to detect faults and isolate them to root causes
  • Provide automation to rapidly carry out diagnoses and repair

The combination of Instana and IBM Cloud Pak for Watson AIOPs provides that end-to-end set of capabilities, making it possible to automate large parts of the incident-management process, reducing costs, and improving uptime and availability for your deployments.