Observability-driven development

Observability is the understanding of the internal state or condition of a system based on the knowledge of its external outputs. Gartner suggests that "observability is the evolution of monitoring into a process that offers insight into digital business applications, speeds innovation and enhances customer experience." Observability has certainly evolved from monitoring, but has taken a big step forward. Based on the telemetry data, monitoring tells you what’s wrong whereas observability tells you why something is wrong.

Read more about observability vs. monitoring.

Observability vs. monitoring is not an either-or proposition. If we compare application performance monitoring (APM) to being an iceberg, monitoring shows you the obvious signs above the sea water, while observability dives deep below the water to give you a complete picture about why the application is performing a certain way, with a complete context supported by the dynamically correlated telemetry data and insights, so that the developers (or SREs or operators) can take intelligent actions, immediately, and with confidence.

While the value of observability might be obvious, it sounds like a heavy task when you are trying to add observability into our applications.

You can read more about the core concepts of observability in this article, “Observability, insights, and automation”.

In this article, I explore observability from an application developer perspective, focusing on what challenges developers might be facing. I also show how we can simplify and streamline the work, with an enterprise-grade full-stack observability platform, like IBM Instana Observability (or Instana for short), which is a key product in IBM’s AIOps platform.

As one of the fast-growing APM leaders, Instana provides these observability capabilities:

No significant code change is required in the apps, even the legacy ones, which can be instrumented automatically by Instana’s AutoTrace technology. Instana also provides broad support for open source frameworks and standards (such as OpenTelemetry) for applications that are manually instrumented or that provide additional custom data.
An end-to-end pipeline can be expected: collect, ingest, store, and process data, and then identify the desired insights.
• Full-stack observability from a top-down approach for websites, mobile apps, backend applications, platforms, and infrastructure.
Dashboards with built-in golden signals for each supported technology, with huge Site Reliability Engineering (SRE) knowledge and expertise included by default.
An intuitive approach to performing root cause analysis that leads to dramatically reduced mean time to repair (MTTR).

Enabling observability in apps

Is it a big deal to bring in observability for our applications? Developers will always say yes. But why is that so?

If you’re a developer, and you’re tasked to bring observability into a Java application, you need to consider:

What the application stack is in this app, such as Spring Boot, JPA, and more.
The frameworks or libraries to use for logging (such as Log4j), metrics (such as Micrometer), and tracing (such as OpenTelemetry, Jaeger, or MicroProfile).
Where to instrument your code, and why?
What tool should you use to collect the observability data (the metrics, logs, or tracing), but also what backend server should you send the data to?
After you have all this data, how will you visualize and use the data to improve your app?
How do you bring in observability for those third party frameworks or libraries that you’re not able to, or are not willing to, instrument or change?
What about services (such as messaging or databases) outside of the application stack? How will they be brought into the observability landscape?

If you’re a polyglot developer, you might also be tasked with rolling observability out to other applications that are written in Golang, Node.js, Python, or others.

Observability might seem to be of more benefit to operations teams, not development teams. However, as more teams implement DevOps practices, and now one step further, DevSecOps practices, the boundary between “Dev” and “Ops” is blurring. Observability brings as much value to developers as it does to operations teams, as more developers are involved in the same service level objectives (SLOs) of maintaining the high availability, resiliency, and operability of our apps.

So, how much additional work will be introduced into our development processes? Simply put, how much code change is needed? Nothing really comes for free, but fortunately some existing standards and open source communities have been working together to address many of the observability issues, one by one, and language by language. Let’s dig a little deeper into the Java language domain and explore enabling observability in our apps with the three observability pillars: metrics, tracing, and logging.

Metrics

For metrics, numerous trusted frameworks exist and we can simply make some configuration changes, instead of direct code changes, so that we can properly generate and expose the metrics.

For example, if we’re using Spring Boot, we can start with these dependencies in our pom.xml file if you’re using Maven, among others:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
</dependency>

We just need to include the Micrometer framework, which acts as a facade for metrics, like SLF4j does but for logging:

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
    <scope>runtime</scope>
</dependency>

Then, we need to update the declaration in the application.properties file (or application.yml, if you prefer YAML format) with something like this:

management.endpoints.web.exposure.include=health,info,prometheus

After we make this update, we can access our application at the /actuator/Prometheus endpoint and we can see Prometheus-formatted metrics, covering JVM, embedded middleware (Tomcat by default), executor pool, process, garbage collection (GC), and more. And many tools are perfectly fine with a Prometheus endpoint and carry it forward for metrics-based monitoring.

Tracing

Today, for tracing, developers want to instrument their applications automatically, with only runtime or configuration changes, or even no change.

In Java, thanks to the Java agent technology, by using the Instrumentation API and Attach API, most of the popular frameworks, libraries, and applications can be instrumented.

You can also look at the latest open source projects like OpenTelemetry and MicroProfile to instrument your code, generate, collect, and export telemetry data (metrics, logs, and traces) in a standardized way. MicroProfile is an open source community driven specification for Java applications offering an array of APIs to enable effective cloud-native development, including MicroProfile Tracing, which makes use of OpenTelemtery to enable effective cloud-native tracing. You can learn more about how to "enable observability for your cloud-native Java applications" using OpenTelemetry and MicroProfile. You can also consider the open source, end-to-end distributing tracing project, Jaeger, which now offers native support for OpenTelemetry. However, in most of the cases, you can avoid the heavy-lifting manual instrumentation process by simply embracing the automated approach, brought to you by these open standards. You can review this OpenTelemetry Instrumentation for Java GitHub repository to see how it works.

Logging

More than likely, you’ve already fully embraced logging and are using a logging library like Log4j or a logging façade like SLF4j to expose some critical information for debugging and troubleshooting.

We log any sensible data points, around the important logic and process, so that we can have a better tracking flow for how our logic and processes are actually run. We can simply rely on the logging framework’s configuration to expose the logs with the necessary format to the destinations for log output or aggregation, which can be simply standard output or standard error, a file system, or an endpoint far away that is connected via TCP/IP.

End-to-end observability

After walking through what a developer might have to consider regarding the three major observability pillars, we might become more optimistic that the need for code change can be significantly eliminated.

But, there is more to enabling observability than just instrumenting our application code. After our code is ready, to enable end-to-end observability, we must still consider:

The application monitoring tools to collect our observability data
How to ingest the observability data into a proper backend of our choice
How to persist the data with a fine-tuned retention strategy
How to process the data that we’ve collected along the long pipeline, so that we can maximize its value
Determine the insights that can be surfaced for not only our applications, but also our platforms (such as container platforms like Kubernetes or Red Hat OpenShift) and infrastructure (such as a virtualization platform like vSphere or an IaaS of any public cloud) so that different personas such as SREs, ITOps, or AppOps can be fully covered
Whether we can have an intuitive UI for a better user experience so that we can focus on the actionable insights instead of the huge amount of the data points that we’ve collected

To make observability truly end to end, even after we’ve taken care of our code, sounds like a daunting job. Fortunately, many existing projects fulfill these needs. For example:

We can pick Prometheus and Grafana to build the metric-based monitoring solution.
We can pick Jaeger to build the tracing backend.
We can pick ElasticSearch + Logstash + Kibana (ELK) or simply the so-called ELK stack (Elastic Stack), or its variant EFK, with Fluentd instead of Logstash, to build the logging solution.

But still, that's a lot of work to do, right? And how can we maximize the value from these three observability pillars in addition to a proper context so that we can streamline the user experience for the relevant personas, for better insights and root cause analysis (RCA)? And, how about the operability if we have to maintain a few siloed backends?

Enabling full-stack, end-to-end observability with Instana

At this point, you might have realized that how important it is to have a full-stack, end-to-end observability platform so that we can start with low or even no code change, but still can enjoy the value that observability brings to application developers. Modern APM platforms like Instana is exactly the right tool that I recommend to you to get the job done. Why? Read on.

Instana is a full-stack, end-to-end observability platform

What does “full-stack, end-to-end observability platform” exactly mean to you, as a developer?

“Full-stack, end-to-end” here means it has the necessary coverage and complete solution for the needs of observability, as illustrated in the following diagram:

Vertically for the application stack, down to its dependent platform and operating system.
Horizontally for the upstream or downstream services that your application interacts with.
End-to-end pipeline from application instrumentation to data collection, ingestion, storage, or processing, all the way to the desired insights.
A consistent observability user experience for the users, applications, platforms, and infrastructure.
An intuitive and smart alerting system for everything we care about, with built-in tuning mechanisms.
An automation engine with action catalog and policies that goes beyond “traditional observability” for shortened Mean Time To Repair (MTTR).

Graphic showing how Instana is a full-stack observability platform

Instana has a lightweight one-agent design, with a huge plugin ecosystem, called sensors, that can cover hundreds of technologies, with sensible defaults and SRE knowledge included by default. Application developers can start their observability journey with little to no configuration.

For metrics, Instana embraces existing standards and frameworks so we can simply start with what we have. For example, Dropwizard, Java Management Extensions (JMX), Micrometer, Prometheus, or StatsD.

For tracing, with Instana AutoTrace, Instana supports almost all popular programming languages, including Java, Golang, .Net, Node.js, Python, PHP, Ruby, Scala, and more. The enablement process might vary from language to language, depending on the nature of the language itself, but Instana AutoTrace does its best to offer the best experience for developers. For example, in Java we can simply do nothing, but in Node.js we need to install the package by npm install --save @instana/collector and then activate it from within the application: require('@instana/collector')();. You do this by requiring and initializing it as the first statement in your application. That’s it!

For logging, Instana works well with many popular logging frameworks (such as Log4j, Log4j2, or Logback) with auto instrumentation and integrates with log management platforms like Coralogix, ELK (or EFK), Humio, LogDNA, Splunk, and can offer context-based linkage while navigating from Instana, for the holistic view of all relevant logs.

Instana supports almost all cloud and infrastructure platforms too, so even those SaaS services like Relational Database Service (RDS) or serverless technologies like AWS Lambda, Azure Functions, or Google Cloud Run can be monitored and put into our observability landscape.

I’d encourage you to review the complete list of supported technologies in the Instana documentation.

One of the fundamental product philosophies of Instana says, “Deploy the Instana agent; it does all the work for you.” You deploy the agent, and its sensors are smart enough to discover the technologies and do the right job for you: instrument the stack, collect the right data, and send it over to the backend through a secure HTTP2 endpoint.

The agent deployment process has a one-liner experience. The UI can generate the proper one-liner that we can copy, paste, and run in our specified location to manage.

Screen capture of installing Instana agents

Instana offers context-based actionable insights

The agent will continuously discover the technologies in its host and activate the right sensors for the actual work. Each activated sensor does the following:

Collects the observability data for the technology it knows best
Identifies which process on the same host is sending the data
Correlates the data by using semantic relationships that describe how the different entities in your stack (zones, hosts, clusters, namespaces, containers, middleware, the apps, and so on) work with one another
Aggregates and sends the data to the Instana backend with this context.

Instana’s backend processing engine sees the data points that are associated with the overall stack that is powering those applications and builds them as a dynamic graph, which powers a series of features like unbounded analytics, which you can navigate through the complete stack from wherever you are. And each technology will have a dedicated dashboard with all golden signals and SRE knowledge included by default.

The following screen capture shows the stack of a Spring Boot app, including the Spring Boot app itself, its JVM, its process, the container, the Pod, the worker node, and the VM, all of which can be explored within such a dynamic graph.

Screen capture of an app in an Instana dynamic graph

When you click on any layer of the stack, a dedicated dashboard is displayed. The following screen capture shows the JVM dashboard that supports the Spring Boot app when clicking the JVM layer where all signals are dedicated to this JVM.

Screen capture of an app dashboard in Instana

Nowadays, an application is a logical term where tens or maybe hundreds of components might be included. Instana offers a series of prebuilt models within its application perspective module for you to define and manage an application:

Services or Endpoints: The simplest way to construct an application perspective is by selecting the collection of services or endpoints directly, which might be just a single function, such as a payment service.
A Critical User Journey: If you want to monitor the user interactions across a set of services to identify the user journeys, this model is the right one to apply, which is focused on service level indicators (SLIs) and service level objectives (SLOs).
Environment or Region: You can create an application perspective by selecting the services based on an environment (such as production or staging that might be described by the agent’s zone), based on a region (such as US East), or based on a set of hosts.
An Important Customer or Tenant: By using custom tags or HTTP parameter information, an application perspective can be constructed for your important customer or tenant.
Kubernetes or Container: You can create an application perspective using a platform-oriented way to specify the collection of services, such as by Kubernetes cluster name, namespace, or tags. Tags are available for quite a few popular platforms like Kubernetes, OpenShift, Docker, Marathon, and Nomad.
Request Attributes: You can create an application perspective by selecting request attributes such as HTTP headers or query parameters.
Technology: You can create an application perspective using a group of services based on a technology (such as MySQL) or application name.
Custom Tags: You can add your own metadata as a custom tag by using the SDK and the custom tag data specifies the collection of services or endpoints to use in building an application perspective.

By following the application perspective wizard, developers can easily define their application.

Screen capture of application perspective wizard in Instana

After you complete the application perspective wizard, Instana automatically generates a dashboard for you, with all golden signals built in.

Screen capture of app dashboard in Instana

You use this dashboard to explore all of the observability data that Instana collects and correlates for you. For example, you can look at the architecture that Instana discovers:

Screen capture of architecture dependencies in Instana

Or, you can look into the services within our applications:

Screen capture of an application's services in Instana

Each technology or service is discoverable and navigable too by using the dynamic graph. You can click the Stack or Upstream/Downstream buttons to discover further insights.

Screen capture of an application's stack in Instana

If you click the View in Analyze button, you can analyze calls, latency, or the errors that occurred in the transactions. And, the dynamic graph can help us aggregate metrics, traces, logs, and even the stack trace in one single view so that we can streamline the root cause analysis (RCA) significantly. You can not only know how the request traversed through your microservices and upstream and downstream services, but also where the error occurred and why. You even can tell which line of the code triggered the error or the slowness!

Screen capture of an analytics of app in Instana

Instana also offers end user monitoring (EUM) to understand the performance that the users experience when they surf our websites or mobile apps.

Screen capture of end user monitoring of an app in Instana

The developers or SREs can trace back to the backend components by simply clicking the View Backend Trace button to make the trace end-to-end, from how the end user experiences it to how the backend components perform.

Screen capture of back end trace in Instana

Instana’s synthetic monitoring can continuously test the performance especially for those public-facing websites and portals so that we can always know whether our APIs are healthy and performant and how our end users are experiencing our services.

Screen capture of synthetic monitoring in Instana

Automation is part of Instana’s capabilities as well.

With the action catalog, Instana offers built-in AI-generated actions, powered by IBM watsonx, and the customers can build their own actions based on their existing knowledge or operational experiences.

Screen capture of Automation and Action Catalog in Instana

See my article, “Decoding the trending AI-powered use cases in the observability world” for more AI-powered use cases that we’re gradually bringing to our customers for handling real-world challenging problems.

If something bad happened, the built-in AI models will indicate the probable root cause inferring from the learning of the historical data and current context. Meanwhile, the actions can be associated to the alert by predefined policies and even recommended by the built-in AI models.

The relevant personas can fix the problem by simply clicking the Run button which will trigger the action immediately. If you trust that such associated or recommended actions can help address the issues, you can automated the execution of the actions as well. Instana offers the mechanism to gradually make the actions trustworthy.

Screen capture of event triggering in Instana

Conclusion

Bringing end-to-end observability into our IT landscape can be quite challenging especially considering the polyglot languages that we develop in, the diversified infrastructures we work with, and the segmented tools in the community. Instana, as a full-stack, end-to-end observability platform, addresses these challenges and can significantly eliminate most of the overhead to our development processes, streamline our daily operations, and eventually shorten the mean time to repair (MTTR).

For an in-depth review of the Instana solution components, check out this article, "Real-time monitoring of microservices and cloud-native applications using IBM Instana SaaS."