Code can fight systemic racism. This Black History Month, let's rewrite the wrong. Get involved

Implement distributed tracing in a cloud-native space

Microservices and distributed tracing

Microservice-based architectures might be easy and fun to develop, but they can become challenging to run, support, and debug. Because of their distributed running model (such as in a Kubernetes cluster), it is almost impossible to track exactly through where a request passes or what synchronous or asynchronous calls are made, because they are both served by containers that are terminated by the Kubernetes controller and running on various infrastructure nodes.

In the past few years, because of the need of distributed transaction monitoring and root cause analysis in complex distributed microservice environments, Jaeger has emerged as one of the key players in addressing the issue of traceability. With distributed tracing, you can track a single request through all of its journey, from its initial source endpoint throughout the entire microservice ecosystem graph, crossing application domain boundaries, different protocols, and different programming languages.

This article shows distributed tracing in action with a hello-world sample application on a Kubernetes cluster that you provision and configure to use Jaeger. Then, it delves into some advanced topics that you need to address when you are designing for production, such as data persistence, multitracing, and data retention.

Set up the sample application on IBM Cloud

To see how Jaeger works, let’s set up a Kubernetes cluster on IBM Cloud and enable the Istio plug-in, which comes with Jaeger as a tracing system installed. Then, we’ll install a sample application to see if everything works as designed.

  1. Log in to IBM Cloud. To provision a standard (at least 3 worker nodes each with 4 cores and 16 GB memory, single zone) Kubernetes cluster, go to the Kubernetes clusters page and click Create cluster.

  2. On the Kubernetes clusters page, click on that record to open the Overview page.

    IBM Cloud Kubernetes Overview page

  3. Click Add-ons in the left-navigation menu and on the Managed Istio card, click Install, and then click Install again. Istio provides a seamless installation of Istio control plane components and integrates with platform logging, monitoring, and tracing tools. The installation might take a few minutes.

  4. Next, we want to connect using the terminal to the newly created Kubernetes cluster in order to configure the Kubernetes resources. Click Access in the left-navigation menu and follow the instructions.

    IBM Cloud Kubernetes Access page

  5. Customize the managed Istio installation to activate the tracing mechanism. You can read about this in more detail in the IBM Cloud documentation, but all you need to do is edit the managed-istio-custom configmap. Run the following:

     kubectl edit cm managed-istio-custom -n ibm-operators
    

    Add the following two lines under the data section:

     istio-meshConfig-enableTracing: "true"
     istio-pilot-traceSampling: "100.0"
    
  6. Starting with the managed Istio add-on version 1.8.0, Jaeger — which is the tracing mechanism — is not installed by default. However, you can easily install Jaeger (detailed here) by running the following:

     kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.8/samples/addons/jaeger.yaml
    
  7. Install the BookInfo sample application provided by IBM Cloud, detailed in IBM Cloud Docs.

Once the BookInfo sample application is installed, make sure that the microservices and their corresponding pods are deployed by running the following:

kubectl get pods -n default

BookInfo Application Pods Running

Try refreshing the sample application page at http://YOUR_PUBLIC_IP/productpage several times to create traffic in the system. Now that you have induced interaction between the application’s microservices, let’s look at the Jaeger UI and see that they really show up there. We’ll do a port-forward on the service that exposes the tracing mechanism:

kubectl port-forward svc/tracing -n istio-system 8080:80

Open http://localhost:8080 in your browser to see something similar to the following image:

Jaeger UI on Sample Application

Default Jaeger deployment model on IBM Cloud

What has happened so far is nothing really special — you just used some clear step-by-step instructions and deployed a sample application that demonstrates tracing. This is a nice and clear example if you are looking to stay just in the proof-of-concept area. In reality, you can’t move what you have done so far into a production system for many reasons, such as data persistence, extensibility, multitenancy, and reliability.

Let’s look into the current Jaeger configuration and deployment. First, let’s see what the deployment name is under which Jaeger was installed:

kubectl get deploy -n istio-system

You should get something similar to the following:

Deployments

If you inspect the jaeger deployment, you can conclude that:

  • The deployment uses an all-in-one-based image of Jaeger that encapsulates all the inner components in the same Docker image. This might be enough for development or testing purposes, but in production, you need to be able to configure the resources to different components and play with the various ways on connecting agents, collectors, and persistence storages.

      image: docker.io/jaegertracing/all-in-one:1.20
    
  • It uses an in-memory storage of traces. If for some reason, the Jaeger pod found in the istio-system namespace gets restarted, all the tracing data is lost.

  • There is no control over the sampling rate and strategy (more details about this topic in the next section).
  • The deployment relies on the default add-on configuration, which eases the entire process of heaving it up — at the cost of extensibility and reliability. It’s clear that the add-on is more for a proof-of-concept purpose rather than for more intensive, production-like workloads.

Advanced aspects of distributed tracing

Jaeger has two main deployment models:

  • Tracing data being persisted directly to storage
  • Tracing data being written to Kafka as a preliminary buffer

The main inner components include the following:

  • Agent: Listens for spans sent over UDP, batches them, and sends them to the collector. The agent abstracts the routing and discovery of the collectors away from the client.
  • Collector: Receives the traces from Jaeger agents and runs them through the processing pipeline, which includes: validating, indexing, transforming, and storing the traces in a pluggable storage back end (Cassandra, ElasticSearch, or Kafka).
  • Query service: Retrieves the traces from the Jaeger back end and hosts a UI to display the information.
  • Jaeger client libs: Libraries specific to the client’s programming languages, responsible with the instrumentation of the code.

For a more detailed view of the Jaeger’s inner workings components, I recommend you look at the Jaeger Architecture page.

Tracing data persistence

The tracing data, which is picked up by the agents and processed by the collectors, is stored using one of the following pluggable storage back ends: Cassandra, Elasticsearch, Kafka, gRPC plug-in, Badger (available only in all-in-one deployments), and memory (available in all-in-one deployments). You can specify the type of the back-end storage by setting the SPAN_STORAGE_TYPE environment variable. It can even hold a comma-separated list of valid types, where only the first item in the list is used for read/write while the rest are used only for writing operations.

Clearly for production deployments, you want to opt for one of the long-term storage options so that you can introduce tracing capabilities to address the unknown unknowns of your software’s failures.

For large scale production deployments, the Jaeger team recommends the Elasticsearch backend.

Auto remove the persisted data

Depending on the type of the application you have, you might want to store the tracing data for longer periods of time. For auditing and analytics purposes, you might want to store data for many years, while other applications only need to store the tracing data for a couple of months. The data retention method is more or less dependent on the selected type of back-end storage. This article only explores the first two options, because they cover more than 90% of the use cases.

  • Using Cassandra for back-end storage, you can configure the TRACE_TTL or time to live for trace data setting when the Cassandra keyspace is initialized, whether you rely on the Cassandra instance that comes with the Jaeger Helm chart release or you install it yourself. If you use the default installation method, it defaults to 2 days.

    Cassandra keyspace defaults

  • Using Elasticsearch for back-end storage is a bit different because Elasticsearch doesn’t offer a TTL property for their indexes. The approach in this case is to create an es-index-cleaner cronjob that, based on its configuration, removes the old indexes from Elasticsearch. It is available if you install Jaeger through the official Helm chart and configure the specific section for it:

      esIndexCleaner:
        enabled: true
        image: jaegertracing/jaeger-es-index-cleaner
        tag: latest
        extraEnv: []
          # - name: ROLLOVER
          #   value: 'true'
        schedule: "55 23 * * *"
        numberOfDays: 7
        successfulJobsHistoryLimit: 3
    

This code is configured to run at 23:55 every day, to remove the indexes older than 7 days, and to keep the history of the last 3 jobs. It also supports the concept of rolling over the indexes through the environment variables.

Multitenancy

When deploying a solution, you want to make sure that the tracing data can be accessed by only the people or teams intended to see it (I’m not talking about authentication, yet). That is, if you are hosting multiple solutions or clients in your cloud-native Kubernetes environment, you don’t want to have the tracing data for all your clients bundled together. This is multitenancy: the ability of a single instance application to be used by multiple guests, or tenants. Whether a tenant is an external client application or an internal business unit/department, that is something specific to your organisation, however the principle stands.

Typically, multitenancy is not an easy topic because there is no easy way to solve the problem. You can start by identifying your specific scenario and what are you trying to resolve, and then configure and deploy Jaeger to that purpose. Let’s look at some options that you have in the Jaeger space for multitenancy, and most likely your use-case is one (or a variation of one) of them. Jaeger was not built with multitenancy in mind, so we have to address this aspect.

Because the tracing data is passing through a couple of components and layers (being collected by the agents, sent to the collectors for processing and finally persisted), you can apply the multitenant logic (that is splitting the traffic and the data) at every single one layer. This means that you can:

  • Include the tenant information at the span level.
  • Address the multitenancy at the agent/collector level.
  • Split the tracing data at the storage layer.
  • Completely don’t bother with it and just install a Jaeger instance for every solution (which would be the worst possible option because you won’t use any of the cloud-native benefits and fails to scale with the number of deployed applications).

All of these options are discussed in more detail in the article Jaeger and multitenancy.

Typically, when you implement a multitenancy solution, you want to fulfill the following specific requirements when it comes to tracing data:

  • One Elasticsearch instance supporting all the tenants.
  • Each tenant’s trace data has to be persisted separately — with various retention timeframes.
  • Ability for every tenant to view and query only its own tracing data.
  • As minimal development activities as possible, thus reusing the existing Jaeger functionalities.

Implementing an Elasticsearch-based solution

Jaeger is extremely flexible in how it works with Elasticsearch. You can choose to use an existing Elasticsearch instance, install a new dedicated one just for the tracing tool, or you can allow the Helm chart to install one for you. Every option has its own advantages and should consider all of them when running your installation.

To implement this solution for using Elasticsearch as a persistence layer (a practical exercise of this is detailed in Jaeger’s multitenancy with Elasticsearch):

  1. Install an Elasticsearch instance in a separate namespace — you want to manage your own Elasticsearch cluster.
  2. Install a Jaeger collector for each tenant, configured to use the Elasticsearch cluster with the specific tenant name/ID and configure the Jaeger’s Elasticsearch index cleaner.
  3. Install the Jaeger agents as sidecars to the services that are being traced.

First, install Elasticsearch using the official Helm chart:

helm upgrade --install --debug --namespace jaeger -f jaeger-es-values.yaml elastic elastic/elasticsearch

The specific values in the provided values file jaeger-es-values.yaml for this release are:

clusterName: "elasticsearch"
nodeGroup: "master"
masterService: ""
roles:
  master: "true"
  ingest: "true"
  data: "true"
httpPort: 9200
transportPort: 9300

Next, you’ll use the Jaeger Helm chart to configure and deploy the the Jaeger collector component, the UI, and the index cleaning job. You should perform this step for each tenant, and you can automate it in an as-a-service context.

helm upgrade --install --debug --namespace tenant-app -f jaeger-tenant-values.yaml jaeger-collector jaegertracing/jaeger

The following shows the specifics of the provided values file jaeger-tenant-values.yaml:

agent:
  enabled: false
collector:
  enabled: true
  image: jaegertracing/jaeger-collector
  pullPolicy: IfNotPresent
query:
  enabled: true
  image: jaegertracing/jaeger-query
  basePath: /ops/jaeger/<TENANT_NAME>
esIndexCleaner:
  enabled: true
  image: jaegertracing/jaeger-es-index-cleaner
  schedule: "59 23 * * *"
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  numberOfDays: 3
storage:
  type: elasticsearch
  elasticsearch:
    host: elasticsearch-master.jaeger.svc.cluster.local
    indexPrefix: <TENANT_NAME>
    extraEnv:
      - name: INDEX_PREFIX
        value: <TENANT_NAME>

The only thing remaining now is for you to deploy the Jaeger agents. Typically, they are deployed as network daemons (through DaemonSet) that listen for the spans being sampled. However, you can also deploy the agents as a sidecar to the tenant’s services, and you need to make sure you link them to the proper collector for each tenant. That means that for your deployment, you would need to add an additional container, such as the jaeger-agent described here:

apiVersion: apps/v1
kind: Deployment
...
spec:
  template:
    metadata:
      labels:
        app.kubernetes.io/name: my-app
    spec:
      containers:
      - image: yourimagerepository/hello-my-image
        name: my-app-cntr
        ports:
        - containerPort: 8080
      - image: jaegertracing/jaeger-agent:1.17.0
        name: jaeger-agent
        resources:
          limits:
            cpu: 20m
            memory: 20Mi
        args: ["--reporter.grpc.host-port=jaeger-<TENANT_NAME>-collector.jaeger.svc.cluster.local:14250"]

For more details on the specifics of this option, read Jaeger’s multitenancy with Elasticsearch).

Implement a Cassandra-based solution

The process of implementing a Cassandra-based multitenancy is similar to the Elasticsearch one. First, you install the Cassandra deployment for all tenants:

helm upgrade --install --namespace jaeger cassandra bitnami/cassandra

Next, you initialise the Cassandra keyspace and schema:

$ docker run --name jaeger-cassandra-schema --link cassandra:cassandra -e MODE=test -e KEYSPACE=<TENANT_NAME> jaegertracing/jaeger-cassandra-schema

The, deploy the collector for each tenant:

$ docker run --link cassandra:cassandra -e CASSANDRA_SERVERS=cassandra -e CASSANDRA_KEYSPACE=<TENANT_NAME> jaegertracing/jaeger-collector

Just as in the previous section Implementing an Elasticsearch-based solution, we only need to deploy the Jaeger agents either as DaemonSets or as a sidecar to the tenant’s services.

Conclusion

More and more companies are starting to use cloud in a smart way, tapping into the real benefits of cloud computing, redesigning monolith solutions into microservice-based ones. During the transition to microservices, besides the benefits of cloud-based distributed deployments and serverless computing, we also inherit the difficulties of such architectures such as interservice communication, distributed logging, and transaction spanning.

In this article, we instantiated an IBM Cloud-hosted Kubernetes cluster, installed an Istio plug-in, and deployed a sample application to see the Jaeger distributed tracing setup. In the last part of the article, we looked into more advanced aspects of distributed tracing such as data persistence options, index cleaning, and multitenancy.