Recently, IBM, Google, and Lyft announced Istio, an open-source platform aimed at helping organizations manage distributed applications comprising multiple small, independent modules we refer to as microservices. One of the key principles behind Istio is to provide uniform observability across the microservice infrastructure, without requiring special tooling. In this article I am going to talk about a specific aspect of observability that is of paramount importance to enable developers and operators to make sense of the overall behavior of modern applications—distributed tracing.
The benefits of modularization are well-known in software engineering; however, distributed cloud-native applications comprising a multitude of microservices pose additional challenges to debugging and troubleshooting, especially as the number of such components grows and their inter-relationships become more complex. For a long time, logs and metrics have been the de facto means to analyze the behavior of applications and debug them. Unfortunately, logs and metrics collected from individual microservices do not provide a holistic view of the application behavior, rendering them insufficient to reveal the cause of many problems and bugs.
Suppose an application suddenly begins to perform one of its user-facing functions slowly. In trying to reason about this problem, one might ask the following questions:
- How to identify the component(s) at the root of the issue?
- What are all interactions among the microservices?
- Which such interactions are in the critical path of the observed malfunction?
- Which microservice is the performance bottleneck?
- Is there any unexpected interplay between microservices?
Clearly, looking at each microservice’s logs and metrics would become a nightmare and provide little insight to answering the questions above.
How does Istio help with debugging microservices performance?
At the heart of the Istio service mesh is Envoy, an open-source L7 proxy and communication bus designed, announced, and popularized by Lyft. In Istio, each microservice is collocated with an Envoy instance, which is responsible for handling all incoming and outgoing network traffic. Thus, in effect, all Envoy instances together form a communication conduit through which the microservices interact with one another; each Envoy instance can therefore participate in monitoring all inter-service API calls and record, among other things, how long each call takes and whether or not it completes successfully.
While working on Istio, we decided to tap into Envoy’s aforementioned power of observation to assist developers with debugging and troubleshooting. Envoy already supported a commercial solution for distributed tracing; as part of our work, we have contributed code to Envoy to support Zipkin, an open-source distributed-tracing technology. Our main goal was to empower developers to clearly see the big picture portraying all interactions among all microservices.
Our Zipkin-related code in Envoy works as follows. Every time a microservice makes an outbound call, Envoy starts a new span. A span represents a full interaction between a pair of microservices, from the time the caller (client) makes a request until it receives the response from the other party (server). The Envoy running on the client side of a call starts a span and records the time right before forwarding the request to the Envoy on the server side which, in turn, realizes that a span is in progress and records the time when the request was received. When the microservice is done processing the request, the corresponding Envoy records the time and sends the response to the originating party. Then, on the client side, perceiving that the span has been completed, Envoy records the time when the response was received and computes the span duration. Given that only the span creator computes the span duration, such value is impervious to unsynchronized clocks. Both Envoys involved in this interaction know it partially: one knows when the client started a request and received the response; the other, when the server received the request and sent out the response. Each Envoy posts its own view of the span to Zipkin, which will then merge all the timing information. Hence, in addition to knowing how long the called microservice took to serve the request, Zipkin will also know, assuming synchronized clocks, how much time the request and the response were in transit in the network.
When serving a request, a microservice might need to call another microservice, leading to the creation of span causally-related to the previous one. All spans that have the same root span form a trace. So, a trace captures the entire chain of causally-related requests necessary to satisfy the root request. Similar to how our code allows Zipkin to merge all timing information for a span, it also allows it to merge related spans into a single trace.
As a result, given a time interval of interest, one can quickly see on the Zipkin UI:
- all traces representing chained interactions between pairs of microservices that originate from the same root;
- at which point in time in a chain a particular microservice called another;
- the duration of a chain of requests, and that of each of its constituents; and
- how long a microservice took to act on the response received from another, before making the next request in the chain.
All that information is presented in a hierarchical view with a timeline for visual inspection; furthermore, all tracing data can be queried in the UI or via Zipkin’s REST API for more sophisticated analysis, if needed.
Distributed tracing in action with Zipkin
Now, let us play with Zipkin distributed tracing in Istio. To that end, consider the sample Bookinfo application that is part of Istio’s distribution. This application illustrates some functions typically available in online book stores. It comprises the following microservices:
- productpage: The productpage microservice calls the details and reviews microservices to populate the webpage seen by the users.
- details: The details microservice provides all details of a given book.
- reviews: The reviews microservice provides the readers’ reviews of a given book. It may also call the ratings microservice to get user ratings information.
- ratings: The ratings microservice retrieves book-ranking information that accompanies a book review.
Moreover, suppose there are 3 versions of the reviews microservice, such that:
- version v1 does not call the ratings service; and
- versions v2 and v3 call the ratings service, but have different timeout policies.
The following figure depicts the overall architecture of the application in the Istio ecosystem. It shows the expected interactions between the microservices and illustrates the different behaviors of the 3 versions of reviews. Note that each service is collocated with an Envoy instance, through which all service-to-service communication goes. A front-end Envoy is present as well.
Below is an example of tracing data transparently collected by Envoy, as it appears on the Zipkin UI.
The trace (request chain) above shows the service-to-service interactions when the application is providing information on a book to a user. At the top of the hierarchy we see how long it took the productpage service to respond to the user. That time (188.740ms) amounts to an initial processing done by productpage, followed by an interaction with details, some more processing, a call to reviews, and a final processing before returning the data. In particular, productpage calls the details service ~38ms after receiving the user request; it takes the details service ~61ms to return a response to productpage. After a short period of time, productpage calls the reviews service, which sends a response ~59ms later. Note that, in this case, version v1 of the reviews service was used, since it did not call ratings.
On the Zipkin UI one can click on each of the lines of the trace to see the precise start and end times and some metadata for each call (span). For example, the following figure shows the information on the call to the details service (second line). Note how all timing information was merged by Zipkin. The span details contains the times recorded by both client and server, as well as the relative times.
A different request chain results when version v2 of the reviews service is exercised. Unlike the previous case, we would expect to see a call from reviews to ratings. This is shown below:
In the above trace, the reviews service spends more than half of its time waiting for the ratings service to return a response. Overall, it takes reviews version v2 ~82ms to complete its work, whereas v1’s time was ~59ms. The whole trace contains 4 network calls, not 3 as before.
The two traces analyzed above capture the expected behavior of the sample application. Let us now examine a scenario where something goes wrong. To force this situation, I will inject a delay of 7s affecting the communication between reviews version v2 and ratings (I used Istio itself to inject this delay. For information on how to do so, check the Istio documentation). Below is the trace corresponding to this scenario:
This time around, the trace, which contains a total of 6 network calls, is quite long as compared to the previous ones: it lasts for ~10s. It unveils interesting characteristics of the inter-relationships between the services. Our injected delay of 7s matches the time it takes the ratings service to respond to reviews. However, after a timeout of ~3s, productpage attempts to call reviews one more time. In this second call productpage also times out after 3s; it then gives up and returns an incomplete webpage (no reviews data) to the user. The unexpected behavior here is that, when productpage returns to the user (within ~6s), the two calls from reviews to ratings are still in progress and will finish later. In other words, the timeout for the communication between reviews and ratings is not compatible with the timeout between productpage and reviews.
It is worth noting that, if one checks the details of the two calls made by productpage to reviews by clicking on the corresponding spans on the trace, Zipkin will show that the response size and response code are both 0, signifying that the call was aborted.
Finally, let us explore one more scenario, where we keep our injected delay but use reviews version v3 instead of v2. The code of reviews v3 will time out earlier in case ratings takes too long to respond. The resulting trace is presented next:
With reviews v3 we obtain a different overall behavior. The entire chain of requests takes ~5.6s, as opposed to ~10s. This is due to the fact that reviews v3 times out before productpage decides to abort the call. Consequently, there are no pending requests when productpage returns the incomplete webpage to the user. Note that the calls from productpage to reviews are highlighted in red by Zipkin. The reason is that reviews v3 explicitly returns a 500 HTTP code to productpage when giving up on ratings. This can be seen in the metadata of the calls in Zipkin.
Tracing requires minimal application cooperation
Cloud-native applications in Istio can benefit from Zipkin distributed tracing without having to use any library whatsoever because the actual tracing is performed by the Envoy proxies collocated with the microservices, as explained before. The only (minimal) expected cooperation from the microservices’ code is the propagation of Zipkin-related context in HTTP headers when making outbound calls. For details on which HTTP headers need to be captured and propagated, check the Istio documentation.
Using Envoy and Zipkin for distributed tracing within Istio service mesh
This article explains why distributed tracing is important for modern, distributed applications, and describes how Istio supports Zipkin distributed tracing. To demonstrate how such mechanism can help developers assimilate a global, holistic view of the application behavior and uncover unexpected service-to-service interactions, the article covers four tracing scenarios. The scenarios presented here use a sample application made available with Istio, and can actually be tried out by users.
Related articles on Istio
- Introducing the Istio open source microservices mesh
- Managing microservices with Istio
- IBM, Google and Lyft give microservices a ride on the Istio Service Mesh