Observability, which is comprised of monitoring, logging, tracing, and alerting aspects, is an important architectural concern when using microservices and event-driven architecture (EDA) styles, primarily because:
A large number of deployments require automation and centralization of monitoring/observability
The asynchronous and distributed nature of the architecture results in difficulties related to correlating metrics produced from multiple components
Addressing this architectural concern provides simplified management and quick turn-around time for resolving runtime issues. It also provides insights that can help in making informed architectural, design, deployment, and infrastructure decisions to improve non-functional characteristics of the platform. Additionally, useful business/operations insights can be obtained by engineering emission, collection ,and visualization of custom metrics.
However, it is often a neglected architectural concern. This tutorial describes guidelines and best practices for the monitoring aspect of observability for Java and Spring Boot microservices using open source tools such as Micrometer, Prometheus, and Grafana.
Prerequisites
Before you begin this tutorial, you need to set up the following environment:
A Java IDE for cloning and editing the code in the git repo
Estimated time
It should take you about 2 hours to complete this tutorial.
Brief overview of monitoring
The main objectives for a monitoring tool are:
Monitor the application's performance
Provide self-service to stakeholders (development team, infrastructure team, operational users, maintenance teams, and business users)
Assist in performing quick root cause analysis (RCA)
Establish the application's performance baseline
If using cloud, provide the ability to monitor cloud usage costs, and monitor different cloud services in an integrated way
Monitoring is mainly comprised of the following four sets of activities:
Instrumentation of the application(s) - Instrumenting the application to emit the metrics that are of importance to the application monitoring and maintenance teams, as well as for the business users. There are many non-intrusive ways for emitting metrics, the most popular ones being "byte-code instrumentation," "aspect-oriented programming," and "JMX."
Metrics collection - Collecting metrics from the applications and persisting them in a repository/data-store. The repository then provides a way to query and aggregate data for visualization. Some of the popular collectors are Prometheus, StatsD, and DataDaog. Most of the metrics collection tools are time-series repositories and provide advanced querying capability.
Metrics visualization - Visualization tools query the metrics repository to build views and dashboards for end-user consumption. They provide rich user interface to perform various kinds of operations on the metrics, such as aggregation, drill-down, and so on.
Alerts and notifications - When metrics breach defined thresholds (for instance CPU is more than 80% for more than 10 minutes), human intervention might be required. For that, alerting and notifications are important. Most visualization tools provide alerting and notification ability.
There are many open source and commercial products available for monitoring. Some of the notable commercial products are: AppDynamics, Dynatrace, DataDog, logdna, and sysdig. Open-source tools are typically used in combination. Some of the very popular combinations are Prometheus and Grafana, Elastic-Logstash-Kibana (ELK), and StatsD + Graphite.
Monitoring guidance for microservices
It is encouraged to have uniformity in types of metrics being collected across all microservices. This helps in increasing the reusability of dashboards, and simplifies aggregation and drill-down of metrics to visualize them at different levels.
What to monitor
Microservices will expose an API and/or consume events and messages. During processing, it might invoke its own business components, connect to a database, invoke technical services (caching, auditing, etc), invoke other microservices, and/or publish events and messages. It is beneficial to monitor metrics at these different stages of processing because it helps to provide detailed insights on performance and exceptions. This in turn helps in quick analysis of issues.
Commonly collected metrics relevant to event-driven architrecture (EDA) and microservices include:
Resource utilization metrics
Resource utilization - CPU, memory, disk utilization, network utilization, etc
JVM heap and GC metrics - GC overhead, GC time, heap (and its distinct regions) utilization
JVM thread utilization - blocked, runnable, waiting connection use time
Application metrics
Availability, latency, throughput, status, exceptions, and more for different architectural layers of the microservice, such as:
Controller layer - for HTTP/REST Method calls including
For application metrics, ideally the entry and exit point in each architectural layer of the microservice should be instrumented.
Critical metrics characteristics for microservices
The following three characteristics of metrics are important when monitoring microservices:
Dimensionality
Time series/Rate aggregation
Metrics viewpoints
Dimensionality
Dimensions control how a metric is aggregated as well as the extent of drill-down of a particular metric. It is realized by adding tags to a metric. Tags is a name=value pair of information. Tags are used to qualify fetching or aggregation of metrics through queries to the monitoring system. It is an important characteristic for monitoring microservices due to large number of deployments. In other words, in an eco-system of microservices, multiple microservices (or even different components of a microservice) would emit metrics with same names. To distinguish between them, you qualify the metrics with dimensions.
For instance, consider the metric http_server_requests_seconds_count. If there are more than one API endpoints (which is the case in an eco-system of microservices), then without dimensions, one can only view the aggregated values of this metric at the platform level. It won't be possible to get a distribution across different API endpoints. Adding a uri tag to the metric while emitting it would enable fetching this distribution. Take a look at the following example, which explains this characteristic.
If http_server_requests_seconds_count is emitted with the following tags:
Then the http_server_requests_seconds_count can be aggregated at the appName level, at the instanceId level, by HTTP response status, or by outcome, as demonstrated by the following queries:
# Count distribution by status for a given environmentsum by (status) (http_server_requests_seconds_count{env="$env"})
# Count distribution by uri and status for a given environmentsum by (uri, status) (http_server_requests_seconds_count{env="$env"})
# Count distribution by uri, status and appName for a given environmentsum by (uri, status, appName) (http_server_requests_seconds_count{env="$env"})
Show more
Tags can also be used as query criteria. Note the usage of the env tag, where $env is a Grafana dashboard's variable for user input "environment."
Time series/Rate aggregation
The ability to aggregate metrics over time is important for identifying patterns in the application's performance, such as correlating performance with load patterns, building performance profile for a day/week/month, and creating application's performance baseline.
Metrics viewpoints
This is a derived characteristic and provides the ability to group metrics together for ease of visualization and use. For instance:
A dashboard that depicts the availability status of all microservices of the platform
A drill-down (detailed) view per microservice to view the detailed metrics of a microservice
A view cluster level and detailed view of metrics of middleware components, such as Event Broker
Instrumenting a Spring Boot microservice
This section covers instrumentation of a microservice and its rest controllers, service beans, component beans, and data access objects. This section also covers instrumentation of kafka-consumer, kafka-producer, and camel routes, which are relevant if kafka, spring-cloud-stream, or Apache Camel are used for integration or EDA.
To help with the monitoring and management of a microservice, enable the Spring Boot Actuator by adding spring-boot-starter-actuator as a dependency. Multiple HTTP and JMX endpoints to monitor the application are available out of the box, including basic monitoring of a microservice's health, beans, application information, and environment information.
Cache metrics (out of the box for Caffeine, EhCache2, Hazelcast, or any JSR-107-compliant cache)
Tomcat metrics
Spring integration metrics
Metrics endpoint
Actuator also creates an endpoint for metrics. By default, it is /actuator/metrics. It needs to be exposed through Spring configuration. The following is a sample configuration:
To integrate with Metrics Tool, Spring Boot Actuator provides auto-configuration for Micrometer. Micrometer provides a facade for a plethora of monitoring systems, including Prometheus. This tutorial assumes some level of familiarity with Micrometer concepts. Micrometer provides three mechanisms to collect metrics:
Counter - typically used to count occurrences, method executions, exceptions, and so on
Timer - used for measuring time duration and also occurrences; typically used for measuring latencies
Gauge - single point in time metric; for instance, number of threads
Integration with Prometheus
Since Prometheus uses polls to collect metrics, it is relatively simple two-step process to integrate Prometheus and Micrometer.
Add the micrometer-registry-prometheus registry.
Declare a bean of type MeterRegistryCustomizer<PrometheusMeterRegistry>.
This is an optional step. However, it is recommended, as it provides a mechanism to customize the MeterRegistry. This is useful for declaring common tags (dimensions) for the metrics data that would be collected by Micrometer. This helps in metrics drill-down. It is especially useful when there are a lot of microservices and/or multiple instances of each microservice. Typical common tags could be applicationName, instanceName, and environment. This would allow you to build aggregated visualizations across instances and applications as well as be able to drill down to a particular instance/application/environment.
Once configured, Actuator will expose an endpoint /actuator/prometheus, which should be enabled in Spring configuration. A job needs to be added in Prometheus through its configuration to scrape this endpoint at the specified frequency.
The configuration class that declares the MetricsRegistryCustomizer can be written as part of framework so that all MicroServices implementation can reuse it. Tag values can be supplied using system/application properties.
Some application-level metrics are available out of the box and, for some, a variety of techniques can be employed. This following chart summarizes these features:
Done through custom reusable aspects of Spring-AOP
Done through custom reusable aspects of Spring-AOP
Done through custom reusable aspects of Spring-AOP
Out of the box for logging, caching, and JDBC connection pools
Out of the box if spring-cloud-stream is used
Done through custom MeterBinder beans
Out of the box
Out of the box
Partial support available. Custom instrumentation of routes required.
Throughput
Out of the box with @Timed annotation
Done through custom reusable aspects of Spring-AOP
Done through custom reusable aspects of Spring-AOP
Done through custom reusable aspects of Spring-AOP
Out of the box for logging, caching, and JDBC connection pools
Out of the box if spring-cloud-stream is used
Done through custom MeterBinder beans
Out of the box
Out of the box
Partial support available. Custom instrumentation of routes required.
Exceptions
Out of the box with @Timed annotation
Done through custom reusable aspects of Spring-AOP
Done through custom reusable aspects of Spring-AOP
Done through custom reusable aspects of Spring-AOP
Out of the box for logging, caching, and JDBC connection pools
Out of the box if spring-cloud-stream is used
Done through custom MeterBinder beans
Out of the box
Out of the box
Partial support available. Custom instrumentation of routes required.
Instrumenting REST Controllers
The quickest and easiest way to instrument REST controllers is to use the @Timed annotation on the controller or on individual methods of the controller. @Timed automatically adds these tags to the timer: exception, method, outcome, status, uri. It is also possible to supply additional tags to the @Timed annotation.
Instrumenting different architectural layers of a microservice
A microservice would typically have Controller, Service, DAO, and Integration layers. Controllers don't require any additional instrumentation when @Timed annotation is applied to them. For Service, DAO, and Integration layers, developers create custom beans annotated with @Service or @Component annotations. Metrics related to latency, throughput, and exceptions can provide vital insights. These can be easily gathered using Micrometer's Timer and Counter metrics. However, the code needs to be instrumented for applying these metrics. A common reusable class that instruments services and components can be created using spring-aop, which would be reusable across all microservices. Using @Around and @AfterThrowing advice metrics can be generated without adding any code to the service/component classes and methods. Consider the following guidelines about developing such an aspect:
Create reusable annotations to apply to different types of Components/Services. For example, custom annotations, such as @MonitoredService, @MonitoredDAO, and @MonitoredIntegrationComponent, can be applied to services, data access objects, and integration components, respectively.
Define multiple pointcuts to apply advice for different types of components and which have above-mentioned annotations on them.
Apply appropriate tags to the metric so that drill-down or slicing of metrics is possible. For instance, tags such as componentClass, componentType, methodName, and exceptionClass can be used. With these tags and common-tags, the metric would be emitted as follows:
This would abstract out all the instrumentation logic from microservices into a set of reusable aspects and annotations. The microservices developer just has to apply the respective annotations on his or her classes.
A sample instrumented Service class will have the following annotations on it. Automatically, all the methods in this Service class will become candidates for applying the serviceResponseTimeAdvice and serviceExceptionMonitoringAdvice.
Instrumentation of outbound HTTP/REST calls is provided out of the box by spring-actuator. However, for this to work, RestTemplate should be obtained from a bean RestTemplateBuilder. Additionally, custom tags can be added to the metrics if a custom bean of type RestTemplateExchangeTagsProvider is provided.
The following configuration class illustrates this:
Kafka Consumers are instrumented by default by Actuator. More than 30 metrics related to Kafka Consumers are collected by Actuator and Micrometer. The common tags are also applied on the Kafka Consumers. Some of the notable metrics are kafka_consumer_records_consumed_total_records_total, kafka_consumer_bytes_consumed_total_bytes_total, and kafka_consumer_records_lag_avg_records. Then, using dimensions, one can group them by Kafka-Topics, Kafka-partitions, and more.
Instrumenting Kafka Producers
Kafka Producers are NOT instrumented by Actuator by default. Kafka Producer has its own implementation of Metrics. To register these metrics with Micrometer, define a bean of type MeterBinder for each KafkaProducer<?,?>. This MeterBinder will create and register Gauges with Micrometer Registry. With this approach, more than 50 Kafka Producer metrics can be collected. The common tags and additional tags (during building the gauges) would provide multiple dimensions to these metrics.
The following code shows what a typical implementation of MeterBinder would look like:
publicclassKafkaProducerMonitorimplementsMeterBinder {
//Filter out metrics that don't produce a doubleprivate Set<String> filterOutMetrics;
//Need to store the reference of the metric - else it might get garbage collected. KafkaMetric is a custom implementation that holds reference to the MetricName and KafkaProducerprivate Set<KafkaMetric> bindedMetrics;
private KafkaProducer<?,?> kafkaProducer;
private Iterable<Tag> tags;
publicKafkaProducerMonitor(KafkaProducer kafkaProducer, MeterRegistry registry, Iterable<Tag> tags)
{
...
}
@OverridepublicvoidbindTo(MeterRegistry registry) {
Map<MetricName, ? extendsMetric> metrics = kafkaProducer.metrics();
if (MapUtils.isNotEmpty(metrics))
{
metrics.keySet().stream().filter(metricName -> !filterOutMetrics.contains(metricName.name()))
.forEach(metricName -> {
logger.debug("Registering Kafka Producer Metric: {}", metricName);
KafkaMetricmetric=newKafkaMetric(metricName, kafkaProducer);
bindedMetrics.add(metric);
Gauge.builder("kafka-producer-" + metricName.name(), metric, KafkaMetric::getMetricValue)
.tags(tags)
.register(registry);
});
}
}
}
Show more
Note:There are other third-party components that emit metrics but are not integrated with Micrometer. In such cases, the pattern mentioned above can be leveraged; one example being Apache Ignite.
Camel integration
If Apache Camel is being used for integration, there would be integration and processing Routes in the application. It makes sense to have metrics at Route level as well. Camel provides endpoints for Micrometer through its camel-micrometer component. Adding the camel-micrometer dependency in the application's pom enables Micrometer endpoints to start/stop timers and increment counters. These can be used to collect route-level metrics. Other Camel-specific beans, such as those of type org.apache.camel.Processor, can be instrumented using the AOP approach previously described.
To enable micrometer endpoints, add camel-micrometer dependency as follows:
As you can see, a large number of metrics can be collected and pushed to Prometheus using:
Out-of-the-box metrics provided by Actuator.
Custom metrics through instrumenting the code using AOP and MeterBinder. All of this custom instrumentation code is reusable and can be built as a library, which is consumed by all microservices implementations.
Both methods provide a consistent and minimally intrusive way of collecting metrics across multiple microservices and their multiple instances.
Prometheus integration with other third-party systems
Prometheus has a healthy development ecosystem. There are multiple libraries and servers that are available for exporting metrics of third-party systems to Prometheus, which are catalogued at Prometheus Exporters. For instance, the mongodb_exporter can be used to export MongoDB metrics into Prometheus.
Apache Kafka makes its metrics available with JMX. They can be exported into Prometheus as described in the following sections.
Integrating Kafka with Prometheus
If you are using Kafka as your message/event broker, then integration of Kafka metrics with Prometheus is not out of the box. A jmx_exporter needs to be used. This needs to be configured on the Kafka Brokers, and then the brokers will start exposing metrics over HTTP. jmx_exporter requires a configuration file (.yml). A sample configuration is provided in the examples folder of the jmx_exporter repository.
For this tutorial, we build a custom Kafka image only for the purpose of demonstration. Instructions for building a custom Kafka image with jmx_exporter are provided in the code repository's README.md
Building Dashboards in Grafana
Once the metrics are registered with Prometheus Meter Registry and Prometheus is up and running, it will start collecting the metrics. These metrics can now be used to build different monitoring dashboards in Grafana. Multiple dashboards are required for different viewpoints. It is a good practice to have these dashboards:
Platform overview dashboard, which provides availability status of each microservice and other software components of the platform (for example, Kafka). This type of dashboard can also report aggregated metrics at the platform level for request-rates (HTTP request rates, Kafka consumption request rates, and more), and exception counts.
Microservices drill-down dashboard, provides detailed metrics of an instance of a microservice. It is important to declare variables in Grafana, which correspond to different tags used in the metrics. For example, appName, env, instanceId, and more.
Middleware monitoring dashboard, which provides a detailed drill-down view of the middleware components. These are specific to the middleware (for example, Kafka dashboard). Here, also, it is important to declare variables so that metrics can be observed at cluster level as well as at instance level.
Using dimensionality for drill-down and aggregation
While reporting metrics, tags are added to the metrics. These tags can be used in Prometheus queries to aggregate or drill down on the metrics. For instance, at the platform overview level, one would like to view the total number of exceptions in the platform. This can be easily done using the following query:
Now to drill down the same metric at method- and exception-type level, the Prometheus query would be as follows:
sum by(appName, instanceId, componentClass, methodName, exceptionClass)(component_invocation_exception_counter_total{env="$env", appName="$application", instance="$instance"})
Show more
It would produce the details as:
Note the $ variables. These are defined as variables in the dashboard. Grafana will populate them based on different metrics available in Prometheus. The user of the dashboard can choose their respective values, and that can be used to dynamically change the metric visualization without creating new visualizations in Grafana.
As an another example, consider the below prometheus query for visualizing the throughput of service beans in a particular microservice instance.
The following dashboard visualizes metrics at platform level:
It provides:
HTTP Request Rate and Kafka Consumption Rate for all REST Controller methods and Kafka Consumers
Availability status of all microservices instances and Kafka cluster.
Note that each visualization in this is a hyperlink for a particular microservice instance, which provides navigation to the detailed drill-down dashboard of that microservice instance.
Failed HTTP Requests and Service Errors for all microservices instances.
A breakdown of exceptions for all microservices instances.
Sample microservices drill-down dashboard
This dashboard is organized into multiple sections, called "rows" in Grafana. This dashboard provides all the metrics of a particular instance of a microservice. Note that it is a single dashboard having user inputs for environment, microservice, instanceId, and so on. By changing the values in these user inputs, one can view metrics for any microservice of the platform.
Note:There are multiple screenshots, since there are many metrics that have been visualized for demonstration.
Different metrics sections
Microservice instance level metrics
HTTP controller metrics
Service metrics
HTTP client metrics
Kafka Producer metrics
JDBC connection pool metrics
Sample Kafka monitoring dashboard
Kafka broker metrics
Kafka messaging statistics
Conclusion
Monitoring of Spring Boot microservices is made easy and simple with spring-boot-actuator, micrometer, and spring-aop. Combining these powerful frameworks provides a way for building comprehensive monitoring capabilities for microservices.
An important aspect of monitoring is consistency of metrics across multiple microservices and their multiple instances, which makes monitoring and trouble-shooting easy and intuitive even when there are hundreds of microservices.
Another important aspect of monitoring is different viewpoints. This can be achieved by using dimensionality and rate aggregation characteristics of metrics. Tools such as Prometheus and Grafana support this out of the box. Developers just need to ensure that the metrics being emitted have the correct set of tags on them (this in turn can be achieved easily through reusable and common aspects, and Spring configuration).
By applying this guidance, it is possible to have consistent and comprehensive monitoring for all microservices with zero to minimal intrusive glue code.
Sample code
The code examples provided in this tutorial are available in GitHub.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.