Back in August 2015, Raj Singh wrote about the Simple Metrics Collector App that he and David Taieb had put together. It was a simple app that collected web analytics data and stored it in an IBM Cloudant database.
In this post, I'll describe how I took their work and rebuilt it to fit into a Microservices stack. Instead of the app writing data to a database directly, it now posts data to a choice of queues; either Redis, RabbitMQ, or IBM Message Hub (Apache Kafka). You can then deploy further Microservice apps to consume the queued data for storage, analytics, and real-time reporting.
How does it work?
The Simple Metrics Collector Microservice repository contains the source code and instructions for installation and configuration. You can deploy to Bluemix with a single click or run locally against local or remote services.
Instead of writing data to Cloudant using its HTTP API the new code passes the incoming data to a series of modules which write data to a choice of queues, PubSub channels, or message brokers. The module is chosen depending on the value of an environment variable
QUEUE_TYPE which can take one of the following values:
stdoutwrite to the terminal only
redis_queuewrite to a Redis queue
redis_pubsubwrite to a Redis pubsub channel
rabbit_queuewrite to a RabbitMQ queue
rabbit_pubsubwrite to a RabbitMQ pubsub channel
kafkawrite to an Apache Kafka or IBM Message Hub topic
When the app runs, the value of the environment variable
QUEUE_TYPE determines which of the plugins from the plugins directory is loaded. Each plugin has its own, customised
add function which performs the task of communicating with each queue system.
It's simple to add your own plugin. Just add a new file to the
plugins directory! If you want to write a ZeroMQ, MQLight, or MQTT module, then fork the code and get writing. We'd be happy to accept your changes in a Pull Request.
The difference between Queues and PubSub
My previous post, Get in line! An intro to Queues and PubSub, covers this in detail, but here's a quick summary:
A queue stores each data payload in the order they were received. Workers read the oldest item or items from the queue. Some queue software (RabbitMQ & Redis) pushes data to the workers. In others (Apache Kafka), the worker polls the queue for data. With a queue, we can scale the number of workers to deal with increasing workload because the load is split evenly between the workers. Queues buffer the data until consumers have had a chance to consume it, and in the case of RabbitMQ, provide a feedback mechanism to ensure that the worker acknowledges completion of each task.
A PubSub channel also receives time-ordered data and sends it to consumers, but each consumer of data gets all the data. PubSub doesn't share the data between the connected workers, all the workers get a copy of the data. This lets you use the same data stream for several purposes like storage, streaming analytics, real-time dashboard, etc. PubSub systems tend not to buffer data; the consumers must be connected to receive notification of incoming data (the exception being Apache Kafka whose topics can behave like buffer queues or pubsub channels).
The Simple Metrics Collector Microservice allows you to choose between Queue and PubSub modes for each of the supported queuing platforms, Redis, RabbitMQ, and Apache Kafka.
Consuming the data
Having the Metrics Collector Microservice running on its own isn't the full story, as it merely queues the data ready for consumption. It is nothing without some other microservices that consume the data. This is where the Metrics Collector Storage Microservice comes in. This microservice can also connect to either a Redis, RabbitMQ, or Kafka queue but this time to receive incoming data. Any data that arrives on the queue (or PubSub channel) is written to one of a choice of databases
- IBM Cloudant / Apache CouchDB
All three storage services are capable of storing JSON data, so require little effort to be able to store the stream of data from the hub. The Metrics Collector Storage Microservice uses the
DATABASE_TYPE to decide which storage mechanism to use:
stdoutoutput to the console only
cloudantsave to Cloudant
mongodbsave to MongoDB
elasticsearchsave to ElasticSearch
The Metrics Collector Storage Microservice uses each database's bulk storage API to efficiently store data in batches, rather than write each individual record as it arrives. As with the Metrics Collector Microservice, there is a plugin architecture so that that code can be easily extended to deal with other queues or other database targets.
The Cloudant/CouchDB/MongoDB/ElasticSearch service can be run locally or remotely; all of the services are availble to try in IBM's Bluemix platform-as-a-service.
Microservices gives you flexibility
If we run our Microservice in PubSub mode, then we can deploy any number of consuming microservices to that channel for storage, analytics, and real-time reporting. We could deploy two instances of the Metrics Collector Storage Microservice: one configured to write the data to MongoDB, and the other to write the data to ElasticSearch.
If the volume of data is too large for a single producer service, then additional services can be created and the traffic shared between them.
The producer microservice is decoupled from the consumers (and vice versa) by the hub inbetween.
Why such a complicated architecture?
Scale. Imagine we have a web app generating a lot data. We may need more than one server collecting the metrics from the web and a large queue cluster buffering the data. Then we can add any number of consumers, adding as many nodes as we need to deal with the load. As long as the queue can handle the volume of data, then we can add producer nodes and consumer nodes to match the workload.
The Simple Metrics Collector Microservice is a demonstration of the principle of Microservices using web metrics as an example. Your application could collect any kind of data: Internet of Things readings, server logs, mobile game usage statistics, or weather reports. But the theme is the same; build your application to scale from the beginning and add computing power to the cluster to deal with the volume of data being generated.