When I first began developing Reactive Systems I did so because of a measurable business need. I started working on the Walmart Canada project in 2012 — one of the first large-scale reactive projects to be delivered with the Scala programming language and Lightbend (formerly known as Typesafe) platform.
I embraced reactive because it solved the real-world problems I was faced with at the time. I would have adopted many of these techniques even if they didn’t have a formal name.
When first launched, the new Walmart Canada platform needed to be responsive on a multitude of levels:
- Responsive to users by serving each request in under 1 millisecond
- Responsive to demand by handling 5 million page views per minute on day one
- Responsive to a growing user base as the improved user experience drove exponentially more visits (a ‘virtuous cycle’)
- Responsive on days of high traffic (such as Black Friday) when page views can and do spike exponentially; simultaneously reduce CPU cycles and the overall operational costs associated with traditional software architectures
- Responsive to the needs of the business by providing the ability to change and customize the website without code —- essentially turning walmart.ca into a CMS
While seven years old now —- which is dog years in technology —- the refreshed Walmart Canada platform as it originally launched with Scala, Play, and Akka, is a case study worth studying in the reactive community. The business goals came before the technical solution, and the business requirements were a near 1:1 mapping to the principles of Reactive Systems as outlined by the reactive manifesto. Tools such as Lagom are built with these core principles in mind, which is how the latest breed of reactive frameworks reduces the overhead involved of having to build out your own reactive plumbing.
With the right team and budget, achieving the before-mentioned business goals would be relatively straightforward on a greenfield project. However, doing so while integrating with dozens of heritage back-end systems is another story altogether. This is at the heart of learning to successfully develop Reactive Systems. A Reactive System must adapt to a changing business environment, operational environment, and development landscape, all within the constraint of working with existing systems that need to be evolved rather than replaced.
Building your next reactive project on a solid underpinning is so important, which is why we’re comfortable to recommend Lagom, Akka, and the rest of the technologies that we’ve covered in this series. Our choices are very intentional, which is necessary for commercial projects that will require a certain level of support, security, indemnification, commercial terms, and so forth. Reactive Stock Trader was built with the enterprise developer in mind as the potential seed of inspiration for upcoming projects. We’ve covered almost every facet of the Lightbend Platform, aside from Spark, Flink, and Spark ML.
This final unit summarizes the architecture we have demonstrated over the course of the entire series and outlines some next steps to consider for those about to embark on a real-world reactive project. While internalizing the lessons in this series, take comfort in the fact that many of these techniques are proven in systems that handle massive amounts of business every single day.
Let’s recap what we covered in this series. Then we will provide a complete overview of Reactive Stock Trader’s final architecture.
Designing event-driven systems
In Unit 1, we outlined a technique for designing event-driven systems called event storming, along with the proven modeling technique of DDD (domain-driven design). When combined, these techniques are a highly effective approach to designing real-time systems.
An additional modeling technique that has begun to gain traction since Unit 1 was published is event modeling, a term coined by Adam Dymitruk. Event modeling builds on the work of Greg Young, creator of CQRS, and the work of Alberto Brandolini, the creator of event storming. Event modeling is a lightweight design technique that adds additional structure to an event storming exercise, specifically by creating swimlanes for each bounded context, with personas and external systems in separate interaction swimlanes.
UX and UI
One of the most valuable side effects of event storming and DDD is that the models produced are comprehensible by a diverse set of stakeholders, including UX and UI team members. This helps to ensure that the UI and UX requirements completely align with the way that technical and business stakeholders visualize the flow of the system.
By the end of this unit, the importance of including a diverse set of team members, including designers and product experts, should be obvious.
Data, data types, and compiler guarantees
In Unit 3, we laid out the basics of how commands and services interact within each other, along with patterns like Algabraec Data Types (ADTs), which will help to ensure the Java compiler handles as many edge cases as possible when processing trade requests without having to resort to extensive testing. Rather than use if-else-then statements to handle various types of trades (market, limit, stop-limit), we used the visitor pattern along with ADTs to guarantee that all cases are handled, and a compiler error will be thrown if a type of trade is added without the appropriate handling logic.
This is an interesting —- and perhaps novel —- topic to cover in the Java development community, because it advocates leveraging the compiler rather than testing techniques, such as TDD.
Services and API
In Unit 4, we covered the core principles of working with Lagom along with the core principles of asynchronous programming. We rounded out the unit by covering how to properly structure a single Lagom microservice into packages, controllers, and routes. We wrapped up by exploring the basics of implementing services, along with how service descriptors and the service registry works. This unit is a transition from theory to practice, preparing readers for the detailed implementation work to follow.
Event sourced persistence
The focus of Unit 5 is persistence. We outlined the effective application of event sourcing to eliminate the need for mutable data structures, which helps to ensure that entities are fully persistent and can be recovered in the event of a crash or restart. Mastering this unit will help you to properly create, persist, and emit events in a reactive system.
CQRS (Command Query Responsibility Segregation)
In Unit 6, we covered the relationship between commands and queries, and, more specifically, how to separate the command path from the query path in order to effectively optimize our system.
A common cause of latency in modern systems is the improper handling of complex queries in real-time using the same threads as those serving requests and handling commands. By separating reads and writes, including the full separation of read models from write models, we can individually optimize reads and writes.
Once we fully separate the command channel from the query channel, we can begin to significantly optimize the performance, reliability, and latency of queries (Unit 7). Separating commands from queries can radically reduce query latency by orders of magnitude as we move processing off of the main UI threads and into asynchronous worker threads.
Through the use of Lagom read-side processors along with Cassandra (Unit 8), we can support the handling of long-running, multi-stage transactions without resorting to brittle techniques such as distributed two-phase commits or
ThreadLocal techniques and sticky sessions.
Transaction handling is one of the most difficult aspects of distributed systems development and where the most amount of consideration should be spent. To be successful with reactive systems development, transaction handling will be a key aspect of design and development that requires proficiency. This is where you can also expect some refinement and change within reactive tools themselves, as no de-facto transaction handling standard has emerged, even in Lagom.
In Unit 9, we covered the Message Broker API and described how to integrate multiple bounded contexts together without necessarily depending on REST. However, enterprise integrations will require a number of techniques, from utilizing a message broker, to REST, to legacy filesystem integration.
We don’t need to wait around before querying the system synchronously. Using the Lagom PubSub API (Unit 10), we can subscribe to raw events or changes to read-side models from within a Lagom service, and then, using Reactive Streams, push the data to any interested external party (such as our Vue.js UI).
Deploying to Kubernetes
Building a system based on reactive microservices is one piece of the puzzle. In Unit 11, we covered how to package and deploy individual Lagom services to Minikube on your local developer environment. This will set up readers before a production deployment, as it covers the main steps to get ready for a real deployment. It also covers some of the convenience functionality that comes with Lagom through its integration with the sbt build tool.
Congratulations for completing the entire series! The rest of this unit will summarize the architecture of Reactive Stock Trader. We will also recommend next steps for taking your reactive systems knowledge to the next level.
Reactive Stock Trader architecture
In real-world systems, coupling and mutability leads to brittle, inflexible, hard-to-understand architectures. Our goal for the architecture of Reactive Stock Trader was to demonstrate how to separate concerns on all levels. In this spirit, we have walked down a path towards a microservices-based architecture, with each microservice defined by a bounded context.
In spite of the complexity of microservices, the system remains easy to work with as it’s self-contained within a single logical build. The combination of Lagom and sbt makes microservices smooth to work with during local development phases.
In summary, the architecture of Reactive Stock Trader delivers the benefits of enhanced runtime performance through optimization techniques such as CQRS, increased resilience through cluster-aware microservices, and the productivity benefits of working with a single logical system. It’s effectively like the convenience of working with a monolith at development time, while delivering all of the benefits of microservices at runtime.
Let’s now step through a high-level overview of each major architectural category of Reactive Stock Trader.
The main way to influence a change in Reactive Stock Trader is to create and issue a command. Once a command is created and emitted, we can see how it flows through the various layers of our architecture.
After being submitted, a command must be validated. After a command is validated, we have two mechanisms to emit status back to the UI or other interested systems (which are not mutually exclusive):
- Return a response directly
- Subscribe to updates using the PubSub API and stream changes to projected views
The following diagram captures this flow as a pattern that applies to almost every command in Reactive Stock Trader, and almost any command that will be added in the future.
Are there any situations in which a command would follow a different path than outlined above?
Yes. Imagine that a command was created and submitted by an external service. Rather than use the PubSub API to broadcast the event to subscribers, we would instead leverage the Message Broker API, as we covered in Unit 9. This would add a few additional stages to the above flow between the ‘persistent entity’ and ‘PubSub’ swimlanes, but otherwise the flow would remain mostly intact.
As events are persisted, a projected view is kept up to date by read-side processors. When a query is executed, a Lagom service queries the read-side model without having to do any complex joins or computation. This keeps queries very snappy.
Imagine a query that asks, “How many shares does Customer X have of Stock Y?“
All client holdings will already be precomputed — in other words, projected — from the raw events asynchronously. Using the data pump pattern implemented by a collection of
ReadSideProcessor instances, you can set up any number of read-side processors that will work in the background to monitor changes in the journal and act appropriately to keep views up to date. When a CQL query is finally issued, rather than performing any intensive processing in real time, the hard work has already been done in the background, and the views are up to date. This essentially removes CPU-induced latency from the query path, leaving only IO latency remaining.
As we get closer and closer to the production environment, a common question is, “Why not just use a database rather than persistent entities?”
In order to keep our system as performant as possible, Lagom stores every active entity in memory and looks it up whenever necessary without having to perform an expensive SQL query or wait for IO latency. If an entity becomes infrequently accessed, it is passivated by Lagom, which means that it is removed from memory but is available upon request by restoring the entity’s current state from the event log. This brings the performance benefits of in-memory storage together with the cost savings of storing entities in cheaper forms of storage such as SSDs and spinning disks.
The core principle of Lagom is that the building blocks of a successful system are entities, and entities should be extremely fast and easy to work with. Lagom manages this under the hood with Akka Cluster Sharding, which carves up all of the nodes in a cluster (for instance, nodes in Kubernetes), assigns shards to each node, and then distributes entities to shards.
In our command path diagram above, we can see find or create entity. Under the hood, Akka handles the complexity of sharding and messaging for Lagom entities.
The following diagram paints the picture of how cluster sharding works in Akka.
The diagram represents a single bounded context, which forms a cluster. The role of the
ShardCoordinator is to decide how the shards are distributed, not how messages are forwarded to entities. The first time a specific
Entity is messaged, Akka will discover the current
ShardRegion of the entity from the
ShardCoordinator and cache that address. Then, messages can begin to flow to the entity. Each message to the final recipient — the entity — is actually delivered to the
ShardRegion. From there, the message is forwarded to the correct
Shard, which forwards the message to the
Developers who plan to work with Lagom in a production environment should become familiar with the concepts of cluster sharding and understand how Lagom manages entities under the hood in order to successfully support a production system. While Lagom provides this functionality ‘out of the box,’ configuring this for production, or troubleshooting issues in production, will require some depth of knowledge of the underlying Akka implementation.
A system isn’t reactive without a mechanism to stream data. Lagom relies on Reactive Streams and Akka Streams under the hood. Lagom mostly obfuscates the complexity of both; however, developers will need a high-level understanding of streaming semantics in order to successfully develop a reactive system.
The Reactive Stock Trader architecture streams events by connecting Lagom services with Play controllers using Reactive Streams, and then seamlessly connects Play with Vue.js over a WebSocket connection using built in convenience methods within Play to treat a WebSocket connection as a sink for streams. Lagom handles streaming using the Akka PubSub API, which is the source of the stream.
In Reactive Stock Trader, we only stream in one direction, from Lagom to the UI. All input is handled by commands over REST rather than streaming commands over a WebSocket connection, although this is supported by Play and Lagom. In other words, Lagom supports bi-directional streaming, but there is no use case for this in Reactive Stock Trader.
Integrating between bounded contexts
Rather than point-to-point integration between bounded contexts over REST, we leverage Kafka as an intermediary between bounded contexts and integrate with Kafka using the Message Broker API. Each bounded context is expected to publish interesting events by default, giving other bounded contexts the opportunity to subscribe to domain events.
In the above figure, we show two bounded contexts, but imagine dozens or even hundreds of microservices all connected together via REST. Point-to-point RESTful integration will create a spaghetti of dependencies, coupling services, and creating a future maintenance nightmare. By intermediating direct service-to-service connections using Kafka via pub-sub, we greatly simplify future maintenance and also create a durable record of events that influence the behavior of each microservice. This also makes testing easy, as we can simply inject test events into Kafka and observe the behavior of our services.
In the example above, we have the following properties:
- Entities are sharded across multiple nodes in a cluster
- The PubSub API is used for communication within a bounded context
- Bounded contexts group logically related aggregates together with a fluent API
- Public APIs can be expressed both as RESTful endpoints and message brokers like Kafka
- The Message Broker API is used for communication across bounded contexts
- Point-to-point REST integration between two bounded contexts should be avoided; only the BFF should communicate with the REST (service) interface of a bounded context
Integrating with the world
Within a system networking boundary, each microservice (bounded context) will expose both a REST endpoint and also a number of Kafka topics. However, individual microservices should not be exposed directly to the public.
In order to integrate your system with the outside world, ingress should happen only via public REST along with WS endpoints exposed by your BFF (Backend for Frontend). Authentication, authorization, and other core policies should be a huge focus of work in the BFF tier, along with protocol translations and any other plumbing code that will help to keep microservices focused on pure business logic.
Each microservice will also have service endpoints, but access should be strictly controlled at a network level. Essentially, everything ‘below’ the BFF waterline will be within a trusted zone.
Also notice that we discourage direct point-to-point integration between microservices. The Message Broker API can be leveraged to disintermediate this potentially tangled mess with Kafka as the message broker of choice.
That’s not to say RESTful integration between services will never happen, but it should be a very temporary measure, such as during initial POC development, and should be considered strategic technical debt.
Individual microservices will also need to include their own individual security measures. Death Star Security is a term that describes microservice-based architectures with robust outer protections, but a soft core that, once breached, can take down the entire system. Essentially, you will want to keep bad actors out of your system, but assume that some of them will get past your outer layer of defense. Guard core services appropriately!
Deploying to production
Now that we have a general idea of our architecture, along with a complete refresher of the contents of the entire series, let’s take some time to cover additional considerations before deploying a reactive system to a real production environment.
As we covered in Unit 11, Minikube is a great way to gain an understanding into all of the steps involved in a real deployment, but there are many other deployment topics to cover —- too many topics for this series. However, let’s cover a few of the most important, which should point you in the right direction for future learnings.
Deploying with OpenShift
Red Hat OpenShift is a hybrid cloud enterprise Kubernetes application platform aimed at polishing the native interfaces and operational capabilities for administrators, operators, and developers. Reactive Platform is certified by Lightbend to run on OpenShift, making it a solid production deployment target for any user of Play, Akka, and/or Lagom. As an added benefit, OpenShift can be deployed to any popular cloud provider, including IBM Cloud, as well as on-prem on a client’s own infrastructure.
One consideration before adopting OpenShift is to understand the role of Operators versus Helm Charts. Helm is only one part of the ‘Operator Maturity Model,’ which aims to become the standard way to package and manage deployments in future versions of Kubernetes.
The above diagram from the OpenShift blog “Make a Kubernetes Operator in 15 minutes with Helm” showcases the difference between a Helm-powered operator versus a Go-powered operator. What you have seen in this series only represents the basics of installation to Kubernetes, but there are many more options for infrastructure automation.
Learn more about the Operator framework.
Red Hat OpenShift Cluster on IBM Cloud
One of the easiest ways to deploy an OpenShift-ready system is to leverage Red Hat OpenShift Cluster on IBM Cloud. This managed OpenShift Container Platform service gives enterprise developers a viable alternative to the traditional self-managed hosting options of Kubernetes on AWS or GCP.
With OpenShift, you get all of the benefits of Kubernetes, such as intelligent scheduling, self-healing, horizontal scaling, service discovery and load balancing, and automated rollouts/rollbacks. With OpenShift, you gain heightened cluster and app security, global availability, access to the OpenShift catalog, and access to the complete IBM Cloud Platform services, such as Watson, IoT, and Blockchain.
A system like Reactive Stock Trader is unlikely to live inside its own little bubble and will require integration within a much larger enterprise system. This is where a service mesh may come into play. Istio is arguably the most prevalent open source service mesh available for Kubernetes.
“Istio is open technology that provides a way for developers to seamlessly connect, manage, and secure networks of different microservices — regardless of platform, source, or vendor.” – IBM Cloud Managed Istio
Istio can simplify networking within an enterprise environment by making interoperability between a microservice-based system, such as Reactive Stock Trader, and a broader enterprise system more streamlined. It also improves the quality of service within a microservice-based system by making it easier to properly configure security, encryption, authorization, and rate limiting options.
Istio is available as an add-on to Kubernetes services available on most public clouds, for example IBM Kubernetes Cluster Service. OpenShift Service Mesh, also based on Istio, comes as an integral part of Red Hat OpenShift.
Both of these managed Istio implementations will simplify network configuration and cluster management. In addition, they’ll make it easier to take advantage of monitoring and logging services, as well as other API’s to data services, Blockchain, Watson, and so on. For those users who may be exploring the possibilities of machine learning and AI, native integration with Watson is an especially compelling offer.
Configuring Cassandra for production
Apache Cassandra is not a general purpose datastore. Lagom handles some of the complexity and nuances of Cassandra, but there are a host of differences between Cassandra and a typical RDBMS.
In Cassandra, your tables must be designed from the ground up for queries. Cassandra does not support joins, so a single query must be serviceable by a single table. A query will either return a single record or multiple records, but all of these will come from the same table. This is because Cassandra is very fast in terms of a single lookup where it knows the partition key and reads from the same partition.
All Cassandra queries should use a partition key. If your queries don’t use a partition key, they will suffer from poor performance.
This serves the Lagom use case very well as we’re using Cassandra for read-side queries. However, when designing these queries, we’ll need to think about the partition key strategy, which we did not cover in this series.
Cassandra does not support the immediate deletion of records. When a record is first deleted, a tombstone in Cassandra represents the deleted record and will be read with subsequent queries until purged.
Within Reactive Stock Trader, we are not deleting queries. But, if your use case requires the significant deletion of queries, you should understand the concept of tombstones and possibly explore leveraging a standard RDBMS in place of Cassandra. Although we do not advocate for deleting events, the deletion of events and queries may happen for a number of reasons, including GDPR compliance.
Consistency and replication
Consistency and replication needs to be evaluated on a query by query basis. When a table is created, its replication factor is determined; the replication factor is how many nodes the data is replicated to. When a record is written, the number of nodes that must be written to are determined.
Before deploying Cassandra to production, you should have a good handle on consistency levels and ensure that your queries reflect your desired level of consistency. Also, for those looking to understand consistency in a deeper way, I highly recommend reviewing the blog post about Cassandra from Kyle Kingsbury, the creator of Jepsen.
Before going live with a Lagom deployment backed by Cassandra, it’s important to understand a few additional characteristics of Cassandra. Mainly, Cassandra is I/O bound and relies heavily on speedy disk performance. For those who plan to run Cassandra on cloud hardware, it’s important to use the largest instances with local storage wherever possible (within cost constraints). The recommendation from Cassandra experts will almost certainly be to use dedicated hardware for Cassandra.
Before going live, we highly recommend to check your configuration against the recommendations from Datastax, which should point you in the right direction.
Configuring Kafka for production
The Message Broker API relies on Kafka, so it will need to be properly tuned for production. Generally speaking, Kafka has very good default settings. The easiest recommendation to using Kafka in production is to use a fully managed service such as Red Hat AMQ Streams or IBM Event Streams, and to also read Neha Narkhede’s excellent Kafka book along with the official docs.
Production configuration of Lagom and Play
In Unit 11, we showed a pattern for configuring Lagom for production using the naming standard of
application.prod.config for production configuration files. This pattern should solve many of your initial production configuration needs.
Moving beyond the basic configuration outlined in Unit 11, it helps to keep in mind that Play (BFF) and Lagom (microservices) will need to be separately configured for production. When configuring Lagom microservices, it also helps to understand that Akka configuration can be applied as well.
Serialization is a broad topic but something that will require consideration before going to production. The deepest depths of serialization is out of scope for this series, but we would like to put it on your radar as an important topic that will need to be explored well before a production deployment, and absolutely before your microservices system grows beyond a few simple bounded contexts.
A short series of recommendations are as follows:
Serialization/deserialization with Kafka using the Lagom Message Broker API should use Avro. Avro is very well supported by the IBM Event Stream Schema Registry, for example. And for any real production system, you will absolutely want to use schemas and schema registry for all domain events being stored by Kafka.
The Lagom documentation recommends JSON as the preferred serialization format:
- between Lagom services
- to the journal (with Cassandra)
messaging within a cluster using the PubSub API
Some teams may wish to use Protobuf for the journal for serialization with Cassandra. In production systems based on Akka, I have had success using Protobuf with Akka Persistence, and also feel comfortable recommending this with Lagom.
There is no single right answer to serialization, so you’ll need to explore the various options at your disposal and then work within the bounds of Lagom to configure your serialization strategy appropriately. Understand that it is a specialized topic, but one that affects reactive systems in profound ways.
Regardless of whether you use JSON, Avro, or Protobuf, the only hard limit is not to use Java serialization. There are several issues when using Java serialization, including the lack of support for schema evolution, poor performance, and a high vulnerability to attacks.
Observability, monitoring, and performance
Being able to monitor your application is a prerequisite to being ready to deploy to production. Understanding the health of your Kubernetes pods, the state of connections between them, congestion of inter-service network connections, and so forth are all important indicators of the application in the context of the deployed infrastructure. The tooling required to monitor KPIs will most likely be provided by the platform on which you choose to deploy your application. For instance, Prometheus and Grafana are powerful open source tools for monitoring and visualizing KPIs and setting alerts. These, along with other tools, form part of the management console within IBM Cloud Pak for Applications, a cloud-native application platform optimized for Red Hat OpenShift.
However, often you will need to get a deeper view into your applications performance, preferably before they are deployed into production. For this, there are commercial add-ons available for Akka, Play, and Lagom in Reactive Platform. Telemetry libraries allow you to instrument your applications for deep visibility, down to the actor level. And Reactive Platform Console uses customizable Prometheus and Grafana dashboards to visualize the data and set alerts.
Reactive Stock Trader touches upon all components of a reactive architecture — from structure, to communications patterns, to persistence, to streaming. We hope that you find it a valuable starting point before you embark on your next — or first — adventure in building real-world reactive systems!
About the author
Kevin Webber is a developer, architect, and technical leader based in Toronto, Canada. Kevin brings almost two decades of programming and architecture experience in Java and the JVM to projects, with a specialization in training and development in areas such as event-driven architecture, microservices, cloud-enablement, and machine learning enablement. Kevin can be found at kevinwebber.ca, and also on Medium, YouTube, Twitter, and LinkedIn.
A special thanks to the co-authors and reviewers of this series: Dana Harrington (BlueCat), Markus Eisele (Lightbend), Wade Waldron (Lightbend), Doug Weatherbee (Lightbend), Sarah Domina (IBM), and Peter Gollmar (IBM); and to the rest of the team at Lightbend and IBM who have helped with suggestions, corrections, and, most importantly, by developing the tools that we used throughout this series.