In Part 1 of this article, I have given an overview on challenges when implementing Microservices. In this part I will give some best practices for developing microservices at scale and re-evaluate the promises of Microservices against what we learned.
New style applications
The new architectural style used for microservices removes capabilities that were used in traditional applications, most notably support for distributed transactions and shared session state. Also, the components within an application become more independent, so best practices previously more used between applications now need to be implemented between the microservices within an application.
Here is a list of practices that you should use in microservices-style applications:
- Eventual consistency: Transactions are limited to a single microservice, so there is a higher probability that inconsistencies across services happen due to processing errors. Experience has shown that, especially in user-driven applications, inconsistencies are not as “bad” as perceived if fixed in a timely manner. Often, just a reload of a page is needed. Embrace this concept, plan for failure and thus plan for asynchronous replication in database clusters, or reconciliation and recovery workflows for potential problems. Look at patterns like SAGAs to ensure consistency across microservices where absolutely needed.
- Conflict-free data structures / log-style databases / Single-Writer principle: When transactions are limited to a single call, you will often have to resort to patterns like optimistic locking. In easy terms, check if a value still is as expected before you overwrite it. To reduce the potential for conflicts, you could use conflict-free data structures. These are data structures you can always modify with a set of specifically allowed operations, without creating any conflict. In a log-style database you could just write a new version of an object, where the newest one will then “win,” for example. Such a pattern also helps scaling the database and reducing conflicts in active-active database cluster setups. Another pattern is the Single-Writer principle: Define for each data item who (to which service) is the single owner that can modify it (and knows how to do it consistently and performant), and have others only read it.
- Versioning support: Microservices are developed and deployed independently from each other. To support the evolution of a microservice, including changes in the service operations or exchange data structures, a microservice should support multiple versions of its service interface at the same time, until all service consumers have moved on. This could be an adapter from the old API to the new API, or, if the underlying database structures permit, even running an old and a new version of the service in parallel. Note that the interface of a microservice with its stored data here is similar to a service API – changes in the data structures need to be managed as well, so a microservices may need to support multiple-object model versions for its stored data.
- Resilience: As with microservices, there are more (internal and external) interfaces in an application, so each microservice should be able to handle a misbehaving service provider that it uses. Circuit breakers “switch off” a service provider that is too slow. Bulkheads (resource limits per service) limit resource usage in a consumed service, so one service cannot use resources like database connections reserved for another service. Throttling limits the throughput (of requests, or of messages in a streaming application) such that the load can be handled. In general, a microservice should implement a “graceful degradation” approach. If underlying resources are not available, the service should not fail, but just offer a reduced functionality. Note that resilience features are often implemented in the service mesh in which a microservice is deployed (see below), or in “side cards” that are deployed next to the service, so that the microservice itself does not have to implement these features itself.
- Scalable (No-)SQL database: Classical databases provide features like referential integrity checking, strict data typing, etc. With microservices and their increased rate of change, strict data typing is becoming a hindrance. Every time you need to change the data model of a table, all the contents of that table potentially needs to be migrated, or re-organized, with potential downtimes. NoSQL databases relax the type checking. You can store (almost) anything you like, so that you can have multiple versions of a data structure in the same table (or collection) at the same time (you only need to make sure the application can handle these). Also, modern NoSQL databases are more easily able to scale to large instances by automatically distributing their data.
- Autoscaling: In a microservices application, each of the components can be scaled separately. This helps optimize the resource usage, as only those parts of the application use resources that are actually needed. For example, a nightly background process can use the same resources as the high volume website that runs over the day. Distributing resources automatically using auto-scaling (e.g., based on number of requests helps reducing the overall resource usage).
- Service mesh: To integrate your services use a service mesh like the Istio service mesh for example, that provides a lot of functionality needed like TLS termination (Transport-Layer Security encryption of connections), service discovery, load balancing, rate limiting, circuit breakers, and many more that you don’t have to implement in your own service code.
- Security by design: Design your application for security to consider (e.g., GDPR privacy regulations). Implement security features like TLS, encryption at rest, key management, etc. right from the start. Introducing them later in the development cycle can potentially cause high refactoring cost.
- Infrastructure automation: Containerize your application components so you can deploy the identical container across environments. Define the infrastructure setup “as code” in configuration files that are maintained under version control. Automate everything so they can automatically be deployed to update production or to create a new test environment.
Look out for other microservice best practices like the 12-factor-app, to avoid unwanted dependencies.
In general, try to look at modern tooling to see what fits your application best. kafkaorderentry is an (maybe overly extreme) example how to build an order management system based on the kafka event processing platform instead of classical (relational or even NoSQL) databases. It does not always have to be classical approach.
Reference tool set and architecture
The “principle” (if not “dogma”) of microservices is to use the “best technology” for a given purpose. This in principle makes sense, but in those cases I found the criteria for the decision which would be the “best” technology did not take into account the efforts of operations and support. That resulted in a zoo of tools, with multiple different tools for the same purpose within one application. Also, it costs a lot of effort for every team to develop and manage their own build and deployment pipeline, only to unify and consolidate them later.
To overcome this effort, even a microservices project should consider the run and operations cost for each selected tool and runtime. To reduce this cost, experience from other projects, or at least a common denominator setup, should be re-used and only exceptions should be managed when necessary. A microservices development team, even if freed from the usual standards and regulations to provide quick business value, does not need to reinvent the wheel.
- Architecture management helps the development team to be efficient in the long run.
- Restrict the set of tools, programming languages and runtimes, database types, and more to reduce the operations and support effort in the long run. Support exceptions where necessary.
- A common build and deployment pipeline relieves the development team from these tasks, and ensures a common set of quality and deployment standards that can be checked automatically (e.g., code quality rules, use of TLS, etc.).
- Common guidelines and standards, such as interfaces ensuring interoperability.
- Clear definition of data type owner (structure of objects).
- Include a service mesh component to “outsource” resilience features from themicroservices to the mesh, making service development easier.
Microservices provide and consume stable, versioned APIs, so that each service can evolve independently as needed. A small focused team (the proposed measure is the “two pizzas” size of 5-7 persons) is the ideal development organization, responsible for the full stack: from database, over service layer, to service interface or even user interface. This proves to be a challenge, as the required skill set increases compared to a classical team-per-layer approach.
In some cases, a feature requires coordinated effort across multiple Microservices. A data model change may ripple through several services. An application may be just too large to be implemented in a single team.
Beware of Conway’s law! It states that the architecture resembles the organization structure of your development teams. It is one reason to argue for microservice development teams being responsible for the full stack, and in this case increases the speed of development and innovation, e.g., as seen on teamorg. However, when you need coordination and communication across microservices, this becomes more difficult.
In such cases, the approach often is to “go waterfall,” a change in one service in the first sprint, then the change in another service in its next sprint. However, this slows development down and leads to increased efforts in case there are feedback loops needed to change something that has been done in an earlier sprint (like adding a field that has been missed earlier).
Teams implementing different microservices must be able “to talk” to each other during a sprint, so a change can be implemented parallel in a coordinated way. If you cannot have short-lived “feature teams” across the microservices teams, at least nominate someone in each team to be responsible for the communication and coordination with the other teams for that specific feature. Ensure that project planning takes into account the extra communication and coordination effort for cross-microservice activities.
Also make sure that the code is integrated across teams and tested as often as possible, if not multiple times a day, to “fail fast” and to fix any problems that arise quickly.
Testing and analysis support
Testing an application or microservice that runs 24h/7d that’s deployed regularly, maybe even multiple times a day, only works when tests are automated. Test automation and integration into the build and deployment pipeline is key here. Make sure you have a testing strategy from the beginning, to setup your test pipeline right from the start – changing tests just to fit an approach introduced too late is extra effort that can be avoided.
Test data generation should also be considered. This can either be synthetic data, or an anonymized set of production data (although GDPR and data privacy may restrict its use). The latter especially provides a larger and more realistic variability and thus better test data quality. Each problem that has been fixed should potentially become a new set of test data.
Non-functional testing is essential as well. Non-functional requirements are often considered as “implemented because we use <your favorite scalability pattern, database or event processing tool>.” However, without a test to prove the system quality for the non-functional requirements, it should be considered as not implemented. Non-functional testing here not only means performance (response times), throughput (load), but also availability and resilience testing. While performance and throughput can be automated rather easily, resilience testing is more challenging. Netflix for example has automated their availability and resilience testing with their “Chaos Monkey” that randomly disables computers in their production(!) infrastructure to see how the other services react.
Even the best load and resilience testing may still overlook a situation that could potentially render the system unavailable or unresponsive. A good way to reduce the risk of such situations is to do Canary Releases, where they only gradually move the users from the old version of a service to a new version. Note that this could be done by routing requests to different instances of a service that have different versions.
To support the testing and error analysis (in test as well as in production), the system should be able to trace requests across microservices, such as using a correlation ID and opentracing.io and zipkin tools. An ELK-stack (Elasticsearch-Logstash-Kibana) or something similar should collect the logs and enable the team to search and filter logs without having to manually copy them from the servers. Prometheus could be used to monitor the infrastructure and applications, with Grafana to provide easy to use time-series dashboards of system parameters.
DevOps and agile
The development approach that’s used for microservices is usually agile, where changes can be developed and deployed quickly. The coordination of multiple microservices requires an organizational structure that ensures:
- Dependencies between microservices are managed, and
- Overall technical guidelines are adhered to.
You should employ agile frameworks that are made to work at scale, be it “scrum of scrums,” or the “Scaled Agile Framework” (SAFe). For example, the role of the System and Solution architect is to enable the function for the development process overlooking the system as a whole.
Use a DevOps approach to quickly get feedback from production and end users. Support this approach by implementing monitoring tooling for the application. Here, synergy with testing can be achieved if the tools used in testing, like ELK, Prometheus and opentracing are used in production as well (also see the “Smarter Monitoring” approach).
A microservices approach is that “you build it, you run it.” But even if the microservices team is available on call in the beginning, prepare that at some point the maintenance will be taken over by another team. So make sure documentation and analysis tooling, like logging and monitoring support is “good enough” to enable outside developers understand and analyze the microservice. For example, ensure that architectural decisions and the rationale behind them is properly documented. Otherwise, such decisions become carved in stone, as after some time no one knows why they have been made the way they are, and no one is brave enough to challenge it. Or the other way around, mistakes from the past are repeated.
Do we fulfill our promises?
Now that we’ve described the promises of microservices, let’s have a look and evaluate these promises with what we have learned:
- Promise of agility, and faster time-to-market: Microservices are indeed easier to understand and implement and deploy. Setting up a full development, deployment, and support pipeline still takes its time, but is often hidden behind quick development of a Minimal Viable Product (MVP) that lacks many of the non-functional qualities needed for full production. Similarly, microservices may benefit from lessons learned and proven setups that are re-used from other microservices, but this needs to be actively managed as the microservices approach propagates to do everything yourself in each team.
- Promise of more innovation, achieved by using the best tech for each problem: A microservice has less impact than a monolith, so there is less risk endangering other parts of an application with a new, innovative service. The risk is that due to personal preferences marginal (or only perceived) advantages of a specific technology leads to a zoo of different tools for similar problems, resulting in a higher maintenance cost in the long term.
- Promise of better resilience: Microservices are deployed in smaller units, and individually protected by resilience services of the mesh, so the zone of failure is limited compared to a monolithic application. Mesh tooling can automatically restart failing microservices, which is more difficult and slower to achieve with monoliths.
- Promise of better scalability: The initial resource usage of microservices may be larger than with a monolith due to the additional base resource usage. But microservices can be scaled individually, and with a better reaction time. Scalability can be achieved with monoliths too, but scalability with microservices is more granular and only those microservices with load need to be scaled.
- Promise of better reusability: Business-domain-oriented services can be more easily reused by separating business domains reduces the risk of monolithic service interfaces that cannot easily be reused. Loose coupling between services makes it easier to separate out and reuse a service. A properly modularized “monolith” may have the same qualities, but not all monoliths are built this way, and the separate development and deployment enforces modularization with microservices.
- Promise of improved ROI and better TCO: An agile approach starts quick but introduces cost for rework to implement the full (functional and non-functional) requirements. Having each microservice team implement its own build and deployment pipeline or other infrastructure multiplies this implementation cost. A microservice team should be full-stack: from requirements, coding, databases, build and deployment, to operations and operations support, increasing skill requirements and thus development costs. Without strict resource management, even infrastructure cost triggered by duplicate build pipelines, test infrastructures and more may be higher.
One main reason for delays and cost overruns in microservices projects probably is expectation management, where a service can be developed quickly, but it may need to evolve in several steps until it can fulfill the full production requirements. Friction between multiple microservices teams is not seen and coordination cost not recognized.
Microservices are a push to properly modularize your applications. If you’re already doing this in your monolith, the difference is the granularity of deployment. This granularity allows better scalability, reuse and resilience. On the other side a “monolith” can benefit from the infrastructure developed for microservices, like containerization, service mesh, autoscaling approaches, sidecars with circuit breakers or request rate limiting. The key indeed is proper modularization where even a monolith can be broken up later if some services for example need to be reused.
The microservices approach proposes that each team is responsible for its service. But if in your traditional approach the UI team does not “talk” to the service team, who does not talk to your infrastructure team, expect problems with one microservices team not communicating with another microservice team. In the end, a developer has two frames of responsibility: the vertical one to deliver business functionality and the horizontal one to deliver the necessary non-functional qualities. If you don’t take care of both of them in your organizational setup, Conway’s law will hit you either way. If you choose the monolith approach, define vertical sub-teams responsible to implement specific business functionality. If you choose the microservices approach, define horizontal teams responsible to ensure the non-functional qualities. (Hint: guilds are a start, but every team should participate, and the guilds should have authoritative power too.)
A large and complex application still stays large and complex, no matter what approach. The implementation teams need to talk to each other in a matrix kind of way, no matter what approach. Microservices are no silver bullet. In the end, no matter what approach, you have to THINK.
A great course to learn about application architecture using the Microservices approach on the cloud is the “IBM Cloud Garage Application Architect Bootcamp” course. Many thanks go to my colleagues Christian Bongard, Wilhelm Burtz, Jim Laredo and Doug Davis, who have reviewed the article and commented on it.