Learn more >
by Samir Nasser | Published June 19, 2019
There are unique performance and resiliency engineering concerns for hybrid cloud software solutions that, if not addressed, can lead to serious service level agreement (SLA) issues, such as outages and poor performance. This article provides actionable recommendations to address common challenges that can afflict your hybrid cloud solution.
Basic understanding of performance and resiliency engineering.
You should be able to read this article within 10 minutes.
Software performance is often measured by how fast a solution executes. For example, the load time of a web page and the response time of a REST service call are examples of the response time metric. Likewise, how many transactions can be performed on a web site per unit of time is an example of another key performance metric called throughput.
Although those two metrics are often considered during software performance testing, other metrics that describe the resiliency of your solution can often be just as important and sometimes even more so. For example, if a component fails, how should the overall software solution behave? How fast should the overall solution recover from this failure? Is the recovery automatic or manual? Will another replicate component provide the functionality of the failed component? These questions probe the resiliency requirements of your solution.
In the hybrid cloud software solution world, unique performance and resiliency challenges can come up if the right mitigations are not considered. Below are nine best practices to consider.
Performance and resiliency testing are crucial for hybrid cloud solutions. Although this is applicable to on-premises solutions as well, I want to emphasize that such testing is quite important for hybrid cloud solutions.
Unlike on-premises solutions that are hosted in your data center, a hybrid cloud solution can span across large geographic distances. These distances should be given serious thought. Otherwise, your hybrid cloud solution can suffer from performance and resiliency challenges.
For example, consider a solution that has a web component running in a cloud that needs to interact with a database component that is hosted on premises. The larger the geographic distance between the two components, the higher the network latency. Depending on your performance requirements of this application, the observed behavior may not be acceptable and a design change may be required.
Below are some potential remedial actions to alleviate the geographic distance impact on performance:
Components that are located on premises and across various clouds have unique security requirements. There are usually layers of security for these interactions to be allowed. These layers may include network switches, firewalls, virtual private networks, and specialized security appliances that employ all sorts of encryption and decryption schemes. Since an on-premises solution component runs in a different network from a public cloud solution component, rigorous security measures, including firewalls, are used to restrict access to certain traffic. These security layers come at a cost. Additionally, since there are more security layers between an on-premises component and another on the cloud, rather than between two on-premises components, there are more resiliency issues to consider because one or more security layers can fail.
However, the flexibility to choose a language may come at a price. A solution implemented with various languages will need more performance experts to tune the overall solution. Each language has its own performance best practices that an expert must leverage for tuning a particular component. Additionally, more performance debugging tools will be needed to assist in resolving performance challenges during the testing and tuning phase. A tool that works with a component implemented in one language will likely not work with another implemented in a different language. Therefore, understanding what facilities a language has to help with performance and resiliency troubleshooting becomes crucial.
In order to optimize the performance and resiliency of a software solution during testing, or to ensure that its behavior is maintained in production, you may need to make changes in any layer of the solution stack. Unlike on-premises software solutions, a hybrid cloud solution will have a number of different stakeholders with various levels of change control. On one end of the spectrum, you, as the solution owner, has the most (full) control of the on-premises solution stack, including all layers of the stack. On the other end of the spectrum, you have little to no control when using a software as a service (SaaS) model. In between, you may have control from the operating system level and up, middleware level and up, application component level, or constrained fine-grained control such as the ability to restart the component that is provided as a service by your SaaS provider.
There may not be an immediately clear link between control and performance or resiliency engineering. However, consider this real world use case: you have solution components located on premises and in two different clouds from two different providers. The component running in one cloud frequently interacts securely with the component running in the other cloud. The first component acts as a client to the second component. But the cloud provider hosting the second component decides to change its server certificate without informing you or the cloud provider hosting the first component. Now, the first component cannot reach the second component anymore and the resiliency of your solution completely fails due to control issues. The fact that you do not have full control as in the traditional on-premises solution world can create additional performance and resiliency challenges that are only relevant to the hybrid cloud world.
To optimize software for performance and resiliency, you must, at minimum, have access to solution logs and performance metrics. Without them, you cannot proactively tell if your solution is running as expected. Visibility into the runtimes of your various solution components is more challenging in a hybrid cloud configuration than in an on-premises one. While your cloud service provider has total visibility into what is happening to your component(s), you may have a broad range of visibility levels available for you to choose from, for both performance metrics and other dimensions, such as stack configuration.
Visibility does not mean control. You may be able to see things, but not act upon them. Visibility, or lack thereof, impacts performance and resiliency engineering in the following ways:
A software solution cannot be optimized for performance and resiliency without the proper tools for testing, monitoring, and diagnostics, and for making changes when needed. However, the tools that work with one cloud environment may not work with another. Additionally, one cloud provider may have tools that others do not.
For example, one of your cloud providers may use Prometheus for performance monitoring, while another may use something different. Similarly, one of your providers may use Istio to manage the runtime traffic between microservices clients and microservices, while another may use Hystrix.
In the traditional on-premises world, you always know that a particular solution component, such as a database, can be reached at a specific IP address. This may not be the case for a hybrid cloud solution. For example, in a Kubernetes cluster, only the IP addresses of the various cluster nodes are known. The containers of a solution deployed into a Kubernetes cluster can be moved by the Kubernetes master component, driven by certain events, from one node to another based on policies. To determine where various containers are at a particular point in time, you need a tool, namely kubectl. Because a solution component in this case does not have a specific IP address, collecting performance metrics and logs from this component to assess its performance and resiliency must be done differently than you would in an on-premises environment. Moving a containerized component from one node to another can change the performance characteristics of the overall solution. For example, a component may be moved from a relatively idle node to another busy node where multiple components contend for physical resources.
Software solution components often run in a shared resource environment with other components from the same or multiple other cloud clients. This can create performance and resiliency challenges. For workloads deployed in a Kubernetes cluster, containers can easily move from one node to another when certain resource constraints occur. However, Kubernetes does not solve all types of constraints, nor does it solve challenges associated with non-containerized solution components. A private or dedicated infrastructure for running solution components that are required to provide predictable performance behavior can help solve this problem. But this is a remedial option to solve performance challenges that may plague solution components that are not running in a dedicated cloud environment.
I would like to thank Tien Nguyen, IBM Distinguished Engineer, for his excellent feedback on this article. I would also like to thank Surya Duggirala, IBM Cloud Platform Engineering Guild Leader, for his review of and feedback on this article.
This two-part article series discusses the broader potential benefits of containerization. To achieve those benefits, learn additional actions you need…
Is hybrid cloud the solution that you and organization have been looking for? In this video series, Sai Vennam walks…
Back to top