IBM Developer Blog

Follow the latest happenings with IBM Developer and stay in the know.

During a cloud upgrade, how do you keep your containerized app available to users? This blog provides some tips to prevent major outages during an…


Application availability in a cloud environment

When you’re working in a cloud environment, it’s inevitable that you will need to upgrade or update the version that you’re currently working in to the most current version available. But how do you maintain the availability of your containerized applications during an upgrade?

In this blog post, I’ll discuss some tips to do this and then use a scenario where I upgrade my IBM Cloud Private instance as an example. In an ideal situation, the application availability would reach zero downtime, which is what we’ll try to achieve.

How a typical containerized application works

Let’s see what a general containerized application looks like by using a sample online banking (OLB) application as our scenario. The diagram below illustrates a typical user app inner workings:

sample online banking app

  1. It has a front-end application, which is running with replicas of pods and a back-end database server that is running with another replicas of pods. Front-end pods and back-end pods communicate with each other, and our front-end application exposes itself as an externally reachable URL to users.
  2. For pod-to-pod internal access, the service is usually used inside a cluster. A set of pods exposes its functionality as a service, and then other pods can access it by calling the service name. When a pod calls a service name, a domain name service (or KubeDNS in this particular scenario) transfers the service name to the clusterIP. Then the traffic to the clusterIP will loadbalance to the back-end pods of the service. In our sample app OLB example, the back-end application exposes itself as a back-end service, and the front-end pod calls the back-end pod by calling the service name. Our front-end application exposes a front-end service name in the same way. (To be clear, our diagram doesn’t draw the workflow of a front-end service).
  3. For external access, an ingress can be configured to expose the service as an externally-reachable URLs. In our OLB example, the front-end application will also expose itself as an ingress, and external users can access the OLB by accessing the ingress URL.

What happens during an update?

So what can you expect from your applications’ availability during your cloud version upgrade?

Generally, you can expect the following three stages for your containerized application during an update:

  1. Preparing the cluster for upgrade
  2. Upgrading the cluster core components
  3. Upgrading the cluster add-on components

In the first stage of preparing the cluster for upgrade, the cluster data is normally backed up before the real upgrading process begins. In the second stage, core components will upgrade to a newer version. Using IBM Cloud Private as an example, components like apiserver, controller-manager, and scheduler will upgrade to a newer version. Generally, applications won’t call the core components directly, so the first two stages won’t affect users’ applications!

In the third stage, add-on components like Calico, KubeDNS, and the NGINX Ingress Controller are upgrading, as users’ applications rely on these components, so we can expect there will be some (minimal) outage during the add-on components upgrade portion.

Tips to implement ahead of an update

There are four places where outages can occur during an update.

  1. Container

    Besides the 3 upgarde stages mention above, you may also want to upgarde the container version, like Docker. Upgrading Docker will restart all the running containers on the host, this will affect your applicaiion availibility and this is an outage hard to avoid.

  2. Container network

    Pod to pod communication depends on the stability of container network. Upgrading network components may affect your container network. The stability of container network depends on your cloud cluster. Using ICP as an example, ICP uses Calico as default Container Network Interface (CNI) plug-in, and there is no downtime in container network during upgrade.

  3. DNS

    As the typical containerized application above shows, pod internal communication depends on the cluster domain name service. To reach zero downtime upgrade, requires DNS achieve both graceful shutdown and rolling upgrade. Graceful shutdown ensures the processing request to be finished before pods exist. Rolling upgrade request you have at leaset 2 pods instance and there is always a pod available during upgrade.

    If the upgrade can’t achieve graceful shutdown and rolling upgrade, the outage can last a few seconds during a DNS upgrade. During the outage, internal calls to our pod by service name might fail.

  4. Load balancer

    If the load balancer upgrade can’t achieve graceful shutdown and rolling upgrade, there will be outage and affect external access.

In general, the outages of DNS and load balancer can occur during an update that might affect your user application and therefore your users’ experience. Below are tips you can do ahead of an update, to improve application avalibility.

  1. Address a domain name service outage (KubeDNS)

    Use KubeDNS as an example. In kubernetes, upgrade KubeDNS by daemonset can achieve rolling upgrade, the outage is mainly because KubeDNS is not gracefully shutdown.

    To fix this gap, you can add preStop script in KubeDNS daemonset, to gracefully shutdown KubeDNS pod within 10 seconds:

       lifecycle:
         preStop:
           exec:
             command:
             - sleep
             - 10s
    

    I also strongly suggest that you add a retry mechanism in the code level ahead of a known update. As you know, many factors may affect pod-to-pod communication, such as an unstable network. But the retry mechanism can help improve your application’s availability. For example, below is a piece of code which will try the connection 10 times.

     int retries = 10;
     for (int i = 0 ; i < retries ; i++) {
       try {
       // your connection code logic to http://service-name
       } catch (Exception e) {
         continue;
       }
     }
    
  2. Address a load balancer outage (NGINX Ingress Controller)

    Let’s use an NGINX Ingress Controller as an example. The NGINX which is the core binary used inside NGINX Ingress Controller already achieve graceful shutdown, and rolling upgrade can be achieved by upgrading daemonset.

    In an HA environment, an external load balancer (for example, HA Proxy) is usually used to route requests to the cluster NGINX Ingress Controller, where the outage is mostly due to the time window before the external load balancer detects the old NGINX Ingress Controller pod exits.

    NGINX Ingress Controller provides a default health check URI, http://node-ip/healthz, which can be used to make better a health check. Using HAProxy, as an example, can perform a health check by checking an HTTP service. Here is the configuration example for a health check:

     listen icp-proxy
         bind :80, :443
         mode tcp
         option tcplog
         option httpchk GET /healthz
         http-check expect status 200
         server server1 172.16.205.111 check fall 3 rise 2
         server server2 172.16.205.112 check fall 3 rise 2
         server server3 172.16.205.113 check fall 3 rise 2
    

    Notes:

    • option httpchk GET /healthz means that GET is the method used for build HTTP request and /healthz is the URI used for the HTTP request.
    • http-check expect status 200 set the expect response status code to 200.
    • check fall 3 rise 2 set the number of consecutive valid health checks before considering the server as DOWN and UP.

      The accuracy of the health check depends on your external load balancer. For HAProxy in our testing, there is a time window around 2-3 seconds before detecting a NGINX Ingress Controller pod is DOWN. So I strongly suggest that your application implements a retry mechanism if it’s sensitive to connection failure.

During your cloud upgrade, you might need to revert to an earlier version of your cloud instance if the upgrade fails. When/if you revert to an earlier version of your cloud instance, you can still keep your application availability by avoiding the creation of new workloads or deployments, as the revert will roll back data to your previous content and you will lose any new workloads and deployments.

Testing results

Let’s look at our online banking (OLB) application’s availability during an upgrade. We’re using HAProxy as the load balancer and JMeter to test the connection during the upgrade.

# ./jmeter -n -t OLB.jmx -JHOST=load-balancer-hostname -JPORT=80 -j LOGS/jmeter/jMeter_test_log -l results.jtl -e -o LOGS/jmeter/resultReport -JTHREAD=100 -JDURATION=6000 -JRAMP=300
Creating summariser <summary>
Created the tree successfully using OLB.jmx
Starting the test @ Wed Jan 09 06:36:29 PST 2019 (1547044589185)
Waiting for possible Shutdown/StopTestNow/Heapdump message on port 4445
summary +      1 in 00:00:00 =    2.7/s Avg:   122 Min:   122 Max:   122 Err:     0 (0.00%) Active: 1 Started: 1 Finished: 0
summary +   1572 in 00:00:30 =   52.5/s Avg:   104 Min:   102 Max:   118 Err:     0 (0.00%) Active: 11 Started: 11 Finished: 0
summary =   1573 in 00:00:30 =   51.9/s Avg:   104 Min:   102 Max:   122 Err:     0 (0.00%)
summary +   4427 in 00:00:30 =  147.5/s Avg:   104 Min:   102 Max:   129 Err:     0 (0.00%) Active: 21 Started: 21 Finished: 0
summary =   6000 in 00:01:00 =   99.4/s Avg:   104 Min:   102 Max:   129 Err:     0 (0.00%)
summary +   7313 in 00:00:30 =  243.9/s Avg:   104 Min:   102 Max:   124 Err:     0 (0.00%) Active: 31 Started: 31 Finished: 0
summary =  13313 in 00:01:30 =  147.4/s Avg:   104 Min:   102 Max:   129 Err:     0 (0.00%)
summary +  10172 in 00:00:30 =  339.1/s Avg:   104 Min:   102 Max:   131 Err:     0 (0.00%) Active: 41 Started: 41 Finished: 0
summary =  23485 in 00:02:00 =  195.2/s Avg:   104 Min:   102 Max:   131 Err:     0 (0.00%)
summary +  12990 in 00:00:30 =  433.0/s Avg:   104 Min:   102 Max:   237 Err:     5 (0.04%) Active: 51 Started: 51 Finished: 0
summary =  36475 in 00:02:30 =  242.7/s Avg:   104 Min:   102 Max:   237 Err:     5 (0.01%)
summary +  14292 in 00:00:30 =  476.4/s Avg:   114 Min:    69 Max:  6013 Err:    48 (0.34%) Active: 61 Started: 61 Finished: 0
summary =  50767 in 00:03:00 =  281.5/s Avg:   107 Min:    69 Max:  6013 Err:    53 (0.10%)
summary +  18618 in 00:00:30 =  620.6/s Avg:   106 Min:   102 Max:  6011 Err:     3 (0.02%) Active: 71 Started: 71 Finished: 0
summary =  69385 in 00:03:30 =  329.9/s Avg:   107 Min:    69 Max:  6013 Err:    56 (0.08%)
summary +  21588 in 00:00:30 =  719.5/s Avg:   104 Min:   102 Max:   133 Err:     0 (0.00%) Active: 81 Started: 81 Finished: 0
summary =  90973 in 00:04:00 =  378.5/s Avg:   106 Min:    69 Max:  6013 Err:    56 (0.06%)
summary +  24390 in 00:00:30 =  812.9/s Avg:   104 Min:   102 Max:   135 Err:     0 (0.00%) Active: 91 Started: 91 Finished: 0
summary = 115363 in 00:04:30 =  426.8/s Avg:   106 Min:    69 Max:  6013 Err:    56 (0.05%)

We can see from above that JMeter keeps running during the entire cloud upgrade process. The summary report calculates the failed connection every 30 seconds and accumulates it to a total number. From the results, we can also see that:

  • During 2:30 to 3:00, there is a short downtime, which happens during upgrading NGINX Ingress Controller, because of the time window of health check mentioned above.
  • In 30 seconds, JMeter sends 14292 requests and 48 failed, the failure percentage is about 0.34% in 30 seconds. The downtime is a scattered time point in 30 seconds, when in actuality the downtime happens at the time point when NGINX Ingress Controller pod exits. So, if your retry mechanism can handle it, your cloud upgrade won’t affect your app’s availability!

IBM Cloud Private and upgrade features

With the recent update for IBM Cloud Private (ICP) version 3.1.2, developers who are interested in a platform for developing on-premises, containerized applications can benefit from the new feature of a multi-version upgrade. (This means that you can upgrade from version 3.1.1 to 3.1.2 and version 3.1.0 to 3.1.2.) ICP also provides support for user application availability when you upgrade to 3.1.2 in a high availability (HA) IBM Cloud Private cluster.

If you already use or want to use ICP, upgrading to ICP 3.1.2 means that management components will be rolling the upgrade to newer versions and application pods will continue to run during the upgrade. In general, traffic to applications continues to be routed. (Refer to the ICP 3.1.2 Knowledge Center and note that there is still a short outage during the upgrade.)

Other features of ICP’s upgrades include:

  1. User application pods won’t be affected. All the pods keep running, which means no pods exit or restart.
  2. For internal pod-to-pod communication, there is no downtime in the pod container network.
  3. For internal access, KubeDNS always works when calling service name. (As ICP 3.1.2 knowledge center mentions, there is still a short outage during kube-dns upgrade.)
  4. For external access, ICP use NGINX Ingress Controller. You need to configure the healtch check in external load balancer to avoid outage.

Next steps

Multicloud environments are on the rise, so it’s important to evaluate your cloud needs. If you don’t already have a cloud computing environment where access is limited to members of an enterprise and partner networks, then you might want to consider checking out IBM Cloud Private. Already an ICP user? See what you can do with our code patterns that are based on ICP. Or test your knowledge on multi-cloud management with our Learning Path.

Qing Hao