By Julian Wang, Neil Graham, Vijay Sundaresan, Mel Bakhshi | Published April 20, 2018
In this article, we discuss some of the best practices to achieve best performance from applications running in the Liberty profile of IBM® WebSphere® Application Server on IBM Power® System S9xx and L922 systems (based on the recent IBM POWER9™ processor technology). These best practices should be applicable to most Java™ applications, even if running outside WebSphere Application Server. We use the deployment of a particular application, Acme Air, running in IBM Cloud Private as a case study to demonstrate the benefits and the application of the best practices. For completeness and to reflect the growing importance of IBM Cloud Private, we also discuss some techniques we used to tune the IBM Cloud Private environment for POWER9.
Applications that run on optimized hardware can effectively utilize the underlying resources of the system. As an example, microservice-based applications based on WebSphere Liberty, which is included as part of IBM Cloud Private, provided 1.86 times per core performance, 43% lower solution costs, and 1.66 times better price-performance on IBM Power L922 compared to Intel® Xeon® Gold 6130.
The use of the microservices in cloud-native applications has several benefits, specially running on an optimized hardware. We will demonstrate the benefits of running microservices through several techniques on the POWER9 hardware. We will also discuss the best practices that were employed when running a Java microservices application on POWER9. These techniques can help in getting the maximum performance out of Java microservices running on IBM Cloud Private.
This section provides a set of application performance guidelines for Java- and Liberty-based workloads. We will refer to Acme Air as a case study in their usage.
In general, transactional Java-based applications typically benefit from multiple threads to produce higher throughput. The Acme Air workload is no exception because it is an online flight reservation system capable of providing lots of web API calls (measure of transactions/throughput) per day.
Each processor core in the Power S9xx and L922 servers supports up to eight simultaneous multithreading (SMT) threads in hardware (AC922 cores have up to four threads). On Linux, each such SMT thread is represented as a virtual processor, whereas on IBM AIX®, each is represented as a logical processor. In other words, each POWER9 processor core supports running up to eight vCPUs on Linux or eight logical processors on AIX. Regardless of the terminology, core SMT level is an option that customers can choose during system boot or change dynamically in an active system. Obviously, the total number of virtual or logical processors in a system or partition depends on the core SMT level chosen. The following table shows the relationships:
For example, a system with four processor cores running in the SMT4 mode will have 16 virtual or logical processors while one with the same number of processor cores running in the SMT8 mode will have 32 virtual or logical processors.
Large workloads using many threads on many-core systems face extra challenges with respect to concurrency and scaling. In such cases, steps can be taken to decrease contention on shared resources and reduce overhead. However, for Java workloads on POWER9, we strongly recommend SMT8 mode as the default for running the system. We have evaluated a substantial set of workloads to identify the performance benefit going from the SMT4 mode to the SMT8 mode on the system. The following table shows some of them:
More details of DayTrader7 workload throughput on a six-core partition of Power S924 is shown in Figure 1.
Most applications do benefit from SMT, but some applications do not scale with an increased number of vCPUs or logical CPUs on an SMT enabled system. One way to address such an application scalability issue is to change to a lower SMT mode with fewer vCPUs or logical CPUs. This usually can be done dynamically without rebooting, and there are generally few concerns with regard to possible bad performance effect due to the system topological change.
On the other hand, when the SMT level (mode) is increased, we strongly recommend doing it over a maintenance window or reboot, and don’t recommend doing it dynamically, unless you are aware of the following few possible side effects and necessary care has been taken to address them:
You can enable large page support on systems that support it by starting Java with the -Xlp option. On certain processors, the JVM now starts with large pages enabled by default. Large page usage is primarily intended to provide performance improvements to applications that allocate a great deal of memory and frequently access that memory. The large page performance improvements are a result of the reduced number of misses in the translation lookaside buffer (TLB). The TLB maps a larger virtual storage area range and thus causes this improvement. Large page support must be available in the kernel, and enabled, so that Java can use large pages.
Refer to Configuring large page memory allocation for more details about it.
Another option is to run with transparent huge pages (THP) enabled in the kernel, and in this case, the kernel uses large pages even though the user has not configured large pages explicitly. Transparent huge pages generally improve the throughput performance of Java applications, but it is possible that for certain non-Java applications the feature is not beneficial, or even harmful in terms of performance. Because this kernel feature can only be enabled system-wide, care must be taken to set the transparent huge pages switch to “madvise” instead of enabling unconditionally (using “always” for the huge pages switch) for all applications on the system. This allows the kernel to selectively use transparent huge pages only in beneficial cases.
Figure 2 shows the amount of page faults without large pages and with large pages (only using it in Liberty with the -Xlp option) enabled. There is a big drop in the amount of the page faults when large pages are enabled on POWER9.
One of the many consequences of running on the cloud is that there are usually a greater number of layers involved when compared with running on a bare metal machine. The layers could refer to the virtualization of different resources (processor, storage, and so on), or they could simply be that completing a task may involve more network hops across a large farm of cloud machines, because the location of the machine having the resource (say, the database) is more uncertain than it is in a more controlled on-premises environment. Regardless, the presence of more layers invariably leads to performance overheads and so the latency associated with each task could be significantly higher when running on the cloud. Depending on application design, higher latency environments can require many more application threads to fully use the available CPU resources, as threads may spend time blocked on remote task execution.
Starting in Liberty 18.104.22.168, the default thread pool autonomics were enhanced to be more highly performant in cloud (high latency) scenarios thereby removing the need to manually tune the Liberty thread pool settings. You can find many more details of these changes for better out-of-the-box performance in cloud scenarios at: https://developer.ibm.com/wasdev/docs/was-liberty-threading-and-why-you-probably-dont-need-to-tune-it/.
If customers have a significant variety in their workloads and deployment environments, the latencies experienced by those workloads in different deployment environments could vary quite a lot, making it a challenge to tune manually in all cases. This is the main reason that we recommend that customers use the default Liberty thread pool autonomics as it is designed to adapt transparently in all cases, ensuring that the customer does not suffer from sub-optimal performance because their manual tuning that worked well in one case did not work well in other cases. Note that if a customer’s objective is to get every last percentage of performance and is willing to invest the time to manually tune each deployment, their results will likely be at least as good (and possibly even better) as those achieved by the Liberty thread pool autonomics. But we consider the cost of tuning to be significant enough to be impractical for most customers and in those cases, it would be better to rely on the Liberty thread pool autonomics to get nearly optimal performance in all deployments at minimal operational cost.
Note that the Liberty thread pool size is unrelated to the number of garbage collection (GC) threads employed by JVM. That is, because GC is a stop-the-world process, JVM will apply as many threads as it considers optimal (usually the number of logical CPUs) to parallelize GC activities and shorten the GC pause duration.
The IBM Cloud Private infrastructure shown in Table 1 is configured for running the Acme Air workload.
We used the following software stack to deploy the microservices:
At a high-level view, the workload is made up of five Java-based microservices and three MongoDB databases. These services are based on WebSphere Application Server Liberty Docker images. Figure 1 shows a view of the microservices interactions within the IBM Cloud Private environment where we used JMeter as a workload driver, which sends transactions through a proxy server part of the IBM Cloud Private cluster. To saturate the CPU on the worker node, the microservices were scaled up to a total of 20 JVM instances (four JVM instances per microservice).
You can find more details about the Acme Air workload at: https://github.com/blueperf
The following sections describe the tunings we used for the IBM Cloud Private application.
The mpstat reports captured during the workload showed that all of the inbound traffic that was coming through a single interrupt request queue (IRQ) was being routed to a single CPU which ran it up to 100% utilization. Figure 4 and Figure 5 show the single CPU core spending 99% on SoftIRQ related to the interrupt of the network adapter used for data communication. This confirms that NIC is dumping all the interrupts in one receive (RX) queue, instead of distributing them across the available RXs.
The NIC is handling all the interrupts in a single RX queue, and whichever CPU happens to be the one serving that queue gets overwhelmed (99% busy). We changed the adapter’s slot to a different PCI slot so that it would connect directly to a CPU socket rather than being switched through an intermediate device in case the issue with the RX queue was somehow related to the intervening device. Though that change is beneficial for network communication in general, it did not resolve the higher SoftIRQ issue. To resolve the issue, we needed to manually map all cores (0-63) to be available to all the interrupts. For example, if the adapter has eight queues available and the first queue is defined as /proc/irq/72/smp_affinity_list to a single core, use the following command to allow the queue to use all the cores: echo 0-64 > /proc/irq/72/smp_affinity_list
echo 0-64 > /proc/irq/72/smp_affinity_list
This change allowed us to distribute the SoftIRQ load across all available CPU cores.
Figure 6 shows mpstat after the change.
The Data Stream Control Register (DSCR) is used to control the degree of aggressiveness of memory prefetching for load and store. Generally, for sequential data access pattern, the hardware-based memory prefetching can improve performance by reducing the impact of cache miss latency and will prefetch data cache lines from L2, L3, and main memory into the L1 data cache for quick access. Performance of Acme Air improves slightly (1-2%) when the hardware prefetch is turned off. The Acme Air data access can be considered random and we believe this is the reason why the hardware prefetcher does not improve performance. This is in fact fairly typical of Java applications and so it is worth checking if the hardware prefetcher should be enabled or not on a case-by-case basis (to ensure that cases where the prefetcher helps do not get missed).
For performance measurements, we expect to use higher CPU frequency, and because the Acme Air workload is considered as a CPU-bound workload, we set the frequency at performance level through the Advanced System Management (ASM). Figure 7 shows the setting used.
To confirm that we are using the highest CPU frequency during the workload, we used the following command from the ASM interface: "getclockspeed pu.ex pu_coreclock mhz -all" p9n.ex k0:n0:s0:p00:c13137
"getclockspeed pu.ex pu_coreclock mhz -all"
We enabled the verbose GC logs and collected the logs for each run along with the server logs. The option, -Xverbosegclog:./logs/verbosegc.%seq.log,20,10000, in the Liberty JVM arguments to make the verbose GC log appear in the same location as the other server logs. The verbose GC showed that the booking service JVM is spending more than 15% of the processor time doing GC, during a test period.
That GC log showed that the level of the GC activity was excessive, and that led us to try running with a higher heap on all the Liberty instances. This reduced the CPU utilization to be used for more work to be done. By default, each JVM starts a number of GC threads equal to the number of hardware threads, and when a JVM does a GC operation, all of its GC threads will be busy. This might work well in a small system with a lower CPU count. But, if there are many more JVMs running in the single OS, then GC thread contention may occur. Ideally, the total number of GC threads should not exceed more than four times the number of hardware threads. With eight cores at SMT8 and 20 JVM, we set the GC threads to be 8.
The following JVM options were used during the runtime:
-Xmx2g -Xmn1792m -Xgcthreads8 -Xlp -Dhttp.keepalive=true -Dhttp.maxConnections=700 -Dcom.acmeair.client.CustomerClient/mp-rest/url=http://nginx1/customer -Dcom.acmeair.client.FlightClient/mp-rest/url=http://nginx1/flight -Xverbosegclog:./logs/verbosegc.%seq.log,20,10000
The default settings of the ingress controller in IBM Cloud Private would work for most of the development-related activities, however, for a workload with high network traffic, we need to adjust the default settings. The ingress controller of IBM Cloud Private is based on NGINX. Refer to: https://www.nginx.com/blog/introducing-nginx-kubernetes-ingress-controller/
In order to handle JMeter traffic, we need to allow higher worker connection and increased timeout values for the proxy server of IBM Cloud Private using the ingress controller. Figure 8 shows the settings that were added.
When deploying the Acme Air microservices within IBM Cloud Private, we also need to deploy ingress services that are used for this application. Refer to the following URL for ingress service configuration: https://www.ibm.com/support/knowledgecenter/en/SS5PWC/front_end_config_cfc_task.html
Because the ingress service exposes the application to the external requests, we need to change its configuration to handle more requests. The following settings were used:
ingress.kubernetes.io/rewrite-target: / ingress.kubernetes.io/ssl-redirect: "false" ingress.kubernetes.io/connection-proxy-header
In this article, we set out some general best practices for getting the most out of your WebSphere Application Server or Java application running on the new POWER9 Enterprise systems. We used these to get great performance out of an application running on IBM Cloud Private, as well as a number of best practices that are specific to this environment. As the data we present here and elsewhere shows, the combination of IBM Cloud Private, WebSphere Application Server Liberty, and POWER9 systems offers great performance with attractive price-points and impressive density.
Back to top