If you’ve ever used WAS traditional, you’re used to having lots of thread pools and you’re used to having to tune them. You want to maximize the performance of your server, and adjusting thread pool sizes is one of the most effective ways of doing so.

So it’s only natural that, the first time you create a Liberty server, you want to find all the bells and whistles for configuring the thread pools and you want to play around with them. You might even be tempted to adjust the thread pool settings before deploying your first application because you just know that you’re going to need to, right?

Wrong (probably).

The Liberty threading model is, quite simply, completely different than the WAS traditional threading model. First of all, WAS traditional has multiple thread pools, whereas Liberty has a single thread pool called the default executor. This doesn’t mean that every single thread in a thread dump is in this thread pool. If you take a thread dump and look closely, you’ll see a bunch of utility threads like OSGi framework threads, JVM garbage collections threads, Java NIO selector threads, etc…

What I mean by a single thread pool is that all of the application code runs in a single thread pool. (This isn’t completely true … there are a few edge cases where application code might run outside of the default executor, but it’s not worth worrying about.)

Okay, so if all application code runs in a single thread pool, it must be REALLY important to tune it. Right?

Nope, not really. The defaults are actually very good. More importantly, the defaults are very good for a wide range of workload types.

Threading settings

Let’s take a look at some of the threading settings that are configured using the executor, what their defaults are, and what they mean:

  • coreThreads – This is essentially a ‘minimum threads’ value; once Liberty creates enough threads to exceed the coreThreads value, we’ll never get rid of threads to drop below it. The underlying executor creates a new thread for each piece of offered work, until there are coreThreads threads in the pool. Once the coreThreads size is reached, the Liberty threadpool auto-tuning algorithm controls the number of threads in the range between coreThreads and maxThreads.

    The default value for coreThreads is -1, which means that at runtime we set coreThreads to a multiple of the number of hardware threads on your system. (Currently, that multiple is 2, but we reserve the right to change that.)

  • maxThreads – This one is pretty obvious. It’s the maximum number of threads that we can possibly create for this thread pool. Ever.

    The default value is -1, which translates to MAX_INT or, essentially, infinite.

    You might be thinking, isn’t ‘infinite’ kind of a bad default for maxThreads? It’s actually quite a sensible default. That’s because Liberty uses an auto-tuning algorithm to find the sweet spot for how many threads the server needs. I’ll go into more detail below but, essentially, Liberty is always playing around and adjusting the number of threads in the pool in-between the defined bounds for coreThreads and maxThreads.

    Saying that the default for maxThreads is infinite is basically saying “do NOT restrict the Liberty auto-tuning algorithm”. Let it do its job with no bounds. Don’t worry, though; setting maxThreads to the default doesn’t mean that Liberty WILL create MAX_INT threads. We technically could but it would never, ever, be beneficial to do so. So we never will even come remotely close.

  • keepAlive – This kinda implies that it’s the amount of time an idle thread will remain in the pool before it goes away. However, due to the details of how the auto-tuning algorithm works (explained below), this setting never comes into play. The default is 60s but, like I said, it just simply never comes into play.

  • name – This is the name of the thread pool, and it’s also part of the name of the threads that live in this pool. The default is Default Executor. Don’t change it. There’s no point, and it just makes it more difficult to find the default executor threads in a thread dump if you happen to be looking for them.

  • rejectedWorkPolicy – This is what happens when a piece of work gets submitted to the executor but the work queue that backs the executor is full. You can choose to either have the submitting thread run the work, or you can choose to have an exception be thrown to the submitter. Here’s the thing, though… the work queue that backs the default executor is infinite. If we reject work, it’s because your server is out of memory, in which case you’ve got bigger problems than what to do with rejected work. The default is to throw an exception to the submitter, and there’s no reason to change it.

  • stealPolicy – This is a dead setting. Prior to Liberty V8.5.5.2, the default executor used a series of thread-local work queues that could steal from each other in an attempt to boost performance. As it turns out, it didn’t boost performance much, if at all, and it caused a lot of headaches. So we removed this feature from the default executor but, for backwards compatibility, we still have to honor this configuration option. The stealPolicy setting controlled some of the behavior of that work-stealing feature but now that the feature is gone, this setting does absolutely nothing.

Alright, so to summarize thus far, the only settings of the executor that are remotely interesting to change are the coreThreads and maxThreads. These settings, as already discussed, serve as bounds for the Liberty auto-tuning algorithm. Let’s get into a little more detail about how that algorithm works.

How the auto-tuning algorithm works

The Liberty default executor is broken into two pieces: (a) the underlying implementation, and (b) the controller thread. The underlying implementation is the actual physical thread pool, and the controller thread determines the thread pool size of the underlying implementation. The controller thread is free to choose any pool size in between coreThreads and maxThreads but note that, from the perspective of the underlying implementation, the pool size is always constant.

Here’s an example. Let’s say you use the defaults for coreThreads and maxThreads and that you have 8 hardware threads on your system. At run time, the controller thread is bounded by coreThreads of 16 and maxThreads of MAX_INT. Let’s say that the controller thread determines that 18 is the current optimal number of threads. The controller thread then sets BOTH coreThreads and maxThreads of the underlying implementation to the same value, 18.

Note that this is why the keepAlive setting of the executor is useless. At the level where it matters (on the underlying implementation), coreThreads is always equal to maxThreads, so there are never any idle threads sitting around waiting to go away.

Now, just because the controller thread determined that 18 is the optimal pool size RIGHT NOW doesn’t mean that it can’t change its mind. It’s actually running on a loop and analyzing throughput. It’s constantly recalculating what the optimal pool size is based on its throughput observations.

If you start a server and don’t run any workload, it’ll settle on a lower value for the pool size. If you ramp up the workload, the server will adjust and increase the number of threads, although I should note that this might take a few minutes. The algorithm doesn’t want to over-respond to quick changes in workload, so it does take its time increasing the number of threads (or decreasing them, if workload gets reduced).

Fighting deadlocks in the executor

Those are the basics of how the threading model works, but let me just discuss one more detail that comes into play for Liberty V8.5.5.6. Prior to that version, it was sometimes possible to deadlock the executor. In other words, all threads in the executor would be occupied but they would all be occupied waiting for OTHER work to complete, work that had been queued to the executor. But the executor didn’t have any threads left. The Liberty auto-tuning algorithm used to not handle this situation very well and would sometimes give up trying to add threads to break the deadlock.

This behavior led a lot of folks to set the coreThreads value of the executor to a high number to ensure that the executor never deadlocked. However, in V8.5.5.6, we modified the auto-tuning algorithm to aggressively fight deadlocks. Now, it is essentially impossible for the executor to deadlock. So if you’ve manually set coreThreads in the past to avoid executor deadlocks, you might want to consider reverting back to the default once you move to V8.5.5.6.

That’s it in a nutshell.

What should you take away from this? Don’t tweak the default executor settings (unless you really, really have to)! We try really, really hard to make the defaults work for as many types of workloads as possible. Yes, there will be some edge cases where you may need to adjust coreThreads and maxThreads, but at least try the defaults first.

Optimizing for cloud

One of the many consequences of running on the cloud is that there are usually a greater number of “layers” involved when compared with running on a bare metal machine. The ‘layers’ could refer to the virtualization of different resources (CPU, storage etc.) or simply that completing a task may involve more network hops since the location of the machine having the resource (say, the database) is more uncertain across a large farm of cloud machines than it is in a more controlled on-premise environment. Regardless, the presence of more layers invariably leads to performance overheads and so the latency associated with each task could be significantly higher when running on the cloud. Depending on application design, higher latency environments can require many more application threads to fully exploit the available CPU resources, as threads may spend time blocked on remote task execution.

Starting in Liberty the default thread pool autonomics were enhanced to be more highly performing in cloud (high latency) scenarios.

Higher latency tasks may require large thread pool sizes for optimal throughput. The prior controller only adjusted the pool size by +/-1 thread at each evaluation cycle, which would take a long time to grow the pool to a large size. Starting in the controller adjusts the pool size by a multiple of the number of hardware threads available. Using the number of hardware threads increments makes the adjustments proportional to the computing resources available for the Liberty threads to use. It allows the pool size to grow more quickly. The number of hardware threads multiplier that is used to determine the pool increment/decrement value starts at one. It is increased as the pool size grows past threshold levels (thresholds are defined also as multiples of number of hardware threads), so that the amount of each pool size change remains proportional to the pool size.

As discussed earlier, the controller uses observations of throughput at different pool sizes to decide whether to grow or shrink the pool. However, there are a variety of factors other than the pool size that can affect throughput, such as competing workloads, Java GC pauses, or unknown delay factors on other systems involved in the transaction. The prior implementation considered only historical throughput data in the range current pool size +/-1. Within that narrow range of pool size, the correlation between pool size and throughput may not be very strong, considering the other factors that can affect throughput. So as of the controller considers throughput for a broader range of pool sizes when making grow/shrink decisions. It looks at historical throughput data for ‘current-pool-size +/- (pool-increment * N)’, where N is an internal constant. This broader scope improves the correlation between pool size and throughput, making it less likely that the controller will be misled by random noise (throughput variation not related to pool size) in the historical data.

Applications can undergo a change in behavior (‘phase change’) for a variety of reasons: Application input, environment, language runtime can all change over time. Such workload changes could cause the historical throughput data to be unrepresentative of the ‘current’ state of the system. To reduce the probability of making grow/shrink decisions based on unrepresentative throughput data, an aging factor was introduced in Liberty which discards data for a pool size if that pool size has not been tried recently. Aging out old data improves the controller’s ability to adapt to change in workload conditions.

The CPU resources available to the Liberty server are an important input to thread pool grow/shrink decisions – in particular, if the CPU usage is already high, adding threads is unlikely to improve throughput. This is pretty much common sense; manual thread pool tuning exercises always consider the CPU utilization as an input when seeking the optimal pool size. Beginning in Liberty, a ‘CPU high’ indicator is included in the thread pool controller – if a ‘CPU high’ condition is detected, the controller will be less inclined to grow the thread pool, and more inclined to shrink it. The CPU state is monitored by reading Java MBeans for Process CPU (percent utilization of CPUs available to the Java process) and System CPU (percent utilization of all CPU resources in the system).

These changes to Liberty produced significant speedups in application throughput in several high latency test cases, using the default thread pool executor without any tuning. These high-latency throughput improvements were achieved without generating too many threads for low-latency workloads, so low-latency throughput was on par with the historical controller implementation.

Updated 2018-03-16 by Gary DeVal for WebSphere Liberty release.

10 comments on"WebSphere Liberty threading (and why you probably don’t need to tune it)"

  1. Hi Joe, let me get some clarifications on your scenario. In your example, where is the Liberty server; is it the front-end server running the batch job, or is it the backend server pulling messages off of the queue and processing them?

    Are you using the thread pool to throttle how quickly messages get placed ONTO the queue by the frontend, or how quickly they are picked up by an activation spec (or listener port) on the backend? I’m assuming that you are talking about the latter.

    If that is the case, I will have to check with the JMS folks about what kind of throttles exist on the activation spec. If you didn’t throttle the activation spec at all, then what would happen is that messages would get picked up and placed onto the global queue of the default executor at a very fast pace. The size of the global queue would increase quite rapidly.

    The actual throughput of the server should be okay, because the auto-tuning algorithm is still going to adjust the number of threads to obtain max throughput. For example, if 1,000 messages got pulled off the queue and queued to the default executor global queue on a 2 processor system, the auto-tuning algorithm still probably won’t create more than 8 or so threads on average. This can vary wildly depending on how CPU vs I/O intensive the work is, but the algorithm will handle it. The messages will get processed at a solid, hopefully optimal, rate.

    The downside here is that all of these messages on the global queue will block other work, so if you have mixed workload (some HTTP work going into the server, for example), it might get stuck on the global queue behind the messages.

    So for that reason, I can definitely see why throttling at the activation spec layer would be desirable. I’m not sure what kind of support exists for that. If you could please just let me know that I understand your question correctly, I will then follow up with the JMS folks and get you an answer.

    • What you said about the activation spec is interesting…and am interested in the kind of support that WLP has…though typically I probably will have different JMS endpoints for different priority workload in the first place.

      But actually my case is WLP as the front-end server (running a batch job), and need to throttle how quickly messages get placed ONTO the queue (to the other end which is mainframe).

      • I would normally suggest the use of managed executor service or managed thread factory (concurrent-1.0 feature in WAS liberty) with the max. amount of threads limited to the desired value, however I’m not aware of any directly configurable method to limit the amount of threads here (yes, it can be done programatically using the thread factory, but that isn’t that elegant as work manager with thread limit in WAS classic). So I’m also curious.

  2. Good article, explains it very well.

    I suspect with the control settings in the classic, people might be using it for things you might not have anticipated. So maybe here is one…..we used to run a batch job that process rows in an incoming file and send a MQ message to the backend. We do not want the backend to be overwhelmed if out of the blue someone drops an unexpected large file. So we limit the concurrency with the max thread settings of the thread pool allocated to this job.

    Now if in WLP we can’t have a dedicated pool for this job, what are the alternatives? I think a dedicated JMS connection pool for this job with its own max conn setting?

    • Ooops, I posted a new comment instead of replying directly to yours. Please see above for my response.

    • Hi Joe, just wanted to let you know that I didn’t forget about your question. I personally don’t have JMS experience, but I’m trying to locate someone who does and who can answer your question.

    • Here is a response from my colleague, Venu, who is very familiar with JMS:

      “I have gone through it.. and I feel it may not be possible to control from front end.. as soon as connection is available to front end.. messages can be sent.. thus putting them into Queue immediately. Only way is to throttle this flow to reduce the availability of threads and connections for front end. However as you mentioned in your mail.. this may not be feasible as messages can be flood with limited connections and threads.

      Fom back ground i.e activation specification.. it can be done using the following parameters:

      Maximum concurrent endpoints
      Maximum batch size

      Here is in-detailed explanation by cWAS infocenter(https://www-01.ibm.com/support/knowledgecenter/SSAW57_7.0.0/com.ibm.websphere.nd.doc/info/ae/ae/tjn0027_.html).. however it is relevant to Librty also.

      Maximum concurrent endpoints

      The default messaging provider enables the throttling of message delivery to a message-driven bean through the Maximum concurrent endpoints configuration option on the JMS activation specification used to deploy the bean.
      The maximum number of instances of each message-driven bean is controlled by the Maximum concurrent endpoint setting in the activation specification used to deploy the message-driven bean. This maximum concurrency limit helps prevent a temporary build up of messages from starting too many MDB instances. By default, the maximum number of concurrent MDB instances is set to 10.
      The Maximum concurrent endpoints field limits the number of endpoints (instances of a given message-driven bean) that process messages concurrently. If the maximum has been reached, new messages are not accepted from the messaging engine for delivery until an endpoint finishes its current processing.
      If the available message count (queue depth) associated with a message-driven bean is frequently high, and if your server can handle more concurrent work, you can benefit from increasing the maximum concurrency setting.
      If you set the maximum concurrency for a message-driven bean, be sure that you specify a value smaller than the maximum number of endpoint instances that can be created by the adapter that the message-driven bean is bound to. If necessary, increase the endpoint instance limit.

      Maximum batch size
      An activation specification also has a Maximum batch size that refers to how many messages can be allocated to an endpoint in one batch for serial delivery. So, for example, if you have set the Maximum concurrent endpoints property to 10 and the Maximum batch Size property to 3, then there can be up to 10 endpoints each processing up to 3 messages giving a total of 30 messages allocated to that message-driven bean. If there are multiple message-driven beans deployed against a single activation specification then these maximum values apply to each message-driven bean individually.

      There is one more new JMS 2.0 feature called delivery delay.. this can be set in their front end. Delivery delay is JMS API which has to be set while sending the messages. This will ensures the message is eligible to deliver to consumer (i.e MDB) only after delivery delay elapses.
      http://www.oracle.com/technetwork/articles/java/jms2messaging-1954190.html ( Please look for delivery delay).”

      Note that he mentioned it might be possible to throttle on the front end by limiting threads / connections, but although this MIGHT work it very much depends on how quickly those threads are processing the file. If they process it very quickly, it is still possible for a small number of threads to overwhelm the queue. This isn’t a “hard throttle”, but just an attempt to slow things down enough.

      The backend really seems like the place to control this, with the queue absorbing the messages from the frontend while the backend catches up at its own pace.

      Hope this helps!

Join The Discussion

Your email address will not be published. Required fields are marked *