Optimize JVM start-up with Eclipse OpenJ9
SharedClasses caching, dynamic AOT, and more
Application start-up time matters to different people for different reasons. For instance, software developers that repeatedly go through the code-compile-test cycle care a lot about start-up time because they want to reach the point when their new code is exercised faster. Start-up time is also important for fast recovery after planned or unplanned outages due to HW/SW upgrades or failures. In the cloud, start-up time is very relevant in cases where an autoscaler system needs to quickly bring up additional instances of an application to handle a temporary increase in load.
In Eclipse OpenJ9, JVM start-up time is considered a prime performance metric. To that extent, OpenJ9 modifies its internal heuristics during the start-up phase in order to improve the start-up time of applications. So, how does OpenJ9 detect start-up phases? While the exact details are beyond the scope of this article, it suffices to say that its phase-detection mechanism is based on the observation that start-up is typically characterized by intense class loading and bytecode interpretation, accompanied by sustained JIT compilation activity.
Since interpretation is a very expensive process (interpreted code is typically 10x slower than native code), the goal of the OpenJ9 JIT compiler is to reduce interpretation as soon as possible. Thus, the JIT will aim to compile as many Java methods as fast as possible without being too concerned about generated code quality during this phase. To this end, during start-up, the JIT compiler may choose to:
- Downgrade the optimization level for methods. These low-optimized method bodies can be upgraded back through various mechanisms, of which the most important is guarded-counting-recompilation (GCR). GCR is a recompilation mechanism based on invocation counting, but it is “guarded” in the sense that counting is disabled during start-up phase.
- Reduce invocation counts for interpreted methods. In OpenJ9, the decision to compile an interpreted method is based on an invocation-counting mechanism. While using low invocation thresholds may be appealing from a start-up perspective, this leads to low quality of profiling data collected while methods run interpreted, data that is essential for the optimizer to achieve good throughput. Thus, OpenJ9 may use reduced invocation counts only during the start-up phase.
- Give priority to first-time-compilation requests (that move a method from interpreted to native code) and to compilations that look cheaper. The queue that holds outstanding compilation requests is implemented as a priority queue, and OpenJ9 will favor compilations that provide the best “value.” In the spirit of this idea, recompilations and compilations at higher optimization levels are placed towards the end of the queue.
Two major mechanisms aimed at improving start-up time in OpenJ9 are the shared classes cache (SCC) and dynamic ahead-of-time (AOT) compilation technology. These technologies directly address the major sources of overhead.
The SCC is a memory-mapped file that stores mainly three pieces of data:
- AOT-generated code
- Interpreter profiler information
In OpenJ9, a Java .class file is first transformed into an internal representation called ROMClass, which contains all of the class’s immutable data. Loading a ROMClass from SCC is much faster because:
- The class’s data is fetched from memory, instead of disk
- The transformation to ROMClass and some verification has already happened
Note, however, that not all classes can be stored in the SCC. The main condition is that the classloader used to perform the load be SCC-“aware.” The easiest way to accomplish this prerequisite is to have a classloader that extends
java.net.URLClassLoader. Learn more about SCC technology in “Class sharing in Eclipse OpenJ9” (IBM Developer, June 2018).
Dynamic AOT compilation is a mechanism in which Java methods compiled in one JVM invocation are stored in the SCC and reused in subsequent JVM invocations. Loading an AOT-compiled body from the SCC is much faster and much cheaper than performing a JIT compilation. So start-up can be greatly improved for two reasons:
- The compilation overhead is significantly reduced (thus the JIT compilation threads will steal fewer CPU cycles from the application threads)
- Methods transition faster from interpreted to compiled status
One caveat is that, due to technological limitations arising out of the need to share AOT-compiled bodies across JVM invocations, the code quality for AOT compilations is somewhat lower than the quality for JIT compilations. OpenJ9 overcomes this shortcoming by:
- Recompiling frequently executed AOT bodies
- Restricting AOT generation to the start-up phase of an application
The interpreter profiler mechanism in OpenJ9 collects profiling data about branch bias (taken or not-taken) and targets for interface invokes, virtual invokes,
instanceof operations. This information is crucial to the JIT optimizer, but unfortunately, the profile-gathering process comes at a high overhead, which adversely affects start-up time. The solution adopted by OpenJ9 is to store the collected profiling data into the SCC and to use it in subsequent runs, while turning interpreter profiler off during start-up in those runs. A watchdog mechanism may turn the interpreter profiler on sooner if too many of the queries for profiling data come back empty-handed.
What should the user do to improve start-up time?
While many of the start-up-oriented heuristics are enabled by default in OpenJ9, in some cases user input is needed.
Configure and tune SCC/AOT
At the time of this writing, SCC and dynamic AOT are not enabled by default. Instead, the user needs to specify the
-Xshareclasses command line option. See the
-Xshareclasses documentation on GitHub for a comprehensive list of sub-options. One of the common pitfalls is that the default size of the SCC may be too small for the connected application. This situation can be determined by printing SCC statistics using following command and look at SCC occupancy:
If it shows “
Cache is 100% full,” then it’s likely that the application could benefit from a larger SCC. In past releases of OpenJ9, in order to increase the SCC size with the
‑Xscmx option, you had to destroy the existing SCC and create a new one. Starting with OpenJ9 v0.9.0, this procedure has been streamlined because the SCC defines a soft limit and a hard limit. When the soft limit is reached, the SCC is declared full, but this size can be increased up to the hard limit without destroying the SCC by using the
-Xscmx documentation on GitHub for more details about adjusting the SCC size.
Regarding AOT code size, by default, OpenJ9 does not set any explicit limit, meaning that AOT code can be stored in the SCC until it becomes full. However, you have to be mindful that some applications do internally set an AOT space limit by using the
-Xscmaxaot<size> option. Some prominent examples of this include WebSphere Application Server and WebSphere Liberty. If the applications running on top of these application servers are particularly large, the user could consider increasing the AOT space limit to extract further start-up time improvements. Statistics about the AOT space limit and occupancy can be obtained with the
printStats option (see the example above).
This option is recommended for CPU-constrained environments like the ones typically found in the cloud. Internally, the option makes the JIT compiler be more conservative with its inlining and recompilation decisions (thus saving CPU), while the GC module is less eager in expanding the heap (thus lowering the memory footprint). These changes are expected to reduce the CPU consumed by the JIT compilation threads by 20-30% and to improve footprint by 3-5%, at the expense of a small (2-3%) throughput loss. By itself, this option has a minimal effect on start-up time, but it is a nice start-up boost when used in conjunction with AOT. The reason is twofold:
-Xaot:forceaot, which is an option that instructs the JIT compiler to bypass its usual heuristics and generate as much AOT code as possible
- The optimization level for all AOT compilations is raised from “cold” (typically used during start‑up) to “warm”
A word of caution, though: While
-Xtune:virtualized coupled with a large SCC is a great way of improving start-up and ramp-up of applications, throughput may be affected because, as explained previously, the AOT code quality does not match the JIT code quality, and the recompilation mechanism is subdued significantly such that many AOT bodies do not get recompiled.
This option is recommended in the following situations:
- Very short applications where there isn’t enough time to amortize the cost of JIT compilations
- Graphical/interactive applications that may feel jerky due to interference from the JIT compilation activity
- When the user considers start-up time to be the most important performance metric
As expected, the changes under
-Xquickstart mode are geared towards a fast start-up experience. The JIT will completely disable interpreter profiler, downgrade all first-time compilations to “cold” optimization level, reduce method invocation counts, and turn on
-Xaot:forceaot mode (where applicable). Note that the absence of the interpreter profiler, while beneficial from the start-up point of view, will lead to lower levels of throughput (it’s not uncommon to see throughput drop by 40% when
‑Xquickstart is used).
Other settings that affect start-up time
For those who want to go the extra mile, there is another option that could further reduce start‑up time, though to a lesser degree. The default initial size for Java heap in OpenJ9 is 8MB, and increasing this value with
-Xms<size> could improve start-up time due to lower GC overhead. However, the downside is that memory consumption will increase slightly. For instance, in our Liberty+DT7 experiments, we have seen a 6% reduction in start-up time from adding
-Xms256M to the command line at the expense of an 8% footprint increase.
With so many options to choose from, you might ask which one you should pick. To answer this question, we ran some start-up experiments against the Daytrader7 benchmark application installed on top of WebSphere Liberty 184.108.40.206 and using OpenJDK8 with OpenJ9 build OpenJDK8U_x64_linux_openj9_linuxXL_2018-09-27-08-47, downloaded from AdoptOpenJDK.
Figure 1. Start-up comparison between different OpenJ9 options
As shown in Figure 1, between
-Xquickstart, the former should be considered first because it provides larger start-up improvements (while having a very small effect on throughput, if any). There are situations, though, when the SCC cannot achieve its intended purpose, for example, because many Java classes are loaded by a classloader that is not SCC aware. In these cases,
-Xquickstart can be considered as a substitute.
Figure 2. Effect of SCC/AOT size on start-up time
It should be noted that
-Xquickstart can be used in conjunction with SCC/AOT, and this combination is able to achieve better start-up improvements than any of its components alone. As seen in Figure 2, adding
-Xquickstart to the default SCC/AOT settings employed by WebSphere Liberty (SCC=60MB/AOT=8MB) improves start-up by another 7%. In contrast, the addition of
-Xtune:virtualized appears to slow down the start-up of the application by 2%. However, this is only an artifact of the small SCC and AOT space in WebSphere Liberty: since under
-Xtune:virtualized, the AOT code is compiled at a “warm” optimization level (instead of “cold”), the resulting compiled bodies are bigger and they exceed the capacity of the SCC. If we increase the SCC and AOT sizes to 75MB and 20MB respectively, then we can see the true potential of
-Xtune:virtualized, which reduces start-up time by 13% compared to the default Liberty setting.
To conclude, if start-up matters to you,
-Xtune:virtualized should be one of the first configurations to try. Just make sure you set an appropriate size for the SCC and AOT space.