Java performance optimization is of interest to developers and IT specialists alike. In a series of articles we will help you understand many of the performance-critical components that make up the IBM SDK for Java, and provide you with the high level background necessary to understand how performance is affected.
This first article will take a look at the basic architecture of the IBM J9 Java Virtual Machine (JVM) with a focus on tuning the JIT compiler. We will also touch on the use of the shared classes cache, which can improve startup time.
Other articles in this series:
- Part two: Memory management, garbage collection (GC) policies, and GC diagnostic tools.
- Part three: Java concurrency and locking.
Links to further information are provided throughout the article.
J9 JVM architecture basics
The J9 Java Virtual Machine
The J9 JVM, developed by IBM, is the basis of all IBM SDKs since version 5. It is described in the J9 Virtual Machine (JVM) topic in IBM Knowledge Center. J9 is fully compliant with the Java Virtual Machine specification and can be scaled for use in mobile phones up to IBM z Systems mainframes. It is available on Intel (Linux and Windows), POWER (Linux and AIX), ARM (Linux) and z Systems (Linux and z/OS) platforms.
The J9 JVM comprises a number of components, including:
- Class loader
- Platform port library layer
- Garbage collector (GC)
- Just-In-Time (JIT) compiler (codenamed Testarossa or TR JIT in J9)
- JVM Application Programming Interface (API)
- Monitoring and Diagnostic component
These components are illustrated in the following diagram:
The class loader and interpreter are fundamental components of the JVM that typically get exercised from the time when the application is started. The platform port library layer provides an abstraction layer between the JVM and the underlying OS, allowing most of the platform-specific details such as file I/O and memory allocation to be managed in one place in the codebase.
Many different Java Class Library (JCL) versions can be configured with the base J9 VM to produce different versions of the IBM SDK for Java.
The Testarossa JIT compiler
The JIT compiler used by J9 is named Testarossa and is a high performance, production compiler that contains an extensive suite of classical as well as Java-specific optimizations.
The output from the JIT compiler is optimized machine code for each method and is often tailored to exploit features of the specific hardware that the program is being run on.
The diagram illustrates the various components of the Testarossa JIT compiler at a high level. The IL (intermediate language) generator is fairly simple by design. It performs a straight translation of the input Java bytecode into the Testarossa IL format that is used during the rest of the compilation process. The cross platform optimizer is where most of the analyses and optimizations are done. This component is the most complex piece of the JIT compiler and is where a majority of the compiling time is spent. The optimizer analyzes and transforms the IL representation and passes it to the code generator to translate into machine instructions suitable for the platform the program is being run on.
Platform-specific decisions, such as which instruction would be best to employ on the particular version of the hardware, are taken by the code generator along with register allocation and binary code generation.
Throughout the compilation process, the JIT compiler typically makes various queries relating to the other components in the JVM and about the Java command line options in effect and this code is again abstracted into a separate component.
Conceptually, there is also a part of the JIT compiler that is present even when no compilation is actively being done, and this component is referred to as the runtime environment. This includes run time helper routines that are used by the compiled code generated by the JIT compiler as well as metadata that needs to be generated to support actions such as generating correct exception stack traces.
The Testarossa JIT compiler compiles one method at a time. That is, the input to the compiler is the method’s bytecode plus the rest of the JVM execution state (for example, the classes that are currently loaded). JIT compilation happens transparently for the most part as there are JVM internal “control” heuristics that decide what, when, and how to compile. These heuristics are designed to balance various performance metrics such as application throughput, rampup/startup, and footprint.
There are some differences between JIT compilation heuristics used in different JVM implementations, such as Hotspot or J9, that can affect “out of the box” performance significantly. Differences include the maximum number of compilation threads, the number of invocations of the method before it gets compiled, and the heuristics for recompiling.
The Testarossa JIT compiler compiles methods asynchronously in most cases. That is, Java threads do not halt execution and wait for the compilation to complete. Instead, a method continues to be executed by either interpreting or using the result of an earlier compilation. When the present compilation is finished, the generated code is then used for executing the method.
The Testarossa JIT compiler also uses tiered compilation, meaning that the same method might be compiled multiple times; the first compilation is usually inexpensive and only a subset of optimizations are used to minimize compilation time. A JIT sampling thread keeps track of how much time is spent in different methods, and methods that appear to be critical are chosen for recompilation, when more optimizations are performed. The Testarossa JIT compiler uses more than 70 different optimization passes and has several different optimization strategies. Details can be found in the following articles:
- The JIT compiler in IBM Knowledge Center
- Insides of IBM J9 JVM for Java 6 Architecture and Features
JIT compilation is more than just traditional compilation done dynamically.
One of the main advantages of compiling at run time is that speculative techniques such as class-hierarchy-based optimizations and feedback-directed optimizations (FDO) can be done without any user involvement.
Profiling (namely tracking runtime information by instrumenting the program) in J9 is done both in the interpreter and in the JIT compiler and can result in a significant improvement in performance as the compiler can even learn about values of variables that are only possible to know at run time (for example, values input by a user when the program is running). See, for example, Experiences in designing a robust and scalable interpreter profiling framework.
The other advantage of compiling at run time is that JIT compiler optimizations can exploit all the features of the version of the architecture it is running on, in contrast to a static compiler that must work with all supported architectures.
Tuning the Testarossa JIT compiler
The JIT compiler generally works quite transparently to the end user.
This is usually desirable because it is able to make decisions based on information it collects while it is executing. However, there are inevitably tradeoffs (such as startup and compile time versus throughput) and some tuning options are available to the user. Two in particular are worth mentioning:
-Xtune:virtualizedcalibrates the JVM’s internal heuristics to favor lower JIT compilation time (usually by 25% or more) and footprint at the expense of some small amount of throughput (usually less than 5%). This is useful in resource-constrained environments (for example, virtualized environments) where memory or CPU are over-committed.
-Xquickstartcalibrates the JVM’s internal heuristics to favor very fast startup but usually at a significant throughput overhead (possibly by 30% or more). This is useful for client and GUI applications where quick startup and responsiveness are key parts of the workflow.
J9 shared classes technology
J9’s shared classes technology is useful for optimizing overheads (footprint and a startup time) related to class loading.
In addition to storing classes in the shared classes cache, J9 also stores ahead of time (AOT) compiled code that can be reused by different JVMs connected to the same shared classes cache, without having to suffer from the overhead of doing those compilations. The consuming JVMs have to validate only that the AOT compiled code and associated metadata can be used, and then relocate the code into its own address space to start using the compiled method.
If shared classes are used in conjunction with the JIT compiler tuning options described above, compilation overhead and startup time can be further improved. Further information about shared-classes technology can be found in the following articles:
- Class data sharing in IBM Knowledge Center
- Java technology, IBM style: Class sharing in IBM developerWorks
- Enhance performance with class sharing in IBM developerWorks
In the next article we will focus on memory management, providing information about the garbage collection policies that can be used for different workloads. We will also look at how diagnostic tools can be used to monitor GC performance, pinpoint problems, and suggest tuning options.
Authors: Vijay Sundaresan, Aleksandar Micic, Daniel Heidinga, and Babneet Singh.