Java virtual machines (JVM) use built-in garbage collection (GC) technology to automatically manage memory usage within the Java heap. When an allocation failure occurs, a GC cycle is triggered to reclaim memory from objects that are no longer referenced. Typical GC implementations in modern JVMs require ‘stop-the-world’ pauses, where application threads are suspended while the GC relocates live objects. Why? Because the JVM must ensure that all references to live objects, as visible to application threads, are accurate while the GC relocates live objects within the heap. However, for Java applications that have large heaps, and where response times are critical, ‘stop-the-world’ pauses can result in unpredictable and inconsistent spikes in response times during a GC cycle.

To help mitigate this problem, the IBM J9 Virtual Machine (J9VM) incorporates a new execution mode from the Eclipse OpenJ9 JVM Project that aims to minimize the time spent in ‘stop-the-world’ pauses by garbage collecting in parallel with running application threads. This mode works in conjunction with the new Guarded Storage Facility on the IBM z14TM (z14) Mainframe.

Figure (a) shows the traditional approach to garbage collection. Application threads are suspended, which allows the GC to safely move live objects and then update all the references. When a GC cycle completes, the application threads resume and are guaranteed to find only updated references.

Figure (b) shows the new approach, where stale objects are collected in parallel with running application threads. To achieve this, the GC must ensure that any reference to a live object from the application thread is valid before and after the GC has moved that object. During a GC cycle, the application thread validates all reference accesses from the heap and performs a reference relocation if a stale reference is detected. The Guarded Storage (GS) Facility provides hardware-based support to detect when potentially stale references are accessed. If an application thread loads a stale reference to an object that has been moved by the GC, the GS Facility detects this operation and triggers an interrupt to allow the reference to be relocated. As seen in Figure (b), compared to the traditional approach, the GS Facility allows application threads to execute concurrently with GC, thereby reducing ‘stop-the-world’ pauses and improving overall response times.

4 comments on"Reducing Garbage Collection pause times with Concurrent Scavenge and the Guarded Storage Facility"

  1. Is there more detail on this? From your description, this approach doesn’t sound very new. E.g. See Appel, Ellis, Li (PLDI 1988) to name but one technique, and more recently Azul’s Pauseless GC (VEE 2005).

    • Hi Richard,

      I just published a second post with a bit more detail on our approach at https://developer.ibm.com/javasdk/2017/09/25/concurrent-scavenge-using-guarded-storage-facility-works/ . We’re also planning on publishing some more in the future. You’re right in that the idea behind performing concurrent GC is not new. What’s different in J9 is that the concurrent aspect is only enabled in the Generation Concurrent GC policy, and specifically only in the Nursery region. As such, it targets workloads that have large heaps and are response time critical, since the absolute throughput will be affected by GC threads running concurrently with Java threads. Additionally, the Guarded Storage Facility is part of the z14 Mainframe which is a general purpose machine, unlike Azul’s hardware which was specially designed for Virtual Machines.

      Edit: Some other developers had a few more things to say about your question. Our implementation is similar to Azul’s in that we use a Loaded Value Barrier (which on z14 is called a Guarded Load). In terms of differences, we do not have extra passes like their re-mapping phase, and a concurrent scavenge cycle only occurs ~10% of the time when running with very large heaps which is the type of workload this mode targets.

      • Richard Jones September 27, 2017

        Hi

        This is interesting and the GS Facility looks useful. You say that when an access is made to an address in a guarded region, the interrupt handler migrates the object. How expensive is the trap and handler (e.g. for a typically sized object)? Dealing with the case when there are many access to a guarded region sounds expensive (lots of traps) if you do this on an object by object basis. Or do you do evacuate region by region (in which case how do you fix up all references to that region. Happy to discuss more off-line.

        • Hi,

          I think the “How Guarded Storage Works” article originally had an error which I fixed a couple of days ago; a Guarded Load works by loading from an address, and if the value that is loaded is a reference into a guarded region, then the hardware traps.

          The trap is a few hundred cycles, and the handler is more since it calls a GC hook (see https://github.com/eclipse/openj9/blob/master/runtime/vm/zcinterp.m4#L155 and https://github.com/eclipse/openj9/blob/master/runtime/vm/guardedstorage.c#L34) for every trap. You’re right in that many accesses to a guarded region can get expensive, which is why a concurrent scavenge cycle only occurs around 10% of the time (ie the nursery region is only guarded 10% of the time). Unfortunately I’m not the expert on the GC evacuation heuristics (since I work on the JIT Compiler), but feel free to send me an email at dsouzai@ca.ibm.com and I can CC those who would be able to give you better answers about the actual concurrent scavenge aspect.

Join The Discussion

Your email address will not be published. Required fields are marked *