Java’s new Z Garbage Collector (ZGC) is very exciting

https://www.opsian.com/blog/javas-new-zgc-is-very-exciting

  • It’s almost surreal that the JVM is now designing around terabyte (!) heaps.
  • Java 12 saw a new default GC, but not this one. It’s called Shenandoah.
  • ZGC is marked production ready as of Java 15, but it’s still not the default as of Java 16. I’m not sure if this is because of its newness or other considerations.
  • Enable ZGC with:
    -XX:+UseZGC -Xmx<size> -Xlog:gc
    

ZGC’s design targets a future where these kinds of capacities are common: multi-terabyte heaps with low (<10ms) pause times and impact on overall application performance (<15% on throughput).

Modern collectors carry out this process in several phases:

  • Parallel: multiple GC threads
  • Serial: one GC thread
  • Stop the world: Application threads are suspended
  • Concurrent: Application threads are not suspended
  • Incremental: GC can terminate early and still accomplish something

To achieve its goals ZGC uses two techniques new to Hotspot Garbage Collectors: coloured pointers and load barriers.

Pointer coloring

On 64-bit platforms (ZGC is 64-bit only) a pointer can address vastly more memory than a system can realistically have and so it’s possible to use some of the other bits to store state. ZGC restricts itself to 4Tb heaps which require 42-bits, leaving 22-bits of possible state of which it currently uses 4-bits: finalizable, remap, mark0 and mark1.

One problem with pointer colouring is that it can create additional work when you need to dereference the pointer because you need to mask out the information bits. for x86, the ZGC team use a neat multi-mapping trick.

Multi-mapping involves mapping different ranges of virtual memory to the same physical memory. Since by design only one of remap, mark0 and mark1 can be 1 at any point in time, it’s possible to do this with three mappings.

Is this because there are only three possible values the 4 bits can take here?

Load Barriers

Load barriers are pieces of code that run whenever an application thread loads a reference from the heap. This is in contrast to write-barriers used by other GCs, such as G1.

GC Cycle

Marking

Marking involves finding and labelling in some way all heap objects that could be accessible to the running application, in other words finding the objects that aren’t garbage. ZGC’s Marking is broken up into three phases.

  • The first phase is a Stop The World one where the GC roots are marked as being live. GC roots are things like local variables, which the application can use to access the rest of the objects on the heap. If an object isn’t reachable by a walk of the object graph starting from the roots then there’s no way an application could access it and it is considered garbage.
  • ZGC begins the next phase which concurrently walks the object graph and marks all accessible objects. During this phase the load barrier is testing all loaded references against a mask that determines if they have been marked or not yet for this phase, if a reference hasn’t been marked then it is added to a queue for marking.
  • there’s a final, brief, Stop The World phase which handles some edge cases (which we’re going to ignore for now) and then marking is complete.

Relocation

Relocation involves moving live objects around in order to free up sections of the heap.

Sounds like the rationale for this is defragmentation.

Stages:

  • ZGC divides the heap up in to pages and at the start of this phase it concurrently selects a set of pages whose live objects need to be relocated. When the relocation set is selected, there is a Stop The World pause where ZGC relocates any objects in the relocation set that are referenced as a root (local variables, etc.) and remaps their reference to the new location.
  • the next phase is concurrent relocation. During this phase GC threads walk the relocation set and relocate all the objects in the pages it contains. Application threads can also relocate objects in the relocation set if they try to load them before the GC has relocated them, this is achieved through the load barrier (how is coordination achieved)?

Performance

ZGC’s SPECjbb 2015 throughput numbers are roughly comparable with the Parallel GC (which optimises for throughput) but with an average pause time of 1ms and a max of 4ms. This is in contrast to G1 and Parallel who had average pause times in excess of 200ms.

Edit