Wednesday, November 2, 2011

Java GC, HotSpot's CMS promotion buffers

Recently, I have unfairly blamed promotion local allocation buffers (PLAB) for fragmentation of old space using concurrent mark sweep garbage collector. I was very wrong. In this article, I'm going to explain how PLABs really work with all details.

PLABs

PLAB stand for promotion local allocation buffer. PLABs are used during young collection. Young collection in CMS (and all other garbage collectors in HotSpot JVM) is a stop-the-world copy collection. CMS may use multiple threads for young collection, each of these threads may need to allocate space for objects being copied either in survivor or old space. PLABs are required to avoid competition of threads for shared data structures managing free memory. Each thread have one PLAB for survival space and one for old space. Free memory in survivor space are continuous, so do survivor PLABs, which are simply continuous blocks. On other hand, free memory in old space (using CMS collector) is fragmented and managed via sophisticated dictionary or free chunks ...

Free list space(FLS)

CMS collector cannot compact old space (actually it can, but compaction involves long stop-the-world pause, often referred as GC freeze). Memory manager operates with lists of free chunks to manage fragmented free space. As a counter measure from fragmentation, chunks of free space are grouped by size.В  If available, free chunk of exact required size will be used to serve allocation request. If chunks of given size are exhausted, memory manager will split larger chunk into several smaller to satisfy demand. Consecutive free chunk can also be coalesced to create larger ones (coalescence is made along with sweeping during concurrent GC cycle). This splitting/coalesce logic is controlled by complex heuristics and chunk demand per size statistics.

Old space PLABs

Naturally old space PLABs mimic structure of indexed free list space. Each thread preallocates certain number of chunk of each size below 257 heap words (large chunk allocated from global space). Number of chunks of each size to be preallocated is controlled by statistics. Following JVM flag will enabled verbose reporting of old space PLAB sizing (too verbose for production though).
-XX:+PrintOldPLAB
At the beginning of each young collection we will see following lines in GC log
6.347: [ParNew ...
...
0[10]: 722/5239/897
0[12]: 846/5922/987
0[14]: 666/5100/850
...
1[12]: 229/3296/987
1[14]: 2/2621/850
1[16]: 69/1812/564
1[18]: 247/1160/290
...
[10]: 905
[12]: 1002
[14]: 865
[16]: 567
...
First lines are statistics from each scavenger (young collector) thread in following format:
<tid>[<chunk size>]: <num_retire>/<num_blocks>/<blocks_to_claim>
tid - GC thread ID,
chunk size - chunk size in heap words,
num_retire - number of free chunks in PLAB at the end of young GC,
num_blocks - number of chunks allocated from FLS to PLAB during young GC,
blocks_to_claim - desired number of blocks to refill PLAB.
Next few lines show estimated number of chunks (per size) to be preallocated (per GC thread) at beginning of next young collection.
[<chunk size>]: <blocks_to_claim>

Calculating desired block to claim

Initial number of blocks (chunks) per chunk size is configured via -XX:+CMSParPromoteBlocksToClaim JVM command line option (-XX:+OldPLABSize is alias for this option if CMS GC is used). If resizing of old PLAB is not disabled by -XX:-ResizeOldPLAB option, then desired PLAB size will be adjusted after each young GC.
Ideal desired number per chunk size is calculated by following formula:
block_to_claimideal = MIN(-XX:CMSOldPLABMax, MAX(-XX:CMSOldPLABMin, num_blocks / (-XX:ParallelGCThreads -XX:CMSOldPLABNumRefills)))
,but effective value is exponentially smoothed over time
blocks_to_claimnext = (1 - w) blocks_to_claimprev + w block_to_claimideal
,there w is configured via -XX:OldPLABWeight (0.5 by default).

On-the-fly PLAB resizing

During young collection, if chunk list of certain size will get exhausted, thread will refill it from global free space pool (allocating same number of chunks as at the beginning of collection). Normally thread will have to refill chunk list few times during collection (-XX:CMSOldPLABNumRefills sets desired number of refills). Though, if initial estimate was too small, GC thread will refill its chunk list too often (refill requires global lock for memory managed, so it may be slow). If on-the-fly PLAB resizing is enabled JVM will try to detect such conditions as resize PLAB in the middle of young collection.
-XX:+CMSOldPLABResizeQuicker will enable on-the-fly PLAB resizing (disabled by default).
Few more options offer additional tuning:
-XX:CMSOldPLABToleranceFactor=4 tolerance of the phase-change detector for on-the-fly PLAB resizing during a scavenge.
-XX:CMSOldPLABReactivityFactor=2 gain in the feedback loop for on-the-fly PLAB resizing В during a scavenge.
-XX:CMSOldPLABReactivityCeiling=10 clamping of the gain in the feedback loop for on-the-fly PLAB resizing during a scavenge.

Conclusion

I have spent some time digging though OpenJDK code to make sure, that I'm getting that thing now. It was educating. This article has brought up and explained few more arcane JVM options,В  though I doubt that I will ever use them in practice. Problem with heap fragmentation is that you have to run application for really long time before fragmentation will manifest itself. Most of options above require trial and error path (even though -XX:+PrintOldPLAB might give you some insights about your application) . It much easier just to give damn JVM little more memory (hey, RAM is cheap nowadays) than spend day tuning arcane options.
Anyway, I hope it was as education for you as it was for me.

See also

9 comments:

  1. The equations are all mangled. Display chars are incorrect.

    When you said "Young collection in CMS (and all other garbage collectors in HotSpot JVM) is a stop-the-world copy collection." didn't you mean the ParNewGC's parallel copy collection?

    ReplyDelete
  2. Correct, ParNewGC is a parallel stop-the-world copy collection. In GC land "parallel" means stop-the-world + multithreaded.

    (Thanks for pointing out mangled chars)

    ReplyDelete
  3. Hello, thank you for this informative article.

    I have a question here, in your previous article - "Java GC, HotSpot's CMS and heap fragmentation", you said that "Concurrent Mark Sweep is used only to collect old space.", then what do you mean by "Young collection in CMS"? Do you mean the ParNew or Serial GC algorithm used together with CMS?

    ReplyDelete
  4. Old space is collected concurrently. Young space is collected by stop-the-world copy collector either in parallel (ParNew) or in single thread (DefNew).

    See http://aragozin.blogspot.com/2011/09/hotspot-jvm-garbage-collection-options.html

    For a list of possible combination for young and old space collectors.

    ReplyDelete
  5. Thanks for your sharing, really arcane. This article helps me to understand the PLAB etc.

    ReplyDelete
  6. " I have unfairly blamed promotion local allocation buffers (PLAB) for fragmentation of old space using concurrent mark sweep garbage collector. I was very wrong." hmmm... can you elaborate wht you mean with you was wrong? I think I missed that aspect of the article.

    ReplyDelete
    Replies
    1. Follow link just before that phrase. In that article there are few paragraph, where is explaining how PLAB destroys effect of FLS. It turns out to be wrong and remarks to article explain why.

      Delete
  7. i want to ask the difference between about this two parameter: num_blocks blocks_to_claim, i think one is the really number of chunks and the other is just an expected number

    ReplyDelete
  8. If you want accurate answer, only way for you is to consult OpenJDK source.
    It was sometime since I have investigated this code and I'm afraid to be wrong.

    ReplyDelete