New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8292296: Use multiple threads to process ParallelGC deferred updates #10313
Conversation
This is a follow-up to an initial patch I posted a while back to hotspot-gc-dev: https://mail.openjdk.org/pipermail/hotspot-gc-dev/2022-August/039905.html The problem here is that some applications including SPECjbb spend a lot of time in the "Deferred Updates" stage of parallel compaction if they happen to generate a lot of objects that cross region boundaries. The patch above is parallelising the existing serial processing of deferred updates on the main VM thread. However I think we can solve this in a simpler way by instead having each GC worker thread keep a private list of the deferred objects it encountered during compaction, and then once all regions have been compacted, process its private list of deferred updates. We know that `compaction_with_stealing_work()` won't return until all regions have been compacted because otherwise `terminator->offer_termination()` would return false and the worker thread would attempt to steal tasks from another thread. The advantage of this approach over a separate parallel deferred updates step is that we don't have to worry about adding heuristics for when and how many worker threads to start up, which has the potential to cause regressions in some cases. Processing the deferred objects on the worker thread shouldn't be any slower than the existing serial scan on the VM thread, even if all the deferred objects end up on the queue of one thread (there's no attempt to balance or work-steal between threads). We also avoid having to scan each region for deferred objects in the common case where there are none in a space. The new per-thread deferred objects list is dynamically allocated but its size is bounded by the number of 512k heap regions as we will push at most one pointer per region. With SPECjbb on AWS c7g.16xlarge I see median full GC pause times reduce by around 20% with a corresponding ~1% increase in critical-jOPS averaged over several runs. On the "derby" benchmark from SPECjvm I also see an improvement in median full GC pause times of around 11%. I tried a variety of other benchmarks from Dacapo and SPECjvm but I couldn't see any other significant effect: it seems quite dependent on the type and size of objects allocated. Tested tier1-3 with -XX:+UseParallelGC.
👋 Welcome back ngasson! A progress list of the required criteria for merging this PR into |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm sans the assert issue Albert mentioned. I agree that this is a much nicer solution than adding heuristics again.
@nick-arm This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 54 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
} | ||
|
||
cm->update_contents(cast_to_oop(addr)); | ||
assert(oopDesc::is_oop(cast_to_oop(addr)), "Expected an oop at " PTR_FORMAT, p2i(cast_to_oop(addr))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can be moved up a bit, e.g. btw L2601 and L2602.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind; the assert should be after the obj body is properly updated. It's good as is.
Thanks for the reviews! Any more comments or is this change ok to integrate now? |
I'd say, ship it... 🚢 |
/integrate |
Going to push as commit 3fa6778.
Your commit was automatically rebased without conflicts. |
This is a follow-up to an initial patch I posted a while back to hotspot-gc-dev:
https://mail.openjdk.org/pipermail/hotspot-gc-dev/2022-August/039905.html
The problem here is that some applications including SPECjbb spend a lot of time in the "Deferred Updates" stage of parallel compaction if they happen to generate a lot of objects that cross region boundaries.
The patch above is parallelising the existing serial processing of deferred updates on the main VM thread. However I think we can solve this in a simpler way by instead having each GC worker thread keep a private list of the deferred objects it encountered during compaction, and then once all regions have been compacted, process its private list of deferred updates.
We know that
compaction_with_stealing_work()
won't return until all regions have been compacted because otherwiseterminator->offer_termination()
would return false and the worker thread would attempt to steal tasks from another thread.The advantage of this approach over a separate parallel deferred updates step is that we don't have to worry about adding heuristics for when and how many worker threads to start up, which has the potential to cause regressions in some cases. Processing the deferred objects on the worker thread shouldn't be any slower than the existing serial scan on the VM thread, even if all the deferred objects end up on the queue of one thread (there's no attempt to balance or work-steal between threads). We also avoid having to scan each region for deferred objects in the common case where there are none in a space.
The new per-thread deferred objects list is dynamically allocated but its size is bounded by the number of 512k heap regions as we will push at most one pointer per region.
With SPECjbb on AWS c7g.16xlarge I see median full GC pause times reduce by around 20% with a corresponding ~1% increase in critical-jOPS averaged over several runs. On the "derby" benchmark from SPECjvm I also see an improvement in median full GC pause times of around 11%. I tried a variety of other benchmarks from Dacapo and SPECjvm but I couldn't see any other significant effect: it seems quite dependent on the type and size of objects allocated.
Tested tier1-3 with -XX:+UseParallelGC.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10313/head:pull/10313
$ git checkout pull/10313
Update a local copy of the PR:
$ git checkout pull/10313
$ git pull https://git.openjdk.org/jdk pull/10313/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 10313
View PR using the GUI difftool:
$ git pr show -t 10313
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10313.diff