New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8312182: THPs cause huge RSS due to thread start timing issue #1679
Conversation
👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into |
This backport pull request has now been updated with issues from the original commit. |
Process-wise and for historical record, if you do a composite patch out of several issues, you need at least mark those issues as fixed by this PR. Tell bot like this: |
/issue add 8310233 8312394 8312620 8314139 8312585 |
@tstuefe Adding additional issue to issue list: Adding additional issue to issue list: Adding additional issue to issue list: Adding additional issue to issue list: |
Friendly ping. It would be good to get this fixed in time for the next CPU. |
I have concerns about this, for several reasons:
This also does not look like a recent regression, but rather a long-standing bug, right? |
@shipilev Thanks for looking at this. I debated with myself for a long time whether this was the right approach. I did not choose to build a composite patch out of laziness (if anything, downporting the issues separately and verbatim would have been much simpler, albeit slower). By providing a minimal (not cobbled together but carefully selected) patch I minimize the problem surface because I leave out code that have nothing to do with the goal of this patch: the static hugepage detection of JDK-8310233 "Fix THP detection on Linux").
we have two trailing bugs in product code: (a) JDK-8312394 is of no concern since it only affects code I explicitly left out from the patch (it does not affect THPs). It would be a concern were I to downport patches individually and verbatim. That leaves (b), which is a bug in a super obscure context (building on Windows with WSL that carries an arguably broken Linux kernel). One bug is not a very long tail.
I think this patch is just not that risky.
But that process would carry more risk since I would have to downport unnecessary parts and have time windows with unfixed bugs. This also shows a shortcoming of the review process. If I downport stuff verbatim, reviews are simple since they are mostly mechanical mental diffing; but that is not the most ideal patch, which would be one that is small and confined.
We have customers running with THP enabled always and unwilling or unable to change that; they would be happy about a fix. I'm fine with postponing this patch (not that I have any choice since I lack reviews and the window is almost closed). But the whole discussion leaves me dissatisfied with the practice of downporting whole patch trees to get a single issue fixed. We recently had similar discussions when downporting openjdk/jdk11u-dev#2035 which ended up a far bigger change than necessary, carrying a lot of code for the sole purpose of keeping a low delta to upstream. |
okay, I'll withdraw. Let's do it piece by piece as usual. |
Thanks! Yes, piece by piece would be the right approach here. I don't actually mind clustering several changes into one PR, as long as PR commits tell the story well: what was picked, in what order, and what changes were done along the way. I also don't mind bringing in safe-ish improvements to the common code if it resolves significant deviation from mainline. A palpable number of deviations we did over the years bit us in the back at unfortunate times... |
Unclean composite backport to jdk17u. Fixes JDK-8312182 - "THPs cause huge RSS due to thread start timing issue" (https://bugs.openjdk.org/browse/JDK-8312182)
Problem:
On a machine with transparent huge pages (THP) unconditionally enabled (/sys/kernel/mm/transparent_hugepage/enabled = "always"), the JVM may show a huge memory footprint (RSS) and degraded thread start performance.
The following factors make the problem more severe and more likely:
For a detailed discussion of the underlying problem, please see openjdk/jdk#14919.
In jdk Head, the issue got fixed with a sequence of patches:
However, JDK-8312182 itself needed one preparatory fix:
and then we had several corner-case test problems which are fixed with:
and finally, we decided to rename the switch that allows to switch off the THP mitigation with a final patch:
Instead of downporting these 7 patches verbatim, I prepared a composite patch containing only the necessary mitigation and mitigation tests.
This is similar to the jdk11u downport, but in jdk17u, JDK-8303215 had been already backported. Therefore there are some minor differences.
This patch does:
The patch needs some infrastructure, but I downported only the necessary parts: the helper class "HugePages", which is used in head to scan the operating system for information about THP settings. I only included the parts to do with THPs and left the rest out.
The patch also includes a regression test.
Testing:
I manually tested the JVM on Linux x64 with THP=always:
Without the patch (-Xmx1g -Xms1g -XX:+AlwaysPreTouch -Xss2m, 10000 threads started), I see slow thread startup and 11 GB - 14 GB of RSS.
The patched version comes up a lot faster and only shows 1.3 GB of RSS.
Progress
Issues
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk17u-dev.git pull/1679/head:pull/1679
$ git checkout pull/1679
Update a local copy of the PR:
$ git checkout pull/1679
$ git pull https://git.openjdk.org/jdk17u-dev.git pull/1679/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 1679
View PR using the GUI difftool:
$ git pr show -t 1679
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk17u-dev/pull/1679.diff
Webrev
Link to Webrev Comment