-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8340241: RISC-V: Returns mispredicted #21406
Conversation
👋 Welcome back rehn! A progress list of the required criteria for merging this PR into |
@robehn This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the ➡️ To integrate this PR with the above commit message to the |
Webrevs
|
Great finding. Apparently, we didn't realize such an impact of this prediction hints before. Let me try this on hardwares from other vendors to see. |
Thanks! The issue in C2 is that you now know need to kill CR if your code in any scenario may execute a JALR (assuming the code do return), and that is not obvious. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I witnessed performance improvement on other vendor's hardware too. Minor comments after a cursory look. Will take a more closer look. Thanks.
Ah. Now I see what you mean. Thanks. |
Awesome, thanks ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. Seems that we missed the jump_link in MacroAssembler::trampoline_call
[1]? I also witnessed another place where we missed killing the rflags after this change. See comment for details.
[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L4272
(PS: Ignore this as I just noticed that it's a jal
instead of jalr
by the jump_link
)
For C2 calls, i.e. when compiler is doing a deliberate call, RFLAG is SOC. |
@robehn this pull request can not be integrated into git checkout remove_t0
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
Thanks for the update. Hopefully, I think I can finish first round of review tomorrow. BTW: It will be good to know how this may affect other benchmark workloads, like specjbb2015, etc. |
Thank you for the in-depth review! Maybe @Hamlin-Li can take it for spin? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest version LGTM. Thanks for fixing this!
@RealFYang @luhenry thanks! |
Hi, thanks for doing the check. Yeah, I think we should be safe to go. |
/integrate |
Going to push as commit 66ddaaa.
Your commit was automatically rebased without conflicts. |
Hi, please consider.
RISC-V don't have dedicated call/ret instructions.
Instead the registers used in the jal/jalr instructions determine if this is a JUMP or CALL/RET.
The cpu have a return-address stack where it stores return addresses for prediction.
There are two possible calling conventions: x1 and x5 (or using both for co-routines).
This stack is updated according this table (from unpriv manual, 2.5.1. Unconditional Jumps) for JALR:
And additionally:
"A JAL instruction should push the return address onto a return-address stack (RAS) only when rd is 'x1' or x5."
As the JDK is using x5/(t0) as main scratch all plains jumps are actually calls and calls are co-routine calls (push and pop).
This causes performance issues as the predictions is often wrong.
Average time for 10 best iterations (VF2):
For some of workloads, e.g. call to small function in a loop, it really matters.
This patch blacklist x5(/t0) for JAL/JALR as we only use x1 calling convention.
And changes all jumps to use x6(/t1) instead of x5(/t0).
This patch was incrementally done, i.e. the first change removed the default t0.
I visited all places makings jumps, to make sure t1 was available.
Then changed to default t1 and removed argument in many cases.
Other approaches was tested, e.g. completely switch t0 <-> t1.
This was much harder and more intrusive as you need to do the switch completely in one go.
The use of x6(/t1) as flag register in C2 was luckily not an issue as RFLAGS is always killed when making a jump.
But please inspect this.
Note jump label was a bit more tricky. To solve that this patch defaults to only use JAL when no register is supplied, now default. We never jump to a label so far away that we need a longer range.
But please consider this carefully.
Secondly note CompiledICData was moved to x5(/t0), as x1+x6 (ra/t1) is used for the call.
Please inspect this also. (as this can go silently unnotice but causing VEP to go into runtime for IC miss)
Arguably this is a performance bug, not an enhancement.
No issues found running t1->t3 fastdebug, re-testing more to make sure.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21406/head:pull/21406
$ git checkout pull/21406
Update a local copy of the PR:
$ git checkout pull/21406
$ git pull https://git.openjdk.org/jdk.git pull/21406/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 21406
View PR using the GUI difftool:
$ git pr show -t 21406
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21406.diff
Webrev
Link to Webrev Comment