-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8332587: RISC-V: secondary_super_cache does not scale well #19320
Conversation
👋 Welcome back gcao! A progress list of the required criteria for merging this PR into |
@zifeihan This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 18 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@Hamlin-Li, @RealFYang) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
Webrevs
|
Hey, is there performance comparison data when zbb is not supported? Just want to know if it can also get performance gain in that situation, then this optimization is a more generic one. The reason for ask this question is that seems JDK-8180450 is an optimization for cache line, it should or could bring benefit even if zbb is not supported? |
Hi, the following is an implementation using scalar assembly when zbb is not available. diff --git a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
index ce6696c18a8..93e1045d2d4 100644
--- a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
+++ b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
@@ -3481,6 +3481,36 @@ void MacroAssembler::check_klass_subtype_slow_path(Register sub_klass,
bind(L_fallthrough);
}
+void MacroAssembler::population_count(Register dst, Register src,
+ Register tmp1, Register tmp2) {
+
+ if (UsePopCountInstruction) {
+ cpop(dst, src);
+ } else {
+ assert_different_registers(src, tmp1, tmp2);
+ assert_different_registers(dst, tmp1, tmp2);
+ Label loop, done;
+
+ mv(tmp1, src);
+ // dst = 0;
+ // while(tmp1 != 0) {
+ // dst++;
+ // tmp1 &= (tmp1 - 1);
+ // }
+ mv(dst, zr);
+ beqz(tmp1, done);
+ {
+ bind(loop);
+ addi(dst, dst, 1);
+ mv(tmp2, tmp1);
+ addi(tmp2, tmp2, -1);
+ andr(tmp1, tmp1, tmp2);
+ bnez(tmp1, loop);
+ }
+ bind(done);
+ }
+}
+
// Ensure that the inline code and the stub are using the same registers.
#define LOOKUP_SECONDARY_SUPERS_TABLE_REGISTERS \
do { \
@@ -3533,7 +3563,7 @@ bool MacroAssembler::lookup_secondary_supers_table(Register r_sub_klass,
// Get the first array index that can contain super_klass into r_array_index.
if (bit != 0) {
slli(r_array_index, r_bitmap, (Klass::SECONDARY_SUPERS_TABLE_MASK - bit));
- cpop(r_array_index, r_array_index);
+ population_count(r_array_index, r_array_index, t0, tmp1);
} else {
mv(r_array_index, (u1)1);
}
diff --git a/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp b/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp
index 2575d5ea2ff..3e4930d5605 100644
--- a/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp
+++ b/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp
@@ -322,6 +322,8 @@ class MacroAssembler: public Assembler {
Label* L_success,
Label* L_failure);
+ void population_count(Register dst, Register src, Register tmp1, Register tmp2);
+
// As above, but with a constant super_klass.
// The result is in Register result, not the condition codes.
bool lookup_secondary_supers_table(Register r_sub_klass,
JMH tested on HiFive Unmatched(has not Zbb):
JMH tested on LicheePi 4A(has not Zbb):
With the above test data, we see that there is a performance decrease from testNegative55 to testNegative64 when zbb is not available. This is because a loop is needed to count the number of 1 when Zbb is not available. Therefore, the current patch is only optimized when Zbb is available. |
Hi, Thanks for the review. Sorry for being late. I've implemented the scalar register version before, but tested it and had data performance decrease, so the scalar implementation is not included in this Patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explanation. Seems this patch will also use some instructions in zba/zbs extension. Will they also impact the final performance?
Also has some minor comments first.
I'm sorry for the Banana Pi BPI-F3 board data I forgot to specify that Banana Pi BPI-F3 board doesn't enable zba, zbb, zbs by default, here zbb is enabled manually by me. So the Banana Pi BPI-F3 board JMH data here is the performance JMH data without zba,zbs enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments.
@Hamlin-Li : Hi Hamlin, I've fixed all of the above comments. I've enabled Zba, Zbb, Zbs and tested JMH before and the performance has improved a bit compared to just enabling Zbb, I'll be re-testing the JMH with Zba, Zbb,Zbs enabled today and I'll update here when I'm done with it. |
JMH tested on Banana Pi BPI-F3 board (has Zba,Zbb,Zbs) and Enable UseZba, UseZbb, UseZbs
With patch:
|
There are a bit regression in cases of testNegative63/64, although these might be rare cases or not very common cases, but it's worth to have a try to improve it if possible. |
Hi, Sorry for being late. Thanks for the suggestion, I gave it a try. diff --git a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
index 61e8016db4a..af1649c061f 100644
--- a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
+++ b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp
@@ -3664,6 +3664,32 @@ bool MacroAssembler::lookup_secondary_supers_table(Register r_sub_klass,
test_bit(t0, r_bitmap, bit);
beqz(t0, L_fallthrough);
+ // We will consult the secondary-super array.
+ ld(r_array_base, Address(r_sub_klass, in_bytes(Klass::secondary_supers_offset())));
+
+ mv(tmp3, r_bitmap);
+ if (bit != 0) {
+ ror_imm(tmp3, tmp3, bit);
+ }
+
+ Label skip;
+ addi(t0, tmp3, (u1)1);
+ bnez(t0, skip);
+
+ // Load the array length.
+ lwu(r_array_length, Address(r_array_base, Array<Klass*>::length_offset_in_bytes()));
+ // And adjust the array base to point to the data.
+ // NB! Effectively increments current slot index by 1.
+ assert(Array<Klass*>::base_offset_in_bytes() == wordSize, "");
+ addi(r_array_base, r_array_base, Array<Klass*>::base_offset_in_bytes());
+
+ repne_scan(r_array_base, r_super_klass, r_array_length, t0);
+ bne(r_super_klass, t0, L_fallthrough);
+ mv(result, zr);
+ beqz(result, L_fallthrough);
+
+ bind(skip);
+
// Get the first array index that can contain super_klass into r_array_index.
if (bit != 0) {
slli(r_array_index, r_bitmap, (Klass::SECONDARY_SUPERS_TABLE_MASK - bit));
@@ -3672,9 +3698,6 @@ bool MacroAssembler::lookup_secondary_supers_table(Register r_sub_klass,
mv(r_array_index, (u1)1);
}
- // We will consult the secondary-super array.
- ld(r_array_base, Address(r_sub_klass, in_bytes(Klass::secondary_supers_offset())));
-
// The value i in r_array_index is >= 1, so even though r_array_base
// points to the length, we don't need to adjust it to point to the data.
assert(Array<Klass*>::base_offset_in_bytes() == wordSize, "Adjust this code");
JMH tested on Banana Pi BPI-F3 board (has Zbb) and Enable UseZbb, Not Enable UseZba, UseZbs
After the fix it got better with the cases of testNegative63/64. As said above, this case is very rare, Is there a need for this additional complexity? |
Thanks for updating! With the fix, although it improves the perf for testNegative63/64, but seems it brings some regression for testNegative55-62, in this sense the fix should not be taken. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience.
I was thinking to jump to L_bitmap_full
in lookup_secondary_supers_table_slow_path
, in this way I guess it might address the performance issue when bitmap full, and not introduce regression in other cases. But I'm not sure how much complication will be brought into the implementation. So, let's skip this rare case optmization.
Some minor comment, otherwise looks good.
So many superinterfaces is very rate. So rare, in fact, that it may never have happened in production Java code. |
It's worth running the "before" test with |
JMH tested on Banana Pi BPI-F3 board (has Zbb) and Enable UseZbb, and use
With patch
@theRealAph : Hi, Can you take a look? Thanks |
That all looks like it's working as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@Hamlin-Li @RealFYang @theRealAph : In the case of scalar register implementations, as discussed above, |
I suggest to add scalar version (i.e. use population_count instead when zbb is not supported). |
Thanks all for the review. |
/sponsor |
Going to push as commit 001d686.
Your commit was automatically rebased without conflicts. |
@Hamlin-Li @zifeihan Pushed as commit 001d686. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
|
||
// Check for wraparound. | ||
Label skip; | ||
bge(r_array_length, r_array_index, skip); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this test is correct? I would have thought it would be bgt
. If length == index, then you must set index to 0. I would have expected this to fail testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Yes, I agree that we should use bgt
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this test is correct? I would have thought it would be
bgt
. If length == index, then you must set index to 0. I would have expected this to fail testing.
Thanks, I will fix it like blt(r_array_index,r_array_length,skip);
Linked JBS: https://bugs.openjdk.org/browse/JDK-8334843.
Implementation of subtype checking JDK-8180450 for linux-riscv64.
This optimization depends on availability of the Zbb extension which has the cpop instruction.
Correctness testing:
-XX:+VerifySecondarySupers
JMH tested on Banana Pi BPI-F3 board (has Zbb) and Enable UseZbb, Not Enable UseZba, UseZbs
Original:
With patch:
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19320/head:pull/19320
$ git checkout pull/19320
Update a local copy of the PR:
$ git checkout pull/19320
$ git pull https://git.openjdk.org/jdk.git pull/19320/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 19320
View PR using the GUI difftool:
$ git pr show -t 19320
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19320.diff
Webrev
Link to Webrev Comment