New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8283091: Support type conversion between different data sizes in SLP #7806
Conversation
👋 Welcome back fgao! A progress list of the required criteria for merging this PR into |
After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: int <-> double float <-> long int <-> long float <-> double A typical test case: int[] a; double[] b; for (int i = start; i < limit; i++) { b[i] = (double) a[i]; } Our expected OptoAssembly code for one iteration is like below: add R12, R2, R11, LShiftL openjdk#2 vector_load V16,[R12, openjdk#16] vectorcast_i2d V16, V16 # convert I to D vector add R11, R1, R11, LShiftL openjdk#3 # ptr add R13, R11, openjdk#16 # ptr vector_store [R13], V16 To enable the vectorization, the patch solves the following problems in the SLP. There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use. After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. Here is the test data on NEON: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 216.431 ± 0.131 ns/op VectorLoop.convertD2I 523 avgt 15 220.522 ± 0.311 ns/op VectorLoop.convertF2D 523 avgt 15 217.034 ± 0.292 ns/op VectorLoop.convertF2L 523 avgt 15 231.634 ± 1.881 ns/op VectorLoop.convertI2D 523 avgt 15 229.538 ± 0.095 ns/op VectorLoop.convertI2L 523 avgt 15 214.822 ± 0.131 ns/op VectorLoop.convertL2F 523 avgt 15 230.188 ± 0.217 ns/op VectorLoop.convertL2I 523 avgt 15 162.234 ± 0.235 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 124.352 ± 1.079 ns/op VectorLoop.convertD2I 523 avgt 15 557.388 ± 8.166 ns/op VectorLoop.convertF2D 523 avgt 15 118.082 ± 4.026 ns/op VectorLoop.convertF2L 523 avgt 15 225.810 ± 11.180 ns/op VectorLoop.convertI2D 523 avgt 15 166.247 ± 0.120 ns/op VectorLoop.convertI2L 523 avgt 15 119.699 ± 2.925 ns/op VectorLoop.convertL2F 523 avgt 15 220.847 ± 0.053 ns/op VectorLoop.convertL2I 523 avgt 15 122.339 ± 2.738 ns/op perf data on X86: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 279.466 ± 0.069 ns/op VectorLoop.convertD2I 523 avgt 15 551.009 ± 7.459 ns/op VectorLoop.convertF2D 523 avgt 15 276.066 ± 0.117 ns/op VectorLoop.convertF2L 523 avgt 15 545.108 ± 5.697 ns/op VectorLoop.convertI2D 523 avgt 15 745.303 ± 0.185 ns/op VectorLoop.convertI2L 523 avgt 15 260.878 ± 0.044 ns/op VectorLoop.convertL2F 523 avgt 15 502.016 ± 0.172 ns/op VectorLoop.convertL2I 523 avgt 15 261.654 ± 3.326 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 106.975 ± 0.045 ns/op VectorLoop.convertD2I 523 avgt 15 546.866 ± 9.287 ns/op VectorLoop.convertF2D 523 avgt 15 82.414 ± 0.340 ns/op VectorLoop.convertF2L 523 avgt 15 542.235 ± 2.785 ns/op VectorLoop.convertI2D 523 avgt 15 92.966 ± 1.400 ns/op VectorLoop.convertI2L 523 avgt 15 79.960 ± 0.528 ns/op VectorLoop.convertL2F 523 avgt 15 504.712 ± 4.794 ns/op VectorLoop.convertL2I 523 avgt 15 129.753 ± 0.094 ns/op perf data on AVX512: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 282.984 ± 4.022 ns/op VectorLoop.convertD2I 523 avgt 15 543.080 ± 3.873 ns/op VectorLoop.convertF2D 523 avgt 15 273.950 ± 0.131 ns/op VectorLoop.convertF2L 523 avgt 15 539.568 ± 2.747 ns/op VectorLoop.convertI2D 523 avgt 15 745.238 ± 0.069 ns/op VectorLoop.convertI2L 523 avgt 15 260.935 ± 0.169 ns/op VectorLoop.convertL2F 523 avgt 15 501.870 ± 0.359 ns/op VectorLoop.convertL2I 523 avgt 15 257.508 ± 0.174 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 76.687 ± 0.530 ns/op VectorLoop.convertD2I 523 avgt 15 545.408 ± 4.657 ns/op VectorLoop.convertF2D 523 avgt 15 273.935 ± 0.099 ns/op VectorLoop.convertF2L 523 avgt 15 540.534 ± 3.032 ns/op VectorLoop.convertI2D 523 avgt 15 745.234 ± 0.053 ns/op VectorLoop.convertI2L 523 avgt 15 260.865 ± 0.104 ns/op VectorLoop.convertL2F 523 avgt 15 63.834 ± 4.777 ns/op VectorLoop.convertL2I 523 avgt 15 48.183 ± 0.990 ns/op Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef
Good job! Did you forget the micro-benchmark VectorLoop.java? |
Thanks for your review. Sorry to miss it. Add it ASAP :-) |
Change-Id: I674581135fd0844accc65520574fcef161eededa
Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8
Thanks. Done. |
The patch is to support type conversion between different data sizes like ConvL2F and ConvL2I. Anybody can help review it? Thanks :) |
Can you explain why |
There seems conflicts with the jdk-master, so please merge with the latest version. |
Thanks for your review. On NEON, there are no real vector instructions to do the conversion from Double to Int, which is implemented in 5 instructions jdk/src/hotspot/cpu/aarch64/aarch64_neon.ad Line 512 in 8c18705
|
So shall we disable |
I'm afraid we can't. We still need to support it in VectorAPI. |
Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83
Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f
@fg1417 Thank you for suggesting this optimization. I see that it was not updated for some time. Do you still intend to work on it? Please update to latest JDK (Loom was integrated) and run performance again. Also include % of changes. I have the same concern as @DamonFool about regression when vectorizing some conversions. May be we should have additional |
And separate thank you for updating tests to verify correctness of this optimization! |
@vnkozlov thanks for your review and kind suggestion! I'll update the patch to resolve the potential performance regression. |
@fg1417 I don't see new update in this PR. Please also show performance numbers with new changes |
Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd
…tch rules Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701
Here is the perf uplift data (ns/op) on different machines for the latest patch. NEON perf change (ns/op) SVE perf change (ns/op) X86 perf change (ns/op) AVX512 perf change (ns/op) |
@vnkozlov I updated the patch to resolve the potential performance regression and also updated the performance numbers in the comment. But the patch depends on the fix, 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 by sviswa7 · Pull Request #8961 · openjdk/jdk (github.com), because when we enable the type conversion and loop induction at the same time, we can vectorize more scenarios like
MaxVectorSize=16 , the case here would fail. All jtreg tests passed except that one.
Please help review. Thanks. |
Which
|
But only webrev was affected so it is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes look good. I have one comment about method name.
These changes are related to #8877 which also limit vectors in auto-vectorizer. #8877 has priority since it fixed that issue.
Also JDK 19 fork is coming in 3 day: https://openjdk.java.net/projects/jdk/19/
I would advice you to push these changes into JDK 20 after fork.
I will start my testing and do approval based on results.
src/hotspot/share/opto/matcher.hpp
Outdated
@@ -325,6 +325,10 @@ class Matcher : public PhaseTransform { | |||
// should generate this one. | |||
static const bool match_rule_supported(int opcode); | |||
|
|||
// Identify extra cases that we might want to vectorize automatically | |||
// And exclude cases which are not profitable to auto-vectorize. | |||
static const bool match_rule_supported_vectorization(int opcode, int vlen, BasicType bt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we should rename it to match_rule_supported_superword
. That is how we call auto-vectorizer in C2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we should rename it to
match_rule_supported_superword
. That is how we call auto-vectorizer in C2.
Done.
And update after #8877. Thanks.
Change-Id: I3ef746178c07004cc34c22081a3044fb40e87702
Change-Id: Ie1907f86e2df7051aa2ddb7e5b05a371e887d1bc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good.
I submitted testing.
Please consider adding IR framework test to make sure expected vector nodes are generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing results are good.
@fg1417 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 45 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@vnkozlov, @sviswa7) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
Change-Id: I9525ae9310c3c493da29490d034cbb8f223e7f80
Change-Id: Ifbcc8d233aa27dfe93acef548c7e42721d86376e
Added an IR framework testcase and updated to the latest JDK. Thanks for your review @vnkozlov . |
Thank you for test. @sviswa7 can you review latest changes to have 2 reviews? |
Looks good to me. |
/integrate |
/sponsor |
Going to push as commit a179590.
Your commit was automatically rebased without conflicts. |
This change introduced a regression, see JDK-8290910. |
After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like:
int <-> double
float <-> long
int <-> long
float <-> double
A typical test case:
int[] a;
double[] b;
for (int i = start; i < limit; i++) {
b[i] = (double) a[i];
}
Our expected OptoAssembly code for one iteration is like below:
add R12, R2, R11, LShiftL #2
vector_load V16,[R12, #16]
vectorcast_i2d V16, V16 # convert I to D vector
add R11, R1, R11, LShiftL #3 # ptr
add R13, R11, #16 # ptr
vector_store [R13], V16
To enable the vectorization, the patch solves the following problems in the SLP.
There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain
and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type.
After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation.
In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well.
Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use().
After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes.
Here is the test data (-XX:+UseSuperWord) on NEON:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
convertD2F 523 avgt 15 216.431 ± 0.131 ns/op
convertD2I 523 avgt 15 220.522 ± 0.311 ns/op
convertF2D 523 avgt 15 217.034 ± 0.292 ns/op
convertF2L 523 avgt 15 231.634 ± 1.881 ns/op
convertI2D 523 avgt 15 229.538 ± 0.095 ns/op
convertI2L 523 avgt 15 214.822 ± 0.131 ns/op
convertL2F 523 avgt 15 230.188 ± 0.217 ns/op
convertL2I 523 avgt 15 162.234 ± 0.235 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
convertD2F 523 avgt 15 124.352 ± 1.079 ns/op
convertD2I 523 avgt 15 557.388 ± 8.166 ns/op
convertF2D 523 avgt 15 118.082 ± 4.026 ns/op
convertF2L 523 avgt 15 225.810 ± 11.180 ns/op
convertI2D 523 avgt 15 166.247 ± 0.120 ns/op
convertI2L 523 avgt 15 119.699 ± 2.925 ns/op
convertL2F 523 avgt 15 220.847 ± 0.053 ns/op
convertL2I 523 avgt 15 122.339 ± 2.738 ns/op
perf data on X86:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
convertD2F 523 avgt 15 279.466 ± 0.069 ns/op
convertD2I 523 avgt 15 551.009 ± 7.459 ns/op
convertF2D 523 avgt 15 276.066 ± 0.117 ns/op
convertF2L 523 avgt 15 545.108 ± 5.697 ns/op
convertI2D 523 avgt 15 745.303 ± 0.185 ns/op
convertI2L 523 avgt 15 260.878 ± 0.044 ns/op
convertL2F 523 avgt 15 502.016 ± 0.172 ns/op
convertL2I 523 avgt 15 261.654 ± 3.326 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
convertD2F 523 avgt 15 106.975 ± 0.045 ns/op
convertD2I 523 avgt 15 546.866 ± 9.287 ns/op
convertF2D 523 avgt 15 82.414 ± 0.340 ns/op
convertF2L 523 avgt 15 542.235 ± 2.785 ns/op
convertI2D 523 avgt 15 92.966 ± 1.400 ns/op
convertI2L 523 avgt 15 79.960 ± 0.528 ns/op
convertL2F 523 avgt 15 504.712 ± 4.794 ns/op
convertL2I 523 avgt 15 129.753 ± 0.094 ns/op
perf data on AVX512:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
convertD2F 523 avgt 15 282.984 ± 4.022 ns/op
convertD2I 523 avgt 15 543.080 ± 3.873 ns/op
convertF2D 523 avgt 15 273.950 ± 0.131 ns/op
convertF2L 523 avgt 15 539.568 ± 2.747 ns/op
convertI2D 523 avgt 15 745.238 ± 0.069 ns/op
convertI2L 523 avgt 15 260.935 ± 0.169 ns/op
convertL2F 523 avgt 15 501.870 ± 0.359 ns/op
convertL2I 523 avgt 15 257.508 ± 0.174 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
convertD2F 523 avgt 15 76.687 ± 0.530 ns/op
convertD2I 523 avgt 15 545.408 ± 4.657 ns/op
convertF2D 523 avgt 15 273.935 ± 0.099 ns/op
convertF2L 523 avgt 15 540.534 ± 3.032 ns/op
convertI2D 523 avgt 15 745.234 ± 0.053 ns/op
convertI2L 523 avgt 15 260.865 ± 0.104 ns/op
convertL2F 523 avgt 15 63.834 ± 4.777 ns/op
convertL2I 523 avgt 15 48.183 ± 0.990 ns/op
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/7806/head:pull/7806
$ git checkout pull/7806
Update a local copy of the PR:
$ git checkout pull/7806
$ git pull https://git.openjdk.org/jdk pull/7806/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 7806
View PR using the GUI difftool:
$ git pr show -t 7806
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/7806.diff