-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8348868: AArch64: Add backend support for SelectFromTwoVector #23570
base: master
Are you sure you want to change the base?
Conversation
This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. For 64-bit vector length : Neon tbl instruction is generated for T_SHORT and T_BYTE types only. For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - Benchmark (size) Mode Cnt Gain SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander.
👋 Welcome back bkilambi! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
@Bhavana-Kilambi The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job @Bhavana-Kilambi ! Generally looks good to me. Just some minor issues that I have left the comments. Besides, could you please add some IR tests for this optimization? Thanks!
|
||
// --------------------------------SelectFromTwoVector ----------------------------- | ||
|
||
instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a similar rule for VectorRearrange
such as rearrange_HS_neon
. To unify, can we use the similar name style for this rule?
instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2, | |
instruct vselect_from_two_vectors_HS_neon(vReg dst, vReg_V17 src1, vReg_V18 src2, |
(UseSVE < 2 || Matcher::vector_length_in_bytes(n) < 16)); | ||
match(Set dst (SelectFromTwoVector (Binary index src1) src2)); | ||
effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2); | ||
format %{ "vselect_from_two_vectors_SIF $dst, $src1, $src2, $index\t# vector (4S/8S/2I/4I/2F/4F). KILL $tmp1, $tmp2" %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the same match rule name in the format. Thanks!
ins_pipe(pipe_slow); | ||
%} | ||
|
||
instruct vselect_from_two_vectors(vReg dst, vReg_V17 src1, vReg_V18 src2, vReg index) %{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add comment before the rule why v17
and v18
are used explicitly here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @theRealAph , apologies for the late response. The tbl instruction needs both the source registers to be consecutive and I could not find a way to make the register allocator choose two consecutive registers for this operation and decided to hard code them. As v0-v7 are used for function arguments, v8-v15 are non-volatile which are not needed for this purpose (as we dont want to be preserving these values across function calls), I chose two of the volatile registers from v16-v31 for the source registers. Please let me know if this is the right way to approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it is, yes. Thanks.
(UseSVE < 2 || Matcher::vector_length_in_bytes(n) < 16)); | ||
match(Set dst (SelectFromTwoVector (Binary index src1) src2)); | ||
effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2); | ||
format %{ "vselect_from_two_vectors_SIF $dst, $src1, $src2, $index\t# vector (4S/8S/2I/4I/2F/4F). KILL $tmp1, $tmp2" %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful here. select_from_two_vectors_SIFNeon
seems to alter src1
, so you need a USE_KILL
effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@theRealAph Thanks for the suggestion! makes sense to add USE_KILL for the src1 usage here. I am getting into some errors when I do that. I'll resolve them and get back soon. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@theRealAph Thanks for the suggestion! makes sense to add USE_KILL for the src1 usage here. I am getting into some errors when I do that. I'll resolve them and get back soon. Thanks!
Maybe that should be USE_DEF or TEMP_DEF.
Hi @XiaohongGong Thanks for your review comments :) I will get back soon with a new patchset addressing your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2.
data-1: UseSVE=0
Before After Gain
Benchmark Mode Threads Samples Unit Score Score Error (99.9%) Score Score Error (99.9%) Ratio Param: size
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 400.850304 1.109497 35229.489297 62.602965 87.88 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 201.425559 0.478769 18457.865560 21.655711 91.63 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 55.623907 0.238778 55.479367 0.259319 0.99 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 27.700079 0.073881 27.782368 0.125652 1.00 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 108.179064 0.490253 5137.062026 22.341864 47.48 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 54.354705 0.235878 2600.296050 11.659880 47.83 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 107.876699 0.362950 6092.072276 26.235411 56.47 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 54.173753 0.137934 3083.301351 23.996634 56.91 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 55.828919 0.197490 55.278519 0.543387 0.99 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 27.841811 0.197133 27.701294 0.170357 0.99 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 212.256878 0.610474 12284.067528 22.269728 57.87 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 106.237899 0.292940 6195.468269 10.163818 58.31 2048
Since "double" and "long" type are not supported on Neon, no obvious performance change is observed for selectFromDoubleVector
or selectFromLongVector
. It's as expected.
data-2: UseSVE=2
Before After Gain
Benchmark Mode Threads Samples Unit Score Score Error (99.9%) Score Score Error (99.9%) Ratio Param: size
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 401.283626 1.185346 35212.914922 48.146517 87.75 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 200.442895 0.354457 18484.335484 31.659515 92.21 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 56.093979 0.259369 3870.627049 15.037254 69.00 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 27.761792 0.150907 1981.828293 2.749076 71.38 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 108.203125 0.593284 5791.568827 14.214889 53.52 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 54.388489 0.238700 2956.726043 10.504617 54.36 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 108.362433 0.290180 9389.915021 84.968822 86.65 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 53.982112 0.210067 4790.062993 2.123039 88.73 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 55.583716 0.222332 4725.276744 6.347278 85.01 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 27.967713 0.143626 2328.371821 15.504931 83.25 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 212.137873 0.586753 18484.651452 8.215293 87.13 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 105.692702 0.641425 9386.506869 80.958276 88.80 2048
note-1: "double" and "long" are supported on SVE2, hence we observed obvious performance uplifts for selectFromDoubleVector
and selectFromLongVector
now. It's as expected.
note-2: I observed much difference between data-2 and your data listed in the commit msg. in your data, 1.4~1.7x
is gained for "byte|float|int|short" types. However, my data is much bigger, i.e. 53~92x
.
it's a bit wired.
V17, V17_H, V17_J, V17_K | ||
); | ||
|
||
// Class for vector register v18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use upper case
// Class for vector register v18 | |
// Class for vector register V18 |
@@ -4221,6 +4221,16 @@ template<typename R, typename... Rx> | |||
Assembler(CodeBuffer* code) : AbstractAssembler(code) { | |||
} | |||
|
|||
// SVE2 programmable table lookup in two vector table | |||
void sve2_tbl(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using sve_tbl
here.
- the SVE1 insn is
sve_tbl
as well, but we can distinguish them thanks to function overloading - following the same naming style of other sve2 instructions
Thanks for your review comments @shqking On aarch64 after the above commit, we expect the lowering operations to be generated as we have support for both of these operations but in the inline expander for SelectFromTwoVector, it did not consider targets that do not need to generate VectorLoadShuffle node (like aarch64) for the Lowering operation - jdk/src/hotspot/share/opto/vectorIntrinsics.cpp Line 2736 in e1d0a9c
As a result, the compiler was not generating the VectorRearrange + VectorBlend operation on aarch64 as it is supposed to when SelectFromTwoVector is not supported. The default java impl was being executed which is too slow. So after my small change in vectorIntrinsics.cpp file, the Lowered vector operations are being correctly generated. I felt it would be right to compare the numbers after the change I made in vectorIntrinsics.cpp file with this patch that adds support for SelectFromTwoVector so that we are comparing performance with (VectorRearrange + VectorBlend) vs SelectFromTwoVector rather than compare it with default java implementation. If we compare the performance of this patch with the master branch then the numbers you have shown are correct. Hope this explanation helps :) |
Thanks for your explanation. Sounds reasonable to me. @Bhavana-Kilambi |
} | ||
mulv(dst, size2, index, tmp1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use vector lsl
instead of mul
here, so that we can also support D types for NEON/SVE1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XiaohongGong , thanks I'll give it a try and get back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bhavana-Kilambi , left shift can not get right indexes here as values 0x2, 0x4
is landed in each B lane. Maybe we can just try with bsl
for D size types, as it has only two lanes for long/double types with 128-bit vector length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @XiaohongGong , thanks but bsl instruction only has 8B/16B types. not D type. I'll see how I can do this with bsl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, bsl
only accepts 8B/16B, but it can also work for other types. We need to keep all bits of the lane to 1/0 (e.g. [0xffffffffffffffff, 0x0000000000000000]
for T2D
type). You can take the implementation of VectorBlend
as a reference.
BTW, I'm currently working on adding the vector rearrange support for 2D (i.e. 128-bit long/double vector) types, and I met the same issues. I have tested that using a pattern with bsl
can implement the op. The main idea is 1) compare the shuffle input with an iota index vector, and 2) choose src
input or swap two elements in src
based on the comparing result with bsl
. Hope this could help you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your inputs. I'll look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Bhavana-Kilambi , I'v created a new PR #23790 to implement the VectorRearrange
for small lane count vector types like 2D
. I think the implementation is quite same with what we discussed here. Any feedback please let me know. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XiaohongGong thank you! I will check it out. Apologies for being so slow in responding (got pulled into something else). I will update this PR with my latest patch soon. Thanks!
This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI.
It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2.
For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2.
For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation.
This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor.
Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below -
Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570
$ git checkout pull/23570
Update a local copy of the PR:
$ git checkout pull/23570
$ git pull https://git.openjdk.org/jdk.git pull/23570/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 23570
View PR using the GUI difftool:
$ git pr show -t 23570
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23570.diff
Using Webrev
Link to Webrev Comment