Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8348868: AArch64: Add backend support for SelectFromTwoVector #23570

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Bhavana-Kilambi
Copy link
Contributor

@Bhavana-Kilambi Bhavana-Kilambi commented Feb 11, 2025

This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI.

It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2.

For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2.

For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation.

This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor.

Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below -

Benchmark                                      (size)   Mode    Cnt     Gain
SelectFromBenchmark.selectFromByteVector        1024    thrpt   9       1.43
SelectFromBenchmark.selectFromByteVector        2048    thrpt   9       1.48
SelectFromBenchmark.selectFromDoubleVector      1024    thrpt   9       68.55
SelectFromBenchmark.selectFromDoubleVector      2048    thrpt   9       72.07
SelectFromBenchmark.selectFromFloatVector       1024    thrpt   9       1.69
SelectFromBenchmark.selectFromFloatVector       2048    thrpt   9       1.52
SelectFromBenchmark.selectFromIntVector         1024    thrpt   9       1.50
SelectFromBenchmark.selectFromIntVector         2048    thrpt   9       1.52
SelectFromBenchmark.selectFromLongVector        1024    thrpt   9       85.38
SelectFromBenchmark.selectFromLongVector        2048    thrpt   9       80.93
SelectFromBenchmark.selectFromShortVector       1024    thrpt   9       1.48
SelectFromBenchmark.selectFromShortVector       2048    thrpt   9       1.49

Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8348868: AArch64: Add backend support for SelectFromTwoVector (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570
$ git checkout pull/23570

Update a local copy of the PR:
$ git checkout pull/23570
$ git pull https://git.openjdk.org/jdk.git pull/23570/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23570

View PR using the GUI difftool:
$ git pr show -t 23570

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23570.diff

Using Webrev

Link to Webrev Comment

Sorry, something went wrong.

This patch adds aarch64 backend support for SelectFromTwoVector
operation which was recently introduced in VectorAPI.

It implements this operation using a two table vector lookup instruction -
"tbl" which is available only in Neon and SVE2.

For 64-bit vector length : Neon tbl instruction is generated for T_SHORT
and T_BYTE types only.

For 128-bit vector length : Neon tbl instruction is generated if UseSVE <
2 and SVE2 "tbl" instruction is generated if UseSVE == 2.

For > 128-bit vector length : Currently there are no machines which have
vector length > 128-bit and support SVE2. For all those machines with vector
length > 128-bit and UseSVE < 2, this operation is not supported. The
inline expander for this operation would fail and lowered IR will be
generated which is a mix of two rearrange and one blend operation.

This patch also adds a boolean "need_load_shuffle" in the inline
expander for this operation to test if the platform requires
VectorLoadShuffle operation to be generated. Without this, the lowering
IR was not being generated on aarch64 and the performance was quite
poor.

Performance numbers with this patch on a 128-bit, SVE2 supporting
machine is shown below -

Benchmark                                      (size)   Mode    Cnt     Gain
SelectFromBenchmark.selectFromByteVector        1024    thrpt   9       1.43
SelectFromBenchmark.selectFromByteVector        2048    thrpt   9       1.48
SelectFromBenchmark.selectFromDoubleVector      1024    thrpt   9       68.55
SelectFromBenchmark.selectFromDoubleVector      2048    thrpt   9       72.07
SelectFromBenchmark.selectFromFloatVector       1024    thrpt   9       1.69
SelectFromBenchmark.selectFromFloatVector       2048    thrpt   9       1.52
SelectFromBenchmark.selectFromIntVector         1024    thrpt   9       1.50
SelectFromBenchmark.selectFromIntVector         2048    thrpt   9       1.52
SelectFromBenchmark.selectFromLongVector        1024    thrpt   9       85.38
SelectFromBenchmark.selectFromLongVector        2048    thrpt   9       80.93
SelectFromBenchmark.selectFromShortVector       1024    thrpt   9       1.48
SelectFromBenchmark.selectFromShortVector       2048    thrpt   9       1.49

Gain column refers to the ratio of thrpt between this patch and the
master branch after applying changes in the inline expander.
@bridgekeeper
Copy link

bridgekeeper bot commented Feb 11, 2025

👋 Welcome back bkilambi! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Feb 11, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@Bhavana-Kilambi Bhavana-Kilambi marked this pull request as ready for review February 11, 2025 20:23
@openjdk
Copy link

openjdk bot commented Feb 11, 2025

@Bhavana-Kilambi The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review labels Feb 11, 2025
@mlbridge
Copy link

mlbridge bot commented Feb 11, 2025

Webrevs

Copy link

@XiaohongGong XiaohongGong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job @Bhavana-Kilambi ! Generally looks good to me. Just some minor issues that I have left the comments. Besides, could you please add some IR tests for this optimization? Thanks!


// --------------------------------SelectFromTwoVector -----------------------------

instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a similar rule for VectorRearrange such as rearrange_HS_neon. To unify, can we use the similar name style for this rule?

Suggested change
instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2,
instruct vselect_from_two_vectors_HS_neon(vReg dst, vReg_V17 src1, vReg_V18 src2,

(UseSVE < 2 || Matcher::vector_length_in_bytes(n) < 16));
match(Set dst (SelectFromTwoVector (Binary index src1) src2));
effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2);
format %{ "vselect_from_two_vectors_SIF $dst, $src1, $src2, $index\t# vector (4S/8S/2I/4I/2F/4F). KILL $tmp1, $tmp2" %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the same match rule name in the format. Thanks!

ins_pipe(pipe_slow);
%}

instruct vselect_from_two_vectors(vReg dst, vReg_V17 src1, vReg_V18 src2, vReg index) %{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add comment before the rule why v17 and v18 are used explicitly here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @theRealAph , apologies for the late response. The tbl instruction needs both the source registers to be consecutive and I could not find a way to make the register allocator choose two consecutive registers for this operation and decided to hard code them. As v0-v7 are used for function arguments, v8-v15 are non-volatile which are not needed for this purpose (as we dont want to be preserving these values across function calls), I chose two of the volatile registers from v16-v31 for the source registers. Please let me know if this is the right way to approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it is, yes. Thanks.

(UseSVE < 2 || Matcher::vector_length_in_bytes(n) < 16));
match(Set dst (SelectFromTwoVector (Binary index src1) src2));
effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2);
format %{ "vselect_from_two_vectors_SIF $dst, $src1, $src2, $index\t# vector (4S/8S/2I/4I/2F/4F). KILL $tmp1, $tmp2" %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful here. select_from_two_vectors_SIFNeon seems to alter src1, so you need a USE_KILL effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theRealAph Thanks for the suggestion! makes sense to add USE_KILL for the src1 usage here. I am getting into some errors when I do that. I'll resolve them and get back soon. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theRealAph Thanks for the suggestion! makes sense to add USE_KILL for the src1 usage here. I am getting into some errors when I do that. I'll resolve them and get back soon. Thanks!

Maybe that should be USE_DEF or TEMP_DEF.

@Bhavana-Kilambi
Copy link
Contributor Author

Hi @XiaohongGong Thanks for your review comments :) I will get back soon with a new patchset addressing your comments.

Copy link
Contributor

@shqking shqking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2.

data-1: UseSVE=0

                                                                                                               Before                         After                            Gain
Benchmark                                                                         Mode  Threads Samples Unit   Score      Score Error (99.9%) Score        Score Error (99.9%) Ratio  Param: size
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 400.850304 1.109497            35229.489297 62.602965           87.88  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 201.425559 0.478769            18457.865560 21.655711           91.63  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 55.623907  0.238778            55.479367    0.259319            0.99   1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 27.700079  0.073881            27.782368    0.125652            1.00   2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 108.179064 0.490253            5137.062026  22.341864           47.48  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 54.354705  0.235878            2600.296050  11.659880           47.83  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 107.876699 0.362950            6092.072276  26.235411           56.47  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 54.173753  0.137934            3083.301351  23.996634           56.91  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 55.828919  0.197490            55.278519    0.543387            0.99   1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 27.841811  0.197133            27.701294    0.170357            0.99   2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 212.256878 0.610474            12284.067528 22.269728           57.87  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 106.237899 0.292940            6195.468269  10.163818           58.31  2048

Since "double" and "long" type are not supported on Neon, no obvious performance change is observed for selectFromDoubleVector or selectFromLongVector. It's as expected.

data-2: UseSVE=2

                                                                                                               Before                         After                            Gain
Benchmark                                                                         Mode  Threads Samples Unit   Score      Score Error (99.9%) Score        Score Error (99.9%) Ratio  Param: size
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 401.283626 1.185346            35212.914922 48.146517           87.75  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 200.442895 0.354457            18484.335484 31.659515           92.21  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 56.093979  0.259369            3870.627049  15.037254           69.00  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 27.761792  0.150907            1981.828293  2.749076            71.38  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 108.203125 0.593284            5791.568827  14.214889           53.52  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 54.388489  0.238700            2956.726043  10.504617           54.36  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 108.362433 0.290180            9389.915021  84.968822           86.65  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 53.982112  0.210067            4790.062993  2.123039            88.73  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 55.583716  0.222332            4725.276744  6.347278            85.01  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 27.967713  0.143626            2328.371821  15.504931           83.25  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 212.137873 0.586753            18484.651452 8.215293            87.13  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 105.692702 0.641425            9386.506869  80.958276           88.80  2048

note-1: "double" and "long" are supported on SVE2, hence we observed obvious performance uplifts for selectFromDoubleVector and selectFromLongVector now. It's as expected.
note-2: I observed much difference between data-2 and your data listed in the commit msg. in your data, 1.4~1.7x is gained for "byte|float|int|short" types. However, my data is much bigger, i.e. 53~92x.
it's a bit wired.

V17, V17_H, V17_J, V17_K
);

// Class for vector register v18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use upper case

Suggested change
// Class for vector register v18
// Class for vector register V18

@@ -4221,6 +4221,16 @@ template<typename R, typename... Rx>
Assembler(CodeBuffer* code) : AbstractAssembler(code) {
}

// SVE2 programmable table lookup in two vector table
void sve2_tbl(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using sve_tbl here.

  1. the SVE1 insn is sve_tbl as well, but we can distinguish them thanks to function overloading
  2. following the same naming style of other sve2 instructions

@Bhavana-Kilambi
Copy link
Contributor Author

Bhavana-Kilambi commented Feb 18, 2025

Thanks for your review comments @shqking
This commit added mid-end support for SelectFromTwoVector operation - 709914f. It adds intrinsics for SelectFromTwoVector operation and on machines that do not support this operation, a lowering vector operation (VectorRearrange + VectorBlend combination) is generated.

On aarch64 after the above commit, we expect the lowering operations to be generated as we have support for both of these operations but in the inline expander for SelectFromTwoVector, it did not consider targets that do not need to generate VectorLoadShuffle node (like aarch64) for the Lowering operation -

!arch_supports_vector(Op_VectorLoadShuffle, num_elem, index_elem_bt, VecMaskNotUsed) ||
.
As a result, the compiler was not generating the VectorRearrange + VectorBlend operation on aarch64 as it is supposed to when SelectFromTwoVector is not supported. The default java impl was being executed which is too slow. So after my small change in vectorIntrinsics.cpp file, the Lowered vector operations are being correctly generated.

I felt it would be right to compare the numbers after the change I made in vectorIntrinsics.cpp file with this patch that adds support for SelectFromTwoVector so that we are comparing performance with (VectorRearrange + VectorBlend) vs SelectFromTwoVector rather than compare it with default java implementation. If we compare the performance of this patch with the master branch then the numbers you have shown are correct. Hope this explanation helps :)

@shqking
Copy link
Contributor

shqking commented Feb 19, 2025

Thanks for your explanation. Sounds reasonable to me. @Bhavana-Kilambi

Comment on lines +2734 to +2735
}
mulv(dst, size2, index, tmp1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use vector lsl instead of mul here, so that we can also support D types for NEON/SVE1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XiaohongGong , thanks I'll give it a try and get back.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bhavana-Kilambi , left shift can not get right indexes here as values 0x2, 0x4 is landed in each B lane. Maybe we can just try with bsl for D size types, as it has only two lanes for long/double types with 128-bit vector length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @XiaohongGong , thanks but bsl instruction only has 8B/16B types. not D type. I'll see how I can do this with bsl.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, bsl only accepts 8B/16B, but it can also work for other types. We need to keep all bits of the lane to 1/0 (e.g. [0xffffffffffffffff, 0x0000000000000000] for T2D type). You can take the implementation of VectorBlend as a reference.

BTW, I'm currently working on adding the vector rearrange support for 2D (i.e. 128-bit long/double vector) types, and I met the same issues. I have tested that using a pattern with bsl can implement the op. The main idea is 1) compare the shuffle input with an iota index vector, and 2) choose src input or swap two elements in src based on the comparing result with bsl. Hope this could help you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your inputs. I'll look into this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Bhavana-Kilambi , I'v created a new PR #23790 to implement the VectorRearrange for small lane count vector types like 2D. I think the implementation is quite same with what we discussed here. Any feedback please let me know. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XiaohongGong thank you! I will check it out. Apologies for being so slow in responding (got pulled into something else). I will update this PR with my latest patch soon. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

None yet

4 participants