8348868: AArch64: Add backend support for SelectFromTwoVector #23570

Bhavana-Kilambi · 2025-02-11T20:20:54Z

This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI.

It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2.

For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2.

For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation.

This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor.

Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below -

Benchmark                                      (size)   Mode    Cnt     Gain
SelectFromBenchmark.selectFromByteVector        1024    thrpt   9       1.43
SelectFromBenchmark.selectFromByteVector        2048    thrpt   9       1.48
SelectFromBenchmark.selectFromDoubleVector      1024    thrpt   9       68.55
SelectFromBenchmark.selectFromDoubleVector      2048    thrpt   9       72.07
SelectFromBenchmark.selectFromFloatVector       1024    thrpt   9       1.69
SelectFromBenchmark.selectFromFloatVector       2048    thrpt   9       1.52
SelectFromBenchmark.selectFromIntVector         1024    thrpt   9       1.50
SelectFromBenchmark.selectFromIntVector         2048    thrpt   9       1.52
SelectFromBenchmark.selectFromLongVector        1024    thrpt   9       85.38
SelectFromBenchmark.selectFromLongVector        2048    thrpt   9       80.93
SelectFromBenchmark.selectFromShortVector       1024    thrpt   9       1.48
SelectFromBenchmark.selectFromShortVector       2048    thrpt   9       1.49

Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8348868: AArch64: Add backend support for SelectFromTwoVector (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570
$ git checkout pull/23570

Update a local copy of the PR:
$ git checkout pull/23570
$ git pull https://git.openjdk.org/jdk.git pull/23570/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23570

View PR using the GUI difftool:
$ git pr show -t 23570

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23570.diff

Using Webrev

Link to Webrev Comment

This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. For 64-bit vector length : Neon tbl instruction is generated for T_SHORT and T_BYTE types only. For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - Benchmark (size) Mode Cnt Gain SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander.

bridgekeeper · 2025-02-11T20:21:45Z

👋 Welcome back bkilambi! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-02-11T20:22:47Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-02-11T20:23:17Z

@Bhavana-Kilambi The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-02-11T20:27:43Z

Webrevs

00: Full (65bd521d)

XiaohongGong

Good job @Bhavana-Kilambi ! Generally looks good to me. Just some minor issues that I have left the comments. Besides, could you please add some IR tests for this optimization? Thanks!

XiaohongGong · 2025-02-12T02:34:20Z

src/hotspot/cpu/aarch64/aarch64_vector.ad

+
+// --------------------------------SelectFromTwoVector -----------------------------
+
+instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2,


We have a similar rule for VectorRearrange such as rearrange_HS_neon. To unify, can we use the similar name style for this rule?

Suggested change

instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2,

instruct vselect_from_two_vectors_HS_neon(vReg dst, vReg_V17 src1, vReg_V18 src2,

XiaohongGong · 2025-02-12T02:35:00Z

src/hotspot/cpu/aarch64/aarch64_vector.ad

+            (UseSVE < 2 || Matcher::vector_length_in_bytes(n) < 16));
+  match(Set dst (SelectFromTwoVector (Binary index src1) src2));
+  effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2);
+  format %{ "vselect_from_two_vectors_SIF $dst, $src1, $src2, $index\t# vector (4S/8S/2I/4I/2F/4F). KILL $tmp1, $tmp2" %}


Please use the same match rule name in the format. Thanks!

XiaohongGong · 2025-02-12T03:00:13Z

src/hotspot/cpu/aarch64/aarch64_vector.ad

+  ins_pipe(pipe_slow);
+%}
+
+instruct vselect_from_two_vectors(vReg dst, vReg_V17 src1, vReg_V18 src2, vReg index) %{


Could you please add comment before the rule why v17 and v18 are used explicitly here?

I'm still curious.

Hi @theRealAph , apologies for the late response. The tbl instruction needs both the source registers to be consecutive and I could not find a way to make the register allocator choose two consecutive registers for this operation and decided to hard code them. As v0-v7 are used for function arguments, v8-v15 are non-volatile which are not needed for this purpose (as we dont want to be preserving these values across function calls), I chose two of the volatile registers from v16-v31 for the source registers. Please let me know if this is the right way to approach.

I suppose it is, yes. Thanks.

theRealAph · 2025-02-12T08:58:20Z

src/hotspot/cpu/aarch64/aarch64_vector_ad.m4

+            (UseSVE < 2 || Matcher::vector_length_in_bytes(n) < 16));
+  match(Set dst (SelectFromTwoVector (Binary index src1) src2));
+  effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2);
+  format %{ "vselect_from_two_vectors_SIF $dst, $src1, $src2, $index\t# vector (4S/8S/2I/4I/2F/4F). KILL $tmp1, $tmp2" %}


Be careful here. select_from_two_vectors_SIFNeon seems to alter src1, so you need a USE_KILL effect.

@theRealAph Thanks for the suggestion! makes sense to add USE_KILL for the src1 usage here. I am getting into some errors when I do that. I'll resolve them and get back soon. Thanks!

@theRealAph Thanks for the suggestion! makes sense to add USE_KILL for the src1 usage here. I am getting into some errors when I do that. I'll resolve them and get back soon. Thanks!

Maybe that should be USE_DEF or TEMP_DEF.

Bhavana-Kilambi · 2025-02-12T13:50:50Z

Hi @XiaohongGong Thanks for your review comments :) I will get back soon with a new patchset addressing your comments.

shqking

Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2.

data-1: UseSVE=0

                                                                                                               Before                         After                            Gain
Benchmark                                                                         Mode  Threads Samples Unit   Score      Score Error (99.9%) Score        Score Error (99.9%) Ratio  Param: size
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 400.850304 1.109497            35229.489297 62.602965           87.88  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 201.425559 0.478769            18457.865560 21.655711           91.63  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 55.623907  0.238778            55.479367    0.259319            0.99   1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 27.700079  0.073881            27.782368    0.125652            1.00   2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 108.179064 0.490253            5137.062026  22.341864           47.48  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 54.354705  0.235878            2600.296050  11.659880           47.83  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 107.876699 0.362950            6092.072276  26.235411           56.47  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 54.173753  0.137934            3083.301351  23.996634           56.91  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 55.828919  0.197490            55.278519    0.543387            0.99   1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 27.841811  0.197133            27.701294    0.170357            0.99   2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 212.256878 0.610474            12284.067528 22.269728           57.87  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 106.237899 0.292940            6195.468269  10.163818           58.31  2048

Since "double" and "long" type are not supported on Neon, no obvious performance change is observed for selectFromDoubleVector or selectFromLongVector. It's as expected.

data-2: UseSVE=2

                                                                                                               Before                         After                            Gain
Benchmark                                                                         Mode  Threads Samples Unit   Score      Score Error (99.9%) Score        Score Error (99.9%) Ratio  Param: size
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 401.283626 1.185346            35212.914922 48.146517           87.75  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 200.442895 0.354457            18484.335484 31.659515           92.21  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 56.093979  0.259369            3870.627049  15.037254           69.00  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 27.761792  0.150907            1981.828293  2.749076            71.38  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 108.203125 0.593284            5791.568827  14.214889           53.52  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 54.388489  0.238700            2956.726043  10.504617           54.36  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 108.362433 0.290180            9389.915021  84.968822           86.65  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 53.982112  0.210067            4790.062993  2.123039            88.73  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 55.583716  0.222332            4725.276744  6.347278            85.01  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 27.967713  0.143626            2328.371821  15.504931           83.25  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 212.137873 0.586753            18484.651452 8.215293            87.13  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 105.692702 0.641425            9386.506869  80.958276           88.80  2048

note-1: "double" and "long" are supported on SVE2, hence we observed obvious performance uplifts for selectFromDoubleVector and selectFromLongVector now. It's as expected.
note-2: I observed much difference between data-2 and your data listed in the commit msg. in your data, 1.4~1.7x is gained for "byte|float|int|short" types. However, my data is much bigger, i.e. 53~92x.
it's a bit wired.

shqking · 2025-02-18T14:02:04Z

src/hotspot/cpu/aarch64/aarch64.ad

+    V17, V17_H, V17_J, V17_K
+);
+
+// Class for vector register v18


nit: use upper case

Suggested change

// Class for vector register v18

// Class for vector register V18

shqking · 2025-02-18T14:08:17Z

src/hotspot/cpu/aarch64/assembler_aarch64.hpp

@@ -4221,6 +4221,16 @@ template<typename R, typename... Rx>
  Assembler(CodeBuffer* code) : AbstractAssembler(code) {
  }

+  // SVE2 programmable table lookup in two vector table
+  void sve2_tbl(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn1,


I suggest using sve_tbl here.

the SVE1 insn is sve_tbl as well, but we can distinguish them thanks to function overloading

following the same naming style of other sve2 instructions

Bhavana-Kilambi · 2025-02-18T15:34:11Z

Thanks for your review comments @shqking
This commit added mid-end support for SelectFromTwoVector operation - 709914f. It adds intrinsics for SelectFromTwoVector operation and on machines that do not support this operation, a lowering vector operation (VectorRearrange + VectorBlend combination) is generated.

On aarch64 after the above commit, we expect the lowering operations to be generated as we have support for both of these operations but in the inline expander for SelectFromTwoVector, it did not consider targets that do not need to generate VectorLoadShuffle node (like aarch64) for the Lowering operation -

jdk/src/hotspot/share/opto/vectorIntrinsics.cpp

Line 2736 in e1d0a9c

    
           !arch_supports_vector(Op_VectorLoadShuffle, num_elem, index_elem_bt, VecMaskNotUsed) ||

.
As a result, the compiler was not generating the VectorRearrange + VectorBlend operation on aarch64 as it is supposed to when SelectFromTwoVector is not supported. The default java impl was being executed which is too slow. So after my small change in vectorIntrinsics.cpp file, the Lowered vector operations are being correctly generated.

I felt it would be right to compare the numbers after the change I made in vectorIntrinsics.cpp file with this patch that adds support for SelectFromTwoVector so that we are comparing performance with (VectorRearrange + VectorBlend) vs SelectFromTwoVector rather than compare it with default java implementation. If we compare the performance of this patch with the master branch then the numbers you have shown are correct. Hope this explanation helps :)

shqking · 2025-02-19T01:28:46Z

Thanks for your explanation. Sounds reasonable to me. @Bhavana-Kilambi

XiaohongGong · 2025-02-19T07:40:23Z

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp

+  }
+  mulv(dst, size2, index, tmp1);


Can we use vector lsl instead of mul here, so that we can also support D types for NEON/SVE1 ?

@XiaohongGong , thanks I'll give it a try and get back.

@Bhavana-Kilambi , left shift can not get right indexes here as values 0x2, 0x4 is landed in each B lane. Maybe we can just try with bsl for D size types, as it has only two lanes for long/double types with 128-bit vector length.

Hi @XiaohongGong , thanks but bsl instruction only has 8B/16B types. not D type. I'll see how I can do this with bsl.

Yes, bsl only accepts 8B/16B, but it can also work for other types. We need to keep all bits of the lane to 1/0 (e.g. [0xffffffffffffffff, 0x0000000000000000] for T2D type). You can take the implementation of VectorBlend as a reference.

BTW, I'm currently working on adding the vector rearrange support for 2D (i.e. 128-bit long/double vector) types, and I met the same issues. I have tested that using a pattern with bsl can implement the op. The main idea is 1) compare the shuffle input with an iota index vector, and 2) choose src input or swap two elements in src based on the comparing result with bsl. Hope this could help you!

Thank you for your inputs. I'll look into this.

Hi @Bhavana-Kilambi , I'v created a new PR #23790 to implement the VectorRearrange for small lane count vector types like 2D. I think the implementation is quite same with what we discussed here. Any feedback please let me know. Thanks!

@XiaohongGong thank you! I will check it out. Apologies for being so slow in responding (got pulled into something else). I will update this PR with my latest patch soon. Thanks!

Bhavana-Kilambi marked this pull request as ready for review February 11, 2025 20:23

openjdk bot added hotspot-compiler rfr labels Feb 11, 2025

XiaohongGong reviewed Feb 12, 2025

View reviewed changes

theRealAph reviewed Feb 12, 2025

View reviewed changes

shqking reviewed Feb 18, 2025

View reviewed changes

XiaohongGong reviewed Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8348868: AArch64: Add backend support for SelectFromTwoVector #23570

8348868: AArch64: Add backend support for SelectFromTwoVector #23570

Bhavana-Kilambi commented Feb 11, 2025 •

edited by openjdk bot

Loading

bridgekeeper bot commented Feb 11, 2025

openjdk bot commented Feb 11, 2025

openjdk bot commented Feb 11, 2025

mlbridge bot commented Feb 11, 2025

XiaohongGong left a comment

XiaohongGong Feb 12, 2025

XiaohongGong Feb 12, 2025

XiaohongGong Feb 12, 2025

theRealAph Feb 19, 2025

Bhavana-Kilambi Feb 19, 2025

theRealAph Feb 20, 2025

theRealAph Feb 12, 2025

Bhavana-Kilambi Feb 12, 2025

theRealAph Feb 12, 2025

Bhavana-Kilambi commented Feb 12, 2025

shqking left a comment

shqking Feb 18, 2025

shqking Feb 18, 2025

Bhavana-Kilambi commented Feb 18, 2025 •

edited

Loading

shqking commented Feb 19, 2025

XiaohongGong Feb 19, 2025

Bhavana-Kilambi Feb 19, 2025

XiaohongGong Feb 20, 2025

Bhavana-Kilambi Feb 20, 2025

XiaohongGong Feb 24, 2025

Bhavana-Kilambi Feb 24, 2025

XiaohongGong Feb 27, 2025

Bhavana-Kilambi Mar 3, 2025


		// --------------------------------SelectFromTwoVector -----------------------------

		instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2,

	instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2,
	instruct vselect_from_two_vectors_HS_neon(vReg dst, vReg_V17 src1, vReg_V18 src2,

	// Class for vector register v18
	// Class for vector register V18

8348868: AArch64: Add backend support for SelectFromTwoVector #23570

Are you sure you want to change the base?

8348868: AArch64: Add backend support for SelectFromTwoVector #23570

Conversation

Bhavana-Kilambi commented Feb 11, 2025 • edited by openjdk bot Loading

Progress

Issue

Reviewing

bridgekeeper bot commented Feb 11, 2025

openjdk bot commented Feb 11, 2025

openjdk bot commented Feb 11, 2025

mlbridge bot commented Feb 11, 2025

Webrevs

XiaohongGong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bhavana-Kilambi commented Feb 12, 2025

shqking left a comment

Choose a reason for hiding this comment

data-1: UseSVE=0

data-2: UseSVE=2

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bhavana-Kilambi commented Feb 18, 2025 • edited Loading

shqking commented Feb 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bhavana-Kilambi commented Feb 11, 2025 •

edited by openjdk bot

Loading

Bhavana-Kilambi commented Feb 18, 2025 •

edited

Loading