Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8306136: [vectorapi] Intrinsics of VectorMask.laneIsSet() #14200

Closed
wants to merge 2 commits into from

Conversation

e1iu
Copy link
Member

@e1iu e1iu commented May 29, 2023

VectorMask.laneIsSet() [1] is implemented based on VectorMask.toLong() [2], and it's performance highly depends on the intrinsification of toLong(). However, if toLong() is failed to intrinsify, on some architectures or unsupported species, it's much more expensive than pure getBits(). Besides, some CPUs (e.g. with Arm Neon) may not have efficient instructions to implementation toLong(), so we propose to intrinsify VectorMask.laneIsSet separately.

This patch optimize laneIsSet() by calling the existing intrinsic method VectorSupport.extract(), which actually does not introduce new intrinsic method. The C2 compiler intrinsification logic to support _VectorExtract has also been extended to better support laneIsSet(). It tries to extract the mask's lane value with an ExtractUB node if the hardware backend supports it. While on hardware without ExtractUB backend support , c2 will still try to generate toLong() related nodes, which behaves the same as before the patch.

Key changes in this patch:

  1. Reuse intrinsic VectorSupport.extract() in Java side. No new intrinsic method is introduced.

  2. In compiler, ExtractUBNode is generated if backend support is. If not, the original "toLong" pattern is generated if it's implemented. Otherwise, it uses the default Java getBits[i] rather than the expensive and complicated toLong() based implementation.

  3. Enable ExtractUBNode on AArch64 to extract the lane value for a vector mask in compiler, together with changing its bottom type to TypeInt::BOOL. This helps optimize the conditional selection generated by

    
        public boolean laneIsSet(int i) {
            return VectorSupport.extract(..., defaultImpl) == 1L;
        }
    
    

[Test]
hotspot:compiler/vectorapi and jdk/incubator/vector passed.

[Performance]

Below shows the performance gain on 128-bit vector size Neon machine. For 64 and 128 SPECIES, the improvment caused by this intrinsics. For other SPECIES which can not be intrinfied, performance gain comes from the default Java implementation changes, i.e. getBits[i] vs. toLong().

Benchmark                               Gain (after/before)
microMaskLaneIsSetByte128_con           2.47
microMaskLaneIsSetByte128_var           1.82
microMaskLaneIsSetByte256_con           3.01
microMaskLaneIsSetByte256_var           3.04
microMaskLaneIsSetByte512_con           4.83
microMaskLaneIsSetByte512_var           4.86
microMaskLaneIsSetByte64_con            1.57
microMaskLaneIsSetByte64_var            1.18
microMaskLaneIsSetDouble128_con         1.19
microMaskLaneIsSetDouble128_var         1.13
microMaskLaneIsSetDouble256_con         1.33
microMaskLaneIsSetDouble256_var         1.40
microMaskLaneIsSetDouble512_con         1.62
microMaskLaneIsSetDouble512_var         1.70
microMaskLaneIsSetDouble64_con          1.18
microMaskLaneIsSetDouble64_var          1.21
microMaskLaneIsSetFloat128_con          1.56
microMaskLaneIsSetFloat128_var          1.25
microMaskLaneIsSetFloat256_con          1.63
microMaskLaneIsSetFloat256_var          1.68
microMaskLaneIsSetFloat512_con          2.13
microMaskLaneIsSetFloat512_var          2.18
microMaskLaneIsSetFloat64_con           1.21
microMaskLaneIsSetFloat64_var           1.11
microMaskLaneIsSetInt128_con            1.54
microMaskLaneIsSetInt128_var            1.24
microMaskLaneIsSetInt256_con            1.63
microMaskLaneIsSetInt256_var            1.72
microMaskLaneIsSetInt512_con            2.09
microMaskLaneIsSetInt512_var            2.19
microMaskLaneIsSetInt64_con             1.22
microMaskLaneIsSetInt64_var             1.13
microMaskLaneIsSetLong128_con           1.21
microMaskLaneIsSetLong128_var           1.08
microMaskLaneIsSetLong256_con           1.29
microMaskLaneIsSetLong256_var           1.41
microMaskLaneIsSetLong512_con           1.62
microMaskLaneIsSetLong512_var           1.67
microMaskLaneIsSetLong64_con            1.18
microMaskLaneIsSetLong64_var            1.21
microMaskLaneIsSetShort128_con          1.57
microMaskLaneIsSetShort128_var          1.19
microMaskLaneIsSetShort256_con          2.09
microMaskLaneIsSetShort256_var          2.17
microMaskLaneIsSetShort512_con          3.00
microMaskLaneIsSetShort512_var          3.08
microMaskLaneIsSetShort64_con           1.56
microMaskLaneIsSetShort64_var           1.25

No regression on x86.

[1]


[2]
[3]


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8306136: [vectorapi] Intrinsics of VectorMask.laneIsSet() (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14200/head:pull/14200
$ git checkout pull/14200

Update a local copy of the PR:
$ git checkout pull/14200
$ git pull https://git.openjdk.org/jdk.git pull/14200/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14200

View PR using the GUI difftool:
$ git pr show -t 14200

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14200.diff

Webrev

Link to Webrev Comment

VectorMask.laneIsSet() [1] is implemented based on VectorMask.toLong()
[2], and it's performance highly depends on the intrinsification of
toLong(). However, if `toLong()` is failed to intrinsify, on some
architectures or unsupported species, it's much more expensive than pure
getBits(). Besides, some CPUs (e.g. with Arm Neon) may not have
efficient instructions to implementation toLong(), so we propose to
intrinsify VectorMask.laneIsSet separately.

This patch optimize laneIsSet() by calling the existing intrinsic method
VectorSupport.extract(), which actually does not introduce new intrinsic
method. The C2 compiler intrinsification logic to support _VectorExtract
has also been extended to better support laneIsSet(). It tries to
extract the mask's lane value with an ExtractUB node if the hardware
backend supports it. While on hardware without ExtractUB backend support
, c2 will still try to generate toLong() related nodes, which behaves
the same as before the patch.

Key changes in this patch:

1. Reuse intrinsic `VectorSupport.extract()` in Java side. No new
   intrinsic method is introduced.
2. In compiler, `ExtractUBNode` is generated if backend support is. If
   not, the original "toLong" pattern is generated if it's implemented.
   Otherwise, it uses the default Java `getBits[i]` rather than the
   expensive and complicated toLong() based implementation.
3. Enable `ExtractUBNode` on AArch64 to extract the lane value for a
   vector mask in compiler, together with changing its bottom type to
   TypeInt::BOOL. This helps optimize the conditional selection
   generated by

   ```

       public boolean laneIsSet(int i) {
           return VectorSupport.extract(..., defaultImpl) == 1L;
       }

   ```

[Test]
hotspot:compiler/vectorapi and jdk/incubator/vector passed.

[Performance]

Below shows the performance gain on 128-bit vector size Neon machine.
For 64 and 128 SPECIES, the improvment caused by this intrinsics.
For other SPECIES which can not be intrinfied, performance gain comes
from the default Java implementation changes, i.e. getBits[i] vs.
toLong().

Benchmark                               Gain (after/before)
microMaskLaneIsSetByte128_con           2.47
microMaskLaneIsSetByte128_var           1.82
microMaskLaneIsSetByte256_con           3.01
microMaskLaneIsSetByte256_var           3.04
microMaskLaneIsSetByte512_con           4.83
microMaskLaneIsSetByte512_var           4.86
microMaskLaneIsSetByte64_con            1.57
microMaskLaneIsSetByte64_var            1.18
microMaskLaneIsSetDouble128_con         1.19
microMaskLaneIsSetDouble128_var         1.13
microMaskLaneIsSetDouble256_con         1.33
microMaskLaneIsSetDouble256_var         1.40
microMaskLaneIsSetDouble512_con         1.62
microMaskLaneIsSetDouble512_var         1.70
microMaskLaneIsSetDouble64_con          1.18
microMaskLaneIsSetDouble64_var          1.21
microMaskLaneIsSetFloat128_con          1.56
microMaskLaneIsSetFloat128_var          1.25
microMaskLaneIsSetFloat256_con          1.63
microMaskLaneIsSetFloat256_var          1.68
microMaskLaneIsSetFloat512_con          2.13
microMaskLaneIsSetFloat512_var          2.18
microMaskLaneIsSetFloat64_con           1.21
microMaskLaneIsSetFloat64_var           1.11
microMaskLaneIsSetInt128_con            1.54
microMaskLaneIsSetInt128_var            1.24
microMaskLaneIsSetInt256_con            1.63
microMaskLaneIsSetInt256_var            1.72
microMaskLaneIsSetInt512_con            2.09
microMaskLaneIsSetInt512_var            2.19
microMaskLaneIsSetInt64_con             1.22
microMaskLaneIsSetInt64_var             1.13
microMaskLaneIsSetLong128_con           1.21
microMaskLaneIsSetLong128_var           1.08
microMaskLaneIsSetLong256_con           1.29
microMaskLaneIsSetLong256_var           1.41
microMaskLaneIsSetLong512_con           1.62
microMaskLaneIsSetLong512_var           1.67
microMaskLaneIsSetLong64_con            1.18
microMaskLaneIsSetLong64_var            1.21
microMaskLaneIsSetShort128_con          1.57
microMaskLaneIsSetShort128_var          1.19
microMaskLaneIsSetShort256_con          2.09
microMaskLaneIsSetShort256_var          2.17
microMaskLaneIsSetShort512_con          3.00
microMaskLaneIsSetShort512_var          3.08
microMaskLaneIsSetShort64_con           1.56
microMaskLaneIsSetShort64_var           1.25

No regression on x86.

[1] https://github.com/openjdk/jdk/blob/3d3eaed9133dbe728ca8e00a626d33f7e35ba9ff/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java#L72
[2] https://github.com/openjdk/jdk/blob/3d3eaed9133dbe728ca8e00a626d33f7e35ba9ff/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Int128Vector.java#L726
[3] https://github.com/openjdk/jdk/blob/3d3eaed9133dbe728ca8e00a626d33f7e35ba9ff/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java#L195

Change-Id: I3463624a04f29ba3f22356414e13ee9d13db7988
@bridgekeeper
Copy link

bridgekeeper bot commented May 29, 2023

👋 Welcome back eliu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label May 29, 2023
@openjdk
Copy link

openjdk bot commented May 29, 2023

@e1iu The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels May 29, 2023
@e1iu
Copy link
Member Author

e1iu commented May 29, 2023

/label remove core-libs

@openjdk openjdk bot removed the core-libs core-libs-dev@openjdk.org label May 29, 2023
@openjdk
Copy link

openjdk bot commented May 29, 2023

@e1iu
The core-libs label was successfully removed.

@e1iu
Copy link
Member Author

e1iu commented May 29, 2023

Compared with Vector.lane(), this patch does not use the switch-stmt for
VectorMask.laneIsSet() to transform variable index to constant, e.g.,

    public byte lane(int i) {
        switch(i) {
            case 0: return laneHelper(0);
            case 1: return laneHelper(1);
            case 2: return laneHelper(2);
            case 3: return laneHelper(3);
            case 4: return laneHelper(4);
            case 5: return laneHelper(5);
            case 6: return laneHelper(6);
            case 7: return laneHelper(7);
            case 8: return laneHelper(8);
            case 9: return laneHelper(9);
            case 10: return laneHelper(10);
            case 11: return laneHelper(11);
            case 12: return laneHelper(12);
            case 13: return laneHelper(13);
                 ...
                 ...
            default: throw new IllegalArgumentException("Index " + i + " must be zero or positive, and less than " + VLENGTH);
        }
    }

The switch-stmt ensures the constant position for ExtractNode, which
may be friendly for most architectures to generate efficient code. E.g.,
on AArch64, single instruction umov can extract scalar element from
vector for constant position. Nevertheless, it has some defects:

  1. The switch-stmt is NOT cost free.
  2. The switch-stmt maybe too huge to be inlined to callee. Thus it can
    hurt the performance.
  3. It can NOT work for max species.

Besides, even if the index is not a constant, we can still intrinsify
the code to use TBL insn. Hence in my implementation, I've relax the
intrinsification condition to accept non-constant index.

Finally, to avoid performance regression on x86, this patch abandoned
this approach, generated backend rules for ExtractUB accepting both
immeidiate and register.

@mlbridge
Copy link

mlbridge bot commented May 29, 2023

Webrevs

@e1iu
Copy link
Member Author

e1iu commented Jun 19, 2023

Could anyone can help to take a look? @PaulSandoz @jatin-bhateja

@PaulSandoz
Copy link
Member

This looks a good design, leveraging the extract intrinsic and C2 nodes. Java bits look good. For the benchmark you can simplify if you wish by returning the result rather than using an explicit black hole. Submitted the PR for tier 1-3 testing, will report back once complete.

@PaulSandoz
Copy link
Member

Getting crashes on linux-x64-debug when using these VM options:

-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting

For a test tag of tier3-vector-avx512 when running tests for open/test/jdk/:jdk_vector.

Relevant bits from the HS error log file:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/opt/mach5/mesos/work_dir/slaves/cd627e65-f015-4fb1-a1d2-b6c9b8127f98-S9618/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/24abb99d-ff0d-447b-9153-bb7048d6d487/runs/d1fc7a0c-634f-4df4-9d5f-6a464bb177b9/workspace/open/src/hotspot/share/opto/vectorIntrinsics.cpp:2586), pid=1090716, tid=1090731
#  assert(!Matcher::has_predicated_vectors()) failed: should be
#
...
Current CompileTask:
C2:   9402  989    b        jdk.incubator.vector.Int256Vector$Int256Mask::laneIsSet (38 bytes)

Stack: [0x00007fa325bfc000,0x00007fa325cfd000],  sp=0x00007fa325cf78f0,  free space=1006k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x1815626]  LibraryCallKit::inline_vector_extract()+0xc26  (vectorIntrinsics.cpp:2586)
V  [libjvm.so+0x121cc94]  LibraryIntrinsic::generate(JVMState*)+0x1c4  (library_call.cpp:117)
V  [libjvm.so+0x853141]  CallGenerator::do_late_inline_helper()+0x9b1  (callGenerator.cpp:695)
V  [libjvm.so+0x9ea704]  Compile::inline_incrementally_one()+0xd4  (compile.cpp:2015)
V  [libjvm.so+0x9eb873]  Compile::inline_incrementally(PhaseIterGVN&)+0x273  (compile.cpp:2098)
V  [libjvm.so+0x9eebc7]  Compile::Optimize()+0x427  (compile.cpp:2233)
V  [libjvm.so+0x9f1f75]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1aa5  (compile.cpp:839)
V  [libjvm.so+0x84bc04]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x3c4  (c2compiler.cpp:118)
V  [libjvm.so+0x9fdf10]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xa00  (compileBroker.cpp:2265)
V  [libjvm.so+0x9fed98]  CompileBroker::compiler_thread_loop()+0x618  (compileBroker.cpp:1944)
V  [libjvm.so+0xeb6dec]  JavaThread::thread_main_inner()+0xcc  (javaThread.cpp:719)
V  [libjvm.so+0x17970aa]  Thread::call_run()+0xba  (thread.cpp:217)
V  [libjvm.so+0x149715c]  thread_native_entry(Thread*)+0x11c  (os_linux.cpp:775)
...
model name	: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves nt_good wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid md_clear arch_capabilities
...

Change-Id: Ib223c4048b29875a62a27d6081ad36a125dec144
@e1iu
Copy link
Member Author

e1iu commented Jul 4, 2023

@PaulSandoz

Sorry for the delay.

I removed that assertion. Since in some cases, even predicated feature supported, vectormask can not be in used.

@e1iu
Copy link
Member Author

e1iu commented Jul 11, 2023

@PaulSandoz @jatin-bhateja Could you help to take a look? Thanks.

Copy link
Member

@PaulSandoz PaulSandoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests pass. Java changes look good, but HotSpot reviews are also required.

@openjdk
Copy link

openjdk bot commented Jul 17, 2023

@e1iu This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8306136: [vectorapi] Intrinsics of VectorMask.laneIsSet()

Reviewed-by: psandoz, xgong

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 674 new commits pushed to the master branch:

  • a742767: 8312246: NPE when HSDB visits bad oop
  • 37c756a: 8305506: Add support for fractional values of SafepointTimeoutDelay
  • dfe764e: 8309032: jpackage does not work for module projects unless --module-path is specified
  • 61ab270: 8310835: Address gaps in -Xlint:serial checks
  • 5d57b5c: 6983726: Reimplement MethodHandleProxies.asInterfaceInstance
  • 71cac8c: 8311663: Additional refactoring of Locale tests to JUnit
  • aa23fd9: 8311879: SA ClassWriter generates invalid invokedynamic code
  • 6f66213: 8312014: [s390x] TestSigInfoInHsErrFile.java Failure
  • b5b6f4e: 8312164: Refactor Arrays.hashCode for long, boolean, double, float, and Object arrays
  • 14cf035: 8302987: Add uniform and spatially equidistributed bounded double streams to RandomGenerator
  • ... and 664 more: https://git.openjdk.org/jdk/compare/e21f865d84c7c861843ff568019e1ad11d280a50...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 17, 2023
Copy link

@XiaohongGong XiaohongGong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@e1iu
Copy link
Member Author

e1iu commented Jul 21, 2023

/integrate

@openjdk
Copy link

openjdk bot commented Jul 21, 2023

Going to push as commit d4aacdb.
Since your change was applied there have been 685 commits pushed to the master branch:

  • 783de32: 8300051: assert(JvmtiEnvBase::environments_might_exist()) failed: to enter event controller, JVM TI environments must exist
  • 4e8f331: 8312443: sun.security should use toLowerCase(Locale.ROOT)
  • d7b9416: 8199149: Improve the exception message thrown by VarHandle of unsupported operation
  • 354c660: 8307185: pkcs11 native libraries make JNI calls into java code while holding GC lock
  • bae2247: 8308591: JLine as the default Console provider
  • b772e67: 8312395: Improve assertions in growableArray
  • 9fa944e: 8312019: Simplify and modernize java.util.BitSet.equals
  • fe41910: 8312459: Problem list java/awt/GraphicsDevice/DisplayModes/CycleDMImage.java for macOS
  • 8d29329: 8312320: Remove javax/rmi/ssl/SSLSocketParametersTest.sh from ProblemList
  • 94eb44b: 8312394: [linux] SIGSEGV if kernel was built without hugepage support
  • ... and 675 more: https://git.openjdk.org/jdk/compare/e21f865d84c7c861843ff568019e1ad11d280a50...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 21, 2023
@openjdk openjdk bot closed this Jul 21, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 21, 2023
@openjdk
Copy link

openjdk bot commented Jul 21, 2023

@e1iu Pushed as commit d4aacdb.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@e1iu e1iu deleted the ENTLLT-6278-external branch November 3, 2023 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
3 participants