New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8306136: [vectorapi] Intrinsics of VectorMask.laneIsSet() #14200
Conversation
VectorMask.laneIsSet() [1] is implemented based on VectorMask.toLong() [2], and it's performance highly depends on the intrinsification of toLong(). However, if `toLong()` is failed to intrinsify, on some architectures or unsupported species, it's much more expensive than pure getBits(). Besides, some CPUs (e.g. with Arm Neon) may not have efficient instructions to implementation toLong(), so we propose to intrinsify VectorMask.laneIsSet separately. This patch optimize laneIsSet() by calling the existing intrinsic method VectorSupport.extract(), which actually does not introduce new intrinsic method. The C2 compiler intrinsification logic to support _VectorExtract has also been extended to better support laneIsSet(). It tries to extract the mask's lane value with an ExtractUB node if the hardware backend supports it. While on hardware without ExtractUB backend support , c2 will still try to generate toLong() related nodes, which behaves the same as before the patch. Key changes in this patch: 1. Reuse intrinsic `VectorSupport.extract()` in Java side. No new intrinsic method is introduced. 2. In compiler, `ExtractUBNode` is generated if backend support is. If not, the original "toLong" pattern is generated if it's implemented. Otherwise, it uses the default Java `getBits[i]` rather than the expensive and complicated toLong() based implementation. 3. Enable `ExtractUBNode` on AArch64 to extract the lane value for a vector mask in compiler, together with changing its bottom type to TypeInt::BOOL. This helps optimize the conditional selection generated by ``` public boolean laneIsSet(int i) { return VectorSupport.extract(..., defaultImpl) == 1L; } ``` [Test] hotspot:compiler/vectorapi and jdk/incubator/vector passed. [Performance] Below shows the performance gain on 128-bit vector size Neon machine. For 64 and 128 SPECIES, the improvment caused by this intrinsics. For other SPECIES which can not be intrinfied, performance gain comes from the default Java implementation changes, i.e. getBits[i] vs. toLong(). Benchmark Gain (after/before) microMaskLaneIsSetByte128_con 2.47 microMaskLaneIsSetByte128_var 1.82 microMaskLaneIsSetByte256_con 3.01 microMaskLaneIsSetByte256_var 3.04 microMaskLaneIsSetByte512_con 4.83 microMaskLaneIsSetByte512_var 4.86 microMaskLaneIsSetByte64_con 1.57 microMaskLaneIsSetByte64_var 1.18 microMaskLaneIsSetDouble128_con 1.19 microMaskLaneIsSetDouble128_var 1.13 microMaskLaneIsSetDouble256_con 1.33 microMaskLaneIsSetDouble256_var 1.40 microMaskLaneIsSetDouble512_con 1.62 microMaskLaneIsSetDouble512_var 1.70 microMaskLaneIsSetDouble64_con 1.18 microMaskLaneIsSetDouble64_var 1.21 microMaskLaneIsSetFloat128_con 1.56 microMaskLaneIsSetFloat128_var 1.25 microMaskLaneIsSetFloat256_con 1.63 microMaskLaneIsSetFloat256_var 1.68 microMaskLaneIsSetFloat512_con 2.13 microMaskLaneIsSetFloat512_var 2.18 microMaskLaneIsSetFloat64_con 1.21 microMaskLaneIsSetFloat64_var 1.11 microMaskLaneIsSetInt128_con 1.54 microMaskLaneIsSetInt128_var 1.24 microMaskLaneIsSetInt256_con 1.63 microMaskLaneIsSetInt256_var 1.72 microMaskLaneIsSetInt512_con 2.09 microMaskLaneIsSetInt512_var 2.19 microMaskLaneIsSetInt64_con 1.22 microMaskLaneIsSetInt64_var 1.13 microMaskLaneIsSetLong128_con 1.21 microMaskLaneIsSetLong128_var 1.08 microMaskLaneIsSetLong256_con 1.29 microMaskLaneIsSetLong256_var 1.41 microMaskLaneIsSetLong512_con 1.62 microMaskLaneIsSetLong512_var 1.67 microMaskLaneIsSetLong64_con 1.18 microMaskLaneIsSetLong64_var 1.21 microMaskLaneIsSetShort128_con 1.57 microMaskLaneIsSetShort128_var 1.19 microMaskLaneIsSetShort256_con 2.09 microMaskLaneIsSetShort256_var 2.17 microMaskLaneIsSetShort512_con 3.00 microMaskLaneIsSetShort512_var 3.08 microMaskLaneIsSetShort64_con 1.56 microMaskLaneIsSetShort64_var 1.25 No regression on x86. [1] https://github.com/openjdk/jdk/blob/3d3eaed9133dbe728ca8e00a626d33f7e35ba9ff/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java#L72 [2] https://github.com/openjdk/jdk/blob/3d3eaed9133dbe728ca8e00a626d33f7e35ba9ff/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Int128Vector.java#L726 [3] https://github.com/openjdk/jdk/blob/3d3eaed9133dbe728ca8e00a626d33f7e35ba9ff/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java#L195 Change-Id: I3463624a04f29ba3f22356414e13ee9d13db7988
👋 Welcome back eliu! A progress list of the required criteria for merging this PR into |
/label remove core-libs |
@e1iu |
Compared with Vector.lane(), this patch does not use the switch-stmt for
The switch-stmt ensures the constant position for ExtractNode, which
Besides, even if the index is not a constant, we can still intrinsify Finally, to avoid performance regression on x86, this patch abandoned |
Webrevs
|
Could anyone can help to take a look? @PaulSandoz @jatin-bhateja |
This looks a good design, leveraging the extract intrinsic and C2 nodes. Java bits look good. For the benchmark you can simplify if you wish by returning the result rather than using an explicit black hole. Submitted the PR for tier 1-3 testing, will report back once complete. |
Getting crashes on linux-x64-debug when using these VM options:
For a test tag of Relevant bits from the HS error log file:
|
Sorry for the delay. I removed that assertion. Since in some cases, even predicated feature supported, vectormask can not be in used. |
@PaulSandoz @jatin-bhateja Could you help to take a look? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass. Java changes look good, but HotSpot reviews are also required.
@e1iu This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 674 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
/integrate |
Going to push as commit d4aacdb.
Your commit was automatically rebased without conflicts. |
VectorMask.laneIsSet() [1] is implemented based on VectorMask.toLong() [2], and it's performance highly depends on the intrinsification of toLong(). However, if
toLong()
is failed to intrinsify, on some architectures or unsupported species, it's much more expensive than pure getBits(). Besides, some CPUs (e.g. with Arm Neon) may not have efficient instructions to implementation toLong(), so we propose to intrinsify VectorMask.laneIsSet separately.This patch optimize laneIsSet() by calling the existing intrinsic method VectorSupport.extract(), which actually does not introduce new intrinsic method. The C2 compiler intrinsification logic to support _VectorExtract has also been extended to better support laneIsSet(). It tries to extract the mask's lane value with an ExtractUB node if the hardware backend supports it. While on hardware without ExtractUB backend support , c2 will still try to generate toLong() related nodes, which behaves the same as before the patch.
Key changes in this patch:
Reuse intrinsic
VectorSupport.extract()
in Java side. No new intrinsic method is introduced.In compiler,
ExtractUBNode
is generated if backend support is. If not, the original "toLong" pattern is generated if it's implemented. Otherwise, it uses the default JavagetBits[i]
rather than the expensive and complicated toLong() based implementation.Enable
ExtractUBNode
on AArch64 to extract the lane value for a vector mask in compiler, together with changing its bottom type to TypeInt::BOOL. This helps optimize the conditional selection generated by[Test]
hotspot:compiler/vectorapi and jdk/incubator/vector passed.
[Performance]
Below shows the performance gain on 128-bit vector size Neon machine. For 64 and 128 SPECIES, the improvment caused by this intrinsics. For other SPECIES which can not be intrinfied, performance gain comes from the default Java implementation changes, i.e. getBits[i] vs. toLong().
No regression on x86.
[1]
jdk/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java
Line 72 in 3d3eaed
[2]
jdk/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Int128Vector.java
Line 726 in 3d3eaed
[3]
jdk/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java
Line 195 in 3d3eaed
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14200/head:pull/14200
$ git checkout pull/14200
Update a local copy of the PR:
$ git checkout pull/14200
$ git pull https://git.openjdk.org/jdk.git pull/14200/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 14200
View PR using the GUI difftool:
$ git pr show -t 14200
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14200.diff
Webrev
Link to Webrev Comment