Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8292289: [vectorapi] Improve the implementation of VectorTestNode #9855

Closed
wants to merge 31 commits into from

Conversation

merykitty
Copy link
Member

@merykitty merykitty commented Aug 12, 2022

This patch modifies the node generation of VectorSupport::test to emit a CMoveINode, which is picked up by BoolNode::Ideal(PhaseGVN*, bool) to connect the VectorTestNode directly to the BoolNode, removing the redundant operations of materialising the test result in a GP register and do a CmpI to get back the flags. As a result, VectorMask<T>::alltrue is compiled into machine codes:

    vptest xmm0, xmm1
    jb if_true
if_false:

instead of:

    vptest xmm0, xmm1
    setb r10
    movzbl r10
    testl r10
    jne if_true
if_false:

The results of jdk.incubator.vector.ArrayMismatchBenchmark shows noticeable improvements:

                                                                                   Before                  After
Benchmark                                      Prefix    Size   Mode  Cnt       Score      Error       Score     Error    Units  Change
ArrayMismatchBenchmark.mismatchVectorByte         0.5       9  thrpt   10  217345.383 ± 8316.444  222279.381 ± 2660.983  ops/ms   +2.3%
ArrayMismatchBenchmark.mismatchVectorByte         0.5     257  thrpt   10  113918.406 ± 1618.836  116268.691 ± 1291.899  ops/ms   +2.1%
ArrayMismatchBenchmark.mismatchVectorByte         0.5  100000  thrpt   10     702.066 ±   72.862     797.806 ±   16.429  ops/ms  +13.6%
ArrayMismatchBenchmark.mismatchVectorByte         1.0       9  thrpt   10  146096.564 ± 2401.258  145338.910 ±  687.453  ops/ms   -0.5%
ArrayMismatchBenchmark.mismatchVectorByte         1.0     257  thrpt   10   60598.181 ± 1259.397   69041.519 ± 1073.156  ops/ms  +13.9%
ArrayMismatchBenchmark.mismatchVectorByte         1.0  100000  thrpt   10     316.814 ±   10.975     408.770 ±    5.281  ops/ms  +29.0%
ArrayMismatchBenchmark.mismatchVectorDouble       0.5       9  thrpt   10  195674.549 ± 1200.166  188482.433 ± 1872.076  ops/ms   -3.7%
ArrayMismatchBenchmark.mismatchVectorDouble       0.5     257  thrpt   10   44357.169 ±  473.013   42293.411 ± 2838.255  ops/ms   -4.7%
ArrayMismatchBenchmark.mismatchVectorDouble       0.5  100000  thrpt   10      68.199 ±    5.410      67.628 ±    3.241  ops/ms   -0.8%
ArrayMismatchBenchmark.mismatchVectorDouble       1.0       9  thrpt   10  107722.450 ± 1677.607  111060.400 ±  982.230  ops/ms   +3.1%
ArrayMismatchBenchmark.mismatchVectorDouble       1.0     257  thrpt   10   16692.645 ± 1002.599   21440.506 ± 1618.266  ops/ms  +28.4%
ArrayMismatchBenchmark.mismatchVectorDouble       1.0  100000  thrpt   10      32.984 ±    0.548      33.202 ±    2.365  ops/ms   +0.7%
ArrayMismatchBenchmark.mismatchVectorInt          0.5       9  thrpt   10  335458.217 ± 3154.842  379944.254 ± 5703.134  ops/ms  +13.3%
ArrayMismatchBenchmark.mismatchVectorInt          0.5     257  thrpt   10   58505.302 ±  786.312   56721.368 ± 2497.052  ops/ms   -3.0%
ArrayMismatchBenchmark.mismatchVectorInt          0.5  100000  thrpt   10     133.037 ±   11.415     139.537 ±    4.667  ops/ms   +4.9%
ArrayMismatchBenchmark.mismatchVectorInt          1.0       9  thrpt   10  117943.802 ± 2281.349  112409.365 ± 2110.055  ops/ms   -4.7%
ArrayMismatchBenchmark.mismatchVectorInt          1.0     257  thrpt   10   27060.015 ±  795.619   33756.613 ±  826.533  ops/ms  +24.7%
ArrayMismatchBenchmark.mismatchVectorInt          1.0  100000  thrpt   10      57.558 ±    8.927      66.951 ±    4.381  ops/ms  +16.3%
ArrayMismatchBenchmark.mismatchVectorLong         0.5       9  thrpt   10  182963.715 ± 1042.497  182438.405 ± 2120.832  ops/ms   -0.3%
ArrayMismatchBenchmark.mismatchVectorLong         0.5     257  thrpt   10   36672.215 ±  614.821   35397.398 ± 1609.235  ops/ms   -3.5%
ArrayMismatchBenchmark.mismatchVectorLong         0.5  100000  thrpt   10      66.438 ±    2.142      65.427 ±    2.270  ops/ms   -1.5%
ArrayMismatchBenchmark.mismatchVectorLong         1.0       9  thrpt   10  110393.047 ±  497.853  115165.845 ± 5381.674  ops/ms   +4.3%
ArrayMismatchBenchmark.mismatchVectorLong         1.0     257  thrpt   10   14720.765 ±  661.350   19871.096 ±  201.464  ops/ms  +35.0%
ArrayMismatchBenchmark.mismatchVectorLong         1.0  100000  thrpt   10      30.760 ±    0.821      31.933 ±    1.352  ops/ms   +3.8%

I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8292289: [vectorapi] Improve the implementation of VectorTestNode

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9855/head:pull/9855
$ git checkout pull/9855

Update a local copy of the PR:
$ git checkout pull/9855
$ git pull https://git.openjdk.org/jdk pull/9855/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9855

View PR using the GUI difftool:
$ git pr show -t 9855

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9855.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 12, 2022

👋 Welcome back merykitty! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@merykitty merykitty changed the title [vectorapi] Improve the implementation of VectorTestNode 8292289: [vectorapi] Improve the implementation of VectorTestNode Aug 12, 2022
@openjdk
Copy link

openjdk bot commented Aug 12, 2022

@merykitty The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Aug 12, 2022
@merykitty
Copy link
Member Author

/label hotspot-compiler

@openjdk openjdk bot added rfr Pull request is ready for review hotspot-compiler hotspot-compiler-dev@openjdk.org labels Aug 12, 2022
@openjdk
Copy link

openjdk bot commented Aug 12, 2022

@merykitty
The hotspot-compiler label was successfully added.

@mlbridge
Copy link

mlbridge bot commented Aug 12, 2022

@XiaohongGong
Copy link

Please rebase this PR and resolve the conflicts. Then I can have an internal tests for the aarch64 change parts. Thanks a lot!

@openjdk
Copy link

openjdk bot commented Sep 23, 2022

@vnkozlov
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

@merykitty
Copy link
Member Author

Similar to what has been discussed in #10192 , a mask of 512 bits is not matched into an XMMRegister, I have removed the untaken code paths.

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 10, 2022

@merykitty This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@merykitty
Copy link
Member Author

May I have another review for this PR, please? Thank you very much.

Copy link
Contributor

@shqking shqking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running some tests on AArch64 platform (both Neon and SVE).

@merykitty
Copy link
Member Author

@shqking Thanks a lot for your reviews, I have addressed those in the commits

Copy link
Contributor

@shqking shqking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My test passed.

Performance testing.

Here shows the data on AArch64 Neon.

                                                                                          Before                 After
  Benchmark                                       (prefix)  (size)   Mode  Cnt        Score      Error       Score      Error   Units  Change
  ArrayMismatchBenchmark.mismatchVectorByte            0.5       9  thrpt    5   188228.938 ± 1304.492  188036.510 ± 1566.468  ops/ms   -0.1%
  ArrayMismatchBenchmark.mismatchVectorByte            0.5     257  thrpt    5    50714.740 ±  251.973   52319.498 ±   73.201  ops/ms    3.2%
  ArrayMismatchBenchmark.mismatchVectorByte            0.5  100000  thrpt    5      188.442 ±    0.829     226.975 ±    1.196  ops/ms   20.4%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0       9  thrpt    5   107464.833 ±    7.967  107461.853 ±   17.613  ops/ms    0.0%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0     257  thrpt    5    27873.228 ±  298.765   28854.655 ±  108.520  ops/ms    3.5%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0  100000  thrpt    5       91.318 ±    0.032      90.234 ±    0.049  ops/ms   -1.2%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5       9  thrpt    5   104424.609 ±   35.394  111375.725 ±  336.651  ops/ms    6.7%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5     257  thrpt    5     9466.861 ±   46.815    9523.362 ±   12.216  ops/ms    0.6%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5  100000  thrpt    5       21.572 ±    0.054      22.462 ±    0.273  ops/ms    4.1%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0       9  thrpt    5    65201.598 ± 1202.724   70891.579 ±  576.866  ops/ms    8.7%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0     257  thrpt    5     3931.683 ±    0.432    4241.834 ±    0.531  ops/ms    7.9%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0  100000  thrpt    5        9.641 ±    0.005      10.209 ±    0.007  ops/ms    5.9%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5       9  thrpt    5   112517.132 ± 1266.658  117607.730 ±  658.935  ops/ms    4.5%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5     257  thrpt    5    14627.711 ±  135.210   19549.735 ±  169.208  ops/ms   33.6%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5  100000  thrpt    5       40.599 ±    0.116      45.500 ±    0.105  ops/ms   12.1%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0       9  thrpt    5    86951.685 ±  770.519   88705.394 ±  112.681  ops/ms    2.0%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0     257  thrpt    5     8229.636 ±    6.450    9400.670 ±    0.555  ops/ms   14.2%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0  100000  thrpt    5       20.437 ±    0.032      25.996 ±    0.142  ops/ms   27.2%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5       9  thrpt    5    98752.429 ±    8.477  106053.731 ±  168.779  ops/ms    7.4%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5     257  thrpt    5     9486.680 ±  113.035    9888.039 ±   10.357  ops/ms    4.2%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5  100000  thrpt    5       22.884 ±    0.118      22.469 ±    0.096  ops/ms   -1.8%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0       9  thrpt    5    72150.373 ±  441.019   71092.863 ±  746.174  ops/ms   -1.5%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0     257  thrpt    5     4604.599 ±   42.457    4690.037 ±    7.356  ops/ms    1.9%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0  100000  thrpt    5       11.147 ±    0.013      11.423 ±    0.012  ops/ms    2.5%

Here shows the data on AArch64 256-bit SVE.

                                                                                           Before                 After
  Benchmark                                       (prefix)  (size)   Mode  Cnt        Score       Error       Score       Error   Units  Change
  ArrayMismatchBenchmark.mismatchVectorByte            0.5       9  thrpt    5   332188.434 ±  1867.441  326994.114 ±  9458.795  ops/ms   -1.6%
  ArrayMismatchBenchmark.mismatchVectorByte            0.5     257  thrpt    5   107444.966 ±  5050.526  100516.133 ±  1436.484  ops/ms   -6.4%
  ArrayMismatchBenchmark.mismatchVectorByte            0.5  100000  thrpt    5      440.107 ±     0.135     460.557 ±     0.276  ops/ms    4.6%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0       9  thrpt    5   194751.414 ±  1218.965  196489.976 ±    70.422  ops/ms    0.9%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0     257  thrpt    5    68305.755 ±   102.463   71301.912 ±   214.791  ops/ms    4.4%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0  100000  thrpt    5      213.639 ±     0.310     212.501 ±     0.200  ops/ms   -0.5%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5       9  thrpt    5   184926.046 ±  1429.361  197673.463 ±  2065.066  ops/ms    6.9%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5     257  thrpt    5    27664.974 ±   211.233   30272.798 ±   122.976  ops/ms    9.4%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5  100000  thrpt    5       82.780 ±     0.078      72.316 ±     0.121  ops/ms  -12.6%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0       9  thrpt    5   133433.039 ±    23.047  138097.066 ±   321.764  ops/ms    3.5%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0     257  thrpt    5     9332.847 ±    47.940    9679.395 ±    15.648  ops/ms    3.7%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0  100000  thrpt    5       25.563 ±     0.010      29.525 ±     1.410  ops/ms   15.5%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5       9  thrpt    5   409670.146 ± 15888.302  385940.625 ±  6430.431  ops/ms   -5.8%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5     257  thrpt    5    36565.150 ±  1295.056   39837.700 ±    82.828  ops/ms    8.9%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5  100000  thrpt    5      115.997 ±     0.986     112.612 ±     0.280  ops/ms   -2.9%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0       9  thrpt    5   153095.509 ±   760.043  159605.937 ±   114.691  ops/ms    4.3%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0     257  thrpt    5    20747.445 ±    28.624   21301.590 ±    64.918  ops/ms    2.7%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0  100000  thrpt    5       52.865 ±     0.033      53.757 ±     0.134  ops/ms    1.7%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5       9  thrpt    5   177529.884 ±   145.103  178435.461 ±  2410.473  ops/ms    0.5%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5     257  thrpt    5    20538.232 ±     7.532   20563.490 ±    53.205  ops/ms    0.1%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5  100000  thrpt    5       50.875 ±     0.736      52.826 ±     0.058  ops/ms    3.8%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0       9  thrpt    5   135797.506 ±   333.638  138437.942 ±    97.186  ops/ms    1.9%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0     257  thrpt    5    10561.460 ±    74.946   10337.813 ±    39.726  ops/ms   -2.1%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0  100000  thrpt    5       26.027 ±     0.020      26.224 ±     0.046  ops/ms    0.8%

I think the performance is acceptable.

Jtreg testing

  1. on AARCH64 Neon, I ran tier1~3.
  2. on AArch64 SVE, I ran the cases under the following directories
  "test/hotspot/jtreg/compiler/vectorapi/"
  "test/jdk/jdk/incubator/vector/"
  "test/hotspot/jtreg/compiler/vectorization/"

Besides the CMOVE_I issue in TestVectorTest.java as I mentioned before, all other test cases passed.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to run testing again and also do our performance testing.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regular testing passed. I am waiting performance results.

@vnkozlov
Copy link
Contributor

vnkozlov commented Dec 7, 2022

Rerunning DaCapo and Renaissance which shows some variations.

@merykitty
Copy link
Member Author

@vnkozlov I believe VectorTestNode is only used in Vector API, which should not affect those benchmarks?

@vnkozlov
Copy link
Contributor

vnkozlov commented Dec 8, 2022

@vnkozlov I believe VectorTestNode is only used in Vector API, which should not affect those benchmarks?

Yes, but you have changes in general code related to Cmp node.
I see that DaCapo don't show regression after rerun. Renaissance is still running and numbers are "all over places" - it is not stable.

Anyway, I am approving this changes based on data I got.

@merykitty
Copy link
Member Author

@vnkozlov Ah I get it, thanks a lot for your reviews, can I merge the patch now?

@vnkozlov
Copy link
Contributor

vnkozlov commented Dec 8, 2022

Yes, you can integrate.

@merykitty
Copy link
Member Author

Thank everyone for your kind reviews and suggestions.
/integrate

@openjdk
Copy link

openjdk bot commented Dec 8, 2022

Going to push as commit 3dfadee.
Since your change was applied there have been 238 commits pushed to the master branch:

  • d35e840: 8297295: Remove ThreadGroup.allowThreadSuspension
  • 175e3d3: 8296149: Start of release updates for JDK 21
  • d562d3f: 8297642: PhaseIdealLoop::only_has_infinite_loops must detect all loops that never lead to termination
  • fc52f21: 8298255: JFR provide information about dynamization of number of compiler threads
  • e555d54: 8298383: JFR: GenerateJfrFiles.java lacks copyright header
  • c084431: 8298379: JFR: Some UNTIMED events only sets endTime
  • ea108f5: 8298129: Let checkpoint event sizes grow beyond u4 limit
  • 165dcdd: 8297718: Make NMT free:ing protocol more granular
  • fbe7b00: 8298173: GarbageCollectionNotificationContentTest test failed: no decrease in Eden usage
  • d8ef60b: 8298272: Clean up ProblemList
  • ... and 228 more: https://git.openjdk.org/jdk/compare/2f83b5c487f112c175d081ca5882f5032518937a...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Dec 8, 2022
@openjdk openjdk bot closed this Dec 8, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Dec 8, 2022
@openjdk
Copy link

openjdk bot commented Dec 8, 2022

@merykitty Pushed as commit 3dfadee.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@merykitty merykitty deleted the improveVTest branch April 27, 2023 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

None yet

4 participants