Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8313406: nep_invoker_blob can be simplified more #15089

Closed
wants to merge 1 commit into from

Conversation

YaSuenag
Copy link
Member

@YaSuenag YaSuenag commented Jul 31, 2023

In FFM, native function would be called via nep_invoker_blob. If the function has two arguments, it would be following:

Decoding RuntimeStub - nep_invoker_blob 0x00007fcae394cd10
--------------------------------------------------------------------------------
  0x00007fcae394cd80: pushq %rbp
  0x00007fcae394cd81: movq %rsp, %rbp
  0x00007fcae394cd84: subq $0, %rsp
 ;; { argument shuffle
  0x00007fcae394cd88: movq %r8, %rax
  0x00007fcae394cd8b: movq %rsi, %r10
  0x00007fcae394cd8e: movq %rcx, %rsi
  0x00007fcae394cd91: movq %rdx, %rdi
 ;; } argument shuffle
  0x00007fcae394cd94: callq *%r10
  0x00007fcae394cd97: leave
  0x00007fcae394cd98: retq

subq $0, %rsp is for shadow space on stack, and movq %r8, %rax is number of args for variadic function. So they are not necessary in some case. They should be remove following if they are not needed:

Decoding RuntimeStub - nep_invoker_blob 0x00007fd8778e2810
--------------------------------------------------------------------------------
  0x00007fd8778e2880: pushq %rbp
  0x00007fd8778e2881: movq %rsp, %rbp
 ;; { argument shuffle
  0x00007fd8778e2884: movq %rsi, %r10
  0x00007fd8778e2887: movq %rcx, %rsi
  0x00007fd8778e288a: movq %rdx, %rdi
 ;; } argument shuffle
  0x00007fd8778e288d: callq *%r10
  0x00007fd8778e2890: leave
  0x00007fd8778e2891: retq

All java/foreign jtreg tests are passed.

We can see these stub code on ffmasm testcase with -XX:+UnlockDiagnosticVMOptions -XX:+PrintStubCode and hsdis library. This testcase linked the code with Linker.Option.isTrivial().

After this change, FFM performance on another ffmasm testcase was improved:

before:

Benchmark                           Mode  Cnt          Score          Error  Units
FuncCallComparison.invokeFFMRDTSC  thrpt    3  106664071.816 ± 14396524.718  ops/s
FuncCallComparison.rdtsc           thrpt    3  108024079.738 ± 13223921.011  ops/s

after:

Benchmark                           Mode  Cnt          Score          Error  Units
FuncCallComparison.invokeFFMRDTSC  thrpt    3  107622971.525 ± 12249767.134  ops/s
FuncCallComparison.rdtsc           thrpt    3  107695741.608 ± 23983281.346  ops/s

Environment:

  • CPU: AMD Ryzen 3 3300X
  • OS: Fedora 38 x86_64 (Kernel 6.3.8-200.fc38.x86_64)
  • Hyper-V 4vCPU, 8GB mem

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8313406: nep_invoker_blob can be simplified more (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15089/head:pull/15089
$ git checkout pull/15089

Update a local copy of the PR:
$ git checkout pull/15089
$ git pull https://git.openjdk.org/jdk.git pull/15089/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 15089

View PR using the GUI difftool:
$ git pr show -t 15089

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15089.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 31, 2023

👋 Welcome back ysuenaga! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 31, 2023

@YaSuenag The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-compiler hotspot-compiler-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Jul 31, 2023
@YaSuenag YaSuenag marked this pull request as ready for review July 31, 2023 15:03
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 31, 2023
@mlbridge
Copy link

mlbridge bot commented Jul 31, 2023

Webrevs

Copy link
Member

@JornVernee JornVernee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I'm running tests in our CI.

After this change, FFM performance on another ffmasm testcase was improved:

If you consider the 'error' in those benchmarks, only the first 2 digits of the throughput are significant, so I don't think we can claim that this improves performance base on these numbers alone.

However, I think the improvement in code quality is also a plus, just for making things easier to read/understand. I think these changes are non-invasive, and IIRC in the past we've always set RAX mostly since the code shape at that time didn't allow us to avoid it easily. So, I'm okay with this patch.

Copy link
Member

@JornVernee JornVernee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All green

@openjdk
Copy link

openjdk bot commented Aug 1, 2023

@YaSuenag This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8313406: nep_invoker_blob can be simplified more

Reviewed-by: jvernee, vlivanov

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 157 new commits pushed to the master branch:

  • 595fdd3: 8314059: Remove PKCS7.verify()
  • 49b2984: 8313854: Some tests in serviceability area fail on localized Windows platform
  • c132176: 8114830: (fs) Files.copy fails due to interference from something else changing the file system
  • e56d3bc: 8313657: com.sun.jndi.ldap.Connection.cleanup does not close connections on SocketTimeoutErrors
  • 4b2703a: 8313678: SymbolTable can leak Symbols during cleanup
  • f41c267: 8314045: ArithmeticException in GaloisCounterMode
  • 911d1db: 8314078: HotSpotConstantPool.lookupField() asserts due to field changes in ConstantPool.cpp
  • 6574dd7: 8314025: Remove JUnit-based test in java/lang/invoke from problem list
  • 207bd00: 8313756: [BACKOUT] 8308682: Enhance AES performance
  • 823f5b9: 8308850: Change JVM options with small ranges that get -Wconversion warnings to 32 bits
  • ... and 147 more: https://git.openjdk.org/jdk/compare/6fca28988794b52a6aa974bed1ed6f4f07e0994b...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 1, 2023
@JornVernee
Copy link
Member

JornVernee commented Aug 2, 2023

FWIW, if you want to look into reducing the generated code further, I think we can potentially reduce the amount of shuffling between registers that's needed by reordering the arguments on the Java side so that each VMStorage corresponding to an argument of the leaf method handle is the same as the register for that argument in the Java calling convention.

I think the right place to do this is in DowncallLinker where we are creating the NativeEntryPoint. The way I think it should work:

  1. compute the Java calling convention's argument registers for the leaf method type.
  2. compute a re-ordered VMStorage[] for the arguments, and a re-ordered method type, such that the VMStorage/type for a particular argument index matches the register for the same index used in the Java calling convention as much as possible.
  3. use the re-ordered VMStorage[] + MethodType to create the native entry point + native method handle
  4. apply the same reordering in reverse to the arguments of the created native method handle (using MethodHandles::permuteArguments) so that the resulting method handle has the original argument order/method type.

Pushing this shuffling to the Java side will allow the JIT to reduce data motion, and this should result in reduced shuffling being needed overall I think.

@YaSuenag
Copy link
Member Author

YaSuenag commented Aug 2, 2023

@JornVernee Thanks for your review! I will integrate this when I get second reviewer.

I think we can potentially reduce the amount of shuffling between registers that's needed by reordering the arguments on the Java side so that each VMStorage corresponding to an argument of the leaf method handle is the same as the register for that argument in the Java calling convention.

It would be great! I guess you suggested that ArgumentShuffle in HotSpot moves into DowncallLinker, right? To be honest, I haven't yet understood well about this, and also I do not have other testbed excepting Linux x64. So it is difficult to work for this now.

Again, this idea is great. I'd like to call native function via FFM with less overhead. So I'm happy to help if I can.

@JornVernee
Copy link
Member

JornVernee commented Aug 2, 2023

I guess you suggested that ArgumentShuffle in HotSpot moves into DowncallLinker, right?

No, ArgumentShuffle should stay inside HotSpot. We can not do all the shuffling on the Java side. We can only eliminate some of the register moves that are needed by re-ordering the arguments on the Java side.

For instance, if you look at the comment in assembler_x86.hpp, where we define j_rarg* Register constants, you'll see this:

//        |-------------------------------------------------------|
//        | c_rarg0   c_rarg1  c_rarg2 c_rarg3 c_rarg4 c_rarg5    |
//        |-------------------------------------------------------|
//        | rcx       rdx      r8      r9      rdi*    rsi*       | windows (* not a c_rarg)
//        | rdi       rsi      rdx     rcx     r8      r9         | solaris/linux
//        |-------------------------------------------------------|
//        | j_rarg5   j_rarg0  j_rarg1 j_rarg2 j_rarg3 j_rarg4    |
//        |-------------------------------------------------------|

i.e. all the registers in the Java calling convention are 'off by one' compared to the native calling convention. This makes sense for JNI since we need to prepend the JNIEnv* to the start of the argument list, but it doesn't make sense for Panama.

Let's say we have a native function taking five longs. On Linux/x64 the VMStorage[] for the arguments (the one we use when creating the NativeEntryPoint inside DowncallLinker) would be:

[rdi, rsi, rdx, rcx, r8, r9]

i.e. the first argument we pass on the Java side gets moved (by the downcall stub) into rdi, the second into rsi, etc. This doesn't match the incoming registers of the Java calling convention, where the first argument is passed passing in rsi, the second is passed in rdx, etc. (i.e. off-by-one).

We can simply re-arrange the entries in the VMStorage[] to match the order of registers in the Java calling convention:

[rsi, rdx, rcx, r8, r9, rdi]

i.e. the argument that should go into rdi is passed in the fifth position instead. Since now the registers for each argument match the Java calling convention, the downcall stub doesn't need to do any shuffling! (I'm being very hand-wavey here. Figuring out how to correctly do the re-ordering is the hard part of this).

Ok, but now the arguments we pass to the downcall stub are going to go in the wrong registers as well :( So, to compensate for that, we have to also re-order the incoming argument values on the Java side so that each argument will correspond to it's original register again. To do that, we just need to pass the first argument in the fifth position as well, and shift arguments 1-4 forward by one spot.

Make sense?

@YaSuenag
Copy link
Member Author

YaSuenag commented Aug 3, 2023

Ideally it is the best if we eliminate all of the shuffling completely, but it is impossible I think. We have to use MethodHandles::permuteArguments to apply reordering to NativeMethodHandle as you said, then shuffling would remain somewhare even if we could eliminate them from NEP stub.

Thus this topic would be lower priority if my guessing is correct.

@JornVernee
Copy link
Member

then shuffling would remain somewhare even if we could eliminate them from NEP stub.

Since things on the Java side are visible to the JIT, it should be able to avoid the extra data motion.

Thus this topic would be lower priority if my guessing is correct.

Yes, it is lower priority. It's relatively complex to solve, and also CPUs, in my experience, don't generally care that much about the shuffling. They probably just change their internal register allocation table instead of doing the actual moves.

Also, it will ultimately help more to implement C2 intrinsics for native calls, as that avoids going through the downcall stub altogether. I have an old POC for that which I will dust off.

@YaSuenag
Copy link
Member Author

YaSuenag commented Aug 6, 2023

Can I get second reviewer?

@YaSuenag
Copy link
Member Author

PING: could you review this PR? I need one more reviewer to push.

This PR has passed java/foreign jtreg tests and CI in Oracle.

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@YaSuenag
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Aug 14, 2023

Going to push as commit 583cb75.
Since your change was applied there have been 160 commits pushed to the master branch:

  • 0074b48: 8312597: Convert TraceTypeProfile to UL
  • 1f1c5c6: 8314241: Add test/jdk/sun/security/pkcs/pkcs7/SignerOrder.java to ProblemList
  • f142470: 8311981: Test gc/stringdedup/TestStringDeduplicationAgeThreshold.java#ZGenerational timed out
  • 595fdd3: 8314059: Remove PKCS7.verify()
  • 49b2984: 8313854: Some tests in serviceability area fail on localized Windows platform
  • c132176: 8114830: (fs) Files.copy fails due to interference from something else changing the file system
  • e56d3bc: 8313657: com.sun.jndi.ldap.Connection.cleanup does not close connections on SocketTimeoutErrors
  • 4b2703a: 8313678: SymbolTable can leak Symbols during cleanup
  • f41c267: 8314045: ArithmeticException in GaloisCounterMode
  • 911d1db: 8314078: HotSpotConstantPool.lookupField() asserts due to field changes in ConstantPool.cpp
  • ... and 150 more: https://git.openjdk.org/jdk/compare/6fca28988794b52a6aa974bed1ed6f4f07e0994b...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Aug 14, 2023
@openjdk openjdk bot closed this Aug 14, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Aug 14, 2023
@openjdk
Copy link

openjdk bot commented Aug 14, 2023

@YaSuenag Pushed as commit 583cb75.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@YaSuenag YaSuenag deleted the JDK-8313406 branch August 14, 2023 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

3 participants