Skip to content

8303040: linux PPC64le: Implementation of Foreign Function & Memory API (Preview) #12708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 43 commits into from

Conversation

TheRealMDoerr
Copy link
Contributor

@TheRealMDoerr TheRealMDoerr commented Feb 22, 2023

Implementation of "Foreign Function & Memory API" for linux on Power (Little Endian) according to "Power Architecture 64-Bit ELF V2 ABI Specification".

This PR does not include code for VaList support because it's supposed to get removed by JDK-8299736. I've kept the related tests disabled for this platform and throw an exception instead. Note that the ABI doesn't precisely specify variable argument lists. Instead, it refers to <stdarg.h> (2.2.4 Variable Argument Lists).

Big Endian support is implemented to some extend, but not complete. E.g. structs with size not divisible by 8 are not passed correctly (see useABIv2 in CallArranger.java). Big Endian is excluded by selecting ARCH.equals("ppc64le") (CABI.java) only.

There's another limitation: This PR only accepts structures with size divisible by 4. (An IllegalArgumentException gets thrown otherwise.) I think arbitrary sizes are not usable on other platforms, either, because SharedUtils.primitiveCarrierForSize only accepts powers of 2. Update: Resolved after merging of JDK-8303017

The ABI has some tricky corner cases related to HFA (Homogeneous Float Aggregate). The same argument may need to get passed in both, a FP reg and a GP reg or stack slot (see "no partial DW rule"). This cases are not covered by the existing tests.

I had to make changes to shared code and code for other platforms:

  1. Pass type information when creating VMStorage objects from VMReg. This is needed for the following reasons:
  • PPC64 ABI requires integer types to get extended to 64 bit (also see CCallingConventionRequiresIntsAsLongs in existing hotspot code). We need to know the type or at least the bit width for that.
  • Floating point load / store instructions need the correct width to select between the correct IEEE 754 formats. The register representation in single FP registers is always IEEE 754 double precision on PPC64.
  • Big Endian also needs usage of the precise size. Storing 8 Bytes and loading 4 Bytes yields different values than on Little Endian!
  1. It happens that a NativeMemorySegmentImpl is used as a raw pointer (with byteSize() == 0) while running TestUpcallScope. Hence, existing size checks don't work (see MemorySegment.java). As a workaround, I'm just skipping the check in this particular case. Please check if this makes sense or if there's a better fix (possibly as separate RFE). Update: This issue is resolved by 2nd commit.

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8303040: linux PPC64le: Implementation of Foreign Function & Memory API (Preview)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/12708/head:pull/12708
$ git checkout pull/12708

Update a local copy of the PR:
$ git checkout pull/12708
$ git pull https://git.openjdk.org/jdk.git pull/12708/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 12708

View PR using the GUI difftool:
$ git pr show -t 12708

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/12708.diff

Webrev

Link to Webrev Comment

Sorry, something went wrong.

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 22, 2023

👋 Welcome back mdoerr! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 22, 2023
@openjdk
Copy link

openjdk bot commented Feb 22, 2023

@TheRealMDoerr The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org labels Feb 22, 2023
@minborg
Copy link
Contributor

minborg commented Feb 22, 2023

In the most recent version of the Panama FFM API, any memory layout (including struct and padding layouts) are always byte aligned.

@JornVernee
Copy link
Member

I will do a more thorough review soon.

Some preliminary comments:

The ABI has some tricky corner cases related to HFA (Homogeneous Float Aggregate). The same argument may need to get passed in both, a FP reg and a GP reg or stack slot (see "no partial DW rule"). This cases are not covered by the existing tests.

FWIW, we have to do this for Windows vararg floats as well (here)

This can be done by dup-ing the value, and using 2 vmStores. (each vmStore corresponding to a single register/stack location). Doing something similar might be simpler than the INTEGER_AND_FLOAT and STACK_AND_FLOAT storage types you're using right now. I'm not sure if that is related to the other limitations you mention? Might be interesting to look into. (perhaps as a separate RFE. I don't have a big issue since the current approach stays in PPC-only code)

I had to make changes to shared code and code for other platforms:

1. Pass type information when creating `VMStorage` objects from `VMReg`. This is needed for the following reasons:


* PPC64 ABI requires integer types to get extended to 64 bit (also see CCallingConventionRequiresIntsAsLongs in existing hotspot code). We need to know the type or at least the bit width for that.

* Floating point load / store instructions need the correct width to select between the correct IEEE 754 formats. The register representation in single FP registers is always IEEE 754 double precision on PPC64.

* Big Endian also needs usage of the precise size. Storing 8 Bytes and loading 4 Bytes yields different values than on Little Endian!

I think supplying the BasicType is fine. VMReg doesn't have any width information attached to it, and that's why a complementary BasicType is needed. I'm glad to see that you could make it work with the register masks for VMStorage :)

WRT the extension of int -> long. This could potentially also be handled in Java by adding the conversion as a Cast binding variant, and then adding the widening casts in CallArranger. (I'd be happy to implement the needed changes in shared code if you want, since it touches BindingSpecializer which is pretty dense). Since the extension seems to be a figment of the C ABI, that could be preferable, since it has the benefit of the VM code staying ABI-agnostic. This is potentially important if we want to add other ABIs in the future. But, we can also cross that bridge when we get to it (and there are probably more bridges to cross in that case too). So, up to you, really. (It's similar to the discussion surrounding floats for RISCV, if you followed that)

2. It happens that a `NativeMemorySegmentImpl` is used as a raw pointer (with byteSize() == 0) while running TestUpcallScope. Hence, existing size checks don't work (see MemorySegment.java). As a workaround, I'm just skipping the check in this particular case. Please check if this makes sense or if there's a better fix (possibly as separate RFE).

Zero-length memory segments are supposed to be resized before they are written to or read from (see Zero-length memory segments). We shouldn't disable the check for them, as that would have far-reaching implications for the safety design of the memory access API.

Can you explain a bit more about where/why/how the issue occurs?

@JornVernee
Copy link
Member

JornVernee commented Feb 22, 2023

(I'd be happy to implement the needed changes in shared code if you want, since it touches BindingSpecializer which is pretty dense)

FYI: master...JornVernee:jdk:I2L (assuming I2L has the same semantics as extsw). Then just add a .cast(int.class, long.class) wherever currently an int is vmStored in the PPC CallArranger.

/**
* The {@code T*} native type.
*/
public static final ValueLayout.OfAddress C_POINTER = ValueLayout.ADDRESS.withBitAlignment(64);
Copy link
Member

@JornVernee JornVernee Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where the issue with the check in MemorySegment::copy comes from. Note how other platforms add a call to asUnbounded for the created layout, which makes any pointer boxed using this layout writable/readable (such as the in memory return pointer for upcalls).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint! You found it pretty quickly. I had missed that when rebasing my early prototype. Fixed.

@@ -1356,7 +1356,11 @@ static void copy(MemorySegment srcSegment, ValueLayout srcElementLayout, long sr
}
long size = elementCount * srcElementLayout.byteSize();
srcImpl.checkAccess(srcOffset, size, true);
dstImpl.checkAccess(dstOffset, size, false);
if (dstImpl instanceof NativeMemorySegmentImpl && dstImpl.byteSize() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Jorn said, this change is not acceptable, as it allows bulk copy disregarding the segment real size. In such cases, the issue is always a missing unsafe resize somewhere in the linker code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this workaround. I'm glad to get rid of it.

@mcimadamore
Copy link
Contributor

Thanks for looking into this port!

@TheRealMDoerr
Copy link
Contributor Author

(I'd be happy to implement the needed changes in shared code if you want, since it touches BindingSpecializer which is pretty dense)

FYI: master...JornVernee:jdk:I2L (assuming I2L has the same semantics as extsw). Then just add a .cast(int.class, long.class) wherever currently an int is vmStored in the PPC CallArranger.

Correct, extsw performs a I2L conversion. I had thought about this already, but I think my current implementation is more efficient as it combines register moves with the 64 bit extend. Your proposal would generate separate extend and move instructions, right?

@TheRealMDoerr
Copy link
Contributor Author

I will do a more thorough review soon.

Thanks a lot!

The ABI has some tricky corner cases related to HFA (Homogeneous Float Aggregate). The same argument may need to get passed in both, a FP reg and a GP reg or stack slot (see "no partial DW rule"). This cases are not covered by the existing tests.

FWIW, we have to do this for Windows vararg floats as well (here)

This can be done by dup-ing the value, and using 2 vmStores. (each vmStore corresponding to a single register/stack location). Doing something similar might be simpler than the INTEGER_AND_FLOAT and STACK_AND_FLOAT storage types you're using right now. I'm not sure if that is related to the other limitations you mention? Might be interesting to look into. (perhaps as a separate RFE. I don't have a big issue since the current approach stays in PPC-only code)

Maybe I need to think a bit more about it. I don't really like the extra cases for INTEGER_AND_FLOAT and STACK_AND_FLOAT. On the other side, doing it in the CallArranger would break the design of factoring out the allocation from the binding generation.
In addition, it seems like PPC64 is even more tricky than the Windows case. I need to pass 2 float arguments in a GP reg (or stack slot) plus one of these 2 floats in float register F13. I think this can get implemented more easily in the backend. Do you agree?

@TheRealMDoerr
Copy link
Contributor Author

I have removed the size restriction for structs. Passing a struct consisting of 1 char works (on Little Endian). However, passing a struct consisting of 3 chars doesn't (getting IndexOutOfBoundsException: Out of bound access on segment MemorySegment). Neither on PPC64, nor on x86. Is that known or should I file a bug for that?

@TheRealMDoerr
Copy link
Contributor Author

I should add tests for the tricky corner cases like the following ones:

EXPORT struct S_FF  f_S_S_FF(float p0, float p1, float p2, float p3, float p4, float p5, float p6, float p7, float p8, float p9, float p10, float p11, struct S_FF p12, float p13) { return p12; }
EXPORT float        f_F_S_FF(float p0, float p1, float p2, float p3, float p4, float p5, float p6, float p7, float p8, float p9, float p10, float p11, struct S_FF p12, float p13) { return p13; }
EXPORT struct S_FF  f_S_SSSSSSS_FF(struct S_FF p0, struct S_FF p1, struct S_FF p2, struct S_FF p3, struct S_FF p4, struct S_FF p5, struct S_FF p6, float p7) { return p6; }
EXPORT float        f_F_SSSSSSS_FF(struct S_FF p0, struct S_FF p1, struct S_FF p2, struct S_FF p3, struct S_FF p4, struct S_FF p5, struct S_FF p6, float p7) { return p7; }

Can I add them to the existing libraries? If so, what is the correct naming scheme and what is needed to get them executed (adding the EXPORT alone is not sufficient). Or should I create a separate test for these cases?
Advice will be appreciated!

@mcimadamore
Copy link
Contributor

Can I add them to the existing libraries? If so, what is the correct naming scheme and what is needed to get them executed (adding the EXPORT alone is not sufficient). Or should I create a separate test for these cases? Advice will be appreciated!

There are two kinds of tests for the linker - some tests (e.g. TestDowncallXYZ and TestUpcallXYZ) execute end to end test with several shapes, and make sure that things work.

Then there are ABI specific tests (e.g. see TestXYZCallArranger). These latter tests are typically used to stress tests corners of specific ABIs - and they are easier to write as you can just provide the input (some function descriptor) then test that the resulting set of bindings is the expected one. This allows for much more in-depth testing.

@mcimadamore
Copy link
Contributor

Correct, extsw performs a I2L conversion. I had thought about this already, but I think my current implementation is more efficient as it combines register moves with the 64 bit extend. Your proposal would generate separate extend and move instructions, right?

Would it be possible to generate a biding cast + move and the recognize the pattern in the HS backend and optimize it away as a extsw ?

@JornVernee
Copy link
Member

JornVernee commented Feb 23, 2023

(I'd be happy to implement the needed changes in shared code if you want, since it touches BindingSpecializer which is pretty dense)

FYI: master...JornVernee:jdk:I2L (assuming I2L has the same semantics as extsw). Then just add a .cast(int.class, long.class) wherever currently an int is vmStored in the PPC CallArranger.

Correct, extsw performs a I2L conversion. I had thought about this already, but I think my current implementation is more efficient as it combines register moves with the 64 bit extend. Your proposal would generate separate extend and move instructions, right?

The design philosophy has been to put as much as possible on the Java side, and there are a few reasons for that:

  1. For maintainability. Generated assembly is ultimately harder to debug, compared to Java code (especially in interpreted mode using -Djdk.internal.foreign.*Linker.USE_SPEC=false). (Though, there might also be some personal bias here)
  2. Moving things to the Java side makes it visible to the JIT, which means it has the opportunity to be optimized away, or otherwise optimized together with the surrounding Java code. While anything put into the downcall stub is fixed.
  3. If we want to intrisify linkToNative in C2 later, having downcall stubs be simple and consistent across platforms makes that much easier. Anything that's special in the stub's code would have to be replicated by the JIT as well.

So... WRT efficiency, I think it depends. I've found in the past that adding a few more move instructions to the downcall stub didn't visibly affect performance. This might be because the CPU is good at just aliasing the registers instead of performing an actual move, or because it's just noise next to the membar we do on the return path. Ultimately, I don't think it matters much for performance, though (you could measure). I think the maintainability/future-proofing from implementing in Java is more important.

@JornVernee
Copy link
Member

I will do a more thorough review soon.

Thanks a lot!

The ABI has some tricky corner cases related to HFA (Homogeneous Float Aggregate). The same argument may need to get passed in both, a FP reg and a GP reg or stack slot (see "no partial DW rule"). This cases are not covered by the existing tests.

FWIW, we have to do this for Windows vararg floats as well (here)
This can be done by dup-ing the value, and using 2 vmStores. (each vmStore corresponding to a single register/stack location). Doing something similar might be simpler than the INTEGER_AND_FLOAT and STACK_AND_FLOAT storage types you're using right now. I'm not sure if that is related to the other limitations you mention? Might be interesting to look into. (perhaps as a separate RFE. I don't have a big issue since the current approach stays in PPC-only code)

Maybe I need to think a bit more about it. I don't really like the extra cases for INTEGER_AND_FLOAT and STACK_AND_FLOAT. On the other side, doing it in the CallArranger would break the design of factoring out the allocation from the binding generation. In addition, it seems like PPC64 is even more tricky than the Windows case. I need to pass 2 float arguments in a GP reg (or stack slot) plus one of these 2 floats in float register F13. I think this can get implemented more easily in the backend. Do you agree?

I think the same arguments apply here. I'll have a more thorough look at the patch and then get back to you on this.

@JornVernee
Copy link
Member

JornVernee commented Feb 23, 2023

I have removed the size restriction for structs. Passing a struct consisting of 1 char works (on Little Endian). However, passing a struct consisting of 3 chars doesn't (getting IndexOutOfBoundsException: Out of bound access on segment MemorySegment). Neither on PPC64, nor on x86. Is that known or should I file a bug for that?

I think you might be running into: https://bugs.openjdk.org/browse/JDK-8303017 which was recently found. (If you have a simpler test case please add it to the JBS issue, in a comment)

Edit: I've found a simpler test case as well. It's an issue with structs that are not the size of a power of two, since we can only read/write a power of two at a time. And I see now what your comment about primitveCarrierForSize is about :)

@JornVernee
Copy link
Member

JornVernee commented Feb 23, 2023

Correct, extsw performs a I2L conversion. I had thought about this already, but I think my current implementation is more efficient as it combines register moves with the 64 bit extend. Your proposal would generate separate extend and move instructions, right?

Would it be possible to generate a biding cast + move and the recognize the pattern in the HS backend and optimize it away as a extsw ?

At the moment, there is a forced indirection between the Java code and downcall stub, so the JIT can not do any optimizations that have to take both sides into account. The downcall stub is opaque to the JIT. (though, perhaps in the long run we can add intrinsification of linkToNative that can generate the code in the downcall stub as part of the JIT's own IR, which would solve this)

@mcimadamore
Copy link
Contributor

Correct, extsw performs a I2L conversion. I had thought about this already, but I think my current implementation is more efficient as it combines register moves with the 64 bit extend. Your proposal would generate separate extend and move instructions, right?

Would it be possible to generate a biding cast + move and the recognize the pattern in the HS backend and optimize it away as a extsw ?

At the moment, there is a forced indirection between the Java code and downcall stub, so the JIT can not do any optimizations that have to take both sides into account. The downcall stub is opaque to the JIT. (though, perhaps in the long run we can add intrinsification of linkToNative that can generate the code in the downcall stub as part of the JIT's own IR, which would solve this)

I meant generating extsw when emitting the stub (since when we emit the stub we can see the bindings). But I suppose the problem there is that the VM only sees low level bindings such as moves, it doesn't see bindings such as casts.

@JornVernee
Copy link
Member

I meant generating extsw when emitting the stub (since when we emit the stub we can see the bindings). But I suppose the problem there is that the VM only sees low level bindings such as moves, it doesn't see bindings such as casts.

Oh, sorry, I see what you mean now. But yeah, the VM stub only handles moves, and I think we want to keep it that way (for reasons outlined). The VM stub can be viewed as a low-level primitive that accepts a set of register/stack values, and moves them into the corresponding locations. It's not really supposed to be doing any kind of processing of the values it receives or returns.

@TheRealMDoerr
Copy link
Contributor Author

TheRealMDoerr commented Feb 24, 2023

Some more remarks about other issues:

  • Uploaded my simple reproducer to JDK-8303017
  • Using oversized load / stores is problematic. Don't forget that OpenJDK still supports Big Endian platforms (AIX, s390x).
  • The result of NativeCallingConvention::calling_convention is interpreted as size, but it returns the max offset. That's off by one slot. Should I file a bug for that? (PPC64 is not affected because it doesn't use the result.)
  • Since the membar on the return path was mentioned: I think it would be good to enable UseSystemMemoryBarrier by default on operating systems which support it. Maybe we should discuss this with @robehn.

@robehn
Copy link
Contributor

robehn commented Feb 24, 2023

  • Since the membar on the return path was mentioned: I think it would be good to enable UseSystemMemoryBarrier by default on operating systems which support it. Maybe we should discuss this with @robehn.

Changing the default is fine by me.
There is, AFAIK, one case that needs to read the thread state a lot, thus emitting sysmembars alot, JFR with very high sampling rate. Other than that there are no issues that I know about.
Maybe it would be good to test at what sampling interval we notice a change?
Also I think it's not best match to have the flag experimental when we are in some sense using it. Maybe diagnostic?

Copy link
Member

@reinrich reinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Martin,

finally I've completed a pass over the hotspot part of the port. This seems a good point to share the few comments and questions I've collected so far.

In general the changes do look very good. Hardly any shared code changes.
Nice work!

Cheers, Richard.

}

allocated_frame_size = align_up(allocated_frame_size, StackAlignmentInBytes);
_frame_size_slots = allocated_frame_size >> LogBytesPerInt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VMRegImpl::stack_slot_size could be used when converting from size in bytes to size in slots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this would be better readable. But the following code should also be adapted, then:

  int framesize() const {
    return (_frame_size_slots >> (LogBytesPerWord - LogBytesPerInt));
  }

Maybe it makes sense to do some cleanup for all platforms.


// Load the invoker, as NEP -> .invoker
__ verify_oop(nep_reg);
__ ld(temp_target, jdk_internal_foreign_abi_NativeEntryPoint::downcall_stub_address_offset_in_bytes(), nep_reg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other platforms use access_load_at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I have no idea why. It does the same but with a more complicated API.
I just noticed that other platforms use NONZERO. I think I should at least add that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does the same but with a more complicated API.

AFAIK It depends on the GC that's being used. access_load_at will make sure the right GC barriers are inserted (mostly for concurrent GCs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see it, the access API is an abstraction to be used instead of raw loads. It hides details. See for instance TemplateTable::getfield_or_static on x86 where it is also used. PPC lags behind in making use of the access API.
With a fancy new GC the oop in nep_reg could be stale, requiring some processing which would be taken care of by using the access API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC barriers are used when loading or storing an oop. No GC we currently have (not even the generational ones) use barriers for loading a plain address from an oop. The PPC64 implementation of the BarrierSetAssembler currently has Unimplemented() for non-oop types and all GCs are implemented.
Maybe it was intended for some future GC or other feature which has not yet reached the official repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just figured it out. It was introduced by https://bugs.openjdk.org/browse/JDK-8203172 (on aarch64) which mentions Shenandoah and future GCs. However, the Shenandoah comment says "non-reference load, no additional barrier is needed" and it doesn't use barriers in such a case. So, for the time being, I'll keep the normal load (because access_load_at is not ready for non-oop types). But I should add the NONZERO check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, since Shenandoah changed their load barriers we have been cleaning away the usages of the Access API for loads and stores to primitive values. There's no such support in the C++ Runtime code.

Copy link
Member

@JornVernee JornVernee May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, since this is loading a long (which represents an address that points into the code cache) I think we're fine without using the access API then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. The code had been written for the previous version of Shenandoah (1.0). No current GC uses barriers for non-oop types and the C++ Runtime doesn't support it any more as Stefan pointed out.
It is still possible to use the access API on other platforms, but it does nothing more than a plain load/store for non-oop types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with doing a plain access. I don't like the difference to other ports as it will at least waste time of people with less expertise in the area (e.g. new to the project). No need to continue the discussion in this pr though.


__ block_comment("{ receiver ");
__ load_const_optimized(R3_ARG1, (intptr_t)receiver, R0);
__ resolve_jobject(R3_ARG1, tmp, R31, MacroAssembler::PRESERVATION_FRAME_LR_GP_FP_REGS); // kills R31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a simplification the receiver could be resolved in UpcallLinker::on_entry and returned in JavaThread::_vm_result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a nice enhancement proposal for all platforms. The register spilling code in resolve_jobject can get lengthy dependent on the selected GC. Doing it in the C code (which we call anyway above) would make the upcall stubs smaller.
@JornVernee: What do you think about this idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a nice idea. The resolution here pre-dates the time where we called into the VM.

case T_CHAR :
case T_BYTE :
case T_SHORT :
case T_INT : segment_mask = REG32_MASK; break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why the segment_mask depends on bt on ppc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of the segment_mask can be defined for each platform. I'm using it to encode the information if a value on the Java side uses a 32 or 64 bit slot. In case of 32 bit values, the C side requires all 64 register bits to get defined values (ints get sign extended, floats get converted to double-precision format).

Copy link
Member

@reinrich reinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Martin,
I've made a pass over the Java part (except HFA). I found the specs hard to understand but most specs are like this.
I'll finish the rest beginning of next week.
Cheers, Richard.

*/
package jdk.internal.foreign.abi.ppc64;

import java.lang.foreign.AddressLayout;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imports are not grouped and ordered alphabetically.
(Very much as the aarch64 version)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public static final int MAX_REGISTER_ARGUMENTS = 8;
public static final int MAX_FLOAT_REGISTER_ARGUMENTS = 13;

// This is derived from the 64-Bit ELF V2 ABI spec, restricted to what's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says ABI V2 but the code seems to handle V1 too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's derived from ABI v2, but v1 is compatible. Added that to the comment.


class StorageCalculator {
private final boolean forArguments;
private boolean forVarArgs = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had kept it in case another PPC64 OS would need it, but I guess it's unlikely. So, I just removed it. Could get added back easily if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. There are other examples of redundant code that might serve a purpose in the future. I honestly don't like that.
In the case of forVarArgs porters can still find it in the aarch64 version :)

// TODO: Big Endian can't pass partially used slots correctly in some cases with:
// !useABIv2() && layout.byteSize() > 8 && layout.byteSize() % 8 != 0

// Allocate individual fields as gp slots (regs and stack).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You explained to me, it's not individual (struct) fields that are handled here. Looks like registers and 8 byte stack slots are allocated to completely cover the struct. Would be good if you could change the comment and names in the code to better reflect this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapted to aarch64 implementation (using MAX_COPY_SIZE).

private static LinuxPPC64leLinker instance;

public static LinuxPPC64leLinker getInstance() {
if (instance == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other platforms optimized this to return a constant (probably after you forked off the port).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Adapted.

* public constants CallArranger.ABIv1/2.
*/
public abstract class CallArranger {
protected abstract boolean useABIv2();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be refactored into a static method with the same trick that is used in LinuxPPC64leLinker::getInstance. Callers could be static then and you could delete CallArranger::ABIv2 and ABIv2CallArranger.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like?

    protected static final boolean useABIv2 = CABI.current() == CABI.LINUX_PPC_64_LE;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better to read, but would make the PPC64 CallArranger dependent on the current CABI.
Note that there are tests which use

import jdk.internal.foreign.abi.aarch64.CallArranger;
...
CallArranger.LINUX.getBindings(mt, fd, false);

for example. The tests are designed to run on all platforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better to read, but would make the PPC64 CallArranger dependent on the current CABI. Note that there are tests which use

import jdk.internal.foreign.abi.aarch64.CallArranger;
...
CallArranger.LINUX.getBindings(mt, fd, false);

for example. The tests are designed to run on all platforms.

I see, thanks. Would be nice to have some for PPC64 too :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, yes. I didn't find time for figuring out what would be useful tests. We could still add some in the future or with the big endian port.
Another idea: Would the following be better?
final boolean useABIv2 = (this.getClass() == ABIv2CallArranger.class);
That would also allow getting rid of the method useABIv2().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better final boolean useABIv2 = (this instanceof ABIv2CallArranger);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at 70736be

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good.

Copy link
Member

@reinrich reinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Martin,

there seems to be a mismatch between this pr and the 64-bit ELF ABI V2 for PPC. In fact all dynamically generated calls that have to conform to ABI V2 are affected. I'm giving a short summary of our discussion this afternoon.

Very briefly: ABI V2 states that a Parameter Save Area (PSA) shall be allocated unless all parameters can be passed in registers as indicated by the caller's prototype, whereas the port always allocates a PSA of 8 double words.
(Details under "Parameter Save Area" in "2.2.3.3. Optional Save Areas" of ELF ABI V2)

It is not wrong what we're doing. It is like we didn't know the prototype of the call targets. But for most calls [1] we are wasting stack space (and confusing everybody that tries to match the implementation with the spec).
Interestingly ABI V1 states that a PSA of at least 8 double words is always needed. Looks like we've missed that change.

I have conducted a little experiment and compiled the following test program using Compiler Explorer [2]

#include <stdint.h>

int64_t test_callee(int64_t p1, int64_t p2, int64_t p3, int64_t p4);

int64_t test_call(int64_t p1, int64_t p2, int64_t p3, int64_t p4) {
    return test_callee(p1, p2, p3, p4);
}

This is the -O2 output for ELF ABI V2 (little endian)
Note: the stdu allocates just the minimal frame of 4 double words without PSA.

test_call:                              # @test_call
.Lfunc_gep0:
        addis 2, 12, .TOC.-.Lfunc_gep0@ha
        addi 2, 2, .TOC.-.Lfunc_gep0@l
        mflr 0
        stdu 1, -32(1)
        std 0, 48(1)
        bl test_callee
        nop
        addi 1, 1, 32
        ld 0, 16(1)
        mtlr 0
        blr

This is the -O2 output for ELF ABI V1 (big endian)

test_call:                              # @test_call
        .quad   .Lfunc_begin0
        .quad   .TOC.@tocbase
        .quad   0
.Lfunc_begin0:
        mflr 0
        stdu 1, -112(1)
        std 0, 128(1)
        bl test_callee
        nop
        addi 1, 1, 112
        ld 0, 16(1)
        mtlr 0
        blr

Note: the stdu allocates a much larger frame because it accomodates a PSA of 8 double words.

I'd suggest to keep the current well tested version but add comments to the code that a PSA is always allocated even though ABI V2 does not require it. This should also be explained in the JBS item.

Furthermore RFEs should be filed to adopt ABI V2 in the FFM API port and in the hotspot port to PPC64le. There's quite a bit of room for improvement there. Ironically I've very recently fixed JDK-8306111 citing the ABI V2 spec without realizing that the fix is not needed for V2, just for V1.

[1] Exceptions that do require a PSA are calls with a really long or variable parameter list or if a prototype is not available

[2] Experiment with "Compiler Explorer"

// (native_abi_reg_args is native_abi_minframe plus space for 8 argument register spill slots)
assert(_abi._shadow_space_bytes == frame::native_abi_minframe_size, "expected space according to ABI");
// Note: For ABIv2, we only need (_input_registers.length() > 8) ? _input_registers.length() : 0
int register_save_area_slots = MAX2(_input_registers.length(), 8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both specs, ABI V1 and V2, call this "Parameter Save Area" we should use the same name.

Suggested change
int register_save_area_slots = MAX2(_input_registers.length(), 8);
int parameter_save_area_slots = MAX2(_input_registers.length(), 8);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Changed.

// This may be a bit more than needed when HFA is used (see CallArranger.java).
// (native_abi_reg_args is native_abi_minframe plus space for 8 argument register spill slots)
assert(_abi._shadow_space_bytes == frame::native_abi_minframe_size, "expected space according to ABI");
// Note: For ABIv2, we only need (_input_registers.length() > 8) ? _input_registers.length() : 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to understand. It should be explained that we allocate a PSA even though ABI V2 only requires it if not all parameters can be passed in registers.

stackAlloc(4, STACK_SLOT_SIZE); // Skip first half of stack slot.
stack = stackAlloc(4, 4);
} else {
stack = stackAlloc(is32Bit ? 4 : 8, STACK_SLOT_SIZE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a stack slot is always allocated. Please explain that for ABI V2 this is actually only required if it is know from a prototype that not all parameters can be passed in registers and that we plan to change this.

Copy link
Contributor Author

@TheRealMDoerr TheRealMDoerr May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically computes the stack layout. We need to count all slots to get the right offset for the registers which actually get written on stack. The first such register will hit native_abi_minframe_size + 8 slots. If fewer registers are used, the counted stack slots will not be used.
The decision whether we allocate the Parameter Save Area or not is done in the downcall stub and doesn't depend on the stackAllocs.

@TheRealMDoerr
Copy link
Contributor Author

Thanks for publishing our discussion, here. The unnecessary PSA affects other areas of hotspot much more than Panama. Yes, we should file an RFE. I think one for hotspot is sufficient as the downcall stub is part of it. I don't think it needs extra treatment.

assert(_abi._shadow_space_bytes == frame::native_abi_minframe_size, "expected space according to ABI");
// The Parameter Save Area needs to be at least 8 slots for ABIv1.
// ABIv2 allows omitting it when all parameters can get passed in registers. We currently don't optimize this.
// For ABIv2, we only need (_input_registers.length() > 8) ? _input_registers.length() : 0
Copy link
Member

@reinrich reinrich May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PSA is also needed if the parameter list is variable in length. Is the expression (_input_registers.length() > 8) ? _input_registers.length() : 0 correct in that case too?
Otherwise: ABIv2 allows omitting it if the callee's prototype indicates that stack parameters are not expected. We currently don't optimize this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see now. This is not obvious though. There are a few layers of abstraction at play which hide this. A comment is needed. Maybe like this:

            // With ABIv1 a Parameter Save Area of at least 8 double words is always needed.
            // ABIv2 allows omitting it if the callee's prototype indicates that stack parameters are not expected.
            // We currently don't optimize this (see DowncallStubGenerator in the backend).
            if (reg == null) return stack;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe omitting the PSA is wrong for varargs, but we don't have this information in the backend. So, I think we should simply not optimize it. Reserving 64 Byte stack space should be affordable for a downcall even if it's not always needed. The Java side could compute it, but there's no way to pass this information to the backend. I've improved the comments. Please take a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe omitting the PSA is wrong for varargs, but we don't have this information in the backend.

It is clearly wrong.

So, I think we should simply not optimize it.

Already agreed for this version.

Reserving 64 Byte stack space should be affordable for a downcall even if it's not always needed.

It is hardly ever needed to begin with. It's just not pretty to allocate the PSA for little test functions. It'll confuse people that compare the downcall stub to what the C compiler produces. They might even think we havn't understood the ABI ;)

The Java side could compute it, but there's no way to pass this information to the backend.

I'm not sure about that. First idea would be to keep the allocated stack in nextStorage if needed and pass it to the backend.

I've improved the comments. Please take a look.

Great! Thanks :)

@reinrich
Copy link
Member

Thanks for publishing our discussion, here. The unnecessary PSA affects other areas of hotspot much more than Panama. Yes, we should file an RFE. I think one for hotspot is sufficient as the downcall stub is part of it. I don't think it needs extra treatment.

That's fine. It should have a little list of areas to be revisited.

Adoc:

  • Runtime calls by the interpreter, c1, and c2
  • Interpreted and compiled JNI calls
  • FFM API ("Panama") calls
  • Runtime calls by continuation intrisics
  • Runtime calls by GC barriers

Subtasks can be generated from that list.

Copy link
Member

@reinrich reinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good to go now.
(maybe after a final round of tests)

Thanks a lot!
Richard.

@TheRealMDoerr
Copy link
Contributor Author

Thanks a lot for all discussions, feedback and the reviews! This was really helpful! I'm planning to integrate tomorrow.

@TheRealMDoerr
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented May 24, 2023

Going to push as commit 20f1535.
Since your change was applied there have been 58 commits pushed to the master branch:

  • 466ec30: 8302736: Major performance regression in Math.log on aarch64
  • 05c095c: 8308151: [JVMCI] capture JVMCI exceptions in hs-err
  • beb75e6: 8306302: C2 Superword fix: use VectorMaskCmp and VectorBlend instead of CMoveVF/D
  • 2836c34: 8179502: Enhance OCSP, CRL and Certificate Fetch Timeouts
  • 8ffa264: 8306698: Add overloads to MethodTypeDesc::of
  • 6b27dad: 8301154: SunPKCS11 KeyStore deleteEntry results in dangling PrivateKey entries
  • ed0e956: 8308716: ProblemList java/util/concurrent/ScheduledThreadPoolExecutor/BasicCancelTest.java with genzgc on windows-x64
  • bddf483: 8303942: os::write should write completely
  • ab241b3: 8306706: Support out-of-line code generation for MachNodes
  • 710453c: 8308016: Use snippets in java.io package
  • ... and 48 more: https://git.openjdk.org/jdk/compare/939344b8433b32166f42ad73ae3d96e84b033478...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 24, 2023
@openjdk openjdk bot closed this May 24, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 24, 2023
@TheRealMDoerr TheRealMDoerr deleted the PPC64_Panama branch May 24, 2023 08:39
@openjdk
Copy link

openjdk bot commented May 24, 2023

@TheRealMDoerr Pushed as commit 20f1535.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated shenandoah shenandoah-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

None yet

8 participants