8303040: linux PPC64le: Implementation of Foreign Function & Memory API (Preview) #12708

TheRealMDoerr · 2023-02-22T05:31:46Z

Implementation of "Foreign Function & Memory API" for linux on Power (Little Endian) according to "Power Architecture 64-Bit ELF V2 ABI Specification".

This PR does not include code for VaList support because it's supposed to get removed by JDK-8299736. I've kept the related tests disabled for this platform and throw an exception instead. Note that the ABI doesn't precisely specify variable argument lists. Instead, it refers to <stdarg.h> (2.2.4 Variable Argument Lists).

Big Endian support is implemented to some extend, but not complete. E.g. structs with size not divisible by 8 are not passed correctly (see useABIv2 in CallArranger.java). Big Endian is excluded by selecting ARCH.equals("ppc64le") (CABI.java) only.

There's another limitation: This PR only accepts structures with size divisible by 4. (An IllegalArgumentException gets thrown otherwise.) I think arbitrary sizes are not usable on other platforms, either, because SharedUtils.primitiveCarrierForSize only accepts powers of 2. Update: Resolved after merging of JDK-8303017

The ABI has some tricky corner cases related to HFA (Homogeneous Float Aggregate). The same argument may need to get passed in both, a FP reg and a GP reg or stack slot (see "no partial DW rule"). This cases are not covered by the existing tests.

I had to make changes to shared code and code for other platforms:

Pass type information when creating VMStorage objects from VMReg. This is needed for the following reasons:

PPC64 ABI requires integer types to get extended to 64 bit (also see CCallingConventionRequiresIntsAsLongs in existing hotspot code). We need to know the type or at least the bit width for that.
Floating point load / store instructions need the correct width to select between the correct IEEE 754 formats. The register representation in single FP registers is always IEEE 754 double precision on PPC64.
Big Endian also needs usage of the precise size. Storing 8 Bytes and loading 4 Bytes yields different values than on Little Endian!

It happens that a NativeMemorySegmentImpl is used as a raw pointer (with byteSize() == 0) while running TestUpcallScope. Hence, existing size checks don't work (see MemorySegment.java). As a workaround, I'm just skipping the check in this particular case. Please check if this makes sense or if there's a better fix (possibly as separate RFE). Update: This issue is resolved by 2nd commit.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8303040: linux PPC64le: Implementation of Foreign Function & Memory API (Preview)

Reviewers

Jorn Vernee (@JornVernee - Reviewer) ⚠️ Review applies to 4666aa22
Richard Reingruber (@reinrich - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/12708/head:pull/12708
$ git checkout pull/12708

Update a local copy of the PR:
$ git checkout pull/12708
$ git pull https://git.openjdk.org/jdk.git pull/12708/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 12708

View PR using the GUI difftool:
$ git pr show -t 12708

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/12708.diff

Webrev

Link to Webrev Comment

bridgekeeper · 2023-02-22T05:33:14Z

👋 Welcome back mdoerr! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2023-02-22T05:35:28Z

@TheRealMDoerr The following labels will be automatically applied to this pull request:

core-libs
hotspot
shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

mlbridge · 2023-02-22T05:39:34Z

Webrevs

minborg · 2023-02-22T14:07:55Z

In the most recent version of the Panama FFM API, any memory layout (including struct and padding layouts) are always byte aligned.

JornVernee · 2023-02-22T17:03:16Z

I will do a more thorough review soon.

Some preliminary comments:

The ABI has some tricky corner cases related to HFA (Homogeneous Float Aggregate). The same argument may need to get passed in both, a FP reg and a GP reg or stack slot (see "no partial DW rule"). This cases are not covered by the existing tests.

FWIW, we have to do this for Windows vararg floats as well (here)

This can be done by dup-ing the value, and using 2 vmStores. (each vmStore corresponding to a single register/stack location). Doing something similar might be simpler than the INTEGER_AND_FLOAT and STACK_AND_FLOAT storage types you're using right now. I'm not sure if that is related to the other limitations you mention? Might be interesting to look into. (perhaps as a separate RFE. I don't have a big issue since the current approach stays in PPC-only code)

I had to make changes to shared code and code for other platforms:

1. Pass type information when creating `VMStorage` objects from `VMReg`. This is needed for the following reasons:


* PPC64 ABI requires integer types to get extended to 64 bit (also see CCallingConventionRequiresIntsAsLongs in existing hotspot code). We need to know the type or at least the bit width for that.

* Floating point load / store instructions need the correct width to select between the correct IEEE 754 formats. The register representation in single FP registers is always IEEE 754 double precision on PPC64.

* Big Endian also needs usage of the precise size. Storing 8 Bytes and loading 4 Bytes yields different values than on Little Endian!

I think supplying the BasicType is fine. VMReg doesn't have any width information attached to it, and that's why a complementary BasicType is needed. I'm glad to see that you could make it work with the register masks for VMStorage :)

WRT the extension of int -> long. This could potentially also be handled in Java by adding the conversion as a Cast binding variant, and then adding the widening casts in CallArranger. (I'd be happy to implement the needed changes in shared code if you want, since it touches BindingSpecializer which is pretty dense). Since the extension seems to be a figment of the C ABI, that could be preferable, since it has the benefit of the VM code staying ABI-agnostic. This is potentially important if we want to add other ABIs in the future. But, we can also cross that bridge when we get to it (and there are probably more bridges to cross in that case too). So, up to you, really. (It's similar to the discussion surrounding floats for RISCV, if you followed that)

2. It happens that a `NativeMemorySegmentImpl` is used as a raw pointer (with byteSize() == 0) while running TestUpcallScope. Hence, existing size checks don't work (see MemorySegment.java). As a workaround, I'm just skipping the check in this particular case. Please check if this makes sense or if there's a better fix (possibly as separate RFE).

Zero-length memory segments are supposed to be resized before they are written to or read from (see Zero-length memory segments). We shouldn't disable the check for them, as that would have far-reaching implications for the safety design of the memory access API.

Can you explain a bit more about where/why/how the issue occurs?

JornVernee · 2023-02-22T17:33:05Z

(I'd be happy to implement the needed changes in shared code if you want, since it touches BindingSpecializer which is pretty dense)

FYI: master...JornVernee:jdk:I2L (assuming I2L has the same semantics as extsw). Then just add a .cast(int.class, long.class) wherever currently an int is vmStored in the PPC CallArranger.

JornVernee · 2023-02-22T18:23:54Z

src/java.base/share/classes/jdk/internal/foreign/PlatformLayouts.java

+        /**
+         * The {@code T*} native type.
+         */
+        public static final ValueLayout.OfAddress C_POINTER = ValueLayout.ADDRESS.withBitAlignment(64);


I think this is where the issue with the check in MemorySegment::copy comes from. Note how other platforms add a call to asUnbounded for the created layout, which makes any pointer boxed using this layout writable/readable (such as the in memory return pointer for upcalls).

Thanks for the hint! You found it pretty quickly. I had missed that when rebasing my early prototype. Fixed.

mcimadamore · 2023-02-22T18:31:45Z

src/java.base/share/classes/java/lang/foreign/MemorySegment.java

@@ -1356,7 +1356,11 @@ static void copy(MemorySegment srcSegment, ValueLayout srcElementLayout, long sr
        }
        long size = elementCount * srcElementLayout.byteSize();
        srcImpl.checkAccess(srcOffset, size, true);
-        dstImpl.checkAccess(dstOffset, size, false);
+        if (dstImpl instanceof NativeMemorySegmentImpl && dstImpl.byteSize() == 0) {


As Jorn said, this change is not acceptable, as it allows bulk copy disregarding the segment real size. In such cases, the issue is always a missing unsafe resize somewhere in the linker code.

Removed this workaround. I'm glad to get rid of it.

mcimadamore · 2023-02-22T18:32:26Z

Thanks for looking into this port!

TheRealMDoerr · 2023-02-23T04:44:18Z

(I'd be happy to implement the needed changes in shared code if you want, since it touches BindingSpecializer which is pretty dense)

FYI: master...JornVernee:jdk:I2L (assuming I2L has the same semantics as extsw). Then just add a .cast(int.class, long.class) wherever currently an int is vmStored in the PPC CallArranger.

Correct, extsw performs a I2L conversion. I had thought about this already, but I think my current implementation is more efficient as it combines register moves with the 64 bit extend. Your proposal would generate separate extend and move instructions, right?

TheRealMDoerr · 2023-02-23T04:53:34Z

I will do a more thorough review soon.

Thanks a lot!

The ABI has some tricky corner cases related to HFA (Homogeneous Float Aggregate). The same argument may need to get passed in both, a FP reg and a GP reg or stack slot (see "no partial DW rule"). This cases are not covered by the existing tests.

FWIW, we have to do this for Windows vararg floats as well (here)

This can be done by dup-ing the value, and using 2 vmStores. (each vmStore corresponding to a single register/stack location). Doing something similar might be simpler than the INTEGER_AND_FLOAT and STACK_AND_FLOAT storage types you're using right now. I'm not sure if that is related to the other limitations you mention? Might be interesting to look into. (perhaps as a separate RFE. I don't have a big issue since the current approach stays in PPC-only code)

Maybe I need to think a bit more about it. I don't really like the extra cases for INTEGER_AND_FLOAT and STACK_AND_FLOAT. On the other side, doing it in the CallArranger would break the design of factoring out the allocation from the binding generation.
In addition, it seems like PPC64 is even more tricky than the Windows case. I need to pass 2 float arguments in a GP reg (or stack slot) plus one of these 2 floats in float register F13. I think this can get implemented more easily in the backend. Do you agree?

TheRealMDoerr · 2023-02-23T06:21:05Z

I have removed the size restriction for structs. Passing a struct consisting of 1 char works (on Little Endian). However, passing a struct consisting of 3 chars doesn't (getting IndexOutOfBoundsException: Out of bound access on segment MemorySegment). Neither on PPC64, nor on x86. Is that known or should I file a bug for that?

TheRealMDoerr · 2023-02-23T07:19:24Z

I should add tests for the tricky corner cases like the following ones:

EXPORT struct S_FF  f_S_S_FF(float p0, float p1, float p2, float p3, float p4, float p5, float p6, float p7, float p8, float p9, float p10, float p11, struct S_FF p12, float p13) { return p12; }
EXPORT float        f_F_S_FF(float p0, float p1, float p2, float p3, float p4, float p5, float p6, float p7, float p8, float p9, float p10, float p11, struct S_FF p12, float p13) { return p13; }
EXPORT struct S_FF  f_S_SSSSSSS_FF(struct S_FF p0, struct S_FF p1, struct S_FF p2, struct S_FF p3, struct S_FF p4, struct S_FF p5, struct S_FF p6, float p7) { return p6; }
EXPORT float        f_F_SSSSSSS_FF(struct S_FF p0, struct S_FF p1, struct S_FF p2, struct S_FF p3, struct S_FF p4, struct S_FF p5, struct S_FF p6, float p7) { return p7; }

Can I add them to the existing libraries? If so, what is the correct naming scheme and what is needed to get them executed (adding the EXPORT alone is not sufficient). Or should I create a separate test for these cases?
Advice will be appreciated!

mcimadamore · 2023-02-23T10:15:41Z

Can I add them to the existing libraries? If so, what is the correct naming scheme and what is needed to get them executed (adding the EXPORT alone is not sufficient). Or should I create a separate test for these cases? Advice will be appreciated!

There are two kinds of tests for the linker - some tests (e.g. TestDowncallXYZ and TestUpcallXYZ) execute end to end test with several shapes, and make sure that things work.

Then there are ABI specific tests (e.g. see TestXYZCallArranger). These latter tests are typically used to stress tests corners of specific ABIs - and they are easier to write as you can just provide the input (some function descriptor) then test that the resulting set of bindings is the expected one. This allows for much more in-depth testing.

mcimadamore · 2023-02-23T10:20:31Z

Correct, extsw performs a I2L conversion. I had thought about this already, but I think my current implementation is more efficient as it combines register moves with the 64 bit extend. Your proposal would generate separate extend and move instructions, right?

Would it be possible to generate a biding cast + move and the recognize the pattern in the HS backend and optimize it away as a extsw ?

JornVernee · 2023-02-23T14:49:47Z

(I'd be happy to implement the needed changes in shared code if you want, since it touches BindingSpecializer which is pretty dense)

FYI: master...JornVernee:jdk:I2L (assuming I2L has the same semantics as extsw). Then just add a .cast(int.class, long.class) wherever currently an int is vmStored in the PPC CallArranger.

Correct, extsw performs a I2L conversion. I had thought about this already, but I think my current implementation is more efficient as it combines register moves with the 64 bit extend. Your proposal would generate separate extend and move instructions, right?

The design philosophy has been to put as much as possible on the Java side, and there are a few reasons for that:

For maintainability. Generated assembly is ultimately harder to debug, compared to Java code (especially in interpreted mode using -Djdk.internal.foreign.*Linker.USE_SPEC=false). (Though, there might also be some personal bias here)
Moving things to the Java side makes it visible to the JIT, which means it has the opportunity to be optimized away, or otherwise optimized together with the surrounding Java code. While anything put into the downcall stub is fixed.
If we want to intrisify linkToNative in C2 later, having downcall stubs be simple and consistent across platforms makes that much easier. Anything that's special in the stub's code would have to be replicated by the JIT as well.

So... WRT efficiency, I think it depends. I've found in the past that adding a few more move instructions to the downcall stub didn't visibly affect performance. This might be because the CPU is good at just aliasing the registers instead of performing an actual move, or because it's just noise next to the membar we do on the return path. Ultimately, I don't think it matters much for performance, though (you could measure). I think the maintainability/future-proofing from implementing in Java is more important.

JornVernee · 2023-02-23T14:56:18Z

I will do a more thorough review soon.

Thanks a lot!

The ABI has some tricky corner cases related to HFA (Homogeneous Float Aggregate). The same argument may need to get passed in both, a FP reg and a GP reg or stack slot (see "no partial DW rule"). This cases are not covered by the existing tests.

FWIW, we have to do this for Windows vararg floats as well (here)
This can be done by dup-ing the value, and using 2 vmStores. (each vmStore corresponding to a single register/stack location). Doing something similar might be simpler than the INTEGER_AND_FLOAT and STACK_AND_FLOAT storage types you're using right now. I'm not sure if that is related to the other limitations you mention? Might be interesting to look into. (perhaps as a separate RFE. I don't have a big issue since the current approach stays in PPC-only code)

Maybe I need to think a bit more about it. I don't really like the extra cases for INTEGER_AND_FLOAT and STACK_AND_FLOAT. On the other side, doing it in the CallArranger would break the design of factoring out the allocation from the binding generation. In addition, it seems like PPC64 is even more tricky than the Windows case. I need to pass 2 float arguments in a GP reg (or stack slot) plus one of these 2 floats in float register F13. I think this can get implemented more easily in the backend. Do you agree?

I think the same arguments apply here. I'll have a more thorough look at the patch and then get back to you on this.

JornVernee · 2023-02-23T14:59:34Z

I have removed the size restriction for structs. Passing a struct consisting of 1 char works (on Little Endian). However, passing a struct consisting of 3 chars doesn't (getting IndexOutOfBoundsException: Out of bound access on segment MemorySegment). Neither on PPC64, nor on x86. Is that known or should I file a bug for that?

I think you might be running into: https://bugs.openjdk.org/browse/JDK-8303017 which was recently found. (If you have a simpler test case please add it to the JBS issue, in a comment)

Edit: I've found a simpler test case as well. It's an issue with structs that are not the size of a power of two, since we can only read/write a power of two at a time. And I see now what your comment about primitveCarrierForSize is about :)

JornVernee · 2023-02-23T15:02:48Z

Correct, extsw performs a I2L conversion. I had thought about this already, but I think my current implementation is more efficient as it combines register moves with the 64 bit extend. Your proposal would generate separate extend and move instructions, right?

Would it be possible to generate a biding cast + move and the recognize the pattern in the HS backend and optimize it away as a extsw ?

At the moment, there is a forced indirection between the Java code and downcall stub, so the JIT can not do any optimizations that have to take both sides into account. The downcall stub is opaque to the JIT. (though, perhaps in the long run we can add intrinsification of linkToNative that can generate the code in the downcall stub as part of the JIT's own IR, which would solve this)

mcimadamore · 2023-02-23T16:48:30Z

Correct, extsw performs a I2L conversion. I had thought about this already, but I think my current implementation is more efficient as it combines register moves with the 64 bit extend. Your proposal would generate separate extend and move instructions, right?

Would it be possible to generate a biding cast + move and the recognize the pattern in the HS backend and optimize it away as a extsw ?

At the moment, there is a forced indirection between the Java code and downcall stub, so the JIT can not do any optimizations that have to take both sides into account. The downcall stub is opaque to the JIT. (though, perhaps in the long run we can add intrinsification of linkToNative that can generate the code in the downcall stub as part of the JIT's own IR, which would solve this)

I meant generating extsw when emitting the stub (since when we emit the stub we can see the bindings). But I suppose the problem there is that the VM only sees low level bindings such as moves, it doesn't see bindings such as casts.

JornVernee · 2023-02-23T17:11:23Z

I meant generating extsw when emitting the stub (since when we emit the stub we can see the bindings). But I suppose the problem there is that the VM only sees low level bindings such as moves, it doesn't see bindings such as casts.

Oh, sorry, I see what you mean now. But yeah, the VM stub only handles moves, and I think we want to keep it that way (for reasons outlined). The VM stub can be viewed as a low-level primitive that accepts a set of register/stack values, and moves them into the corresponding locations. It's not really supposed to be doing any kind of processing of the values it receives or returns.

TheRealMDoerr · 2023-02-24T04:03:53Z

Some more remarks about other issues:

Uploaded my simple reproducer to JDK-8303017
Using oversized load / stores is problematic. Don't forget that OpenJDK still supports Big Endian platforms (AIX, s390x).
The result of NativeCallingConvention::calling_convention is interpreted as size, but it returns the max offset. That's off by one slot. Should I file a bug for that? (PPC64 is not affected because it doesn't use the result.)
Since the membar on the return path was mentioned: I think it would be good to enable UseSystemMemoryBarrier by default on operating systems which support it. Maybe we should discuss this with @robehn.

robehn · 2023-02-24T07:12:31Z

Since the membar on the return path was mentioned: I think it would be good to enable UseSystemMemoryBarrier by default on operating systems which support it. Maybe we should discuss this with @robehn.

Changing the default is fine by me.
There is, AFAIK, one case that needs to read the thread state a lot, thus emitting sysmembars alot, JFR with very high sampling rate. Other than that there are no issues that I know about.
Maybe it would be good to test at what sampling interval we notice a change?
Also I think it's not best match to have the flag experimental when we are in some sense using it. Maybe diagnostic?

… instructions for moving between FP and GP reg. Improve comments.

reinrich

Hi Martin,

finally I've completed a pass over the hotspot part of the port. This seems a good point to share the few comments and questions I've collected so far.

In general the changes do look very good. Hardly any shared code changes.
Nice work!

Cheers, Richard.

reinrich · 2023-03-27T16:54:31Z

src/hotspot/cpu/ppc/downcallLinker_ppc.cpp

+  }
+
+  allocated_frame_size = align_up(allocated_frame_size, StackAlignmentInBytes);
+  _frame_size_slots = allocated_frame_size >> LogBytesPerInt;


VMRegImpl::stack_slot_size could be used when converting from size in bytes to size in slots.

Yes, I think this would be better readable. But the following code should also be adapted, then:

int framesize() const { return (_frame_size_slots >> (LogBytesPerWord - LogBytesPerInt)); }

Maybe it makes sense to do some cleanup for all platforms.

src/hotspot/cpu/ppc/frame_ppc.cpp

reinrich · 2023-04-26T14:41:51Z

src/hotspot/cpu/ppc/methodHandles_ppc.cpp

+
+  // Load the invoker, as NEP -> .invoker
+  __ verify_oop(nep_reg);
+  __ ld(temp_target, jdk_internal_foreign_abi_NativeEntryPoint::downcall_stub_address_offset_in_bytes(), nep_reg);


Other platforms use access_load_at.

Interesting. I have no idea why. It does the same but with a more complicated API.
I just noticed that other platforms use NONZERO. I think I should at least add that.

It does the same but with a more complicated API.

AFAIK It depends on the GC that's being used. access_load_at will make sure the right GC barriers are inserted (mostly for concurrent GCs).

As I see it, the access API is an abstraction to be used instead of raw loads. It hides details. See for instance TemplateTable::getfield_or_static on x86 where it is also used. PPC lags behind in making use of the access API.
With a fancy new GC the oop in nep_reg could be stale, requiring some processing which would be taken care of by using the access API.

GC barriers are used when loading or storing an oop. No GC we currently have (not even the generational ones) use barriers for loading a plain address from an oop. The PPC64 implementation of the BarrierSetAssembler currently has Unimplemented() for non-oop types and all GCs are implemented.
Maybe it was intended for some future GC or other feature which has not yet reached the official repo.

I just figured it out. It was introduced by https://bugs.openjdk.org/browse/JDK-8203172 (on aarch64) which mentions Shenandoah and future GCs. However, the Shenandoah comment says "non-reference load, no additional barrier is needed" and it doesn't use barriers in such a case. So, for the time being, I'll keep the normal load (because access_load_at is not ready for non-oop types). But I should add the NONZERO check.

FWIW, since Shenandoah changed their load barriers we have been cleaning away the usages of the Access API for loads and stores to primitive values. There's no such support in the C++ Runtime code.

Ok, since this is loading a long (which represents an address that points into the code cache) I think we're fine without using the access API then?

Correct. The code had been written for the previous version of Shenandoah (1.0). No current GC uses barriers for non-oop types and the C++ Runtime doesn't support it any more as Stefan pointed out.
It is still possible to use the access API on other platforms, but it does nothing more than a plain load/store for non-oop types.

I'm ok with doing a plain access. I don't like the difference to other ports as it will at least waste time of people with less expertise in the area (e.g. new to the project). No need to continue the discussion in this pr though.

src/hotspot/cpu/ppc/upcallLinker_ppc.cpp

reinrich · 2023-04-28T13:18:27Z

src/hotspot/cpu/ppc/upcallLinker_ppc.cpp

+
+  __ block_comment("{ receiver ");
+  __ load_const_optimized(R3_ARG1, (intptr_t)receiver, R0);
+  __ resolve_jobject(R3_ARG1, tmp, R31, MacroAssembler::PRESERVATION_FRAME_LR_GP_FP_REGS); // kills R31


As a simplification the receiver could be resolved in UpcallLinker::on_entry and returned in JavaThread::_vm_result.

This sounds like a nice enhancement proposal for all platforms. The register spilling code in resolve_jobject can get lengthy dependent on the selected GC. Doing it in the C code (which we call anyway above) would make the upcall stubs smaller.
@JornVernee: What do you think about this idea?

Seems like a nice idea. The resolution here pre-dates the time where we called into the VM.

reinrich · 2023-05-09T15:48:52Z

src/hotspot/cpu/ppc/vmstorage_ppc.hpp

+      case T_CHAR   :
+      case T_BYTE   :
+      case T_SHORT  :
+      case T_INT    : segment_mask = REG32_MASK; break;


I wonder why the segment_mask depends on bt on ppc?

The usage of the segment_mask can be defined for each platform. I'm using it to encode the information if a value on the Java side uses a 32 or 64 bit slot. In case of 32 bit values, the C side requires all 64 register bits to get defined values (ints get sign extended, floats get converted to double-precision format).

reinrich

Hi Martin,
I've made a pass over the Java part (except HFA). I found the specs hard to understand but most specs are like this.
I'll finish the rest beginning of next week.
Cheers, Richard.

reinrich · 2023-05-16T14:44:25Z

src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/CallArranger.java

+ */
+package jdk.internal.foreign.abi.ppc64;
+
+import java.lang.foreign.AddressLayout;


Imports are not grouped and ordered alphabetically.
(Very much as the aarch64 version)

reinrich · 2023-05-16T15:28:59Z

src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/CallArranger.java

+    public static final int MAX_REGISTER_ARGUMENTS = 8;
+    public static final int MAX_FLOAT_REGISTER_ARGUMENTS = 13;
+
+    // This is derived from the 64-Bit ELF V2 ABI spec, restricted to what's


The comment says ABI V2 but the code seems to handle V1 too.

It's derived from ABI v2, but v1 is compatible. Added that to the comment.

reinrich · 2023-05-16T15:42:44Z

src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/CallArranger.java

+
+    class StorageCalculator {
+        private final boolean forArguments;
+        private boolean forVarArgs = false;


Seems to be not used.

I had kept it in case another PPC64 OS would need it, but I guess it's unlikely. So, I just removed it. Could get added back easily if needed.

I see. There are other examples of redundant code that might serve a purpose in the future. I honestly don't like that.
In the case of forVarArgs porters can still find it in the aarch64 version :)

reinrich · 2023-05-19T12:42:15Z

src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/CallArranger.java

+            // TODO: Big Endian can't pass partially used slots correctly in some cases with:
+            // !useABIv2() && layout.byteSize() > 8 && layout.byteSize() % 8 != 0
+
+            // Allocate individual fields as gp slots (regs and stack).


You explained to me, it's not individual (struct) fields that are handled here. Looks like registers and 8 byte stack slots are allocated to completely cover the struct. Would be good if you could change the comment and names in the code to better reflect this.

Adapted to aarch64 implementation (using MAX_COPY_SIZE).

reinrich · 2023-05-19T14:11:55Z

src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/linux/LinuxPPC64leLinker.java

+    private static LinuxPPC64leLinker instance;
+
+    public static LinuxPPC64leLinker getInstance() {
+        if (instance == null) {


Other platforms optimized this to return a constant (probably after you forked off the port).

Good catch. Adapted.

reinrich · 2023-05-22T08:53:21Z

src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/CallArranger.java

+ * public constants CallArranger.ABIv1/2.
+ */
+public abstract class CallArranger {
+    protected abstract boolean useABIv2();


This could also be refactored into a static method with the same trick that is used in LinuxPPC64leLinker::getInstance. Callers could be static then and you could delete CallArranger::ABIv2 and ABIv2CallArranger.

Maybe something like?

protected static final boolean useABIv2 = CABI.current() == CABI.LINUX_PPC_64_LE;

That would be better to read, but would make the PPC64 CallArranger dependent on the current CABI.
Note that there are tests which use

import jdk.internal.foreign.abi.aarch64.CallArranger; ... CallArranger.LINUX.getBindings(mt, fd, false);

for example. The tests are designed to run on all platforms.

That would be better to read, but would make the PPC64 CallArranger dependent on the current CABI. Note that there are tests which use

import jdk.internal.foreign.abi.aarch64.CallArranger; ... CallArranger.LINUX.getBindings(mt, fd, false);

for example. The tests are designed to run on all platforms.

I see, thanks. Would be nice to have some for PPC64 too :)

Probably, yes. I didn't find time for figuring out what would be useful tests. We could still add some in the future or with the big endian port.
Another idea: Would the following be better?
final boolean useABIv2 = (this.getClass() == ABIv2CallArranger.class);
That would also allow getting rid of the method useABIv2().

Or better final boolean useABIv2 = (this instanceof ABIv2CallArranger);

Yes, good idea.

Please take a look at 70736be

It looks good.

reinrich

Hi Martin,

there seems to be a mismatch between this pr and the 64-bit ELF ABI V2 for PPC. In fact all dynamically generated calls that have to conform to ABI V2 are affected. I'm giving a short summary of our discussion this afternoon.

Very briefly: ABI V2 states that a Parameter Save Area (PSA) shall be allocated unless all parameters can be passed in registers as indicated by the caller's prototype, whereas the port always allocates a PSA of 8 double words.
(Details under "Parameter Save Area" in "2.2.3.3. Optional Save Areas" of ELF ABI V2)

It is not wrong what we're doing. It is like we didn't know the prototype of the call targets. But for most calls [1] we are wasting stack space (and confusing everybody that tries to match the implementation with the spec).
Interestingly ABI V1 states that a PSA of at least 8 double words is always needed. Looks like we've missed that change.

I have conducted a little experiment and compiled the following test program using Compiler Explorer [2]

#include <stdint.h>

int64_t test_callee(int64_t p1, int64_t p2, int64_t p3, int64_t p4);

int64_t test_call(int64_t p1, int64_t p2, int64_t p3, int64_t p4) {
    return test_callee(p1, p2, p3, p4);
}

This is the -O2 output for ELF ABI V2 (little endian)
Note: the stdu allocates just the minimal frame of 4 double words without PSA.

test_call:                              # @test_call
.Lfunc_gep0:
        addis 2, 12, .TOC.-.Lfunc_gep0@ha
        addi 2, 2, .TOC.-.Lfunc_gep0@l
        mflr 0
        stdu 1, -32(1)
        std 0, 48(1)
        bl test_callee
        nop
        addi 1, 1, 32
        ld 0, 16(1)
        mtlr 0
        blr

This is the -O2 output for ELF ABI V1 (big endian)

test_call:                              # @test_call
        .quad   .Lfunc_begin0
        .quad   .TOC.@tocbase
        .quad   0
.Lfunc_begin0:
        mflr 0
        stdu 1, -112(1)
        std 0, 128(1)
        bl test_callee
        nop
        addi 1, 1, 112
        ld 0, 16(1)
        mtlr 0
        blr

Note: the stdu allocates a much larger frame because it accomodates a PSA of 8 double words.

I'd suggest to keep the current well tested version but add comments to the code that a PSA is always allocated even though ABI V2 does not require it. This should also be explained in the JBS item.

Furthermore RFEs should be filed to adopt ABI V2 in the FFM API port and in the hotspot port to PPC64le. There's quite a bit of room for improvement there. Ironically I've very recently fixed JDK-8306111 citing the ABI V2 spec without realizing that the fix is not needed for V2, just for V1.

[1] Exceptions that do require a PSA are calls with a really long or variable parameter list or if a prototype is not available

[2] Experiment with "Compiler Explorer"

reinrich · 2023-05-22T21:22:50Z

src/hotspot/cpu/ppc/downcallLinker_ppc.cpp

+  // (native_abi_reg_args is native_abi_minframe plus space for 8 argument register spill slots)
+  assert(_abi._shadow_space_bytes == frame::native_abi_minframe_size, "expected space according to ABI");
+  // Note: For ABIv2, we only need (_input_registers.length() > 8) ? _input_registers.length() : 0
+  int register_save_area_slots = MAX2(_input_registers.length(), 8);


Both specs, ABI V1 and V2, call this "Parameter Save Area" we should use the same name.

Suggested change

int register_save_area_slots = MAX2(_input_registers.length(), 8);

int parameter_save_area_slots = MAX2(_input_registers.length(), 8);

Thanks! Changed.

reinrich · 2023-05-22T21:26:26Z

src/hotspot/cpu/ppc/downcallLinker_ppc.cpp

+  // This may be a bit more than needed when HFA is used (see CallArranger.java).
+  // (native_abi_reg_args is native_abi_minframe plus space for 8 argument register spill slots)
+  assert(_abi._shadow_space_bytes == frame::native_abi_minframe_size, "expected space according to ABI");
+  // Note: For ABIv2, we only need (_input_registers.length() > 8) ? _input_registers.length() : 0


This is hard to understand. It should be explained that we allocate a PSA even though ABI V2 only requires it if not all parameters can be passed in registers.

reinrich · 2023-05-22T21:49:27Z

src/java.base/share/classes/jdk/internal/foreign/abi/ppc64/CallArranger.java

+                stackAlloc(4, STACK_SLOT_SIZE); // Skip first half of stack slot.
+                stack = stackAlloc(4, 4);
+            } else {
+                stack = stackAlloc(is32Bit ? 4 : 8, STACK_SLOT_SIZE);


This looks like a stack slot is always allocated. Please explain that for ABI V2 this is actually only required if it is know from a prototype that not all parameters can be passed in registers and that we plan to change this.

This basically computes the stack layout. We need to count all slots to get the right offset for the registers which actually get written on stack. The first such register will hit native_abi_minframe_size + 8 slots. If fewer registers are used, the counted stack slots will not be used.
The decision whether we allocate the Parameter Save Area or not is done in the downcall stub and doesn't depend on the stackAllocs.

TheRealMDoerr · 2023-05-22T22:44:41Z

Thanks for publishing our discussion, here. The unnecessary PSA affects other areas of hotspot much more than Panama. Yes, we should file an RFE. I think one for hotspot is sufficient as the downcall stub is part of it. I don't think it needs extra treatment.

reinrich · 2023-05-23T07:37:37Z

src/hotspot/cpu/ppc/downcallLinker_ppc.cpp

+  assert(_abi._shadow_space_bytes == frame::native_abi_minframe_size, "expected space according to ABI");
+  // The Parameter Save Area needs to be at least 8 slots for ABIv1.
+  // ABIv2 allows omitting it when all parameters can get passed in registers. We currently don't optimize this.
+  // For ABIv2, we only need (_input_registers.length() > 8) ? _input_registers.length() : 0


The PSA is also needed if the parameter list is variable in length. Is the expression (_input_registers.length() > 8) ? _input_registers.length() : 0 correct in that case too?
Otherwise: ABIv2 allows omitting it if the callee's prototype indicates that stack parameters are not expected. We currently don't optimize this.

Ok, I see now. This is not obvious though. There are a few layers of abstraction at play which hide this. A comment is needed. Maybe like this:

// With ABIv1 a Parameter Save Area of at least 8 double words is always needed. // ABIv2 allows omitting it if the callee's prototype indicates that stack parameters are not expected. // We currently don't optimize this (see DowncallStubGenerator in the backend). if (reg == null) return stack;

I believe omitting the PSA is wrong for varargs, but we don't have this information in the backend. So, I think we should simply not optimize it. Reserving 64 Byte stack space should be affordable for a downcall even if it's not always needed. The Java side could compute it, but there's no way to pass this information to the backend. I've improved the comments. Please take a look.

I believe omitting the PSA is wrong for varargs, but we don't have this information in the backend.

It is clearly wrong.

So, I think we should simply not optimize it.

Already agreed for this version.

Reserving 64 Byte stack space should be affordable for a downcall even if it's not always needed.

It is hardly ever needed to begin with. It's just not pretty to allocate the PSA for little test functions. It'll confuse people that compare the downcall stub to what the C compiler produces. They might even think we havn't understood the ABI ;)

The Java side could compute it, but there's no way to pass this information to the backend.

I'm not sure about that. First idea would be to keep the allocated stack in nextStorage if needed and pass it to the backend.

I've improved the comments. Please take a look.

Great! Thanks :)

reinrich · 2023-05-23T07:51:55Z

Thanks for publishing our discussion, here. The unnecessary PSA affects other areas of hotspot much more than Panama. Yes, we should file an RFE. I think one for hotspot is sufficient as the downcall stub is part of it. I don't think it needs extra treatment.

That's fine. It should have a little list of areas to be revisited.

Adoc:

Runtime calls by the interpreter, c1, and c2
Interpreted and compiled JNI calls
FFM API ("Panama") calls
Runtime calls by continuation intrisics
Runtime calls by GC barriers

Subtasks can be generated from that list.

reinrich

This is really good to go now.
(maybe after a final round of tests)

Thanks a lot!
Richard.

TheRealMDoerr · 2023-05-23T20:19:30Z

Thanks a lot for all discussions, feedback and the reviews! This was really helpful! I'm planning to integrate tomorrow.

TheRealMDoerr · 2023-05-24T08:36:10Z

/integrate

openjdk · 2023-05-24T08:38:52Z

Going to push as commit 20f1535.
Since your change was applied there have been 58 commits pushed to the master branch:

466ec30: 8302736: Major performance regression in Math.log on aarch64
05c095c: 8308151: [JVMCI] capture JVMCI exceptions in hs-err
beb75e6: 8306302: C2 Superword fix: use VectorMaskCmp and VectorBlend instead of CMoveVF/D
2836c34: 8179502: Enhance OCSP, CRL and Certificate Fetch Timeouts
8ffa264: 8306698: Add overloads to MethodTypeDesc::of
6b27dad: 8301154: SunPKCS11 KeyStore deleteEntry results in dangling PrivateKey entries
ed0e956: 8308716: ProblemList java/util/concurrent/ScheduledThreadPoolExecutor/BasicCancelTest.java with genzgc on windows-x64
bddf483: 8303942: os::write should write completely
ab241b3: 8306706: Support out-of-line code generation for MachNodes
710453c: 8308016: Use snippets in java.io package
... and 48 more: https://git.openjdk.org/jdk/compare/939344b8433b32166f42ad73ae3d96e84b033478...master

Your commit was automatically rebased without conflicts.

openjdk · 2023-05-24T08:39:06Z

@TheRealMDoerr Pushed as commit 20f1535.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Initial Panama implementation.

4a5debf

openjdk bot added the rfr label Feb 22, 2023

openjdk bot added hotspot core-libs shenandoah labels Feb 22, 2023

JornVernee reviewed Feb 22, 2023

View reviewed changes

mcimadamore reviewed Feb 22, 2023

View reviewed changes

Clean fix for NativeMemorySegmentImpl issue with byteSize 0.

7315fd2

Remove size restriction for structs. Add TODO for Big Endian.

a4d844f

TheRealMDoerr added 2 commits May 5, 2023 16:51

Add test case for passing a double value in a GP register. Use better…

754a19a

… instructions for moving between FP and GP reg. Improve comments.

libTestHFA: Add explicit type conversion to avoid build warning.

74586ab

reinrich reviewed May 9, 2023

View reviewed changes

TheRealMDoerr added 2 commits May 10, 2023 12:50

Replace NULL by nullptr.

9306025

Add NONZERO check for downcall_stub_address_offset_in_bytes().

edcdefb

reinrich reviewed May 19, 2023

View reviewed changes

Cleanup imports, improve comments, updates from other platforms.

b1f0438

reinrich reviewed May 22, 2023

View reviewed changes

TheRealMDoerr added 4 commits May 22, 2023 16:08

Replace abstract method useABIv2().

70736be

Add comment about Register Save Area.

ac5c5dc

Merge remote-tracking branch 'origin' into PPC64_Panama

0c863d0

Adaptation for JDK-8308276.

08a5c14

reinrich suggested changes May 22, 2023

View reviewed changes

Parameter Save Area is the correct name.

b912155

reinrich reviewed May 23, 2023

View reviewed changes

Improve comments about the Parameter Save Area.

5a00d80

reinrich approved these changes May 23, 2023

View reviewed changes

openjdk bot added the integrated label May 24, 2023

openjdk bot closed this May 24, 2023

openjdk bot removed ready rfr labels May 24, 2023

TheRealMDoerr deleted the PPC64_Panama branch May 24, 2023 08:39

JornVernee mentioned this pull request Jul 7, 2023

8311630: [s390] Implementation of Foreign Function & Memory API (Preview) #14801

Closed

3 tasks

TheRealMDoerr mentioned this pull request Oct 21, 2023

8318609: Upcall stubs should be smaller #16290

Closed

3 tasks

	int register_save_area_slots = MAX2(_input_registers.length(), 8);
	int parameter_save_area_slots = MAX2(_input_registers.length(), 8);

8303040: linux PPC64le: Implementation of Foreign Function & Memory API (Preview) #12708

8303040: linux PPC64le: Implementation of Foreign Function & Memory API (Preview) #12708

Conversation

TheRealMDoerr commented Feb 22, 2023 • edited by openjdk bot Loading

Progress

Issue

Reviewers

Reviewing

Webrev

bridgekeeper bot commented Feb 22, 2023

openjdk bot commented Feb 22, 2023

mlbridge bot commented Feb 22, 2023 • edited Loading

Webrevs

minborg commented Feb 22, 2023

JornVernee commented Feb 22, 2023

JornVernee commented Feb 22, 2023 • edited Loading

JornVernee Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcimadamore commented Feb 22, 2023

TheRealMDoerr commented Feb 23, 2023

TheRealMDoerr commented Feb 23, 2023

TheRealMDoerr commented Feb 23, 2023

TheRealMDoerr commented Feb 23, 2023

mcimadamore commented Feb 23, 2023

mcimadamore commented Feb 23, 2023

JornVernee commented Feb 23, 2023 • edited Loading

JornVernee commented Feb 23, 2023

JornVernee commented Feb 23, 2023 • edited Loading

JornVernee commented Feb 23, 2023 • edited Loading

mcimadamore commented Feb 23, 2023

JornVernee commented Feb 23, 2023

TheRealMDoerr commented Feb 24, 2023 • edited Loading

robehn commented Feb 24, 2023

reinrich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JornVernee May 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reinrich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reinrich left a comment

Choose a reason for hiding this comment

TheRealMDoerr commented Feb 22, 2023 •

edited by openjdk bot

Loading

mlbridge bot commented Feb 22, 2023 •

edited

Loading

JornVernee commented Feb 22, 2023 •

edited

Loading

JornVernee Feb 22, 2023 •

edited

Loading

JornVernee commented Feb 23, 2023 •

edited

Loading

JornVernee commented Feb 23, 2023 •

edited

Loading

JornVernee commented Feb 23, 2023 •

edited

Loading

TheRealMDoerr commented Feb 24, 2023 •

edited

Loading

JornVernee May 10, 2023 •

edited

Loading

TheRealMDoerr May 22, 2023 •

edited

Loading

reinrich May 23, 2023 •

edited

Loading