-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8323116: [REDO] Computational test more than 2x slower when AVX instructions are used #18503
Conversation
…uctions are used
👋 Welcome back vamsi-parasa! A progress list of the required criteria for merging this PR into |
@vamsi-parasa This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 212 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@sviswa7, @vnkozlov) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
@vamsi-parasa The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
|
Could I get one more review please? Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one question for changes in assembler code.
I see you avoided xor
for instruction with memory by executing them only without AVX.
I will run our performance testing to see if this change affects performance. Eric did run it but I don't know which version.
@@ -2031,7 +2031,7 @@ void Assembler::cvtsd2ss(XMMRegister dst, XMMRegister src) { | |||
NOT_LP64(assert(VM_Version::supports_sse2(), "")); | |||
InstructionAttr attributes(AVX_128bit, /* rex_w */ VM_Version::supports_evex(), /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ false); | |||
attributes.set_rex_vex_w_reverted(); | |||
int encode = simd_prefix_and_encode(dst, dst, src, VEX_SIMD_F2, VEX_OPCODE_0F, &attributes); | |||
int encode = simd_prefix_and_encode(dst, src, src, VEX_SIMD_F2, VEX_OPCODE_0F, &attributes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to #18089, the purpose of this change is to remove the slowdown due to false dependency. For example, using the current (dst, dst, src)
encoding in the case of VCVTSD2SS xmm1, xmm2, xmm3/m64
, the instruction converts one double precision floating-point value in xmm3/m64 to one single precision floating-point value and merge with high bits in xmm2. This merge with high bits of xmm2 causes a false dependency as xmm1 and xmm2 are the same in (dst, dst, src)
encoding.
We are removing the false dependency by (1) removing the m64 source in VCVTSDSS instruction encoding in the .ad file (2) load m64
source in src
before calling VCVTSD2SS
and explicitly zeroing out the of high bits in src
using vmovsd src, m64
and then calling VCVTSD2SS dst, src, src
. Thus dst[31:0]
now gets the result of convert operation from src[63:0]
and passing src as the non-destructive source (NDS) prevents the false dependancy.
Thanks,
Vamsi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for explaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a downcast from double precision to single precision value, thus only lower 32 bits of destination hold the actual results for conversion, upper 127:32 bits are copied from non destructive source operand for vex encoded instruction.
VCVTSD2SS (VEX.128 Encoded Version) ¶
DEST[31:0] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC2[63:0]);
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
User is only interested in lower 32 bit of destination and passing source as NDS will prevent false dependency for AVX targets since instruction dispatch will not be held for false dependency anymore and will be issued to OOO backend the moment source is ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change modifies the defined behaviours of cvtss2sd
. Without AVX, it would retains the bits 64-127 of dst
while with it the bits would be copied from src
. I would suggest separating the matching rules instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a cleaver trick to dodge false dependency without compromising on correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jatin-bhateja I get it but IMO it shouldn't be the responsibility of the assembler to do that, the assembler should emit machine code in a manner that respects what is being written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a downcast from double precision to single precision value, thus only lower 32 bits of destination hold the actual results for conversion, upper 127:32 bits are copied from non destructive source operand for vex encoded instruction.
Please see the updated description incorporating the correction dst[63:0] -> dst[31,0] for cvtss2sd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change modifies the defined behaviours of cvtss2sd. Without AVX, it would retains the bits 64-127 of dst while with it the bits would be copied from src. I would suggest separating the matching rules instead.
Please address this, fyi in similar cases we created separate methods in the MacroAssembler
such as movflt
or movdbl
. Feel free to disagree but I think the assembler should not behave differently compared to the corresponding assembly instruction.
And I will run regular testing too. |
Next tests failed when running with
|
Thank you, Vladimir (@vnkozlov). Will look into the test and fix it. |
The KNL related failure was fixed in the latest commit by adding the check Could you please have a look at this change? Thanks, |
I will submit new testing. |
Thank you Vladimir! |
@@ -11710,7 +11710,7 @@ int Assembler::vex_prefix_and_encode(int dst_enc, int nds_enc, int src_enc, VexS | |||
} | |||
} | |||
|
|||
if (UseAVX > 2) { | |||
if (UseAVX > 2 && !attributes->uses_vl()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already coved by below assertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this check, the test is failing for KNL (-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting) as Vladimir mentioned. Is there a better way to handle the KNL case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is an easy way. For the instructs where you added the pxor instruction generation, you could change the dst register type from regF to vlRegF. This restricts the xmm register to xmm0-xmm15 for KNL, thereby not needing the evex encoding and in-turn not needing the avx512vl support for pxor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and likewise from regD to vlRegD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is an easy way. For the instructs where you added the pxor instruction generation, you could change the dst register type from regF to vlRegF.
Thanks Sandhya, will make the changes and push an update.
@@ -2031,7 +2031,7 @@ void Assembler::cvtsd2ss(XMMRegister dst, XMMRegister src) { | |||
NOT_LP64(assert(VM_Version::supports_sse2(), "")); | |||
InstructionAttr attributes(AVX_128bit, /* rex_w */ VM_Version::supports_evex(), /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ false); | |||
attributes.set_rex_vex_w_reverted(); | |||
int encode = simd_prefix_and_encode(dst, dst, src, VEX_SIMD_F2, VEX_OPCODE_0F, &attributes); | |||
int encode = simd_prefix_and_encode(dst, src, src, VEX_SIMD_F2, VEX_OPCODE_0F, &attributes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a downcast from double precision to single precision value, thus only lower 32 bits of destination hold the actual results for conversion, upper 127:32 bits are copied from non destructive source operand for vex encoded instruction.
VCVTSD2SS (VEX.128 Encoded Version) ¶
DEST[31:0] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC2[63:0]);
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
User is only interested in lower 32 bit of destination and passing source as NDS will prevent false dependency for AVX targets since instruction dispatch will not be held for false dependency anymore and will be issued to OOO backend the moment source is ready
While changing Cheers. |
My new testing passed. |
@vnkozlov If I understand the proposal from @merykitty correctly, the suggestion is to reserve xmm15 as non allocatable throughout. This sounds like a big overhead for cases where every xmm register is usable say in a Vector API kernel. From Vamsi's microbenchmark runs, he has clearly shown that the gain of his optimization is way more than any overhead of doing pxor just before the converts. |
Okay. I will wait changes @sviswa7 suggested to use vlRegD and vlRegF. |
Thanks Vladimir!
Will make the changes and let you know. |
Please see the updated commit which uses vlRegD and vlRegF. |
Okay. I need to run testing again. |
Yes with my proposal we are losing 1 out of 16 registers, which is a cost. But emitting an additional instruction for every conversion from integer to floating point values is also a cost. A more conservative solution is to use the last register in the allocation chunk which will often be unused, and when it is used, the function should be crowded with other instructions such that this particular dependency will not have a profound effect.
You cannot reach that conclusion, we are trading off here, and this benchmark is chosen because it is bottlenecked by that particular dependency. The situation may not be the same for the other cases. Cheers, |
@merykitty I would like to disagree, decision to reserve a register for entire duration of program cannot be taken lightly. |
Executive (my ;^) decision: we go with current changes: no xmm15 reservation. I am starting (I hope final) testing round. |
@sviswa7 I didn't disagree with you, I just made a more conservative proposal that uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My testing of v05 passed - no new failures.
Thank you Vladimir! |
/integrate |
@vamsi-parasa |
Let us go with Vladimir's executive decision for now and integrate this. Any improvements in subsequent PRs is always welcome. |
/sponsor |
Going to push as commit 7e5ef79.
Your commit was automatically rebased without conflicts. |
@sviswa7 @vamsi-parasa Pushed as commit 7e5ef79. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
The goal of this small PR is improve the performance of convert instructions and address the slowdown when AVX>0 is used.
The performance data using the ComputePI.java benchmark (part of this PR) is as follows:
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18503/head:pull/18503
$ git checkout pull/18503
Update a local copy of the PR:
$ git checkout pull/18503
$ git pull https://git.openjdk.org/jdk.git pull/18503/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18503
View PR using the GUI difftool:
$ git pr show -t 18503
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18503.diff
Webrev
Link to Webrev Comment