-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8296548: Improve MD5 intrinsic for x86_64 #11054
Conversation
The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. This change replaces LEA: r1 = r1 + rsi * 1 + t with ADDs: r1 += t; r1 += rsi. Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. Similar results can also be observed in TestMD5Intrinsics and TestMD5MultiBlockIntrinsics with a more moderate improvement, e.g. ~15% improvement in throughput on Haswell.
👋 Welcome back yftsai! A progress list of the required criteria for merging this PR into |
Webrevs
|
/label hotspot-compiler |
@eastig |
Could you please post JMH microbenchmarks with and without this change? You can run them with |
Yes, please, post performance data. |
@sviswa7 or @jatin-bhateja do you agree with these changes? |
|
Performance without the optimization on Cascade Lake:
Performance with optimization on Cascade Lake:
|
Performance without the optimization on Ice Lake:
Performance with optimization on Ice Lake:
|
@luhenry, @vnkozlov In the MD5 intrinsic stub we use 3 operand LEA. This LEA is on the critical path. The optimization is done according to the Intel 64 and IA-32 Architectures Optimization Reference Manual (Feb 2022), 3.5.1.2:
ADD has had latency 1 and throughput 4 since Haswell (see https://www.agner.org/optimize/instruction_tables.pdf). The patch correctness was tested with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all for providing performance data. Looks good. I will run testing.
@yftsai This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@vnkozlov, @sviswa7) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
Do we have other intrinsics which use LEA (not for this fix)? |
@yftsai can you merge latest JDK sources? Some of GHA testing failures should be fixed. |
My pending ChaCha20 intrinsics ( #7702 ) use LEA for getting the address of constant data to be loaded into SIMD registers. That happens before the 10-iteration loop that implements the 20 rounds (which is the critical section of the intrinsic). |
I have plans to look at other uses of LEA in Hotspot. I have not started yet due to other urgent work. |
From #7702, I see they are not 3 operand LEA. No need to change them. |
There is a VM_Version::supports_fast_2op_lea() and VM_Version::supports_fast_3op_lea() check available which is used to do lea optimizations. |
Thanks you @sviswa7 For this fix, based on IceLake data provided by @yftsai, |
Yes, I agree. The PR looks good to me. |
My testing passed. |
/integrate |
/sponsor |
Going to push as commit 6ead2b0. |
@jatin-bhateja @yftsai Pushed as commit 6ead2b0. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput.
This change replaces
LEA: r1 = r1 + rsi * 1 + t
with
ADDs: r1 += t; r1 += rsi.
Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc.
No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc.
Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/11054/head:pull/11054
$ git checkout pull/11054
Update a local copy of the PR:
$ git checkout pull/11054
$ git pull https://git.openjdk.org/jdk pull/11054/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 11054
View PR using the GUI difftool:
$ git pr show -t 11054
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/11054.diff