8296548: Improve MD5 intrinsic for x86_64 #11054

yftsai · 2022-11-09T07:57:30Z

The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput.

This change replaces
LEA: r1 = r1 + rsi * 1 + t
with
ADDs: r1 += t; r1 += rsi.

Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc.

No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc.

Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8296548: Improve MD5 intrinsic for x86_64

Reviewers

Vladimir Kozlov (@vnkozlov - Reviewer) ⚠️ Review applies to 6ed4348c
Sandhya Viswanathan (@sviswa7 - Reviewer) ⚠️ Review applies to 6ed4348c
Ludovic Henry (@luhenry - Author) ⚠️ Review applies to 6ed4348c

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/11054/head:pull/11054
$ git checkout pull/11054

Update a local copy of the PR:
$ git checkout pull/11054
$ git pull https://git.openjdk.org/jdk pull/11054/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 11054

View PR using the GUI difftool:
$ git pr show -t 11054

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/11054.diff

The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput. This change replaces LEA: r1 = r1 + rsi * 1 + t with ADDs: r1 += t; r1 += rsi. Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc. No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc. Similar results can also be observed in TestMD5Intrinsics and TestMD5MultiBlockIntrinsics with a more moderate improvement, e.g. ~15% improvement in throughput on Haswell.

bridgekeeper · 2022-11-09T07:58:05Z

👋 Welcome back yftsai! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2022-11-09T08:01:21Z

@yftsai The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2022-11-09T08:04:09Z

Webrevs

01: Full - Incremental (be07b342)
00: Full (6ed4348c)

eastig · 2022-11-11T16:11:28Z

/label hotspot-compiler

openjdk · 2022-11-11T16:12:27Z

@eastig
The hotspot-compiler label was successfully added.

luhenry · 2022-11-14T15:47:25Z

Could you please post JMH microbenchmarks with and without this change? You can run them with org.openjdk.bench.java.security.MessageDigests [1]

[1] https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java

vnkozlov · 2022-11-14T23:14:24Z

Yes, please, post performance data.
Note, TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are regression/correctness tests.
Would be nice to have proper JMH benchmarks to show improvement.

vnkozlov · 2022-11-14T23:16:42Z

@sviswa7 or @jatin-bhateja do you agree with these changes?

jatin-bhateja · 2022-11-15T12:23:18Z

@sviswa7 or @jatin-bhateja do you agree with these changes?

Patch shows significant improvement and better port utilization with 3+ micro ops on CLX.

JDK-With-opt:
Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     Score   Error   Units
MessageDigests.digest             md5        64     DEFAULT  thrpt    2  5613.517          ops/ms
MessageDigests.digest             md5     16384     DEFAULT  thrpt    2    50.026          ops/ms

   43,24,11,23,563      exe_activity.1_ports_util                                     (79.97%)
   54,01,28,04,330      exe_activity.2_ports_util                                     (80.22%)
   25,20,63,64,512      exe_activity.3_ports_util                                     (80.00%)
    6,42,47,64,948      exe_activity.4_ports_util                                     (79.83%)

JDK-baseline:
Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     Score   Error   Units
MessageDigests.digest             md5        64     DEFAULT  thrpt    2  4087.112          ops/ms
MessageDigests.digest             md5     16384     DEFAULT  thrpt    2    35.291          ops/ms

   50,76,35,89,853      exe_activity.1_ports_util                                     (80.09%)
   36,59,68,98,931      exe_activity.2_ports_util                                     (79.89%)
    9,61,69,23,581      exe_activity.3_ports_util                                     (80.02%)
    1,88,94,94,202      exe_activity.4_ports_util                                     (79.98%)

yftsai · 2022-11-15T12:53:52Z

Performance without the optimization on Cascade Lake:

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score     Error   Units
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  3315.328 ±  65.799  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    27.482 ±   0.006  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  2916.207 ± 127.293  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    27.381 ±   0.003  ops/ms

Performance with optimization on Cascade Lake:

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score     Error   Units
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  4474.780 ±  17.583  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    38.926 ±   0.005  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  3796.684 ± 153.887  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    38.724 ±   0.005  ops/ms

yftsai · 2022-11-15T13:51:24Z

Performance without the optimization on Ice Lake:

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score    Error   Units
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  5402.018 ± 17.033  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    43.722 ±  0.003  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  4652.620 ± 35.432  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    43.573 ±  0.016  ops/ms

Performance with optimization on Ice Lake:

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score    Error   Units
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  5348.594 ± 14.303  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    43.671 ±  0.008  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  4583.530 ± 12.752  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    43.545 ±  0.006  ops/ms

eastig · 2022-11-15T16:31:07Z

@luhenry, @vnkozlov
Sorry for the uninformative PR description.

In the MD5 intrinsic stub we use 3 operand LEA. This LEA is on the critical path.

The optimization is done according to the Intel 64 and IA-32 Architectures Optimization Reference Manual (Feb 2022), 3.5.1.2:

In Sandy Bridge microarchitecture, there are two significant changes to the performance characteristics of LEA instruction:
For LEA instructions with three source operands and some specific situations, instruction latency has increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
— LEA that uses RIP relative addressing mode.
— LEA that uses 16-bit addressing mode.

Assembly/Compiler Coding Rule 30. (ML impact, L generality) If an LEA instruction using the scaled index is on the critical path, a sequence with ADDs may be better.

ADD has had latency 1 and throughput 4 since Haswell (see https://www.agner.org/optimize/instruction_tables.pdf).
From https://www.agner.org/optimize/instruction_tables.pdf, in Ice Lake LEA performance was improved to latency 1 and throughput 2. This explains no improvement on it.

The patch correctness was tested with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics.
The microbenchmark we used:

import org.apache.commons.lang3.RandomStringUtils;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.BenchmarkParams;

import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;
import java.util.stream.IntStream;

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MD5Benchmark {

    private static final int MAX_INPUTS_COUNT = 1000;
    private static final int MAX_INPUT_LENGTH = 128 * 1024;
    private static List<byte[]> inputs;

    static {
        inputs = new ArrayList<>();
        IntStream.rangeClosed(1, MAX_INPUTS_COUNT).forEach(value -> inputs.add(RandomStringUtils.randomAlphabetic(MAX_INPUT_LENGTH).getBytes(StandardCharsets.UTF_8)));
    }

    @Param({"64", "128", "256", "512", "1024", "2048", "4096", "8192", "16384", "32768", "65536", "131072"})
    private int data_len;

    @State(Scope.Thread)
    public static class InputData {
        byte[] data;
        int count;
        byte[] expectedDigest;
        byte[] digest;

        @Setup
        public void setup(BenchmarkParams params) {
            data = inputs.get(ThreadLocalRandom.current().nextInt(0, MAX_INPUTS_COUNT));
            count = Integer.parseInt(params.getParam("data_len"));
            expectedDigest = calculateMD5Checksum(data, count);
        }

        @TearDown
        public void check() {
            if (!Arrays.equals(expectedDigest, digest)) {
                throw new RuntimeException("Expected md5 digest:\n" + Arrays.toString(expectedDigest) +
                                           "\nGot:\n" + Arrays.toString(digest));
            }
        }
    }

    @Benchmark
    public void testMD5(InputData in) {
        in.digest = calculateMD5Checksum(in.data, in.count);
    }

    private static byte[] calculateMD5Checksum(byte[] input, int count) {
        try {
            MessageDigest md5 = MessageDigest.getInstance("MD5");
            md5.update(input, 0, count);
            return md5.digest();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

vnkozlov

Thank you all for providing performance data. Looks good. I will run testing.

openjdk · 2022-11-15T17:18:07Z

@yftsai This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8296548: Improve MD5 intrinsic for x86_64

Reviewed-by: kvn, sviswanathan, luhenry

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@vnkozlov, @sviswa7) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

vnkozlov · 2022-11-15T17:18:37Z

Do we have other intrinsics which use LEA (not for this fix)?

vnkozlov · 2022-11-15T17:19:38Z

@yftsai can you merge latest JDK sources? Some of GHA testing failures should be fixed.

jnimeh · 2022-11-15T17:33:50Z

Do we have other intrinsics which use LEA (not for this fix)?

My pending ChaCha20 intrinsics ( #7702 ) use LEA for getting the address of constant data to be loaded into SIMD registers. That happens before the 10-iteration loop that implements the 20 rounds (which is the critical section of the intrinsic).

eastig · 2022-11-15T17:48:37Z

Do we have other intrinsics which use LEA (not for this fix)?

I have plans to look at other uses of LEA in Hotspot. I have not started yet due to other urgent work.

eastig · 2022-11-15T17:57:35Z

Do we have other intrinsics which use LEA (not for this fix)?

My pending ChaCha20 intrinsics ( #7702 ) use LEA for getting the address of constant data to be loaded into SIMD registers. That happens before the 10-iteration loop that implements the 20 rounds (which is the critical section of the intrinsic).

From #7702, I see they are not 3 operand LEA. No need to change them.

sviswa7 · 2022-11-15T18:48:14Z

Do we have other intrinsics which use LEA (not for this fix)?

There is a VM_Version::supports_fast_2op_lea() and VM_Version::supports_fast_3op_lea() check available which is used to do lea optimizations.

vnkozlov · 2022-11-15T19:19:01Z

Do we have other intrinsics which use LEA (not for this fix)?

There is a VM_Version::supports_fast_2op_lea() and VM_Version::supports_fast_3op_lea() check available which is used to do lea optimizations.

Thanks you @sviswa7

For this fix, based on IceLake data provided by @yftsai, supports_fast_3op_lea() potential help is not enough to justify increase complexity of code. May be in other places it would be more useful but not here IMHO.

sviswa7 · 2022-11-15T19:29:12Z

Do we have other intrinsics which use LEA (not for this fix)?

There is a VM_Version::supports_fast_2op_lea() and VM_Version::supports_fast_3op_lea() check available which is used to do lea optimizations.

Thanks you @sviswa7

For this fix, based on IceLake data provided by @yftsai, supports_fast_3op_lea() potential help is not enough to justify increase complexity of code. May be in other places it would be more useful but not here IMHO.

Yes, I agree. The PR looks good to me.

vnkozlov · 2022-11-16T02:20:27Z

My testing passed.

yftsai · 2022-11-16T03:07:54Z

/integrate

openjdk · 2022-11-16T03:09:38Z

@yftsai
Your change (at version be07b34) is now ready to be sponsored by a Committer.

jatin-bhateja · 2022-11-16T06:12:09Z

/sponsor

openjdk · 2022-11-16T06:13:21Z

Going to push as commit 6ead2b0.

openjdk · 2022-11-16T06:13:45Z

@jatin-bhateja @yftsai Pushed as commit 6ead2b0.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk bot added the rfr label Nov 9, 2022

openjdk bot added the hotspot label Nov 9, 2022

openjdk bot added the hotspot-compiler label Nov 11, 2022

vnkozlov approved these changes Nov 15, 2022

View reviewed changes

openjdk bot added the ready label Nov 15, 2022

sviswa7 approved these changes Nov 15, 2022

View reviewed changes

luhenry approved these changes Nov 15, 2022

View reviewed changes

Merge branch 'openjdk:master' into JDK-8296548

be07b34

openjdk bot added the sponsor label Nov 16, 2022

openjdk bot added the integrated label Nov 16, 2022

openjdk bot closed this Nov 16, 2022

openjdk bot removed ready rfr sponsor labels Nov 16, 2022

yftsai deleted the JDK-8296548 branch November 16, 2022 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8296548: Improve MD5 intrinsic for x86_64 #11054

8296548: Improve MD5 intrinsic for x86_64 #11054

yftsai commented Nov 9, 2022 •

edited by openjdk bot

Loading

bridgekeeper bot commented Nov 9, 2022

openjdk bot commented Nov 9, 2022

mlbridge bot commented Nov 9, 2022 •

edited

Loading

eastig commented Nov 11, 2022

openjdk bot commented Nov 11, 2022

luhenry commented Nov 14, 2022

vnkozlov commented Nov 14, 2022

vnkozlov commented Nov 14, 2022

jatin-bhateja commented Nov 15, 2022

yftsai commented Nov 15, 2022

yftsai commented Nov 15, 2022

eastig commented Nov 15, 2022 •

edited

Loading

vnkozlov left a comment

openjdk bot commented Nov 15, 2022 •

edited

Loading

vnkozlov commented Nov 15, 2022

vnkozlov commented Nov 15, 2022

jnimeh commented Nov 15, 2022 •

edited

Loading

eastig commented Nov 15, 2022

eastig commented Nov 15, 2022

sviswa7 commented Nov 15, 2022

vnkozlov commented Nov 15, 2022

sviswa7 commented Nov 15, 2022

vnkozlov commented Nov 16, 2022

yftsai commented Nov 16, 2022

openjdk bot commented Nov 16, 2022

jatin-bhateja commented Nov 16, 2022

openjdk bot commented Nov 16, 2022

openjdk bot commented Nov 16, 2022

8296548: Improve MD5 intrinsic for x86_64 #11054

8296548: Improve MD5 intrinsic for x86_64 #11054

Conversation

yftsai commented Nov 9, 2022 • edited by openjdk bot Loading

Progress

Issue

Reviewers

Reviewing

bridgekeeper bot commented Nov 9, 2022

openjdk bot commented Nov 9, 2022

mlbridge bot commented Nov 9, 2022 • edited Loading

Webrevs

eastig commented Nov 11, 2022

openjdk bot commented Nov 11, 2022

luhenry commented Nov 14, 2022

vnkozlov commented Nov 14, 2022

vnkozlov commented Nov 14, 2022

jatin-bhateja commented Nov 15, 2022

yftsai commented Nov 15, 2022

yftsai commented Nov 15, 2022

eastig commented Nov 15, 2022 • edited Loading

vnkozlov left a comment

Choose a reason for hiding this comment

openjdk bot commented Nov 15, 2022 • edited Loading

vnkozlov commented Nov 15, 2022

vnkozlov commented Nov 15, 2022

jnimeh commented Nov 15, 2022 • edited Loading

eastig commented Nov 15, 2022

eastig commented Nov 15, 2022

sviswa7 commented Nov 15, 2022

vnkozlov commented Nov 15, 2022

sviswa7 commented Nov 15, 2022

vnkozlov commented Nov 16, 2022

yftsai commented Nov 16, 2022

openjdk bot commented Nov 16, 2022

jatin-bhateja commented Nov 16, 2022

openjdk bot commented Nov 16, 2022

openjdk bot commented Nov 16, 2022

yftsai commented Nov 9, 2022 •

edited by openjdk bot

Loading

mlbridge bot commented Nov 9, 2022 •

edited

Loading

eastig commented Nov 15, 2022 •

edited

Loading

openjdk bot commented Nov 15, 2022 •

edited

Loading

jnimeh commented Nov 15, 2022 •

edited

Loading