Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8296548: Improve MD5 intrinsic for x86_64 #11054

Closed
wants to merge 2 commits into from

Conversation

yftsai
Copy link
Contributor

@yftsai yftsai commented Nov 9, 2022

The LEA instruction loads the effective address, but MD5 intrinsic uses it for computing values than addresses. This usage potentially uses more cycles than ADDs and reduces the throughput.

This change replaces
LEA: r1 = r1 + rsi * 1 + t
with
ADDs: r1 += t; r1 += rsi.

Microbenchmark evaluation shows ~40% performance improvement on Haswell, Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd gen Epyc.

No performance change for the same microbenchmark on Ice Lake and 3rd gen Epyc.

Similar results can be observed with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics. There is ~15% improvement in throughput on Haswell, Broadwell, Skylake, and Cascade Lake.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/11054/head:pull/11054
$ git checkout pull/11054

Update a local copy of the PR:
$ git checkout pull/11054
$ git pull https://git.openjdk.org/jdk pull/11054/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 11054

View PR using the GUI difftool:
$ git pr show -t 11054

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/11054.diff

Sorry, something went wrong.

The LEA instruction loads the effective address, but MD5 intrinsic uses
it for computing values than addresses. This usage potentially uses
more cycles than ADDs and reduces the throughput.

This change replaces
    LEA:  r1 = r1 + rsi * 1 + t
with
    ADDs: r1 += t; r1 += rsi.

Microbenchmark evaluation shows ~40% performance improvement on Haswell,
Broadwell, Skylake, and Cascade Lake. There is ~20% improvement on 2nd
gen Epyc.

No performance change for the same microbenchmark on Ice Lake and 3rd
gen Epyc.

Similar results can also be observed in TestMD5Intrinsics and
TestMD5MultiBlockIntrinsics with a more moderate improvement, e.g. ~15%
improvement in throughput on Haswell.
@bridgekeeper
Copy link

bridgekeeper bot commented Nov 9, 2022

👋 Welcome back yftsai! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 9, 2022
@openjdk
Copy link

openjdk bot commented Nov 9, 2022

@yftsai The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Nov 9, 2022
@mlbridge
Copy link

mlbridge bot commented Nov 9, 2022

Webrevs

@eastig
Copy link
Member

eastig commented Nov 11, 2022

/label hotspot-compiler

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Nov 11, 2022
@openjdk
Copy link

openjdk bot commented Nov 11, 2022

@eastig
The hotspot-compiler label was successfully added.

@luhenry
Copy link
Member

luhenry commented Nov 14, 2022

Could you please post JMH microbenchmarks with and without this change? You can run them with org.openjdk.bench.java.security.MessageDigests [1]

[1] https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/security/MessageDigests.java

@vnkozlov
Copy link
Contributor

Yes, please, post performance data.
Note, TestMD5Intrinsics and TestMD5MultiBlockIntrinsics are regression/correctness tests.
Would be nice to have proper JMH benchmarks to show improvement.

@vnkozlov
Copy link
Contributor

@sviswa7 or @jatin-bhateja do you agree with these changes?

@jatin-bhateja
Copy link
Member

@sviswa7 or @jatin-bhateja do you agree with these changes?

Patch shows significant improvement and better port utilization with 3+ micro ops on CLX.

JDK-With-opt:
Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     Score   Error   Units
MessageDigests.digest             md5        64     DEFAULT  thrpt    2  5613.517          ops/ms
MessageDigests.digest             md5     16384     DEFAULT  thrpt    2    50.026          ops/ms

   43,24,11,23,563      exe_activity.1_ports_util                                     (79.97%)
   54,01,28,04,330      exe_activity.2_ports_util                                     (80.22%)
   25,20,63,64,512      exe_activity.3_ports_util                                     (80.00%)
    6,42,47,64,948      exe_activity.4_ports_util                                     (79.83%)

JDK-baseline:
Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     Score   Error   Units
MessageDigests.digest             md5        64     DEFAULT  thrpt    2  4087.112          ops/ms
MessageDigests.digest             md5     16384     DEFAULT  thrpt    2    35.291          ops/ms

   50,76,35,89,853      exe_activity.1_ports_util                                     (80.09%)
   36,59,68,98,931      exe_activity.2_ports_util                                     (79.89%)
    9,61,69,23,581      exe_activity.3_ports_util                                     (80.02%)
    1,88,94,94,202      exe_activity.4_ports_util                                     (79.98%)

@yftsai
Copy link
Contributor Author

yftsai commented Nov 15, 2022

Performance without the optimization on Cascade Lake:

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score     Error   Units
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  3315.328 ±  65.799  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    27.482 ±   0.006  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  2916.207 ± 127.293  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    27.381 ±   0.003  ops/ms

Performance with optimization on Cascade Lake:

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score     Error   Units
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  4474.780 ±  17.583  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    38.926 ±   0.005  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  3796.684 ± 153.887  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    38.724 ±   0.005  ops/ms

@yftsai
Copy link
Contributor Author

yftsai commented Nov 15, 2022

Performance without the optimization on Ice Lake:

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score    Error   Units
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  5402.018 ± 17.033  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    43.722 ±  0.003  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  4652.620 ± 35.432  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    43.573 ±  0.016  ops/ms

Performance with optimization on Ice Lake:

Benchmark                    (digesterName)  (length)  (provider)   Mode  Cnt     Score    Error   Units
MessageDigests.digest                   md5        64     DEFAULT  thrpt   15  5348.594 ± 14.303  ops/ms
MessageDigests.digest                   md5     16384     DEFAULT  thrpt   15    43.671 ±  0.008  ops/ms
MessageDigests.getAndDigest             md5        64     DEFAULT  thrpt   15  4583.530 ± 12.752  ops/ms
MessageDigests.getAndDigest             md5     16384     DEFAULT  thrpt   15    43.545 ±  0.006  ops/ms

@eastig
Copy link
Member

eastig commented Nov 15, 2022

@luhenry, @vnkozlov
Sorry for the uninformative PR description.

In the MD5 intrinsic stub we use 3 operand LEA. This LEA is on the critical path.

The optimization is done according to the Intel 64 and IA-32 Architectures Optimization Reference Manual (Feb 2022), 3.5.1.2:

In Sandy Bridge microarchitecture, there are two significant changes to the performance characteristics of LEA instruction:
For LEA instructions with three source operands and some specific situations, instruction latency has increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
— LEA that uses RIP relative addressing mode.
— LEA that uses 16-bit addressing mode.

Assembly/Compiler Coding Rule 30. (ML impact, L generality) If an LEA instruction using the scaled index is on the critical path, a sequence with ADDs may be better.

ADD has had latency 1 and throughput 4 since Haswell (see https://www.agner.org/optimize/instruction_tables.pdf).
From https://www.agner.org/optimize/instruction_tables.pdf, in Ice Lake LEA performance was improved to latency 1 and throughput 2. This explains no improvement on it.

The patch correctness was tested with TestMD5Intrinsics and TestMD5MultiBlockIntrinsics.
The microbenchmark we used:

import org.apache.commons.lang3.RandomStringUtils;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.BenchmarkParams;

import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;
import java.util.stream.IntStream;

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MD5Benchmark {

    private static final int MAX_INPUTS_COUNT = 1000;
    private static final int MAX_INPUT_LENGTH = 128 * 1024;
    private static List<byte[]> inputs;

    static {
        inputs = new ArrayList<>();
        IntStream.rangeClosed(1, MAX_INPUTS_COUNT).forEach(value -> inputs.add(RandomStringUtils.randomAlphabetic(MAX_INPUT_LENGTH).getBytes(StandardCharsets.UTF_8)));
    }

    @Param({"64", "128", "256", "512", "1024", "2048", "4096", "8192", "16384", "32768", "65536", "131072"})
    private int data_len;

    @State(Scope.Thread)
    public static class InputData {
        byte[] data;
        int count;
        byte[] expectedDigest;
        byte[] digest;

        @Setup
        public void setup(BenchmarkParams params) {
            data = inputs.get(ThreadLocalRandom.current().nextInt(0, MAX_INPUTS_COUNT));
            count = Integer.parseInt(params.getParam("data_len"));
            expectedDigest = calculateMD5Checksum(data, count);
        }

        @TearDown
        public void check() {
            if (!Arrays.equals(expectedDigest, digest)) {
                throw new RuntimeException("Expected md5 digest:\n" + Arrays.toString(expectedDigest) +
                                           "\nGot:\n" + Arrays.toString(digest));
            }
        }
    }

    @Benchmark
    public void testMD5(InputData in) {
        in.digest = calculateMD5Checksum(in.data, in.count);
    }

    private static byte[] calculateMD5Checksum(byte[] input, int count) {
        try {
            MessageDigest md5 = MessageDigest.getInstance("MD5");
            md5.update(input, 0, count);
            return md5.digest();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for providing performance data. Looks good. I will run testing.

@openjdk
Copy link

openjdk bot commented Nov 15, 2022

@yftsai This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8296548: Improve MD5 intrinsic for x86_64

Reviewed-by: kvn, sviswanathan, luhenry

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@vnkozlov, @sviswa7) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@vnkozlov
Copy link
Contributor

Do we have other intrinsics which use LEA (not for this fix)?

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 15, 2022
@vnkozlov
Copy link
Contributor

@yftsai can you merge latest JDK sources? Some of GHA testing failures should be fixed.

@jnimeh
Copy link
Member

jnimeh commented Nov 15, 2022

Do we have other intrinsics which use LEA (not for this fix)?

My pending ChaCha20 intrinsics ( #7702 ) use LEA for getting the address of constant data to be loaded into SIMD registers. That happens before the 10-iteration loop that implements the 20 rounds (which is the critical section of the intrinsic).

@eastig
Copy link
Member

eastig commented Nov 15, 2022

Do we have other intrinsics which use LEA (not for this fix)?

I have plans to look at other uses of LEA in Hotspot. I have not started yet due to other urgent work.

@eastig
Copy link
Member

eastig commented Nov 15, 2022

Do we have other intrinsics which use LEA (not for this fix)?

My pending ChaCha20 intrinsics ( #7702 ) use LEA for getting the address of constant data to be loaded into SIMD registers. That happens before the 10-iteration loop that implements the 20 rounds (which is the critical section of the intrinsic).

From #7702, I see they are not 3 operand LEA. No need to change them.

@sviswa7
Copy link

sviswa7 commented Nov 15, 2022

Do we have other intrinsics which use LEA (not for this fix)?

There is a VM_Version::supports_fast_2op_lea() and VM_Version::supports_fast_3op_lea() check available which is used to do lea optimizations.

@vnkozlov
Copy link
Contributor

Do we have other intrinsics which use LEA (not for this fix)?

There is a VM_Version::supports_fast_2op_lea() and VM_Version::supports_fast_3op_lea() check available which is used to do lea optimizations.

Thanks you @sviswa7

For this fix, based on IceLake data provided by @yftsai, supports_fast_3op_lea() potential help is not enough to justify increase complexity of code. May be in other places it would be more useful but not here IMHO.

@sviswa7
Copy link

sviswa7 commented Nov 15, 2022

Do we have other intrinsics which use LEA (not for this fix)?

There is a VM_Version::supports_fast_2op_lea() and VM_Version::supports_fast_3op_lea() check available which is used to do lea optimizations.

Thanks you @sviswa7

For this fix, based on IceLake data provided by @yftsai, supports_fast_3op_lea() potential help is not enough to justify increase complexity of code. May be in other places it would be more useful but not here IMHO.

Yes, I agree. The PR looks good to me.

@vnkozlov
Copy link
Contributor

My testing passed.

@yftsai
Copy link
Contributor Author

yftsai commented Nov 16, 2022

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Nov 16, 2022
@openjdk
Copy link

openjdk bot commented Nov 16, 2022

@yftsai
Your change (at version be07b34) is now ready to be sponsored by a Committer.

@jatin-bhateja
Copy link
Member

/sponsor

@openjdk
Copy link

openjdk bot commented Nov 16, 2022

Going to push as commit 6ead2b0.

@openjdk openjdk bot added the integrated Pull request has been integrated label Nov 16, 2022
@openjdk openjdk bot closed this Nov 16, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Nov 16, 2022
@openjdk
Copy link

openjdk bot commented Nov 16, 2022

@jatin-bhateja @yftsai Pushed as commit 6ead2b0.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@yftsai yftsai deleted the JDK-8296548 branch November 16, 2022 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

None yet

7 participants