8294248: Use less limbs for P256 in EC implementation #10398

XueleiFan · 2022-09-22T20:40:08Z

Hi,

Please review this performance improvement for Secp256R1 implementation in OpenJDK. With this update, there is an about 20% performance improvement for Secp256R1 key generation and signature.

Basically, 256 bits EC curves could use 9 integer limbs for the computation. The current implementation use 10 limbs instead. By reducing the number of limbs, the implementation could benefit from less integer computation (add/sub/multiply/square/inverse/mod/pow, etc), and thus improve the performance.

Here are the benchmark numbers without the patch:

Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
Signatures.sign               64  thrpt   15  1.414 ± 0.022  ops/ms
Signatures.sign              512  thrpt   15  1.418 ± 0.004  ops/ms
Signatures.sign             2048  thrpt   15  1.419 ± 0.005  ops/ms
Signatures.sign            16384  thrpt   15  1.395 ± 0.003  ops/ms

KeyGenerators.keyPairGen          thrpt   15  1.475 ± 0.043  ops/ms

And here are the numbers with the patch applied:

Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
ECSignature.sign               64  thrpt   15  1.719 ± 0.010  ops/ms
ECSignature.sign              512  thrpt   15  1.704 ± 0.012  ops/ms
ECSignature.sign             2048  thrpt   15  1.699 ± 0.018  ops/ms
ECSignature.sign            16384  thrpt   15  1.681 ± 0.006  ops/ms

KeyGenerators.keyPairGen           thrpt   15  1.881 ± 0.008  ops/ms

Thanks,
Xuelei

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8294248: Use less limbs for P256 in EC implementation

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10398/head:pull/10398
$ git checkout pull/10398

Update a local copy of the PR:
$ git checkout pull/10398
$ git pull https://git.openjdk.org/jdk pull/10398/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10398

View PR using the GUI difftool:
$ git pr show -t 10398

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10398.diff

bridgekeeper · 2022-09-22T20:41:20Z

👋 Welcome back xuelei! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2022-09-22T20:44:01Z

@XueleiFan The following labels will be automatically applied to this pull request:

build
security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

mlbridge · 2022-09-22T21:57:16Z

Webrevs

djelinski · 2022-09-23T06:13:10Z

Can you prove that this is correct?
My back-of-the-envelope calculations suggest that this may overflow:

starting with 29 bit limbs
after 2 additions (maxAdds) we have up to 31 bits
then we multiply limbs (62 bits) and add up to numLimbs (9) results together = 65 bits, which overflows long
and we didn't start reducing yet.

XueleiFan · 2022-09-23T07:39:07Z

Thank you for the feedback, Daniel.

Can you prove that this is correct?

That would depend on the implementation details. The JDK testing passed, although it does not means there is no problem. I will try if I could describe the implementation logic more clear tomorrow. The following is just a quick reply for what I can see now for the case you described.

My back-of-the-envelope calculations suggest that this may overflow:

starting with 29 bit limbs

after 2 additions (maxAdds) we have up to 31 bits

then we multiply limbs (62 bits) and add up to numLimbs (9) results together = 65 bits, which overflows long

and we didn't start reducing yet.

For the multiply operation, I guess the result should have been carry reduced and fix in the limbs and sizes. In the generated IntegerPolynomialP256.java, the code looks like:

    protected void mult(long[] a, long[] b, long[] r) {
        ... snipped ...
        long c16 = (a[8] * b[8]);

        carryReduce(r, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16);
    }

carryReduce() is called, and the result 'r' is a set of limbs fit on the limbs size, I guess.

private void carryReduce(...) {
    ... snipped ...
        //carry from position 7
        t0 = (c7 + CARRY_ADD) >> 29;
        c7 -= (t0 << 29);
        c8 += t0;
    ... snipped ...
}

In the description of IntegerPolynomial.java, it is said that "Limb values will always fit within a long, so inputs to multiplication must be less than 32 bits. All IntegerPolynomial implementations allow at most one addition before multiplication. Additions after that will result in an ArithmeticException."

Did you see any code that use more than one addition before the multiplication? Or did you refer to something else with "multiply limbs" other than the description in the IntegerPolynomial class?

djelinski · 2022-09-23T09:20:19Z

Limb values will always fit within a long, so inputs to multiplication must be less than 32 bits. All IntegerPolynomial implementations allow at most one addition before multiplication. Additions after that will result in an ArithmeticException.

The highlighted part of the comment is incorrect; IntegerPolynomial allows at most 2^maxAdds-1 additions, and maxAdds in your patch is 2:

static FieldParams P256 = new FieldParams(
            "IntegerPolynomialP256", 29, 9, 2 /*maxAdds*/, 256,

see the implementation of add (removed the irrelevant stuff):

        protected boolean isSummand() {
            return numAdds < maxAdds;
        }
        public ImmutableElement add(IntegerModuloP genB) {
            if (!(isSummand() && b.isSummand())) {
                throw new ArithmeticException("Not a valid summand");
            }
            int newNumAdds = Math.max(numAdds, b.numAdds) + 1;
        }

if you change maxAdds to 1, you'll start getting exceptions:

java.lang.ArithmeticException: Not a valid summand
        at java.base/sun.security.util.math.intpoly.IntegerPolynomial$MutableElement.setDifference(IntegerPolynomial.java:731)
        at java.base/sun.security.util.math.intpoly.IntegerPolynomial$MutableElement.setDifference(IntegerPolynomial.java:630)
        at jdk.crypto.ec/sun.security.ec.ECOperations.setSum(ECOperations.java:375)
        at jdk.crypto.ec/sun.security.ec.ECOperations.multiply(ECOperations.java:261)

which means that we perform at least 2 additions without doing a reduce (see ECOperations.java:375).

So, we can multiply 29+2 = 31 bit limbs. Then in multiplication you have:

    protected void mult(long[] a, long[] b, long[] r) {
        long c8 = (a[0] * b[8]) + (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) + (a[4] * b[4]) + (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) + (a[8] * b[0]);

addition of 9 multiplication results; 31+31+lg 9 > 64.

magicus · 2022-09-23T12:08:04Z

/label -build

openjdk · 2022-09-23T12:18:56Z

@magicus
The build label was successfully removed.

XueleiFan · 2022-09-23T22:03:02Z

Limb values will always fit within a long, so inputs to multiplication must be less than 32 bits. All IntegerPolynomial implementations allow at most one addition before multiplication. Additions after that will result in an ArithmeticException.

The highlighted part of the comment is incorrect;

@djelinski I got your point now. The scalar multiplication is carefully coded so that at most 2 additions are allowed. If 2+ is required, reducing is used in the code (see the use of setReduced() in ECOperations).

The use of maxAdd may be fragile unless more checking get introduced to detect issues like integer overflow. Alternatively, the additions implementation could be updated to take care of the overflow internally. I will see if there is solution that is simple and effective.

Thanks!

XueleiFan · 2022-09-24T08:17:56Z

I check with the real bytes for the following computation. For Secp256R1, the p is defined.

p = FFFFFFFF 00000001 00000000 00000000 00000000 FFFFFFFF FFFFFFFF FFFFFFFF
  = 2^224(2^32 −1) + 2^192 + 2^96 − 1
  = 2^256 - 2^224 + 2^192 + 2^96 − 1

Let's use a full 256 bits integer for following computation, which is bigger than p:

FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF

starting with 29 bit limbs

With 29 bit limbs, it could be expressed in 9 29-bit words in little-endian order, as the following:

a[] = 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 00FFFFFF

From a[0] to a[7], 29 bits for each limbs, and a[8] is 24 bits. 29 * 8 + 24 = 256

after 2 additions (maxAdds) we have up to 31 bits

after 2 additions, the 9-29 bits words are:

a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 03FFFFFF

From a[0] to a[7], 31 bits for each limbs, and a[8] is 26 bits. This is the biggest number we could get from the finite filed computation.

then we multiply limbs (62 bits) and add up to numLimbs (9) results together = 65 bits, which overflows long
and we didn't start reducing yet.

Let's multiply two limbs:

a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 03FFFFFF
b[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 03FFFFFF

The overflow we cared about is the following computation, the longest additions.

        long c8 = (a[0] * b[8]) + (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) + (a[4] * b[4]) + (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) + (a[8] * b[0]);

Now we have a[0]-a[7] are 31-bit words, and b[0]-b[7] are 31-bit words. a[8] and b[8] is 26-bit words.

Then, we know that (a[0] * b[8]) and (a[8] * b[0]) are multiplying between 31-bits words and 26-bits words. So we have the equivalent computation for the a and b values above:

a[1]=a[2]=a[3]=a[4]=a[5]=a[6]=a[7]=0x7FFFFFFF;
b[1]=b[2]=b[3]=b[4]=b[5]=b[6]=b[7]=0x7FFFFFFF;
long i1x7 = (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) + (a[4] * b[4]) + (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) 
              = 0xBFFFFFF900000007L
long i0x8 =  (a[0] * b[8]) +  (a[8] * b[0]);
              = 0x03FFFFFEF8000002L

long c8 = i1x7 + i0x8
             = 0xC3FFFFF7F8000009L

c8 fills the 64 bits fully, but it does not overflow.

@djelinski does it sound right to you?

Although there is not overflow if I get the above computation right, it is still not a straightforward thing we can know without detailed computation. Instead, this kind of computation and checking could be performed in FieldGen code automatically, or just reduce limbs operations result (up to no more than twice the modulus) and remove the limit of maxAdd.

This RFE, JDK-8294248, is one of a few performance improvement of the EC implementation in JDK (JDK-8294188). I would like to address the checking or/and reducing improvement in a separately RFE.

djelinski · 2022-09-24T09:36:06Z

@XueleiFan as far as I can tell, we cannot assume that the starting number will be 256 bit; the setReduced operation performs a limited reduce that only ensures that limbs fit in bitsPerLimb; the carry from the last limb is only performed in finalCarryReduceLast.

If we start with 29 ones in every limb, the computation will overflow.

Instead, this kind of computation and checking could be performed in FieldGen code automatically, or just reduce limbs operations result (up to no more than twice the modulus) and remove the limit of maxAdd.

Both options are worth exploring.

Thanks for working on this!

XueleiFan · 2022-09-25T05:49:41Z

@XueleiFan as far as I can tell, we cannot assume that the starting number will be 256 bit; the setReduced operation performs a limited reduce that only ensures that limbs fit in bitsPerLimb; the carry from the last limb is only performed in finalCarryReduceLast.

Thank you @djelinski, I missed this point.

I tried to prototype and allow only one addition, and use reduce if necessary, and get the following benchmarking numbers:

Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
Signatures.sign               64  thrpt   15  1.629 ± 0.008  ops/ms
Signatures.sign              512  thrpt   15  1.628 ± 0.007  ops/ms
Signatures.sign             2048  thrpt   15  1.625 ± 0.006  ops/ms
Signatures.sign            16384  thrpt   15  1.588 ± 0.014  ops/ms
ECKeyGen.keyPairGen               thrpt   15  1.779 ±(99.9%) 0.007 ops/ms

The improvement looks positive to me (15%-20% improvement). It implies that it may be performance friendly by reducing or semi-reducing limbs operations rather than using maxAdd controlling. However, as the update would impact more curves in JDK, I would think it twice if it is good to start from it or not.

XueleiFan · 2022-09-25T05:52:01Z

Close this PR for now. I may re-open it again when the long overflow issues get addressed.

XueleiFan · 2022-11-29T19:00:52Z

I may re-open it again when the long overflow issues get addressed.

The issue has been addressed in PR 10624. Please review this update.

I will post the new benchmark numbers soon, for the latest performance after integration of PR 10624, PR 10544 and PR 10893.

XueleiFan · 2022-11-29T20:44:42Z

Here is the latest benchmark numbers, after the integration of #10624, #10544 and #10893.

with this patch:

Benchmark                    (algorithm)  (messageLength)   Mode  Cnt     Score    Error  Units
Signatures.sign                secp256r1               64  thrpt   15  4767.902 ± 26.834  ops/s
Signatures.sign                secp256r1              512  thrpt   15  4755.802 ± 41.800  ops/s
Signatures.sign                secp256r1             2048  thrpt   15  4728.560 ± 55.913  ops/s
Signatures.sign                secp256r1            16384  thrpt   15  4488.473 ± 60.325  ops/s

without this patch:

Benchmark                    (algorithm)  (messageLength)   Mode  Cnt     Score    Error  Units
Signatures.sign                secp256r1               64  thrpt   15  4089.026 ± 22.034  ops/s
Signatures.sign                secp256r1              512  thrpt   15  4081.396 ± 25.416  ops/s
Signatures.sign                secp256r1             2048  thrpt   15  4080.277 ± 24.239  ops/s
Signatures.sign                secp256r1            16384  thrpt   15  3926.398 ± 14.582  ops/s

The performance improvement is about 15%.

Comparing to the numbers in the PR description, the performance improvement is about 240% with the update for this PR, #10624, #10544 and #10893.

djelinski · 2022-11-30T07:46:19Z

Please add a test that verifies that the worst case calculation still produces correct results. That is:

build a number where the limb values are as high as possible (2^(numLimbs*bitsPerLimb)-1, or something close)
sum that number with itself until numAdds = maxAdds
square the result
compare the result with the same calculations on BigInteger

XueleiFan · 2022-11-30T19:44:00Z

Please add a test that verifies that the worst case calculation still produces correct results. That is:

build a number where the limb values are as high as possible (2^(numLimbs*bitsPerLimb)-1, or something close)

sum that number with itself until numAdds = maxAdds

square the result

compare the result with the same calculations on BigInteger

It makes senses to me. I would like to have an improvement in FieldGen.java instead, so that no illegal params could be set. Let's see if I could make it before integration of this patch.

XueleiFan · 2022-12-03T05:58:16Z

Please add a test that verifies that the worst case calculation still produces correct results. That is:

build a number where the limb values are as high as possible (2^(numLimbs*bitsPerLimb)-1, or something close)

sum that number with itself until numAdds = maxAdds

square the result

compare the result with the same calculations on BigInteger

It makes senses to me. I would like to have an improvement in FieldGen.java instead, so that no illegal params could be set. Let's see if I could make it before integration of this patch.

The FieldGen.java improvement is not a small patch. It's on my plan, but it may be not something right for JDK 20 as the deadline is approaching. I will see if it is possible for JDK 21. Close this PR for now. I will open it again if the FieldGen.java get improved. @djelinski Thank you for your time and extremely helpful feedback.

8294248: Use less limbs for P256 in EC implementation

35cf758

openjdk bot added security build labels Sep 22, 2022

remove tailing white spaces

a5e0f77

XueleiFan marked this pull request as ready for review September 22, 2022 21:49

openjdk bot added the rfr label Sep 22, 2022

openjdk bot removed the build label Sep 23, 2022

XueleiFan closed this Sep 25, 2022

XueleiFan mentioned this pull request Oct 9, 2022

8295010: Reduce if required in EC limbs operations #10624

Closed

3 tasks

XueleiFan reopened this Oct 17, 2022

XueleiFan marked this pull request as draft October 17, 2022 05:30

openjdk bot removed the rfr label Oct 17, 2022

XueleiFan and others added 2 commits November 18, 2022 23:17

remove duplicated bench tests

9b724e9

Merge

a9ec1b0

XueleiFan marked this pull request as ready for review November 29, 2022 19:00

openjdk bot added the rfr label Nov 29, 2022

set maxadds

59e4818

XueleiFan closed this Dec 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8294248: Use less limbs for P256 in EC implementation #10398

8294248: Use less limbs for P256 in EC implementation #10398

XueleiFan commented Sep 22, 2022 •

edited by openjdk bot

Loading

bridgekeeper bot commented Sep 22, 2022

openjdk bot commented Sep 22, 2022

mlbridge bot commented Sep 22, 2022 •

edited

Loading

djelinski commented Sep 23, 2022

XueleiFan commented Sep 23, 2022 •

edited

Loading

djelinski commented Sep 23, 2022

magicus commented Sep 23, 2022

openjdk bot commented Sep 23, 2022

XueleiFan commented Sep 23, 2022

XueleiFan commented Sep 24, 2022

djelinski commented Sep 24, 2022

XueleiFan commented Sep 25, 2022

XueleiFan commented Sep 25, 2022

XueleiFan commented Nov 29, 2022 •

edited

Loading

XueleiFan commented Nov 29, 2022

djelinski commented Nov 30, 2022

XueleiFan commented Nov 30, 2022

XueleiFan commented Dec 3, 2022

8294248: Use less limbs for P256 in EC implementation #10398

8294248: Use less limbs for P256 in EC implementation #10398

Conversation

XueleiFan commented Sep 22, 2022 • edited by openjdk bot Loading

Progress

Issue

Reviewing

bridgekeeper bot commented Sep 22, 2022

openjdk bot commented Sep 22, 2022

mlbridge bot commented Sep 22, 2022 • edited Loading

Webrevs

djelinski commented Sep 23, 2022

XueleiFan commented Sep 23, 2022 • edited Loading

djelinski commented Sep 23, 2022

magicus commented Sep 23, 2022

openjdk bot commented Sep 23, 2022

XueleiFan commented Sep 23, 2022

XueleiFan commented Sep 24, 2022

djelinski commented Sep 24, 2022

XueleiFan commented Sep 25, 2022

XueleiFan commented Sep 25, 2022

XueleiFan commented Nov 29, 2022 • edited Loading

XueleiFan commented Nov 29, 2022

djelinski commented Nov 30, 2022

XueleiFan commented Nov 30, 2022

XueleiFan commented Dec 3, 2022

XueleiFan commented Sep 22, 2022 •

edited by openjdk bot

Loading

mlbridge bot commented Sep 22, 2022 •

edited

Loading

XueleiFan commented Sep 23, 2022 •

edited

Loading

XueleiFan commented Nov 29, 2022 •

edited

Loading