Skip to content

8294248: Use less limbs for P256 in EC implementation #10398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

XueleiFan
Copy link
Member

@XueleiFan XueleiFan commented Sep 22, 2022

Hi,

Please review this performance improvement for Secp256R1 implementation in OpenJDK. With this update, there is an about 20% performance improvement for Secp256R1 key generation and signature.

Basically, 256 bits EC curves could use 9 integer limbs for the computation. The current implementation use 10 limbs instead. By reducing the number of limbs, the implementation could benefit from less integer computation (add/sub/multiply/square/inverse/mod/pow, etc), and thus improve the performance.

Here are the benchmark numbers without the patch:

Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
Signatures.sign               64  thrpt   15  1.414 ± 0.022  ops/ms
Signatures.sign              512  thrpt   15  1.418 ± 0.004  ops/ms
Signatures.sign             2048  thrpt   15  1.419 ± 0.005  ops/ms
Signatures.sign            16384  thrpt   15  1.395 ± 0.003  ops/ms

KeyGenerators.keyPairGen          thrpt   15  1.475 ± 0.043  ops/ms

And here are the numbers with the patch applied:

Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
ECSignature.sign               64  thrpt   15  1.719 ± 0.010  ops/ms
ECSignature.sign              512  thrpt   15  1.704 ± 0.012  ops/ms
ECSignature.sign             2048  thrpt   15  1.699 ± 0.018  ops/ms
ECSignature.sign            16384  thrpt   15  1.681 ± 0.006  ops/ms

KeyGenerators.keyPairGen           thrpt   15  1.881 ± 0.008  ops/ms

Thanks,
Xuelei


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8294248: Use less limbs for P256 in EC implementation

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10398/head:pull/10398
$ git checkout pull/10398

Update a local copy of the PR:
$ git checkout pull/10398
$ git pull https://git.openjdk.org/jdk pull/10398/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10398

View PR using the GUI difftool:
$ git pr show -t 10398

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10398.diff

Sorry, something went wrong.

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 22, 2022

👋 Welcome back xuelei! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Sep 22, 2022

@XueleiFan The following labels will be automatically applied to this pull request:

  • build
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added security security-dev@openjdk.org build build-dev@openjdk.org labels Sep 22, 2022
@XueleiFan XueleiFan marked this pull request as ready for review September 22, 2022 21:49
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 22, 2022
@mlbridge
Copy link

mlbridge bot commented Sep 22, 2022

Webrevs

@djelinski
Copy link
Member

Can you prove that this is correct?
My back-of-the-envelope calculations suggest that this may overflow:

  • starting with 29 bit limbs
  • after 2 additions (maxAdds) we have up to 31 bits
  • then we multiply limbs (62 bits) and add up to numLimbs (9) results together = 65 bits, which overflows long
  • and we didn't start reducing yet.

@XueleiFan
Copy link
Member Author

XueleiFan commented Sep 23, 2022

Thank you for the feedback, Daniel.

Can you prove that this is correct?

That would depend on the implementation details. The JDK testing passed, although it does not means there is no problem. I will try if I could describe the implementation logic more clear tomorrow. The following is just a quick reply for what I can see now for the case you described.

My back-of-the-envelope calculations suggest that this may overflow:

  • starting with 29 bit limbs
  • after 2 additions (maxAdds) we have up to 31 bits
  • then we multiply limbs (62 bits) and add up to numLimbs (9) results together = 65 bits, which overflows long
  • and we didn't start reducing yet.

For the multiply operation, I guess the result should have been carry reduced and fix in the limbs and sizes. In the generated IntegerPolynomialP256.java, the code looks like:

    protected void mult(long[] a, long[] b, long[] r) {
        ... snipped ...
        long c16 = (a[8] * b[8]);

        carryReduce(r, c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16);
    }

carryReduce() is called, and the result 'r' is a set of limbs fit on the limbs size, I guess.

private void carryReduce(...) {
    ... snipped ...
        //carry from position 7
        t0 = (c7 + CARRY_ADD) >> 29;
        c7 -= (t0 << 29);
        c8 += t0;
    ... snipped ...
}

In the description of IntegerPolynomial.java, it is said that "Limb values will always fit within a long, so inputs to multiplication must be less than 32 bits. All IntegerPolynomial implementations allow at most one addition before multiplication. Additions after that will result in an ArithmeticException."

Did you see any code that use more than one addition before the multiplication? Or did you refer to something else with "multiply limbs" other than the description in the IntegerPolynomial class?

@djelinski
Copy link
Member

Limb values will always fit within a long, so inputs to multiplication must be less than 32 bits. All IntegerPolynomial implementations allow at most one addition before multiplication. Additions after that will result in an ArithmeticException.

The highlighted part of the comment is incorrect; IntegerPolynomial allows at most 2^maxAdds-1 additions, and maxAdds in your patch is 2:

static FieldParams P256 = new FieldParams(
            "IntegerPolynomialP256", 29, 9, 2 /*maxAdds*/, 256,

see the implementation of add (removed the irrelevant stuff):

        protected boolean isSummand() {
            return numAdds < maxAdds;
        }
        public ImmutableElement add(IntegerModuloP genB) {
            if (!(isSummand() && b.isSummand())) {
                throw new ArithmeticException("Not a valid summand");
            }
            int newNumAdds = Math.max(numAdds, b.numAdds) + 1;
        }

if you change maxAdds to 1, you'll start getting exceptions:

java.lang.ArithmeticException: Not a valid summand
        at java.base/sun.security.util.math.intpoly.IntegerPolynomial$MutableElement.setDifference(IntegerPolynomial.java:731)
        at java.base/sun.security.util.math.intpoly.IntegerPolynomial$MutableElement.setDifference(IntegerPolynomial.java:630)
        at jdk.crypto.ec/sun.security.ec.ECOperations.setSum(ECOperations.java:375)
        at jdk.crypto.ec/sun.security.ec.ECOperations.multiply(ECOperations.java:261)

which means that we perform at least 2 additions without doing a reduce (see ECOperations.java:375).

So, we can multiply 29+2 = 31 bit limbs. Then in multiplication you have:

    protected void mult(long[] a, long[] b, long[] r) {
        long c8 = (a[0] * b[8]) + (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) + (a[4] * b[4]) + (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) + (a[8] * b[0]);

addition of 9 multiplication results; 31+31+lg 9 > 64.

@magicus
Copy link
Member

magicus commented Sep 23, 2022

/label -build

@openjdk openjdk bot removed the build build-dev@openjdk.org label Sep 23, 2022
@openjdk
Copy link

openjdk bot commented Sep 23, 2022

@magicus
The build label was successfully removed.

@XueleiFan
Copy link
Member Author

Limb values will always fit within a long, so inputs to multiplication must be less than 32 bits. All IntegerPolynomial implementations allow at most one addition before multiplication. Additions after that will result in an ArithmeticException.

The highlighted part of the comment is incorrect;

@djelinski I got your point now. The scalar multiplication is carefully coded so that at most 2 additions are allowed. If 2+ is required, reducing is used in the code (see the use of setReduced() in ECOperations).

The use of maxAdd may be fragile unless more checking get introduced to detect issues like integer overflow. Alternatively, the additions implementation could be updated to take care of the overflow internally. I will see if there is solution that is simple and effective.

Thanks!

@XueleiFan
Copy link
Member Author

I check with the real bytes for the following computation. For Secp256R1, the p is defined.

p = FFFFFFFF 00000001 00000000 00000000 00000000 FFFFFFFF FFFFFFFF FFFFFFFF
  = 2^224(2^32 −1) + 2^192 + 2^96 − 1
  = 2^256 - 2^224 + 2^192 + 2^96 − 1

Let's use a full 256 bits integer for following computation, which is bigger than p:

FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF

starting with 29 bit limbs

With 29 bit limbs, it could be expressed in 9 29-bit words in little-endian order, as the following:

a[] = 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 1FFFFFFF 00FFFFFF 

From a[0] to a[7], 29 bits for each limbs, and a[8] is 24 bits. 29 * 8 + 24 = 256

after 2 additions (maxAdds) we have up to 31 bits

after 2 additions, the 9-29 bits words are:

a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 03FFFFFF

From a[0] to a[7], 31 bits for each limbs, and a[8] is 26 bits. This is the biggest number we could get from the finite filed computation.

then we multiply limbs (62 bits) and add up to numLimbs (9) results together = 65 bits, which overflows long
and we didn't start reducing yet.

Let's multiply two limbs:

a[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 03FFFFFF
b[] = 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 7FFFFFFF 03FFFFFF 

The overflow we cared about is the following computation, the longest additions.

        long c8 = (a[0] * b[8]) + (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) + (a[4] * b[4]) + (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) + (a[8] * b[0]);

Now we have a[0]-a[7] are 31-bit words, and b[0]-b[7] are 31-bit words. a[8] and b[8] is 26-bit words.

Then, we know that (a[0] * b[8]) and (a[8] * b[0]) are multiplying between 31-bits words and 26-bits words. So we have the equivalent computation for the a and b values above:

a[1]=a[2]=a[3]=a[4]=a[5]=a[6]=a[7]=0x7FFFFFFF;
b[1]=b[2]=b[3]=b[4]=b[5]=b[6]=b[7]=0x7FFFFFFF;
long i1x7 = (a[1] * b[7]) + (a[2] * b[6]) + (a[3] * b[5]) + (a[4] * b[4]) + (a[5] * b[3]) + (a[6] * b[2]) + (a[7] * b[1]) 
              = 0xBFFFFFF900000007L
long i0x8 =  (a[0] * b[8]) +  (a[8] * b[0]);
              = 0x03FFFFFEF8000002L

long c8 = i1x7 + i0x8
             = 0xC3FFFFF7F8000009L

c8 fills the 64 bits fully, but it does not overflow.

@djelinski does it sound right to you?

Although there is not overflow if I get the above computation right, it is still not a straightforward thing we can know without detailed computation. Instead, this kind of computation and checking could be performed in FieldGen code automatically, or just reduce limbs operations result (up to no more than twice the modulus) and remove the limit of maxAdd.

This RFE, JDK-8294248, is one of a few performance improvement of the EC implementation in JDK (JDK-8294188). I would like to address the checking or/and reducing improvement in a separately RFE.

@djelinski
Copy link
Member

@XueleiFan as far as I can tell, we cannot assume that the starting number will be 256 bit; the setReduced operation performs a limited reduce that only ensures that limbs fit in bitsPerLimb; the carry from the last limb is only performed in finalCarryReduceLast.

If we start with 29 ones in every limb, the computation will overflow.

Instead, this kind of computation and checking could be performed in FieldGen code automatically, or just reduce limbs operations result (up to no more than twice the modulus) and remove the limit of maxAdd.

Both options are worth exploring.

Thanks for working on this!

@XueleiFan
Copy link
Member Author

@XueleiFan as far as I can tell, we cannot assume that the starting number will be 256 bit; the setReduced operation performs a limited reduce that only ensures that limbs fit in bitsPerLimb; the carry from the last limb is only performed in finalCarryReduceLast.

Thank you @djelinski, I missed this point.

I tried to prototype and allow only one addition, and use reduce if necessary, and get the following benchmarking numbers:

Benchmark         (messageLength)   Mode  Cnt  Score   Error   Units
Signatures.sign               64  thrpt   15  1.629 ± 0.008  ops/ms
Signatures.sign              512  thrpt   15  1.628 ± 0.007  ops/ms
Signatures.sign             2048  thrpt   15  1.625 ± 0.006  ops/ms
Signatures.sign            16384  thrpt   15  1.588 ± 0.014  ops/ms
ECKeyGen.keyPairGen               thrpt   15  1.779 ±(99.9%) 0.007 ops/ms

The improvement looks positive to me (15%-20% improvement). It implies that it may be performance friendly by reducing or semi-reducing limbs operations rather than using maxAdd controlling. However, as the update would impact more curves in JDK, I would think it twice if it is good to start from it or not.

@XueleiFan
Copy link
Member Author

Close this PR for now. I may re-open it again when the long overflow issues get addressed.

@XueleiFan XueleiFan closed this Sep 25, 2022
@XueleiFan XueleiFan reopened this Oct 17, 2022
@XueleiFan XueleiFan marked this pull request as draft October 17, 2022 05:30
@openjdk openjdk bot removed the rfr Pull request is ready for review label Oct 17, 2022
@XueleiFan
Copy link
Member Author

XueleiFan commented Nov 29, 2022

I may re-open it again when the long overflow issues get addressed.

The issue has been addressed in PR 10624. Please review this update.

I will post the new benchmark numbers soon, for the latest performance after integration of PR 10624, PR 10544 and PR 10893.

@XueleiFan XueleiFan marked this pull request as ready for review November 29, 2022 19:00
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 29, 2022
@XueleiFan
Copy link
Member Author

Here is the latest benchmark numbers, after the integration of #10624, #10544 and #10893.

with this patch:

Benchmark                    (algorithm)  (messageLength)   Mode  Cnt     Score    Error  Units
Signatures.sign                secp256r1               64  thrpt   15  4767.902 ± 26.834  ops/s
Signatures.sign                secp256r1              512  thrpt   15  4755.802 ± 41.800  ops/s
Signatures.sign                secp256r1             2048  thrpt   15  4728.560 ± 55.913  ops/s
Signatures.sign                secp256r1            16384  thrpt   15  4488.473 ± 60.325  ops/s

without this patch:

Benchmark                    (algorithm)  (messageLength)   Mode  Cnt     Score    Error  Units
Signatures.sign                secp256r1               64  thrpt   15  4089.026 ± 22.034  ops/s
Signatures.sign                secp256r1              512  thrpt   15  4081.396 ± 25.416  ops/s
Signatures.sign                secp256r1             2048  thrpt   15  4080.277 ± 24.239  ops/s
Signatures.sign                secp256r1            16384  thrpt   15  3926.398 ± 14.582  ops/s

The performance improvement is about 15%.

Comparing to the numbers in the PR description, the performance improvement is about 240% with the update for this PR, #10624, #10544 and #10893.

@djelinski
Copy link
Member

Please add a test that verifies that the worst case calculation still produces correct results. That is:

  • build a number where the limb values are as high as possible (2^(numLimbs*bitsPerLimb)-1, or something close)
  • sum that number with itself until numAdds = maxAdds
  • square the result
  • compare the result with the same calculations on BigInteger

@XueleiFan
Copy link
Member Author

Please add a test that verifies that the worst case calculation still produces correct results. That is:

  • build a number where the limb values are as high as possible (2^(numLimbs*bitsPerLimb)-1, or something close)
  • sum that number with itself until numAdds = maxAdds
  • square the result
  • compare the result with the same calculations on BigInteger

It makes senses to me. I would like to have an improvement in FieldGen.java instead, so that no illegal params could be set. Let's see if I could make it before integration of this patch.

@XueleiFan
Copy link
Member Author

Please add a test that verifies that the worst case calculation still produces correct results. That is:

  • build a number where the limb values are as high as possible (2^(numLimbs*bitsPerLimb)-1, or something close)
  • sum that number with itself until numAdds = maxAdds
  • square the result
  • compare the result with the same calculations on BigInteger

It makes senses to me. I would like to have an improvement in FieldGen.java instead, so that no illegal params could be set. Let's see if I could make it before integration of this patch.

The FieldGen.java improvement is not a small patch. It's on my plan, but it may be not something right for JDK 20 as the deadline is approaching. I will see if it is possible for JDK 21. Close this PR for now. I will open it again if the FieldGen.java get improved. @djelinski Thank you for your time and extremely helpful feedback.

@XueleiFan XueleiFan closed this Dec 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfr Pull request is ready for review security security-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

None yet

3 participants