-
Notifications
You must be signed in to change notification settings - Fork 5.8k
8294248: Use less limbs for P256 in EC implementation #10398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back xuelei! A progress list of the required criteria for merging this PR into |
@XueleiFan The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
Can you prove that this is correct?
|
Thank you for the feedback, Daniel.
That would depend on the implementation details. The JDK testing passed, although it does not means there is no problem. I will try if I could describe the implementation logic more clear tomorrow. The following is just a quick reply for what I can see now for the case you described.
For the multiply operation, I guess the result should have been carry reduced and fix in the limbs and sizes. In the generated IntegerPolynomialP256.java, the code looks like:
carryReduce() is called, and the result 'r' is a set of limbs fit on the limbs size, I guess.
In the description of IntegerPolynomial.java, it is said that "Limb values will always fit within a long, so inputs to multiplication must be less than 32 bits. All IntegerPolynomial implementations allow at most one addition before multiplication. Additions after that will result in an ArithmeticException." Did you see any code that use more than one addition before the multiplication? Or did you refer to something else with "multiply limbs" other than the description in the IntegerPolynomial class? |
The highlighted part of the comment is incorrect; IntegerPolynomial allows at most 2^maxAdds-1 additions, and maxAdds in your patch is 2:
see the implementation of add (removed the irrelevant stuff):
if you change maxAdds to 1, you'll start getting exceptions:
which means that we perform at least 2 additions without doing a reduce (see ECOperations.java:375). So, we can multiply 29+2 = 31 bit limbs. Then in multiplication you have:
addition of 9 multiplication results; 31+31+lg 9 > 64. |
/label -build |
@magicus |
@djelinski I got your point now. The scalar multiplication is carefully coded so that at most 2 additions are allowed. If 2+ is required, reducing is used in the code (see the use of setReduced() in ECOperations). The use of maxAdd may be fragile unless more checking get introduced to detect issues like integer overflow. Alternatively, the additions implementation could be updated to take care of the overflow internally. I will see if there is solution that is simple and effective. Thanks! |
I check with the real bytes for the following computation. For Secp256R1, the p is defined.
Let's use a full 256 bits integer for following computation, which is bigger than p:
With 29 bit limbs, it could be expressed in 9 29-bit words in little-endian order, as the following:
From a[0] to a[7], 29 bits for each limbs, and a[8] is 24 bits. 29 * 8 + 24 = 256
after 2 additions, the 9-29 bits words are:
From a[0] to a[7], 31 bits for each limbs, and a[8] is 26 bits. This is the biggest number we could get from the finite filed computation.
Let's multiply two limbs:
The overflow we cared about is the following computation, the longest additions.
Now we have a[0]-a[7] are 31-bit words, and b[0]-b[7] are 31-bit words. a[8] and b[8] is 26-bit words. Then, we know that (a[0] * b[8]) and (a[8] * b[0]) are multiplying between 31-bits words and 26-bits words. So we have the equivalent computation for the a and b values above:
c8 fills the 64 bits fully, but it does not overflow. @djelinski does it sound right to you? Although there is not overflow if I get the above computation right, it is still not a straightforward thing we can know without detailed computation. Instead, this kind of computation and checking could be performed in FieldGen code automatically, or just reduce limbs operations result (up to no more than twice the modulus) and remove the limit of maxAdd. This RFE, JDK-8294248, is one of a few performance improvement of the EC implementation in JDK (JDK-8294188). I would like to address the checking or/and reducing improvement in a separately RFE. |
@XueleiFan as far as I can tell, we cannot assume that the starting number will be 256 bit; the If we start with 29 ones in every limb, the computation will overflow.
Both options are worth exploring. Thanks for working on this! |
Thank you @djelinski, I missed this point. I tried to prototype and allow only one addition, and use reduce if necessary, and get the following benchmarking numbers:
The improvement looks positive to me (15%-20% improvement). It implies that it may be performance friendly by reducing or semi-reducing limbs operations rather than using maxAdd controlling. However, as the update would impact more curves in JDK, I would think it twice if it is good to start from it or not. |
Close this PR for now. I may re-open it again when the long overflow issues get addressed. |
Here is the latest benchmark numbers, after the integration of #10624, #10544 and #10893. with this patch:
without this patch:
The performance improvement is about 15%. Comparing to the numbers in the PR description, the performance improvement is about 240% with the update for this PR, #10624, #10544 and #10893. |
Please add a test that verifies that the worst case calculation still produces correct results. That is:
|
It makes senses to me. I would like to have an improvement in FieldGen.java instead, so that no illegal params could be set. Let's see if I could make it before integration of this patch. |
The FieldGen.java improvement is not a small patch. It's on my plan, but it may be not something right for JDK 20 as the deadline is approaching. I will see if it is possible for JDK 21. Close this PR for now. I will open it again if the FieldGen.java get improved. @djelinski Thank you for your time and extremely helpful feedback. |
Hi,
Please review this performance improvement for Secp256R1 implementation in OpenJDK. With this update, there is an about 20% performance improvement for Secp256R1 key generation and signature.
Basically, 256 bits EC curves could use 9 integer limbs for the computation. The current implementation use 10 limbs instead. By reducing the number of limbs, the implementation could benefit from less integer computation (add/sub/multiply/square/inverse/mod/pow, etc), and thus improve the performance.
Here are the benchmark numbers without the patch:
And here are the numbers with the patch applied:
Thanks,
Xuelei
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10398/head:pull/10398
$ git checkout pull/10398
Update a local copy of the PR:
$ git checkout pull/10398
$ git pull https://git.openjdk.org/jdk pull/10398/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 10398
View PR using the GUI difftool:
$ git pr show -t 10398
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10398.diff