-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8316426: Optimization for HexFormat.formatHex #15768
Conversation
👋 Welcome back wenshao! A progress list of the required criteria for merging this PR into |
What numbers do you get with this? In my experiments the improvement was either negligible or small enough to hardly matter. I also tried flipping bits similar to what you're doing for uppercasing but saw a net regression from that. I opted for simplicity in #15991 - a simple cleanup that removed a few tiny lookup-tables and still led to a decent improvement. Going the other direction seems like the wrong move. |
The performance test results are as follows: 0. sciprt
1. aliyun_ecs_c8i.xlarge
-Benchmark (size) Mode Cnt Score Error Units (baselinie)
-HexFormatBench.appenderLower 512 avgt 15 2.768 ? 0.034 us/op
-HexFormatBench.appenderLowerCached 512 avgt 15 2.796 ? 0.042 us/op
-HexFormatBench.appenderUpper 512 avgt 15 2.800 ? 0.032 us/op
-HexFormatBench.appenderUpperCached 512 avgt 15 2.781 ? 0.018 us/op
-HexFormatBench.formatLower 512 avgt 15 0.544 ? 0.002 us/op
-HexFormatBench.formatLowerCached 512 avgt 15 0.548 ? 0.004 us/op
-HexFormatBench.formatUpper 512 avgt 15 0.546 ? 0.007 us/op
-HexFormatBench.formatUpperCached 512 avgt 15 0.550 ? 0.005 us/op
-HexFormatBench.toHexDigitsByte 512 avgt 15 3.364 ? 0.015 us/op
-HexFormatBench.toHexDigitsInt 512 avgt 15 3.770 ? 0.017 us/op
-HexFormatBench.toHexDigitsLong 512 avgt 15 4.990 ? 0.018 us/op
-HexFormatBench.toHexDigitsShort 512 avgt 15 3.466 ? 0.017 us/op
-HexFormatBench.toHexLower 512 avgt 15 0.415 ? 0.005 us/op
-HexFormatBench.toHexLowerCached 512 avgt 15 0.422 ? 0.005 us/op
-HexFormatBench.toHexUpper 512 avgt 15 0.413 ? 0.005 us/op
-HexFormatBench.toHexUpperCached 512 avgt 15 0.423 ? 0.004 us/op
+Benchmark (size) Mode Cnt Score Error Units (optimized)
+HexFormatBench.appenderLower 512 avgt 15 0.163 ? 0.001 us/op (+1598.16)
+HexFormatBench.appenderLowerCached 512 avgt 15 0.161 ? 0.001 us/op (+1636.65)
+HexFormatBench.appenderUpper 512 avgt 15 0.251 ? 0.023 us/op (+1015.54)
+HexFormatBench.appenderUpperCached 512 avgt 15 0.266 ? 0.001 us/op (+945.49)
+HexFormatBench.formatLower 512 avgt 15 0.275 ? 0.001 us/op (+97.82)
+HexFormatBench.formatLowerCached 512 avgt 15 0.277 ? 0.001 us/op (+97.84)
+HexFormatBench.formatUpper 512 avgt 15 0.285 ? 0.001 us/op (+91.58)
+HexFormatBench.formatUpperCached 512 avgt 15 0.285 ? 0.001 us/op (+92.99)
+HexFormatBench.toHexDigitsByte 512 avgt 15 3.554 ? 0.028 us/op (-5.35)
+HexFormatBench.toHexDigitsInt 512 avgt 15 3.910 ? 0.015 us/op (-3.59)
+HexFormatBench.toHexDigitsLong 512 avgt 15 5.288 ? 0.018 us/op (-5.64)
+HexFormatBench.toHexDigitsShort 512 avgt 15 3.637 ? 0.012 us/op (-4.71)
+HexFormatBench.toHexLower 512 avgt 15 0.445 ? 0.001 us/op (-6.75)
+HexFormatBench.toHexLowerCached 512 avgt 15 0.442 ? 0.001 us/op (-4.53)
+HexFormatBench.toHexUpper 512 avgt 15 0.445 ? 0.001 us/op (-7.20)
+HexFormatBench.toHexUpperCached 512 avgt 15 0.441 ? 0.001 us/op (-4.09) 2. aliyun_ecs_c8y.xlarge
-Benchmark (size) Mode Cnt Score Error Units (baseline)
-HexFormatBench.appenderLower 512 avgt 15 2.857 ? 0.791 us/op
-HexFormatBench.appenderLowerCached 512 avgt 15 2.832 ? 0.758 us/op
-HexFormatBench.appenderUpper 512 avgt 15 2.360 ? 0.010 us/op
-HexFormatBench.appenderUpperCached 512 avgt 15 2.361 ? 0.013 us/op
-HexFormatBench.formatLower 512 avgt 15 0.947 ? 0.406 us/op
-HexFormatBench.formatLowerCached 512 avgt 15 0.616 ? 0.002 us/op
-HexFormatBench.formatUpper 512 avgt 15 1.212 ? 0.411 us/op
-HexFormatBench.formatUpperCached 512 avgt 15 0.616 ? 0.001 us/op
-HexFormatBench.toHexDigitsByte 512 avgt 15 5.844 ? 0.264 us/op
-HexFormatBench.toHexDigitsInt 512 avgt 15 7.392 ? 0.207 us/op
-HexFormatBench.toHexDigitsLong 512 avgt 15 8.068 ? 0.303 us/op
-HexFormatBench.toHexDigitsShort 512 avgt 15 6.214 ? 0.266 us/op
-HexFormatBench.toHexLower 512 avgt 15 0.926 ? 0.003 us/op
-HexFormatBench.toHexLowerCached 512 avgt 15 1.000 ? 0.005 us/op
-HexFormatBench.toHexUpper 512 avgt 15 0.927 ? 0.002 us/op
-HexFormatBench.toHexUpperCached 512 avgt 15 0.999 ? 0.020 us/op
+Benchmark (size) Mode Cnt Score Error Units (optimized)
+HexFormatBench.appenderLower 512 avgt 15 0.356 ? 0.001 us/op (+702.53)
+HexFormatBench.appenderLowerCached 512 avgt 15 0.356 ? 0.001 us/op (+695.51)
+HexFormatBench.appenderUpper 512 avgt 15 0.304 ? 0.001 us/op (+676.32)
+HexFormatBench.appenderUpperCached 512 avgt 15 0.304 ? 0.001 us/op (+676.65)
+HexFormatBench.formatLower 512 avgt 15 0.461 ? 0.001 us/op (+105.43)
+HexFormatBench.formatLowerCached 512 avgt 15 0.485 ? 0.001 us/op (+27.02)
+HexFormatBench.formatUpper 512 avgt 15 0.644 ? 0.003 us/op (+88.20)
+HexFormatBench.formatUpperCached 512 avgt 15 0.595 ? 0.003 us/op (+3.53)
+HexFormatBench.toHexDigitsByte 512 avgt 15 5.804 ? 0.237 us/op (+0.69)
+HexFormatBench.toHexDigitsInt 512 avgt 15 7.209 ? 0.212 us/op (+2.54)
+HexFormatBench.toHexDigitsLong 512 avgt 15 8.301 ? 0.422 us/op (-2.81)
+HexFormatBench.toHexDigitsShort 512 avgt 15 5.908 ? 0.255 us/op (+5.18)
+HexFormatBench.toHexLower 512 avgt 15 0.494 ? 0.001 us/op (+87.45)
+HexFormatBench.toHexLowerCached 512 avgt 15 0.494 ? 0.001 us/op (+102.43)
+HexFormatBench.toHexUpper 512 avgt 15 0.494 ? 0.001 us/op (+87.66)
+HexFormatBench.toHexUpperCached 512 avgt 15 0.493 ? 0.001 us/op (+102.64) 3. Mac Book Pro M1 Pro-Benchmark (size) Mode Cnt Score Error Units (baseline)
-HexFormatBench.appenderLower 512 avgt 15 2.867 ? 0.035 us/op
-HexFormatBench.appenderLowerCached 512 avgt 15 1.656 ? 0.875 us/op
-HexFormatBench.appenderUpper 512 avgt 15 2.813 ? 0.085 us/op
-HexFormatBench.appenderUpperCached 512 avgt 15 1.575 ? 0.901 us/op
-HexFormatBench.formatLower 512 avgt 15 0.385 ? 0.001 us/op
-HexFormatBench.formatLowerCached 512 avgt 15 0.385 ? 0.002 us/op
-HexFormatBench.formatUpper 512 avgt 15 0.385 ? 0.001 us/op
-HexFormatBench.formatUpperCached 512 avgt 15 0.384 ? 0.001 us/op
-HexFormatBench.toHexDigitsByte 512 avgt 15 1.688 ? 0.009 us/op
-HexFormatBench.toHexDigitsInt 512 avgt 15 2.991 ? 0.015 us/op
-HexFormatBench.toHexDigitsLong 512 avgt 15 3.719 ? 0.081 us/op
-HexFormatBench.toHexDigitsShort 512 avgt 15 1.868 ? 0.010 us/op
-HexFormatBench.toHexLower 512 avgt 15 0.321 ? 0.001 us/op
-HexFormatBench.toHexLowerCached 512 avgt 15 0.322 ? 0.001 us/op
-HexFormatBench.toHexUpper 512 avgt 15 0.324 ? 0.001 us/op
-HexFormatBench.toHexUpperCached 512 avgt 15 0.325 ? 0.001 us/op
+Benchmark (size) Mode Cnt Score Error Units (optimized)
+HexFormatBench.appenderLower 512 avgt 15 0.212 ? 0.003 us/op (+1252.36)
+HexFormatBench.appenderLowerCached 512 avgt 15 0.211 ? 0.001 us/op (+684.84)
+HexFormatBench.appenderUpper 512 avgt 15 0.199 ? 0.002 us/op (+1313.57)
+HexFormatBench.appenderUpperCached 512 avgt 15 0.198 ? 0.001 us/op (+695.46)
+HexFormatBench.formatLower 512 avgt 15 0.221 ? 0.001 us/op (+74.21)
+HexFormatBench.formatLowerCached 512 avgt 15 0.192 ? 0.001 us/op (+100.53)
+HexFormatBench.formatUpper 512 avgt 15 0.317 ? 0.002 us/op (+21.46)
+HexFormatBench.formatUpperCached 512 avgt 15 0.348 ? 0.003 us/op (+10.35)
+HexFormatBench.toHexDigitsByte 512 avgt 15 1.715 ? 0.011 us/op (-1.58)
+HexFormatBench.toHexDigitsInt 512 avgt 15 2.261 ? 0.012 us/op (+32.29)
+HexFormatBench.toHexDigitsLong 512 avgt 15 3.776 ? 0.023 us/op (-1.51)
+HexFormatBench.toHexDigitsShort 512 avgt 15 1.862 ? 0.011 us/op (+0.33)
+HexFormatBench.toHexLower 512 avgt 15 0.289 ? 0.004 us/op (+11.08)
+HexFormatBench.toHexLowerCached 512 avgt 15 0.294 ? 0.002 us/op (+9.53)
+HexFormatBench.toHexUpper 512 avgt 15 0.288 ? 0.001 us/op (+12.50)
+HexFormatBench.toHexUpperCached 512 avgt 15 0.295 ? 0.001 us/op (+10.17) |
The byte[] length used by HexFormatBench is 512. In this scenario, the performance improvement of using table lookup is significant. |
Is this a common use-case? I could see an argument that formatting small chunks is much more common, where users will experience relatively more cache misses from the table lookup. The gain from specializing for |
Results of an experiment to remove the lookup table use from
Most of the win is from having a loop that is easier for the JIT to optimize; the lookup-table impl is faster than no lookup table in these micros, by a factor that is in the same ballpark as your I'm not convinced 30-100% wins on these micro is worth the increase in cache traffic, and would like to see a deeper analysis of the pros and cons of lookup tables before we go ahead with any optimizations that depend on their use. The no_lookup_table.diff: diff --git a/src/java.base/share/classes/java/lang/AbstractStringBuilder.java b/src/java.base/share/classes/java/lang/AbstractStringBuilder.java
index 57a16bb769a..70727375c0b 100644
--- a/src/java.base/share/classes/java/lang/AbstractStringBuilder.java
+++ b/src/java.base/share/classes/java/lang/AbstractStringBuilder.java
@@ -1825,6 +1825,15 @@ private final void appendChars(CharSequence s, int off, int end) {
count += end - off;
}
+ private static char toHighHexDigit(boolean ucase, int value) {
+ value = (value >> 4) & 0xf;
+ return (char) ((value < 10 ? '0' : ucase ? 'A' : 'a') + value);
+ }
+ private static char toLowHexDigit(boolean ucase, int value) {
+ value = (value) & 0xf;
+ return (char) ((value < 10 ? '0' : ucase ? 'A' : 'a') + value);
+ }
+
final void appendHex(boolean ucase, byte[] bytes, int fromIndex, int toIndex) {
Preconditions.checkFromToIndex(fromIndex, toIndex, bytes.length, Preconditions.IOOBE_FORMATTER);
@@ -1832,17 +1841,14 @@ final void appendHex(boolean ucase, byte[] bytes, int fromIndex, int toIndex) {
int charPos = count;
for (int i = fromIndex; i < toIndex; ++i) {
- short packed = HexDigits.digitPair(bytes[i], ucase);
+ byte b = bytes[i];
if (isLatin1()) {
- ByteArrayLittleEndian.setShort(
- value,
- charPos,
- packed);
+ StringLatin1.putChar(value, charPos++, toHighHexDigit(ucase, b));
+ StringLatin1.putChar(value, charPos++, toLowHexDigit(ucase, b));
} else {
- StringUTF16.putChar(value, charPos, packed & 0xff);
- StringUTF16.putChar(value, charPos + 1, packed >> 8);
+ StringUTF16.putChar(value, charPos++, toHighHexDigit(ucase, b));
+ StringUTF16.putChar(value, charPos++, toLowHexDigit(ucase, b));
}
- charPos += 2;
}
count = charPos; |
Add internal methods to StringBuilder for performance optimization, I saw that the implementation of JEP 403 String Template does similar things. class AbstractStringBuilder {
long mix(long lengthCoder) { }
long prepend(long lengthCoder, byte[] buffer) {}
// ...
} However, the StringBuilder.appendHex method can have more usage scenarios and can be considered as a public method. Is it necessary to submit a new PR to add these methods? class AbstractStringBuilder {
public void appendHex(byte[] bytes) {}
public void appendHex(byte[] bytes, boolean ucase) {}
public void appendHex(byte[] bytes, int fromIndex, int toIndex) {}
public void appendHex(byte[] bytes, int fromIndex, int toIndex, boolean ucase) {}
} Regarding the performance of using lookup table, I think it makes sense when the length of byte[] is greater than 8. I think that when the length of byte[] is actually used, there is a high probability that it will be greater than 8. Of course, I just said the number 8 casually, it could be 12, or 16. HexDecimal#DIGITS is a table with a size of 512 bytes. I think that in such a table, when it needs to be used continuously, it is worthwhile to perform table lookup operations. |
I don't think A more high-level idea would be to improve said BTW, here's a patch on top of #15768 that simplifies
Same speed. Drawback is that the diff --git a/src/java.base/share/classes/java/util/HexFormat.java b/src/java.base/share/classes/java/util/HexFormat.java
index e1d88e3ceb6..68a15307fe0 100644
--- a/src/java.base/share/classes/java/util/HexFormat.java
+++ b/src/java.base/share/classes/java/util/HexFormat.java
@@ -407,38 +407,7 @@ public <A extends Appendable> A formatHex(A out, byte[] bytes, int fromIndex, in
int length = toIndex - fromIndex;
if (length > 0) {
try {
- boolean prefixEmpty = prefix.isEmpty();
- boolean suffixEmpty = suffix.isEmpty();
- boolean delimiterEmpty = delimiter.isEmpty();
- if (!prefixEmpty) {
- out.append(prefix);
- }
- if (suffixEmpty && delimiterEmpty && prefixEmpty) {
- if (out instanceof StringBuilder sb) {
- jla.appendHex(sb, digitCase == Case.UPPERCASE, bytes, fromIndex, toIndex);
- } else {
- for (int i = 0; i < length; i++) {
- toHexDigits(out, bytes[fromIndex + i]);
- }
- }
- } else {
- toHexDigits(out, bytes[fromIndex]);
- for (int i = 1; i < length; i++) {
- if (!suffixEmpty) {
- out.append(suffix);
- }
- if (!delimiterEmpty) {
- out.append(delimiter);
- }
- if (!prefixEmpty) {
- out.append(prefix);
- }
- toHexDigits(out, bytes[fromIndex + i]);
- }
- }
- if (!suffixEmpty) {
- out.append(suffix);
- }
+ out.append(formatHex(bytes, fromIndex, toIndex));
} catch (IOException ioe) {
throw new UncheckedIOException(ioe.getMessage(), ioe);
} |
I deleted the newly added AbstractBuilder.appendHex method, Such changes are reduced and performance improvements are similar. The new performance test results are as follows: 1. aliyun_ecs_c8i.xlarge
-Benchmark (size) Mode Cnt Score Error Units (baselinie)
-HexFormatBench.appenderLower 512 avgt 15 2.768 ? 0.034 us/op
-HexFormatBench.appenderLowerCached 512 avgt 15 2.796 ? 0.042 us/op
-HexFormatBench.appenderUpper 512 avgt 15 2.800 ? 0.032 us/op
-HexFormatBench.appenderUpperCached 512 avgt 15 2.781 ? 0.018 us/op
-HexFormatBench.formatLower 512 avgt 15 0.544 ? 0.002 us/op
-HexFormatBench.formatLowerCached 512 avgt 15 0.548 ? 0.004 us/op
-HexFormatBench.formatUpper 512 avgt 15 0.546 ? 0.007 us/op
-HexFormatBench.formatUpperCached 512 avgt 15 0.550 ? 0.005 us/op
-HexFormatBench.toHexDigitsByte 512 avgt 15 3.364 ? 0.015 us/op
-HexFormatBench.toHexDigitsInt 512 avgt 15 3.770 ? 0.017 us/op
-HexFormatBench.toHexDigitsLong 512 avgt 15 4.990 ? 0.018 us/op
-HexFormatBench.toHexDigitsShort 512 avgt 15 3.466 ? 0.017 us/op
-HexFormatBench.toHexLower 512 avgt 15 0.415 ? 0.005 us/op
-HexFormatBench.toHexLowerCached 512 avgt 15 0.422 ? 0.005 us/op
-HexFormatBench.toHexUpper 512 avgt 15 0.413 ? 0.005 us/op
-HexFormatBench.toHexUpperCached 512 avgt 15 0.423 ? 0.004 us/op
+Benchmark (size) Mode Cnt Score Error Units (optimized)
+HexFormatBench.appenderLower 512 avgt 15 0.211 ? 0.002 us/op (+1211.85)
+HexFormatBench.appenderLowerCached 512 avgt 15 0.210 ? 0.004 us/op (+1231.43)
+HexFormatBench.appenderUpper 512 avgt 15 0.289 ? 0.002 us/op (+868.86)
+HexFormatBench.appenderUpperCached 512 avgt 15 0.296 ? 0.019 us/op (+839.53)
+HexFormatBench.formatLower 512 avgt 15 0.265 ? 0.001 us/op (+105.29)
+HexFormatBench.formatLowerCached 512 avgt 15 0.267 ? 0.002 us/op (+105.25)
+HexFormatBench.formatUpper 512 avgt 15 0.274 ? 0.002 us/op (+99.28)
+HexFormatBench.formatUpperCached 512 avgt 15 0.286 ? 0.019 us/op (+92.31)
+HexFormatBench.toHexDigitsByte 512 avgt 15 3.351 ? 0.011 us/op (+0.39)
+HexFormatBench.toHexDigitsInt 512 avgt 15 3.708 ? 0.011 us/op (+1.68)
+HexFormatBench.toHexDigitsLong 512 avgt 15 5.051 ? 0.014 us/op (-1.21)
+HexFormatBench.toHexDigitsShort 512 avgt 15 3.456 ? 0.012 us/op (+0.29)
+HexFormatBench.toHexLower 512 avgt 15 0.445 ? 0.001 us/op (-6.75)
+HexFormatBench.toHexLowerCached 512 avgt 15 0.441 ? 0.001 us/op (-4.31)
+HexFormatBench.toHexUpper 512 avgt 15 0.444 ? 0.001 us/op (-6.99)
+HexFormatBench.toHexUpperCached 512 avgt 15 0.441 ? 0.001 us/op (-4.09) 2. aliyun_ecs_c8y.xlarge
-Benchmark (size) Mode Cnt Score Error Units (baseline)
-HexFormatBench.appenderLower 512 avgt 15 2.857 ? 0.791 us/op
-HexFormatBench.appenderLowerCached 512 avgt 15 2.832 ? 0.758 us/op
-HexFormatBench.appenderUpper 512 avgt 15 2.360 ? 0.010 us/op
-HexFormatBench.appenderUpperCached 512 avgt 15 2.361 ? 0.013 us/op
-HexFormatBench.formatLower 512 avgt 15 0.947 ? 0.406 us/op
-HexFormatBench.formatLowerCached 512 avgt 15 0.616 ? 0.002 us/op
-HexFormatBench.formatUpper 512 avgt 15 1.212 ? 0.411 us/op
-HexFormatBench.formatUpperCached 512 avgt 15 0.616 ? 0.001 us/op
-HexFormatBench.toHexDigitsByte 512 avgt 15 5.844 ? 0.264 us/op
-HexFormatBench.toHexDigitsInt 512 avgt 15 7.392 ? 0.207 us/op
-HexFormatBench.toHexDigitsLong 512 avgt 15 8.068 ? 0.303 us/op
-HexFormatBench.toHexDigitsShort 512 avgt 15 6.214 ? 0.266 us/op
-HexFormatBench.toHexLower 512 avgt 15 0.926 ? 0.003 us/op
-HexFormatBench.toHexLowerCached 512 avgt 15 1.000 ? 0.005 us/op
-HexFormatBench.toHexUpper 512 avgt 15 0.927 ? 0.002 us/op
-HexFormatBench.toHexUpperCached 512 avgt 15 0.999 ? 0.020 us/op
+Benchmark (size) Mode Cnt Score Error Units (optimized)
+HexFormatBench.appenderLower 512 avgt 15 0.343 ? 0.001 us/op (+732.95)
+HexFormatBench.appenderLowerCached 512 avgt 15 0.345 ? 0.001 us/op (+720.87)
+HexFormatBench.appenderUpper 512 avgt 15 0.352 ? 0.002 us/op (+570.46)
+HexFormatBench.appenderUpperCached 512 avgt 15 0.349 ? 0.001 us/op (+576.51)
+HexFormatBench.formatLower 512 avgt 15 0.464 ? 0.001 us/op (+104.10)
+HexFormatBench.formatLowerCached 512 avgt 15 0.484 ? 0.002 us/op (+27.28)
+HexFormatBench.formatUpper 512 avgt 15 0.650 ? 0.001 us/op (+86.47)
+HexFormatBench.formatUpperCached 512 avgt 15 0.598 ? 0.001 us/op (+3.02)
+HexFormatBench.toHexDigitsByte 512 avgt 15 5.591 ? 0.058 us/op (+4.53)
+HexFormatBench.toHexDigitsInt 512 avgt 15 7.080 ? 0.114 us/op (+4.41)
+HexFormatBench.toHexDigitsLong 512 avgt 15 7.754 ? 0.040 us/op (+4.05)
+HexFormatBench.toHexDigitsShort 512 avgt 15 5.779 ? 0.076 us/op (+7.53)
+HexFormatBench.toHexLower 512 avgt 15 0.494 ? 0.001 us/op (+87.45)
+HexFormatBench.toHexLowerCached 512 avgt 15 0.493 ? 0.001 us/op (+102.84)
+HexFormatBench.toHexUpper 512 avgt 15 0.494 ? 0.001 us/op (+87.66)
+HexFormatBench.toHexUpperCached 512 avgt 15 0.493 ? 0.001 us/op (+102.64) 3. Mac Book Pro M1 Pro-Benchmark (size) Mode Cnt Score Error Units (baseline)
-HexFormatBench.appenderLower 512 avgt 15 2.867 ? 0.035 us/op
-HexFormatBench.appenderLowerCached 512 avgt 15 1.656 ? 0.875 us/op
-HexFormatBench.appenderUpper 512 avgt 15 2.813 ? 0.085 us/op
-HexFormatBench.appenderUpperCached 512 avgt 15 1.575 ? 0.901 us/op
-HexFormatBench.formatLower 512 avgt 15 0.385 ? 0.001 us/op
-HexFormatBench.formatLowerCached 512 avgt 15 0.385 ? 0.002 us/op
-HexFormatBench.formatUpper 512 avgt 15 0.385 ? 0.001 us/op
-HexFormatBench.formatUpperCached 512 avgt 15 0.384 ? 0.001 us/op
-HexFormatBench.toHexDigitsByte 512 avgt 15 1.688 ? 0.009 us/op
-HexFormatBench.toHexDigitsInt 512 avgt 15 2.991 ? 0.015 us/op
-HexFormatBench.toHexDigitsLong 512 avgt 15 3.719 ? 0.081 us/op
-HexFormatBench.toHexDigitsShort 512 avgt 15 1.868 ? 0.010 us/op
-HexFormatBench.toHexLower 512 avgt 15 0.321 ? 0.001 us/op
-HexFormatBench.toHexLowerCached 512 avgt 15 0.322 ? 0.001 us/op
-HexFormatBench.toHexUpper 512 avgt 15 0.324 ? 0.001 us/op
-HexFormatBench.toHexUpperCached 512 avgt 15 0.325 ? 0.001 us/op
+Benchmark (size) Mode Cnt Score Error Units (optimized)
+HexFormatBench.appenderLower 512 avgt 15 0.207 ? 0.001 us/op (+1285.03)
+HexFormatBench.appenderLowerCached 512 avgt 15 0.206 ? 0.001 us/op (+703.89)
+HexFormatBench.appenderUpper 512 avgt 15 0.225 ? 0.001 us/op (+1150.23)
+HexFormatBench.appenderUpperCached 512 avgt 15 0.225 ? 0.001 us/op (+600.00)
+HexFormatBench.formatLower 512 avgt 15 0.211 ? 0.003 us/op (+82.47)
+HexFormatBench.formatLowerCached 512 avgt 15 0.186 ? 0.001 us/op (+106.99)
+HexFormatBench.formatUpper 512 avgt 15 0.312 ? 0.001 us/op (+23.40)
+HexFormatBench.formatUpperCached 512 avgt 15 0.344 ? 0.001 us/op (+11.63)
+HexFormatBench.toHexDigitsByte 512 avgt 15 1.718 ? 0.054 us/op (-1.75)
+HexFormatBench.toHexDigitsInt 512 avgt 15 2.255 ? 0.010 us/op (+32.64)
+HexFormatBench.toHexDigitsLong 512 avgt 15 3.764 ? 0.005 us/op (-1.20)
+HexFormatBench.toHexDigitsShort 512 avgt 15 1.858 ? 0.008 us/op (+0.54)
+HexFormatBench.toHexLower 512 avgt 15 0.289 ? 0.004 us/op (+11.08)
+HexFormatBench.toHexLowerCached 512 avgt 15 0.295 ? 0.001 us/op (+9.16)
+HexFormatBench.toHexUpper 512 avgt 15 0.288 ? 0.001 us/op (+12.50)
+HexFormatBench.toHexUpperCached 512 avgt 15 0.297 ? 0.005 us/op (+9.43) |
/label remove security |
@wenshao |
@cl4es Can you help me create an issue for this PR? |
for (int i = 1; i < length; i++) { | ||
out.append(suffix); | ||
out.append(delimiter); | ||
out.append(prefix); | ||
toHexDigits(out, bytes[fromIndex + i]); | ||
} | ||
out.append(suffix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change this whole else
block to
for (int i = 0; i < length; i++) {
if (i > 0)
out.append(delimiter);
out.append(prefix);
toHexDigits(out, bytes[fromIndex + i]);
out.append(suffix);
}
for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original (and current) is coded to avoid a condition inside the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that the way of writing for_0 combined with if > 0 is easier to understand, The operation overhead of if > 0 is very small, and it will not affect performance when used in a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coding pattern to handle first and last special cases outside the loop is also common and may help the JIT with loop unrolling and other optimizations. (And yes JITs can recognize all kinds of code and optimize it anyway).
return (char)('a' - 10 + value); | ||
} | ||
return (char)('A' - 10 + value); | ||
return (char) ((value < 10 ? '0' : hexBase) + value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes seem to be the reason toHexDigitsLong
consistently sees a slight performance drop. Can any professional engineer explain why the old version, which would be like (char) (value < 10 ? '0' + value : hexBase + value)
, is slightly faster?
Since the effect of this change hexBase
is not clear, I recommend dropping this for now (remove the field and changes to the 2 methods).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please drop this. Possibly re-attempting it in a future PR, though I'm a bit skeptical of the potential win here.
I did test something like hexBase + value
in #15591 to a similar effect, so opted for just storing digitCase
. My best guess as to why are that the JIT is simply better at (or more eager to) eliminating untaken paths than it is at constant fold the field load. Or the branch isn't eliminated and we're seeing the effect of branch prediction. I'd recommend analyzing the asm generated (via e.g. -prof perfasm
) to understand this in depth
I have created a JBS issue including the 2 confirmed performance improvements in this patch: https://bugs.openjdk.org/browse/JDK-8316426 |
Webrevs
|
HexFormat was designed for presenting bytes to humans, typical line lengths for the useful output are 80-130 characters, so input sizes of 40-65 bytes, probably less when other interpretations (such as latin1 or ascii) are appended. |
public static short digitPair(int i, boolean ucase) { | ||
short v = DIGITS[i & 0xff]; | ||
return ucase | ||
? (short) (v & ~((v & 0b0100_0000_0100_0000) >> 1)) // really: to uppper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever but perhaps ('a' - 'A') to make it clearer where the constant came from
for (int i = 1; i < length; i++) { | ||
out.append(suffix); | ||
out.append(delimiter); | ||
out.append(prefix); | ||
toHexDigits(out, bytes[fromIndex + i]); | ||
} | ||
out.append(suffix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original (and current) is coded to avoid a condition inside the loop.
ok, but perhaps you can shrink it further, if it does not hurt performance, by subtracting 0x20 before indexing and cut the table to 64 bytes. |
If you use DIGITS of size 54, performance will be 10% slower, The code is written like this: public final class HexFormat {
private static final byte[] DIGITS = {
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, -1, -1, -1, -1, -1,
-1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, 10, 11, 12, 13, 14, 15
// remove 128 - 255
};
public static boolean isHexDigit(int ch) {
// unsigned right shift 8 change to 7
return (ch >= '0' && ch <= 'f' && DIGITS[ch - '0'] >= 0);
}
public static int fromHexDigit(int ch) {
int value;
// unsigned right shift 8 change to 7
if (ch >= '0' && ch <= 'f' && (value = DIGITS[ch - '0']) >= 0)
...
}
} |
/integrate |
It is conventional to get re-reviews of any change and give the reviewers that taken the time to review the PR a chance to review any subsequent changes. |
/integrate |
@RogerRiggs Can this PR be added to /sponsor now? |
FWIW I'll not review or sponsor any PRs that use |
Thanks, I received your opinion and I have removed ByteArrayLittleEndian. Such optimization should be done by JIT. It seems that this optimization is called SuperWord Level Parallelism (SLP). Currently, HotSpot C2 has only completed the loop version and not the linear version. Improvements require JIT experts to submit patches. |
/integrate |
@cl4es Can it be integrated? |
Yes, I think this looks fine. Thank you for your patience. /sponsor |
Going to push as commit 9355431.
Your commit was automatically rebased without conflicts. |
In the improvement of @cl4es PR #15591, the advantages of non-lookup-table were discussed.
But if the input is byte[], using lookup table can improve performance.
For HexFormat#formatHex(Appendable, byte[]) and HexFormat#formatHex(byte[]), If the length of byte[] is larger, the performance of table lookup will be improved more obviously.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15768/head:pull/15768
$ git checkout pull/15768
Update a local copy of the PR:
$ git checkout pull/15768
$ git pull https://git.openjdk.org/jdk.git pull/15768/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 15768
View PR using the GUI difftool:
$ git pr show -t 15768
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15768.diff
Webrev
Link to Webrev Comment