Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8289834: Add SBCS and DBCS Only EBCDIC charsets #9399

Closed
wants to merge 1 commit into from

Conversation

takiguc
Copy link

@takiguc takiguc commented Jul 6, 2022

OpenJDK supports "Japanese EBCDIC - Katakana" and "Korean EBCDIC" SBCS and DBCS Only charsets.

Charset Mix SBCS DBCS
Japanese EBCDIC - Katakana Cp930 Cp290 Cp300
Korean Cp933 Cp833 Cp834

But OpenJDK does not supports some of "Japanese EBCDIC - English" / "Simplified Chinese EBCDIC" / "Traditional Chinese EBCDIC" SBCS and DBCS Only charsets.

I'd like to request Cp1027/Cp835/Cp836/Cp837 for consistency

Charset Mix SBCS DBCS
Japanese EBCDIC - English Cp939 Cp1027 Cp300
Simplified Chinese EBCDIC Cp935 Cp836 Cp837
Traditional Chinese EBCDIC Cp937 (*1) Cp835

*1: Cp037 compatible


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9399/head:pull/9399
$ git checkout pull/9399

Update a local copy of the PR:
$ git checkout pull/9399
$ git pull https://git.openjdk.org/jdk pull/9399/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9399

View PR using the GUI difftool:
$ git pr show -t 9399

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9399.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 6, 2022

👋 Welcome back itakiguchi! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 6, 2022

@takiguc The following labels will be automatically applied to this pull request:

  • build
  • i18n
  • nio

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added nio nio-dev@openjdk.org build build-dev@openjdk.org i18n i18n-dev@openjdk.org labels Jul 6, 2022
@takiguc
Copy link
Author

takiguc commented Jul 6, 2022

/label remove nio
/label add core-libs

@openjdk openjdk bot removed the nio nio-dev@openjdk.org label Jul 6, 2022
@openjdk
Copy link

openjdk bot commented Jul 6, 2022

@takiguc
The nio label was successfully removed.

@takiguc
Copy link
Author

takiguc commented Jul 6, 2022

/label add core-libs

@openjdk openjdk bot added the core-libs core-libs-dev@openjdk.org label Jul 6, 2022
@openjdk
Copy link

openjdk bot commented Jul 6, 2022

@takiguc
The core-libs label was successfully added.

@takiguc takiguc changed the title 8289834: Missing SBCS and DBCS Only EBCDIC charsets 8289834: Add SBCS and DBCS Only EBCDIC charsets Jul 6, 2022
@openjdk
Copy link

openjdk bot commented Jul 6, 2022

@takiguc The core-libs label was already applied.

@takiguc
Copy link
Author

takiguc commented Jul 6, 2022

Discussions are available on :
JDK-8289834: Add SBCS and DBCS Only EBCDIC charsets

@takiguc takiguc marked this pull request as ready for review July 6, 2022 16:18
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 6, 2022
@mlbridge
Copy link

mlbridge bot commented Jul 6, 2022

Webrevs

@AlanBateman
Copy link
Contributor

Discussions are available on :
JDK-8289834: Add SBCS and DBCS Only EBCDIC charsets

Yes, I think this need discussion on whether the JDK really needs to keep including and adding more EBCDIC charsets. I understand they can be useful for someone using JDBC to connect to a database on z/OS but this scenario would work equally well if the EBCDIC charsets were deployed on the class path or module path. Do you know if the icu4j project is still alive? I've often wondered why there wasn't more use of the provider mechanism.

@magicus
Copy link
Member

magicus commented Jul 7, 2022

/label -build

@openjdk openjdk bot removed the build build-dev@openjdk.org label Jul 7, 2022
@openjdk
Copy link

openjdk bot commented Jul 7, 2022

@magicus
The build label was successfully removed.

@naotoj
Copy link
Member

naotoj commented Jul 7, 2022

I second Alan's comment here. I am not sure we could justify those legacy less common EBCDIC charsets making into the jdk.charsets module. Instead, utilizing the charset provider mechanism will better fit here to me.

@mlbridge
Copy link

mlbridge bot commented Jul 8, 2022

Mailing list message from Bernd Eckenfels on core-libs-dev:

And also there is no reason why db drivers or host connectors should not ship their own charset support (Oracle JDBC for example had nls_charset addons. My employer also ship a custom EBCDIC encoding which includes some compatibility hacks, and that took some effort to adopt it to the missing ext mechanism).

Having said that, with JPMS a ?legacy ebcdic? encoding module would be possible while still being optional. Maybe in the future a mechanism for modules which can be added (instead of removed) from standard distribution would make that nicer?

Is there a performance restriction for charset if they are not part of a platform module (optimized string access)?

Gruss
Bernd

--
http://bernd.eckenfels.net
________________________________
Von: core-libs-dev <core-libs-dev-retn at openjdk.org> im Auftrag von Alan Bateman <alanb at openjdk.org>
Gesendet: Thursday, July 7, 2022 11:50:39 AM
An: build-dev at openjdk.org <build-dev at openjdk.org>; core-libs-dev at openjdk.org <core-libs-dev at openjdk.org>; i18n-dev at openjdk.org <i18n-dev at openjdk.org>
Betreff: Re: RFR: 8289834: Add SBCS and DBCS Only EBCDIC charsets

On Wed, 6 Jul 2022 16:18:08 GMT, Ichiroh Takiguchi <itakiguchi at openjdk.org> wrote:

Discussions are available on :
[JDK-8289834](https://bugs.openjdk.org/browse/JDK-8289834): Add SBCS and DBCS Only EBCDIC charsets

Yes, I think this need discussion on whether the JDK really needs to keep including and adding more EBCDIC charsets. I understand they can be useful for someone using JDBC to connect to a database on z/OS but this scenario would work equally well if the EBCDIC charsets were deployed on the class path or module path. Do you know if the icu4j project is still alive? I've often wondered why there wasn't more use of the provider mechanism.

-------------

PR: https://git.openjdk.org/jdk/pull/9399
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20220707/2d095f3d/attachment-0001.htm>

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 5, 2022

@takiguc This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@takiguc
Copy link
Author

takiguc commented Aug 8, 2022

Hello @AlanBateman .
Sorry I'm late.
I got some responses from ICU. ICU-22091
I'm not sure if they're interested in the new charset...

As you know sun.nio.cs.ArrayDecoder and sun.nio.cs.ArrayEncoderinterface have performance advantage.
And some other performance advantages are there on built-in charset decoder/encoder.
Is it possible to create simple public API by using sun.nio.cs.SingleByte and sun.nio.cs.DoubleByte* classes?
We'd like to use stable conversion loop.

@AlanBateman
Copy link
Contributor

As you know sun.nio.cs.ArrayDecoder and sun.nio.cs.ArrayEncoderinterface have performance advantage. And some other performance advantages are there on built-in charset decoder/encoder. Is it possible to create simple public API by using sun.nio.cs.SingleByte and sun.nio.cs.DoubleByte* classes? We'd like to use stable conversion loop.

If they have ASCII compatible regions then that may be so but I haven't see any performance data published on that. Do you know if any experiments that have deployed a CharsetProvider for the EBCDIC charsets and compared the performance with the charsets that in the JDK? There may be merit in exploring adding base abstracts implementations of CharsetEncoder/CharsetDecoder to java.nio.charsets.spi to support single and double byte charsets to see how such base implementations might look, how they would help performance, and if there are any security downsides.

@takiguc
Copy link
Author

takiguc commented Aug 26, 2022

Hello @AlanBateman .
Sorry, I'm late.
Test result is attached (not guaranteed).

I created attached small test program, I'm not sure it's good or not

import java.nio.*;
import java.nio.charset.*;

public class tc {
  public static void main(String[] args) throws Exception {
    Charset cs = Charset.forName(args[0]);
    int cnt = Integer.parseInt(args[1]);
    boolean useCA = "1".equals(args[2]);
    boolean useBA = "1".equals(args[3]);
    CharsetEncoder ce = cs.newEncoder();
    byte[] ba = new byte[0x4000];
    for(int i = 0; i < ba.length; i++) {
      ba[i] = (byte) i;
    }
    String s = new String(ba, cs);
    char[] ca = s.toCharArray();
    ByteBuffer bb = useBA ? ByteBuffer.allocate(ca.length) : ByteBuffer.allocateDirect(ca.length);;
    CharBuffer cb = useCA ? CharBuffer.wrap(ca) : CharBuffer.wrap(s);
    System.out.println("CharBuffer.hasArray() = " + cb.hasArray());
    System.out.println("ByteBuffer.hasArray() = " + bb.hasArray());
    long start_t = System.currentTimeMillis();
    for(int i = 0; i < 200; i++) {
      ce.reset();
      bb.position(0);
      cb.position(0);
      ce.encode(cb, bb, true);
    }
    System.out.println("Warmup: "+(System.currentTimeMillis() - start_t));
    start_t = System.currentTimeMillis();
    for(int i = 0; i < cnt; i++) {
      ce.reset();
      bb.position(0);
      cb.position(0);
      ce.encode(cb, bb, true);
    }
    System.out.println("Test: "+(System.currentTimeMillis() - start_t));
  }
}

Following test result is just for my test environment

  • CPU: Intel (On-premises environment), not same machine
  • Executed 5 times, the values are their average

Use following options, like
OpenJDK:
java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047 20000 1 1
ICU4J
java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047_P100-1995 20000 1 1

I used jdk-20 b12
Only A/A with OpenJDK uses ArrayEncoder (ArrayDecoder) interface

A/A A/B B/A B/B
Linux (OpenJDK) 862 1265 1838 1843
Linux (ICU4J) 1450 1410 1152 1138
Windows (OpenJDK) 921 1231 1959 1850
Windows (ICU4J) 1431 1446 2227 2265
Mac (OpenJDK) 820 1163 1799 1774
Mac (ICU4J) 1282 1242 994 1049

Notes:

  • A/A means CharBuffer is created via char[], ByteBuffer is generated by allocate()
  • A/B means CharBuffer is created via char[], ByteBuffer is generated by allocateDirect()
  • B/A means CharBuffer is created via String, ByteBuffer is generated by allocate()
  • B/B means CharBuffer is created via String, ByteBuffer is generated by allocateDirect()

Actually, I'm confused by this result.
Previously, I was just comparing A/A with B/B on OpenJDK's charset.
I didn't think ICU4J's result would make a difference.

Anyway, please evaluate about this result.
And please let me know if I need more investigation.

@AlanBateman
Copy link
Contributor

Use following options, like OpenJDK: java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047 20000 1 1 ICU4J java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047_P100-1995 20000 1 1

Actually, I'm confused by this result. Previously, I was just comparing A/A with B/B on OpenJDK's charset. I didn't think ICU4J's result would make a difference.

My initial reaction is one of relief that the icu4j provider can be used with current JDK builds. This means there is an option should we decide to stop adding more EBCDIC charsets to the JDK.

The test uses IBM-1047 and I can't tell if the icu4j provider is used or not. Charset doesn't define a provider method but I think would be useful to print cs.getClass() or cs.getClass().getModule() so we know which Charset implementation is used. Also I think any discussion on performance would be better served with a JMH benchmark rather than a standalone test.

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 23, 2022

@takiguc This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@takiguc
Copy link
Author

takiguc commented Oct 3, 2022

Hello @AlanBateman .
Sorry I'm late.

I created Charset SPI JAR x-IBM1047_SPI (custom-charsets.jar) which was ported from sun.nio.cs.SingleByte.java and IBM1047.java (generated one).

Test code:

package com.example;

import java.nio.charset.Charset;
import org.openjdk.jmh.annotations.Benchmark;

public class MyBenchmark {

    final static String s;

    static {
        char[] ca = new char[0x2000];
        for (int i = 0; i < ca.length; i++) {
            ca[i] = (char) (i & 0xFF);
        }
        s = new String(ca);
    }

    @Benchmark
    public void testIBM1047() throws Exception {
        byte[] ba = s.getBytes("IBM1047");
    }

    @Benchmark
    public void testIBM1047_SPI() throws Exception {
        byte[] ba = s.getBytes("x-IBM1047_SPI");
    }

}

All test related files are in JDK-8289834.

Test results are as follows on RHEL8.6 x86_64 (Intel Core i7 3520M) :

1.8.0_345-b01
Benchmark                     Mode  Cnt      Score     Error  Units
MyBenchmark.testIBM1047      thrpt   25  53213.092 ± 126.962  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25  47442.669 ± 349.003  ops/s
20-ea+17-1181
Benchmark                     Mode  Cnt       Score      Error  Units
MyBenchmark.testIBM1047      thrpt   25  136331.141 ± 1078.481  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25   51563.213 ±  843.238  ops/s

IBM1047 is 2.6 times faster than the SPI version on JDK20.
I think this results are related to JEP 254: Compact Strings .
As I requested before, we'd like to use sun.nio.cs.SingleByte* and sun.nio.cs.DoubleByte* class as public API.

@AlanBateman
Copy link
Contributor

AlanBateman commented Oct 3, 2022

Test results are as follows on RHEL8.6 x86_64 (Intel Core i7 3520M) :

1.8.0_345-b01
Benchmark                     Mode  Cnt      Score     Error  Units
MyBenchmark.testIBM1047      thrpt   25  53213.092 ± 126.962  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25  47442.669 ± 349.003  ops/s
20-ea+17-1181
Benchmark                     Mode  Cnt       Score      Error  Units
MyBenchmark.testIBM1047      thrpt   25  136331.141 ± 1078.481  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25   51563.213 ±  843.238  ops/s

IBM1047 is 2.6 times faster than the SPI version on JDK20. I think this results are related to JEP 254: Compact Strings . As I requested before, we'd like to use sun.nio.cs.SingleByte* and sun.nio.cs.DoubleByte* class as public API.

I don't think that is a workable option. Maybe you could do more performance analysis on the providers that are using the SPI to see if there are any secure primitives that would make sense to expose in java.nio.charset.spi? I'm hesitate to suggest adding base classes to java.nio.charset.spi but that may be something that could be explored too.

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 31, 2022

@takiguc This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 28, 2022

@takiguc This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Nov 28, 2022
@takiguc
Copy link
Author

takiguc commented Nov 29, 2022

I'm still working on this one.
/open

@openjdk openjdk bot reopened this Nov 29, 2022
@openjdk
Copy link

openjdk bot commented Nov 29, 2022

@takiguc This pull request is now open

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 27, 2022

@takiguc This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 24, 2023

@takiguc This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org i18n i18n-dev@openjdk.org rfr Pull request is ready for review
4 participants