8289834: Add SBCS and DBCS Only EBCDIC charsets #9399

takiguc · 2022-07-06T14:05:39Z

OpenJDK supports "Japanese EBCDIC - Katakana" and "Korean EBCDIC" SBCS and DBCS Only charsets.

Charset	Mix	SBCS	DBCS
Japanese EBCDIC - Katakana	Cp930	Cp290	Cp300
Korean	Cp933	Cp833	Cp834

But OpenJDK does not supports some of "Japanese EBCDIC - English" / "Simplified Chinese EBCDIC" / "Traditional Chinese EBCDIC" SBCS and DBCS Only charsets.

I'd like to request Cp1027/Cp835/Cp836/Cp837 for consistency

Charset	Mix	SBCS	DBCS
Japanese EBCDIC - English	Cp939	Cp1027	Cp300
Simplified Chinese EBCDIC	Cp935	Cp836	Cp837
Traditional Chinese EBCDIC	Cp937	(*1)	Cp835

*1: Cp037 compatible

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8289834: Add SBCS and DBCS Only EBCDIC charsets

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9399/head:pull/9399
$ git checkout pull/9399

Update a local copy of the PR:
$ git checkout pull/9399
$ git pull https://git.openjdk.org/jdk pull/9399/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9399

View PR using the GUI difftool:
$ git pr show -t 9399

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9399.diff

bridgekeeper · 2022-07-06T14:06:57Z

👋 Welcome back itakiguchi! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2022-07-06T14:09:35Z

@takiguc The following labels will be automatically applied to this pull request:

build
i18n
nio

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

takiguc · 2022-07-06T16:04:42Z

/label remove nio
/label add core-libs

openjdk · 2022-07-06T16:05:06Z

@takiguc
The nio label was successfully removed.

takiguc · 2022-07-06T16:06:29Z

/label add core-libs

openjdk · 2022-07-06T16:06:47Z

@takiguc
The core-libs label was successfully added.

openjdk · 2022-07-06T16:08:44Z

@takiguc The core-libs label was already applied.

takiguc · 2022-07-06T16:18:08Z

Discussions are available on :
JDK-8289834: Add SBCS and DBCS Only EBCDIC charsets

mlbridge · 2022-07-06T16:23:21Z

Webrevs

00: Full (58d128d7)

AlanBateman · 2022-07-07T09:47:25Z

Discussions are available on :
JDK-8289834: Add SBCS and DBCS Only EBCDIC charsets

Yes, I think this need discussion on whether the JDK really needs to keep including and adding more EBCDIC charsets. I understand they can be useful for someone using JDBC to connect to a database on z/OS but this scenario would work equally well if the EBCDIC charsets were deployed on the class path or module path. Do you know if the icu4j project is still alive? I've often wondered why there wasn't more use of the provider mechanism.

magicus · 2022-07-07T10:42:10Z

/label -build

openjdk · 2022-07-07T10:42:44Z

@magicus
The build label was successfully removed.

naotoj · 2022-07-07T16:47:35Z

I second Alan's comment here. I am not sure we could justify those legacy less common EBCDIC charsets making into the jdk.charsets module. Instead, utilizing the charset provider mechanism will better fit here to me.

mlbridge · 2022-07-08T08:23:36Z

Mailing list message from Bernd Eckenfels on core-libs-dev:

And also there is no reason why db drivers or host connectors should not ship their own charset support (Oracle JDBC for example had nls_charset addons. My employer also ship a custom EBCDIC encoding which includes some compatibility hacks, and that took some effort to adopt it to the missing ext mechanism).

Having said that, with JPMS a ?legacy ebcdic? encoding module would be possible while still being optional. Maybe in the future a mechanism for modules which can be added (instead of removed) from standard distribution would make that nicer?

Is there a performance restriction for charset if they are not part of a platform module (optimized string access)?

Gruss
Bernd

--
http://bernd.eckenfels.net
________________________________
Von: core-libs-dev <core-libs-dev-retn at openjdk.org> im Auftrag von Alan Bateman <alanb at openjdk.org>
Gesendet: Thursday, July 7, 2022 11:50:39 AM
An: build-dev at openjdk.org <build-dev at openjdk.org>; core-libs-dev at openjdk.org <core-libs-dev at openjdk.org>; i18n-dev at openjdk.org <i18n-dev at openjdk.org>
Betreff: Re: RFR: 8289834: Add SBCS and DBCS Only EBCDIC charsets

On Wed, 6 Jul 2022 16:18:08 GMT, Ichiroh Takiguchi <itakiguchi at openjdk.org> wrote:

Discussions are available on :
[JDK-8289834](https://bugs.openjdk.org/browse/JDK-8289834): Add SBCS and DBCS Only EBCDIC charsets

Yes, I think this need discussion on whether the JDK really needs to keep including and adding more EBCDIC charsets. I understand they can be useful for someone using JDBC to connect to a database on z/OS but this scenario would work equally well if the EBCDIC charsets were deployed on the class path or module path. Do you know if the icu4j project is still alive? I've often wondered why there wasn't more use of the provider mechanism.

-------------

PR: https://git.openjdk.org/jdk/pull/9399
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20220707/2d095f3d/attachment-0001.htm>

bridgekeeper · 2022-08-05T11:24:03Z

@takiguc This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

takiguc · 2022-08-08T00:19:57Z

Hello @AlanBateman .
Sorry I'm late.
I got some responses from ICU. ICU-22091
I'm not sure if they're interested in the new charset...

As you know sun.nio.cs.ArrayDecoder and sun.nio.cs.ArrayEncoderinterface have performance advantage.
And some other performance advantages are there on built-in charset decoder/encoder.
Is it possible to create simple public API by using sun.nio.cs.SingleByte and sun.nio.cs.DoubleByte* classes?
We'd like to use stable conversion loop.

AlanBateman · 2022-08-08T09:22:32Z

As you know sun.nio.cs.ArrayDecoder and sun.nio.cs.ArrayEncoderinterface have performance advantage. And some other performance advantages are there on built-in charset decoder/encoder. Is it possible to create simple public API by using sun.nio.cs.SingleByte and sun.nio.cs.DoubleByte* classes? We'd like to use stable conversion loop.

If they have ASCII compatible regions then that may be so but I haven't see any performance data published on that. Do you know if any experiments that have deployed a CharsetProvider for the EBCDIC charsets and compared the performance with the charsets that in the JDK? There may be merit in exploring adding base abstracts implementations of CharsetEncoder/CharsetDecoder to java.nio.charsets.spi to support single and double byte charsets to see how such base implementations might look, how they would help performance, and if there are any security downsides.

takiguc · 2022-08-26T07:15:30Z

Hello @AlanBateman .
Sorry, I'm late.
Test result is attached (not guaranteed).

I created attached small test program, I'm not sure it's good or not

import java.nio.*;
import java.nio.charset.*;

public class tc {
  public static void main(String[] args) throws Exception {
    Charset cs = Charset.forName(args[0]);
    int cnt = Integer.parseInt(args[1]);
    boolean useCA = "1".equals(args[2]);
    boolean useBA = "1".equals(args[3]);
    CharsetEncoder ce = cs.newEncoder();
    byte[] ba = new byte[0x4000];
    for(int i = 0; i < ba.length; i++) {
      ba[i] = (byte) i;
    }
    String s = new String(ba, cs);
    char[] ca = s.toCharArray();
    ByteBuffer bb = useBA ? ByteBuffer.allocate(ca.length) : ByteBuffer.allocateDirect(ca.length);;
    CharBuffer cb = useCA ? CharBuffer.wrap(ca) : CharBuffer.wrap(s);
    System.out.println("CharBuffer.hasArray() = " + cb.hasArray());
    System.out.println("ByteBuffer.hasArray() = " + bb.hasArray());
    long start_t = System.currentTimeMillis();
    for(int i = 0; i < 200; i++) {
      ce.reset();
      bb.position(0);
      cb.position(0);
      ce.encode(cb, bb, true);
    }
    System.out.println("Warmup: "+(System.currentTimeMillis() - start_t));
    start_t = System.currentTimeMillis();
    for(int i = 0; i < cnt; i++) {
      ce.reset();
      bb.position(0);
      cb.position(0);
      ce.encode(cb, bb, true);
    }
    System.out.println("Test: "+(System.currentTimeMillis() - start_t));
  }
}

Following test result is just for my test environment

CPU: Intel (On-premises environment), not same machine
Executed 5 times, the values are their average

Use following options, like
OpenJDK:
java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047 20000 1 1
ICU4J
java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047_P100-1995 20000 1 1

I used jdk-20 b12
Only A/A with OpenJDK uses ArrayEncoder (ArrayDecoder) interface

	A/A	A/B	B/A	B/B
Linux (OpenJDK)	862	1265	1838	1843
Linux (ICU4J)	1450	1410	1152	1138
Windows (OpenJDK)	921	1231	1959	1850
Windows (ICU4J)	1431	1446	2227	2265
Mac (OpenJDK)	820	1163	1799	1774
Mac (ICU4J)	1282	1242	994	1049

Notes:

A/A means CharBuffer is created via char[], ByteBuffer is generated by allocate()
A/B means CharBuffer is created via char[], ByteBuffer is generated by allocateDirect()
B/A means CharBuffer is created via String, ByteBuffer is generated by allocate()
B/B means CharBuffer is created via String, ByteBuffer is generated by allocateDirect()

Actually, I'm confused by this result.
Previously, I was just comparing A/A with B/B on OpenJDK's charset.
I didn't think ICU4J's result would make a difference.

Anyway, please evaluate about this result.
And please let me know if I need more investigation.

AlanBateman · 2022-08-26T09:25:55Z

Use following options, like OpenJDK: java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047 20000 1 1 ICU4J java -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047_P100-1995 20000 1 1

Actually, I'm confused by this result. Previously, I was just comparing A/A with B/B on OpenJDK's charset. I didn't think ICU4J's result would make a difference.

My initial reaction is one of relief that the icu4j provider can be used with current JDK builds. This means there is an option should we decide to stop adding more EBCDIC charsets to the JDK.

The test uses IBM-1047 and I can't tell if the icu4j provider is used or not. Charset doesn't define a provider method but I think would be useful to print cs.getClass() or cs.getClass().getModule() so we know which Charset implementation is used. Also I think any discussion on performance would be better served with a JMH benchmark rather than a standalone test.

bridgekeeper · 2022-09-23T18:11:07Z

@takiguc This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

takiguc · 2022-10-03T07:14:09Z

Hello @AlanBateman .
Sorry I'm late.

I created Charset SPI JAR x-IBM1047_SPI (custom-charsets.jar) which was ported from sun.nio.cs.SingleByte.java and IBM1047.java (generated one).

Test code:

package com.example;

import java.nio.charset.Charset;
import org.openjdk.jmh.annotations.Benchmark;

public class MyBenchmark {

    final static String s;

    static {
        char[] ca = new char[0x2000];
        for (int i = 0; i < ca.length; i++) {
            ca[i] = (char) (i & 0xFF);
        }
        s = new String(ca);
    }

    @Benchmark
    public void testIBM1047() throws Exception {
        byte[] ba = s.getBytes("IBM1047");
    }

    @Benchmark
    public void testIBM1047_SPI() throws Exception {
        byte[] ba = s.getBytes("x-IBM1047_SPI");
    }

}

All test related files are in JDK-8289834.

Test results are as follows on RHEL8.6 x86_64 (Intel Core i7 3520M) :

1.8.0_345-b01
Benchmark                     Mode  Cnt      Score     Error  Units
MyBenchmark.testIBM1047      thrpt   25  53213.092 ± 126.962  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25  47442.669 ± 349.003  ops/s

20-ea+17-1181
Benchmark                     Mode  Cnt       Score      Error  Units
MyBenchmark.testIBM1047      thrpt   25  136331.141 ± 1078.481  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25   51563.213 ±  843.238  ops/s

IBM1047 is 2.6 times faster than the SPI version on JDK20.
I think this results are related to JEP 254: Compact Strings .
As I requested before, we'd like to use sun.nio.cs.SingleByte* and sun.nio.cs.DoubleByte* class as public API.

AlanBateman · 2022-10-03T08:51:13Z

Test results are as follows on RHEL8.6 x86_64 (Intel Core i7 3520M) :
1.8.0_345-b01
Benchmark                     Mode  Cnt      Score     Error  Units
MyBenchmark.testIBM1047      thrpt   25  53213.092 ± 126.962  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25  47442.669 ± 349.003  ops/s
20-ea+17-1181
Benchmark                     Mode  Cnt       Score      Error  Units
MyBenchmark.testIBM1047      thrpt   25  136331.141 ± 1078.481  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25   51563.213 ±  843.238  ops/s
IBM1047 is 2.6 times faster than the SPI version on JDK20. I think this results are related to JEP 254: Compact Strings . As I requested before, we'd like to use sun.nio.cs.SingleByte* and sun.nio.cs.DoubleByte* class as public API.

I don't think that is a workable option. Maybe you could do more performance analysis on the providers that are using the SPI to see if there are any secure primitives that would make sense to expose in java.nio.charset.spi? I'm hesitate to suggest adding base classes to java.nio.charset.spi but that may be something that could be explored too.

bridgekeeper · 2022-10-31T11:42:35Z

@takiguc This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper · 2022-11-28T15:24:29Z

@takiguc This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

takiguc · 2022-11-29T01:33:22Z

I'm still working on this one.
/open

openjdk · 2022-11-29T01:33:45Z

@takiguc This pull request is now open

bridgekeeper · 2022-12-27T03:00:38Z

@takiguc This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper · 2023-01-24T03:58:52Z

@takiguc This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

8289834: Missing SBCS and DBCS Only EBCDIC charsets

58d128d

openjdk bot added nio nio-dev@openjdk.org build build-dev@openjdk.org i18n i18n-dev@openjdk.org labels Jul 6, 2022

openjdk bot removed the nio nio-dev@openjdk.org label Jul 6, 2022

openjdk bot added the core-libs core-libs-dev@openjdk.org label Jul 6, 2022

takiguc changed the title ~~8289834: Missing SBCS and DBCS Only EBCDIC charsets~~ 8289834: Add SBCS and DBCS Only EBCDIC charsets Jul 6, 2022

takiguc marked this pull request as ready for review July 6, 2022 16:18

openjdk bot added the rfr Pull request is ready for review label Jul 6, 2022

openjdk bot removed the build build-dev@openjdk.org label Jul 7, 2022

bridgekeeper bot closed this Nov 28, 2022

openjdk bot reopened this Nov 29, 2022

bridgekeeper bot closed this Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8289834: Add SBCS and DBCS Only EBCDIC charsets #9399

8289834: Add SBCS and DBCS Only EBCDIC charsets #9399

takiguc commented Jul 6, 2022 •

edited by openjdk bot

bridgekeeper bot commented Jul 6, 2022

openjdk bot commented Jul 6, 2022

takiguc commented Jul 6, 2022

openjdk bot commented Jul 6, 2022

takiguc commented Jul 6, 2022

openjdk bot commented Jul 6, 2022

openjdk bot commented Jul 6, 2022

takiguc commented Jul 6, 2022

mlbridge bot commented Jul 6, 2022

AlanBateman commented Jul 7, 2022

magicus commented Jul 7, 2022

openjdk bot commented Jul 7, 2022

naotoj commented Jul 7, 2022

mlbridge bot commented Jul 8, 2022

bridgekeeper bot commented Aug 5, 2022

takiguc commented Aug 8, 2022

AlanBateman commented Aug 8, 2022

takiguc commented Aug 26, 2022

AlanBateman commented Aug 26, 2022

bridgekeeper bot commented Sep 23, 2022

takiguc commented Oct 3, 2022

AlanBateman commented Oct 3, 2022 •

edited

bridgekeeper bot commented Oct 31, 2022

bridgekeeper bot commented Nov 28, 2022

takiguc commented Nov 29, 2022

openjdk bot commented Nov 29, 2022

bridgekeeper bot commented Dec 27, 2022

bridgekeeper bot commented Jan 24, 2023

8289834: Add SBCS and DBCS Only EBCDIC charsets #9399

8289834: Add SBCS and DBCS Only EBCDIC charsets #9399

Conversation

takiguc commented Jul 6, 2022 • edited by openjdk bot

Progress

Issue

Reviewing

bridgekeeper bot commented Jul 6, 2022

openjdk bot commented Jul 6, 2022

takiguc commented Jul 6, 2022

openjdk bot commented Jul 6, 2022

takiguc commented Jul 6, 2022

openjdk bot commented Jul 6, 2022

openjdk bot commented Jul 6, 2022

takiguc commented Jul 6, 2022

mlbridge bot commented Jul 6, 2022

Webrevs

AlanBateman commented Jul 7, 2022

magicus commented Jul 7, 2022

openjdk bot commented Jul 7, 2022

naotoj commented Jul 7, 2022

mlbridge bot commented Jul 8, 2022

bridgekeeper bot commented Aug 5, 2022

takiguc commented Aug 8, 2022

AlanBateman commented Aug 8, 2022

takiguc commented Aug 26, 2022

AlanBateman commented Aug 26, 2022

bridgekeeper bot commented Sep 23, 2022

takiguc commented Oct 3, 2022

AlanBateman commented Oct 3, 2022 • edited

bridgekeeper bot commented Oct 31, 2022

bridgekeeper bot commented Nov 28, 2022

takiguc commented Nov 29, 2022

openjdk bot commented Nov 29, 2022

bridgekeeper bot commented Dec 27, 2022

bridgekeeper bot commented Jan 24, 2023

takiguc commented Jul 6, 2022 •

edited by openjdk bot

AlanBateman commented Oct 3, 2022 •

edited