8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs #9533

shipilev · 2022-07-18T07:40:53Z

Test appears to pass fine with G1. But it fails with other GCs, for example Parallel, Shenandoah, etc, it fails:

$ CONF=linux-x86_64-server-fastdebug make test TEST=java/io/ObjectStreamClass/ObjectStreamClassCaching.java TEST_VM_OPTS="-XX:+UseParallelGC"

test ObjectStreamClassCaching.testCacheReleaseUnderMemoryPressure(): success
test ObjectStreamClassCaching.testCachingEffectiveness(): failure
java.lang.AssertionError: Cache lost entry although memory was not under pressure expected [false] but found [true]
	at org.testng.Assert.fail(Assert.java:99)
	at org.testng.Assert.failNotEquals(Assert.java:1037)
	at org.testng.Assert.assertFalse(Assert.java:67)

I believe this is because System.gc() is not that reliable about what happens with weak references. As seen with other GCs, they can clear the weakrefs on Full GC. In fact, the test fails with G1 if we do a second System.gc() in this test. So the test itself is flaky. The fix is to avoid doing System.gc() altogether in that subtest. The test is still retained to see that reference is not cleared for a while.

Additional testing:

Linux x86_64 fastdebug, affected test with -XX:+UseSerialGC, 100 repetitions
Linux x86_64 fastdebug, affected test with -XX:+UseParallelGC, 100 repetitions
Linux x86_64 fastdebug, affected test with -XX:+UseG1GC, 100 repetitions
Linux x86_64 fastdebug, affected test with -XX:+UseShenandoahGC, 100 repetitions
Linux x86_64 fastdebug, affected test with -XX:+UseZGC, 100 repetitions

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs

Reviewers

Roman Kennke (@rkennke - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9533/head:pull/9533
$ git checkout pull/9533

Update a local copy of the PR:
$ git checkout pull/9533
$ git pull https://git.openjdk.org/jdk pull/9533/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9533

View PR using the GUI difftool:
$ git pr show -t 9533

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9533.diff

bridgekeeper · 2022-07-18T07:42:29Z

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2022-07-18T07:45:53Z

@shipilev The following label will be automatically applied to this pull request:

core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2022-07-18T07:48:34Z

Webrevs

00: Full (28426c23)

shipilev · 2022-07-22T06:29:15Z

Still waiting for revies. @rkennke, maybe? :)

rkennke · 2022-07-22T10:36:32Z

WhiteBox has a number of methods that allow better control over GC, e.g. fullGC(), concurrentGCRunToIdle(), etc. Would that help?

See for example: test/hotspot/jtreg/gc/TestReferenceRefersTo.java

shipilev · 2022-07-22T10:39:13Z

WhiteBox has a number of methods that allow better control over GC, e.g. fullGC(), concurrentGCRunToIdle(), etc. Would that help?

I think the test relies on assumption that System.gc() does the weak reference clearing. Which is not given, for example if concurrent GC cycle is triggered with weak refs processing that purges the cache. I don't think we can solve this with GC control, unless we hack into the each of the GC's policies with Whitebox...

rkennke · 2022-07-22T10:46:57Z

WhiteBox has a number of methods that allow better control over GC, e.g. fullGC(), concurrentGCRunToIdle(), etc. Would that help?

I think the test relies on assumption that System.gc() does the weak reference clearing. Which is not given, for example if concurrent GC cycle is triggered without weak refs processing. I don't think we can solve this with GC control, unless we hack into the each of the GC's policies with Whitebox...

Well, WB GC control can run concurrent GC, and wait until certain control points are reached or conc cycle is finished. An example that does that and checks referents get cleared is:

test/hotspot/jtreg/gc/TestReferenceClearDuringReferenceProcessing.java

TBH, not certain how GC policies play into that, and how well WB is supported in each GC. Might be worth trying?

Other that that, how does removinch System.gc() help make the test more reliable?

shipilev · 2022-07-22T10:49:51Z

TBH, not certain how GC policies play into that, and how well WB is supported in each GC. Might be worth trying?
Other that that, how does removinch System.gc() help make the test more reliable?

When you call System.gc() on some collectors, they blow out the weak references, which includes the cache in question. The assert then fails, thinking there was no reason to clear the cache. That is, the assert assumes the "memory pressure" is the only way the cache would be dropped, which is evidently not true.

rkennke

Looks good, then. Thanks!

openjdk · 2022-07-22T11:36:14Z

@shipilev This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs

Reviewed-by: rkennke

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 91 new commits pushed to the master branch:

7841e5c: 8290464: Optimize ResourceArea zapping on ResourceMark release
e9f97b2: 8290489: Initial nroff manpage generation for JDK 20
59d85ba: 8290687: serviceability/sa/TestClassDump.java could leave files owned by root on macOS
66f59c2: 8290731: Clean up CDS handling of LambdaForm Species classes
0dda3c1: 8289275: Remove incorrect __declspec(dllimport) attributes from pointers in jdk.crypto.cryptoki
620c8a0: 8289643: File descriptor leak with ProcessBuilder.startPipeline
7ec0132: 8286844: com/sun/jdi/RedefineCrossEvent.java failed with 1 threads completed while VM suspended
80bd8c3: 8290504: Close streams returned by ModuleReader::list
15f4b30: 8290115: ArrayCopyObject JMH has wrong package
4c1cd66: 8288368: simplify code in ValueTaglet, remove redundant code
... and 81 more: https://git.openjdk.org/jdk/compare/87340fd5408d89d9343541ff4fcabde83548a598...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

shipilev · 2022-07-25T10:00:49Z

Thanks! Any other reviews, please?

plevart · 2022-07-25T13:04:46Z

Just a note that the failing test is not about checking that cache is cleared after plain System.gc(), but about checking that cache is NOT cleared after System.gc() when there was no real memory pressure. The cache uses SoftReference(s). So it appears that some gc(s) do clear SoftReference(s) on System.gc(). Does WhiteBox GC util have a means to provoke gc without SoftReference processing?

shipilev · 2022-07-25T13:13:30Z

Just a note that the failing test is not about checking that cache is cleared after plain System.gc(), but about checking that cache is NOT cleared after System.gc() when there was no real memory pressure. The cache uses SoftReference(s). So it appears that some gc(s) do clear SoftReference(s) on System.gc(). Does WhiteBox GC util have a means to provoke gc without SoftReference processing?

As I said above, there seem to be no such method. It would require hacking into each GC policy to prevent weakrefs clearing.

plevart · 2022-07-25T13:24:49Z

...my above claims might appear strange to some reader as the test clearly wraps a cached ObjectStreamClass instance with a WeakReference and then checks whether this WeakReference is cleared or not. So an explanation is in order... The internal cache uses SoftReference(s). Since the cache is encapsulated, the test does not check it directly but uses a trick. It wraps the cached instance with a WeakReference. Since the instance is already wrapped with SoftReference in the cache and there is no strong reference holding it, the instance can be considered Softly Reachable. And we know that GC must clear all XxxReference(s) regardless of their type pointing to the same referent atomically. By detecting in the test that a WeakReference is cleared, we assume that the internal SoftReference is cleared too. If this was not the case, GC could be considdered to have a bug and is not by the spec. So what we are observing now is either System.gc() clearing SoftReference or a GC bug. I don't belive it is a GC bug though.

plevart · 2022-07-25T13:32:04Z

Just a note that the failing test is not about checking that cache is cleared after plain System.gc(), but about checking that cache is NOT cleared after System.gc() when there was no real memory pressure. The cache uses SoftReference(s). So it appears that some gc(s) do clear SoftReference(s) on System.gc(). Does WhiteBox GC util have a means to provoke gc without SoftReference processing?

As I said above, there seem to be no such method. It would require hacking into each GC policy to prevent weakrefs clearing.

Just to be on the same page... It is not about weakref clearing but about softref clearing. The test assumes that System.gc() would not clear a softref on the 1st call at least.

plevart · 2022-07-25T13:46:00Z

Contemplating how this test could be fixed without using WhiteBox gc testing... The test could wrap the cached instance with a WeakReference as it does now (ref1) and then have a second WeakReference that would wrap an instance of "new Object()", (ref2)... Then the test coul gradually allocate more and more heap in a loop, checking both WeakReference(s) as it goes. When ref2 is cleared but ref1 is not, the test would succeed. Any other outcome ( such as both refs being cleared at the same point) could be considered a failure. This would assume that GCs would 1st clear XxxReferences of weakly reachable referents and only much later those of softly reachable. I hope this property is universal to all GCs although it is not guaranteed by the spec.

plevart · 2022-07-25T14:51:32Z

...I can take this over, unless you want to do it, Aleksey?

shipilev · 2022-07-25T16:41:04Z

...I can take this over, unless you want to do it, Aleksey?

I find it dubious to try and guess what GCs would do with non-strong refs, but feel free. Don't reassign the bug yet, just see how messy that would be?

shipilev · 2022-07-27T12:30:43Z

...I can take this over, unless you want to do it, Aleksey?

I find it dubious to try and guess what GCs would do with non-strong refs, but feel free. Don't reassign the bug yet, just see how messy that would be?

On the other hand, this test is in tier2, so it makes lots of testing with other GCs not clean. I would like to have this fix in, and then do any followups that might make the test more targeted.

plevart · 2022-07-28T11:21:25Z

...I can take this over, unless you want to do it, Aleksey?

I find it dubious to try and guess what GCs would do with non-strong refs, but feel free. Don't reassign the bug yet, just see how messy that would be?

On the other hand, this test is in tier2, so it makes lots of testing with other GCs not clean. I would like to have this fix in, and then do any followups that might make the test more targeted.

By removing System.gc() you effectively make the test a NO-OP. It then basically tests just that newly constructed SoftReference is not cleared in the next moment after construction. I would then rather just disable the test until it is fixed properly...

plevart · 2022-07-28T12:30:41Z

@shipilev you said that the test fails with some other GCs (Parallel, Shenandoah, ...). Who is responsible to select the GC algorithm? What I'm asking is whether one should explicitly enumerate several @Runs of the test with explicit selection of different GCs (which ones?) or is this a matter of some external test infrastructure that runs all the tests in the suite and pre-selects different GCs for them?

plevart · 2022-07-28T13:42:58Z

I managed to construct/fix test that passes on G1, Shenandoah and ZGC...

plevart@2dbce34

The combination of JVM options: -Xmx10m -XX:SoftRefLRUPolicyMSPerMB=1 ... makes ZGC and Shenandoah very aggressive and practically causes SoftReferences to be treated equally as WeakReferences. By increasing the max. heap size a bit and using defaults for SoftRefLRUPolicyMSPerMB (I believe 1000 ms/MB for G1 and 200 ms/MB for Shenandoah and ZGC), G1 and Shenandoah were happy with the changed test. For ZGC I also had to make sure that the CacheEffectiveness test is executed as 1st test in VM instance, followed by CacheReleaseUnderMemoryPressure.

plevart · 2022-07-28T14:23:17Z

So how do we proceed @shipilev ? I open another issue or are you willing to accept my changes into your issue?

shipilev · 2022-07-28T14:33:40Z

So how do we proceed @shipilev ? I open another issue or are you willing to accept my changes into your issue?

I prefer to commit the simpler (mine) version first, and the follow up with any more sophisticated attempt to fix it.

It then basically tests just that newly constructed SoftReference is not cleared in the next moment after construction.

Yes, but not really. There is still a 100ms sleep and reference processing involved, which somewhat verifies that the cache is not blown away immediately/accidentally.

Who is responsible to select the GC algorithm?

CIs routinely test existing suites with different GCs. The PR body contains a sample reproducer how that happens. If you are doing the separate @run statements, you need to also check for @requires vm.gc.XXX, etc.

plevart · 2022-07-28T17:34:44Z

It then basically tests just that newly constructed SoftReference is not cleared in the next moment after construction.

Yes, but not really. There is still a 100ms sleep and reference processing involved, which somewhat verifies that the cache is not blown away immediately/accidentally.

Just a 100ms sleep means hardly any chance that reference processing will be involved as VM practically does nothing at that time to trigger GC. If you are referring to this:

        // to trigger any ReferenceQueue processing...
        lookupObjectStreamClass(AnotherTestClass.class);

...then the comment is perhaps misleading. Looking up another class just caches another entry and at the same time processes any enqueued References to remove stale cache entries. But for a Reference to be enqueued into a ReferenceQueue, GC threads must 1st clear it and link it into a "pending" list, ReferenceHandler java thread must then unlink it from the "pending" list and enqueue it into Reference's associated ReferenceQueue and only then can the request for another class process it from that ReferenceQueue. But the test is not observing that 3rd part (cleanup of stale cache entries). It observes the 1st part (atomic clearing of all XxxReferences that refer to the same referent) which is performed by GC. So this is hardly going to happen as the sole trigger for that (since System.gc() was removed) remains allocation of new objects and not enough may be allocated to trigger GC.

But if your purpose was to shut up the failing test, then this is one way to do it.

Who is responsible to select the GC algorithm?

CIs routinely test existing suites with different GCs. The PR body contains a sample reproducer how that happens. If you are doing the separate @run statements, you need to also check for @requires vm.gc.XXX, etc.

Thanks for the hint. I'll add that.

plevart · 2022-07-28T17:45:13Z

What I was trying to say above is that without System.gc() the test would succeed even if caching was not actually there and lookup always constructed new ObjectStreamClass instance. There's not enough allocation activity to clear the WeakReference constructed in the test with the ObjectStreamClass referent even if that referent was only weakly reachable...

shipilev · 2022-07-28T18:10:44Z

All right, fine. This bug takes too much time. Take the bug, but problem-list the test first.

Fix

28426c2

shipilev mentioned this pull request Jul 18, 2022

8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs openjdk/jdk19#27

Closed

8 tasks

openjdk bot added the rfr label Jul 18, 2022

openjdk bot added the core-libs label Jul 18, 2022

rkennke approved these changes Jul 22, 2022

View reviewed changes

openjdk bot added the ready label Jul 22, 2022

shipilev closed this Jul 28, 2022

plevart mentioned this pull request Jul 29, 2022

8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs #9684

Closed

3 tasks

shipilev deleted the JDK-8283276-oscc-test branch September 5, 2022 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs #9533

8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs #9533

shipilev commented Jul 18, 2022 •

edited by openjdk bot

Loading

bridgekeeper bot commented Jul 18, 2022

openjdk bot commented Jul 18, 2022

mlbridge bot commented Jul 18, 2022

shipilev commented Jul 22, 2022

rkennke commented Jul 22, 2022

shipilev commented Jul 22, 2022 •

edited

Loading

rkennke commented Jul 22, 2022

shipilev commented Jul 22, 2022

rkennke left a comment

openjdk bot commented Jul 22, 2022 •

edited

Loading

shipilev commented Jul 25, 2022

plevart commented Jul 25, 2022

shipilev commented Jul 25, 2022

plevart commented Jul 25, 2022

plevart commented Jul 25, 2022

plevart commented Jul 25, 2022

plevart commented Jul 25, 2022

shipilev commented Jul 25, 2022

shipilev commented Jul 27, 2022

plevart commented Jul 28, 2022

plevart commented Jul 28, 2022

plevart commented Jul 28, 2022

plevart commented Jul 28, 2022

shipilev commented Jul 28, 2022

plevart commented Jul 28, 2022

plevart commented Jul 28, 2022

shipilev commented Jul 28, 2022

8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs #9533

8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs #9533

Conversation

shipilev commented Jul 18, 2022 • edited by openjdk bot Loading

Progress

Issue

Reviewers

Reviewing

bridgekeeper bot commented Jul 18, 2022

openjdk bot commented Jul 18, 2022

mlbridge bot commented Jul 18, 2022

Webrevs

shipilev commented Jul 22, 2022

rkennke commented Jul 22, 2022

shipilev commented Jul 22, 2022 • edited Loading

rkennke commented Jul 22, 2022

shipilev commented Jul 22, 2022

rkennke left a comment

Choose a reason for hiding this comment

openjdk bot commented Jul 22, 2022 • edited Loading

shipilev commented Jul 25, 2022

plevart commented Jul 25, 2022

shipilev commented Jul 25, 2022

plevart commented Jul 25, 2022

plevart commented Jul 25, 2022

plevart commented Jul 25, 2022

plevart commented Jul 25, 2022

shipilev commented Jul 25, 2022

shipilev commented Jul 27, 2022

plevart commented Jul 28, 2022

plevart commented Jul 28, 2022

plevart commented Jul 28, 2022

plevart commented Jul 28, 2022

shipilev commented Jul 28, 2022

plevart commented Jul 28, 2022

plevart commented Jul 28, 2022

shipilev commented Jul 28, 2022

shipilev commented Jul 18, 2022 •

edited by openjdk bot

Loading

shipilev commented Jul 22, 2022 •

edited

Loading

openjdk bot commented Jul 22, 2022 •

edited

Loading