Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs #9533

Closed
wants to merge 1 commit into from

Conversation

shipilev
Copy link
Member

@shipilev shipilev commented Jul 18, 2022

Test appears to pass fine with G1. But it fails with other GCs, for example Parallel, Shenandoah, etc, it fails:

$ CONF=linux-x86_64-server-fastdebug make test TEST=java/io/ObjectStreamClass/ObjectStreamClassCaching.java TEST_VM_OPTS="-XX:+UseParallelGC"

test ObjectStreamClassCaching.testCacheReleaseUnderMemoryPressure(): success
test ObjectStreamClassCaching.testCachingEffectiveness(): failure
java.lang.AssertionError: Cache lost entry although memory was not under pressure expected [false] but found [true]
	at org.testng.Assert.fail(Assert.java:99)
	at org.testng.Assert.failNotEquals(Assert.java:1037)
	at org.testng.Assert.assertFalse(Assert.java:67)

I believe this is because System.gc() is not that reliable about what happens with weak references. As seen with other GCs, they can clear the weakrefs on Full GC. In fact, the test fails with G1 if we do a second System.gc() in this test. So the test itself is flaky. The fix is to avoid doing System.gc() altogether in that subtest. The test is still retained to see that reference is not cleared for a while.

Additional testing:

  • Linux x86_64 fastdebug, affected test with -XX:+UseSerialGC, 100 repetitions
  • Linux x86_64 fastdebug, affected test with -XX:+UseParallelGC, 100 repetitions
  • Linux x86_64 fastdebug, affected test with -XX:+UseG1GC, 100 repetitions
  • Linux x86_64 fastdebug, affected test with -XX:+UseShenandoahGC, 100 repetitions
  • Linux x86_64 fastdebug, affected test with -XX:+UseZGC, 100 repetitions

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9533/head:pull/9533
$ git checkout pull/9533

Update a local copy of the PR:
$ git checkout pull/9533
$ git pull https://git.openjdk.org/jdk pull/9533/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9533

View PR using the GUI difftool:
$ git pr show -t 9533

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9533.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 18, 2022

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 18, 2022
@openjdk
Copy link

openjdk bot commented Jul 18, 2022

@shipilev The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the core-libs core-libs-dev@openjdk.org label Jul 18, 2022
@mlbridge
Copy link

mlbridge bot commented Jul 18, 2022

Webrevs

@shipilev
Copy link
Member Author

Still waiting for revies. @rkennke, maybe? :)

@rkennke
Copy link
Contributor

rkennke commented Jul 22, 2022

WhiteBox has a number of methods that allow better control over GC, e.g. fullGC(), concurrentGCRunToIdle(), etc. Would that help?

See for example: test/hotspot/jtreg/gc/TestReferenceRefersTo.java

@shipilev
Copy link
Member Author

shipilev commented Jul 22, 2022

WhiteBox has a number of methods that allow better control over GC, e.g. fullGC(), concurrentGCRunToIdle(), etc. Would that help?

I think the test relies on assumption that System.gc() does the weak reference clearing. Which is not given, for example if concurrent GC cycle is triggered with weak refs processing that purges the cache. I don't think we can solve this with GC control, unless we hack into the each of the GC's policies with Whitebox...

@rkennke
Copy link
Contributor

rkennke commented Jul 22, 2022

WhiteBox has a number of methods that allow better control over GC, e.g. fullGC(), concurrentGCRunToIdle(), etc. Would that help?

I think the test relies on assumption that System.gc() does the weak reference clearing. Which is not given, for example if concurrent GC cycle is triggered without weak refs processing. I don't think we can solve this with GC control, unless we hack into the each of the GC's policies with Whitebox...

Well, WB GC control can run concurrent GC, and wait until certain control points are reached or conc cycle is finished. An example that does that and checks referents get cleared is:

test/hotspot/jtreg/gc/TestReferenceClearDuringReferenceProcessing.java

TBH, not certain how GC policies play into that, and how well WB is supported in each GC. Might be worth trying?

Other that that, how does removinch System.gc() help make the test more reliable?

@shipilev
Copy link
Member Author

TBH, not certain how GC policies play into that, and how well WB is supported in each GC. Might be worth trying?
Other that that, how does removinch System.gc() help make the test more reliable?

When you call System.gc() on some collectors, they blow out the weak references, which includes the cache in question. The assert then fails, thinking there was no reason to clear the cache. That is, the assert assumes the "memory pressure" is the only way the cache would be dropped, which is evidently not true.

Copy link
Contributor

@rkennke rkennke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, then. Thanks!

@openjdk
Copy link

openjdk bot commented Jul 22, 2022

@shipilev This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8283276: java/io/ObjectStreamClass/ObjectStreamClassCaching.java fails with various GCs

Reviewed-by: rkennke

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 91 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 22, 2022
@shipilev
Copy link
Member Author

Thanks! Any other reviews, please?

@plevart
Copy link
Contributor

plevart commented Jul 25, 2022

Just a note that the failing test is not about checking that cache is cleared after plain System.gc(), but about checking that cache is NOT cleared after System.gc() when there was no real memory pressure. The cache uses SoftReference(s). So it appears that some gc(s) do clear SoftReference(s) on System.gc(). Does WhiteBox GC util have a means to provoke gc without SoftReference processing?

@shipilev
Copy link
Member Author

Just a note that the failing test is not about checking that cache is cleared after plain System.gc(), but about checking that cache is NOT cleared after System.gc() when there was no real memory pressure. The cache uses SoftReference(s). So it appears that some gc(s) do clear SoftReference(s) on System.gc(). Does WhiteBox GC util have a means to provoke gc without SoftReference processing?

As I said above, there seem to be no such method. It would require hacking into each GC policy to prevent weakrefs clearing.

@plevart
Copy link
Contributor

plevart commented Jul 25, 2022

...my above claims might appear strange to some reader as the test clearly wraps a cached ObjectStreamClass instance with a WeakReference and then checks whether this WeakReference is cleared or not. So an explanation is in order... The internal cache uses SoftReference(s). Since the cache is encapsulated, the test does not check it directly but uses a trick. It wraps the cached instance with a WeakReference. Since the instance is already wrapped with SoftReference in the cache and there is no strong reference holding it, the instance can be considered Softly Reachable. And we know that GC must clear all XxxReference(s) regardless of their type pointing to the same referent atomically. By detecting in the test that a WeakReference is cleared, we assume that the internal SoftReference is cleared too. If this was not the case, GC could be considdered to have a bug and is not by the spec. So what we are observing now is either System.gc() clearing SoftReference or a GC bug. I don't belive it is a GC bug though.

@plevart
Copy link
Contributor

plevart commented Jul 25, 2022

Just a note that the failing test is not about checking that cache is cleared after plain System.gc(), but about checking that cache is NOT cleared after System.gc() when there was no real memory pressure. The cache uses SoftReference(s). So it appears that some gc(s) do clear SoftReference(s) on System.gc(). Does WhiteBox GC util have a means to provoke gc without SoftReference processing?

As I said above, there seem to be no such method. It would require hacking into each GC policy to prevent weakrefs clearing.

Just to be on the same page... It is not about weakref clearing but about softref clearing. The test assumes that System.gc() would not clear a softref on the 1st call at least.

@plevart
Copy link
Contributor

plevart commented Jul 25, 2022

Contemplating how this test could be fixed without using WhiteBox gc testing... The test could wrap the cached instance with a WeakReference as it does now (ref1) and then have a second WeakReference that would wrap an instance of "new Object()", (ref2)... Then the test coul gradually allocate more and more heap in a loop, checking both WeakReference(s) as it goes. When ref2 is cleared but ref1 is not, the test would succeed. Any other outcome ( such as both refs being cleared at the same point) could be considered a failure. This would assume that GCs would 1st clear XxxReferences of weakly reachable referents and only much later those of softly reachable. I hope this property is universal to all GCs although it is not guaranteed by the spec.

@plevart
Copy link
Contributor

plevart commented Jul 25, 2022

...I can take this over, unless you want to do it, Aleksey?

@shipilev
Copy link
Member Author

...I can take this over, unless you want to do it, Aleksey?

I find it dubious to try and guess what GCs would do with non-strong refs, but feel free. Don't reassign the bug yet, just see how messy that would be?

@shipilev
Copy link
Member Author

...I can take this over, unless you want to do it, Aleksey?

I find it dubious to try and guess what GCs would do with non-strong refs, but feel free. Don't reassign the bug yet, just see how messy that would be?

On the other hand, this test is in tier2, so it makes lots of testing with other GCs not clean. I would like to have this fix in, and then do any followups that might make the test more targeted.

@plevart
Copy link
Contributor

plevart commented Jul 28, 2022

...I can take this over, unless you want to do it, Aleksey?

I find it dubious to try and guess what GCs would do with non-strong refs, but feel free. Don't reassign the bug yet, just see how messy that would be?

On the other hand, this test is in tier2, so it makes lots of testing with other GCs not clean. I would like to have this fix in, and then do any followups that might make the test more targeted.

By removing System.gc() you effectively make the test a NO-OP. It then basically tests just that newly constructed SoftReference is not cleared in the next moment after construction. I would then rather just disable the test until it is fixed properly...

@plevart
Copy link
Contributor

plevart commented Jul 28, 2022

@shipilev you said that the test fails with some other GCs (Parallel, Shenandoah, ...). Who is responsible to select the GC algorithm? What I'm asking is whether one should explicitly enumerate several @Runs of the test with explicit selection of different GCs (which ones?) or is this a matter of some external test infrastructure that runs all the tests in the suite and pre-selects different GCs for them?

@plevart
Copy link
Contributor

plevart commented Jul 28, 2022

I managed to construct/fix test that passes on G1, Shenandoah and ZGC...

plevart@2dbce34

The combination of JVM options: -Xmx10m -XX:SoftRefLRUPolicyMSPerMB=1 ... makes ZGC and Shenandoah very aggressive and practically causes SoftReferences to be treated equally as WeakReferences. By increasing the max. heap size a bit and using defaults for SoftRefLRUPolicyMSPerMB (I believe 1000 ms/MB for G1 and 200 ms/MB for Shenandoah and ZGC), G1 and Shenandoah were happy with the changed test. For ZGC I also had to make sure that the CacheEffectiveness test is executed as 1st test in VM instance, followed by CacheReleaseUnderMemoryPressure.

@plevart
Copy link
Contributor

plevart commented Jul 28, 2022

So how do we proceed @shipilev ? I open another issue or are you willing to accept my changes into your issue?

@shipilev
Copy link
Member Author

So how do we proceed @shipilev ? I open another issue or are you willing to accept my changes into your issue?

I prefer to commit the simpler (mine) version first, and the follow up with any more sophisticated attempt to fix it.

It then basically tests just that newly constructed SoftReference is not cleared in the next moment after construction.

Yes, but not really. There is still a 100ms sleep and reference processing involved, which somewhat verifies that the cache is not blown away immediately/accidentally.

Who is responsible to select the GC algorithm?

CIs routinely test existing suites with different GCs. The PR body contains a sample reproducer how that happens. If you are doing the separate @run statements, you need to also check for @requires vm.gc.XXX, etc.

@plevart
Copy link
Contributor

plevart commented Jul 28, 2022

It then basically tests just that newly constructed SoftReference is not cleared in the next moment after construction.

Yes, but not really. There is still a 100ms sleep and reference processing involved, which somewhat verifies that the cache is not blown away immediately/accidentally.

Just a 100ms sleep means hardly any chance that reference processing will be involved as VM practically does nothing at that time to trigger GC. If you are referring to this:

        // to trigger any ReferenceQueue processing...
        lookupObjectStreamClass(AnotherTestClass.class);

...then the comment is perhaps misleading. Looking up another class just caches another entry and at the same time processes any enqueued References to remove stale cache entries. But for a Reference to be enqueued into a ReferenceQueue, GC threads must 1st clear it and link it into a "pending" list, ReferenceHandler java thread must then unlink it from the "pending" list and enqueue it into Reference's associated ReferenceQueue and only then can the request for another class process it from that ReferenceQueue. But the test is not observing that 3rd part (cleanup of stale cache entries). It observes the 1st part (atomic clearing of all XxxReferences that refer to the same referent) which is performed by GC. So this is hardly going to happen as the sole trigger for that (since System.gc() was removed) remains allocation of new objects and not enough may be allocated to trigger GC.

But if your purpose was to shut up the failing test, then this is one way to do it.

Who is responsible to select the GC algorithm?

CIs routinely test existing suites with different GCs. The PR body contains a sample reproducer how that happens. If you are doing the separate @run statements, you need to also check for @requires vm.gc.XXX, etc.

Thanks for the hint. I'll add that.

@plevart
Copy link
Contributor

plevart commented Jul 28, 2022

What I was trying to say above is that without System.gc() the test would succeed even if caching was not actually there and lookup always constructed new ObjectStreamClass instance. There's not enough allocation activity to clear the WeakReference constructed in the test with the ObjectStreamClass referent even if that referent was only weakly reachable...

@shipilev
Copy link
Member Author

All right, fine. This bug takes too much time. Take the bug, but problem-list the test first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org ready Pull request is ready to be integrated rfr Pull request is ready for review
3 participants