Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8347406: [REDO] C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) #23630

Closed
wants to merge 22 commits into from

Conversation

dafedafe
Copy link
Contributor

@dafedafe dafedafe commented Feb 14, 2025

Issue

The test src/hotspot/share/opto/c2compiler.cpp fails intermittently due to a crash that happens when trying to allocate code cache space for C1 and C2 in RuntimeStub::new_runtime_stub and SingletonBlob::operator new.

Causes

There are a few call paths during the initialization of C1 and C2 that can lead to the code cache allocations in RuntimeStub::new_runtime_stub (through RuntimeStub::operator new) and SingletonBlob::operator new triggering a fatal error if there is no more space. The paths in question are:

  1. Compiler::init_c1_runtime -> Runtime1::initialize -> Runtime1::generate_blob_for -> Runtime1::generate_blob -> RuntimeStub::new_runtime_stub
  2. C2Compiler::initialize -> C2Compiler::init_c2_runtime -> OptoRuntime::generate -> OptoRuntime::generate_stub -> Compile::Compile -> Compile::Code_Gen -> PhaseOutput::install -> PhaseOutput::install_stub -> RuntimeStub::new_runtime_stub
  3. C2Compiler::initialize -> C2Compiler::init_c2_runtime -> OptoRuntime::generate -> OptoRuntime::generate_uncommon_trap_blob -> UncommonTrapBlob::create -> new UncommonTrapBlob
  4. C2Compiler::initialize -> C2Compiler::init_c2_runtime -> OptoRuntime::generate -> OptoRuntime::generate_exception_blob -> ExceptionBlob::create -> new ExceptionBlob

Solution

Instead of fatally crashing the we can use the alloc_fail_is_fatal flag of RuntimeStub::new_runtime_stub to avoid crashing in cases 1 and 2 and add a similar flag to SingletonBlob::operator new for cases 3 and 4. In the latter case we need to adjust all calls accordingly.

Note: In JDK-8326615 it was argued that increasing the minimum code cache size would solve the issue but that wasn't entirely accurate: doing so possibly decreases the chances of a failed allocation in these 4 places but doesn't totally avoid it.

Testing

The original failing regression test in test/hotspot/jtreg/compiler/startup/StartupOutput.java has been modified to run multiple times with randomized values (within the original failing range) to increase the chances of hitting the fatal assertion.

Tests: Tier 1-4 (windows-x64, linux-x64/aarch64, and macosx-x64/aarch64; release and debug mode)


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8347406: [REDO] C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23630/head:pull/23630
$ git checkout pull/23630

Update a local copy of the PR:
$ git checkout pull/23630
$ git pull https://git.openjdk.org/jdk.git pull/23630/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23630

View PR using the GUI difftool:
$ git pr show -t 23630

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23630.diff

Using Webrev

Link to Webrev Comment

Sorry, something went wrong.

dafedafe and others added 13 commits February 10, 2025 15:30

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
…nal Error (codeBlob.cpp:429) Initial size of CodeCache is too small
Co-authored-by: Tobias Hartmann <tobias.hartmann@oracle.com>
Co-authored-by: Tobias Hartmann <tobias.hartmann@oracle.com>
Co-authored-by: Tobias Hartmann <tobias.hartmann@oracle.com>
Co-authored-by: Tobias Hartmann <tobias.hartmann@oracle.com>
…ime::generate
@bridgekeeper
Copy link

bridgekeeper bot commented Feb 14, 2025

👋 Welcome back dfenacci! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Feb 14, 2025

@dafedafe This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8347406: [REDO] C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash)

Reviewed-by: dlong, adinn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 388 new commits pushed to the master branch:

  • 5709f79: 8352648: JFR: 'jfr query' should not be available in product builds
  • 02a4ce2: 8352147: G1: TestEagerReclaimHumongousRegionsClearMarkBits test takes very long
  • de58009: 8351468: C2: array fill optimization assigns wrong type to intrinsic call
  • a875733: 8352486: [ubsan] compilationMemoryStatistic.cpp:659:21: runtime error: index 64 out of bounds for type const struct unnamed struct
  • 5591f8a: 8351515: C2 incorrectly removes double negation for double and float
  • 56a4ffa: 8352597: [IR Framework] test bug: TestNotCompilable.java fails on product build
  • e23e0f8: 8352591: Missing UnlockDiagnosticVMOptions in VerifyGraphEdgesWithDeadCodeCheckFromSafepoints test
  • adfb120: 8351748: Add class init barrier to AOT-cached Method/Var Handles
  • ee1577b: 8352652: [BACKOUT] nsk/jvmti/ tests should fail when nsk_jvmti_setFailStatus() is called
  • df9210e: 8347706: jvmciEnv.cpp has jvmci includes out of order
  • ... and 378 more: https://git.openjdk.org/jdk/compare/efbad00c4d7931177ccc5e9bce3b30dfbac94010...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot changed the title JDK-8347406: [REDO] C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) 8347406: [REDO] C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) Feb 14, 2025
@openjdk
Copy link

openjdk bot commented Feb 14, 2025

@dafedafe this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8347406
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Feb 14, 2025
@openjdk
Copy link

openjdk bot commented Feb 14, 2025

@dafedafe The following labels will be automatically applied to this pull request:

  • hotspot
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org labels Feb 14, 2025
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Feb 19, 2025
@dafedafe dafedafe marked this pull request as ready for review February 20, 2025 16:27
@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 20, 2025
@mlbridge
Copy link

mlbridge bot commented Feb 20, 2025

Webrevs

@dafedafe
Copy link
Contributor Author

@adinn I noticed I touched quite a few runtime files that you recently refactored. I guess it might make sense if you had a look at them. Thanks!

@dean-long
Copy link
Member

I don't understand why JDK-8326615 didn't work. If the minimum codecache size was too small, can't we just increase it?

}
return _pre_barrier_c1_runtime_code_blob != nullptr && reference_barrier_success;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to return false immediately after each failure, rather than continuing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, totally. Fixed.

@@ -540,4 +540,8 @@ void ZBarrierSetC1::generate_c1_runtime_stubs(BufferBlob* blob) {
generate_c1_store_runtime_stub(blob, true /* self_healing */, "store_barrier_on_oop_field_with_healing");
_store_barrier_on_oop_field_without_healing =
generate_c1_store_runtime_stub(blob, false /* self_healing */, "store_barrier_on_oop_field_without_healing");
return _load_barrier_on_oop_field_preloaded_runtime_stub != nullptr &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why not return false immediately on first failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed too.

if (rs == nullptr) {
C->record_failure("CodeCache is full");
} else {
C->set_stub_entry_point(rs->entry_point());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the deleted rs->is_runtime_stub() assert still useful here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slightly modified one surely is. Inserted it again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it could be moved into the else clause and simplified further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see 👍. Moved. Thanks!

@dafedafe
Copy link
Contributor Author

I don't understand why JDK-8326615 didn't work. If the minimum codecache size was too small, can't we just increase it?

It seems that the small minimum codecache size wasn’t the core of the problem. In JDK-8326615 I’ve actually tried increasing the minimum code cache size (even calculating the minimum C1 and C2 sizes needed) but I kept hitting the same problem seemingly because what triggers the crash is the code cache exceeding its limits exactly during one of the allocations in the 4 call paths listed in the description (any allocation at another point simply turns off C1 or C2 (or both)). Increasing the minimum was possibly making it a bit less probable (i.e. when running something small that would use very little code cache, e.g. java -version) but as soon as running something a bit bigger (requiring more code cache), we could still end up calling the problematic allocations right when the code cache was full.
BTW even if the crash was very dependent on C1/C2 thread scheduling etc., I was able to reproduce it with different minimum code cache sizes relatively frequently.

Even so, it might be a good idea to additionally increase the minimum code cache anyway. @dean-long do you think it would make sense to file an RFE for that?

dafedafe and others added 3 commits February 27, 2025 09:31
Co-authored-by: Dean Long <17332032+dean-long@users.noreply.github.com>
@dean-long
Copy link
Member

Refreshing my memory, isn't the real problem with trying to fix this with a minimum codecache size is that some of these stubs are not allocated during initial single-threaded JVM startup, but later when the first compiler threads start, and that allows other code blobs to fill up the codecache?

@dean-long
Copy link
Member

Even so, it might be a good idea to additionally increase the minimum code cache anyway. @dean-long do you think it would make sense to file an RFE for that?

Sure, if it's still an issue.

@dafedafe
Copy link
Contributor Author

dafedafe commented Mar 3, 2025

Refreshing my memory, isn't the real problem with trying to fix this with a minimum codecache size is that some of these stubs are not allocated during initial single-threaded JVM startup, but later when the first compiler threads start, and that allows other code blobs to fill up the codecache?

Yes, exactly. This seems to be even more of an issue with 2 compiler threads (i.e. C1/C2) since the first can fill up the code cache first at the expense of the other.
The result is that if one compiler thread tries to allocate more space in a full code cache during initialization with one of the 4 call paths above, the VM crashes (but could actually just turn off the compiler thread instead).

@adinn
Copy link
Contributor

adinn commented Mar 24, 2025

@dean-long You are right that the root of the problem is delayed stub init when running in tiered mode. It's actually a race between the C1 and C2 threads to consume code cache space.

Neither C1 nor C2 can generate compiled method code before they have initialized their respective stubs. However, initialization of C2 stubs may potentially occur after a C1 compiler thread has initialized its own stubs and generated some number of C1-compiled methods.

Clearly that can't happen the other way around because in tiered mode a C2 thread will only end up compiling a method after 1) a prior C1-compiled method exists in the runtime and 2) its own stubs have been generated. An implication of 1) is that the C1 stubs have already been generated.

Copy link
Member

@dean-long dean-long left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 24, 2025
Copy link
Contributor

@adinn adinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me too.

@dafedafe
Copy link
Contributor Author

Thank you very much for your reviews @dean-long and @adinn!

@dafedafe
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Mar 25, 2025

Going to push as commit 48fac66.
Since your change was applied there have been 400 commits pushed to the master branch:

  • 99c8a6e: 8350463: AArch64: Add vector rearrange support for small lane count vectors
  • b2da0d3: 8352289: [macos] Review skipped tests in tools/jpackage/macosx/SigningPackage*
  • ba658a7: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations
  • 5625b43: 8350429: runtime/NMT/CheckForProperDetailStackTrace.java should only run for debug JVM
  • 2c60fc5: 8352176: Automate setting up environment for mac signing tests
  • 6e6a39d: 8347321: [ubsan] CGGlyphImages.m:553:30: runtime error: nan is outside the range of representable values of type 'unsigned long'
  • b84b292: 8352615: [Test] RISC-V: TestVectorizationMultiInvar.java fails on riscv64 without rvv support
  • a54445f: 8350609: Cleanup unknown unwind opcode (0xB) for windows
  • c87e1be: 8349582: APX NDD code generation for OpenJDK
  • 7d1fe0e: 8339543: [vectorapi] laneHelper and withLaneHelper should be ForceInline
  • ... and 390 more: https://git.openjdk.org/jdk/compare/efbad00c4d7931177ccc5e9bce3b30dfbac94010...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Mar 25, 2025
@openjdk openjdk bot closed this Mar 25, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Mar 25, 2025
@openjdk
Copy link

openjdk bot commented Mar 25, 2025

@dafedafe Pushed as commit 48fac66.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated shenandoah shenandoah-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

None yet

3 participants