Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8306767: Concurrent repacking of extra data in MethodData is potentially unsafe #16840

Closed
wants to merge 27 commits into from

Conversation

eme64
Copy link
Contributor

@eme64 eme64 commented Nov 28, 2023

As explained in a comment below, we have to ensure that reading/writing/cleaning the extra data all needs to be guarded by the extra_data_lock, and that no safepoint should happen while holding that lock, so that the lock is not broken.

I introduced check_extra_data_locked, where I check that we hold the lock, and if we are a java thread (only those ever safepoint), that we currently are in a NoSafepointVerifier scope, hence we verify that no safepoint will be taken.

I placed check_extra_data_locked in all the places where we access the extra data, and then placed locks (with implicit no-safepoint-verifiers) at the call-site of those places.

I also needed to change the rank of extra_data_lock to nosafepoint and set the Mutex::_no_safepoint_check_flag when taking the lock. Otherwise I could not take the lock from a VM thread.

Testing
Testing: tier1-3 and stress.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8306767: Concurrent repacking of extra data in MethodData is potentially unsafe (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16840/head:pull/16840
$ git checkout pull/16840

Update a local copy of the PR:
$ git checkout pull/16840
$ git pull https://git.openjdk.org/jdk.git pull/16840/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16840

View PR using the GUI difftool:
$ git pr show -t 16840

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16840.diff

Webrev

Link to Webrev Comment

Sorry, something went wrong.

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 28, 2023

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot changed the title 8306767 8306767: Concurrent repacking of extra data in MethodData is potentially unsafe Nov 28, 2023
@openjdk
Copy link

openjdk bot commented Nov 28, 2023

@eme64 The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Nov 28, 2023
@eme64 eme64 marked this pull request as ready for review November 28, 2023 14:25
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 28, 2023
Copy link
Contributor

@fisk fisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@openjdk
Copy link

openjdk bot commented Nov 28, 2023

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8306767: Concurrent repacking of extra data in MethodData is potentially unsafe

Reviewed-by: eosterlund, roland, coleenp, never

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 44 new commits pushed to the master branch:

  • ffe3bb6: 8324657: Intermittent OOME on exception message create
  • e709842: 8324636: Serial: Remove Generation::block_is_obj
  • 7a798d3: 8324598: use mem_unit when working with sysinfo memory and swap related information
  • 6d36eb7: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target.
  • 9d1a6d1: 8323645: Remove unused internal sun.net.www.protocol.jar.URLJarFileCallBack interface
  • 3059c3b: 8324242: Avoid null check for OopHandle::ptr_raw()
  • 929af9e: 8307788: vmTestbase/gc/gctests/LargeObjects/large003/TestDescription.java timed out
  • e7fdac9: 8324280: RISC-V: Incorrect implementation in VM_Version::parse_satp_mode
  • 3d32c46: 6503196: API doc for DecimalFormat::getMaximumIntegerDigits is unclear
  • 2d5cb97: 8324647: Invalid test group of lib-test after JDK-8323515
  • ... and 34 more: https://git.openjdk.org/jdk/compare/c84af4938647efbc2d6c94efef748446bf6d50b4...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 28, 2023
@@ -2494,7 +2495,10 @@ Deoptimization::query_update_method_data(MethodData* trap_mdo,
// Find the profile data for this BCI. If there isn't one,
// try to allocate one from the MDO's set of spares.
// This will let us detect a repeated trap at this point.
pdata = trap_mdo->allocate_bci_to_data(trap_bci, reason_is_speculate(reason) ? compiled_method : nullptr);
{
MutexLocker ml(trap_mdo->extra_data_lock());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the lock have to be held over the lifetime of the pdata variable? Otherwise an intervening safepoint could repack the MDO rendering pdata invalid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. That was done in a similar place above but not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. So is it that I just have to have a lock on allocate_bci_to_data, or do I have to protect pdata? Because pdata is returned from Deoptimization::query_update_method_data, and so I would have to make the lock take a wider scope than that function, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock needs to be held while accessing the contents of pdata which is very complicated in this particular usage pattern. Maybe the caller needs to refetch the pointer under a lock instead of passing it out of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
And how about other usages of ProfileData from the mdo, like these (there is a few variants of them around):

ProfileData* data = mdo->first_data();
ProfileData* data = mdo->bci_to_data(bci);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there are 2 uses of query_update_method_data. One does not use the return pdata. The other uses it and in some cases updates it. Do you think it is safe to just re-fetch it, or would that potentially cut some connection between the two that should not be cut?
The alternative is just to already get the lock before calling query_update_method_data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that anything that can return data from the extra data section is a potential danger. bci_to_data calls bci_to_extra_data at the end so it seems potentially unsafe which seems like a huge problem since that's used all over the place. Whether the callers are actually getting or expecting record from extra data is unclear. I would suspect that most places where it's used there should already be a preallocated record. The concurrent repacking really makes it hard to ensure the accesses are safe. I think the API would need to make a stronger split between preallocated records and records which might come from the extra data section. I'm honestly not sure how to make this truly safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. So this issue is much bigger than query_update_method_data and allocate_bci_to_data, is what you are saying. Sounds like I need to study this much deeper. Maybe we need to refactor the while way we access the records? Maybe any access to the records needs to be guarded with a lock, just to be safe?
If there are concurrent updates, which are guarded by lock, then should not all reads also be guarded? In the end maybe we just need to make all accesses mutually exclusive, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, reading data from the extra data section, requires holding the lock.

@eme64
Copy link
Contributor Author

eme64 commented Nov 30, 2023

Update:
The current fix is not sufficient. I'll summarize my state of knowledge:

We have these accesses to the extra data:

  • read / update of a record (race conditions for counter updates are acceptable, they just make the result imprecise but not incorrect).
  • allocating a new record
  • cleaning out stale records

I have had conversations with @fisk and @rwestrel about which can happen concurrently. There may have been a time when cleaning only happened during SafePoint, and hence there probably was no concurrent read / allocation with it. But this has changed, certainly with ZGC, but also with WB_ClearMethodState (call clearing at any time).

The data structure uses the concept of "monotonicity" to ensure that reads are safe, even with concurrent allocations. The idea is that new records are only appended (into the section with all DataLayout::arg_info_data_tag).

ProfileData* MethodData::bci_to_extra_data_helper(int bci, Method* m, DataLayout*& dp, bool concurrent) {
DataLayout* end = args_data_limit();
for (;; dp = next_extra(dp)) {
assert(dp < end, "moved past end of extra data");
// No need for "Atomic::load_acquire" ops,
// since the data structure is monotonic.
switch(dp->tag()) {

We wrap the extra data in BitData or SpeculativeTrapData objects, which are nothing but pointers to the extra data array/buffer.

Of course concurrent allocations have to be made mutually exclusive, hence we need the extra_data_lock. And @tkrodriguez has found an instance where a lock was not taken, hence this bug report.

The problem now gets more complicated with cleaning. It removes records, and shifts other records "to the left" to compact the extra data. This certainly breaks monotonicity. This would be ok if there could be no concurrent read/update or allocation. But that seems certainly not to hold anymore now.

void MethodData::clean_extra_data_helper(DataLayout* dp, int shift, bool reset) {
if (shift == 0) {
return;
}
if (!reset) {
// Move all cells of trap entry at dp left by "shift" cells
intptr_t* start = (intptr_t*)dp;
intptr_t* end = (intptr_t*)next_extra(dp);
for (intptr_t* ptr = start; ptr < end; ptr++) {
*(ptr-shift) = *ptr;
}
} else {
// Reset "shift" cells stopping at dp
intptr_t* start = ((intptr_t*)dp) - shift;
intptr_t* end = (intptr_t*)dp;
for (intptr_t* ptr = start; ptr < end; ptr++) {
*ptr = 0;
}
}
}

We can protect all of these with locks. But that still is not good enough if there are concurrent reads, which make assumptions about the offsets in the extra data array/buffer. Imagine this:

Thread A: starts iterating over the extra data.
Thread B: cleans out some stale records, and shifts around things.
Thread A: continues iterating, with offsets that are now possibly not correct any more.

I have another concern about what we do today during allocation:

if (create_if_missing && dp < end) {
MutexLocker ml(&_extra_data_lock);
// Check again now that we have the lock. Another thread may
// have added extra data entries.
ProfileData* result = bci_to_extra_data_helper(bci, m, dp, false);
if (result != nullptr || dp >= end) {
return result;
}
assert(dp->tag() == DataLayout::no_tag || (dp->tag() == DataLayout::speculative_trap_data_tag && m != nullptr), "should be free");
assert(next_extra(dp)->tag() == DataLayout::no_tag || next_extra(dp)->tag() == DataLayout::arg_info_data_tag, "should be free or arg info");
u1 tag = m == nullptr ? DataLayout::bit_data_tag : DataLayout::speculative_trap_data_tag;
// SpeculativeTrapData is 2 slots. Make sure we have room.
if (m != nullptr && next_extra(dp)->tag() != DataLayout::no_tag) {
return nullptr;
}
DataLayout temp;
temp.initialize(tag, checked_cast<u2>(bci), 0);
dp->set_header(temp.header());
assert(dp->tag() == tag, "sane");
assert(dp->bci() == bci, "no concurrent allocation");
if (tag == DataLayout::bit_data_tag) {
return new BitData(dp);
} else {
SpeculativeTrapData* data = new SpeculativeTrapData(dp);
data->set_method(m);
return data;
}
}

We take the lock, and append a new record.
But is this writing done safely, such that a concurrent read would be safe? Because currently it must be.
We do set_header, which does a u8 store to the extra data, including the tag. But the payload object is stored separately. I wonder what here is guaranteed to be atomic, maybe the header is stored atomically. But a concurrent read could certainly find complete garbage instead of the (not yet written) payload object. What if a read thus sees the tag for a SpeculativeTrapData, and we read its data->method() field, and it is garbage but not nullptr?

In the light of this, I see 2 alternatives:

  1. add locks everywhere, and make sure no ProfileData* pdata pointer leaks out of a lock. This may take a hit on performance, though @rwestrel thinks that accesses are rare, and if they are not rare then other things may be the bottleneck.
  2. have a completely new data structure that is safe and does not require locking everywhere.

@openjdk
Copy link

openjdk bot commented Nov 30, 2023

@eme64 this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8306767
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed ready Pull request is ready to be integrated labels Nov 30, 2023
@openjdk openjdk bot added ready Pull request is ready to be integrated and removed merge-conflict Pull request has merge conflict with target branch labels Nov 30, 2023
@tkrodriguez
Copy link
Contributor

I agree with Roland that none of the accesses from C++ code are performance critical so always requiring a lock shouldn't matter for performance. It is somewhat intrusive though. The alternative is to make the API distinguish clearly between preallocated data and the extra data and adjust all internal usages to select the right one. For instance speculative_trap_data_tag usages are only in the extra_data section so access to those could have a distinct API. The only other thing in extra_data are bit_data_tag so any caller that's looking for a different kind of record is always safe since it must be from the preallocated section. A change like that might be clarifying in general as well but I think there's a question of effort vs benefit.

Also to clarify, I never actually observed this problem in practice but inferred the possibility while addressing MDO concurrency issues with Graal. It would be very hard to notice and very transient but it could lead to crashes since SpeculativeTrapData contains a Method*.

@eme64
Copy link
Contributor Author

eme64 commented Dec 1, 2023

@tkrodriguez Yes, we could now do quite the refactoring. The question is how far I should take this. I think I will just make things safe with locks now, and then if someone really wants to fully refactor all of this, then they can do that separately.

@tkrodriguez
Copy link
Contributor

Sounds reasonable to me.

eme64 added 2 commits January 15, 2024 12:55

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
rm empty line
@eme64 eme64 requested a review from rwestrel January 15, 2024 12:32
Copy link
Contributor

@rwestrel rwestrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for making the changes. Do we want to file a bug to refactor the code?

@eme64
Copy link
Contributor Author

eme64 commented Jan 17, 2024

@rwestrel thanks for the review! I pushed a comment improvement.

What kind of refactoring are you thinking about? Just to remove the MutexUnlocker occurrence, or something bigger?

@rwestrel
Copy link
Contributor

@rwestrel thanks for the review! I pushed a comment improvement.

What kind of refactoring are you thinking about? Just to remove the MutexUnlocker occurrence, or something bigger?

I was thinking maybe something along the lines of what Tom suggested: " I think the API would need to make a stronger split between preallocated records and records which might come from the extra data section. "
It doesn't seem right that the API can return a pointer to something that's not safe to access unless we own a lock too.

@eme64
Copy link
Contributor Author

eme64 commented Jan 17, 2024

@rwestrel Ok, makes sense. Would you want to do this?
I mostly addressed this because it is a bug, but otherwise I have my focus elsewhere.

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet reviewed all of this but this mechanism seems unnecessary and I'd like to understand why this would be added.

NoSafepointMutexLocker(Mutex* mutex, Mutex::SafepointCheckFlag flag = Mutex::_safepoint_check_flag) :
NoSafepointMutexLocker(mutex, true, flag) {}
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add this. The locks that are no-safepoint-check-flags have implicit NoSafepointVerifier logic when you take them out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Thanks for the hint. I will look into this. It is a bit confusing what the locks do and do not do with SafePoints. Do they just not safepoint when trying to acquire the lock, or also verify that no safepoint is made while holding the lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They verify that no safepoints happen while holding the lock. There's a counter in JavaThread.

#ifdef ASSERT
  // Debug support for checking if code allows safepoints or not.
  // Safepoints in the VM can happen because of allocation, invoking a VM operation, or blocking on
  // mutex, or blocking on an object synchronizer (Java locking).
  // If _no_safepoint_count is non-zero, then an assertion failure will happen in any of
  // the above cases. The class NoSafepointVerifier is used to set this counter.
  int _no_safepoint_count;                             // If 0, thread allow a safepoint to happen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the comment should be updated to say we increment this when taking out a no-safepoint-check mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I removed the NoSafepointVerifier. It seems to work, and pass my verification that we are in a no_safepoint scope.

method()->print_codes_on(&ss);
print_code_on(&ss);
print_pcs_on(&ss);
tty->print("%s", ss.as_string()); // print all at once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like these ttyLocker changes should be checked in as a different cleanup, ie removing ttyLocker is a really good thing. Can you make these changes a separate patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I filed the RFE:
JDK-8324129 C2: Remove some ttyLocker usages in preparation for JDK-8306767
#17486

@eme64
Copy link
Contributor Author

eme64 commented Jan 18, 2024

@coleenp thanks for the suggestions.

I will first integrate this RFE: #17486 (please review 😉 )

Then I will integrate and merge it to here.

@eme64
Copy link
Contributor Author

eme64 commented Jan 22, 2024

@coleenp the other change is integrated and merged to here.
@tkrodriguez @fisk @rwestrel would you mind (re-)revieing?

Copy link
Contributor

@fisk fisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very promising to me. Very nice with the asserts checking the appropriate locking is there when using the extra data section. Great job!

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better. I have a couple of small requests. Thanks.

uint arg_modified(int a) { ArgInfoData *aid = arg_info();
uint arg_modified(int a) { // Lock and avoid breaking lock with Safepoint
MutexLocker ml(extra_data_lock(), Mutex::_no_safepoint_check_flag);
ArgInfoData *aid = arg_info();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these to methodData.inline.hpp so that you don't have to include mutexLocker.hpp in a header file?

JavaThread::current()->is_in_no_safepoint_scope(),
"JavaThread must have NoSafepointVerifier inside lock scope");
#endif
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not performance critical, can you move this to the .cpp file?

Copy link
Contributor

@tkrodriguez tkrodriguez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, code movement looks good. thank you!

@eme64
Copy link
Contributor Author

eme64 commented Jan 25, 2024

Thanks @tkrodriguez @fisk @rwestrel @coleenp for all your help, conversations and suggestions!
One more potentially highly intermittent bug fixed.
/integrate

@openjdk
Copy link

openjdk bot commented Jan 25, 2024

Going to push as commit 746a086.
Since your change was applied there have been 44 commits pushed to the master branch:

  • ffe3bb6: 8324657: Intermittent OOME on exception message create
  • e709842: 8324636: Serial: Remove Generation::block_is_obj
  • 7a798d3: 8324598: use mem_unit when working with sysinfo memory and swap related information
  • 6d36eb7: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target.
  • 9d1a6d1: 8323645: Remove unused internal sun.net.www.protocol.jar.URLJarFileCallBack interface
  • 3059c3b: 8324242: Avoid null check for OopHandle::ptr_raw()
  • 929af9e: 8307788: vmTestbase/gc/gctests/LargeObjects/large003/TestDescription.java timed out
  • e7fdac9: 8324280: RISC-V: Incorrect implementation in VM_Version::parse_satp_mode
  • 3d32c46: 6503196: API doc for DecimalFormat::getMaximumIntegerDigits is unclear
  • 2d5cb97: 8324647: Invalid test group of lib-test after JDK-8323515
  • ... and 34 more: https://git.openjdk.org/jdk/compare/c84af4938647efbc2d6c94efef748446bf6d50b4...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jan 25, 2024
@openjdk openjdk bot closed this Jan 25, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jan 25, 2024
@openjdk
Copy link

openjdk bot commented Jan 25, 2024

@eme64 Pushed as commit 746a086.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

None yet

5 participants