-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8306767: Concurrent repacking of extra data in MethodData is potentially unsafe #16840
Conversation
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@eme64 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 44 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
@@ -2494,7 +2495,10 @@ Deoptimization::query_update_method_data(MethodData* trap_mdo, | |||
// Find the profile data for this BCI. If there isn't one, | |||
// try to allocate one from the MDO's set of spares. | |||
// This will let us detect a repeated trap at this point. | |||
pdata = trap_mdo->allocate_bci_to_data(trap_bci, reason_is_speculate(reason) ? compiled_method : nullptr); | |||
{ | |||
MutexLocker ml(trap_mdo->extra_data_lock()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the lock have to be held over the lifetime of the pdata variable? Otherwise an intervening safepoint could repack the MDO rendering pdata invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. That was done in a similar place above but not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. So is it that I just have to have a lock on allocate_bci_to_data
, or do I have to protect pdata
? Because pdata
is returned from Deoptimization::query_update_method_data
, and so I would have to make the lock take a wider scope than that function, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock needs to be held while accessing the contents of pdata which is very complicated in this particular usage pattern. Maybe the caller needs to refetch the pointer under a lock instead of passing it out of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
And how about other usages of ProfileData
from the mdo
, like these (there is a few variants of them around):
ProfileData* data = mdo->first_data();
ProfileData* data = mdo->bci_to_data(bci);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there are 2 uses of query_update_method_data
. One does not use the return pdata
. The other uses it and in some cases updates it. Do you think it is safe to just re-fetch it, or would that potentially cut some connection between the two that should not be cut?
The alternative is just to already get the lock before calling query_update_method_data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that anything that can return data from the extra data section is a potential danger. bci_to_data calls bci_to_extra_data at the end so it seems potentially unsafe which seems like a huge problem since that's used all over the place. Whether the callers are actually getting or expecting record from extra data is unclear. I would suspect that most places where it's used there should already be a preallocated record. The concurrent repacking really makes it hard to ensure the accesses are safe. I think the API would need to make a stronger split between preallocated records and records which might come from the extra data section. I'm honestly not sure how to make this truly safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So this issue is much bigger than query_update_method_data
and allocate_bci_to_data
, is what you are saying. Sounds like I need to study this much deeper. Maybe we need to refactor the while way we access the records? Maybe any access to the records needs to be guarded with a lock, just to be safe?
If there are concurrent updates, which are guarded by lock, then should not all reads also be guarded? In the end maybe we just need to make all accesses mutually exclusive, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, reading data from the extra data section, requires holding the lock.
Update: We have these accesses to the extra data:
I have had conversations with @fisk and @rwestrel about which can happen concurrently. There may have been a time when cleaning only happened during SafePoint, and hence there probably was no concurrent read / allocation with it. But this has changed, certainly with ZGC, but also with The data structure uses the concept of "monotonicity" to ensure that reads are safe, even with concurrent allocations. The idea is that new records are only appended (into the section with all jdk/src/hotspot/share/oops/methodData.cpp Lines 1438 to 1445 in a5ccd3b
We wrap the extra data in BitData or SpeculativeTrapData objects, which are nothing but pointers to the extra data array/buffer.
Of course concurrent allocations have to be made mutually exclusive, hence we need the The problem now gets more complicated with cleaning. It removes records, and shifts other records "to the left" to compact the extra data. This certainly breaks monotonicity. This would be ok if there could be no concurrent read/update or allocation. But that seems certainly not to hold anymore now. jdk/src/hotspot/share/oops/methodData.cpp Lines 1737 to 1756 in a5ccd3b
We can protect all of these with locks. But that still is not good enough if there are concurrent reads, which make assumptions about the offsets in the extra data array/buffer. Imagine this:
I have another concern about what we do today during allocation: jdk/src/hotspot/share/oops/methodData.cpp Lines 1503 to 1532 in a5ccd3b
We take the lock, and append a new record. But is this writing done safely, such that a concurrent read would be safe? Because currently it must be. We do set_header , which does a u8 store to the extra data, including the tag. But the payload object is stored separately. I wonder what here is guaranteed to be atomic, maybe the header is stored atomically. But a concurrent read could certainly find complete garbage instead of the (not yet written) payload object. What if a read thus sees the tag for a SpeculativeTrapData , and we read its data->method() field, and it is garbage but not nullptr?
In the light of this, I see 2 alternatives:
|
@eme64 this pull request can not be integrated into git checkout JDK-8306767
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
I agree with Roland that none of the accesses from C++ code are performance critical so always requiring a lock shouldn't matter for performance. It is somewhat intrusive though. The alternative is to make the API distinguish clearly between preallocated data and the extra data and adjust all internal usages to select the right one. For instance speculative_trap_data_tag usages are only in the extra_data section so access to those could have a distinct API. The only other thing in extra_data are bit_data_tag so any caller that's looking for a different kind of record is always safe since it must be from the preallocated section. A change like that might be clarifying in general as well but I think there's a question of effort vs benefit. Also to clarify, I never actually observed this problem in practice but inferred the possibility while addressing MDO concurrency issues with Graal. It would be very hard to notice and very transient but it could lead to crashes since SpeculativeTrapData contains a Method*. |
@tkrodriguez Yes, we could now do quite the refactoring. The question is how far I should take this. I think I will just make things safe with locks now, and then if someone really wants to fully refactor all of this, then they can do that separately. |
Sounds reasonable to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for making the changes. Do we want to file a bug to refactor the code?
@rwestrel thanks for the review! I pushed a comment improvement. What kind of refactoring are you thinking about? Just to remove the |
I was thinking maybe something along the lines of what Tom suggested: " I think the API would need to make a stronger split between preallocated records and records which might come from the extra data section. " |
@rwestrel Ok, makes sense. Would you want to do this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet reviewed all of this but this mechanism seems unnecessary and I'd like to understand why this would be added.
NoSafepointMutexLocker(Mutex* mutex, Mutex::SafepointCheckFlag flag = Mutex::_safepoint_check_flag) : | ||
NoSafepointMutexLocker(mutex, true, flag) {} | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't add this. The locks that are no-safepoint-check-flags have implicit NoSafepointVerifier logic when you take them out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. Thanks for the hint. I will look into this. It is a bit confusing what the locks do and do not do with SafePoints. Do they just not safepoint when trying to acquire the lock, or also verify that no safepoint is made while holding the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They verify that no safepoints happen while holding the lock. There's a counter in JavaThread.
#ifdef ASSERT
// Debug support for checking if code allows safepoints or not.
// Safepoints in the VM can happen because of allocation, invoking a VM operation, or blocking on
// mutex, or blocking on an object synchronizer (Java locking).
// If _no_safepoint_count is non-zero, then an assertion failure will happen in any of
// the above cases. The class NoSafepointVerifier is used to set this counter.
int _no_safepoint_count; // If 0, thread allow a safepoint to happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the comment should be updated to say we increment this when taking out a no-safepoint-check mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I removed the NoSafepointVerifier
. It seems to work, and pass my verification that we are in a no_safepoint scope.
method()->print_codes_on(&ss); | ||
print_code_on(&ss); | ||
print_pcs_on(&ss); | ||
tty->print("%s", ss.as_string()); // print all at once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like these ttyLocker changes should be checked in as a different cleanup, ie removing ttyLocker is a really good thing. Can you make these changes a separate patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I filed the RFE:
JDK-8324129 C2: Remove some ttyLocker usages in preparation for JDK-8306767
#17486
@coleenp the other change is integrated and merged to here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very promising to me. Very nice with the asserts checking the appropriate locking is there when using the extra data section. Great job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better. I have a couple of small requests. Thanks.
uint arg_modified(int a) { ArgInfoData *aid = arg_info(); | ||
uint arg_modified(int a) { // Lock and avoid breaking lock with Safepoint | ||
MutexLocker ml(extra_data_lock(), Mutex::_no_safepoint_check_flag); | ||
ArgInfoData *aid = arg_info(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move these to methodData.inline.hpp so that you don't have to include mutexLocker.hpp in a header file?
JavaThread::current()->is_in_no_safepoint_scope(), | ||
"JavaThread must have NoSafepointVerifier inside lock scope"); | ||
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is not performance critical, can you move this to the .cpp file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, code movement looks good. thank you!
Thanks @tkrodriguez @fisk @rwestrel @coleenp for all your help, conversations and suggestions! |
Going to push as commit 746a086.
Your commit was automatically rebased without conflicts. |
As explained in a comment below, we have to ensure that reading/writing/cleaning the extra data all needs to be guarded by the
extra_data_lock
, and that no safepoint should happen while holding that lock, so that the lock is not broken.I introduced
check_extra_data_locked
, where I check that we hold the lock, and if we are a java thread (only those ever safepoint), that we currently are in aNoSafepointVerifier
scope, hence we verify that no safepoint will be taken.I placed
check_extra_data_locked
in all the places where we access the extra data, and then placed locks (with implicit no-safepoint-verifiers) at the call-site of those places.I also needed to change the rank of
extra_data_lock
tonosafepoint
and set theMutex::_no_safepoint_check_flag
when taking the lock. Otherwise I could not take the lock from a VM thread.Testing
Testing: tier1-3 and stress.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16840/head:pull/16840
$ git checkout pull/16840
Update a local copy of the PR:
$ git checkout pull/16840
$ git pull https://git.openjdk.org/jdk.git pull/16840/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16840
View PR using the GUI difftool:
$ git pr show -t 16840
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16840.diff
Webrev
Link to Webrev Comment