Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8292695: SIGQUIT and jcmd attaching mechanism does not work with signal chaining library #9955

Closed
wants to merge 6 commits into from

Conversation

caoman
Copy link
Contributor

@caoman caoman commented Aug 20, 2022

Hi all,

Could anyone review this bug fix? See https://bugs.openjdk.org/browse/JDK-8292695 for details.

I changed the temporary handler for SIGQUIT to use a dummy function, and use os::signal() to set it up, just as os::initialize_jdk_signal_support() does.
It is possible that just moving the set_signal_handler(BREAK_SIGNAL, false); in install_signal_handlers() outside of the window bounded by JVM_{begin|end}_signal_setting() could also fix this bug. However, set_signal_handler() and JVM_HANDLE_XXX_SIGNAL() are currently used for signals that support chaining and periodically check, which do not apply to SIGQUIT. I think it is cleaner to use different functions for SIGQUIT.

I also added a test to check that sending SIGQUIT should produce a thread dump on stdout, with and without using libjsig.so.

-Man


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8292695: SIGQUIT and jcmd attaching mechanism does not work with signal chaining library

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9955/head:pull/9955
$ git checkout pull/9955

Update a local copy of the PR:
$ git checkout pull/9955
$ git pull https://git.openjdk.org/jdk pull/9955/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9955

View PR using the GUI difftool:
$ git pr show -t 9955

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9955.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 20, 2022

👋 Welcome back manc! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 20, 2022

@caoman The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Aug 20, 2022
@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 22, 2022
@mlbridge
Copy link

mlbridge bot commented Aug 22, 2022

Webrevs

@tstuefe
Copy link
Member

tstuefe commented Aug 22, 2022

@caoman I'm interested in what your launcher does, or is, since I have almost never encountered signal chaining in the field. Would you mind sharing what product the launcher belongs to, and why you enable signal chaining?

Copy link
Member

@navyxliu navyxliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am not a reviewer.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still trying to gauge the cause and effects here ...

In the meantime, should the test not be put under runtime/jsig ?

@caoman
Copy link
Contributor Author

caoman commented Aug 22, 2022

@tstuefe Our launcher is equipped with almost all our Java applications. It basically does what bin/java does, but with several additional features. One main motivation to have our own launcher, is to static link all user-defined JNI dependency libraries with the launcher, so the applications do not need to use System.loadLibrary() to load .so files dynamically. Using static linking instead of dynamically loading .so files is core characteristic of our deployment environment for almost all languages (C, C++, Java, Python, Go). @jianglizhou is actually working on a project that also statically links HotSpot (libjvm.so and other .so files) with our launcher as well.

Regarding to the signal chaining support from our launcher, we found some existing applications that LD_PRELOADs libjsig.so. We'd like to support it in our launcher so they don't need to depend on libjsig.so. Those applications need signal chaining, because they use an internal C++ library that could install custom signal handlers for error handling.

@caoman
Copy link
Contributor Author

caoman commented Aug 22, 2022

Still trying to gauge the cause and effects here ...

In the meantime, should the test not be put under runtime/jsig ?

The test tests SIGQUIT behavior with and without libjsig.so. So it does not necessarily depend on libjsig.so. I could move it if you prefer runtime/jsig.

@tstuefe
Copy link
Member

tstuefe commented Aug 23, 2022

@tstuefe Our launcher is equipped with almost all our Java applications. It basically does what bin/java does, but with several additional features. One main motivation to have our own launcher, is to static link all user-defined JNI dependency libraries with the launcher, so the applications do not need to use System.loadLibrary() to load .so files dynamically. Using static linking instead of dynamically loading .so files is core characteristic of our deployment environment for almost all languages (C, C++, Java, Python, Go). @jianglizhou is actually working on a project that also statically links HotSpot (libjvm.so and other .so files) with the our launcher as well.

That is interesting. Do you link statically to improve startup speed, or for some other reason?

@caoman
Copy link
Contributor Author

caoman commented Aug 23, 2022

@tstuefe Our launcher is equipped with almost all our Java applications. It basically does what bin/java does, but with several additional features. One main motivation to have our own launcher, is to static link all user-defined JNI dependency libraries with the launcher, so the applications do not need to use System.loadLibrary() to load .so files dynamically. Using static linking instead of dynamically loading .so files is core characteristic of our deployment environment for almost all languages (C, C++, Java, Python, Go). @jianglizhou is actually working on a project that also statically links HotSpot (libjvm.so and other .so files) with the our launcher as well.

That is interesting. Do you link statically to improve startup speed, or for some other reason?

It is not about startup speed, but mainly to ensure correct version and simplify deployment. It is very similar to the advantages of static library as stated on Wikipedia: https://en.wikipedia.org/wiki/Static_library.
@jianglizhou will share more details about this project in a separate email thread or in a public talk soon, so stay tuned.

@caoman
Copy link
Contributor Author

caoman commented Aug 25, 2022

@dholmes-ora Let me know if there's any other issue, or if I could help explain the cause and effects better.

@dholmes-ora
Copy link
Member

@tstuefe are you able to look at this one? Thanks.

@caoman
Copy link
Contributor Author

caoman commented Sep 6, 2022

@tstuefe Any feedback? I'd like to get this submitted soon to unblock our internal use cases relying on SIGQUIT and jcmd.

@tstuefe
Copy link
Member

tstuefe commented Sep 7, 2022

Sorry, missed this one. I'll take a look later today.

@tstuefe
Copy link
Member

tstuefe commented Sep 7, 2022

@caoman While looking at this, I think I found an unrelated bug in libjsig. https://bugs.openjdk.org/browse/JDK-8293466 Could you eyeball it and tell me if my assumption is correct?

@caoman
Copy link
Contributor Author

caoman commented Sep 7, 2022

Thanks for filing the other bug. Responded in that bug.

@tstuefe
Copy link
Member

tstuefe commented Sep 7, 2022

Looking at this, this gets more and more interesting. One thing, libjsig does nothing to prevent user code from changing thread/process sigmask. @caoman has that never been a problem at Google?

@tstuefe
Copy link
Member

tstuefe commented Sep 7, 2022

Found another minor one: https://bugs.openjdk.org/browse/JDK-8293494.

I'm looking at the change now, though.

@tstuefe
Copy link
Member

tstuefe commented Sep 7, 2022

Hi @caoman, @dholmes-ora, @navyxliu,

I wonder if it has to be that complicated.

The problems (both https://bugs.openjdk.org/browse/JDK-8279124 and https://bugs.openjdk.org/browse/JDK-8292695) are caused by SIGQUIT being in a weird space - it is a VM signal, really, but gets handled by the signal dispatcher thread. The signal dispatcher thread is started long after the "real" hotspot handlers are installed. And after the dispatcher thread starts, only then we install the SIGQUIT handler. At that point the libjsig is already in "surveillance" mode where it intercepts stuff.

But there is nothing that prevents us from moving the SIGQUIT handler installation up in time to the hotspot signal handler initialization. Nobody says the SIGQUIT handler has to be installed after the signal dispatcher thread started.

Signal dispatcher thread and handler are only loosely coupled via os::signal_notify(). If the VM receives SIGQUIT in the time window after the SIGQUIT handler is installed but before the signal dispatcher thread is started, handler will just "ring the bell", but thats fine. The notifications would stack, and once the dispatcher thread runs, it will process all these SIGQUIT signals in a delayed fashion.

To test my theory, I built: master...tstuefe:jdk:test-move-sigquit-init-up-in-time .

This moves signal handler installation up in time. It also optionally adds a delay between hotspot signal handler installation and dispatcher thread startup.

And this just works as expected: if we send SIGQUIT to the VM, we get thread dumps. Before dispatcher thread start, these SIGQUITs are "stored", and executed all once dispatcher thread starts (whether or not one wants that is a different question, but easily to modify). And SIGQUITs after dispatcher startup work fine too.

As expected this works with or without libjsig, so it solves JDK-8292695. It also closes the initial time window and therefore does not regress JDK-8279124. And it is a lot simpler to argue about, especially with libjsig present. Now SIGQUIT is handled like any other ordinary hotspot signal.

(and I wonder whether we could handle the shutdown signals the same way, but have not tested).

What do you think?

Am I overlooking something?

@caoman
Copy link
Contributor Author

caoman commented Sep 7, 2022

One thing, libjsig does nothing to prevent user code from changing thread/process sigmask. @caoman has that never been a problem at Google?

We don't actually use libjsig directly. Our launcher reimplemented signal chaining support using the same JVM_begin_signal_setting/JVM_end_signal_setting/JVM_get_signal_action interface.

Our launcher's implementation has calls to pthread_sigmask, but the goal seems to make the overridden sigaction() async-signal-safe. It doesn't prevent user code from changing thread/process sigmask either. However, I haven't heard of issues with user code changing sigmask.

@caoman
Copy link
Contributor Author

caoman commented Sep 7, 2022

But there is nothing that prevents us from moving the SIGQUIT handler installation up in time to the hotspot signal handler initialization. Nobody says the SIGQUIT handler has to be installed after the signal dispatcher thread started.

Thank you for the great idea and experiment! Let me try this approach and rerun the tests.

@caoman
Copy link
Contributor Author

caoman commented Sep 8, 2022

But there is nothing that prevents us from moving the SIGQUIT handler installation up in time to the hotspot signal handler initialization. Nobody says the SIGQUIT handler has to be installed after the signal dispatcher thread started.

Thank you for the great idea and experiment! Let me try this approach and rerun the tests.

Switched to the approach that installs the handler only once.

The SIGQUIT handler has to be installed after jdk_misc_signal_init() has initialized pending_signals and sig_semaphore, which are used by os::signal_notify().
This means SIGQUIT handler is still installed outside libjsig's JVM_begin_signal_setting/JVM_end_signal_setting window, so it not tracked by libjsig like other HotSpot signals.

@tstuefe
Copy link
Member

tstuefe commented Sep 8, 2022

But there is nothing that prevents us from moving the SIGQUIT handler installation up in time to the hotspot signal handler initialization. Nobody says the SIGQUIT handler has to be installed after the signal dispatcher thread started.

Thank you for the great idea and experiment! Let me try this approach and rerun the tests.

Switched to the approach that installs the handler only once.

The SIGQUIT handler has to be installed after jdk_misc_signal_init() has initialized pending_signals and sig_semaphore, which are used by os::signal_notify(). This means SIGQUIT handler is still installed outside libjsig's JVM_begin_signal_setting/JVM_end_signal_setting window, so it not tracked by libjsig like other HotSpot signals.

You can just move jdk_misc_signal_init up a notch in PosixSignals::init(), to precede hotspot signal initialization. Semaphore initialization does not depend on anything much, its just a sem_init(). See my modified example:

77826cf

Then it works as planned, with SIGQUIT now being part of the normal hotspot VM signal installation.

@navyxliu
Copy link
Member

navyxliu commented Sep 8, 2022

I try it out. yeah, it works! One installation but it will remember early SIGBREAK.

Why do you insist to move jdk_misc_signal_init() before install_signal_handlers()? Is there any difference before and after it?

@caoman
Copy link
Contributor Author

caoman commented Sep 8, 2022

Moved jdk_misc_signal_init() earlier as Thomas suggested. Now that libjsig can prevent user overriding SIGBREAK later.
For Windows, jdk_misc_signal_init() seems appropriate as there's a SetConsoleCtrlHandler() above.

Why do you insist to move jdk_misc_signal_init() before install_signal_handlers()? Is there any difference before and after it?

Because SIGBREAK's UserHandler() uses the pending_signals and sig_semaphore variables initialized by jdk_misc_signal_init().

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test that checks that:

if VM is started with -Xrs, and we send a SIGQUIT, the VM shall crash with a core. The crash should look like a system crash, not an hs-err file crash, like this:

Quit (core dumped)

which assures us that no hotspot signal handler is installed and we take therefore the default action on SIGQUIT.

If this string is OS dependent and gives us headaches, the test should check at least that the process exit code is not 0.

And of course the VM should be started with tiny heap etc such to make sure the core file generated will be tiny.

@caoman
Copy link
Contributor Author

caoman commented Sep 9, 2022

Can you also add a test that checks that:

if VM is started with -Xrs, and we send a SIGQUIT, the VM shall crash with a core. The crash should look like a system crash, not an hs-err file crash, like this:

I struggled to get this to work. I found out it is due to https://bugs.openjdk.org/browse/JDK-8234262, and ProcessBuilder changes how SIGQUIT is handled in the subprocess. This is further complicated by how jtreg runs the parent process. The workaround (passing -Xrs to the parent process) does not work through jtreg using "main/othervm -Xrs".

I uploaded the not-working test in here: caoman@73e3a5f

@tstuefe
Copy link
Member

tstuefe commented Sep 9, 2022

Can you also add a test that checks that:
if VM is started with -Xrs, and we send a SIGQUIT, the VM shall crash with a core. The crash should look like a system crash, not an hs-err file crash, like this:

I struggled to get this to work. I found out it is due to https://bugs.openjdk.org/browse/JDK-8234262, and ProcessBuilder changes how SIGQUIT is handled in the subprocess. This is further complicated by how jtreg runs the parent process. The workaround (passing -Xrs to the parent process) does not work through jtreg using "main/othervm -Xrs".

I uploaded the not-working test in here: caoman@73e3a5f

Thank you for trying, and finding this out! I'm fine with postponing this test then to another RFE.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you for taking my suggestion!

..Thomas

@openjdk
Copy link

openjdk bot commented Sep 9, 2022

@caoman This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8292695: SIGQUIT and jcmd attaching mechanism does not work with signal chaining library

Reviewed-by: xliu, stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 18 new commits pushed to the master branch:

  • b8598b0: 8291660: Grapheme support in BreakIterator
  • a14c3a4: 8288933: Improve the implementation of Double/Float.isInfinite
  • 00befdd: 8292675: Add identity transformation for removing redundant AndV/OrV nodes
  • 7169ee5: 8293477: IGV: Upgrade to Netbeans Platform 15
  • 3dd94f3: 8292671: Hotspot Style Guide should allow covariant returns
  • 9d6b028: 8234315: GTK LAF does not gray out disabled JMenu
  • 812d805: 6447816: Provider filtering (getProviders) is not working with OR'd conditions
  • 43e191d: 8293524: RISC-V: Use macro-assembler functions as appropriate
  • 14eb5ad: 8291753: Add JFR event for GC CPU Time
  • 30d4145: 8293230: x86_64: Move AES and GHASH stub definitions into separate source files
  • ... and 8 more: https://git.openjdk.org/jdk/compare/6677227301acf06eb8be264e4eb3e092d0d7442f...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 9, 2022
Copy link
Member

@navyxliu navyxliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new revision still LGTM.

@caoman
Copy link
Contributor Author

caoman commented Sep 9, 2022

/integrate

@openjdk
Copy link

openjdk bot commented Sep 9, 2022

Going to push as commit 45ff10c.
Since your change was applied there have been 19 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Sep 9, 2022
@openjdk openjdk bot closed this Sep 9, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 9, 2022
@openjdk
Copy link

openjdk bot commented Sep 9, 2022

@caoman Pushed as commit 45ff10c.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@caoman
Copy link
Contributor Author

caoman commented Sep 9, 2022

Thank you for trying, and finding this out! I'm fine with postponing this test then to another RFE.

SG. Filed https://bugs.openjdk.org/browse/JDK-8293611.

@caoman caoman deleted the JDK-8292695 branch September 9, 2022 19:26
@dholmes-ora
Copy link
Member

Messing around with initialization order always makes me very nervous. While this may have solved the problem at hand it is far from clear to me that it has not introduced unexpected behaviour in other situations - situations that our testing is unlikely to expose.

@tstuefe
Copy link
Member

tstuefe commented Sep 20, 2022

Messing around with initialization order always makes me very nervous. While this may have solved the problem at hand it is far from clear to me that it has not introduced unexpected behaviour in other situations - situations that our testing is unlikely to expose.

Hmm, we put a lot of work and time into this solution. While I acknowledge this is a difficult area, so are many others in hotspot which we are not shy of changing. And we did not change the coding on a whim - the proposed solution IMHO is better than the assortment of piled-on hotfixes we had before and dissolves quite a bit of complexity. That makes the code easier to understand and maintain in the long run. @caoman also provided a regression test, so that's covered too.

But if you have concrete misgivings about the proposed solution, let's discuss it.

@dholmes-ora
Copy link
Member

@tstuefe IIUC the earlier proposal from @caoman simply moved the setting of the BREAK_HANDLER to inside jdk_misc_signal_init. But you then had it moved instead to install_signal_handlers, which then required that install_signal_handlers get moved after jdk_misc_signal_init as the UserHandler depends on initialization that occurs there. I don't understand why that complex double-shuffle was needed/wanted instead of what Man proposed??

@tstuefe
Copy link
Member

tstuefe commented Oct 24, 2022

@tstuefe IIUC the earlier proposal from @caoman simply moved the setting of the BREAK_HANDLER to inside jdk_misc_signal_init. But you then had it moved instead to install_signal_handlers, which then required that install_signal_handlers get moved after jdk_misc_signal_init as the UserHandler depends on initialization that occurs there. I don't understand why that complex double-shuffle was needed/wanted instead of what Man proposed??

See my comment at #9955 (comment). The original solution by @caoman , while it would have worked, was a bandaid atop of a bandaid (see also https://bugs.openjdk.org/browse/JDK-8279124). It was perpetuating a crooked solution, and the crookedness stems from the fact that SIGQUIT is semantically a VM signal, owned by the VM, but initialized outside the libjsig-VM-signal-initialization window. The solution proposed by me makes SIGQUIT initialize like any other VM signal, therefore we can remove the SIGQUIT-related special handling and hopefully don't need further patches in this area.

@dholmes-ora
Copy link
Member

I don't really agree with the rationale, but the fix is correct, and the reordering seems harmless enough on further inspection.

To me this is a JDK serviceability signal, both for dynamic attach and thread dumps, as it is the JDK serving "signal thread" that handles it. Most of the problems in this area have stemmed from initialization issues due to "too early" sending of SIGQUIT to initiate an attach attempt.

@caoman
Copy link
Contributor Author

caoman commented Oct 24, 2022

SIGQUIT is semantically a VM signal, owned by the VM, but initialized outside the libjsig-VM-signal-initialization window. The solution proposed by me makes SIGQUIT initialize like any other VM signal, ...

Yes, this is main reason we installed BREAK_SIGNAL handler inside install_signal_handlers and reordered jdk_misc_signal_init and install_signal_handlers.

There was a concern that with libjsig present and without -Xrs, user could still install a custom BREAK_SIGNAL handler that overrides JVM's handler, see #9955 (comment). This change avoids this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
4 participants