Skip to content

Fix JVM crash when Thread::current() returns nullptr (PROF-13072)#461

Open
jbachorik wants to merge 3 commits intomainfrom
jb/prof-13072-thread-current-nullptr-fix
Open

Fix JVM crash when Thread::current() returns nullptr (PROF-13072)#461
jbachorik wants to merge 3 commits intomainfrom
jb/prof-13072-thread-current-nullptr-fix

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented Apr 10, 2026

What does this PR do?:
Fixes a JVM crash where Thread::current() returns nullptr inside ASGCT/JFR allocation paths when a profiling signal fires during thread initialization.

Motivation:
There is a race window in start_routine_wrapper between Profiler::registerThread() (which arms the per-thread CPU timer, enabling SIGPROF delivery) and routine(params) (which calls thread_native_entrypd_set_thread(), setting Thread::current() in JVM TLS). If SIGPROF or SIGVTALRM fires in this window, ASGCT can be invoked on a thread where Thread::current() (ELF TLS) is still nullptr, crashing in JFR allocation paths (resource_allocate_bytes called from JfrStackTrace, JfrArtifactSet, etc.). This only manifests in virtualized environments where OS scheduling makes the race window much more likely to be hit.

Additional Notes:
Two complementary changes:

  1. Narrow the race window (start_routine_wrapper, start_routine_wrapper_spec): move Profiler::registerThread() inside the existing SignalBlocker scope. The timer is then armed while SIGPROF/SIGVTALRM are masked; any pending signal fires only after signals are re-enabled (but before routine(params)) and is discarded by the guard below.

  2. Signal handler guard with one-shot init window (CTimer::signalHandler, WallClockASGCT::signalHandler): when JVMThread::isInitialized() && JVMThread::current() == nullptr, skip the sample — but only if the thread's _init_window countdown is still active (starts at 1, decremented on first skip). JVMThread::current() reads the JVM's own pthread TLS key (found during startup) — the same value pd_set_thread() writes.

    The one-shot countdown distinguishes the two cases where JVMThread::current() is null:

    • JVM threads in the race window: the first signal after the SignalBlocker exits is skipped. POSIX guarantees at most one pending signal of each type, so by the time the window expires, pd_set_thread() has been called.
    • Pure native threads (e.g. NativeThreadCreator): JVMThread::current() is always null. The countdown expires after one skip, and all subsequent signals are sampled normally — no permanent loss of native thread samples.

How to test the change?:
The crash reproduces only in virtualized environments. The benchmark at https://github.com/DataDog/java-profiler/tree/zgu/ctx_benchmark was used to trigger it originally. Unit-level reproduction is not feasible (race window requires specific scheduling that doesn't occur on bare metal reliably).

DynamicNativeThread and NativeThreadTest on J9/glibc/aarch64 verify that the guard does not permanently suppress native (non-JVM) thread samples.

For Datadog employees:

  • This PR doesn't touch any of that.
  • JIRA: PROF-13072

Race window between Profiler::registerThread() arming the CPU/wall-clock
timer and thread_native_entry calling pd_set_thread(): if a profiling
signal fires in that window ASGCT can be called with Thread::current()==null
inside the JVM, crashing in JFR allocation paths (resource_allocate_bytes).

Two complementary fixes:
1. Signal handlers (CTimer, WallClockASGCT): skip the sample when the JVM
   pthread key is initialized but the current thread has no JVM TLS value,
   i.e. pd_set_thread() has not yet run.
2. start_routine_wrapper / start_routine_wrapper_spec: move registerThread()
   inside the existing SignalBlocker so the timer is armed while signals are
   masked; any pending signal fires after unblocking and is discarded by (1).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik added the AI label Apr 10, 2026
Replace the permanent skip (JVMThread::current()==nullptr) with a
one-shot init-window countdown per thread. JVM threads in the race
window get one signal skipped; pure native threads (where
JVMThread::current() is always null, e.g. NativeThreadCreator) are
allowed through after the window expires, restoring their samples.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dd-octo-sts
Copy link
Copy Markdown

dd-octo-sts bot commented Apr 10, 2026

CI Test Results

Run: #24248400353 | Commit: 6beef5c | Duration: 27m 40s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-04-10 15:10:10 UTC

Introduce start_window_and_register() noinline helper so that
start_routine_wrapper_spec() never has sigset_t (SignalBlocker)
on its own stack frame, preserving the original design that
prevents musl's stack-protector canary corruption on aarch64.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik marked this pull request as ready for review April 10, 2026 16:31
@jbachorik jbachorik requested a review from a team as a code owner April 10, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant