Skip to content

Route vCPU preemption signals through a dedicated sigwait thread to eliminate HV_EXIT_REASON_UNKNOWN #77

@Max042004

Description

@Max042004

Background

elfuse interrupts a running vCPU (for the cross-process guest-signal transport
and for the per-iteration safety timeout) by sending a host signal to the vCPU
thread, whose handler calls hv_vcpus_exit():

  • SIGUSR2 — cross-process guest-signal doorbell. proc_send_guest_signal()
    (src/syscall/proc.c:358) writes the guest signum to /tmp/elfuse-sig-<pid>
    and sends host SIGUSR2; the receiver's handler
    guest_signal_transport_handler() (proc.c:965) sets g_external_guest_signal
    and calls hv_vcpus_exit(&g_timeout_vcpu, 1).
  • SIGALRM — per-iteration timeout. alarm_handler() sets g_timed_out and
    calls hv_vcpus_exit(&g_timeout_vcpu, 1).

Because the signal is delivered to the vCPU thread while it is inside
hv_vcpu_run
, Apple HVF aborts the run with HV_EXIT_REASON_UNKNOWN (0x3)
instead of the clean HV_EXIT_REASON_CANCELED (0) that hv_vcpus_exit()
produces for a vCPU caught between runs. The run loop must therefore treat
UNKNOWN as a possible cancellation, which is ambiguous: a genuine hypervisor
fault could in principle also surface as UNKNOWN, so blindly resuming risks a
silent spin (raised by a cubic review on #76).

#76 (fix-cross-process-signal-el0) carries the EL0-preemption delivery fix and
routes UNKNOWN through the cancellation handler so the already-queued guest
signal is delivered instead of crashing the child. That keeps the ambiguity: the
run loop cannot tell our own preemption from a genuine fault, so it errs toward
resuming. This issue removes the ambiguity at its source rather than reasoning
about it after the fact.

Note: a real hv_vcpu_run API failure (non-HV_SUCCESS return) is already
caught by HV_CHECK_CTX (proc.c:1113) and crashes immediately. The only thing
that reaches the UNKNOWN branch is HV_SUCCESS + exit_reason == UNKNOWN,
i.e. our own hv_vcpus_exit landing mid-execution.

Goal

Deliver every self-directed hv_vcpus_exit from a thread other than the
vCPU thread, so hv_vcpu_run always returns a clean CANCELED. Once no
legitimate path can produce UNKNOWN, the run loop can treat any UNKNOWN as a
hard hypervisor failure and crash with diagnostics — no heuristic needed.

Why this is not blocked by an HVF constraint

The relevant HVF rules are:

  • a VM (hv_vm_create) is per process (one per elfuse --fork-child);
  • a vCPU is bound to the thread that created it (hv_vcpu_run must run on
    that thread);
  • hv_vcpus_exit() is explicitly designed to be called from another thread
    to force a VMEXIT.

The helper thread lives in the same host process as the vCPU thread, so there
is no cross-process issue — this is the idiomatic HVF watchdog pattern. The
current code already calls hv_vcpus_exit() on a stored handle
(g_timeout_vcpu), proving the handle is usable off-thread; this change only
moves the call to a dedicated thread.

Work items

  • 1. Signal-mask discipline. Block SIGUSR2 (and SIGALRM, see item 2) on
    every vCPU thread and the main thread via pthread_sigmask, leaving them
    unblocked only on the dedicated sigwait thread. Establish the mask before
    any vCPU thread is created, at every thread-creation site (bootstrap,
    CLONE_THREAD workers, fork-child bring-up), and re-establish it across
    fork/exec so children do not inherit a stale disposition. A single
    missed site silently reintroduces UNKNOWN.

  • 2. Move SIGALRM onto the same path. The "any UNKNOWN is abnormal"
    invariant only holds if all signal-driven hv_vcpus_exit calls leave the
    vCPU thread. Re-home the per-iteration timeout: either have the helper
    thread own the timeout (e.g. sigtimedwait/timer + hv_vcpus_exit) or
    replace alarm() with a mechanism that does not deliver SIGALRM to the
    vCPU thread. Preserve the existing g_timed_outCRASH_TIMEOUT
    (exit 124) semantics and the guest ITIMER_REAL emulation that currently
    shares alarm().

  • 3. Live-vCPU registry. Replace the single g_timeout_vcpu global with a
    per-process registry of all live vCPU handles (multi-threaded guests run
    worker vCPUs, each on its own thread — g_timeout_vcpu is currently
    last-writer-wins). The helper thread kicks the correct set on a transport
    event; hv_vcpus_exit() accepts a vCPU array, so a single call can exit
    all of them and let the signal/queue machinery sort out delivery. Register
    on vCPU create, unregister on thread exit, guard with a lock.

  • 4. Empirically verify CANCELED, not UNKNOWN. Confirm under stress that
    helper-thread hv_vcpus_exit() against an actively-running vCPU yields
    CANCELED with zero UNKNOWN across many iterations (single- and
    multi-threaded guests, and the cross-process fork case). The CANCELED-vs-
    UNKNOWN split is supported by the code's own comments, but Apple HVF has
    quirks — validate before making UNKNOWN fatal.

Acceptance criteria

  • With items 1–3 landed, the run loop treats HV_EXIT_REASON_UNKNOWN as fatal
    (crash_report(CRASH_UNEXPECTED_EXIT)), matching the original pre-Fix cross-process signal delivery to EL0-preempted guests #76 else
    branch.
  • test-fork passes 100% over a large batch (e.g. 200+ runs); no
    elfuse --fork-child orphans left behind.
  • test-signal, test-signal-thread, test-mt-fork, and the timeout=0
    validation suite stay green.
  • A stress harness records 0 HV_EXIT_REASON_UNKNOWN exits during heavy
    cross-process signalling.

Risks / open questions

  • Signal-mask plumbing touches every thread-creation and fork/exec path; a
    missed site is silent (reintroduces UNKNOWN) — mitigated by item 4's stress
    check.
  • Extra thread per guest process: small overhead, but interacts with the
    fork model (the helper must be re-created, not inherited, in each child).
  • Does the helper-thread hv_vcpus_exit reliably interrupt a vCPU blocked in a
    host syscall issued from inside the HVC handler (e.g. nanosleep), or only
    one executing guest code? Confirm the cross-process wake still works for a
    child parked in a blocking host syscall.

Out of scope

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions