Skip to content

Run @parcel/watcher in a self-healing child process#250

Open
ymichael wants to merge 3 commits into
mainfrom
bb/investigate-inotify-watchers-thr_qcv8rurevn
Open

Run @parcel/watcher in a self-healing child process#250
ymichael wants to merge 3 commits into
mainfrom
bb/investigate-inotify-watchers-thr_qcv8rurevn

Conversation

@ymichael

@ymichael ymichael commented Jun 19, 2026

Copy link
Copy Markdown
Owner

Why

The host daemon's recursive filesystem watcher (@parcel/watcher) throws on a benign poll() EINTR in its inotify backend instead of retrying. One throw kills the shared native backend, leaks its inotify fds + threads, and can hang the daemon (requiring a manual restart); watches also die silently. There is no upstream fix to wait for: 2.5.6 is latest, issue #141 is open with no PR, and the bug is unchanged on the default branch.

VS Code depends on the same library/version for recursive watching and runs it in a separate forked process it restarts on exit. This PR brings that pattern to bb.

What

Run @parcel/watcher in a forked child process — installed by the daemon at startup (no flag). When the child dies, stops answering liveness pings, or reports a backend error (EINTR), the parent SIGKILLs it (the OS reclaims the leaked fds/threads wholesale) and respawns it, replaying subscriptions. The bug self-heals instead of taking down the daemon or leaving watches dead.

  • RootSubscription (the only runtime importer of parcel) calls a backend accessor; the daemon installs the subprocess backend, tests stay in-process and mock parcel directly. Everything above RootSubscription is unchanged.
  • The parent proxy mirrors parcel's subscribe/unsubscribe over IPC, holds the subscription registry, pings for liveness, and recovers on death/wedge/backend-error.
  • Respawns use capped exponential backoff that resets when a child proves healthy — an EINTR storm can't tight-loop, and the watcher always recovers (never permanently gives up).
  • Replayed subscriptions re-emit the root's current entries to reconcile the restart gap; a per-subscription replay failure is surfaced as recoverable so RootSubscription re-establishes via its existence-gated retry.
  • The child ships as its own daemon bundle (bb-parcel-watcher-child.mjs), in the files whitelist + startup artifact check; it exits itself when the parent IPC disconnects (no orphan). On shutdown the daemon disposes the proxy (SIGKILL child + unref timers) so the event loop drains.

QA

Two adversarial multi-agent review passes (decompose → independent skeptics refute each finding → dynamic probes). The first found the happy path solid but surfaced 5 critical/high merge-blockers, since fixed in this PR (each with a regression test):

  1. Critical — published bb-app tarball omitted the child bundle (files + artifact check) → permanently dead watching for npx users.
  2. High — proxy not disposed + ping interval not unref'd → daemon hung on graceful shutdown with the child orphaned.
  3. High — replay bypassed the pathExists gate → a transient missing path during respawn became a permanently dead watch (terminal, no retry).
  4. High — permanent give-up + no-backoff respawn loop → recurring EINTR could kill all watches for the daemon's lifetime.
  5. High — subscribe in the spawn→ready window double-subscribed (orphaned watch + double events; reproduced by a live probe).

A second focused re-QA of the fixes found no new defects (overall risk: low).

Testing

  • 37 @bb/host-watcher tests (9 proxy tests incl. crash/EINTR-recycle/ping-wedge/backoff/recoverable-replay/no-double-subscribe), 355 @bb/host-daemon, 37 bb-app; typecheck + full daemon build + check-bundles green.
  • Real-fork smokes: event → SIGKILL child → respawn (new pid) → events resume with no caller action; and dispose() reaps the child and the event loop drains on its own (shutdown no longer hangs).

Deferred follow-ups (medium/low, not blocking)

  • After a self-heal the blunt rescan re-emits only immediate children, so nested git loose-refs (refs/heads/*) and gap-window deletions aren't reconciled until the next real fs event (short, self-correcting). A precise fix is parcel's getEventsSince snapshot API.
  • Health-monitor inotify metric still reads the daemon's own /proc; point it at the child pid.

🤖 Generated with Claude Code

ymichael and others added 2 commits June 19, 2026 02:38
Run the recursive filesystem watcher in a forked child process behind the
BB_WATCHER_SUBPROCESS flag (off by default, zero behavior change when off).
When the child dies or stops answering liveness pings, the parent SIGKILLs
it — reclaiming leaked inotify fds and parked threads wholesale, which
in-process recovery cannot do — then respawns it and replays subscriptions,
so a parcel inotify EINTR crash/hang degrades to a transparent restart
instead of taking down the daemon. This mirrors how VS Code isolates the
same library (forked watcher process, restart on exit).

- Swap RootSubscription's direct @parcel/watcher import for a backend
  accessor (the single runtime chokepoint); default stays the real
  in-process watcher.
- Add the subprocess backend: parent proxy (subscription registry,
  liveness ping, respawn + replay, bounded restart budget), child handler,
  JSON-safe IPC protocol, and fork channel.
- Close the restart gap: replayed subscriptions carry a rescan flag so the
  new child re-emits the root's current entries and callers reconcile to
  on-disk state.
- Emit the child as its own daemon bundle (bb-parcel-watcher-child.mjs);
  @parcel/watcher stays an external runtime require.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the BB_WATCHER_SUBPROCESS flag. The daemon now installs the
subprocess-isolated watcher backend at startup, so it is the actual behavior
rather than an opt-in. Unit tests inject a fake watcher and stay on the
in-process backend, so they can still mock parcel directly.

Critically, recover from the bug instead of only containing it: a watch-error
from the child (parcel's shared inotify backend dying on EINTR) now recycles
the whole child. The SIGKILL lets the OS reclaim the leaked inotify fds and
parked threads, and the respawn re-arms every watch on a fresh backend, so
watches self-heal instead of going permanently dead. Respawn/​recycle events
are logged through the daemon's pino logger.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ymichael ymichael changed the title Isolate @parcel/watcher in a respawnable child process Run @parcel/watcher in a self-healing child process Jun 19, 2026
A multi-agent QA pass on this branch surfaced several merge-blocking defects;
this commit fixes the critical + high-severity ones (all with regression tests).

- PACKAGING (critical): the published bb-app `files` whitelist and the startup
  artifact check both omitted bb-parcel-watcher-child.mjs, so an `npx bb-app`
  install would throw on the first subscription and have permanently dead file
  watching. Add it to both.
- SHUTDOWN (high): the proxy was never disposed and its ping interval was not
  unref'd, so a graceful daemon shutdown hung with the child orphaned — the same
  hang/leak class this change exists to prevent. Dispose the backend in
  shutdownRuntimes and unref the ping/respawn timers.
- REPLAY vs pathExists (high): a transient missing path during respawn produced
  a child subscribe-failed that RootSubscription classified as TERMINAL (no
  retry), permanently killing the watch. Surface replay subscribe failures as
  the recoverable rescan signal so RootSubscription re-establishes via its
  existence-gated, backed-off retry.
- RESTART LOOP (high): a permanent give-up after a fixed restart budget plus a
  no-backoff respawn loop could kill all watches for the daemon's lifetime.
  Replace with capped exponential backoff that resets when a child proves
  healthy, so the watcher always recovers and never permanently gives up.
- DOUBLE-SUBSCRIBE (high): a subscribe landing in the spawn->ready window was
  sent eagerly AND replayed on ready, orphaning one parcel watch (leaked inotify
  fd) and double-delivering events. Gate the eager send on childReady so
  replay-on-ready is the single source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant