fix(inputs): close libuv loop after run() returns (fixes flaky macOS SIGBUS)#789
fix(inputs): close libuv loop after run() returns (fixes flaky macOS SIGBUS)#789leoparente wants to merge 3 commits into
Conversation
…nc callback The async stop callback (run on the loop thread inside uv_run) called _io_loop->close() (uv_loop_close). Closing the loop while uv_run is still on the stack frees structures that uv__io_poll keeps iterating, so when the callback returns uv__io_poll dereferences freed memory and the process dies with SIGSEGV/SIGBUS. It is timing-dependent, which is why CI saw it as a flaky macOS crash in unit-tests-input-dnstap / unit-tests-input-flow (deterministic on the macOS 26 runner). Stop the loop in the callback (so uv_run returns) but defer uv_loop_close() to the io thread, after run() returns. Applied to all four libuv-based inputs (dnstap, flow, netprobe receiver + stream).
There was a problem hiding this comment.
Pull request overview
This PR fixes a libuv lifecycle bug in several uvw/libuv-based inputs by ensuring the event loop is only closed after run() returns, avoiding loop closure from within a running libuv callback (a known cause of flaky macOS SIGBUS/SIGSEGV crashes).
Changes:
- Remove
*_io_loop->close()calls fromasync_eventstop callbacks (whileuv_run()is still on the stack). - Add
*_io_loop->close()on the IO thread immediately after*_io_loop->run()returns, across all affected inputs. - Add inline comments documenting the libuv constraint and why closure must be deferred.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/inputs/netprobe/PingProbe.cpp | Defers loop close to IO thread after run() returns; stop callback only stops loop. |
| src/inputs/netprobe/NetProbeInputStream.cpp | Defers loop close to IO thread after run() returns; stop callback stops/closes timer then stops loop. |
| src/inputs/flow/FlowInputStream.cpp | Defers loop close to IO thread after run() returns; stop callback stops/closes timer & UDP handle then stops loop. |
| src/inputs/dnstap/DnstapInputStream.cpp | Defers loop close to IO thread after run() returns for both TCP and unix-socket loops; stop callbacks stop/close server + timer then stop loop. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
LCOV of commit
|
…loop is quiescent Follow-up to the loop-close fix, addressing review feedback: when stop() runs while a dnstap client is connected, the server/timer were closed but the accepted client handles in _tcp_sessions/_unix_sessions were not, so the loop was not quiescent and uv_loop_close() would return EBUSY (and uvw frees the uv_loop_t regardless). Add FrameSessionData::close_handle() and close every connected client on the loop thread in the async stop callback before stopping the loop. Also close the PingReceiver heartbeat timer, which was never closed. dnstap and flow loops are now provably quiescent at close() time. (NetProbe probe handles are still stopped from the control thread via probe->stop(); a loop-thread shutdown for those is a separate pre-existing concern.)
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d0e4005f4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…op is quiescent Address review feedback on the netprobe shutdown path. probe->stop() was called from the control thread and did not close every uvw handle: TcpProbe left _client and _interval_timer open, and PingProbe left _internal_timer open. Those handles stayed active in the loop's handle queue, so when the io thread reached _io_loop->close() uv_loop_close() returned UV_EBUSY (and uvw frees the uv_loop_t regardless). Make each probe's stop() close all handles it owns (guarded against double-close), and invoke probe->stop() from inside the async stop callback on the loop thread instead of from the control thread. This both leaves the loop quiescent for close() and fixes the pre-existing cross-thread uv_close on probe handles. All three input loops (dnstap, flow, netprobe) are now quiescent at close() time.
|
@codex review |
Problem
The macOS CI job (
unit-tests-mac) flakily fails the Test step withunit-tests-input-dnstapand/orunit-tests-input-flowcrashing with a Bus error / SIGSEGV. It is non-deterministic on the runner and became much more frequent on the macOS 26 image.Examples:
Catch2 randomizes test-case order, so the reported crashing case varies between runs — a sign the fault is in shared lifecycle code, not one test's logic.
Root cause
Reproduced locally on macOS 26 (30/30 crashes) and captured under lldb:
EXC_BAD_ACCESS (address=0x10)inuv__io_poll←uv_runDnstapInputStream::stop()→std::thread::join()Each libuv-based input's
async_eventstop callback runs on the loop thread, insideuv_run, and called_io_loop->close()(uv_loop_close).uv_loop_close()is only valid afteruv_run()has returned — calling it from within a running callback frees the loop's internal poll structures, so when the callback returnsuv__io_pollkeeps iterating freed memory and the process dies. It is timing-dependent, hence the flakiness.Fix
Keep
_io_loop->stop()in the async callback (souv_run()returns promptly, even with active client handles), but defer_io_loop->close()to the io thread, afterrun()returns — never inside the running loop. The same anti-pattern existed in all four libuv-based inputs and is fixed in each:src/inputs/dnstap/DnstapInputStream.cpp(tcp + unix socket paths)src/inputs/flow/FlowInputStream.cppsrc/inputs/netprobe/NetProbeInputStream.cppsrc/inputs/netprobe/PingProbe.cpp(
netprobecarried the same latent bug but wasn't crashing yet; fixed for consistency.)Verification
Rebuilt and stress-ran the affected binaries from the ctest working directory on macOS 26:
unit-tests-input-dnstapunit-tests-input-flowunit-tests-input-netprobeNo behavior change beyond the lifecycle ordering; the streams still stop cleanly and join their io thread.
🤖 Generated with Claude Code