Skip to content

Orphan-reap responsive-but-unclaimed shellpers at startup to fix PTY pool exhaustion #1038

@amrmelsayed

Description

@amrmelsayed

Background

Surfaced from PR #1031 review and issue #1030. The contributor's diagnosis of the PTY-exhaustion symptom (posix_spawnp failed / Invalid shellper info JSON after enough Tower restart cycles) is correct, but the proposed fix in #1031 (kill all scoped shellpers on tower stop) breaks the documented "Tower stop does NOT kill shellpers" contract that the persistence work in #274, #832, PR #999, and #991 was built on.

This issue tracks the right place to fix the leak: the startup orphan-reaper.

The leak

killOrphanedShellpers() in packages/codev/src/agent-farm/servers/tower-server.ts (~line 411) runs at Tower startup after reconcileTerminalSessions(). It scans shellper-main processes scoped to this Tower's socketDir and currently:

  • Kills shellpers whose socket doesn't respond → correct, they're corpses.
  • Spares shellpers whose socket responds, regardless of whether reconciliation claimed them → the leak.

The "spare responsive shellpers" guard was added in #341 because a corrupt or empty SQLite would have made every live shellper look unclaimed, and the reaper would have nuked an entire active workspace. The safe fallback was "if it responds, assume it belongs to somebody, don't touch it."

That fallback is overly conservative. A responsive shellper from a Tower instance that no longer exists (workspace deleted, prior run that left files behind, socketDir overlap across users on a multi-user box, etc.) will never be claimed by any future Tower either. Each one holds one macOS PTY. macOS has a small fixed PTY pool — these accumulate and eventually exhaust it.

Proposed fix

After reconcileTerminalSessions() completes and the _reconciling barrier in packages/codev/src/agent-farm/servers/tower-terminals.ts is dropped (so no client can racingly reattach during the reap), any scoped responsive shellper still not in the active sessions map is provably orphaned: nothing in Tower's reconciliation path will claim it later.

Two options:

  1. Inline includeResponsive: true second sweep. Add a parameter to killOrphanedShellpers(). The first sweep stays as-is (unresponsive-only, runs before _reconciling drops). The second runs after reconciliation and the barrier-drop, with includeResponsive: true, killing whatever's left in scope that's still unclaimed.

  2. Log-only at startup + explicit operator verb. Don't auto-kill on startup. Log a warning for each detected orphan. Expose afx terminal gc (or similar) for operators to run when they see the symptom. Lower-risk: preserves Orphaned shellper and consult processes accumulate over time #341's conservatism, gives operators an explicit verb to reclaim.

Option 1 is the better default for a healthy long-run state; option 2 is the safer first step. Either way, the implementation reuses SessionManager.findShellperProcesses() and the existing socketDir marker scoping. The actual kill is a process-group SIGTERM → 5s → SIGKILL (PR #1031's killScopedShellpers() is a good shape if reused).

Testing surface

  • Corrupt / empty SQLite at startup — reconciliation produces an empty active map; reaper must NOT kill responsive shellpers in this case (so the new pass should run only when reconciliation completed cleanly, not when it raised).
  • Partial reconciliation — some sessions reattached, some failed. Reaper kills only the unclaimed-after-success set.
  • Race against client reattach — verify the _reconciling barrier in tower-terminals.ts keeps clients out until the reap completes.
  • Multi-Tower-on-same-machine — verify socketDir scoping prevents cross-Tower interference.

References

Out of scope

Behavior change to afx tower stop — keep the existing contract (Tower stops, shellpers survive, next start reattaches). This is purely a startup-time orphan-detection improvement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/towerArea: Tower server / agent farm CLI

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions