Skip to content

Mailroom fd hygiene#1610

Draft
DanGould wants to merge 6 commits into
payjoin:masterfrom
DanGould:mailroom-fd-hygiene
Draft

Mailroom fd hygiene#1610
DanGould wants to merge 6 commits into
payjoin:masterfrom
DanGould:mailroom-fd-hygiene

Conversation

@DanGould

@DanGould DanGould commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Bear with me as I post this draft while it's still in flux and needs a bit of cleaning. Getting it out early so you can see what's up. Follow up to #1608 with actual fixes.

This is the result of adversarial back and forth between Claude and Codex (between environment setup too, so context has been a bit wonky. Commits definitely need cleaning before this gets close to merge)

Pull Request Checklist

Please confirm the following before requesting review:

CONNECT and WebSocket bootstrap tunnels proxied bytes with
copy_bidirectional and no timeout, spawned without any concurrency
limit. Each tunnel pins two file descriptors (the inbound upgraded
socket and the outbound TCP stream), so a stalled or malicious
client could hold descriptors open indefinitely and exhaust the
process limit -- the one relay path whose descriptor use was
unbounded.

Add a shared TunnelLimits holding a semaphore and a timeout. Acquire
an owned permit before any work and reject with 503 when the budget
is exhausted; hold the permit for the tunnel's lifetime. Run the
copy through a timeout that tears the tunnel down once exceeded,
releasing its descriptors. Total tunnel descriptor use is now
bounded by the concurrency cap.
@coveralls

coveralls commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Coverage Report for CI Build 27118534399

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage decreased (-0.2%) to 85.137%

Details

  • Coverage decreased (-0.2%) from the base build.
  • Patch coverage: 143 uncovered changes across 5 files (405 of 548 lines covered, 73.91%).
  • 67 coverage regressions across 2 files.

Uncovered Changes

File Changed Covered %
payjoin-mailroom/src/lib.rs 406 298 73.4%
payjoin-mailroom/src/ohttp_relay/bootstrap/ws.rs 28 0 0.0%
payjoin-mailroom/src/ohttp_relay/bootstrap/connect.rs 37 33 89.19%
payjoin-mailroom/src/ohttp_relay/gateway_prober.rs 22 20 90.91%
payjoin-mailroom/src/config.rs 19 18 94.74%
Total (9 files) 548 405 73.91%

Coverage Regressions

67 previously-covered lines in 2 files lost coverage.

File Lines Losing Coverage Coverage
payjoin-cli/src/app/v2/mod.rs 64 51.75%
payjoin-mailroom/src/lib.rs 3 72.35%

Coverage Stats

Coverage Status
Relevant Lines: 15165
Covered Lines: 12911
Line Coverage: 85.14%
Coverage Strength: 360.62 hits per line

💛 - Coveralls

Three paths could consume descriptors without bound.

Cap accepted connections: wrap the listener in a LimitedListener that
acquires an owned semaphore permit per accept, held for the connection
lifetime, so the server sheds rather than exhausts descriptors under
load. Applied to the plain and non-TLS serve paths.

Bound the whole bootstrap tunnel, not just the byte copy: wrap the
post-upgrade tunnel future in one tokio timeout covering the upgrade,
gateway connect, and copy, and drop the now-unused
copy_bidirectional_with_timeout helper.

Bound gateway probes: give the prober a probe timeout and wrap its
outbound request, so a hung gateway cannot pin an outbound descriptor.
@DanGould DanGould force-pushed the mailroom-fd-hygiene branch from 33bf751 to 5688896 Compare June 6, 2026 16:34
DanGould added 3 commits June 7, 2026 18:18
The ACME (TLS-direct, :443) serve path used axum-server's bind and so
bypassed the LimitedListener cap, leaving it able to exhaust
descriptors after the plain path was bounded.

Replace it with a hand-rolled AcmeListener over tokio_rustls_acme's
Incoming plus a LimitedTcpIncoming that applies the same accepted-
connection semaphore, so the ACME path sheds under load like the
others. The rewrite drops axum-server and switches cert renewal to
futures::StreamExt, so remove the now-unused axum-server and
tokio-stream dependencies and the tokio-rustls-acme "axum" feature.
The directory exhausted its descriptors by accumulating idle inbound
connections: axum::serve sets no keep-alive idle timeout and no
connection cap, so connections a peer relay leaves open are never
reclaimed. Because OHTTP hides end users behind relays, a single peer
can pin every descriptor and wedge the listener.

Replace axum::serve with a hyper_util auto::Builder accept loop so
per-connection HTTP state is reachable, and set a header-read timeout
that closes connections idle between requests. It is connection-state
aware: it never fires during an in-flight handler, so a long-poll
awaiting its mailbox (up to config.timeout) is not interrupted.
Keep-alive stays enabled so planned relay-directory HTTP/2
multiplexing remains possible.

Bound concurrency with a configurable global cap that reserves
descriptor headroom below the soft RLIMIT_NOFILE (read via rustix).
Export active-connection and shed counters so accumulation is
observable. Limits are configurable via the [connection] section.
The global connection cap alone lets one peer relay pin every slot and
starve the others. Add a per-source cap (default 1024) keyed on the
connecting peer IP -- IPv4-mapped addresses normalized to IPv4, IPv6
aggregated by a configurable prefix (default /64).

This is peer-relay fairness at the transport layer. It is not end-user
rationing -- end users are hidden behind OHTTP relays, so the directory
only ever sees peer relays -- and it is not the deferred BIP 77
anti-abuse layer. Over-cap connections are shed without consuming a
global permit, and counted in the shed metric.
@DanGould DanGould force-pushed the mailroom-fd-hygiene branch from 5688896 to d7b59f2 Compare June 8, 2026 03:50
The per-source default of 1024 pins one peer relay at the same
threshold that originally wedged the directory: the incident peak was
~1009 connections from a single relay (pj.achow101.com) plus several
thousand more queued in the listen backlog, so real demand from one
relay reached the low thousands. Under OHTTP the directory sees only
the relay, not the end users behind it, so a per-source cap throttles
a legitimate aggregating relay rather than an abuser.

Until relay-directory HTTP/2 multiplexing lands -- which collapses a
relay's footprint from one descriptor per receiver to a handful --
default the per-source cap to the global cap (off) and raise the
global cap to 32768: well above observed demand, still far below the
65536 descriptor ceiling, so the directory sheds gracefully instead of
wedging and no trusted relay is clipped.
@DanGould DanGould mentioned this pull request Jun 11, 2026
2 tasks
@DanGould DanGould force-pushed the mailroom-fd-hygiene branch from 0d0d0f7 to a777a52 Compare June 11, 2026 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants