Skip to content

feat(sync-service): adaptive poll-based wait_until under StatusMonitor congestion#4376

Draft
alco wants to merge 13 commits into
mainfrom
alco/poll-wait-instead-of-genserver-call
Draft

feat(sync-service): adaptive poll-based wait_until under StatusMonitor congestion#4376
alco wants to merge 13 commits into
mainfrom
alco/poll-wait-instead-of-genserver-call

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented May 20, 2026

Summary

Implements #4371: replace the mailbox-based StatusMonitor.wait_until/3 not-ready path with adaptive per-process polling once the StatusMonitor's waiter set crosses a congestion threshold. Bottleneck 2 of the thundering-herd umbrella (#4266).

  • New Electric.PollWait primitive: per-process bounded polling with exponential backoff + jitter.
  • New congestion flag on StatusMonitor: an ETS-backed boolean that flips to true once MapSet.size(state.waiters) >= 100 and clears back to false once the set drains to 0. Wired into all three transitions: enqueue (handle_call({:wait_until, …})), bulk drain (maybe_reply_to_waiters/1), single-timeout drain (:timeout_waiter handle_info).
  • Adaptive wait_until/3 dispatch: the fast path is unchanged. The not-ready and sleeping+block_on_conn_sleeping branches consult congested?/1.
  • StatusMonitor mailbox growth is now bounded by the congestion threshold (100) regardless of total concurrent request volume. Polling waiters never enqueue into state.waiters.

Fast path for already-active stacks remains a single ETS read in the caller process.

Test coverage highlights:

  • PollWait: ready-immediately, timeout, monotonic backoff growth under per-call opts, jitter never produces zero/negative sleeps, :infinity timeout terminates only on ready.
  • StatusMonitor congestion flag: flips at threshold via the natural handle_call enqueue, clears via readiness-drain, clears via :timeout_waiter-drain, stays unset below threshold, gracefully returns false when the table doesn't exist.
  • StatusMonitor adaptive dispatch: polling returns {:ok, :active} on readiness flip, returns {:ok, :read_only} on metadata-ready, returns {:error, _} on timeout, sleeping branch short-circuits before the congestion check, uncongested callers continue to use the GenServer.call path.

alco added 9 commits May 20, 2026 14:34
…r congestion

When the StatusMonitor's waiter set crosses the congestion threshold,
new wait_until/3 callers now poll on PollWait.until/3 against
service_status/1 instead of enqueuing into the StatusMonitor mailbox.
This bounds StatusMonitor mailbox growth under burst load while keeping
the low-latency GenServer.call path for the uncongested common case.

The fast path (:active short-circuit, :waiting+:read_only short-circuit,
:sleeping returning :conn_sleeping when not blocking) is unchanged.
Congestion is consulted only on the not-ready and sleeping+blocking
branches via the shared dispatch_wait/3 chokepoint.

The ETS status table is switched from :protected to :public so tests
can force the congestion flag directly. Reads (congested?/1) already
went through ETS outside the GenServer process.
@alco alco added the claude label May 20, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 20, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit e344565
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a0dd258ef5cc7000888aa83
😎 Deploy Preview https://deploy-preview-4376--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.56%. Comparing base (1ab43f5) to head (7dd7cc3).
⚠️ Report is 14 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4376      +/-   ##
==========================================
+ Coverage   55.84%   59.56%   +3.72%     
==========================================
  Files         245      302      +57     
  Lines       24847    29061    +4214     
  Branches     6878     7859     +981     
==========================================
+ Hits        13876    17311    +3435     
- Misses      10957    11733     +776     
- Partials       14       17       +3     
Flag Coverage Δ
packages/agents 67.52% <ø> (-3.41%) ⬇️
packages/agents-mcp 77.54% <ø> (?)
packages/agents-runtime 81.23% <ø> (-0.05%) ⬇️
packages/agents-server 74.04% <ø> (+0.11%) ⬆️
packages/agents-server-ui 6.53% <ø> (-0.13%) ⬇️
packages/electric-ax 42.61% <ø> (ø)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 94.39% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 59.56% <ø> (+3.72%) ⬆️
unit-tests 59.56% <ø> (+3.72%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Claude Code Review

Summary

Iteration-2 review. Most of the iteration-1 feedback is addressed (dedup with check_level/2, read_concurrency: true, @spec on congested_threshold/0, @doc false on the test-only cast). The PR also picked up two new pieces of work since iteration 1: do_wait_until was refactored to route through PollWait, and the congestion threshold was bumped from 100 → 1000. Those new pieces are worth a second look.

Previous Review Status

Addressed:

  • Dedup of polling closure vs check_level/2 (status_monitor.ex:259-264) — the polling closure now delegates to check_level/2 so the readiness contract can't drift.
  • read_concurrency: true on the ETS table (status_monitor.ex:41) — captured in code; no longer relying on cross-PR ordering with feat(sync-service): cheap admission control (Tier 0) with post-load_shape reclassification #4359.
  • @spec congested_threshold/0 (status_monitor.ex:80).
  • @doc false on the test-only cast (status_monitor.ex:356-360).

Still open from iteration 1:

  • Hardcoded @congested_threshold — bumped 100 → 1000, but the value is still a module attribute, and congested_threshold/0 is now @doc false, which makes it harder, not easier, to surface the knob in Electric.Config. If 1000 is the post-empirical default, that's fine — but please leave an issue/TODO referencing the configuration work so it doesn't get forgotten.
  • PollWait opts not validated with NimbleOptions (poll_wait.ex:22-28) — still raw Keyword.get/2, so a typo like :inital_interval silently uses the default. Worth tightening if this primitive is meant to be reused (per the moduledoc).
  • congested?/1 rescue comment (status_monitor.ex:73-77) — the rescue is only needed for the missing-table case; a one-line comment would help future readers.

Issues Found

Important (Should Fix)

1. do_wait_until no longer bounds total wait by the requested timeout.

File: packages/sync-service/lib/electric/status_monitor.ex:229-254

Issue: The previous maybe_retry_wait_until/5 recursion decremented the inner timeout on each retry (remaining_timeout = timeout - @spin_prevention_delay), so the cumulative wait was bounded by the original timeout. The new implementation captures timeout in check_fn's closure and passes the original, undecremented value to GenServer.call(pid, {:wait_until, level, timeout}, :infinity) on every retry attempt.

Impact: When the monitor takes K retries to come up during boot, total wait ≈ K * @spin_prevention_delay + timeout. For a caller asking "wait up to 50ms," a 1-retry boot race produces ~60ms total; PollWait can't interrupt the in-flight GenServer.call once its own deadline elapses. This isn't called out in the changeset and isn't covered by tests (none of the new tests exercise the do_wait_until retry path).

Suggested fix: Compute a per-attempt remaining timeout from the PollWait deadline (or pass System.monotonic_time(:millisecond) through the closure and subtract). Alternatively, document the bound explicitly and accept the overshoot — but the changeset should mention it.

2. do_wait_until refactor (patch 12) is scope creep relative to the PR description.

File: packages/sync-service/lib/electric/status_monitor.ex:229-254

Issue: The PR title/body/changeset describe a single change: route the not-ready branch to PollWait when the GenServer is congested. Patch 12 also rewrites the non-congested retry loop (do_wait_until) on top of PollWait. That's a separate behavioral change (see #1 above) tucked under a "refactor" commit.

Impact: Reviewers landing on the changeset/PR description won't know to look at the non-congested path. If the timeout overshoot in #1 turns out to be a regression, the bisect target will be misleading.

Suggested fix: Either split patch 12 into its own PR with its own changeset entry, or update the PR description + changeset to call out the do_wait_until rewrite explicitly, along with the timeout-bound behavior change.

Suggestions (Nice to Have)

3. PollWait's 10ms floor when jitter <= 0.0 is surprising for non-StatusMonitor callers.

File: packages/sync-service/lib/electric/poll_wait.ex:56-57

The floor was raised 1 → 10 in patch 11, presumably to match @spin_prevention_delay for the new do_wait_until integration. But callers passing initial_interval: 5, jitter: 0.0 now get 10ms sleeps, not 5ms. A reader of poll_wait_test.exs:40 already runs into this — the test's comment "With 0 jitter and a 2.0 factor: 5, 10, 20, 20, 20, ..." describes intervals that don't actually occur (they're 10, 10, 20, 20, 20, ...). The test still passes because the assertions are lower bounds, but the comment is now misleading.

Suggested fix: Either document the floor in the moduledoc (it's a hidden surprise for callers configuring small initial intervals), or apply the floor only in the StatusMonitor caller (e.g. pass min_sleep: 10 explicitly). Also update the stale comment in poll_wait_test.exs:59.

4. monitor_unavailable_reason/1 does a second lookup that can disagree with the inner failures.

File: packages/sync-service/lib/electric/status_monitor.ex:281-287

After PollWait times out, this function does another monitor_lookup/1 to construct the error message. If the monitor came up between the last inner failure and this lookup, the caller gets "Stack ... has terminated" when in fact the inner calls were just "not found yet." The OLD code had similar wording, so not a regression — but the redundant lookup could be replaced by threading the last failure reason through the closure (e.g. via an accumulator), which would also remove a small race window.

5. Test-only cast still part of public dispatch.

File: packages/sync-service/lib/electric/status_monitor.ex:356-360

@doc false was added (good), but the clause is still compiled into prod. Consider if Mix.env() != :prod or driving the same behavior through Repatch. Not blocking — pragmatic trade-off, same as iteration 1.

Issue Conformance

PR body covers the adaptive-polling work but does not mention the do_wait_until refactor (patch 12) or the threshold bump to 1000. The changeset has the same gap. Recommend updating both to reflect the actual scope of the diff.

The two related issues (#4266 umbrella, #4371 specific) are referenced in the PR body but not linked formally — adding Closes #4371 would help.


Review iteration: 2 | 2026-05-21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant