Skip to content

feat(sync-service): cheap admission control (Tier 0) with post-load_shape reclassification#4359

Open
alco wants to merge 24 commits into
mainfrom
alco/cheap-admission-control
Open

feat(sync-service): cheap admission control (Tier 0) with post-load_shape reclassification#4359
alco wants to merge 24 commits into
mainfrom
alco/cheap-admission-control

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented May 19, 2026

Summary

Two changes to ServeShapePlug's pipeline that together make admission control cheap to run and accurately bound the resource it's supposed to protect.

  1. Cheap Tier 0 classifier. :resolve_existing_shape is deleted. The admission classifier now reads a single ETS table (shape_meta_table for the stack) via ShapeStatus.has_shape_handle?/2. No SQLite, no GenServer call. Admission moves to the front of the plug pipeline so rejected requests don't pay for validation, shape lookup, or any of the other per-request setup.
  2. Reclassify at snapshot start. Once the Snapshotter signals that the snapshot has started (i.e. it has obtained a connection from the snapshot pool), the in-flight permit moves from :initial to :existing. The :initial cap now bounds requests that are in validate-and-load or contending for a snapshot-pool connection, not requests currently streaming their response.

Pipeline change

Before:

:resolve_existing_shape   ← SQLite read (ShapeDb), runs before admission
:check_admission          ← acquires :initial or :existing permit
:put_resp_content_type
:parse_body
:validate_request
:reject_subquery_shape_compaction_request
:load_shape
:serve_shape_response     ← streams body; permit released in after-clause

After:

:put_resp_content_type                  ← runs first so admission 503s keep Content-Type
:check_admission                        ← cheap classifier (1× :ets.member)
:parse_body
:validate_request
:reject_subquery_shape_compaction_request
:load_shape
:hold_initial_until_snapshot_started    ← :initial-bucket requests park here
:reclassify_admission_kind              ← :initial → :existing swap
:serve_shape_response                   ← streams body

Why this is cheaper

:resolve_existing_shape called into Electric.Shapes.fetch_handle_by_shape/2, which falls through to ShapeDb.handle_for_shape/2 — a SQLite read on the read pool. Issue #4266 calls this out as the first bottleneck on the pre-admission path: every request, including the ones admission is about to reject, performs that read. Under thundering herd, the read pool saturates before admission can shed anything.

The new classifier is :ets.member(shape_meta_table(stack_id), handle). One lookup, no pool, no GenServer.

The rescue on ArgumentError covers the brief window during stack startup where the per-stack meta table hasn't been created yet — those requests classify as :initial, which is the safer bucket.

What this PR addresses, what it doesn't

ShapeCache (resolved within this PR). Before this branch, every shape request — handle or not — could end up in a GenServer.call(ShapeCache, {:create_or_wait_shape_handle, …}, 30_000) on a cache miss, with no upstream cap on concurrent callers. With cheap admission: :existing requests short-circuit on the validate_shape_handle ETS hash check and never touch ShapeCache; only :initial requests can hit {:create_or_wait_shape_handle, …}, and their population is hard-capped by the :initial bucket. ShapeCache's max instantaneous concurrent-caller count is now max_concurrent_requests.initial rather than "however many clients showed up." The 30s @call_timeout now exists only to absorb pathological filesystem tail latencies, not to throttle queue depth.

ShapeCache mailbox-overfill on coalesced creates (related concern, tracked in #4372). Even with the :initial bucket capping the population, all :initial requests for the same uncached shape still pile into the ShapeCache mailbox: handler 1 does the actual creation, handlers 2..N short-circuit via the SQLite critical fetch, but every caller occupies a mailbox slot for the duration of handler 1's work plus its own dispatch latency. #4372 proposes a GenServer-owned per-shape ETS lock observed by callers (via :ets.member before they send the call) plus a PollWait fallback — same pattern as #4371, with tighter backoff defaults to match shape-creation timescales.

StatusMonitor (out of scope, tracked in #4371). Api.validate_params calls StatusMonitor.wait_until for every request. The fast path is an ETS read and is unaffected. The slow path — stack starting, post-deploy, recovering from sleep — funnels every concurrent waiter through the StatusMonitor mailbox and replies serially on the readiness flip. The :initial bucket caps inflight requests across the validate+load_shape span but does not eliminate that single-process reply burst. Acceptable for now; #4371 has the per-process polling design with the adaptive congestion flag.

EtsInspector (out of scope, tracked in #4370). Used in validate_request to build %Shape{}. Warm cache is in-process ETS. Cold cache and PG-degraded paths serialize every concurrent request through the inspector mailbox, and crucially, failed DB lookups are not cached — so under PG pool exhaustion the inspector re-attempts the same failing query for each queued message. The :initial bucket bounds the herd size but does not stop the inspector from re-burning the DB pool on every retry.

Why reclassification happens at snapshot start

release_admission_permit/0 runs in call/2's after clause — i.e., after Api.Response.send_stream/2 has finished draining the body. For an initial snapshot of a non-trivial shape, or a long-poll, that's seconds to minutes.

So before this PR, :initial permits were held for the entire request lifetime:

t=0    t=5ms   t=10ms   t=50ms          t=2000ms
│      │       │        │               │
│ check_admission (:initial +1)
│      │ validate
│              │ load_shape (often coalesces; may spawn Snapshotter)
│                       │ start streaming chunks
│                                       │ done streaming, :initial -1
└────────────── :initial held for 2000ms ─────────────────────────┘

After this change:

t=0    t=5ms   t=10ms   t=50ms   t=N ms          t=2000ms
│      │       │        │        │               │
│ check_admission (:initial +1)
│      │ validate
│              │ load_shape (returns once Consumer + Snapshotter spawned)
│                       │ await_snapshot_start (Snapshotter checks out PG conn)
│                                │ try_swap (:initial -1, :existing +1)
│                                │ start streaming chunks
│                                                │ done streaming, :existing -1
└──────── :initial 50ms + Snapshotter-pool wait ─┤
                                 └────── :existing rest of lifetime ────┘

:initial residency drops from "entire request lifetime" to "validate + load_shape + Snapshotter pool checkout." Two things matter about that last span:

1. For coalesced handlers on a popular shape, the wait is essentially free. ShapeCache.await_snapshot_start short-circuits on ShapeStatus.snapshot_started? (one ETS lookup) when the snapshot is already running. So requests on a warm shape add at most one ETS read between load_shape and the swap.

2. For requests that trigger genuine shape creation, the wait is what makes :initial protect the snapshot pool. When the Snapshotter is queued waiting for a PG connection from the snapshot pool, the handler stays parked in :initial. Once the Snapshotter actually checks out a connection and casts {:snapshot_started, _}, the handler reclassifies and the next handler can enter. So :initial is no longer just "validate-and-load throughput" — it's also "number of requests that have caused a Snapshotter to either start or wait on the pool." Under PG-pool saturation, new requests get rejected at the admission gate (503 with Retry-After) instead of piling up as an unbounded Postgrex checkout queue.

The downstream get_merged_log_stream call no longer issues its own await_snapshot_start: the plug's result is propagated via request.snapshot_status and consumed inline. This avoids racing the consumer's stop_and_clean on the snapshot-failure path, which would otherwise turn legitimate SnapshotError 503s into must_refetch 409s.

Permit-release safety against handler death

Permits are stashed in Process.put(@admission_permit_key, …) at acquire/reclassify time and released in call/2's after clause. The after clause covers all in-process termination paths (normal return, raise, throw, halt, mid-stream client disconnect). The remaining concern — being killed mid-flight by an external EXIT signal — is absorbed by Thousand Island's connection handler GenServer, which sets Process.flag(:trap_exit, true) (deps/thousand_island/lib/thousand_island/handler.ex:343). Trapping converts incoming EXIT signals to mailbox messages, so the handler finishes the current request normally, releases the permit, and only then terminates on the next receive. The remaining gaps (:brutal_kill after shutdown timeout, externally-sent :kill) only happen during BEAM shutdown, where the ETS counters reset along with the VM.

Related

Closes #4291 — AdmissionControl logic becomes expensive under thundering herd

🤖 Generated with Claude Code

alco and others added 2 commits May 19, 2026 15:27
Captures the design and PR-description draft for Tier 0 cheap admission
plus post-load_shape reclassification. Plan tracks #4291.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Atomically moves an in-flight permit between buckets. Returns
{:error, :overloaded} when the destination is at cap and leaves the source
unchanged. Used by the upcoming reclassification step in ServeShapePlug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.73%. Comparing base (eac6dea) to head (a39f565).
⚠️ Report is 6 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4359      +/-   ##
==========================================
+ Coverage   59.18%   59.73%   +0.55%     
==========================================
  Files         305      290      -15     
  Lines       28444    28579     +135     
  Branches     7493     7766     +273     
==========================================
+ Hits        16834    17073     +239     
+ Misses      11596    11489     -107     
- Partials       14       17       +3     
Flag Coverage Δ
electric-telemetry ?
elixir ?
packages/agents 70.92% <ø> (+0.45%) ⬆️
packages/agents-mcp 77.54% <ø> (ø)
packages/agents-runtime 81.27% <ø> (+1.29%) ⬆️
packages/agents-server 73.93% <ø> (+0.77%) ⬆️
packages/agents-server-ui 6.66% <ø> (+0.62%) ⬆️
packages/electric-ax 42.61% <ø> (+0.23%) ⬆️
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 94.39% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 59.73% <ø> (+0.76%) ⬆️
unit-tests 59.73% <ø> (+0.55%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

alco and others added 5 commits May 19, 2026 15:32
Documents the caller-must-hold-from-permit precondition, adds
`current` to swap_rejected telemetry metadata for parity with the
existing reject event, and renames a local variable for consistency
with the rest of the module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fication

Moves :check_admission to the front of ServeShapePlug's pipeline and
classifies requests on a single :ets.member/2 of the per-stack shape-meta
table. Deletes :resolve_existing_shape — admission no longer touches the
SQLite-backed ShapeDb (bottleneck 1 of #4266).

Adds :reclassify_admission_kind between :load_shape and
:serve_shape_response. Once load_shape returns, the handler atomically
swaps its :initial permit for an :existing permit, freeing the :initial
slot for the next validate-and-load wave while this request streams.
If :existing is at cap, the handler keeps its :initial permit; the next
swap attempt will succeed once :existing drains.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit added a nil-guard in check_admission to handle
api_opts that omit max_concurrent_requests, but that introduced a silent
bypass for misconfiguration. Reverts the guard and gives the Api struct
a real default so direct plug_opts/1 callers (e.g. secure-mode tests)
populate the field automatically. Misconfiguration once again surfaces
as a Map.fetch! crash instead of silently disabling admission.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lback logging

- Comment the unknown-handle branch in admission_kind/1 so the two
  :initial arms don't read like a copy-paste oversight.
- Logger.debug on the reclassify_admission_kind fallback (try_swap
  returning :overloaded) so operators can spot the "stuck on :initial"
  path with debug logging enabled. The rejection itself remains
  observable via the existing [:electric, :admission_control,
  :swap_rejected] telemetry emitted by try_swap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alco alco marked this pull request as ready for review May 19, 2026 13:59
@alco alco added the claude label May 19, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 44d35d2c41

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/sync-service/lib/electric/admission_control.ex Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Claude Code Review

Summary

Incremental review of #4359 (iteration 6). Four new commits, all responses to iteration 5's open suggestions: 43c7384b3 (changeset wording), 5e98e904f (@spec for get_merged_log_stream), 3943d20c8 (residency assertion in router test), and a39f5650d (SnapshotError-vs-must_refetch 503/409 regression test). No behaviour changes; nothing blocking.

What Was Addressed Since Iteration 5

  • 5e98e904f@spec Shapes.get_merged_log_stream/3. Spec lands at shapes.ex:26-31 referencing Api.Request.snapshot_status() (the existing nil | :started | {:error, term()} alias at request.ex:20). Required adding the Electric.Shapes.Api alias at the module top; correctly enumerates the new :snapshot_status opt alongside the pre-existing :since, :up_to, :read_only?. Closes the convention gap from iteration 4.
  • 3943d20c8 — Residency assertion in router test. Adds assert %{initial: 0, existing: 0} = Electric.AdmissionControl.get_current(stack_id) at router_test.exs:2350, right after the {:admission_swap, ...} telemetry assertion. With the request now complete, both buckets should be empty: :initial drained by the try_swap decrement, :existing drained by the after-clause release. This locks the contract that no permit leaks across the swap.
  • a39f5650d — SnapshotError 503-vs-409 regression test. New test at serve_shape_plug_test.exs:326-348 stubs await_snapshot_start to return {:error, Electric.SnapshotError.slow_snapshot_start()} and asserts conn.status == 503 plus message-body parity. Done at the plug level rather than router level (iteration 5 had suggested router), which is fine — the plug-level test exercises the same hold_initial_until_snapshot_startedrequest.snapshot_statusget_merged_log_stream propagation path that prevents the double-await race. A future refactor that re-introduces a second await_snapshot_start on the streaming path would still trip this test.
  • 43c7384b3 — Changeset wording. Iteration 5 had suggested adding a sentence about snapshot-pool checkout protection. The author instead removed the second sentence ("Once load_shape returns, the in-flight permit atomically swaps from the :initial bucket to :existing …") entirely. Defensible: the @magnetised/@alco PR-comment exchange acknowledged that the snapshot-pool protection is "more hypothetical than practical" because the :initial cap can be two orders of magnitude larger than the snapshot pool — overclaiming it in release notes would mislead operators. Result: the changeset now describes only the cheap-admission win (bottleneck 1 of ShapeCache bottlenecks under thundering herd #4266), which is the protection that is operator-visible. Trade-off worth noting: the swap/try_swap/4 itself is no longer mentioned in the changeset, but it's an internal mechanism, not a user-visible API change.

New Code Review (delta only)

Nothing new to flag — all four commits are surgical, no functional changes.

Issues

None.

Suggestions

  • PR description still claims :initial bounds "requests contending for a snapshot-pool connection." As @magnetised pointed out and @alco confirmed in the conversation, the swap fires once the Snapshotter sends the {:snapshot_started, _} cast — i.e. after it checks out the connection but before the snapshot finishes — so a snapshot that is mid-query still holds a pool slot while its corresponding request has already been reclassified to :existing. Strictly, :initial bounds connection-checkout waiters, not "all requests with an outstanding snapshot pool slot." The changeset has been updated to avoid this overclaim; the PR body's summary bullet [Merged on #3] Write inserts, updates and deletes to Vaxine #2 could use the same trim before merge so the merge commit message doesn't carry it forward. Editorial only; not blocking.
  • try_swap/4 clamp behaviour comment (carried over from iteration 5). Still strictly cosmetic. The rollback path at admission_control.ex:163 goes through decr/3 which clamps at 0; the inline comment ("Rollback the claimed permit since the bucket is already at capacity") is correct for the happy invariant, but a future reader auditing whether to_kind can transiently underflow has to chase down decr/3. A one-line note like # decr clamps at 0; correct under "we just incremented" invariant. would preserve the audit trail. Low priority.

Issue Conformance

PR cites #4291 (primary, "Closes"), #4292 (parent), #4266 (bottleneck 1). linked_issues.json now resolves #4291 with its full problem statement — the cheap-gate properties listed there (ETS counters, no GenServer call on reject path, coarse classification from source/method/live/offset, 503 + adaptive Retry-After) all hold in the current implementation. The Retry-After is still the placeholder 5–10s jitter (TODO at serve_shape_plug.ex:341-348), which is consistent with #4291's framing of adaptive retry as a parent-issue goal (#4292), not this PR's scope. Changeset present and now accurate. PR description's bullet #2 wording remains the one editorial nit.

Previous Review Status

  • Iteration 5 suggestions: all four addressed. Changeset wording trimmed (different direction from the suggestion, but defensible given the conversation). @spec added. Residency assertion added. SnapshotError-vs-must_refetch regression test added at plug level.
  • Iteration 3 telemetry/log on swap-reject: still covered by the existing [:electric, :admission_control, :swap_rejected] event.
  • Iteration 5 try_swap/4 clamp comment: still cosmetic, re-listed above.
  • No regressions from the four new commits. The tests added in a39f5650d and 3943d20c8 exercise paths that already worked; they're forward guards, not bug fixes. The @spec is documentation, not enforcement. The changeset trim is release-notes editorial.

Review iteration: 6 | 2026-05-20

alco and others added 8 commits May 19, 2026 16:15
Reverts the struct-level default added in 739ae54. Production already
populates max_concurrent_requests before Api.plug_opts/1 is reached
(via Electric.Application.api_configuration/1 -> Electric.Config.get_env/1
-> the Defaults map at lib/electric/config.ex:73, which is the real source
of truth at %{initial: 300, existing: 10_000}). The struct default was
only ever consulted by tests that build Api opts directly without going
through the production-config plumbing.

Instead, route those test fixtures through Electric.Config.get_env/1 so
they see the same defaults production sees:

- test/support/component_setup.ex build_router_opts/2 (covers most router
  tests via the canonical helper).
- test/electric/plug/serve_shape_plug_logging_test.exs build_plug_opts/1.
- The secure-mode api_opts fixture in router_test.exs.

Misconfiguration once again surfaces as a Map.fetch! crash instead of
being silently papered over by a hardcoded struct default that did not
match the production numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves :put_resp_content_type before :check_admission so 503 admission
rejections still carry `Content-Type: application/json`. Prior order put
admission first, but Api.Response.error/3 -> Api.Response.send/2 does not
set the content-type header itself — only the JSON-encoded body. Strict
clients, CDNs, and observability tooling could fall back to
`application/octet-stream` on the rejection path the PR was optimizing
for. Adds a regression assertion in the existing 503 router test.

Reported by Claude review on PR #4359.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous implementation used a single ets:update_counter/4 call
with a threshold-clamped increment for the destination column and a
decrement for the source column. That combination has two distinct
concurrency hazards under contention:

1. **Codex's race (rollback under-counts to_kind).** When N callers
   race past a saturated to_kind, every caller captures the same
   clamped post-increment value `cap + 1` — but only the first
   increment actually moved the row. All N rollbacks then decrement
   the destination by 1, removing real permits and breaking cap
   enforcement for subsequent acquires until something drains.

2. **Claude's race (concurrent acquire on from_kind sees under-count).**
   Because from_kind was decremented in the same atomic op as the
   to_kind increment, a concurrent try_acquire(from_kind) could
   observe a transiently lowered from_kind, succeed, and push
   from_kind above its cap on swap rollback.

The fix is to switch to two unclamped ops:

  1. Plain increment of to_kind. Each caller's `+1` is real, so each
     rejected caller's rollback removes exactly its own contribution.
  2. If new_to > cap: atomic rollback (to_kind -= 1). from_kind is
     untouched, so the caller keeps its from_kind permit.
  3. Otherwise: atomic decrement of from_kind (success).

Cost: 2 atomic ETS ops on both the success and reject paths (vs. 1 on
success / 2 on reject before). The mid-state between the two success
ops shows total = original_total + 1, which can spuriously reject a
concurrent try_acquire(to_kind) but never over-admit — a strictly
safer trade than the previous "from + to invariant" framing, which
turned out to be wrong under the threshold-clamp behaviour.

Adds two regression tests:

- `concurrent rejected swaps preserve the destination counter`
  (Codex's race).
- `from_kind cap is preserved against concurrent try_acquire during swap`
  (Claude's race).

Also addresses two smaller review notes:

- Adds the missing @SPEC for try_swap/4.
- Switches `swap_rejected` telemetry's `current` measurement from
  `cap` to `new_to` (the actual post-increment value the racer
  captured), matching the `:reject` event's post-increment pattern.

Reported by Codex and Claude reviews on PR #4359.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tables

StatusMonitor, EtsInspector, and LsnTracker each own a read-mostly ETS
table that every shape HTTP request hits on the fast path:

- StatusMonitor: `service_status/1` runs from `Api.validate_params` ->
  `hold_until_stack_ready` for every request.
- EtsInspector `pg_inspector_table`: `load_relation_oid`,
  `load_relation_info`, `load_column_info`, `load_supported_features`
  all do an ETS lookup here before falling back to the GenServer.
- LsnTracker: `determine_global_last_seen_lsn` reads it for every
  non-`offset=:now` request.

All three tables were created without `read_concurrency`, so reads
contend on the default ETS lock together with the infrequent writes
that the owning GenServer makes. The sibling hot-path tables
(shape_meta_table, AdmissionControl counter, PureFileStorage stack
table, WriteBuffer, ConsumerRegistry) already set the flag; these
three are the remaining holes on the per-request read path and the
ones most likely to serialize under the concurrency the new admission
control allows through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reclassify after the Snapshotter signals snapshot_started, not at
load_shape return. Holding :initial across the await_snapshot_start
wait makes admission-gate backpressure cover snapshot-pool contention:
when Snapshotters queue for a PG connection, requests stay parked in
:initial and new requests get rejected at the gate (503 + Retry-After)
instead of piling up as an unbounded Postgrex checkout queue.

The await result is propagated via request.snapshot_status and
consumed by Shapes.get_merged_log_stream, replacing its own
await_snapshot_start call. Issuing it twice would race the consumer's
stop_and_clean on the snapshot-failure path, turning legitimate
SnapshotError 503s into must_refetch 409s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@magnetised
Copy link
Copy Markdown
Contributor

Reclassify at snapshot start. Once the Snapshotter signals that the snapshot has started (i.e. it has obtained a connection from the snapshot pool), the in-flight permit moves from :initial to :existing. The :initial cap now bounds requests that are in validate-and-load or contending for a snapshot-pool connection, not requests currently streaming their response.

@alco active snapshot is still holding a connection so shouldn't it still count towards "things contending for a snapshot-pool connection"?

Copy link
Copy Markdown
Contributor

@magnetised magnetised left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really appreciate the work that's gone into this. awesome stuff. have a couple of questions, one in the review one on the main pr, but am assuming that I've just not understood something.

🌟

with {stack_id, :initial} <- Process.get(@admission_permit_key),
max = Map.fetch!(config[:api].max_concurrent_requests, :existing),
:ok <-
Electric.AdmissionControl.try_swap(stack_id, :initial, :existing, max_concurrent: max) do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean if the try_swap fails?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to try_swap fails only if the :existing bucket is already at capacity. An unlikely case in practice since we configure it with a higher value than :initial.

In any case, if the swap fails, the current request keeps holding the :initial permit until it's released in the after clause of the call function.

@alco
Copy link
Copy Markdown
Member Author

alco commented May 20, 2026

Reclassify at snapshot start. Once the Snapshotter signals that the snapshot has started (i.e. it has obtained a connection from the snapshot pool), the in-flight permit moves from :initial to :existing. The :initial cap now bounds requests that are in validate-and-load or contending for a snapshot-pool connection, not requests currently streaming their response.

@alco active snapshot is still holding a connection so shouldn't it still count towards "things contending for a snapshot-pool connection"?

You're right.

With the changes made in this PR, the :initial bucket protects ShapeCache primarily. But if initial shape snapshots cannot keep up with the rate of new shape requests, some requests will inevitably be stuck waiting for DB connection checkout or time out on it.

So the protection of the snapshot pull is more hypothetical than practical, considering that the :initial bucket size can be 2 orders of magnitude larger than the connection pool size. But at the point when the shape consumer process is already running and waiting on the snapshot, ShapeCache is supposedly ready to accept the next new shape request. Which could end up being a request for an already existing shape.

In other words, it may end up being a moot point in practice, but reclassifying a shape req mid-flight has potential to make new shape creation faster (by not waiting on the full snapshot creation+serving cycle) but doesn't really make it worse in the thundering herd scenario where shape reqs are already timing out on connection pool checkout en masse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AdmissionControl logic becomes expensive when hit with a thundering herd of shape requests

2 participants