Skip to content

feat(sync-service): Scale SQLite connection pool to 0#3908

Open
magnetised wants to merge 9 commits intomainfrom
magnetised/scale-sqlite-pool
Open

feat(sync-service): Scale SQLite connection pool to 0#3908
magnetised wants to merge 9 commits intomainfrom
magnetised/scale-sqlite-pool

Conversation

@magnetised
Copy link
Contributor

@magnetised magnetised commented Feb 25, 2026

NimblePool allows for lazy initialization and scale down of SQLite connections. This PR adds support for that.

The scaling is disabled for exclusive_mode as an exclusive_mode db could be in-memory, and closing that would be catastrophic.

I've added the number of active connections to the exported data. Will be interesting to see what goes on there, in both the larger instances and the many less-active ones.

This disables sqlite statistics collection by default as the usefulness of the exported numbers is low without memory statistics enabled, and enabling memory stats is potentially causing issues in some deployments.

@codecov
Copy link

codecov bot commented Feb 25, 2026

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
2491 1 2490 1
View the top 2 failed test(s) by shortest run time
Elixir.Electric.Shapes.ConsumerTest::test transaction handling with real storage writes txn fragments to storage immediately but keeps txn boundaries when flushing
Stack Traces | 0.986s run time
25) test transaction handling with real storage writes txn fragments to storage immediately but keeps txn boundaries when flushing (Electric.Shapes.ConsumerTest)
     .../electric/shapes/consumer_test.exs:1548
     match (=) failed
     code:  assert [{Storage, :append_fragment_to_log!, [log_items, _]}] =
              Support.StorageTracer.collect_traced_calls()
     left:  [{Electric.ShapeCache.Storage, :append_fragment_to_log!, [log_items, _]}]
     right: []
     stacktrace:
       .../electric/shapes/consumer_test.exs:1615: anonymous fn/2 in Electric.Shapes.ConsumerTest."test transaction handling with real storage writes txn fragments to storage immediately but keeps txn boundaries when flushing"/1
       (elixir 1.19.5) lib/enum.ex:961: Enum."-each/2-lists^foreach/1-0-"/2
       .../electric/shapes/consumer_test.exs:1612: (test)
Elixir.Electric.Shapes.ConsumerTest::test transaction handling with real storage transaction fragments are buffered until snapshot xmin is known
Stack Traces | 1.82s run time
13) test transaction handling with real storage transaction fragments are buffered until snapshot xmin is known (Electric.Shapes.ConsumerTest)
     .../electric/shapes/consumer_test.exs:919
     Assertion failed, no matching message after 400ms
     The following variables were pinned:
       tx_offset = 10
     Showing 4 of 4 messages in the mailbox
     code: assert_receive {:flush_boundary_updated, ^tx_offset}
     mailbox:
       pattern: {:flush_boundary_updated, ^tx_offset}
       value:   {:flush_boundary_updated, 9}

       pattern: {:flush_boundary_updated, ^tx_offset}
       value:   {:trace, #PID<0.14010.0>, :call, {Electric.ShapeCache.Storage, :append_fragment_to_log!, [[{LogOffset.new(10, 6), "\"public\".\"test_table\"/\"6\"", :insert, "{\"value\":{\"id\":\"6\"},\"key\":\"\\\"public\\\".\\\"test_table\\\"/\\\"6\\\"\",\"headers\":{\"relation\":[\"public\",\"test_table\"],\"operation\":\"insert\",\"lsn\":\"10\",\"txids\":[100],\"op_position\":6}}"}], {Electric.ShapeCache.PureFileStorage, {:writer_state, {:writer_acc, [], [], 0, LogOffset.new(10, 4), LogOffset.new(9, 6), LogOffset.new(10, 4), LogOffset.new(9, 6), 1603, 1218, 1, true, {LogOffset.last_before_real_offsets(), [{{LogOffset.new(9, 0), nil}, {0, nil}}]}, {:open_files, {:file_descriptor, :prim_file, %{handle: #Reference<0.1969248539.3144548364.144879>, owner: #PID<0.14010.0>, r_buffer: #Reference<0.1969248539.3144548356.209453>, r_ahead_size: 0}}, nil}}, nil, #Reference<0.1969248539.3144548355.211327>, "latest.0", %Electric.ShapeCache.PureFileStorage{buffer_ets: nil, chunk_bytes_threshold: 10485760, shape_handle: "27335419-1772704010522038", stack_id: "Electric.Shapes.ConsumerTest test transaction handling with real storage transaction fragments are buffered until snapshot xmin is known", stack_ets: #Reference<0.1969248539.3144548355.210766>, snapshot_file_timeout: 5000, version: 1}}}]}}

       pattern: {:flush_boundary_updated, ^tx_offset}
       value:   {:trace, #PID<0.14010.0>, :call, {Electric.ShapeCache.Storage, :append_fragment_to_log!, [[{LogOffset.new(10, 8), "\"public\".\"test_table\"/\"2\"", :delete, "{\"value\":{\"id\":\"2\"},\"key\":\"\\\"public\\\".\\\"test_table\\\"/\\\"2\\\"\",\"headers\":{\"last\":true,\"relation\":[\"public\",\"test_table\"],\"operation\":\"delete\",\"lsn\":\"10\",\"txids\":[100],\"op_position\":8}}"}], {Electric.ShapeCache.PureFileStorage, {:writer_state, {:writer_acc, [], [], 0, LogOffset.new(10, 6), LogOffset.new(9, 6), LogOffset.new(10, 6), LogOffset.new(9, 6), 1827, 1387, 2, true, {LogOffset.last_before_real_offsets(), [{{LogOffset.new(9, 0), nil}, {0, nil}}]}, {:open_files, {:file_descriptor, :prim_file, %{handle: #Reference<0.1969248539.3144548364.144879>, owner: #PID<0.14010.0>, r_buffer: #Reference<0.1969248539.3144548356.209453>, r_ahead_size: 0}}, nil}}, nil, #Reference<0.1969248539.3144548355.211327>, "latest.0", %Electric.ShapeCache.PureFileStorage{buffer_ets: nil, chunk_bytes_threshold: 10485760, shape_handle: "27335419-1772704010522038", stack_id: "Electric.Shapes.ConsumerTest test transaction handling with real storage transaction fragments are buffered until snapshot xmin is known", stack_ets: #Reference<0.1969248539.3144548355.210766>, snapshot_file_timeout: 5000, version: 1}}}]}}

       pattern: {:flush_boundary_updated, ^tx_offset}
       value:   {:trace, #PID<0.14010.0>, :call, {Electric.ShapeCache.Storage, :signal_txn_commit!, [100, {Electric.ShapeCache.PureFileStorage, {:writer_state, {:writer_acc, [{{10, 8}, "{\"value\":{\"id\":\"2\"},\"key\":\"\\\"public\\\".\\\"test_table\\\"/\\\"2\\\"\",\"headers\":{\"last\":true,\"relation\":[\"public\",\"test_table\"],\"operation\":\"delete\",\"lsn\":\"10\",\"txids\":[100],\"op_position\":8}}"}], [[], <<0, 0, 0, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 8>>, <<0, 0, 0, 25>>, "\"public\".\"test_table\"/\"2\"", <<100, 0, 0, 0, 0, 0, 0, 0, 0, 181>>, "{\"value\":{\"id\":\"2\"},\"key\":\"\\\"public\\\".\\\"test_table\\\"/\\\"2\\\"\",\"headers\":{\"last\":true,\"relation\":[\"public\",\"test_table\"],\"operation\":\"delete\",\"lsn\":\"10\",\"txids\":[100],\"op_position\":8}}"], 236, LogOffset.new(10, 8), LogOffset.new(9, 6), LogOffset.new(10, 6), LogOffset.new(9, 6), 2063, 1568, 2, true, {LogOffset.last_before_real_offsets(), [{{LogOffset.new(9, 0), nil}, {0, nil}}]}, {:open_files, {:file_descriptor, :prim_file, %{handle: #Reference<0.1969248539.3144548364.144879>, owner: #PID<0.14010.0>, r_buffer: #Reference<0.1969248539.3144548356.209453>, r_ahead_size: 0}}, nil}}, #Reference<0.1969248539.3144417282.228452>, #Reference<0.1969248539.3144548355.211327>, "latest.0", %Electric.ShapeCache.PureFileStorage{buffer_ets: nil, chunk_bytes_threshold: 10485760, shape_handle: "27335419-1772704010522038", stack_id: "Electric.Shapes.ConsumerTest test transaction handling with real storage transaction fragments are buffered until snapshot xmin is known", stack_ets: #Reference<0.1969248539.3144548355.210766>, snapshot_file_timeout: 5000, version: 1}}}]}}
     stacktrace:
       .../electric/shapes/consumer_test.exs:1107: (test)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch 3 times, most recently from 72269af to 0c0a256 Compare February 25, 2026 14:52
@magnetised magnetised marked this pull request as ready for review February 25, 2026 15:08
@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch from 0c0a256 to 023caa8 Compare February 25, 2026 15:13
@alco alco added the claude label Feb 25, 2026
@claude
Copy link

claude bot commented Feb 25, 2026

Claude Code Review

Summary

Round 16: Four new commits since round 15 — "refactor statistics module", "change supervisor strategy", "don't crash stats task", and a merge. This is a substantial refactor: Statistics no longer initialises via handle_continue; instead it starts lean and receives a one-shot :initialize cast from a Task child after the full supervision tree is up. Stats reads are now fully async via Task.async. The write pool is now lazy and gets an idle timeout, allowing it to scale to 0. Supervisor strategy changed to one_for_all, max_restarts: 0 to propagate child crashes upward. WriteBuffer now traps exits so it can flush before terminating. These are well-motivated changes.

CI is still showing consistent failures that need attention before merging.

What's Working Well

  • Async stats probe — Moving stats reads out of handle_continue into an async Task keeps the GenServer mailbox clear during startup. The probe/schedule split (capability check on first run, lean read on subsequent runs) is clean.
  • connections metric added — Tracking active pool workers via worker_start/worker_stop in NimblePool callbacks and exporting the count in telemetry is a good observability addition.
  • one_for_all, max_restarts: 0 — The strategy change is well-explained in the comment and correctly matches the intent: a crash in any of the tightly coupled ShapeDb children should propagate the whole supervisor to the parent, not silently restart in isolation.
  • WriteBuffer terminate/2 path — Trapping exits and converting them to {:stop, reason, state} is the right pattern to guarantee the flush callback runs on shutdown.
  • stats_sql/2 overloaded on both flags — Handling the (false, true) and (false, false) permutations explicitly is more readable than the previous single-argument form.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

1. dbstat via sqlite_master still broken — disk stats always disabled (from round 15)

File: packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/statistics.ex (probe_capabilities/2)

The sqlite_master query for dbstat availability was flagged last round. It is unchanged:

SELECT 1 FROM sqlite_master WHERE type = 'table' AND name = 'dbstat'

dbstat is an eponymous virtual table: compiled in via SQLITE_ENABLE_DBSTAT_VTAB, queryable directly (SELECT * FROM dbstat), but never registered in sqlite_master because it has no CREATE VIRTUAL TABLE statement. This query returns 0 rows unconditionally, so fetch_one returns :error, setting dbstat_available?: false on every first run. Disk size is then silently disabled for the lifetime of the process.

The current CI failure "statistics only exports disk usage by default" is consistent with this: the test now calls wait_statistics/2 (10 × 10 ms polling), but even after the probe finishes, enabled.disk is false and the assert disk_size > 0 branch is silently skipped — the test shape passes vacuously but disk stats are broken.

Suggested fix: probe by querying the table directly:

dbstat_available? =
  case Connection.fetch_one(conn, "SELECT 1 FROM dbstat LIMIT 1", []) do
    {:ok, _}    -> true   # row returned
    :error      -> true   # table exists but db is empty (0 rows ≠ table missing)
    {:error, _} ->
      if first_run?,
        do: Logger.warning("SQLite disk size statistics will not be available.")
      false
  end

Only a query error ("no such table: dbstat") means the extension is absent.

2. crashing WriteBuffer restarts entire supervision tree test failing consistently in CI

File: packages/sync-service/test/electric/shape_cache/shape_status/shape_db_test.exs:647-659

Blacksmith CI shows this test failing across 4 concurrent jobs in the latest run, with:

Assertion failed, no matching message after 400ms

The test sends Process.exit(write_buffer_pid, :some_reason). Because WriteBuffer traps exits, this arrives as an {:EXIT, ...} message, which triggers handle_info({:EXIT, _, :some_reason}){:stop, :some_reason, state}terminate/2. The terminate/2 callback flushes the write buffer synchronously to SQLite before the process exits. In a slow CI environment that flush can exceed the default 100 ms assert_receive window (or the explicit 400 ms mentioned in the error).

The test has no explicit timeout, so the 400 ms in the error is the ExUnit default. Adding a longer timeout is a quick fix:

assert_receive {:DOWN, ^buffer_ref, :process, ^write_buffer_pid, :some_reason}, 5000
assert_receive {:DOWN, ^super_ref, :process, ^supervisor_pid, _}, 5000

Alternatively, pre-flush the buffer before the Process.exit/2 call so terminate/2 has nothing to write.

3. wait_statistics/2 polling window too narrow for async probe timing

File: packages/sync-service/test/electric/shape_cache/shape_status/shape_db_test.exs

defp wait_statistics(_ctx, 0), do: :error
defp wait_statistics(ctx, remaining_attempts) do
  case ShapeDb.statistics(ctx.stack_id) do
    {:ok, %{updated_at: %DateTime{}} = stats} -> {:ok, stats}
    {:ok, _} ->
      Process.sleep(10)
      wait_statistics(ctx, remaining_attempts - 1)
  end
end

10 retries × 10 ms = 100 ms total. The probe runs a Task.async that checks out the write pool (lazily opening a connection), runs PRAGMA page_size, queries dbstat/memstat, then sends the result back to the GenServer. In a loaded CI runner this consistently exceeds 100 ms (as seen in the "statistics only exports disk usage by default" CI failure). Increase the window:

defp wait_statistics(_ctx, 0), do: :error
defp wait_statistics(ctx, remaining_attempts) do
  case ShapeDb.statistics(ctx.stack_id) do
    {:ok, %{updated_at: %DateTime{}} = stats} -> {:ok, stats}
    {:ok, _} ->
      Process.sleep(50)
      wait_statistics(ctx, remaining_attempts - 1)
  end
end

50 ms × 10 = 500 ms should cover any reasonable CI latency. Or just use assert_receive in the test with a 2 s timeout against a telemetry event.

Suggestions (Nice to Have)

4. :ok = Sqlite3.release(conn, stmt) bare match (from round 15)

File: packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex (close/1)

Still present. In practice the underlying esqlite3 implementation is a no-op that always returns :ok, so this cannot crash. But the updated spec now says @spec release(connection(), statement()) :: :ok | {:error, term()}, creating a mismatch between spec and assertion. Worth making the two consistent — either keep the assertion and tighten the spec back to :: :ok, or handle {:error, _} and log it.

5. stats_sql(false, false) reachable path

File: packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/statistics.ex

The comment says "should not normally be reached", but given issue #1 above (dbstat always returns false), this path IS reached in practice whenever memory stats are disabled (the default). SELECT 1 WHERE 0 returns 0 rows harmlessly, but analyze_stats([], page_size) produces an empty struct. Fixing #1 would remove the concern; until then it silently no-ops.

Issue Conformance

No linked issue — unchanged from all previous rounds.

Previous Review Status

  • RESOLVED: Dead state fields (pool_opts, exclusive_mode?, page_size: 0 seed) — the init rewrite eliminates them cleanly.
  • OPEN (Important): dbstat via sqlite_master always returns false — not addressed; author declined. This is now visibly causing the CI test failure.
  • OPEN: CI test failures — "crashing WriteBuffer restarts entire supervision tree" failing in 4 concurrent Blacksmith jobs; "statistics only exports disk usage by default" failing intermittently. Both are new test reliability issues introduced in this round.
  • DECLINED: Task.async crash propagation — still carried forward; the try/catch wrapper in the task body mitigates the most common paths.
  • DECLINED: Missing @impl GenServer on callbacks — confirmed: handle_cast({:worker_incr, ...}) has @impl GenServer; the subsequent :initialize clauses correctly omit it. Not an issue.

Review iteration: 16 | 2026-03-05

@alco
Copy link
Member

alco commented Feb 26, 2026

@magnetised I've looked through the issues Claude flagged up, all for them look legit. Please ping me when ready for another review round.

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch from 023caa8 to 90029c5 Compare February 26, 2026 12:29
@blacksmith-sh

This comment has been minimized.

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch from 90029c5 to f8fa293 Compare February 26, 2026 13:43
@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch 2 times, most recently from df9efb8 to 292b144 Compare February 26, 2026 15:40
@magnetised
Copy link
Contributor Author

@alco think we can ignore claude's suggestions. ready for another look pls

@magnetised magnetised requested a review from alco February 26, 2026 16:11
@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch 2 times, most recently from 9772e20 to 3d410a6 Compare March 2, 2026 10:00
@blacksmith-sh

This comment has been minimized.

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch 2 times, most recently from 812f743 to e95c5ae Compare March 3, 2026 11:59
Copy link
Member

@alco alco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch 5 times, most recently from 524f5f8 to b316a68 Compare March 3, 2026 15:20
@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch 2 times, most recently from d31116a to 872c8bd Compare March 3, 2026 16:37
@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Mar 3, 2026

Found 2 test failures on Blacksmith runners:

Failures

Test View Logs
Elixir.Electric.Shapes.ConsumerTest/
test transaction handling with real storage transaction fragments are buffered until sn
apshot xmin is known
View Logs
Elixir.Electric.Shapes.ConsumerTest/
test transaction handling with real storage writes txn fragments to storage immediately
but keeps txn boundaries when flushing
View Logs

Fix in Cursor

@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch from 872c8bd to b281e9b Compare March 4, 2026 15:35
so stats retrieval is not blocked by concurrent reading of stats
because now we're maybe shutting the write connection when idle, this
will log all the time
@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch from b281e9b to 9b362c9 Compare March 5, 2026 09:37
@magnetised magnetised force-pushed the magnetised/scale-sqlite-pool branch from 9b362c9 to e463c12 Compare March 5, 2026 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants