Summary
The test "handles relation tracker restart" in publication_manager_test.exs:503 has a race condition that causes intermittent CI failures.
Observed in run 22354924262 on main (2026-02-24).
Error
** (exit) exited in: GenServer.call({:via, Registry, {:"Electric.ProcessRegistry:...", {Electric.Replication.PublicationManager.RelationTracker, nil}}}, {:remove_shape, "36215155-..."}, 5000)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name
Root Cause
The test at test/electric/replication/publication_manager_test.exs:503:
- Line 515:
GenServer.stop(relation_tracker_name) kills the RelationTracker
- Line 519:
assert_pub_tables(ctx, [ctx.relation], 2_000) polls Postgres publication tables until they match
- Line 522:
PublicationManager.remove_shape(ctx.stack_id, shape_handle) does a GenServer.call to the RelationTracker
The problem is that assert_pub_tables checks Postgres state, not whether the RelationTracker GenServer has been re-registered by the supervisor. There's a window where publication tables are correct (from the previous state) but the new RelationTracker process isn't yet alive or hasn't finished handle_continue(:restore_relations, ...).
Suggested Fix
Call RelationTracker.wait_for_restore(ctx.stack_id) before remove_shape on line 522. This function already exists (line 79-82 of relation_tracker.ex) and blocks until handle_continue(:restore_relations) completes, which guarantees the process is registered and ready.
Context: Broader CI Flakiness
While investigating, I looked at all sync-service workflow failures from the last 2 days: 12 failures vs 14 successes (~46% failure rate). The failures are spread across many test files — only 1 of the 12 was this publication_manager_test:
| Test file |
Failures |
shape_cache_test.exs:501 |
4 |
request_batcher_test.exs:100 |
2 |
publication_manager_test.exs:503 |
1 |
api_test.exs:925 |
1 |
delete_shape_plug_test.exs:100 |
1 |
shape_db_test.exs:553 |
1 |
shape_cache_test.exs:877 |
1 |
Summary
The test
"handles relation tracker restart"inpublication_manager_test.exs:503has a race condition that causes intermittent CI failures.Observed in run 22354924262 on
main(2026-02-24).Error
Root Cause
The test at
test/electric/replication/publication_manager_test.exs:503:GenServer.stop(relation_tracker_name)kills the RelationTrackerassert_pub_tables(ctx, [ctx.relation], 2_000)polls Postgres publication tables until they matchPublicationManager.remove_shape(ctx.stack_id, shape_handle)does aGenServer.callto the RelationTrackerThe problem is that
assert_pub_tableschecks Postgres state, not whether the RelationTracker GenServer has been re-registered by the supervisor. There's a window where publication tables are correct (from the previous state) but the new RelationTracker process isn't yet alive or hasn't finishedhandle_continue(:restore_relations, ...).Suggested Fix
Call
RelationTracker.wait_for_restore(ctx.stack_id)beforeremove_shapeon line 522. This function already exists (line 79-82 ofrelation_tracker.ex) and blocks untilhandle_continue(:restore_relations)completes, which guarantees the process is registered and ready.Context: Broader CI Flakiness
While investigating, I looked at all
sync-serviceworkflow failures from the last 2 days: 12 failures vs 14 successes (~46% failure rate). The failures are spread across many test files — only 1 of the 12 was thispublication_manager_test:shape_cache_test.exs:501request_batcher_test.exs:100publication_manager_test.exs:503api_test.exs:925delete_shape_plug_test.exs:100shape_db_test.exs:553shape_cache_test.exs:877