Skip to content

feat(cluster): wire multi-shard cross-shard communication for server-ng#3269

Open
hubcio wants to merge 1 commit into
masterfrom
crossfire-server-ng
Open

feat(cluster): wire multi-shard cross-shard communication for server-ng#3269
hubcio wants to merge 1 commit into
masterfrom
crossfire-server-ng

Conversation

@hubcio
Copy link
Copy Markdown
Contributor

@hubcio hubcio commented May 15, 2026

core/server-ng bootstrapped exactly one shard with SHARD_ID=0 and
senders=vec![sender] hardcoded; the multi-shard path was dead code.
Cross-shard primitives copied from legacy core/server also did not
fit core/shard's crossfire model (bounded MTx + try_send-or-drop,
fd-transfer coordinator instead of SO_REUSEPORT).

bootstrap() now spawns N OS threads from sharding.cpu_allocation,
each pinned via nix::sched + hwlocality and running its own compio
runtime + IggyMessageBus + IggyShard for the partitions hashed to
that shard. Cross-thread shutdown rides an Arc polled by
a per-shard watchdog, since the bus' Shutdown is !Send and cannot be
triggered from the main thread directly. Partial shard-spawn failure
and shard-thread panic now signal cluster-wide shutdown instead of
hanging; the shutdown watchdog is detached from the bus drain.

ShardFrame becomes a concrete enum (Consensus + Lifecycle); the R
generic is lifted off IggyShard. Named routers (route_metadata /
route_partition / route_consensus_control) replace the duplicated
MessageBag match blocks, and a debug_assert at pump entry catches
receiver-thread mis-binding that the ctor's assert_sender_ordering
cannot see.

ShardMetrics records frame_drops_total (counter, variant+reason
labels), bumped at every inter-shard try_send rejection; without it,
drop-and-recover under VSR retransmit is operationally
indistinguishable from a livelock. The counter is atomic, so it is
safe to bump from !Send compio reactor contexts.

The legacy shard-mapping broadcast subsystem (periodic snapshot
refresh task, three-state MappingSlot table, ReplicaMappingUpdate /
ReplicaMappingClear frames) is retired entirely. Cross-shard replica
routing now flows through the cluster-shared ReplicaOwnerTable: the
owning shard's installer stamps its slot on a successful registry
insert and CAS-clears it on disconnect, so every bus' send_to_replica
slow path reads authoritative state with no broadcast or reconcile
loop. Builder accepts the coordinator at ctor; IggyShard stays
immutable post-construction.

message_bus forward-fn types widen to carry replica/client id, and
send_to_replica routes via the shared owner table so non-zero shards
reconcile against shared state rather than shard 0's private bus.

WAL recovery is serialized across shards: non-zero shards open the
WAL read-only, reject mutating ops (drain, set_snapshot_op) on a
read-only storage, route an invalid WAL header to truncate_or_fail,
and close the read-only fd once recovery completes.

Storage milestones (durable PartitionJournal, durable
(view, commit_op) watermark) and SDK (client_id, request_id)
durability across reconnect remain out of scope; tracked separately.

@hubcio hubcio force-pushed the crossfire-server-ng branch from e0e289d to 84e79d3 Compare May 15, 2026 09:33
@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label May 15, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

❌ Patch coverage is 50.89392% with 824 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.27%. Comparing base (eff697f) to head (e0067ca).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
core/server-ng/src/bootstrap.rs 20.00% 464 Missing and 4 partials ⚠️
core/shard/src/router.rs 0.00% 111 Missing ⚠️
core/shard/src/builder.rs 0.00% 57 Missing ⚠️
core/shard/src/lib.rs 41.93% 36 Missing ⚠️
core/message_bus/src/lib.rs 82.58% 22 Missing and 5 partials ⚠️
core/server-ng/src/main.rs 0.00% 27 Missing ⚠️
core/journal/src/prepare_journal.rs 88.18% 17 Missing and 9 partials ⚠️
core/metadata/src/stm/mod.rs 72.72% 18 Missing ⚠️
core/shard/src/coordinator.rs 83.65% 17 Missing ⚠️
core/simulator/src/bus.rs 0.00% 12 Missing ⚠️
... and 7 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3269      +/-   ##
============================================
- Coverage     74.18%   71.27%   -2.91%     
  Complexity      943      943              
============================================
  Files          1200     1201       +1     
  Lines        109645   105567    -4078     
  Branches      86535    82472    -4063     
============================================
- Hits          81340    75246    -6094     
- Misses        25567    27371    +1804     
- Partials       2738     2950     +212     
Components Coverage Δ
Rust Core 71.74% <50.89%> (-3.68%) ⬇️
Java SDK 58.44% <ø> (ø)
C# SDK 69.13% <ø> (-0.31%) ⬇️
Python SDK 81.43% <ø> (ø)
Node SDK 91.44% <ø> (-0.10%) ⬇️
Go SDK 39.91% <ø> (ø)
Files with missing lines Coverage Δ
core/common/src/sharding/namespace.rs 95.69% <100.00%> (+0.69%) ⬆️
core/configs/src/server_config/sharding.rs 85.06% <100.00%> (+2.58%) ⬆️
core/consensus/src/impls.rs 78.15% <100.00%> (+0.03%) ⬆️
core/journal/src/file_storage.rs 70.58% <ø> (ø)
core/message_bus/src/client_listener/tcp.rs 80.00% <ø> (ø)
core/message_bus/src/client_listener/tcp_tls.rs 84.61% <ø> (ø)
core/message_bus/src/client_listener/ws.rs 80.76% <ø> (ø)
core/message_bus/src/connector.rs 94.31% <ø> (ø)
core/message_bus/src/installer/mod.rs 62.50% <ø> (+9.86%) ⬆️
core/message_bus/src/installer/tcp_tls.rs 90.00% <ø> (ø)
... and 24 more

... and 124 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hubcio hubcio force-pushed the crossfire-server-ng branch 2 times, most recently from 92db4e4 to 4ee1f23 Compare May 18, 2026 11:37
Comment thread core/journal/src/prepare_journal.rs
Comment thread core/journal/src/prepare_journal.rs
Comment thread core/message_bus/src/installer/replica.rs Outdated
Comment thread core/server-ng/src/bootstrap.rs
Comment thread core/configs/src/server_config/sharding.rs
Copy link
Copy Markdown

@jayakasadev jayakasadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All threads resolved. LGTM.

@hubcio hubcio force-pushed the crossfire-server-ng branch 2 times, most recently from 7ef14ed to bcf8da7 Compare May 20, 2026 19:09
Copy link
Copy Markdown
Contributor

@krishvishal krishvishal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit, StartViewChange, DoViewChange, StartView (and Register requests) all carry Operation::Reserved / Operation::Register. Neither is in is_metadata() nor is_partition(), so route_typed falls through to route_consensus_control and hashes header.namespace. Metadata's consensus group is initialized with namespace = 0 (bootstrap.rs:1032), and Murmur3(0u64) >> 16 = 25477, so for every shard_count > 1, the frame is dispatched to a non-zero shard whose IggyMetadata.consensus is None. The receiver hits the fall-through arm and tracing::warn!-drops. VSR retransmit re-routes via the same hash and cluster wedges permanently.

core/server-ng bootstrapped exactly one shard with SHARD_ID=0 and
senders=vec![sender] hardcoded; the multi-shard path was dead code.
Cross-shard primitives copied from legacy core/server also did not
fit core/shard's crossfire model (bounded MTx + try_send-or-drop,
fd-transfer coordinator instead of SO_REUSEPORT).

bootstrap() now spawns N OS threads from sharding.cpu_allocation,
each pinned via nix::sched + hwlocality and running its own compio
runtime + IggyMessageBus + IggyShard for the partitions hashed to
that shard. Cross-thread shutdown rides an Arc<AtomicBool> polled by
a per-shard watchdog, since the bus' Shutdown is !Send and cannot be
triggered from the main thread directly. Partial shard-spawn failure
and shard-thread panic now signal cluster-wide shutdown instead of
hanging.

ShardFrame becomes a concrete enum (Consensus + Lifecycle); the R
generic is lifted off IggyShard. Named routers (route_metadata /
route_partition / route_consensus_control) replace the duplicated
MessageBag match blocks, and a debug_assert at pump entry catches
receiver-thread mis-binding that the ctor's assert_sender_ordering
cannot see.

ShardMetrics records frame_drops_total (counter, variant+reason
labels), bumped at every inter-shard try_send rejection and at the
receiver-side delivery failure paths (partition-namespace miss,
ForwardClientSend failure, consensus misroute). Without it,
drop-and-recover under VSR retransmit is operationally
indistinguishable from a livelock. The counter is atomic, so it is
safe to bump from !Send compio reactor contexts. shard_id is NOT a
label; each shard owns its own Family and the exporter must attach
shard scope at the registry layer.

The legacy shard-mapping broadcast subsystem (periodic snapshot
refresh task, three-state MappingSlot table, ReplicaMappingUpdate /
ReplicaMappingClear frames) is retired entirely. Cross-shard replica
routing now flows through the cluster-shared ReplicaOwnerTable: the
owning shard's installer stamps its slot on a successful registry
insert and CAS-clears it on disconnect, so every bus' send_to_replica
slow path reads authoritative state with no broadcast or reconcile
loop. Builder accepts the coordinator at ctor; IggyShard stays
immutable post-construction.

message_bus forward-fn types widen to carry replica/client id, and
send_to_replica routes via the shared owner table so non-zero shards
reconcile against shared state rather than shard 0's private bus.

Metadata recovery: shard 0 owns the WAL writer (via
PrepareJournal::open) and the metadata consensus replica. Peer shards
never open the WAL; they receive a MetadataHandoff::Waiter factory
bundle broadcast by shard 0 over a single-shot crossfire channel and
reconstruct their mux_stm from the in-memory snapshot it carries.
broadcast_metadata_bundle / await_metadata_bundle poll a shutdown
flag so a shard 0 that panics before broadcasting cannot strand
peers.

PrepareJournal::drain advances snapshot_op only AFTER the WAL rewrite
sequence (tmp create -> write -> fsync -> rename -> fsync parent ->
reopen) is durable. Advancing earlier would leave snapshot_op past
entries still present on disk on any ? failure, letting a future
append() pass the slot collision check and silently evict a live
entry from the index.

server-ng's main installs Ctrl-C with abort-on-failure: without a
working SIGINT handler the shutdown flag would never flip and shard
threads would park indefinitely.

InvalidShardsCount split into ShardsCountZero (allocator produced
zero shards) and ShardsCountOverflow (count past OWNER_NONE - 1) so
the operator-facing message names the actual failure mode.

Bootstrap namespace materialization stream-filters inside read():
heavy (Arc<TopicStats>, Partition) clones are kept only for partitions
owned by this shard; non-owning entries are inserted straight into
the shards_table from the closure. Partial-spawn survivor join fans
out joiner threads so total wait is bounded by the slowest shard,
not N * SHARD_SHUTDOWN_TIMEOUT.

TODOs left in place for follow-ups:
- shutdown watchdog is still .detach()'d; tracking it on the bus
  self-deadlocks because the watchdog drives bus.shutdown(). Needs a
  shared core/task_registry crate ported from core/server.
- primary IggyMetadata::on_prepare pre-advances sequencer + checksum
  in push_prepare_entry before the journal append; an append failure
  leaves state pointing at a phantom op. Needs a rollback path in
  push_prepare_entry.

Storage milestones (durable PartitionJournal, durable
(view, commit_op) watermark) and SDK (client_id, request_id)
durability across reconnect remain out of scope; tracked separately.
@hubcio hubcio force-pushed the crossfire-server-ng branch from bcf8da7 to e0067ca Compare May 21, 2026 10:02
@hubcio
Copy link
Copy Markdown
Contributor Author

hubcio commented May 21, 2026

good catch @krishvishal , should be fixed now. I added a sentinel METADATA_CONSENSUS_NAMESPACE = 1u64 << 63 in core/common/src/sharding/namespace.rs (it's above PACKED_NAMESPACE_MAX so it can't collide with a real partition namespace), and bootstrap now inits the metadata group with that constant at bootstrap.rs:1106 instead of 0.

router got an is_vsr_reserved arm in route_typed that short-circuits any frame with namespace == METADATA_CONSENSUS_NAMESPACE straight to shard 0 at router.rs:172, so metadata consensus traffic never touches the murmur path. per-partition consensus groups still hash through route_consensus_control like before.

I think its good enough for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review PR is waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants