feat: Add distributed state manager for multi-node WebSocket support #498

marcodejongh · 2025-12-31T07:47:29Z

Implements DistributedStateManager service to enable horizontal scaling
without sticky sessions:

Redis-backed connection tracking across instances
Distributed user list aggregation from all instances
Atomic leader election using Lua scripts
Cross-instance session membership validation

Key changes:

New DistributedStateManager service with comprehensive test coverage
RoomManager integration for seamless multi-instance support
Updated requireSessionMember helper for cross-instance auth
getSessionUsers now async to support distributed queries

This enables running multiple backend nodes behind a load balancer
without requiring sticky sessions. Leader election is coordinated
via Redis, and user lists are aggregated across all instances.

vercel · 2025-12-31T07:47:34Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
boardsesh	Ready	Preview, Comment	Dec 31, 2025 11:20am

claude · 2025-12-31T07:48:36Z

Claude Review

Issues Found

Race condition in registerClient (room-manager.ts:112-116): The distributed state registration is fire-and-forget with .catch(). If the Redis registration fails but the local client registration succeeds, the connection will have inconsistent state across instances. Should await this or handle the failure more robustly.
Missing await in removeClient (room-manager.ts:403-408): Same issue - removeConnection is fire-and-forget. If this fails, the connection will remain as a ghost in distributed state.
findNearbySessions doesn't account for distributed state (room-manager.ts:844): The participantCount only uses local this.sessions.get(session.id)?.size which will be incorrect in multi-instance mode. Should use distributedState.getSessionMemberCount() instead.
Unused variable connectedAt in Lua script (distributed-state.ts:51): The LEADER_ELECTION_SCRIPT accepts connectedAt as ARGV[2] but never uses it. Either remove the parameter or use it for leader priority.
Missing TTL refresh for session members (distributed-state.ts:310-311): Session membership is added to Redis with TTL, but leader status changes don't refresh this TTL. Long-running sessions could have members expire unexpectedly.
Potential null dereference (distributed-state.ts:530): In hashToConnection, if hash.sessionId is an empty string (which it will be for no-session), it should return null for sessionId, but line 532 assigns empty string to null. The ternary hash.sessionId || null will correctly return null for empty string, but this is inconsistent with how it's stored (empty string) vs retrieved (null).
Test cleanup may fail silently (distributed-state.test.ts:24-27): If redis.keys() returns an empty array and you try to redis.del(...keys), this could cause issues depending on ioredis version. The check if (keys.length > 0) is good, but consider using a try-catch for robustness.

Implements DistributedStateManager service to enable horizontal scaling without sticky sessions: - Redis-backed connection tracking across instances - Distributed user list aggregation from all instances - Atomic leader election using Lua scripts - Cross-instance session membership validation Key changes: - New DistributedStateManager service with comprehensive test coverage - RoomManager integration for seamless multi-instance support - Updated requireSessionMember helper for cross-instance auth - getSessionUsers now async to support distributed queries This enables running multiple backend nodes behind a load balancer without requiring sticky sessions. Leader election is coordinated via Redis, and user lists are aggregated across all instances.

- Add DistributedStateManager to architecture diagram - Update leader election section for distributed election - Expand multi-instance support section with new diagrams - Add distributed state Redis keys documentation - Update graceful shutdown sequence for distributed cleanup - Add distributed state timeouts and limits - Add new files to Related Files section - Update server.ts to call roomManager.shutdown() on exit

claude · 2025-12-31T07:54:41Z

Claude Review

packages/backend/src/services/distributed-state.ts

Race condition in joinSession (lines 303-333): The multi.exec() completes, then separately getConnection() is called and leader election happens via eval(). Another concurrent join could read stale connectedAt data between these operations.
Missing TTL on session leader key (line 60, 123): The sessionLeader key is set without TTL. If session members all disconnect but cleanup fails, the leader key persists forever.
Singleton pattern with module-level state (lines 541-581): Global distributedStateManager variable means tests could leak state between test files. Consider dependency injection instead.
Fire-and-forget in registerClient (room-manager.ts:113): registerConnection errors are caught and logged but not propagated - client appears registered locally but distributed state may be inconsistent.
Inconsistent session cleanup (room-manager.ts:345-347): When last user leaves, saveUsers(sessionId, []) is called on Redis store but distributed state members set isn't explicitly cleaned.

packages/backend/src/graphql/resolvers/shared/helpers.ts

Expensive retry loop (lines 51-72): Up to 8 retries with exponential backoff (~6.4s total) for auth validation seems excessive. Each retry hits Redis twice if distributed state is enabled.

packages/backend/src/tests/distributed-state.test.ts

Tests use redis.keys('boardsesh:*') (lines 22-25): The KEYS command blocks Redis and is O(N) - should use SCAN for production code, though acceptable for tests with small datasets.

- Fix race condition in registerClient by awaiting distributed state registration and rolling back local state on failure - Fix missing await in removeClient to prevent ghost connections - Fix findNearbySessions to use distributed state for accurate cross-instance participant counts - Remove unused connectedAt parameter from LEADER_ELECTION_SCRIPT - Add TTL refresh for session members in ELECT_NEW_LEADER_SCRIPT to prevent expiry during long sessions - Add try-catch to test cleanup in all beforeEach blocks for robustness - Fix queue-sync-fixes.test.ts to await async registerClient

claude · 2025-12-31T08:03:07Z

Claude Review

Issues Found

Race condition in leader election (distributed-state.ts:322-331): The leader election uses two separate Redis operations - eval for the Lua script, then hset for updating isLeader flag. If the process crashes between these operations, the connection's isLeader flag won't be set despite being the leader.
Singleton global state without reset (distributed-state.ts:546-584): The module-level distributedStateManager singleton cannot be reset between tests or in hot-reload scenarios. This could cause issues in test isolation or development.
Missing error handling for Lua script failures (distributed-state.ts:362-369): The leaveSession method doesn't handle potential failures from the ELECT_NEW_LEADER_SCRIPT Lua script execution.
Session members TTL not refreshed on activity (distributed-state.ts:316-317): The session members set TTL is only set on join. Long sessions (>4 hours) could have their membership set expire even with active users.
Potential memory leak in cleanupInstanceConnections (distributed-state.ts:484-501): If removeConnection fails for one connection, the loop continues but the failed connection remains tracked. The entire instanceConnections set is then deleted, potentially orphaning the connection data.

Test Quality

The tests are comprehensive with good coverage of:

Single and multi-instance scenarios
Edge cases (stale data, rapid operations, concurrent joins)
Leader election and transfer

Missing test coverage:

Redis connection failures mid-operation
TTL expiration scenarios
Heartbeat failure handling

Documentation

Documentation in docs/websocket-implementation.md has been properly updated to cover the new distributed state architecture.

- Fix race condition in leader election by making LEADER_ELECTION_SCRIPT atomically set both leader key and isLeader flag in a single Lua operation - Add error handling for Lua script failures in leaveSession with fallback to clear leader key on error - Add refreshSessionMembership method and update refreshConnection to also refresh session membership TTL for long-running sessions - Fix potential memory leak in cleanupInstanceConnections by force-cleaning failed connection data and wrapping operations in try-catch - Add resetDistributedState function for test isolation and hot-reload

claude · 2025-12-31T10:28:06Z

Claude Review

Issues Found

1. Potential race condition in removeConnection (distributed-state.ts:243-270)

The removeConnection method reads connection state, then performs multiple Redis operations in a pipeline, but if the leader leaves, the leaveSession logic is not called - the caller must handle leader election separately. This could lead to orphaned leader state if not properly orchestrated.

2. Missing old leader isLeader flag reset in Lua script (distributed-state.ts:96-140)

ELECT_NEW_LEADER_SCRIPT sets the new leader isLeader flag to true but never sets the old leader flag to false. While the old leader is removed from the session, if they rejoin quickly, their stale isLeader=true could cause inconsistency.

3. Singleton reset not thread-safe (distributed-state.ts:648-650)

resetDistributedState() sets the singleton to null without calling stop(), potentially leaving heartbeat intervals running and orphaned connections in Redis.

4. Error swallowed in cleanup (room-manager.ts:409-417)

removeClient catches distributed state errors but continues with local cleanup. If distributed state removal fails, the connection data remains in Redis but is removed locally, causing ghost connections.

5. TTL refresh race condition (distributed-state.ts:459-472)

refreshConnection reads connection to get sessionId, then refreshes TTLs in a separate pipeline. If the connection leaves the session between read and write, it refreshes a stale session membership.

6. Missing test coverage for error paths

No tests for Redis connection failures, Lua script execution failures, or partial failure scenarios during multi-operation transactions.
No tests for refreshConnection or refreshSessionMembership methods.
No tests for the heartbeat mechanism (start(), updateHeartbeat).

7. Documentation updated correctly

docs/websocket-implementation.md has been updated to reflect the new distributed state architecture.

- Fix race condition in removeConnection by adding automatic leader election when removing a leader (controlled via electNewLeader parameter) - Update ELECT_NEW_LEADER_SCRIPT to reset old leader's isLeader flag before electing new leader, preventing stale flag inconsistencies - Make resetDistributedState async and call stop() to prevent orphaned heartbeat intervals; add forceResetDistributedState for sync reset - Add atomic REFRESH_TTL_SCRIPT Lua script to avoid race conditions in refreshConnection where session could change between read and write - Improve removeClient error handling to return cleanup status and provide more detailed logging about distributed state failures - Simplify cleanupInstanceConnections since removeConnection now handles leader election automatically - Add comprehensive tests for removeConnection with leader election, refreshConnection, and old leader flag reset scenarios

claude · 2025-12-31T11:20:12Z

Claude Review

Issues Found

Race condition in leaveSession (packages/backend/src/services/distributed-state.ts:416-430) - The leaveSession method reads connection.isLeader before the Redis multi transaction, but the multi transaction doesn't use that value atomically. If another process updates the leader status between the read and the transaction commit, the leader election logic could behave incorrectly.
Singleton state pollution across tests (packages/backend/src/services/distributed-state.ts:656-693) - The module-level distributedStateManager singleton is problematic for testing. Tests manually call stop() but the singleton persists. While forceResetDistributedState() exists, the test file doesn't use it, meaning tests that fail mid-execution may leak state to subsequent tests.
Missing error handling for cleanupInstanceConnections loop (packages/backend/src/services/distributed-state.ts:589-596) - The loop continues after errors but the order of cleanup matters. If leader election fails for one connection, subsequent connections in the same session may have incorrect leader state.
Inconsistent null vs empty string handling (packages/backend/src/services/distributed-state.ts:645) - hashToConnection treats empty string sessionId as null, but leaveSession explicitly sets sessionId: '' (line 423). The inconsistency could cause subtle bugs when checking session membership.
Potential memory leak in RoomManager (packages/backend/src/services/room-manager.ts:412-438) - removeClient doesn't clean up this.sessions map entries if the client was in a session. It only removes from this.clients. The session set may retain a stale reference.
No TTL on session leader key (packages/backend/src/services/distributed-state.ts:62-63) - The LEADER_ELECTION_SCRIPT sets the leader key without TTL. If all connections disconnect without proper cleanup, the leader key remains indefinitely. Consider adding TTL matching sessionMembership TTL.
Tests require Redis running (packages/backend/src/__tests__/distributed-state.test.ts:8) - Integration tests hardcode Redis URL but there's no skip logic or mock fallback. Tests will fail in CI environments without Redis.

Documentation

The docs/websocket-implementation.md has been properly updated with the new DistributedStateManager architecture, including Mermaid diagrams and Redis key structures.

claude added 3 commits December 31, 2025 07:48

chore: Update package-lock.json

7e13818

marcodejongh force-pushed the claude/review-websocket-redis-nodes-FOoNd branch from f15002f to 24da364 Compare December 31, 2025 07:53

vercel bot deployed to Preview December 31, 2025 07:54 View deployment

vercel bot deployed to Preview December 31, 2025 08:03 View deployment

vercel bot deployed to Preview December 31, 2025 10:27 View deployment

vercel bot deployed to Preview December 31, 2025 11:20 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add distributed state manager for multi-node WebSocket support #498

feat: Add distributed state manager for multi-node WebSocket support #498

Uh oh!

marcodejongh commented Dec 31, 2025

Uh oh!

vercel bot commented Dec 31, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 31, 2025

Uh oh!

claude bot commented Dec 31, 2025

Uh oh!

claude bot commented Dec 31, 2025

Uh oh!

claude bot commented Dec 31, 2025

Uh oh!

claude bot commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add distributed state manager for multi-node WebSocket support #498

Are you sure you want to change the base?

feat: Add distributed state manager for multi-node WebSocket support #498

Uh oh!

Conversation

marcodejongh commented Dec 31, 2025

Uh oh!

vercel bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Dec 31, 2025

Claude Review

Issues Found

Uh oh!

claude bot commented Dec 31, 2025

Claude Review

Uh oh!

claude bot commented Dec 31, 2025

Claude Review

Issues Found

Test Quality

Documentation

Uh oh!

claude bot commented Dec 31, 2025

Claude Review

Issues Found

Uh oh!

claude bot commented Dec 31, 2025

Claude Review

Issues Found

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Dec 31, 2025 •

edited

Loading