Skip to content

Conversation

@marcodejongh
Copy link
Owner

Implements DistributedStateManager service to enable horizontal scaling
without sticky sessions:

  • Redis-backed connection tracking across instances
  • Distributed user list aggregation from all instances
  • Atomic leader election using Lua scripts
  • Cross-instance session membership validation

Key changes:

  • New DistributedStateManager service with comprehensive test coverage
  • RoomManager integration for seamless multi-instance support
  • Updated requireSessionMember helper for cross-instance auth
  • getSessionUsers now async to support distributed queries

This enables running multiple backend nodes behind a load balancer
without requiring sticky sessions. Leader election is coordinated
via Redis, and user lists are aggregated across all instances.

@vercel
Copy link

vercel bot commented Dec 31, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
boardsesh Ready Ready Preview, Comment Dec 31, 2025 11:20am

@claude
Copy link

claude bot commented Dec 31, 2025

Claude Review

Issues Found

  1. Race condition in registerClient (room-manager.ts:112-116): The distributed state registration is fire-and-forget with .catch(). If the Redis registration fails but the local client registration succeeds, the connection will have inconsistent state across instances. Should await this or handle the failure more robustly.

  2. Missing await in removeClient (room-manager.ts:403-408): Same issue - removeConnection is fire-and-forget. If this fails, the connection will remain as a ghost in distributed state.

  3. findNearbySessions doesn't account for distributed state (room-manager.ts:844): The participantCount only uses local this.sessions.get(session.id)?.size which will be incorrect in multi-instance mode. Should use distributedState.getSessionMemberCount() instead.

  4. Unused variable connectedAt in Lua script (distributed-state.ts:51): The LEADER_ELECTION_SCRIPT accepts connectedAt as ARGV[2] but never uses it. Either remove the parameter or use it for leader priority.

  5. Missing TTL refresh for session members (distributed-state.ts:310-311): Session membership is added to Redis with TTL, but leader status changes don't refresh this TTL. Long-running sessions could have members expire unexpectedly.

  6. Potential null dereference (distributed-state.ts:530): In hashToConnection, if hash.sessionId is an empty string (which it will be for no-session), it should return null for sessionId, but line 532 assigns empty string to null. The ternary hash.sessionId || null will correctly return null for empty string, but this is inconsistent with how it's stored (empty string) vs retrieved (null).

  7. Test cleanup may fail silently (distributed-state.test.ts:24-27): If redis.keys() returns an empty array and you try to redis.del(...keys), this could cause issues depending on ioredis version. The check if (keys.length > 0) is good, but consider using a try-catch for robustness.

Implements DistributedStateManager service to enable horizontal scaling
without sticky sessions:

- Redis-backed connection tracking across instances
- Distributed user list aggregation from all instances
- Atomic leader election using Lua scripts
- Cross-instance session membership validation

Key changes:
- New DistributedStateManager service with comprehensive test coverage
- RoomManager integration for seamless multi-instance support
- Updated requireSessionMember helper for cross-instance auth
- getSessionUsers now async to support distributed queries

This enables running multiple backend nodes behind a load balancer
without requiring sticky sessions. Leader election is coordinated
via Redis, and user lists are aggregated across all instances.
- Add DistributedStateManager to architecture diagram
- Update leader election section for distributed election
- Expand multi-instance support section with new diagrams
- Add distributed state Redis keys documentation
- Update graceful shutdown sequence for distributed cleanup
- Add distributed state timeouts and limits
- Add new files to Related Files section
- Update server.ts to call roomManager.shutdown() on exit
@claude
Copy link

claude bot commented Dec 31, 2025

Claude Review

packages/backend/src/services/distributed-state.ts

  1. Race condition in joinSession (lines 303-333): The multi.exec() completes, then separately getConnection() is called and leader election happens via eval(). Another concurrent join could read stale connectedAt data between these operations.

  2. Missing TTL on session leader key (line 60, 123): The sessionLeader key is set without TTL. If session members all disconnect but cleanup fails, the leader key persists forever.

  3. Singleton pattern with module-level state (lines 541-581): Global distributedStateManager variable means tests could leak state between test files. Consider dependency injection instead.

  4. Fire-and-forget in registerClient (room-manager.ts:113): registerConnection errors are caught and logged but not propagated - client appears registered locally but distributed state may be inconsistent.

  5. Inconsistent session cleanup (room-manager.ts:345-347): When last user leaves, saveUsers(sessionId, []) is called on Redis store but distributed state members set isn't explicitly cleaned.

packages/backend/src/graphql/resolvers/shared/helpers.ts

  1. Expensive retry loop (lines 51-72): Up to 8 retries with exponential backoff (~6.4s total) for auth validation seems excessive. Each retry hits Redis twice if distributed state is enabled.

packages/backend/src/tests/distributed-state.test.ts

  1. Tests use redis.keys('boardsesh:*') (lines 22-25): The KEYS command blocks Redis and is O(N) - should use SCAN for production code, though acceptable for tests with small datasets.

- Fix race condition in registerClient by awaiting distributed state
  registration and rolling back local state on failure
- Fix missing await in removeClient to prevent ghost connections
- Fix findNearbySessions to use distributed state for accurate cross-instance
  participant counts
- Remove unused connectedAt parameter from LEADER_ELECTION_SCRIPT
- Add TTL refresh for session members in ELECT_NEW_LEADER_SCRIPT to prevent
  expiry during long sessions
- Add try-catch to test cleanup in all beforeEach blocks for robustness
- Fix queue-sync-fixes.test.ts to await async registerClient
@claude
Copy link

claude bot commented Dec 31, 2025

Claude Review

Issues Found

  1. Race condition in leader election (distributed-state.ts:322-331): The leader election uses two separate Redis operations - eval for the Lua script, then hset for updating isLeader flag. If the process crashes between these operations, the connection's isLeader flag won't be set despite being the leader.

  2. Singleton global state without reset (distributed-state.ts:546-584): The module-level distributedStateManager singleton cannot be reset between tests or in hot-reload scenarios. This could cause issues in test isolation or development.

  3. Missing error handling for Lua script failures (distributed-state.ts:362-369): The leaveSession method doesn't handle potential failures from the ELECT_NEW_LEADER_SCRIPT Lua script execution.

  4. Session members TTL not refreshed on activity (distributed-state.ts:316-317): The session members set TTL is only set on join. Long sessions (>4 hours) could have their membership set expire even with active users.

  5. Potential memory leak in cleanupInstanceConnections (distributed-state.ts:484-501): If removeConnection fails for one connection, the loop continues but the failed connection remains tracked. The entire instanceConnections set is then deleted, potentially orphaning the connection data.

Test Quality

The tests are comprehensive with good coverage of:

  • Single and multi-instance scenarios
  • Edge cases (stale data, rapid operations, concurrent joins)
  • Leader election and transfer

Missing test coverage:

  • Redis connection failures mid-operation
  • TTL expiration scenarios
  • Heartbeat failure handling

Documentation

Documentation in docs/websocket-implementation.md has been properly updated to cover the new distributed state architecture.

- Fix race condition in leader election by making LEADER_ELECTION_SCRIPT
  atomically set both leader key and isLeader flag in a single Lua operation
- Add error handling for Lua script failures in leaveSession with fallback
  to clear leader key on error
- Add refreshSessionMembership method and update refreshConnection to also
  refresh session membership TTL for long-running sessions
- Fix potential memory leak in cleanupInstanceConnections by force-cleaning
  failed connection data and wrapping operations in try-catch
- Add resetDistributedState function for test isolation and hot-reload
@claude
Copy link

claude bot commented Dec 31, 2025

Claude Review

Issues Found

1. Potential race condition in removeConnection (distributed-state.ts:243-270)

  • The removeConnection method reads connection state, then performs multiple Redis operations in a pipeline, but if the leader leaves, the leaveSession logic is not called - the caller must handle leader election separately. This could lead to orphaned leader state if not properly orchestrated.

2. Missing old leader isLeader flag reset in Lua script (distributed-state.ts:96-140)

  • ELECT_NEW_LEADER_SCRIPT sets the new leader isLeader flag to true but never sets the old leader flag to false. While the old leader is removed from the session, if they rejoin quickly, their stale isLeader=true could cause inconsistency.

3. Singleton reset not thread-safe (distributed-state.ts:648-650)

  • resetDistributedState() sets the singleton to null without calling stop(), potentially leaving heartbeat intervals running and orphaned connections in Redis.

4. Error swallowed in cleanup (room-manager.ts:409-417)

  • removeClient catches distributed state errors but continues with local cleanup. If distributed state removal fails, the connection data remains in Redis but is removed locally, causing ghost connections.

5. TTL refresh race condition (distributed-state.ts:459-472)

  • refreshConnection reads connection to get sessionId, then refreshes TTLs in a separate pipeline. If the connection leaves the session between read and write, it refreshes a stale session membership.

6. Missing test coverage for error paths

  • No tests for Redis connection failures, Lua script execution failures, or partial failure scenarios during multi-operation transactions.
  • No tests for refreshConnection or refreshSessionMembership methods.
  • No tests for the heartbeat mechanism (start(), updateHeartbeat).

7. Documentation updated correctly

  • docs/websocket-implementation.md has been updated to reflect the new distributed state architecture.

- Fix race condition in removeConnection by adding automatic leader election
  when removing a leader (controlled via electNewLeader parameter)
- Update ELECT_NEW_LEADER_SCRIPT to reset old leader's isLeader flag before
  electing new leader, preventing stale flag inconsistencies
- Make resetDistributedState async and call stop() to prevent orphaned
  heartbeat intervals; add forceResetDistributedState for sync reset
- Add atomic REFRESH_TTL_SCRIPT Lua script to avoid race conditions in
  refreshConnection where session could change between read and write
- Improve removeClient error handling to return cleanup status and provide
  more detailed logging about distributed state failures
- Simplify cleanupInstanceConnections since removeConnection now handles
  leader election automatically
- Add comprehensive tests for removeConnection with leader election,
  refreshConnection, and old leader flag reset scenarios
@claude
Copy link

claude bot commented Dec 31, 2025

Claude Review

Issues Found

  1. Race condition in leaveSession (packages/backend/src/services/distributed-state.ts:416-430) - The leaveSession method reads connection.isLeader before the Redis multi transaction, but the multi transaction doesn't use that value atomically. If another process updates the leader status between the read and the transaction commit, the leader election logic could behave incorrectly.

  2. Singleton state pollution across tests (packages/backend/src/services/distributed-state.ts:656-693) - The module-level distributedStateManager singleton is problematic for testing. Tests manually call stop() but the singleton persists. While forceResetDistributedState() exists, the test file doesn't use it, meaning tests that fail mid-execution may leak state to subsequent tests.

  3. Missing error handling for cleanupInstanceConnections loop (packages/backend/src/services/distributed-state.ts:589-596) - The loop continues after errors but the order of cleanup matters. If leader election fails for one connection, subsequent connections in the same session may have incorrect leader state.

  4. Inconsistent null vs empty string handling (packages/backend/src/services/distributed-state.ts:645) - hashToConnection treats empty string sessionId as null, but leaveSession explicitly sets sessionId: '' (line 423). The inconsistency could cause subtle bugs when checking session membership.

  5. Potential memory leak in RoomManager (packages/backend/src/services/room-manager.ts:412-438) - removeClient doesn't clean up this.sessions map entries if the client was in a session. It only removes from this.clients. The session set may retain a stale reference.

  6. No TTL on session leader key (packages/backend/src/services/distributed-state.ts:62-63) - The LEADER_ELECTION_SCRIPT sets the leader key without TTL. If all connections disconnect without proper cleanup, the leader key remains indefinitely. Consider adding TTL matching sessionMembership TTL.

  7. Tests require Redis running (packages/backend/src/__tests__/distributed-state.test.ts:8) - Integration tests hardcode Redis URL but there's no skip logic or mock fallback. Tests will fail in CI environments without Redis.

Documentation

The docs/websocket-implementation.md has been properly updated with the new DistributedStateManager architecture, including Mermaid diagrams and Redis key structures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants