Skip to content

Fix signaling races during peer connection setup#364

Open
erikdubbelboer wants to merge 1 commit into
mainfrom
lobby-join-race-conditions
Open

Fix signaling races during peer connection setup#364
erikdubbelboer wants to merge 1 commit into
mainfrom
lobby-join-race-conditions

Conversation

@erikdubbelboer
Copy link
Copy Markdown
Member

@erikdubbelboer erikdubbelboer commented Jun 6, 2026

Serialize incoming signaling messages so WebRTC descriptions and candidates are handled in order. Queue ICE candidates until a remote description is available, ignore stale answers that no longer match the current signaling state, and avoid creating duplicate Peer instances while async TURN credential lookup is in flight.

On the server, have JoinLobby return the peers that existed before the join and only request connections to that snapshot. This avoids racing newly joined peers against each other while the lobby membership changes.

Credential request responses can still resolve immediately because _addPeer may wait for them while handling a queued connect packet.

There are no new tests as these race conditions aren't easy to replicate with simple tests. I do have a test framework I used to find them, but this runs for several minutes and uses multiple headless Chrome browsers. I'm cleaning up this code for a separate pull request.

Serialize incoming signaling messages so WebRTC descriptions and candidates are
handled in order. Queue ICE candidates until a remote description is available,
ignore stale answers that no longer match the current signaling state, and avoid
creating duplicate Peer instances while async TURN credential lookup is in
flight.

On the server, have JoinLobby return the peers that existed before the join and
only request connections to that snapshot. This avoids racing newly joined peers
against each other while the lobby membership changes.

Credential request responses can still resolve immediately because _addPeer may
wait for them while handling a queued connect packet.
Comment thread lib/peer.ts
private makingOffer: boolean = false
private ignoreOffer: boolean = false
private isSettingRemoteAnswerPending: boolean = false
private readonly pendingCandidates: RTCIceCandidate[] = []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private readonly pendingCandidates: RTCIceCandidate[] = []
private readonly pendingRemoteCandidates: RTCIceCandidate[] = []

I would always include local/remote prefixes for these kinds of state.

@erikdubbelboer
Copy link
Copy Markdown
Member Author

Issue 1: New lobby joiners could both initiate the same peer connection

Before the change:

  1. Peer A is already in a lobby.
  2. Peers B and C join that lobby at nearly the same time.
  3. JoinLobby itself is serialized correctly with SELECT ... FOR UPDATE, so the database update is safe:
    • B appends itself.
    • C appends itself after B.
  4. But after joining, HandleJoinPacket called GetLobby.
  5. GetLobby happens after the transaction is committed, outside the join lock.
  6. So B might join when only A existed, but then call GetLobby after C has also joined.
  7. B now sees [A, B, C] and requests a connection to both A and C.
  8. C also sees [A, B, C] and requests a connection to both A and B.
  9. Now both B and C can initiate a connection to each other.

That matters because RequestConnection assigns roles per initiated pair: the joiner gets one connect packet with polite: true, and the existing peer gets one with polite: false. If both peers initiate, each side can receive conflicting duplicate connect packets for the same peer ID.

The fix is that JoinLobby now returns the peer list captured under the row lock, before appending the new peer. Then HandleJoinPacket only calls RequestConnection for that snapshot.

@erikdubbelboer
Copy link
Copy Markdown
Member Author

Issue 2: Client signaling handlers could interleave and corrupt WebRTC setup

Before the change, every websocket message started handleSignalingMessage(...) independently. The websocket delivers messages in order, but the handler is async, so message handling could overlap.

A bad sequence looked like this:

  1. Client receives a connect packet for peer B.
  2. connect handling calls _addPeer.
  3. _addPeer waits for TURN credentials before constructing the Peer.
  4. While that await is still pending, more websocket messages arrive.
  5. A description or candidate message for B starts handling concurrently.
  6. Depending on timing, the peer might not exist yet, the remote description might not be set yet, or two concurrent connect handlers might both try to create a peer for the same ID.

That led to a few concrete failure modes:

  1. ICE candidates could be applied before setRemoteDescription finished. Browsers reject that with a remote-description-null or invalid-state error.
  2. Duplicate connect packets could race through _addPeer; both would see no peer before awaiting credentials, then both could create Peer instances. The later one overwrote the map entry, while the older one could still have event handlers and pending WebRTC work.
  3. Delayed answers could be applied after the connection was no longer in have-local-offer. setRemoteDescription(answer) is only valid while there is a matching local offer, so stale answers could throw.
  4. Once the message queue was added, a new self-dependency appeared: connect waits for _addPeer, _addPeer waits for TURN credentials, and the credential response comes through the same websocket queue. Without the immediate credential-response bypass, the response would sit behind the connect handler that is waiting for it.

The fix addresses those pieces together:

  1. Incoming signaling messages are serialized through messageQueue, so a description handler can finish before the next candidate handler runs.
  2. ICE candidates are still queued inside Peer if they arrive before remoteDescription, then drained after setRemoteDescription.
  3. _addPeer checks this.peers.has(id) before and after credential lookup, so a peer created while another _addPeer was awaiting credentials prevents duplicate construction.
  4. Stale answers are ignored unless the connection is actually in have-local-offer.
    Credential request responses resolve immediately instead of waiting for the signaling queue, because they are needed to unblock queued connect handling.

@koenbollen
Copy link
Copy Markdown
Collaborator

As discussed in voice. Let's try and find a metric in the analytics that can verify that this change is going to be a positive one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants