Fix signaling races during peer connection setup by erikdubbelboer · Pull Request #364 · poki/netlib

erikdubbelboer · 2026-06-06T09:03:39Z

Serialize incoming signaling messages so WebRTC descriptions and candidates are handled in order. Queue ICE candidates until a remote description is available, ignore stale answers that no longer match the current signaling state, and avoid creating duplicate Peer instances while async TURN credential lookup is in flight.

On the server, have JoinLobby return the peers that existed before the join and only request connections to that snapshot. This avoids racing newly joined peers against each other while the lobby membership changes.

Credential request responses can still resolve immediately because _addPeer may wait for them while handling a queued connect packet.

There are no new tests as these race conditions aren't easy to replicate with simple tests. I do have a test framework I used to find them, but this runs for several minutes and uses multiple headless Chrome browsers. I'm cleaning up this code for a separate pull request.

Serialize incoming signaling messages so WebRTC descriptions and candidates are handled in order. Queue ICE candidates until a remote description is available, ignore stale answers that no longer match the current signaling state, and avoid creating duplicate Peer instances while async TURN credential lookup is in flight. On the server, have JoinLobby return the peers that existed before the join and only request connections to that snapshot. This avoids racing newly joined peers against each other while the lobby membership changes. Credential request responses can still resolve immediately because _addPeer may wait for them while handling a queued connect packet.

koenbollen · 2026-06-06T09:15:25Z

  private makingOffer: boolean = false
  private ignoreOffer: boolean = false
  private isSettingRemoteAnswerPending: boolean = false
+  private readonly pendingCandidates: RTCIceCandidate[] = []


Suggested change

private readonly pendingCandidates: RTCIceCandidate[] = []

private readonly pendingRemoteCandidates: RTCIceCandidate[] = []

I would always include local/remote prefixes for these kinds of state.

erikdubbelboer · 2026-06-06T09:34:40Z

Issue 1: New lobby joiners could both initiate the same peer connection

Before the change:

Peer A is already in a lobby.
Peers B and C join that lobby at nearly the same time.
JoinLobby itself is serialized correctly with SELECT ... FOR UPDATE, so the database update is safe:
- B appends itself.
- C appends itself after B.
But after joining, HandleJoinPacket called GetLobby.
GetLobby happens after the transaction is committed, outside the join lock.
So B might join when only A existed, but then call GetLobby after C has also joined.
B now sees [A, B, C] and requests a connection to both A and C.
C also sees [A, B, C] and requests a connection to both A and B.
Now both B and C can initiate a connection to each other.

That matters because RequestConnection assigns roles per initiated pair: the joiner gets one connect packet with polite: true, and the existing peer gets one with polite: false. If both peers initiate, each side can receive conflicting duplicate connect packets for the same peer ID.

The fix is that JoinLobby now returns the peer list captured under the row lock, before appending the new peer. Then HandleJoinPacket only calls RequestConnection for that snapshot.

erikdubbelboer · 2026-06-06T09:39:25Z

Issue 2: Client signaling handlers could interleave and corrupt WebRTC setup

Before the change, every websocket message started handleSignalingMessage(...) independently. The websocket delivers messages in order, but the handler is async, so message handling could overlap.

A bad sequence looked like this:

Client receives a connect packet for peer B.
connect handling calls _addPeer.
_addPeer waits for TURN credentials before constructing the Peer.
While that await is still pending, more websocket messages arrive.
A description or candidate message for B starts handling concurrently.
Depending on timing, the peer might not exist yet, the remote description might not be set yet, or two concurrent connect handlers might both try to create a peer for the same ID.

That led to a few concrete failure modes:

ICE candidates could be applied before setRemoteDescription finished. Browsers reject that with a remote-description-null or invalid-state error.
Duplicate connect packets could race through _addPeer; both would see no peer before awaiting credentials, then both could create Peer instances. The later one overwrote the map entry, while the older one could still have event handlers and pending WebRTC work.
Delayed answers could be applied after the connection was no longer in have-local-offer. setRemoteDescription(answer) is only valid while there is a matching local offer, so stale answers could throw.
Once the message queue was added, a new self-dependency appeared: connect waits for _addPeer, _addPeer waits for TURN credentials, and the credential response comes through the same websocket queue. Without the immediate credential-response bypass, the response would sit behind the connect handler that is waiting for it.

The fix addresses those pieces together:

Incoming signaling messages are serialized through messageQueue, so a description handler can finish before the next candidate handler runs.
ICE candidates are still queued inside Peer if they arrive before remoteDescription, then drained after setRemoteDescription.
_addPeer checks this.peers.has(id) before and after credential lookup, so a peer created while another _addPeer was awaiting credentials prevents duplicate construction.
Stale answers are ignored unless the connection is actually in have-local-offer.
Credential request responses resolve immediately instead of waiting for the signaling queue, because they are needed to unblock queued connect handling.

koenbollen · 2026-06-06T10:47:09Z

As discussed in voice. Let's try and find a metric in the analytics that can verify that this change is going to be a positive one.

koenbollen reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix signaling races during peer connection setup#364

Fix signaling races during peer connection setup#364
erikdubbelboer wants to merge 1 commit into
mainfrom
lobby-join-race-conditions

erikdubbelboer commented Jun 6, 2026 •

edited

Loading

Uh oh!

koenbollen Jun 6, 2026

Uh oh!

erikdubbelboer commented Jun 6, 2026

Uh oh!

erikdubbelboer commented Jun 6, 2026

Uh oh!

koenbollen commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

	private readonly pendingCandidates: RTCIceCandidate[] = []
	private readonly pendingRemoteCandidates: RTCIceCandidate[] = []

Conversation

erikdubbelboer commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koenbollen Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

erikdubbelboer commented Jun 6, 2026

Uh oh!

erikdubbelboer commented Jun 6, 2026

Uh oh!

koenbollen commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

erikdubbelboer commented Jun 6, 2026 •

edited

Loading