(fix): Gateway connection manager: set a deadline to close stalled handshakes#22506
(fix): Gateway connection manager: set a deadline to close stalled handshakes#22506justinkaseman wants to merge 3 commits into
Conversation
|
👋 justinkaseman, thanks for creating this pull request! To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team. Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks! |
|
I see you updated files related to
|
|
✅ No conflicts with other open PRs targeting |
There was a problem hiding this comment.
Pull request overview
Risk Rating: MEDIUM — Changes alter connection liveness behavior (read deadlines driven by keepalive pongs), which can impact node connectivity if misconfigured or if deadline updates fail silently.
This PR adds read deadlines to Gateway node WebSocket connections so half-open/stalled connections are detected and closed, allowing nodes to reconnect.
Changes:
- Set an initial
ReadDeadlineafter handshake finalization and refresh it on each keepalive pong. - Add integration-style tests using an
httptestWebSocket pair to verify idle connections are closed and healthy connections remain alive when pongs are received.
Scrupulous human review recommended for:
FinalizeHandshakeread-deadline + pong handler behavior (error handling and interaction with gorilla/websocket read loop).- Test timing strategy (real-time sleeps / eventual assertions) to avoid flakiness and long suite runtime.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| core/services/gateway/connectionmanager.go | Adds read deadline initialization and resets it on pong reception to close half-open connections. |
| core/services/gateway/connectionmanager_test.go | Adds WebSocket-pair helpers and tests covering idle-timeout closure and “alive with pongs” behavior. |
Comments suppressed due to low confidence (1)
core/services/gateway/connectionmanager.go:284
- Inside the pong handler,
SetReadDeadline’s return value is ignored. Since the handler can return an error to abort the read loop, it should propagate anySetReadDeadlineerror so the connection is closed deterministically when deadline updates fail.
conn.SetPongHandler(func(data string) error {
if pongWait > 0 {
conn.SetReadDeadline(time.Now().Add(pongWait))
}
| if conn != nil { | ||
| // Set a read deadline so that half-open connections are detected and closed. | ||
| // Use 3x the heartbeat interval to tolerate up to 2 missed pongs. | ||
| pongWait := time.Duration(m.config.HeartbeatIntervalSec) * time.Second * 3 |
There was a problem hiding this comment.
what range of values is HeartbeatIntervalSec allowed to take?
There was a problem hiding this comment.
There is no range.
| } | ||
| conn.SetPongHandler(func(data string) error { | ||
| if pongWait > 0 { | ||
| conn.SetReadDeadline(time.Now().Add(pongWait)) |
There was a problem hiding this comment.
The server also initiates pings for keepalive every heartbeat interval. One potential issue with setting so strict a deadline is that during periods of poor network connectivity, we will loop closing the conn, re-initiating the handshake, eventually connecting, then closing the conn. We could consider increasing the deadline to 10s, which i've generally found to be a safe wait time even in poor network conditions to signal that the connection is pathological.
To understand better - you're calling the symptom here "stalled handshakes", but the logic to handle it better is executed only after FinalizeHandshake returns. After FinalizeHandshake exits (specifically Reset) the handshake has been completed. I'll assume that what we're observing is a handshake completes, but there's no pong received from the client after.
If we want to force a ping check right after FinalizeHandshake completes, we could consider running something like:
pingCtx := context.Background()
errPing := attempt.nodeState.conn.Write(pingCtx, websocket.PingMessage, []byte{})
m.gMetrics.RecordKeepalivePingsSent(pingCtx, attempt.nodeAddress, attempt.nodeState.name, errPing == nil)
if errPing != nil {
m.lggr.Debugw("unable to send immediate post-handshake ping to node",
"nodeAddress", attempt.nodeAddress, "name", attempt.nodeState.name, "err", errPing)
}
I also thought about trying to handle a full ping/pong cycle within FinalizeHandshake. The difficulty is that pong handlers are aysnc, so it would require some stateful logic within a pong handler to check for the first pong, flip an atomic flag its been received, and seen confirmation over a channel to FinalizeHandshake. Given the library doesn't expose a straightforward way to do it, my intuition is that this approach is barking up the wrong tree and we should treat all post-handshake connectivity the same, even if its occured right after Reset is called.
There was a problem hiding this comment.
Currently the HeartbeatIntervalSec is set to 20 sec, so 3 times 20 seconds = 1 min.
I'm just going to make it PongTimeoutSec so we don't have to reason in HeartBeatIntervals as our unit of measure.
I also added the FinalizeHandshake ping check. Good suggestion.
| return network.ErrChallengeInvalidSignature | ||
| } | ||
| if conn != nil { | ||
| // Set a read deadline so that half-open connections are detected and closed. |
There was a problem hiding this comment.
@justinkaseman This feels worthy of a feature flag to me tbh (and maybe a setting for the pongTimeout too)
I think releasing this may not be so safe given that reconnects will presumably cause some errors until we've re-established
There was a problem hiding this comment.
Fair point. Adding a feature flag via pongTImeout config. If 0 (default) it will be the current old behavior. So now we can control it via a job spec.
Also noted to test extensively on staging !
|




Set a read deadline so that half-open connections are detected and closed.
The deadline is reset every time a keepalive pong is received.
If no pongarrives within the deadline, ReadMessage() in the readPump returns an error, the connection is closed, and the node can reconnect.