Skip to content

(fix): Gateway connection manager: set a deadline to close stalled handshakes#22506

Open
justinkaseman wants to merge 3 commits into
developfrom
fix/gateway-connection-ws
Open

(fix): Gateway connection manager: set a deadline to close stalled handshakes#22506
justinkaseman wants to merge 3 commits into
developfrom
fix/gateway-connection-ws

Conversation

@justinkaseman
Copy link
Copy Markdown
Contributor

Set a read deadline so that half-open connections are detected and closed.
The deadline is reset every time a keepalive pong is received.
If no pongarrives within the deadline, ReadMessage() in the readPump returns an error, the connection is closed, and the node can reconnect.

Copilot AI review requested due to automatic review settings May 16, 2026 21:14
@justinkaseman justinkaseman requested review from a team as code owners May 16, 2026 21:14
@github-actions
Copy link
Copy Markdown
Contributor

👋 justinkaseman, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Copy Markdown
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 16, 2026

✅ No conflicts with other open PRs targeting develop

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: MEDIUM — Changes alter connection liveness behavior (read deadlines driven by keepalive pongs), which can impact node connectivity if misconfigured or if deadline updates fail silently.

This PR adds read deadlines to Gateway node WebSocket connections so half-open/stalled connections are detected and closed, allowing nodes to reconnect.

Changes:

  • Set an initial ReadDeadline after handshake finalization and refresh it on each keepalive pong.
  • Add integration-style tests using an httptest WebSocket pair to verify idle connections are closed and healthy connections remain alive when pongs are received.

Scrupulous human review recommended for:

  • FinalizeHandshake read-deadline + pong handler behavior (error handling and interaction with gorilla/websocket read loop).
  • Test timing strategy (real-time sleeps / eventual assertions) to avoid flakiness and long suite runtime.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
core/services/gateway/connectionmanager.go Adds read deadline initialization and resets it on pong reception to close half-open connections.
core/services/gateway/connectionmanager_test.go Adds WebSocket-pair helpers and tests covering idle-timeout closure and “alive with pongs” behavior.
Comments suppressed due to low confidence (1)

core/services/gateway/connectionmanager.go:284

  • Inside the pong handler, SetReadDeadline’s return value is ignored. Since the handler can return an error to abort the read loop, it should propagate any SetReadDeadline error so the connection is closed deterministically when deadline updates fail.
		conn.SetPongHandler(func(data string) error {
			if pongWait > 0 {
				conn.SetReadDeadline(time.Now().Add(pongWait))
			}

Comment thread core/services/gateway/connectionmanager.go Outdated
Comment thread core/services/gateway/connectionmanager_test.go Outdated
@trunk-io
Copy link
Copy Markdown

trunk-io Bot commented May 16, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

if conn != nil {
// Set a read deadline so that half-open connections are detected and closed.
// Use 3x the heartbeat interval to tolerate up to 2 missed pongs.
pongWait := time.Duration(m.config.HeartbeatIntervalSec) * time.Second * 3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what range of values is HeartbeatIntervalSec allowed to take?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no range.

}
conn.SetPongHandler(func(data string) error {
if pongWait > 0 {
conn.SetReadDeadline(time.Now().Add(pongWait))
Copy link
Copy Markdown
Contributor

@patrickhuie19 patrickhuie19 May 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server also initiates pings for keepalive every heartbeat interval. One potential issue with setting so strict a deadline is that during periods of poor network connectivity, we will loop closing the conn, re-initiating the handshake, eventually connecting, then closing the conn. We could consider increasing the deadline to 10s, which i've generally found to be a safe wait time even in poor network conditions to signal that the connection is pathological.

To understand better - you're calling the symptom here "stalled handshakes", but the logic to handle it better is executed only after FinalizeHandshake returns. After FinalizeHandshake exits (specifically Reset) the handshake has been completed. I'll assume that what we're observing is a handshake completes, but there's no pong received from the client after.

If we want to force a ping check right after FinalizeHandshake completes, we could consider running something like:

		pingCtx := context.Background()
		errPing := attempt.nodeState.conn.Write(pingCtx, websocket.PingMessage, []byte{})
		m.gMetrics.RecordKeepalivePingsSent(pingCtx, attempt.nodeAddress, attempt.nodeState.name, errPing == nil)
		if errPing != nil {
			m.lggr.Debugw("unable to send immediate post-handshake ping to node",
				"nodeAddress", attempt.nodeAddress, "name", attempt.nodeState.name, "err", errPing)
		}

I also thought about trying to handle a full ping/pong cycle within FinalizeHandshake. The difficulty is that pong handlers are aysnc, so it would require some stateful logic within a pong handler to check for the first pong, flip an atomic flag its been received, and seen confirmation over a channel to FinalizeHandshake. Given the library doesn't expose a straightforward way to do it, my intuition is that this approach is barking up the wrong tree and we should treat all post-handshake connectivity the same, even if its occured right after Reset is called.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the HeartbeatIntervalSec is set to 20 sec, so 3 times 20 seconds = 1 min.
I'm just going to make it PongTimeoutSec so we don't have to reason in HeartBeatIntervals as our unit of measure.

I also added the FinalizeHandshake ping check. Good suggestion.

return network.ErrChallengeInvalidSignature
}
if conn != nil {
// Set a read deadline so that half-open connections are detected and closed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinkaseman This feels worthy of a feature flag to me tbh (and maybe a setting for the pongTimeout too)

I think releasing this may not be so safe given that reconnects will presumably cause some errors until we've re-established

Copy link
Copy Markdown
Contributor Author

@justinkaseman justinkaseman May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Adding a feature flag via pongTImeout config. If 0 (default) it will be the current old behavior. So now we can control it via a job spec.

Also noted to test extensively on staging !

@cl-sonarqube-production
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants