(fix): Gateway connection manager: set a deadline to close stalled handshakes by justinkaseman · Pull Request #22506 · smartcontractkit/chainlink

justinkaseman · 2026-05-16T21:14:53Z

Set a read deadline so that half-open connections are detected and closed.
The deadline is reset every time a keepalive pong is received.
If no pongarrives within the deadline, ReadMessage() in the readPump returns an error, the connection is closed, and the node can reconnect.

…alled handshakes

github-actions · 2026-05-16T21:17:07Z

👋 justinkaseman, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

github-actions · 2026-05-16T21:17:59Z

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

#added For any new functionality added.
#breaking_change For any functionality that requires manual action for the node to boot.
#bugfix For bug fixes.
#changed For any change to the existing functionality.
#db_update For any feature that introduces updates to database schema.
#deprecation_notice For any upcoming deprecation functionality.
#internal For changesets that need to be excluded from the final changelog.
#nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
#removed For any functionality/config that is removed.
#updated For any functionality that is updated.
#wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

github-actions · 2026-05-16T21:18:28Z

✅ No conflicts with other open PRs targeting develop

Copilot

Pull request overview

Risk Rating: MEDIUM — Changes alter connection liveness behavior (read deadlines driven by keepalive pongs), which can impact node connectivity if misconfigured or if deadline updates fail silently.

This PR adds read deadlines to Gateway node WebSocket connections so half-open/stalled connections are detected and closed, allowing nodes to reconnect.

Changes:

Set an initial ReadDeadline after handshake finalization and refresh it on each keepalive pong.
Add integration-style tests using an httptest WebSocket pair to verify idle connections are closed and healthy connections remain alive when pongs are received.

Scrupulous human review recommended for:

FinalizeHandshake read-deadline + pong handler behavior (error handling and interaction with gorilla/websocket read loop).
Test timing strategy (real-time sleeps / eventual assertions) to avoid flakiness and long suite runtime.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
core/services/gateway/connectionmanager.go	Adds read deadline initialization and resets it on pong reception to close half-open connections.
core/services/gateway/connectionmanager_test.go	Adds WebSocket-pair helpers and tests covering idle-timeout closure and “alive with pongs” behavior.

Comments suppressed due to low confidence (1)

core/services/gateway/connectionmanager.go:284

Inside the pong handler, SetReadDeadline’s return value is ignored. Since the handler can return an error to abort the read loop, it should propagate any SetReadDeadline error so the connection is closed deterministically when deadline updates fail.

		conn.SetPongHandler(func(data string) error {
			if pongWait > 0 {
				conn.SetReadDeadline(time.Now().Add(pongWait))
			}

trunk-io · 2026-05-16T21:28:40Z

_{View Full Report ↗︎ ⋅ Docs}

patrickhuie19 · 2026-05-17T19:05:59Z

 	if conn != nil {
+		// Set a read deadline so that half-open connections are detected and closed.
+		// Use 3x the heartbeat interval to tolerate up to 2 missed pongs.
+		pongWait := time.Duration(m.config.HeartbeatIntervalSec) * time.Second * 3


what range of values is HeartbeatIntervalSec allowed to take?

There is no range.

patrickhuie19 · 2026-05-17T19:41:46Z

+		}
 		conn.SetPongHandler(func(data string) error {
+			if pongWait > 0 {
+				conn.SetReadDeadline(time.Now().Add(pongWait))


The server also initiates pings for keepalive every heartbeat interval. One potential issue with setting so strict a deadline is that during periods of poor network connectivity, we will loop closing the conn, re-initiating the handshake, eventually connecting, then closing the conn. We could consider increasing the deadline to 10s, which i've generally found to be a safe wait time even in poor network conditions to signal that the connection is pathological.

To understand better - you're calling the symptom here "stalled handshakes", but the logic to handle it better is executed only after FinalizeHandshake returns. After FinalizeHandshake exits (specifically Reset) the handshake has been completed. I'll assume that what we're observing is a handshake completes, but there's no pong received from the client after.

If we want to force a ping check right after FinalizeHandshake completes, we could consider running something like:

pingCtx := context.Background() errPing := attempt.nodeState.conn.Write(pingCtx, websocket.PingMessage, []byte{}) m.gMetrics.RecordKeepalivePingsSent(pingCtx, attempt.nodeAddress, attempt.nodeState.name, errPing == nil) if errPing != nil { m.lggr.Debugw("unable to send immediate post-handshake ping to node", "nodeAddress", attempt.nodeAddress, "name", attempt.nodeState.name, "err", errPing) }

I also thought about trying to handle a full ping/pong cycle within FinalizeHandshake. The difficulty is that pong handlers are aysnc, so it would require some stateful logic within a pong handler to check for the first pong, flip an atomic flag its been received, and seen confirmation over a channel to FinalizeHandshake. Given the library doesn't expose a straightforward way to do it, my intuition is that this approach is barking up the wrong tree and we should treat all post-handshake connectivity the same, even if its occured right after Reset is called.

Currently the HeartbeatIntervalSec is set to 20 sec, so 3 times 20 seconds = 1 min.
I'm just going to make it PongTimeoutSec so we don't have to reason in HeartBeatIntervals as our unit of measure.

I also added the FinalizeHandshake ping check. Good suggestion.

cedric-cordenier · 2026-05-19T09:03:23Z

 		return network.ErrChallengeInvalidSignature
 	}
 	if conn != nil {
+		// Set a read deadline so that half-open connections are detected and closed.


@justinkaseman This feels worthy of a feature flag to me tbh (and maybe a setting for the pongTimeout too)

I think releasing this may not be so safe given that reconnects will presumably cause some errors until we've re-established

Fair point. Adding a feature flag via pongTImeout config. If 0 (default) it will be the current old behavior. So now we can control it via a job spec.

Also noted to test extensively on staging !

cl-sonarqube-production · 2026-05-19T22:20:37Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

(fix): set a read deadline for Gateway connection manager to close st…

97e6ba4

…alled handshakes

Copilot AI review requested due to automatic review settings May 16, 2026 21:14

justinkaseman requested review from a team as code owners May 16, 2026 21:14

product-security-plaid-production Bot requested review from jmank88 and patrickhuie19 May 16, 2026 21:15

Copilot AI reviewed May 16, 2026

View reviewed changes

Comment thread core/services/gateway/connectionmanager.go Outdated

Comment thread core/services/gateway/connectionmanager_test.go Outdated

patrickhuie19 reviewed May 17, 2026

View reviewed changes

cedric-cordenier reviewed May 19, 2026

View reviewed changes

Changes from review

e23d1ff

justinkaseman requested review from cedric-cordenier and patrickhuie19 May 19, 2026 22:07

Lint

97f1c71

justinkaseman requested review from a team as code owners May 19, 2026 22:09

product-security-plaid-production Bot requested a review from ChrisAmora May 19, 2026 22:09

Conversation

justinkaseman commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

trunk-io Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickhuie19 May 17, 2026

Choose a reason for hiding this comment

Uh oh!

justinkaseman May 19, 2026

Choose a reason for hiding this comment

Uh oh!

patrickhuie19 May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinkaseman May 19, 2026

Choose a reason for hiding this comment

Uh oh!

cedric-cordenier May 19, 2026

Choose a reason for hiding this comment

Uh oh!

justinkaseman May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cl-sonarqube-production Bot commented May 19, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented May 16, 2026 •

edited

Loading

trunk-io Bot commented May 16, 2026 •

edited

Loading

patrickhuie19 May 17, 2026 •

edited

Loading

justinkaseman May 19, 2026 •

edited

Loading