Skip to content

Fix idle connection leak in prober#1139

Merged
knative-prow[bot] merged 1 commit into
knative:mainfrom
dsimansk:pr/fix-prober-memory-leak
May 28, 2026
Merged

Fix idle connection leak in prober#1139
knative-prow[bot] merged 1 commit into
knative:mainfrom
dsimansk:pr/fix-prober-memory-leak

Conversation

@dsimansk
Copy link
Copy Markdown
Contributor

Changes

  • Fix memory leak in prober with reusable transport

Per soak tests run on our Serverless distro with serving + net-kourier @maschmid observed a liner memory increase over time that may lead to OOM. It was traced down to idle connection pool steadily increasing.

In a serving-focused soak test with re-creating ksvcs, keeping around stable ~100 ksvcs on the cluster, net-kourier-controller with increased memory limit restarted 2 times on OOM during a 12h soak run.

/cc @maschmid @linkvt @dprotaso

@knative-prow knative-prow Bot requested review from dprotaso and linkvt May 21, 2026 11:59
@knative-prow knative-prow Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2026
@knative-prow knative-prow Bot requested a review from maschmid May 21, 2026 11:59
@knative-prow knative-prow Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 21, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.34%. Comparing base (3cf3413) to head (61c8882).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1139      +/-   ##
==========================================
+ Coverage   93.33%   93.34%   +0.01%     
==========================================
  Files          35       35              
  Lines        1005     1007       +2     
==========================================
+ Hits          938      940       +2     
  Misses         53       53              
  Partials       14       14              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/status/status.go Fixed
Copy link
Copy Markdown
Member

@linkvt linkvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, minor comment about the fallback when context has no value set.

withProbeAddr could in theory be inlined but not really important.

Comment thread pkg/status/status.go Fixed
Comment thread pkg/status/status.go Outdated
// because the HTTP client validates that the hostname (not the Host header) matches the server
// TLS certificate Common Name or Alternative Names. Therefore, http.Request.URL is set to the
// hostname and it is substituted here with the target IP.
if target, ok := ctx.Value(probeAddrKey{}).(string); ok {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to fallback here, panicking might be a good approach as we expect the probeaddr always to be set?

@dprotaso
Copy link
Copy Markdown
Member

Any reason why we can't just enable DisableKeepAlives on the cloned transport ?

@dsimansk
Copy link
Copy Markdown
Contributor Author

Any reason why we can't just enable DisableKeepAlives on the cloned transport ?

The soak test is running with the patch below. So far it more then halved the memory footprint.
Then hopefully comparing it to reused transport patch. Full cycle takes ~12h to complete. For sure we can start with as smaller DisableKeepAlives change. It'd definitely help.

 vendor/knative.dev/networking/pkg/status/status.go | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/vendor/knative.dev/networking/pkg/status/status.go b/vendor/knative.dev/networking/pkg/status/status.go
index 4d7ebe515..5bfce6727 100644
--- a/vendor/knative.dev/networking/pkg/status/status.go
+++ b/vendor/knative.dev/networking/pkg/status/status.go
@@ -406,6 +406,7 @@ func (m *Prober) processWorkItem() bool {
 		prober.WithHeader(header.ProbeKey, header.ProbeValue),
 		prober.WithHeader(header.HashKey, header.HashValueOverride),
 		m.probeVerifier(item))
+	transport.CloseIdleConnections()
 
 	// In case of cancellation, drop the work item
 	select {
@@ -421,6 +422,7 @@ func (m *Prober) processWorkItem() bool {
 		item.logger.Errorf("Probing of %s failed, IP: %s:%s, ready: %t, error: %v (depth: %d)",
 			item.url, item.podIP, item.podPort, ok, err, m.workQueue.Len())
 	} else {
+		m.workQueue.Forget(obj)
 		m.onProbingSuccess(item.ingressState, item.podState)
 	}
 	return true

@dprotaso
Copy link
Copy Markdown
Member

Curious - how does the soak test work? Is it creating a knative service or something every minute etc?

@dsimansk
Copy link
Copy Markdown
Contributor Author

Curious - how does the soak test work? Is it creating a knative service or something every minute etc?

In a nutshell it should be maintaining a pool of ~100 ksvc in this case over long period of time. On this ksvc set performing periodic re-create with same name, add new revision, send request to scale from zero etc.

@dprotaso
Copy link
Copy Markdown
Member

In a nutshell it should be maintaining a pool of ~100 ksvc in this case over long period of time. On this ksvc set performing periodic re-create with same name, add new revision, send request to scale from zero etc.

Do you have a script?

@dsimansk dsimansk force-pushed the pr/fix-prober-memory-leak branch from 3ab8730 to 61c8882 Compare May 27, 2026 09:27
@dsimansk dsimansk changed the title WIP: Fix memory leak in prober with reusable transport Fix idle connection leak in prober May 27, 2026
@knative-prow knative-prow Bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels May 27, 2026
@dsimansk
Copy link
Copy Markdown
Contributor Author

dsimansk commented May 27, 2026

In a nutshell it should be maintaining a pool of ~100 ksvc in this case over long period of time. On this ksvc set performing periodic re-create with same name, add new revision, send request to scale from zero etc.

Do you have a script?

AFAIK there's a golang implementation of test framework that manages the soak loop scenario and collects evidence. @maschmid would need to fill in more gaps.

@dprotaso I've updated the PR to do a minimal change with .DisableKeepAlives property and .Forget call.

@dprotaso
Copy link
Copy Markdown
Member

/lgtm
/approve

thanks!

@knative-prow knative-prow Bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2026
@knative-prow
Copy link
Copy Markdown

knative-prow Bot commented May 28, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, dsimansk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow Bot merged commit 08ed7ab into knative:main May 28, 2026
27 checks passed
@dsimansk
Copy link
Copy Markdown
Contributor Author

/cherry-pick release-1.21
/cherry-pick release-1.22

@knative-prow-robot
Copy link
Copy Markdown

@dsimansk: new pull request created: #1141

Details

In response to this:

/cherry-pick release-1.21
/cherry-pick release-1.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow-robot
Copy link
Copy Markdown

@dsimansk: new pull request created: #1142

Details

In response to this:

/cherry-pick release-1.21
/cherry-pick release-1.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants