Skip to content

TLS propagation test#30801

Open
gangwgr wants to merge 1 commit intoopenshift:mainfrom
gangwgr:tls-ci
Open

TLS propagation test#30801
gangwgr wants to merge 1 commit intoopenshift:mainfrom
gangwgr:tls-ci

Conversation

@gangwgr
Copy link
Contributor

@gangwgr gangwgr commented Feb 19, 2026

Test run -https://privatebin.corp.redhat.com/?b0f760b57d709f07#F2u5YDsSTXrxCVWyKze9Jj4bP424HE2ZfhR6HhEFRJ2Y

Summary by CodeRabbit

  • Tests

    • Added a new TLS verification test suite validating TLS versions and cipher suites across services and namespaces.
    • Verifies operator-observed config and TLS-related environment variables propagate to deployments.
    • Adds wire-level TLS handshake tests and coverage for TLS profile transitions, including validating system stability after profile changes.
  • Chores

    • Registered the new TLS extended tests and added ownership metadata for review/approval.

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gangwgr
Once this PR has been reviewed and has the lgtm label, please assign dgoodwin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gangwgr
Copy link
Contributor Author

gangwgr commented Feb 19, 2026

/pipeline required

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@gangwgr
Copy link
Contributor Author

gangwgr commented Feb 23, 2026

/retest

@gangwgr
Copy link
Contributor Author

gangwgr commented Feb 23, 2026

/test e2e-metal-ipi-ovn-ipv6

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@gangwgr gangwgr changed the title [WIP]TLS propagation test TLS propagation test Feb 23, 2026
@gangwgr
Copy link
Contributor Author

gangwgr commented Feb 23, 2026

/retest-required

@openshift-trt
Copy link

openshift-trt bot commented Feb 24, 2026

Job Failure Risk Analysis for sha: 21e1845

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6 High
[sig-cluster-lifecycle][OCPFeatureGate:ImageStreamImportMode] ClusterVersion API desired architecture should be valid when architecture is set in release payload metadata [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
This test has passed 98.31% of 3130 runs on release 4.22 [Overall] in the last week.

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@gangwgr
Copy link
Contributor Author

gangwgr commented Feb 25, 2026

/retest-required

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

Walkthrough

Adds a new end-to-end TLS ObservedConfig test suite: registers the suite, imports the TLS extended tests, adds OWNERS, and implements tests validating operator observedConfig, deployment TLS env-vars, and wire-level TLS behavior including a TLS profile change/restore flow.

Changes

Cohort / File(s) Summary
Test Suite Registration
pkg/testsuites/standard_suites.go, test/extended/include.go
Registers new openshift/tls-observed-config suite in staticSuites and adds a blank import for the TLS extended tests.
TLS Verification Tests
test/extended/tls/tls_observed_config.go
New Ginkgo test implementing per-target definitions and tests: operator observedConfig checks, deployment TLS env-var propagation checks, wire-level TLS handshakes via port-forward, and a config-change test that switches APIServer TLS profile to Modern then restores it with rollout/wait logic across ClusterOperators.
Package Configuration
test/extended/tls/OWNERS
Adds OWNERS file specifying reviewers and approvers for the TLS test package.
Module Manifest
go.mod
Minor change (single-line update referenced in raw summary).

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Suite
    participant DynClient as Dynamic Client
    participant ApiServer as Cluster APIServer
    participant Deployment as Target Deployment
    participant ClusterOp as ClusterOperator

    Test->>DynClient: Fetch operator config (observedConfig)
    DynClient-->>Test: observedConfig (minTLSVersion, cipherSuites)
    Test->>Deployment: Inspect container env vars
    Deployment-->>Test: Env var values
    Test->>Test: Compare env vars vs observedConfig/profile
    Test->>Deployment: Port-forward to service and perform TLS handshake checks
    Deployment-->>Test: TLS handshake results
    Test->>ApiServer: Update TLS profile (Modern) [config-change test]
    ApiServer-->>Test: Profile updated
    Test->>ClusterOp: Wait for related ClusterOperators to stabilize
    ClusterOp-->>Test: Stable
    Test->>ApiServer: Restore original TLS profile
    ApiServer-->>Test: Profile restored
    Test->>ClusterOp: Wait for stabilization
    ClusterOp-->>Test: Stable
Loading
sequenceDiagram
    participant Test as Test Suite
    participant Validator as Validator
    participant RolloutMgr as Rollout Watcher
    participant ClusterOp as ClusterOperators

    Test->>Validator: Trigger per-target validations (observedConfig, env, wire)
    Validator-->>Test: Validation results
    Test->>RolloutMgr: Initiate rollout waits after profile change
    RolloutMgr->>ClusterOp: Poll cluster operator statuses
    ClusterOp-->>RolloutMgr: Report rollout progress/completion
    RolloutMgr-->>Test: All relevant operators stabilized
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Test functions receive untimed context.Background() and make cluster API calls without timeout protection. Add timeout contexts to testObservedConfig, testDeploymentTLSEnvVars, and testWireLevelTLS before cluster API operations.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'TLS propagation test' directly aligns with the main changes: adding a comprehensive TLS propagation test suite that validates TLS environment variables, ObservedConfig settings, and wire-level TLS behavior across OpenShift services.
Docstring Coverage ✅ Passed Docstring coverage is 91.67% which is sufficient. The required threshold is 80.00%.
Stable And Deterministic Test Names ✅ Passed Test names are stable and deterministic, generated from hardcoded static values in the targets list that do not change between runs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/tls/tls_observed_config.go`:
- Around line 702-712: The TLS probes use tls.Dial (in the block handling
host/shouldWork and host/shouldNotWork) which can hang; replace calls to
tls.Dial with tls.DialWithDialer using a net.Dialer that sets a reasonable
Timeout (e.g., a few seconds) so both the successful probe and the
rejected-version probe use a dialer with explicit handshake timeout, preserve
the same error handling around conn, ConnectionState(), and conn.Close(), and
ensure the dialer timeout variable is reused for both DialWithDialer invocations
to avoid duplicated magic values.
- Around line 507-510: The namespace lookup currently treats any error as "not
found" and skips the test; change the error handling after
oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace,
metav1.GetOptions{}) to import k8s.io/apimachinery/pkg/api/errors as kapierrors
and only call g.Skip when kapierrors.IsNotFound(err) is true, otherwise fail the
test (e.g., call the test failure helper you use in this file) so
auth/transport/transient errors are not masked; apply the same change to the
second occurrence of the same oc.AdminKubeClient().CoreV1().Namespaces().Get
call in this file.
- Around line 654-676: Replace the unreliable stdout check after starting
port-forward (where e2e.StartCmdAndStreamOutput is used and
readPartialFrom(stdout, 1024) is called) with an active TCP readiness probe:
repeatedly attempt to open a TCP connection to 127.0.0.1:localPort until success
or a short deadline elapses, then proceed to call toExecute(localPort). Use the
existing ctx/cancel or a separate short timeout for the probe, keep defers for
stdout/stderr and TryKill(cmd), and ensure probe retries/backoff are short so
the test fails fast if the tunnel never becomes reachable.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between a631c60 and b8f1f94.

📒 Files selected for processing (4)
  • pkg/testsuites/standard_suites.go
  • test/extended/include.go
  • test/extended/tls/OWNERS
  • test/extended/tls/tls_observed_config.go

Comment on lines +507 to +510
_, err := oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace, metav1.GetOptions{})
if err != nil {
g.Skip(fmt.Sprintf("Namespace %s does not exist in this cluster", t.namespace))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file structure and locate the file
find . -name "tls_observed_config.go" -type f

Repository: openshift/origin

Length of output: 102


🏁 Script executed:

# Get the file size to determine how to read it
wc -l test/extended/tls/tls_observed_config.go

Repository: openshift/origin

Length of output: 104


🏁 Script executed:

# Read the specific lines around 507-510
sed -n '500,515p' test/extended/tls/tls_observed_config.go

Repository: openshift/origin

Length of output: 869


🏁 Script executed:

# Read the specific lines around 580-583
sed -n '575,590p' test/extended/tls/tls_observed_config.go

Repository: openshift/origin

Length of output: 902


🏁 Script executed:

# Check the imports section of the file
head -50 test/extended/tls/tls_observed_config.go

Repository: openshift/origin

Length of output: 1905


🏁 Script executed:

# Search for all instances of namespace Get() calls to find similar patterns
rg 'Namespaces\(\)\.Get' test/extended/tls/tls_observed_config.go -A 3

Repository: openshift/origin

Length of output: 459


🏁 Script executed:

# Search for kapierrors usage in the codebase
rg "kapierrors" test/extended/ -A 2 | head -30

Repository: openshift/origin

Length of output: 2263


🏁 Script executed:

# Search for IsNotFound usage to see the common pattern
rg "IsNotFound" test/extended/ -B 1 -A 1 | head -40

Repository: openshift/origin

Length of output: 3028


🏁 Script executed:

# Check what error packages are imported in similar test files
rg 'apimachinery/pkg/api/errors' test/extended/ | head -10

Repository: openshift/origin

Length of output: 999


Skip only on NotFound errors; fail on other namespace lookup failures.

Current logic treats any API error (including auth failures, transient errors, etc.) as "namespace doesn't exist", which masks real failures and can cause false-green test runs. Restrict the skip to NotFound errors only, and fail on other errors.

This issue appears at both lines 507-510 and 580-583. Add the kapierrors import and use kapierrors.IsNotFound(err) to distinguish between legitimate NotFound scenarios and actual failures:

Fix pattern
 import (
 	"context"
 	"crypto/tls"
 	"encoding/json"
 	"fmt"
 	"io"
 	"math/rand"
 	"os/exec"
 	"strings"
 	"time"

 	g "github.com/onsi/ginkgo/v2"
 	o "github.com/onsi/gomega"

 	corev1 "k8s.io/api/core/v1"
+	kapierrors "k8s.io/apimachinery/pkg/api/errors"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/util/wait"
 	"k8s.io/client-go/util/retry"

 	_, err := oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace, metav1.GetOptions{})
-	if err != nil {
+	if kapierrors.IsNotFound(err) {
 		g.Skip(fmt.Sprintf("Namespace %s does not exist in this cluster", t.namespace))
 	}
+	o.Expect(err).NotTo(o.HaveOccurred(), fmt.Sprintf("failed to query namespace %s", t.namespace))

Apply the same change at line 580-583.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_, err := oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace, metav1.GetOptions{})
if err != nil {
g.Skip(fmt.Sprintf("Namespace %s does not exist in this cluster", t.namespace))
}
_, err := oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace, metav1.GetOptions{})
if kapierrors.IsNotFound(err) {
g.Skip(fmt.Sprintf("Namespace %s does not exist in this cluster", t.namespace))
}
o.Expect(err).NotTo(o.HaveOccurred(), fmt.Sprintf("failed to query namespace %s", t.namespace))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/tls/tls_observed_config.go` around lines 507 - 510, The
namespace lookup currently treats any error as "not found" and skips the test;
change the error handling after
oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace,
metav1.GetOptions{}) to import k8s.io/apimachinery/pkg/api/errors as kapierrors
and only call g.Skip when kapierrors.IsNotFound(err) is true, otherwise fail the
test (e.g., call the test failure helper you use in this file) so
auth/transport/transient errors are not masked; apply the same change to the
second occurrence of the same oc.AdminKubeClient().CoreV1().Namespaces().Get
call in this file.

Comment on lines 702 to 712
conn, err := tls.Dial("tcp", host, shouldWork)
if err != nil {
return fmt.Errorf("svc/%s in %s: TLS connection that should work failed: %w", t.serviceName, t.namespace, err)
}
negotiated := conn.ConnectionState().Version
conn.Close()
e2e.Logf("svc/%s in %s: TLS connection succeeded, negotiated version: 0x%04x", t.serviceName, t.namespace, negotiated)

// Test that the version that should not work is rejected.
conn, err = tls.Dial("tcp", host, shouldNotWork)
if err == nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's verify the file exists and check the exact content at the specified lines
fd tls_observed_config.go

Repository: openshift/origin

Length of output: 100


🏁 Script executed:

# Once we locate the file, examine the code around lines 702-712
cat -n test/extended/tls/tls_observed_config.go | sed -n '695,720p'

Repository: openshift/origin

Length of output: 1505


🏁 Script executed:

# Also check for the function signature to understand the context better
cat -n test/extended/tls/tls_observed_config.go | sed -n '680,715p'

Repository: openshift/origin

Length of output: 1707


🏁 Script executed:

# Check for other tls.Dial usages in this file to see the pattern
rg "tls\.Dial" test/extended/tls/tls_observed_config.go -n -B 2 -A 2

Repository: openshift/origin

Length of output: 470


🏁 Script executed:

# Check if there are any timeout or context handling patterns in this file
rg "context\.|timeout|Timeout|Deadline" test/extended/tls/tls_observed_config.go -n

Repository: openshift/origin

Length of output: 860


🏁 Script executed:

# Check for imports in the file to see what's already imported
head -50 test/extended/tls/tls_observed_config.go

Repository: openshift/origin

Length of output: 1905


Add explicit handshake timeouts for TLS probes.

tls.Dial without deadlines can hang under network edge cases and block test progression in the e2e suite. Use tls.DialWithDialer with a timeout to ensure reliable test execution.

🔧 Proposed fix
 import (
 	"context"
 	"crypto/tls"
 	"encoding/json"
 	"fmt"
 	"io"
 	"math/rand"
+	"net"
 	"os/exec"
 	"strings"
 	"time"
@@
 func checkTLSConnection(localPort int, shouldWork, shouldNotWork *tls.Config, t tlsTarget) error {
 	host := fmt.Sprintf("localhost:%d", localPort)
+	dialer := &net.Dialer{Timeout: 5 * time.Second}
 	e2e.Logf("Testing TLS behavior against %s (forwarded from svc/%s in %s)", host, t.serviceName, t.namespace)

 	// Test that the expected TLS version works.
-	conn, err := tls.Dial("tcp", host, shouldWork)
+	conn, err := tls.DialWithDialer(dialer, "tcp", host, shouldWork)
@@
 	// Test that the version that should not work is rejected.
-	conn, err = tls.Dial("tcp", host, shouldNotWork)
+	conn, err = tls.DialWithDialer(dialer, "tcp", host, shouldNotWork)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
conn, err := tls.Dial("tcp", host, shouldWork)
if err != nil {
return fmt.Errorf("svc/%s in %s: TLS connection that should work failed: %w", t.serviceName, t.namespace, err)
}
negotiated := conn.ConnectionState().Version
conn.Close()
e2e.Logf("svc/%s in %s: TLS connection succeeded, negotiated version: 0x%04x", t.serviceName, t.namespace, negotiated)
// Test that the version that should not work is rejected.
conn, err = tls.Dial("tcp", host, shouldNotWork)
if err == nil {
conn, err := tls.DialWithDialer(dialer, "tcp", host, shouldWork)
if err != nil {
return fmt.Errorf("svc/%s in %s: TLS connection that should work failed: %w", t.serviceName, t.namespace, err)
}
negotiated := conn.ConnectionState().Version
conn.Close()
e2e.Logf("svc/%s in %s: TLS connection succeeded, negotiated version: 0x%04x", t.serviceName, t.namespace, negotiated)
// Test that the version that should not work is rejected.
conn, err = tls.DialWithDialer(dialer, "tcp", host, shouldNotWork)
if err == nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/tls/tls_observed_config.go` around lines 702 - 712, The TLS
probes use tls.Dial (in the block handling host/shouldWork and
host/shouldNotWork) which can hang; replace calls to tls.Dial with
tls.DialWithDialer using a net.Dialer that sets a reasonable Timeout (e.g., a
few seconds) so both the successful probe and the rejected-version probe use a
dialer with explicit handshake timeout, preserve the same error handling around
conn, ConnectionState(), and conn.Close(), and ensure the dialer timeout
variable is reused for both DialWithDialer invocations to avoid duplicated magic
values.

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@gangwgr
Copy link
Contributor Author

gangwgr commented Feb 26, 2026

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@gangwgr, testwith: could not generate prow job. ERROR:

could not determine cluster for job pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6: error performing request: got unexpected http 404 status code from dispatcher: Cluster not found

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
test/extended/tls/tls_observed_config.go (4)

698-755: ⚠️ Potential issue | 🟠 Major

Add explicit handshake timeouts for TLS dial operations.

tls.Dial without deadlines can hang indefinitely under network edge cases, blocking test progression. This was flagged in a previous review. Use tls.DialWithDialer with a timeout dialer.

🔧 Proposed fix
 func checkTLSConnection(localPort int, shouldWork, shouldNotWork *tls.Config, t tlsTarget) error {
 	// Try both IPv4 and IPv6 localhost addresses for dual-stack support.
 	hosts := []string{
 		fmt.Sprintf("127.0.0.1:%d", localPort), // IPv4
 		fmt.Sprintf("[::1]:%d", localPort),     // IPv6
 	}

 	var host string
 	var conn *tls.Conn
 	var err error
+	dialer := &net.Dialer{Timeout: 5 * time.Second}

 	// Find a working host (try IPv4 first, then IPv6).
 	for _, h := range hosts {
-		conn, err = tls.Dial("tcp", h, shouldWork)
+		conn, err = tls.DialWithDialer(dialer, "tcp", h, shouldWork)
 		if err == nil {
 			host = h
 			break
 		}
 		// ... error handling ...
 	}
 	// ...
 	// Test that the version that should not work is rejected.
-	conn, err = tls.Dial("tcp", host, shouldNotWork)
+	conn, err = tls.DialWithDialer(dialer, "tcp", host, shouldNotWork)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/tls/tls_observed_config.go` around lines 698 - 755, The TLS
dials in checkTLSConnection can hang; change both tls.Dial calls to use
tls.DialWithDialer with a net.Dialer that sets a timeout (e.g. Dialer{Timeout:
10*time.Second}) so the initial probe (inside the hosts loop using shouldWork)
and the subsequent negative probe (using shouldNotWork) will fail fast instead
of blocking indefinitely; locate tls.Dial invocations in checkTLSConnection and
replace them with tls.DialWithDialer(&net.Dialer{Timeout: <reasonable
duration>}, "tcp", hostOrH, tlsConfig).

506-510: ⚠️ Potential issue | 🟠 Major

Skip only on NotFound errors; fail on other namespace lookup failures.

The current code treats any API error as "namespace doesn't exist" and skips, which masks real failures like auth errors or transient issues. This was flagged in a previous review.

🐛 Proposed fix
 	g.By(fmt.Sprintf("verifying namespace %s exists", t.namespace))
 	_, err := oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace, metav1.GetOptions{})
-	if err != nil {
+	if kapierrors.IsNotFound(err) {
 		g.Skip(fmt.Sprintf("Namespace %s does not exist in this cluster", t.namespace))
 	}
+	o.Expect(err).NotTo(o.HaveOccurred(), fmt.Sprintf("failed to query namespace %s", t.namespace))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/tls/tls_observed_config.go` around lines 506 - 510, The code
currently skips on any error from
oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace,
metav1.GetOptions{}); change it to only skip when the error is a NotFound (use
k8s api errors.IsNotFound(err) from "k8s.io/apimachinery/pkg/api/errors") and
for any other non-nil error fail the test (use g.Fail or a test fatal with the
error) so real API/auth/transient failures aren't masked; keep the existing
g.Skip(fmt.Sprintf("Namespace %s does not exist...", t.namespace)) branch for
IsNotFound and report the actual err in the failing branch.

648-683: ⚠️ Potential issue | 🟠 Major

Add active readiness probe for port-forward tunnel establishment.

The code starts oc port-forward and reads a chunk from stdout, but this doesn't guarantee the tunnel accepts connections. The 10-second context timeout also covers both tunnel setup and TLS test execution, which is too tight. This was flagged in a previous review.

🔧 Proposed fix with TCP readiness probe and extended timeout
 func forwardPortAndExecute(serviceName, namespace, remotePort string, toExecute func(localPort int) error) error {
 	var err error
 	for i := 0; i < 3; i++ {
 		if err = func() error {
-			ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+			ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
 			defer cancel()
-			localPort := rand.Intn(65534-1025) + 1025
+			localPort := rand.Intn(64510) + 1025 // [1025, 65534]
 			args := []string{
 				"port-forward",
 				fmt.Sprintf("svc/%s", serviceName),
 				fmt.Sprintf("%d:%s", localPort, remotePort),
 				"-n", namespace,
 			}

 			cmd := exec.CommandContext(ctx, "oc", args...)
 			stdout, stderr, err := e2e.StartCmdAndStreamOutput(cmd)
 			if err != nil {
 				return fmt.Errorf("failed to start port-forward: %v", err)
 			}
 			defer stdout.Close()
 			defer stderr.Close()
 			defer e2e.TryKill(cmd)

-			// Give port-forward a moment to establish.
-			e2e.Logf("oc port-forward output: %s", readPartialFrom(stdout, 1024))
+			// Wait until forwarded local port is reachable via active TCP probe.
+			deadline := time.Now().Add(5 * time.Second)
+			for {
+				conn, dialErr := net.DialTimeout("tcp", fmt.Sprintf("127.0.0.1:%d", localPort), 500*time.Millisecond)
+				if dialErr == nil {
+					_ = conn.Close()
+					break
+				}
+				if time.Now().After(deadline) {
+					return fmt.Errorf("port-forward did not become ready for svc/%s in %s: %w", serviceName, namespace, dialErr)
+				}
+				time.Sleep(200 * time.Millisecond)
+			}
+			e2e.Logf("oc port-forward established to svc/%s in %s on local port %d", serviceName, namespace, localPort)
 			return toExecute(localPort)
 		}(); err == nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/tls/tls_observed_config.go` around lines 648 - 683,
forwardPortAndExecute currently starts oc port-forward with a single 10s context
and only reads stdout (via readPartialFrom/StartCmdAndStreamOutput), which
doesn't guarantee the tunnel accepts connections; change it to create a
dedicated, longer context for the port-forward command (e.g., 30s) and perform
an active TCP readiness probe against localhost:localPort (using net.Dial or
net.DialTimeout) with several short retries and backoff to confirm the tunnel is
accepting connections before calling toExecute; ensure you still defer
stdout.Close(), stderr.Close(), and e2e.TryKill(cmd) on failure, and separate
the port-forward lifecycle from the test execution (i.e., do not have the
port-forward context encompass the entire toExecute run).

579-583: ⚠️ Potential issue | 🟠 Major

Same issue: skip only on NotFound errors.

This has the same error handling issue as testDeploymentTLSEnvVars — any error causes a skip rather than only NotFound errors.

🐛 Proposed fix
 	g.By("verifying namespace exists")
 	_, err = oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace, metav1.GetOptions{})
-	if err != nil {
+	if kapierrors.IsNotFound(err) {
 		g.Skip(fmt.Sprintf("Namespace %s does not exist in this cluster", t.namespace))
 	}
+	o.Expect(err).NotTo(o.HaveOccurred(), fmt.Sprintf("failed to query namespace %s", t.namespace))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/tls/tls_observed_config.go` around lines 579 - 583, The
namespace existence check currently skips the test for any Get error; change it
to only skip when the error is a NotFound error by using
k8s.io/apimachinery/pkg/api/errors.IsNotFound(err) against the result of
oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace,
metav1.GetOptions{}), i.e. if apierrors.IsNotFound(err) then g.Skip(...) else if
err != nil then fail the test (return or call the test failure helper used in
these tests) so genuine API errors surface instead of being treated as missing
namespace.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/extended/tls/tls_observed_config.go`:
- Around line 698-755: The TLS dials in checkTLSConnection can hang; change both
tls.Dial calls to use tls.DialWithDialer with a net.Dialer that sets a timeout
(e.g. Dialer{Timeout: 10*time.Second}) so the initial probe (inside the hosts
loop using shouldWork) and the subsequent negative probe (using shouldNotWork)
will fail fast instead of blocking indefinitely; locate tls.Dial invocations in
checkTLSConnection and replace them with tls.DialWithDialer(&net.Dialer{Timeout:
<reasonable duration>}, "tcp", hostOrH, tlsConfig).
- Around line 506-510: The code currently skips on any error from
oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace,
metav1.GetOptions{}); change it to only skip when the error is a NotFound (use
k8s api errors.IsNotFound(err) from "k8s.io/apimachinery/pkg/api/errors") and
for any other non-nil error fail the test (use g.Fail or a test fatal with the
error) so real API/auth/transient failures aren't masked; keep the existing
g.Skip(fmt.Sprintf("Namespace %s does not exist...", t.namespace)) branch for
IsNotFound and report the actual err in the failing branch.
- Around line 648-683: forwardPortAndExecute currently starts oc port-forward
with a single 10s context and only reads stdout (via
readPartialFrom/StartCmdAndStreamOutput), which doesn't guarantee the tunnel
accepts connections; change it to create a dedicated, longer context for the
port-forward command (e.g., 30s) and perform an active TCP readiness probe
against localhost:localPort (using net.Dial or net.DialTimeout) with several
short retries and backoff to confirm the tunnel is accepting connections before
calling toExecute; ensure you still defer stdout.Close(), stderr.Close(), and
e2e.TryKill(cmd) on failure, and separate the port-forward lifecycle from the
test execution (i.e., do not have the port-forward context encompass the entire
toExecute run).
- Around line 579-583: The namespace existence check currently skips the test
for any Get error; change it to only skip when the error is a NotFound error by
using k8s.io/apimachinery/pkg/api/errors.IsNotFound(err) against the result of
oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, t.namespace,
metav1.GetOptions{}), i.e. if apierrors.IsNotFound(err) then g.Skip(...) else if
err != nil then fail the test (return or call the test failure helper used in
these tests) so genuine API errors surface instead of being treated as missing
namespace.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between b8f1f94 and b0a3132.

📒 Files selected for processing (4)
  • pkg/testsuites/standard_suites.go
  • test/extended/include.go
  • test/extended/tls/OWNERS
  • test/extended/tls/tls_observed_config.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/extended/include.go
  • test/extended/tls/OWNERS

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@gangwgr
Copy link
Contributor Author

gangwgr commented Feb 26, 2026

@gangwgr
Copy link
Contributor Author

gangwgr commented Feb 26, 2026

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@gangwgr, testwith: could not generate prow job. ERROR:

could not determine cluster for job pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6: error performing request: got unexpected http 404 status code from dispatcher: Cluster not found

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@gangwgr, testwith: could not generate prow job. ERROR:

could not determine ci op config from metadata: got unexpected http 404 status code from configresolver: failed to get config: could not find any config for branch master on repo openshift/origin

@openshift-trt
Copy link

openshift-trt bot commented Feb 26, 2026

Job Failure Risk Analysis for sha: b0a3132

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (25) are below the historical average (2372): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@gangwgr
Copy link
Contributor Author

gangwgr commented Feb 27, 2026

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 27, 2026

@gangwgr, testwith: could not generate prow job. ERROR:

could not determine cluster for job pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6: error performing request: got unexpected http 404 status code from dispatcher: Cluster not found

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
test/extended/tls/tls_observed_config.go (3)

202-206: Misplaced GinkgoRecover() call.

g.GinkgoRecover() is intended to be called in goroutines spawned within test code to catch panics and report them to Ginkgo. Placing it at the Describe block level with defer has no effect since the Describe function itself doesn't run in a goroutine that needs panic recovery—Ginkgo already handles panics in test nodes.

Remove this line as it serves no purpose here.

🧹 Proposed fix
 var _ = g.Describe("[sig-api-machinery][Feature:TLSObservedConfig][Suite:openshift/tls-observed-config][Serial]", func() {
-	defer g.GinkgoRecover()
-
 	oc := exutil.NewCLI("tls-observed-config")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/tls/tls_observed_config.go` around lines 202 - 206, Remove the
misplaced defer g.GinkgoRecover() call from the Describe block in
tls_observed_config.go; specifically, delete the line calling g.GinkgoRecover()
immediately inside the Describe anonymous function (around the top of the
Describe that defines the "[sig-api-machinery]...TLSObservedConfig" suite)
because Ginkgo handles panic recovery for Describe-level nodes and GinkgoRecover
is only necessary inside spawned goroutines.

655-657: Consider seeding random or using ephemeral port range.

The random port selection uses math/rand without explicit seeding (Go 1.20+ auto-seeds, but older versions don't). Also, ports 1025-49151 include registered ports that may be in use. Consider using the ephemeral range (49152-65535) to reduce collision likelihood.

🧹 Proposed improvement
-			localPort := rand.Intn(65534-1025) + 1025
+			// Use ephemeral port range (49152-65535) to reduce collision with registered services
+			localPort := rand.Intn(65535-49152) + 49152
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/tls/tls_observed_config.go` around lines 655 - 657, The test
picks localPort via rand.Intn without seeding and from a broad range; update the
selection for stability by either seeding math/rand (e.g., seed once with
time.Now().UnixNano()) or—preferably—choose from the IANA ephemeral port range
(49152-65535) to avoid registered ports, and use the localPort variable (where
it's assigned) accordingly; alternatively consider binding to port 0 (letting
the OS pick) where applicable to eliminate collisions.

527-529: Consider checking all containers for TLS env vars.

The code only inspects Containers[0], assuming TLS env vars are always in the first container. If container ordering changes or sidecars are added, this could miss the relevant env vars. Consider iterating over all containers or adding a comment documenting this assumption.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/tls/tls_observed_config.go` around lines 527 - 529, The test
currently only inspects deployment.Spec.Template.Spec.Containers[0] for TLS env
vars which can miss values in other containers or sidecars; update the check
that uses deployment.Spec.Template.Spec.Containers, envToMap and logEnvVars to
iterate over all containers (or merge each container's env into a combined map)
and verify t.tlsMinVersionEnvVar in every container's env set, ensuring you call
logEnvVars for each container (or once for the merged map) so no container is
skipped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/extended/tls/tls_observed_config.go`:
- Around line 202-206: Remove the misplaced defer g.GinkgoRecover() call from
the Describe block in tls_observed_config.go; specifically, delete the line
calling g.GinkgoRecover() immediately inside the Describe anonymous function
(around the top of the Describe that defines the
"[sig-api-machinery]...TLSObservedConfig" suite) because Ginkgo handles panic
recovery for Describe-level nodes and GinkgoRecover is only necessary inside
spawned goroutines.
- Around line 655-657: The test picks localPort via rand.Intn without seeding
and from a broad range; update the selection for stability by either seeding
math/rand (e.g., seed once with time.Now().UnixNano()) or—preferably—choose from
the IANA ephemeral port range (49152-65535) to avoid registered ports, and use
the localPort variable (where it's assigned) accordingly; alternatively consider
binding to port 0 (letting the OS pick) where applicable to eliminate
collisions.
- Around line 527-529: The test currently only inspects
deployment.Spec.Template.Spec.Containers[0] for TLS env vars which can miss
values in other containers or sidecars; update the check that uses
deployment.Spec.Template.Spec.Containers, envToMap and logEnvVars to iterate
over all containers (or merge each container's env into a combined map) and
verify t.tlsMinVersionEnvVar in every container's env set, ensuring you call
logEnvVars for each container (or once for the merged map) so no container is
skipped.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between b0a3132 and ee0403d.

📒 Files selected for processing (4)
  • pkg/testsuites/standard_suites.go
  • test/extended/include.go
  • test/extended/tls/OWNERS
  • test/extended/tls/tls_observed_config.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/extended/include.go
  • pkg/testsuites/standard_suites.go

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 27, 2026

@gangwgr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6 b0a3132 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-gcp-ovn b0a3132 link true /test e2e-gcp-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants