Skip to content

fix: increase Test_Server_CapabilityError timeout to prevent race wit…#22501

Merged
Tofel merged 1 commit into
developfrom
fix/flaky-CRE-4319-2026-05-15-clean
May 20, 2026
Merged

fix: increase Test_Server_CapabilityError timeout to prevent race wit…#22501
Tofel merged 1 commit into
developfrom
fix/flaky-CRE-4319-2026-05-15-clean

Conversation

@Tofel
Copy link
Copy Markdown
Contributor

@Tofel Tofel commented May 15, 2026

Summary

  • Increases RequestTimeout from 100ms to 10s in Test_Server_CapabilityError to eliminate a race condition between the server's expiry ticker and async message delivery.

Root cause

RequestTimeout=100ms equalled the ticker interval returned by getServerTickerInterval. On loaded CI machines, the expiry goroutine fires Cancel(Error_TIMEOUT) before all 10 async messages from 10 workflow peers complete delivery through testAsyncMessageBroker's serial sendCh consumer. The 10th OnMessage then finds hasResponse()==true (set by Cancel) and dispatches Error_TIMEOUT instead of the expected Error_INTERNAL_ERROR.

Fix

Raise the timeout to 10s — matching the convention used by other tests in this file. TestErrorCapability returns synchronously, so the test still completes in milliseconds; only the safety margin changes.

Flaky test fixes

Issue Test Trunk
CRE-4319 Test_Server_CapabilityError Trunk test case

Test plan

  • Test_Server_CapabilityError passed 10/10 runs locally with -race -shuffle=on
  • golangci-lint clean for ./core/capabilities/remote/executable/...
  • CI green

…h expiry ticker

RequestTimeout was 100ms — equal to the ticker interval — causing the expiry
goroutine to fire Cancel(Error_TIMEOUT) before async broker delivery of all 10
messages completed on loaded CI machines.

Fixes CRE-4319
@github-actions
Copy link
Copy Markdown
Contributor

✅ No conflicts with other open PRs targeting develop

@cl-sonarqube-production
Copy link
Copy Markdown

@trunk-io
Copy link
Copy Markdown

trunk-io Bot commented May 15, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

@Tofel Tofel marked this pull request as ready for review May 20, 2026 10:03
@Tofel Tofel requested a review from a team as a code owner May 20, 2026 10:03
Copilot AI review requested due to automatic review settings May 20, 2026 10:03
@Tofel Tofel requested a review from a team as a code owner May 20, 2026 10:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: LOW — test-only change adjusting a timeout value to reduce flakiness; no production logic impact.

This PR addresses CI flakiness in Test_Server_CapabilityError by increasing the server request timeout so the server’s expiry path doesn’t race ahead of asynchronous message delivery.

Changes:

  • Increased the capabilityNodeResponseTimeout (used as RemoteExecutableConfig.RequestTimeout when unset) in Test_Server_CapabilityError from 100ms to 10s.

Scrupulous human review focus:

  • Confirm the new 10s timeout aligns with intended behavior of the expiry ticker/request timeout interaction for this test scenario (i.e., it removes the flake without masking real regressions).

@Tofel Tofel added this pull request to the merge queue May 20, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 20, 2026
@Tofel Tofel added this pull request to the merge queue May 20, 2026
Merged via the queue into develop with commit 088ed59 May 20, 2026
115 checks passed
@Tofel Tofel deleted the fix/flaky-CRE-4319-2026-05-15-clean branch May 20, 2026 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants