Skip to content

Test | Fix Transient Fault handling and other flaky unit tests#4080

Merged
mdaigle merged 5 commits intomainfrom
dev/cheena/fix-flaky-transientfaulthandling
Mar 25, 2026
Merged

Test | Fix Transient Fault handling and other flaky unit tests#4080
mdaigle merged 5 commits intomainfrom
dev/cheena/fix-flaky-transientfaulthandling

Conversation

@cheenamalhotra
Copy link
Member

@cheenamalhotra cheenamalhotra commented Mar 24, 2026

[Attempted fix with Copilot]

The Problem

When SqlConnection.Open() runs on .NET Framework (net462), the internal LoginNoFailover method enters parallel/interval-timer mode by default because TNIR (Transparent Network IP Resolution) is enabled. LoginWithFailover always uses interval timers on all platforms.

These interval timers give each login attempt only a small fraction of the total ConnectTimeout:

Path Fraction With 30s timeout
LoginWithFailover 8% (FailoverTimeoutStep) 2.4s per attempt
LoginNoFailover + TNIR 12.5% (FailoverTimeoutStepForTnir) 1.875s per attempt

On a busy Windows CI machine, when one of these interval timers expires mid-login, the client disconnects and retries internally inside the login loop — before the outer transient-fault retry loop ever sees the error. This produces an extra PreLogin on the TDS server that never gets a corresponding Login7 (the client abandoned the connection).

Normal flow (2 PreLogins):

  1. PreLogin → Login7 → transient error 40613 → outer retry
  2. PreLogin → Login7 → success

Busy CI flow (3 PreLogins):

  1. PreLogin → Login7 → transient error 40613 → outer retry
  2. PreLogin → interval timer expires before Login7 completes → client disconnects → internal retry
  3. PreLogin → Login7 → success

The tests asserted Assert.Equal(2, server.PreLoginCount) — an exact count that doesn't account for these platform-specific internal retries. On .NET (net8.0/net9.0), TNIR doesn't exist and LoginNoFailover only uses interval timers when MultiSubnetFailover=true, so the tests pass reliably there.

The fix adds a Login7Count counter to the test TDS server so we can compute AbandonedPreLoginCount (PreLogins with no Login7), then asserts on PreLoginCount - AbandonedPreLoginCount — giving an exact, deterministic count regardless of how many internal timeout retries occurred.

Additional fixes

  • Fixes the multipart identifier getting skipped due to duplicate IDs:

[xUnit.net 00:00:03.53] UnitTests: Skipping test case with duplicate ID '287fad04d43fa7e5d02c3f11743f0138e5e6db40' ('Microsoft.Data.Common.UnitTests.MultipartIdentifierTests.SinglePartIdentifierWithBracketsAndWhiteSpace_ParsesSuccessfully(partIdentifier: "word1", expected: ["word1"])' and 'Microsoft.Data.Common.UnitTests.MultipartIdentifierTests.SinglePartIdentifierWithBracketsAndWhiteSpace_ParsesSuccessfully(partIdentifier: "word1", expected: ["word1"])')

  • Fixes failover port test failure, by ensuring connection pools are cleared before connecting to failover partner.

  • Fixes SqlTypeWorkaroundsTests discovery by setting DisableDiscoveryEnumeration to true.

@cheenamalhotra cheenamalhotra requested a review from a team as a code owner March 24, 2026 07:12
Copilot AI review requested due to automatic review settings March 24, 2026 07:12
@github-project-automation github-project-automation bot moved this to To triage in SqlClient Board Mar 24, 2026
@cheenamalhotra cheenamalhotra added the Area\Tests Issues that are targeted to tests or test projects label Mar 24, 2026
@cheenamalhotra cheenamalhotra added this to the 7.1.0-preview1 milestone Mar 24, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the simulated TDS server test infrastructure and transient-fault unit tests to avoid flaky assertions on exact PreLogin counts on .NET Framework where internal TNIR/interval-timer retries can introduce extra pre-login attempts.

Changes:

  • Add Login7Count tracking to GenericTdsServer and adjust TransientTdsErrorTdsServer to increment it when bypassing base.OnLogin7Request.
  • Introduce AbandonedPreLoginCount (PreLogins not followed by Login7) for more deterministic test assertions.
  • Remove flaky traits from transient-fault simulated server tests and update assertions to exclude “abandoned” prelogins.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/Microsoft.Data.SqlClient/tests/tools/TDS/TDS.Servers/TransientTdsErrorTdsServer.cs Increments Login7 counter on error path where base handler is skipped.
src/Microsoft.Data.SqlClient/tests/tools/TDS/TDS.Servers/GenericTdsServer.cs Adds Login7 and abandoned-prelogin counting to support deterministic login-attempt assertions.
src/Microsoft.Data.SqlClient/tests/UnitTests/SimulatedServerTests/ConnectionTests.cs Updates transient-fault assertions to use adjusted login-attempt counting; removes flaky traits.
src/Microsoft.Data.SqlClient/tests/UnitTests/SimulatedServerTests/ConnectionRoutingTestsAzure.cs Updates routing transient-fault assertions to use adjusted login-attempt counting; removes flaky trait.
src/Microsoft.Data.SqlClient/tests/UnitTests/SimulatedServerTests/ConnectionRoutingTests.cs Updates routing transient-fault assertions to use adjusted login-attempt counting; removes flaky trait.
src/Microsoft.Data.SqlClient/tests/UnitTests/SimulatedServerTests/ConnectionFailoverTests.cs Updates transient-fault login-attempt assertion to use adjusted counting; removes flaky trait.

@cheenamalhotra cheenamalhotra moved this from To triage to In progress in SqlClient Board Mar 24, 2026
Copilot AI review requested due to automatic review settings March 24, 2026 07:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

@cheenamalhotra cheenamalhotra changed the title Test | Fix Transient Fault handling flaky tests Test | Fix Transient Fault handling and other flaky unit tests Mar 24, 2026
@paulmedynski paulmedynski self-assigned this Mar 24, 2026
paulmedynski
paulmedynski previously approved these changes Mar 24, 2026
@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.42%. Comparing base (60d4b92) to head (e7dd518).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4080      +/-   ##
==========================================
- Coverage   73.22%   68.42%   -4.81%     
==========================================
  Files         280      275       -5     
  Lines       43000    66924   +23924     
==========================================
+ Hits        31486    45791   +14305     
- Misses      11514    21133    +9619     
Flag Coverage Δ
CI-SqlClient ?
PR-SqlClient-Project 68.42% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mdaigle
mdaigle previously approved these changes Mar 24, 2026
Copilot AI review requested due to automatic review settings March 24, 2026 21:12
@cheenamalhotra cheenamalhotra dismissed stale reviews from mdaigle and paulmedynski via e7dd518 March 24, 2026 21:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

@cheenamalhotra cheenamalhotra moved this from In progress to In review in SqlClient Board Mar 24, 2026
@mdaigle mdaigle merged commit 59afd23 into main Mar 25, 2026
303 checks passed
@mdaigle mdaigle deleted the dev/cheena/fix-flaky-transientfaulthandling branch March 25, 2026 16:52
@github-project-automation github-project-automation bot moved this from In review to Done in SqlClient Board Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area\Tests Issues that are targeted to tests or test projects

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants