Skip to content

unflake a System Test: 'test_remote_sampling_rules_retention'#6342

Draft
vlad-scherbich wants to merge 4 commits intomainfrom
fix/flaky-test-remote-sampling-rules-retention
Draft

unflake a System Test: 'test_remote_sampling_rules_retention'#6342
vlad-scherbich wants to merge 4 commits intomainfrom
fix/flaky-test-remote-sampling-rules-retention

Conversation

@vlad-scherbich
Copy link

@vlad-scherbich vlad-scherbich commented Feb 18, 2026

https://datadoghq.atlassian.net/browse/PROF-13796

Motivation

test_remote_sampling_rules_retention fails intermittently in CI. Observed flaking in a dd-trace-py PR that only changes CI workflow config (zero relation to test logic). See failed job.

The root cause is a race condition in set_and_wait_rc: its wait signals (telemetry event + RC ACKNOWLEDGED state) are not config-version-specific, so stale events from a prior RC config can satisfy them before the new sampling rules are actually active in the library.

Changes

Adds a retry loop around the first trace assertion in test_remote_sampling_rules_retention to tolerate the brief propagation window between RC acknowledgment and actual rule application. This is consistent with how other tests in the same file handle similar timing sensitivity (e.g., get_sampled_trace).

Testing

$ ./run.sh PARAMETRIC -L python -k "retention"
...
============================================================================= test context ==============================================================================
Scenario: PARAMETRIC
Logs folder: ./logs_parametric
Library: python@4.4.0
========================================================================== test session starts ==========================================================================
gw0 [1] / gw1 [1] / gw2 [1] / gw3 [1] / gw4 [1] / gw5 [1] / gw6 [1] / gw7 [1] / gw8 [1] / gw9 [1] / gw10 [1] / gw11 [1] / gw12 [1] / gw13 [1] / gw14 [1] / gw15 [1]
.                                                                                                                                                                 [100%]
--------------------------- generated xml file: /Users/vlad.scherbich/go/src/github.com/DataDog/system-tests/logs_parametric/reportJunit.xml ----------------------------
========================================================================== 1 passed in 32.80s ===========================================================================
COMP-LR7JK0FKW1:system-tests vlad.scherbich$ 

Workflow

  1. ⚠️ Create your PR as draft ⚠️
  2. Work on you PR until the CI passes
  3. Mark it as ready for review
    • Test logic is modified? -> Get a review from RFC owner.
    • Framework is modified, or non obvious usage of it -> get a review from R&P team

🚀 Once your PR is reviewed and the CI green, you can merge it!

🛟 #apm-shared-testing 🛟

Reviewer checklist

  • Anything but tests/ or manifests/ is modified ? I have the approval from R&P team
  • A docker base image is modified?
    • the relevant build-XXX-image label is present
  • A scenario is added, removed or renamed?

@vlad-scherbich vlad-scherbich requested review from a team as code owners February 18, 2026 22:09
@vlad-scherbich vlad-scherbich requested review from zacharycmontoya and removed request for a team February 18, 2026 22:09
@github-actions
Copy link
Contributor

github-actions bot commented Feb 18, 2026

CODEOWNERS have been resolved as:

tests/parametric/test_dynamic_configuration.py                          @DataDog/system-tests-core @DataDog/apm-sdk-capabilities

@vlad-scherbich vlad-scherbich requested review from a team and KowalskiThomas February 18, 2026 22:14
@vlad-scherbich vlad-scherbich marked this pull request as draft February 18, 2026 22:20
Comment on lines 1054 to 1059
for _ in range(30):
trace = send_and_wait_trace(test_library, test_agent, name="test", service="foo")
span = find_first_span_in_trace_payload(trace)
if span["metrics"].get("_dd.rule_psr", 1.0) == pytest.approx(0.1):
break
time.sleep(0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, we try up to 30 times (waiting 0.1 seconds between each attempt) to see a span whose metadata shows that the new sampling rules were correctly received and applied?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly!

@vlad-scherbich vlad-scherbich force-pushed the fix/flaky-test-remote-sampling-rules-retention branch from b65f71b to d22ddf4 Compare February 19, 2026 13:31
assert_sampling_rate(trace, 0.1)
# After updating the RC config, the library may briefly still be applying the
# previous sampling rules. set_and_wait_rc waits for telemetry and RC acknowledgment,
# but these signals can be satisfied by stale events from the prior config, causing a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to improve the wait_rc condition to wait for the precise new config ?

Copy link
Author

@vlad-scherbich vlad-scherbich Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbeauchesne , I like this idea.
set_and_wait_rc is just a helper function used only in test_dynamic_configuration.py at 20 call sites.

I'll open a separate PR for the proposed fix, which would be to clear the session before setting the new RC config.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbeauchesne , new PR implementing your suggestion: #6349

Lots of unrelated system test fails, do I just re-run them until they pass?

@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Feb 19, 2026

⚠️ Tests

Fix all issues with BitsAI or with Cursor

⚠️ Warnings

❄️ 706 New flaky tests detected

tests.appsec.test_alpha.Test_Basic.test_headers[chi] from system_tests_suite (Datadog) (Fix with Cursor)
ValueError: No appsec event validate this condition

self = <tests.appsec.test_alpha.Test_Basic object at 0x7f57d96e87d0>

    def test_headers(self):
        """Via server.request.headers.no_cookies"""
        # Note: we do not check the returned key_path nor rule_id for the alpha version
        address = "server.request.headers.no_cookies"
        pattern = "../"
>       interfaces.library.assert_waf_attack(self.r_headers_1, pattern=pattern, address=address)
...
tests.appsec.test_alpha.Test_Basic.test_headers[echo] from system_tests_suite (Datadog) (Fix with Cursor)
ValueError: No appsec event validate this condition

self = <tests.appsec.test_alpha.Test_Basic object at 0x7fada1c89370>

    def test_headers(self):
        """Via server.request.headers.no_cookies"""
        # Note: we do not check the returned key_path nor rule_id for the alpha version
        address = "server.request.headers.no_cookies"
        pattern = "../"
>       interfaces.library.assert_waf_attack(self.r_headers_1, pattern=pattern, address=address)
...
tests.appsec.test_alpha.Test_Basic.test_headers[gin] from system_tests_suite (Datadog) (Fix with Cursor)
ValueError: No appsec event validate this condition

self = <tests.appsec.test_alpha.Test_Basic object at 0x7fb8cc2d0320>

    def test_headers(self):
        """Via server.request.headers.no_cookies"""
        # Note: we do not check the returned key_path nor rule_id for the alpha version
        address = "server.request.headers.no_cookies"
        pattern = "../"
>       interfaces.library.assert_waf_attack(self.r_headers_1, pattern=pattern, address=address)
...
View all

ℹ️ Info

🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 9f8ebf4 | Docs | Datadog PR Page | Was this helpful? Give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments