unflake a System Test: 'test_remote_sampling_rules_retention'#6342
unflake a System Test: 'test_remote_sampling_rules_retention'#6342vlad-scherbich wants to merge 4 commits intomainfrom
Conversation
|
|
| for _ in range(30): | ||
| trace = send_and_wait_trace(test_library, test_agent, name="test", service="foo") | ||
| span = find_first_span_in_trace_payload(trace) | ||
| if span["metrics"].get("_dd.rule_psr", 1.0) == pytest.approx(0.1): | ||
| break | ||
| time.sleep(0.1) |
There was a problem hiding this comment.
If I understand correctly, we try up to 30 times (waiting 0.1 seconds between each attempt) to see a span whose metadata shows that the new sampling rules were correctly received and applied?
b65f71b to
d22ddf4
Compare
| assert_sampling_rate(trace, 0.1) | ||
| # After updating the RC config, the library may briefly still be applying the | ||
| # previous sampling rules. set_and_wait_rc waits for telemetry and RC acknowledgment, | ||
| # but these signals can be satisfied by stale events from the prior config, causing a |
There was a problem hiding this comment.
Would it be possible to improve the wait_rc condition to wait for the precise new config ?
There was a problem hiding this comment.
@cbeauchesne , I like this idea.
set_and_wait_rc is just a helper function used only in test_dynamic_configuration.py at 20 call sites.
I'll open a separate PR for the proposed fix, which would be to clear the session before setting the new RC config.
There was a problem hiding this comment.
@cbeauchesne , new PR implementing your suggestion: #6349
Lots of unrelated system test fails, do I just re-run them until they pass?
|
✨ Fix all issues with BitsAI or with Cursor
|
https://datadoghq.atlassian.net/browse/PROF-13796
Motivation
test_remote_sampling_rules_retentionfails intermittently in CI. Observed flaking in a dd-trace-py PR that only changes CI workflow config (zero relation to test logic). See failed job.The root cause is a race condition in
set_and_wait_rc: its wait signals (telemetry event + RC ACKNOWLEDGED state) are not config-version-specific, so stale events from a prior RC config can satisfy them before the new sampling rules are actually active in the library.Changes
Adds a retry loop around the first trace assertion in
test_remote_sampling_rules_retentionto tolerate the brief propagation window between RC acknowledgment and actual rule application. This is consistent with how other tests in the same file handle similar timing sensitivity (e.g.,get_sampled_trace).Testing
Workflow
🚀 Once your PR is reviewed and the CI green, you can merge it!
🛟 #apm-shared-testing 🛟
Reviewer checklist
tests/ormanifests/is modified ? I have the approval from R&P teambuild-XXX-imagelabel is present