Skip to content

Enable PR-level performance quality gates#5571

Merged
igoragoli merged 21 commits intomasterfrom
augusto/enable-perf-quality-gates
Apr 14, 2026
Merged

Enable PR-level performance quality gates#5571
igoragoli merged 21 commits intomasterfrom
augusto/enable-perf-quality-gates

Conversation

@igoragoli
Copy link
Copy Markdown
Contributor

@igoragoli igoragoli commented Apr 9, 2026

What does this PR do?

Adds a PR-level performance quality gate for microbenchmarks: microbenchmarks-check-big-regressions.

The job runs after microbenchmarks complete and fails if any benchmark regresses by more than 20%, using bp-runner bp-runner.fail-on-regression.yml --debug.

Motivation:

APMSP-2545 Setup pre-release and PR level quality gates for Ruby

Change log entry

None.

Additional Notes:

Can we bypass this?
Yes. I added a comment with directions on how to do it:

# Verify that the microbenchmarks-check-big-regressions CI job has passed. If any regression happened, merging this PR will be blocked.
# If bypassing is necessary, see https://datadoghq.atlassian.net/wiki/x/8YFzMwE for more details.
microbenchmarks-check-big-regressions:

What does this command even mean?

Why 20%?

  • This is a default limit for regressions on a single PR.
  • We could have fixed limits via SLOs, like dd-trace-py does.
  • We could also shrink this regression threshold and make it more ambitious, but I prefer to make sure benchmarking jobs are up and running correctly and fast first and make this threshold more agressive later.

How to test the change?

microbenchmarks-check-big-regressions running after microbenchmarks in CI: https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-rb/-/jobs/1592412230

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

Thank you for updating Change log entry section 👏

Visited at: 2026-04-09 14:50:40 UTC

@igoragoli igoragoli changed the title ci: scaffold macrobenchmark quality gates and auto-trigger benchmarks ci: enable performance quality gates Apr 9, 2026
Copy link
Copy Markdown
Contributor Author

igoragoli commented Apr 9, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@igoragoli igoragoli added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label Apr 9, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Apr 9, 2026

Benchmarks

Benchmark execution time: 2026-04-14 12:56:25

Comparing candidate commit 515d6bd in PR branch augusto/enable-perf-quality-gates with baseline commit 3260714 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 45 metrics, 1 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

@igoragoli igoragoli force-pushed the augusto/enable-perf-quality-gates branch 3 times, most recently from c866a4b to c3caecc Compare April 10, 2026 08:31
@igoragoli igoragoli marked this pull request as ready for review April 14, 2026 08:19
@igoragoli igoragoli requested a review from a team as a code owner April 14, 2026 08:19
@igoragoli igoragoli changed the base branch from augusto/add-perf-quality-gate-dd-octo-sts-policy to master April 14, 2026 08:21
Adds a chainguard policy allowing GitLab CI to obtain a short-lived
GitHub token with contents:read scope. Used by check-slo-breaches
to track SLO threshold changes in git history.
Add macrobenchmarks-gates and macrobenchmarks-notify stages. Include
check-slo-breaches and notify-slo-breaches templates from
benchmarking-platform-tools. Add placeholder check-slo-breaches job
that depends on all 8 macrobenchmark jobs.

Temporarily set macrobenchmarks to auto-trigger on all branches to
collect baseline artifacts for SLO threshold generation.
Adds a quality gate that fails on microbenchmark regressions exceeding
20%. Uses bp-runner fail_on_regression step from benchmarking-platform.
Runs after microbenchmarks with when: always to catch failures too.
Set to allow_failure: true until thresholds are validated.
Replace check-slo-breaches placeholder with real fail_on_breach
implementation. Add notify-slo-breaches job to alert on
apm-dcs-performance-alerts. Generate 209 SLO thresholds across
42 scenarios using tight strategy (T=5%).

Revert macrobenchmarks to manual trigger on non-master branches.
Move microbenchmarks before macrobenchmarks so macro gates and notify
stages are adjacent. Restrict check-slo-breaches and notify-slo-breaches
to master only since non-master branches use manual macrobenchmarks.
Drop rules: block from check-slo-breaches and notify-slo-breaches.
GitLab ignores top-level when: when rules: is present. Follow
dd-trace-py pattern: use when: always with no rules.
Use rules: with when: always on master, default on_success on branches.
Remove conflicting top-level when: always which GitLab ignores when
rules: is present.
Remove baseline scenarios (not actionable). Keep only:
- normal_operation: agg_http_req_duration p50/p99
- high_load: throughput
- utilization monitors: cpu_usage_percentage, rss

Drop data_received, data_sent, dropped_iterations, http_req_duration.
Reduces from 209 to 66 thresholds across 36 scenarios.
Fix macrobenchmarks-notify-slo-breaches referencing wrong job name.
Move when: always into rules for microbenchmarks-check-big-regressions
since GitLab ignores top-level when: when rules: is present.
Single-run SLO generation produced a tight RSS threshold (2.73 GB)
that doesn't account for cross-run variance. Bump to 3.25 GB based
on observed values across multiple runs.
Only PR-level microbenchmark regression checks are needed.
Remove check-slo-breaches, notify-slo-breaches, SLO thresholds file,
dd-octo-sts policy, and associated stages.
@igoragoli igoragoli force-pushed the augusto/enable-perf-quality-gates branch from 43a7138 to 5ce82d2 Compare April 14, 2026 08:25
@datadog-official
Copy link
Copy Markdown

datadog-official bot commented Apr 14, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 95.35% (-0.02%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 515d6bd | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

Remove allow_failure from microbenchmarks-check-big-regressions.
Restore original stage order (macrobenchmarks before microbenchmarks).
@igoragoli igoragoli added the dev/ci Involves CircleCI, GitHub Actions, or GitLab label Apr 14, 2026
@igoragoli igoragoli changed the title ci: enable performance quality gates Enable PR-level performance quality gates Apr 14, 2026
Comment thread .gitlab/benchmarks.yml Outdated
Comment thread .gitlab/benchmarks.yml Outdated
Comment thread .gitlab/benchmarks.yml Outdated
Comment thread .gitlab/benchmarks.yml Outdated
Store bp-runner.fail-on-regression.yml in the repo instead of cloning
benchmarking-platform at runtime. Drop redundant CI variable re-exports.
Makes the 20% regression threshold visible and configurable in this repo.
@igoragoli igoragoli merged commit 5cd7833 into master Apr 14, 2026
353 checks passed
@igoragoli igoragoli deleted the augusto/enable-perf-quality-gates branch April 14, 2026 13:25
@github-actions github-actions bot added this to the 2.31.0 milestone Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos dev/ci Involves CircleCI, GitHub Actions, or GitLab mergequeue-status: done

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants