Skip to content

output: engine: Add metrics for backpressure durations#11529

Merged
edsiper merged 2 commits intomasterfrom
cosmo0920-add-metrics-for-backpressure-durations
Mar 21, 2026
Merged

output: engine: Add metrics for backpressure durations#11529
edsiper merged 2 commits intomasterfrom
cosmo0920-add-metrics-for-backpressure-durations

Conversation

@cosmo0920
Copy link
Copy Markdown
Contributor

@cosmo0920 cosmo0920 commented Mar 9, 2026

For observing backpressure statuses, we need to add backpresure wait metrics in output and engine.
This could provide a clue of which plugin could be working for heavily loaded or back pressure impacted.
This metrics will be collected per plugin name.
It will not cause cardinality explosion.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added a new histogram metric to track output backpressure wait times. This metric records how long outputs wait during backpressure-triggered retries, providing visibility into retry latency patterns across different outputs.

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
@cosmo0920 cosmo0920 requested a review from edsiper as a code owner March 9, 2026 09:52
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

The changes introduce a new metric histogram to track backpressure wait times in Fluent Bit output instances. A new cmt_backpressure_wait histogram field is added to the output instance structure, initialized with predefined buckets during setup, and updated with retry duration measurements when backpressure occurs.

Changes

Cohort / File(s) Summary
Output Instance Structure
include/fluent-bit/flb_output.h
Added cmt_backpressure_wait histogram field to the flb_output_instance struct for storing backpressure wait metric handle.
Output Initialization
src/flb_output.c
Introduced output_backpressure_wait_buckets bucket array and histogram creation logic in initialization flow; includes error handling for bucket creation failure.
Engine Retry Metrics
src/flb_engine.c
Added histogram observation call during retry scheduling to record backpressure wait duration labeled by output name.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

docs-required

Suggested reviewers

  • edsiper

Poem

🐰 A histogram hops into the scene,
To track what backpressure's been!
Each retry's wait, now measured and bright,
Output delays captured just right! 📊

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'output: engine: Add metrics for backpressure durations' directly and clearly describes the main change: adding new metrics for observing backpressure wait durations across output and engine components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cosmo0920-add-metrics-for-backpressure-durations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cosmo0920 cosmo0920 added this to the Fluent Bit v5.0 milestone Mar 9, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/flb_output.c (1)

51-53: Sub-second buckets may be unused.

Based on the scheduler code in flb_sched_request_create, retry_seconds is computed as backoff_full_jitter(...) + 1, which always returns an integer >= 1. The sub-second buckets (0.010 through 0.500) will never capture any values.

Consider whether these buckets are intentionally included for future use or if they could be simplified to start at 1.0.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/flb_output.c` around lines 51 - 53, The array
output_backpressure_wait_buckets currently includes sub-second values that will
never be selected because flb_sched_request_create computes retry_seconds as
backoff_full_jitter(...) + 1 which yields integers >= 1; either remove the
unused sub-second entries or adjust the jitter calculation—decide on intended
behavior and implement accordingly: if sub-second granularity is not needed,
modify output_backpressure_wait_buckets to start at 1.0 (e.g., {1.0, 2.0, 5.0,
...}); if sub-second retries are intended, change the retry_seconds computation
in flb_sched_request_create/backoff_full_jitter to allow fractional values
without the +1 bias and ensure retry_seconds remains compatible with any callers
that expect integer seconds. Reference symbols:
output_backpressure_wait_buckets, flb_sched_request_create, retry_seconds,
backoff_full_jitter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/flb_output.c`:
- Around line 51-53: The array output_backpressure_wait_buckets currently
includes sub-second values that will never be selected because
flb_sched_request_create computes retry_seconds as backoff_full_jitter(...) + 1
which yields integers >= 1; either remove the unused sub-second entries or
adjust the jitter calculation—decide on intended behavior and implement
accordingly: if sub-second granularity is not needed, modify
output_backpressure_wait_buckets to start at 1.0 (e.g., {1.0, 2.0, 5.0, ...});
if sub-second retries are intended, change the retry_seconds computation in
flb_sched_request_create/backoff_full_jitter to allow fractional values without
the +1 bias and ensure retry_seconds remains compatible with any callers that
expect integer seconds. Reference symbols: output_backpressure_wait_buckets,
flb_sched_request_create, retry_seconds, backoff_full_jitter.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a9389483-f31c-44d3-8249-5abf8607441e

📥 Commits

Reviewing files that changed from the base of the PR and between a1d9c2a and 1854229.

📒 Files selected for processing (3)
  • include/fluent-bit/flb_output.h
  • src/flb_engine.c
  • src/flb_output.c

@edsiper edsiper merged commit fc8dbd4 into master Mar 21, 2026
2 checks passed
@edsiper edsiper deleted the cosmo0920-add-metrics-for-backpressure-durations branch March 21, 2026 22:12
@edsiper
Copy link
Copy Markdown
Member

edsiper commented Mar 21, 2026

@cosmo0920 we will need to update the docs for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants