output: engine: Add metrics for backpressure durations#11529
Conversation
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
📝 WalkthroughWalkthroughThe changes introduce a new metric histogram to track backpressure wait times in Fluent Bit output instances. A new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/flb_output.c (1)
51-53: Sub-second buckets may be unused.Based on the scheduler code in
flb_sched_request_create,retry_secondsis computed asbackoff_full_jitter(...) + 1, which always returns an integer >= 1. The sub-second buckets (0.010 through 0.500) will never capture any values.Consider whether these buckets are intentionally included for future use or if they could be simplified to start at 1.0.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/flb_output.c` around lines 51 - 53, The array output_backpressure_wait_buckets currently includes sub-second values that will never be selected because flb_sched_request_create computes retry_seconds as backoff_full_jitter(...) + 1 which yields integers >= 1; either remove the unused sub-second entries or adjust the jitter calculation—decide on intended behavior and implement accordingly: if sub-second granularity is not needed, modify output_backpressure_wait_buckets to start at 1.0 (e.g., {1.0, 2.0, 5.0, ...}); if sub-second retries are intended, change the retry_seconds computation in flb_sched_request_create/backoff_full_jitter to allow fractional values without the +1 bias and ensure retry_seconds remains compatible with any callers that expect integer seconds. Reference symbols: output_backpressure_wait_buckets, flb_sched_request_create, retry_seconds, backoff_full_jitter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/flb_output.c`:
- Around line 51-53: The array output_backpressure_wait_buckets currently
includes sub-second values that will never be selected because
flb_sched_request_create computes retry_seconds as backoff_full_jitter(...) + 1
which yields integers >= 1; either remove the unused sub-second entries or
adjust the jitter calculation—decide on intended behavior and implement
accordingly: if sub-second granularity is not needed, modify
output_backpressure_wait_buckets to start at 1.0 (e.g., {1.0, 2.0, 5.0, ...});
if sub-second retries are intended, change the retry_seconds computation in
flb_sched_request_create/backoff_full_jitter to allow fractional values without
the +1 bias and ensure retry_seconds remains compatible with any callers that
expect integer seconds. Reference symbols: output_backpressure_wait_buckets,
flb_sched_request_create, retry_seconds, backoff_full_jitter.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a9389483-f31c-44d3-8249-5abf8607441e
📒 Files selected for processing (3)
include/fluent-bit/flb_output.hsrc/flb_engine.csrc/flb_output.c
|
@cosmo0920 we will need to update the docs for this |
For observing backpressure statuses, we need to add backpresure wait metrics in output and engine.
This could provide a clue of which plugin could be working for heavily loaded or back pressure impacted.
This metrics will be collected per plugin name.
It will not cause cardinality explosion.
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
Release Notes