disk: apply device_include/exclude filtering to latency metrics#23928
disk: apply device_include/exclude filtering to latency metrics#23928gouri-yerra wants to merge 6 commits into
Conversation
collect_latency_metrics() was emitting read_time/write_time metrics for every disk regardless of device_include/device_exclude configuration, while space metrics were correctly filtered. This caused unexpected metrics from excluded devices to appear in customer data. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52db952835
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| def collect_latency_metrics(self): | ||
| for disk_name, disk in psutil.disk_io_counters(perdisk=True).items(): | ||
| self.log.debug('IO Counters: %s -> %s', disk_name, disk) | ||
| if self._device_excluded(disk_name) or not self._device_included(disk_name): |
There was a problem hiding this comment.
Match latency filters against partition device paths
On Linux, the documented device_include/device_exclude values match disk_partitions() device paths such as /dev/sda1, but disk_io_counters(perdisk=True) keys are bare names such as sda1. With a common config like device_exclude: ['/dev/sdb1'], this check still emits latency metrics for sdb1; with device_include: ['/dev/sda1'], it drops the included device's latency metrics because the regex is evaluated only against the bare disk_name. The filter needs to account for the same device path format used by partition metrics.
Useful? React with 👍 / 👎.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e tag Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| def collect_latency_metrics(self): | ||
| for disk_name, disk in psutil.disk_io_counters(perdisk=True).items(): | ||
| self.log.debug('IO Counters: %s -> %s', disk_name, disk) | ||
| if self._device_excluded(disk_name) or not self._device_included(disk_name): |
There was a problem hiding this comment.
Filtering is applied to the raw disk_name from psutil.disk_io_counters(perdisk=True), but device_include/device_exclude patterns are often authored against partition-style device paths. Can we make the matching semantics explicit here (normalization or explicit IO-key matching contract)?
Concrete example:
device_include: ['/dev/sda1']disk_io_counterskeys:{'sda1': ...}
Current behavior skips latency metrics for that device because _device_included('sda1') == False.
Please either normalize before predicate checks or document and lock this behavior with a unit test.
There was a problem hiding this comment.
Good catch. Switched from applying regex patterns directly to IO counter keys to building a set of _base_device_name values from the partitions that already passed exclude_disk() in check(). Since _base_device_name('/dev/sda1') == _base_device_name('sda1') == 'sda1' on Linux, this normalises both sides and bridges the full-path vs bare-name gap. The filter is only active when device_include or device_exclude is configured, so unfiltered configs are unchanged.
iliakur
left a comment
There was a problem hiding this comment.
Requesting changes based on the inline comment about device filter matching semantics in collect_latency_metrics (raw disk_io_counters key format vs configured device patterns).
| def collect_latency_metrics(self): | ||
| for disk_name, disk in psutil.disk_io_counters(perdisk=True).items(): | ||
| self.log.debug('IO Counters: %s -> %s', disk_name, disk) | ||
| if self._device_excluded(disk_name) or not self._device_included(disk_name): |
There was a problem hiding this comment.
Could you clarify intended precedence when both filters match the same device?
Current behavior in this change is:
collect_latency_metricschecksself._device_excluded(disk_name)first- then checks
not self._device_included(disk_name) - so if a device matches both include and exclude, it is skipped (exclude wins)
Is that precedence intended and part of the expected contract for latency metrics?
There was a problem hiding this comment.
No longer relevant, exclude/include precedence is handled entirely by exclude_disk() in check(), which is the existing well-tested logic. The new collect_latency_metrics path just checks membership in the pre-filtered set.
Instead of applying device_include/device_exclude regexes directly to disk_io_counters keys (which use bare names on Linux like 'sda1' while patterns are written against full paths like '/dev/sda1'), build a set of allowed base device names from the partitions that passed exclude_disk() in check(). _base_device_name normalises both sides so the match is format-agnostic. Filter is only active when device_include or device_exclude is configured, preserving original behaviour otherwise. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_base_device_name on Windows strips backslashes only, so /dev/sda1 stays as /dev/sda1 and does not normalise to match the bare io_counters key sda1. Use Windows drive-letter format on Windows runners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Validation ReportAll 21 validations passed. Show details
|
What does this PR do?
Applies
device_include/device_excludefiltering tocollect_latency_metrics()so thatsystem.disk.read_time,system.disk.write_time,system.disk.read_time_pct, andsystem.disk.write_time_pctrespect the same device filtering as space metrics.Motivation
collect_latency_metrics()iterated overpsutil.disk_io_counters(perdisk=True)and emitted IO metrics for every disk on the system with no filtering applied, while space metrics (system.disk.total,used,free, etc.) correctly honoureddevice_include/device_exclude. This caused customers who configureddevice_includeto still receive latency metrics tagged with excluded devices (e.g.device:c:appearing even when onlydevice:u:was configured).Review checklist (to be filled by reviewers)
qa/requiredif this PR needs QA validation, orqa/skip-qaif it does not. Exactly one of the two is required.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged