Skip to content

disk: apply device_include/exclude filtering to latency metrics#23928

Open
gouri-yerra wants to merge 6 commits into
masterfrom
gouri/disk-latency-metrics-device-filter
Open

disk: apply device_include/exclude filtering to latency metrics#23928
gouri-yerra wants to merge 6 commits into
masterfrom
gouri/disk-latency-metrics-device-filter

Conversation

@gouri-yerra
Copy link
Copy Markdown
Contributor

What does this PR do?

Applies device_include/device_exclude filtering to collect_latency_metrics() so that system.disk.read_time, system.disk.write_time, system.disk.read_time_pct, and system.disk.write_time_pct respect the same device filtering as space metrics.

Motivation

collect_latency_metrics() iterated over psutil.disk_io_counters(perdisk=True) and emitted IO metrics for every disk on the system with no filtering applied, while space metrics (system.disk.total, used, free, etc.) correctly honoured device_include/device_exclude. This caused customers who configured device_include to still receive latency metrics tagged with excluded devices (e.g. device:c: appearing even when only device:u: was configured).

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add qa/required if this PR needs QA validation, or qa/skip-qa if it does not. Exactly one of the two is required.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

collect_latency_metrics() was emitting read_time/write_time metrics for
every disk regardless of device_include/device_exclude configuration,
while space metrics were correctly filtered. This caused unexpected
metrics from excluded devices to appear in customer data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gouri-yerra gouri-yerra self-assigned this Jun 3, 2026
@datadog-datadog-prod-us1
Copy link
Copy Markdown
Contributor

datadog-datadog-prod-us1 Bot commented Jun 3, 2026

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 5 Pipeline jobs failed

PR | test / check   View in Datadog   GitHub Actions

PR | test / test (linux, ubuntu-22.04, disk, Disk on Linux (py3.13), py3.13) / Disk on Linux (py3.13)-py3.13   View in Datadog   GitHub Actions

PR | test / test (windows, windows-2022, disk, Disk on Windows) / Disk on Windows   View in Datadog   GitHub Actions

View all 5 failed jobs.

🧪 2 Tests failed in 1 job

PR | run   GitHub Actions

test_latency_metrics_respect_device_include from test_unit.py   View in Datadog (Fix with Cursor)
Needed exactly 1 candidates for &#39;system.disk.read_time&#39;, got 0
Expected:
        MetricStub(name=&#39;system.disk.read_time&#39;, type=None, value=None, tags=[&#39;device:sda1&#39;, &#39;device_name:sda1&#39;], hostname=None, device=None, flush_first_value=None)
Difference to closest:


Similar submitted:
Score   Most similar

test_latency_metrics_respect_device_include from test_unit.py   View in Datadog (Fix with Cursor)
Needed exactly 0 candidates for &#39;system.disk.read_time&#39;, got 1
Expected:
        MetricStub(name=&#39;system.disk.read_time&#39;, type=None, value=None, tags=[&#39;device:D:&#39;, &#39;device_name:d:&#39;], hostname=None, device=None, flush_first_value=None)
Difference to closest:


Similar submitted:
Score   Most similar
1.00    MetricStub(name=&#39;system.disk.read_time&#39;, type=3, value=50.0, tags=[&#39;device:D:&#39;, &#39;device_name:d:&#39;], hostname=&#39;&#39;, device=None, flush_first_value=False)
0.98    MetricStub(name=&#39;system.disk.read_time&#39;, type=3, value=50.0, tags=[&#39;device:C:&#39;, &#39;device_name:c:&#39;], hostname=&#39;&#39;, device=None, flush_first_value=False)
...

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 38ceb6f | Docs | Datadog PR Page | Give us feedback!

@gouri-yerra gouri-yerra marked this pull request as ready for review June 3, 2026 22:35
@gouri-yerra gouri-yerra requested review from a team as code owners June 3, 2026 22:35
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52db952835

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread disk/datadog_checks/disk/disk.py Outdated
def collect_latency_metrics(self):
for disk_name, disk in psutil.disk_io_counters(perdisk=True).items():
self.log.debug('IO Counters: %s -> %s', disk_name, disk)
if self._device_excluded(disk_name) or not self._device_included(disk_name):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match latency filters against partition device paths

On Linux, the documented device_include/device_exclude values match disk_partitions() device paths such as /dev/sda1, but disk_io_counters(perdisk=True) keys are bare names such as sda1. With a common config like device_exclude: ['/dev/sdb1'], this check still emits latency metrics for sdb1; with device_include: ['/dev/sda1'], it drops the included device's latency metrics because the regex is evaluated only against the bare disk_name. The filter needs to account for the same device path format used by partition metrics.

Useful? React with 👍 / 👎.

gouri-yerra and others added 2 commits June 4, 2026 09:58
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e tag

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread disk/datadog_checks/disk/disk.py Outdated
def collect_latency_metrics(self):
for disk_name, disk in psutil.disk_io_counters(perdisk=True).items():
self.log.debug('IO Counters: %s -> %s', disk_name, disk)
if self._device_excluded(disk_name) or not self._device_included(disk_name):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering is applied to the raw disk_name from psutil.disk_io_counters(perdisk=True), but device_include/device_exclude patterns are often authored against partition-style device paths. Can we make the matching semantics explicit here (normalization or explicit IO-key matching contract)?

Concrete example:

  • device_include: ['/dev/sda1']
  • disk_io_counters keys: {'sda1': ...}

Current behavior skips latency metrics for that device because _device_included('sda1') == False.
Please either normalize before predicate checks or document and lock this behavior with a unit test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Switched from applying regex patterns directly to IO counter keys to building a set of _base_device_name values from the partitions that already passed exclude_disk() in check(). Since _base_device_name('/dev/sda1') == _base_device_name('sda1') == 'sda1' on Linux, this normalises both sides and bridges the full-path vs bare-name gap. The filter is only active when device_include or device_exclude is configured, so unfiltered configs are unchanged.

Copy link
Copy Markdown
Contributor

@iliakur iliakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes based on the inline comment about device filter matching semantics in collect_latency_metrics (raw disk_io_counters key format vs configured device patterns).

Comment thread disk/datadog_checks/disk/disk.py Outdated
def collect_latency_metrics(self):
for disk_name, disk in psutil.disk_io_counters(perdisk=True).items():
self.log.debug('IO Counters: %s -> %s', disk_name, disk)
if self._device_excluded(disk_name) or not self._device_included(disk_name):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify intended precedence when both filters match the same device?

Current behavior in this change is:

  • collect_latency_metrics checks self._device_excluded(disk_name) first
  • then checks not self._device_included(disk_name)
  • so if a device matches both include and exclude, it is skipped (exclude wins)

Is that precedence intended and part of the expected contract for latency metrics?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer relevant, exclude/include precedence is handled entirely by exclude_disk() in check(), which is the existing well-tested logic. The new collect_latency_metrics path just checks membership in the pre-filtered set.

gouri-yerra and others added 2 commits June 5, 2026 13:55
Instead of applying device_include/device_exclude regexes directly to
disk_io_counters keys (which use bare names on Linux like 'sda1' while
patterns are written against full paths like '/dev/sda1'), build a set
of allowed base device names from the partitions that passed exclude_disk()
in check(). _base_device_name normalises both sides so the match is
format-agnostic. Filter is only active when device_include or device_exclude
is configured, preserving original behaviour otherwise.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_base_device_name on Windows strips backslashes only, so /dev/sda1
stays as /dev/sda1 and does not normalise to match the bare io_counters
key sda1. Use Windows drive-letter format on Windows runners.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Jun 5, 2026

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and code coverage settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants