Skip to content

Add Kueue integration#23908

Open
gjulianm wants to merge 23 commits into
masterfrom
guillermo.julian/kueue
Open

Add Kueue integration#23908
gjulianm wants to merge 23 commits into
masterfrom
guillermo.julian/kueue

Conversation

@gjulianm
Copy link
Copy Markdown
Contributor

@gjulianm gjulianm commented Jun 2, 2026

What does this PR do?

Adds a Kueue OpenMetrics integration with curated queue, workload, controller, runtime, and resource-specific metrics for GPU/CPU quota and usage.

Motivation

Kueue exposes queue admission and resource accounting metrics that should be available in Datadog with stable names and useful queue/resource tags.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add qa/required if this PR needs QA validation, or qa/skip-qa if it does not. Exactly one of the two is required.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Validation: ddev --no-interactive test kueue.

@datadog-prod-us1-6
Copy link
Copy Markdown

datadog-prod-us1-6 Bot commented Jun 2, 2026

Pipelines  Tests  Code Coverage

Fix all issues with BitsAI

⚠️ Warnings

🚦 4 Pipeline jobs failed

PR All | test / j06ca546 / SNMP   View in Datadog   GitHub Actions

See error Connection error: Max retries exceeded while trying to reach 'ddintegrations.blob.core.windows.net'. Failed to resolve host.

PR All | test / j46da136 / JBoss_WildFly   View in Datadog   GitHub Actions

See error Could not resolve host 'ddintegrations.blob.core.windows.net' during start-up command execution.

PR All | test / j5a9585a / IBM ACE   View in Datadog   GitHub Actions

See error Could not resolve host: ddintegrations.blob.core.windows.net while trying to download files.

View all 4 failed jobs.

🧪 20 Tests failed in 1 job

PR All | run   GitHub Actions

test_bulk_table from test_check.py   View in Datadog (Fix with Cursor)
HTTPSConnectionPool(host=&#39;ddintegrations.blob.core.windows.net&#39;, port=443): Max retries exceeded with url: /snmp/cisco-3850.snmprec (Caused by NameResolutionError(&#34;HTTPSConnection(host=&#39;ddintegrations.blob.core.windows.net&#39;, port=443): Failed to resolve &#39;ddintegrations.blob.core.windows.net&#39; ([Errno -2] Name or service not known)&#34;))
test_cast_metrics from test_check.py   View in Datadog (Fix with Cursor)
HTTPSConnectionPool(host=&#39;ddintegrations.blob.core.windows.net&#39;, port=443): Max retries exceeded with url: /snmp/cisco-3850.snmprec (Caused by NameResolutionError(&#34;HTTPSConnection(host=&#39;ddintegrations.blob.core.windows.net&#39;, port=443): Failed to resolve &#39;ddintegrations.blob.core.windows.net&#39; ([Errno -2] Name or service not known)&#34;))

View all 20 test failures

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 84.78%
Overall Coverage: 88.61% (+1.24%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: c0e4d3f | Docs | Datadog PR Page | Give us feedback!

@gjulianm gjulianm force-pushed the guillermo.julian/kueue branch from 826b922 to c111e07 Compare June 3, 2026 10:16
@gjulianm gjulianm marked this pull request as ready for review June 3, 2026 11:47
@gjulianm gjulianm requested review from a team as code owners June 3, 2026 11:47
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0cf8bd4cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kueue/datadog_checks/kueue/check.py Outdated
Comment thread kueue/metadata.csv Outdated
@joepeeples
Copy link
Copy Markdown
Contributor

Opened DOCS-14629 to assign a Docs writer and follow up with editorial review.

@joepeeples joepeeples added the editorial review Waiting on a more in-depth review from a docs team editor label Jun 3, 2026
lucia-sb

This comment was marked as duplicate.

Copy link
Copy Markdown
Contributor

@lucia-sb lucia-sb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gjulianm, this is a first quick review powered by Claude Code. This uses a team of agents reviewing different factors of the code changes such as code quality, functionality, correctness and other aspects I find important. Rules are tailored following my personal recommendations and the review has been first approved by me.

Please take a look at the comments and decide whether they should be implemented or not. When deciding not to implement a comment make sure to say why, I will be reviewing both the code and your comments personally. This is a first iteration trying to catch the most important things.

My Feedback Legend

Here's a quick guide to the prefixes I use in my comments:

praise: no action needed, just celebrate!
note: just a comment/information, no need to take any action.
question: I need clarification or I'm seeking to understand your approach.
nit: A minor, non-blocking issue (e.g., style, typo). Feel free to ignore.
suggestion: I'm proposing an improvement. This is optional but recommended.
request: A change I believe is necessary before this can be merged.

The only blocking comments are request, any other type of comment can be applied at discretion of the developer.

Comment thread kueue/datadog_checks/kueue/check.py Outdated
Comment thread kueue/datadog_checks/kueue/check.py Outdated
Comment thread kueue/datadog_checks/kueue/check.py Outdated
Comment thread kueue/datadog_checks/kueue/check.py Outdated
Comment thread kueue/datadog_checks/kueue/check.py Outdated
Comment thread kueue/tests/fixtures/metrics.txt Outdated
Comment thread kueue/metadata.csv Outdated
Comment thread kueue/changelog.d/23908.added Outdated
Comment thread kueue/hatch.toml
@@ -0,0 +1,17 @@
[
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Service checks are soft-deprecated. Since this is a new integration, should service_checks.json be present at all? The kueue.openmetrics.health check is emitted automatically by OpenMetricsBaseCheckV2 (it cannot be suppressed without disabling the base behavior), but declaring it here may be unnecessary for a new integration. Is this file required by the publishing pipeline, or should it be removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I couldn't remove it CI validation fails without it.

gjulianm added 9 commits June 4, 2026 11:39
Use non-default service/pod subnets so the kind cluster's API service IP
does not collide with the host environment's Kubernetes networking, which
hijacked in-cluster traffic and broke Kueue's webhook cert bootstrap. Also
scope the LocalQueue readiness wait to the default namespace.
Rename the generic Go version label before submission so E2E metrics pass tag validation.
Relax metric tag assertions to match the actual tag set emitted by the
controller (endpoint, replica_role, cohort tags) instead of pinning an
exact subset, and add the missing assets/service_checks.json (with its
manifest reference) that assert_service_checks requires.
The controller deployment can report `Available` before its webhook server
is actually serving, causing intermittent `connection refused` failures when
applying ResourceFlavor/ClusterQueue. Wait for the webhook service endpoints
and retry the apply to absorb the brief cert-propagation window.
Metric descriptions referenced the raw Prometheus labels ('cluster_queue',
'local_queue'/'localQueue') instead of the tags Datadog actually emits after
remapping ('kueue_cluster_queue', 'kueue_local_queue').
The raw kueue_pending_workloads metric has no cluster_queue in its name, so
the cluster_queue. prefix was inconsistent with every other cluster-queue-
indexed metric (which keep bare names and just carry the kueue_cluster_queue
tag). Drop the prefix to match the source name and the rest of the convention.
Copy link
Copy Markdown
Contributor Author

gjulianm commented Jun 4, 2026

Hi @lucia-sb, reviewed and resolved all the comments except for three, let me know if everything is ok now.

drichards-87
drichards-87 previously approved these changes Jun 4, 2026
Copy link
Copy Markdown
Contributor

@drichards-87 drichards-87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left suggestions from Docs and approved the PR.

Comment thread kueue/assets/service_checks.json Outdated
Comment thread kueue/README.md Outdated
Comment thread kueue/README.md Outdated
Comment thread kueue/README.md Outdated
Comment thread kueue/metadata.csv Outdated
Comment thread kueue/metadata.csv Outdated
Comment thread kueue/metadata.csv Outdated
Comment thread kueue/metadata.csv Outdated
Comment thread kueue/metadata.csv Outdated
Comment thread kueue/metadata.csv Outdated
@temporal-github-worker-1 temporal-github-worker-1 Bot dismissed drichards-87’s stale review June 5, 2026 06:18

Review from drichards-87 is dismissed. Related teams and files:

  • documentation
    • kueue/README.md
    • kueue/assets/service_checks.json
    • kueue/metadata.csv
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Jun 5, 2026

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and code coverage settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants