Add Kueue integration#23908
Conversation
|
79406d9 to
826b922
Compare
Start with a basic OpenMetrics V2 check that forwards endpoint metrics under the kueue namespace to enable early endpoint validation before adding curated mappings.
826b922 to
c111e07
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b0cf8bd4cc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Opened DOCS-14629 to assign a Docs writer and follow up with editorial review. |
lucia-sb
left a comment
There was a problem hiding this comment.
Hi @gjulianm, this is a first quick review powered by Claude Code. This uses a team of agents reviewing different factors of the code changes such as code quality, functionality, correctness and other aspects I find important. Rules are tailored following my personal recommendations and the review has been first approved by me.
Please take a look at the comments and decide whether they should be implemented or not. When deciding not to implement a comment make sure to say why, I will be reviewing both the code and your comments personally. This is a first iteration trying to catch the most important things.
My Feedback Legend
Here's a quick guide to the prefixes I use in my comments:
praise: no action needed, just celebrate!
note: just a comment/information, no need to take any action.
question: I need clarification or I'm seeking to understand your approach.
nit: A minor, non-blocking issue (e.g., style, typo). Feel free to ignore.
suggestion: I'm proposing an improvement. This is optional but recommended.
request: A change I believe is necessary before this can be merged.
The only blocking comments are request, any other type of comment can be applied at discretion of the developer.
| @@ -0,0 +1,17 @@ | |||
| [ | |||
There was a problem hiding this comment.
Question: Service checks are soft-deprecated. Since this is a new integration, should service_checks.json be present at all? The kueue.openmetrics.health check is emitted automatically by OpenMetricsBaseCheckV2 (it cannot be suppressed without disabling the base behavior), but declaring it here may be unnecessary for a new integration. Is this file required by the publishing pipeline, or should it be removed?
There was a problem hiding this comment.
Actually I couldn't remove it CI validation fails without it.
Use non-default service/pod subnets so the kind cluster's API service IP does not collide with the host environment's Kubernetes networking, which hijacked in-cluster traffic and broke Kueue's webhook cert bootstrap. Also scope the LocalQueue readiness wait to the default namespace.
Rename the generic Go version label before submission so E2E metrics pass tag validation.
Relax metric tag assertions to match the actual tag set emitted by the controller (endpoint, replica_role, cohort tags) instead of pinning an exact subset, and add the missing assets/service_checks.json (with its manifest reference) that assert_service_checks requires.
The controller deployment can report `Available` before its webhook server is actually serving, causing intermittent `connection refused` failures when applying ResourceFlavor/ClusterQueue. Wait for the webhook service endpoints and retry the apply to absorb the brief cert-propagation window.
Metric descriptions referenced the raw Prometheus labels ('cluster_queue',
'local_queue'/'localQueue') instead of the tags Datadog actually emits after
remapping ('kueue_cluster_queue', 'kueue_local_queue').
The raw kueue_pending_workloads metric has no cluster_queue in its name, so the cluster_queue. prefix was inconsistent with every other cluster-queue- indexed metric (which keep bare names and just carry the kueue_cluster_queue tag). Drop the prefix to match the source name and the rest of the convention.
|
Hi @lucia-sb, reviewed and resolved all the comments except for three, let me know if everything is ok now. |
drichards-87
left a comment
There was a problem hiding this comment.
Left suggestions from Docs and approved the PR.
Review from drichards-87 is dismissed. Related teams and files:
- documentation
- kueue/README.md
- kueue/assets/service_checks.json
- kueue/metadata.csv
Validation ReportAll 21 validations passed. Show details
|
What does this PR do?
Adds a Kueue OpenMetrics integration with curated queue, workload, controller, runtime, and resource-specific metrics for GPU/CPU quota and usage.
Motivation
Kueue exposes queue admission and resource accounting metrics that should be available in Datadog with stable names and useful queue/resource tags.
Review checklist (to be filled by reviewers)
qa/requiredif this PR needs QA validation, orqa/skip-qaif it does not. Exactly one of the two is required.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is mergedValidation:
ddev --no-interactive test kueue.