ROSAENG-60057: Add alert silence pre/post steps for production FVT#80929
Conversation
|
@dustman9000: This pull request references ROSAENG-60057 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (9)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (6)
WalkthroughTwo new CI step-registry steps are introduced: ChangesAlert Silence/Unsilence Workflow
Sequence DiagramsequenceDiagram
rect rgba(100, 149, 237, 0.5)
Note over SilenceScript,RHOBS: Pre-step: rosa-e2e-silence-alerts
participant SilenceScript as rosa-e2e-silence-alerts
participant OIDC as OIDC Token Endpoint
participant RHOBS as RHOBS Cell
SilenceScript->>SilenceScript: Load RHOBS_ENV config
SilenceScript->>OIDC: POST client_credentials
OIDC-->>SilenceScript: access_token
loop For each RHOBS cell
SilenceScript->>RHOBS: POST /api/metrics/v1/hcp/am/api/v2/silences
RHOBS-->>SilenceScript: {silenceID}
SilenceScript->>SilenceScript: Append CELL|silenceID to SHARED_DIR
end
end
rect rgba(144, 238, 144, 0.5)
Note over Job: Main job execution
participant Job as ocm-fvt-periodic-cs-rosa-hcp-ad-production-main
Job->>Job: Run test steps
end
rect rgba(255, 165, 0, 0.5)
Note over UnsilenceScript,RHOBS: Post-step: rosa-e2e-unsilence-alerts
participant UnsilenceScript as rosa-e2e-unsilence-alerts
UnsilenceScript->>UnsilenceScript: Read SHARED_DIR/silence-ids
alt File exists and not empty
UnsilenceScript->>OIDC: POST client_credentials
OIDC-->>UnsilenceScript: access_token
loop For each CELL|SILENCE_ID
UnsilenceScript->>RHOBS: DELETE /api/metrics/v1/hcp/am/api/v2/silences/{SILENCE_ID}
RHOBS-->>UnsilenceScript: 200 OK
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.yaml (1)
69-69: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winFix standalone redirection syntax.
Line 69 uses a standalone redirection without a command, which shellcheck flags as SC2188. While many shells accept this, it's non-portable and should be written with an explicit no-op command.
🔧 Proposed fix
-> "${SHARED_DIR}/silence-ids" +: > "${SHARED_DIR}/silence-ids"Alternatively, use
true >orecho -n >.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.yaml` at line 69, The standalone redirection syntax on line 69 of rosa-e2e-silence-alerts-ref.yaml is flagged as SC2188 by shellcheck and is non-portable. Replace the standalone redirection with an explicit no-op command followed by the redirection operator, such as using `true >` or `echo -n >` to make the syntax portable across different shells while maintaining the same behavior.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/config/openshift-online/rosa-e2e/openshift-online-rosa-e2e-main__ocm-fvt-rosa-hcp-production.yaml`:
- Around line 31-39: After modifying the
openshift-online-rosa-e2e-main__ocm-fvt-rosa-hcp-production.yaml configuration
file to add allow_best_effort_post_steps and update the pre and post step
references, you must run the `make update` command in the repository root to
regenerate the downstream Prow job definitions and metadata artifacts. This
ensures that all generated configurations stay synchronized with the source CI
operator configuration changes.
In
`@ci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.yaml`:
- Line 12: The timeout field in the rosa-e2e-silence-alerts step is set too low
at 2m0s to accommodate the sequential RHOBS cell requests. Increase the timeout
value from 2m0s to either 3m or 5m to provide adequate margin for the 9
sequential curl requests (each with up to 10-second max-time) plus OIDC token
acquisition and JSON parsing overhead. Consider using 5m as a safe value similar
to the collect-cs-telemetry step, or a minimum of 3m if a tighter constraint is
preferred.
In
`@ci-operator/step-registry/rosa/e2e/unsilence-alerts/rosa-e2e-unsilence-alerts-commands.sh`:
- Around line 28-41: The while loop on line 28 uses a colon as the IFS delimiter
to split CELL and SILENCE_ID, but since CELL is a URL containing colons (e.g.,
https://...), the read command incorrectly splits on all colons rather than just
the separator between CELL and SILENCE_ID, causing malformed DELETE requests on
line 35. Change the IFS delimiter from colon to a character that won't appear in
URLs, such as a pipe character, and update the corresponding separator in the
input file to match so that CELL and SILENCE_ID are parsed correctly.
---
Outside diff comments:
In
`@ci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.yaml`:
- Line 69: The standalone redirection syntax on line 69 of
rosa-e2e-silence-alerts-ref.yaml is flagged as SC2188 by shellcheck and is
non-portable. Replace the standalone redirection with an explicit no-op command
followed by the redirection operator, such as using `true >` or `echo -n >` to
make the syntax portable across different shells while maintaining the same
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 1a66ed23-289a-4987-a435-9f987d0d71fe
📒 Files selected for processing (9)
ci-operator/config/openshift-online/rosa-e2e/openshift-online-rosa-e2e-main__ocm-fvt-rosa-hcp-production.yamlci-operator/step-registry/rosa/e2e/silence-alerts/OWNERSci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-commands.shci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.metadata.jsonci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.yamlci-operator/step-registry/rosa/e2e/unsilence-alerts/OWNERSci-operator/step-registry/rosa/e2e/unsilence-alerts/rosa-e2e-unsilence-alerts-commands.shci-operator/step-registry/rosa/e2e/unsilence-alerts/rosa-e2e-unsilence-alerts-ref.metadata.jsonci-operator/step-registry/rosa/e2e/unsilence-alerts/rosa-e2e-unsilence-alerts-ref.yaml
f0a9ed1 to
85a5aec
Compare
|
/pj-rehearse ack |
|
@dustman9000: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Add step registry refs to create and expire RHOBS alertmanager silences during production FVT runs, preventing test clusters from paging the on-call SRE. Pre-step creates regex-based silences matching _id =~ "cs-ci-.*" on all production RHOBS cells via the gateway API. Post-step expires the silences after the job completes. Wired into the cs-rosa-hcp-ad-production-main weekly FVT job. Jira: https://redhat.atlassian.net/browse/ROSAENG-60057
85a5aec to
ebcf5ff
Compare
|
@dustman9000: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bmeng, dustman9000 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/pj-rehearse ack |
|
@dustman9000: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
Summary
Context: Production FVT clusters (cs-ci-longname-*) were triggering RHOBS alerts and paging on-call SREs. This automates the silence/unsilence lifecycle through the gateway API.
Test plan
Jira: https://redhat.atlassian.net/browse/ROSAENG-60057
Summary by CodeRabbit
This PR extends OpenShift CI’s ROSA production FVT workflow to automatically silence RHOBS Alertmanager alerts during the Functional Verification Test run, then reliably expire those silences afterward.
What’s Changing
The weekly
ocm-fvt-periodic-cs-rosa-hcp-ad-production-mainjob now includes step hooks:rosa-e2e-record-start-timerosa-e2e-silence-alerts: creates regex-based RHOBS gateway Alertmanager silences across all configured production (or staging) RHOBS cell endpoints for the job duration.rosa-e2e-unsilence-alerts: expires the created silences by reading saved silence IDs.The job sets
allow_best_effort_post_steps: trueso the unsilence post-step runs even if the main FVT job fails.Implementation Details
New step registry:
rosa-e2e-silence-alertsrosa-e2e-silence-alerts-commands.shin theocp/4.18:cliimage./usr/local/rhobs-oidc.RHOBS_ENV=production(default) orstagingto select the appropriate RHOBS cell gateway URLs.SILENCE_MATCHER_NAMEdefault_id,SILENCE_MATCHER_VALUEdefaultcs-ci-.*,isRegex=true,isEqual=true)startsAt/endsAtcomputed in UTC usingSILENCE_DURATION_HOURS(default6)createdBy: "rosa-ci-prow"ROSAENG-60057and a job/PR URL when available${SHARED_DIR}/silence-idsasCELL|SILENCE_IDfor later cleanup.New step registry:
rosa-e2e-unsilence-alertsrosa-e2e-unsilence-alerts-commands.shin theocp/4.18:cliimage.${SHARED_DIR}/silence-idsis missing/empty, it exits successfully.CELL|SILENCE_ID, it sends an HTTP DELETE to the RHOBS gateway silence endpoint to expire the silence.Ownership/Registry Updates
silence-alertsandunsilence-alertsstep registries.