feat: consecutive-window alerts#2472
Conversation
|
|
@mrkaye97 is attempting to deploy a commit to the HyperDX Team on Vercel. A member of the Team first needs to authorize it. |
Greptile SummaryThis PR adds a
Confidence Score: 4/5The change is logically sound and well-tested for the happy path, but there is a known open question about service-downtime gaps allowing non-adjacent ALERT windows to satisfy the consecutive check. The core shouldFireBasedOnConsecutiveWindows implementation correctly gates all three ALERT code paths, the fired field is properly threaded through the aggregation and resolution guard, and three integration tests cover the main scenarios. The open thread item about temporal proximity — where downtime gaps can cause stale ALERT windows to satisfy the consecutive-window requirement — is an acknowledged correctness gap that could produce spurious notifications after extended outages. packages/api/src/tasks/checkAlerts/index.ts — specifically the shouldFireBasedOnConsecutiveWindows query, which lacks a lower-bound timestamp filter to ensure only truly consecutive windows are counted. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[processAlert runs] --> B{meta.type == single_value?}
B -- yes --> C[doesExceedThreshold?]
B -- no --> D[time-series: iterate expectedBuckets]
C -- yes --> E[shouldFireBasedOnConsecutiveWindows]
C -- no --> F[state = OK]
D --> G[bucket has data?]
G -- no, zeroValueIsAlert --> E
G -- yes, per group --> H[doesExceedThreshold?]
H -- yes --> E
H -- no --> I[state = OK]
E --> J{N-1 previous windows all ALERT?}
J -- yes --> K[history.fired = true, trySendNotification ALERT]
J -- no --> L[history.fired = false, notification suppressed]
F --> M[sendNotificationIfResolved]
I --> M
K --> M
L --> M
M --> N{previousHistory.state==ALERT AND fired !== false AND current==OK?}
N -- yes --> O[trySendNotification OK, resolution sent]
N -- no --> P[skip resolution]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[processAlert runs] --> B{meta.type == single_value?}
B -- yes --> C[doesExceedThreshold?]
B -- no --> D[time-series: iterate expectedBuckets]
C -- yes --> E[shouldFireBasedOnConsecutiveWindows]
C -- no --> F[state = OK]
D --> G[bucket has data?]
G -- no, zeroValueIsAlert --> E
G -- yes, per group --> H[doesExceedThreshold?]
H -- yes --> E
H -- no --> I[state = OK]
E --> J{N-1 previous windows all ALERT?}
J -- yes --> K[history.fired = true, trySendNotification ALERT]
J -- no --> L[history.fired = false, notification suppressed]
F --> M[sendNotificationIfResolved]
I --> M
K --> M
L --> M
M --> N{previousHistory.state==ALERT AND fired !== false AND current==OK?}
N -- yes --> O[trySendNotification OK, resolution sent]
N -- no --> P[skip resolution]
Reviews (3): Last reviewed commit: "refactor: group key can be optional" | Re-trigger Greptile |
| const alertHistory = await AlertHistory.find({ | ||
| alert: new mongoose.Types.ObjectId(alert.id), | ||
| ...groupFilter, | ||
| createdAt: { $lt: nowInMinsRoundDown }, | ||
| }) | ||
| .sort({ createdAt: -1 }) | ||
| .limit(numWindowsToLookBack - 1); | ||
|
|
||
| return ( | ||
| alertHistory.length === numWindowsToLookBack - 1 && | ||
| alertHistory.every(h => h.state === AlertState.ALERT) | ||
| ); |
There was a problem hiding this comment.
Consecutive-window check doesn't verify temporal proximity
shouldFireBasedOnConsecutiveWindows queries the N-1 most recent AlertHistory records without filtering by timestamp proximity. If the alert-check service experienced downtime (e.g., hours of outage), old ALERT records from before the gap would still be returned by the sort-and-limit query and would satisfy the every(h => h.state === ALERT) check, causing the current window to fire as if the intervening period had never broken the streak.
Adding a lower-bound timestamp filter — e.g. createdAt: { $gte: new Date(nowInMinsRoundDown.getTime() - (numWindowsToLookBack - 1) * windowSizeInMins * 60_000), $lt: nowInMinsRoundDown } — would ensure only truly consecutive windows are counted.
There was a problem hiding this comment.
This seems reasonable, would love input from a maintainer on this one, as I'm not too familiar here!
0214704 to
897fa74
Compare
Summary
Fixes #2469 (comment)
Adding consecutive-window configuration to alerts, so that you can specify a condition like "only fire this alert after some condition is met for N consecutive windows." This helps prevent flaky alerts (and pages), and cut down on alert noise in many cases.
Not too sure on a few things here, so would love some feedback:
Screenshots or video
Screenshot from my local of the new alert modal:
Testing Locally
I tested by creating a trivial alert that'd fire on any error log that we see at least once per minute for three minutes, and then sent a curl once per minute to trigger it:
Sent webhooks to
webhook.site, and saw logs matching what we'd expect:And then:
How to test on Vercel preview
Preview routes:
Wasn't sure what to put here!
References
Source Issue: #2469 (comment)