Skip to content

feat: consecutive-window alerts#2472

Open
mrkaye97 wants to merge 10 commits into
hyperdxio:mainfrom
mrkaye97:mk/add-multi-window-lookback-alerts
Open

feat: consecutive-window alerts#2472
mrkaye97 wants to merge 10 commits into
hyperdxio:mainfrom
mrkaye97:mk/add-multi-window-lookback-alerts

Conversation

@mrkaye97

@mrkaye97 mrkaye97 commented Jun 16, 2026

Copy link
Copy Markdown

Summary

Fixes #2469 (comment)

Adding consecutive-window configuration to alerts, so that you can specify a condition like "only fire this alert after some condition is met for N consecutive windows." This helps prevent flaky alerts (and pages), and cut down on alert noise in many cases.

Not too sure on a few things here, so would love some feedback:

  1. My UX sense isn't amazing - I thought that doing this in-line made sense, but it may make sense to put the number of windows on its own line below the original alert config, e.g.
  2. One implementation detail I was wondering about is if we'd want to allow for alerting if e.g. three out of five windows alert, or similar. For instance "fire the alert if we see this log line at least once in three of five consecutive windows." I could see this being useful, but it also adds some overhead, and it feels like it'd be easy to add that feature in a non-breaking way in the future if someone would like to have it, so I figured I'd leave it off for now.
  3. In the current implementation, the alert in the UI shows as "firing" even if no notification has actually sent yet, because the condition has been met, but the threshold is not. Not sure if that's okay, or if I should figure out some way around it!

Screenshots or video

Screenshot from my local of the new alert modal:

Screenshot 2026-06-16 at 10 52 39 AM

Testing Locally

I tested by creating a trivial alert that'd fire on any error log that we see at least once per minute for three minutes, and then sent a curl once per minute to trigger it:

 curl -s -X POST http://localhost:30996/v1/logs \
    -H "Content-Type: application/json" \
    -H "Authorization: << ingestion api key >>" \
    -d '{
      "resourceLogs": [{
        "resource": {
          "attributes": [{"key": "service.name", "value": {"stringValue": "test-service"}}]
        },
        "scopeLogs": [{
          "logRecords": [{
            "timeUnixNano": "'$(date +%s)000000000'",
            "severityText": "ERROR",
            "severityNumber": 17,
            "body": {"stringValue": "something went wrong"}
          }]
        }]  
      }]
    }'

Sent webhooks to webhook.site, and saw logs matching what we'd expect:

  "text": "🚨 Alert for \"test-alert\" - 1 lines found | \n1 lines found, which meets or exceeds the threshold of 1 lines\nTime Range (UTC): [Jun 16 2:52:00 PM - Jun 16 2:53:00 PM)\n\n```\n\"2026-06-16T14:52:00.001000000Z\",\"hdx-oss-dev-api\",\"info\",\"check-alerts started at Tue Jun 16 2026 10:52:00 GMT-0400 (Eastern Daylight Time)\"\n\"2026-06-16T14:52:00.004000000Z\",\"hdx-oss-dev-api\",\"info\",\"Connection established to MongoDB\"\n\"2026-06-16T14:52:00.004000000Z\",\"hdx-oss-dev-api\",\"debug\",\"finished provider initialization\"\n\"2026-06-16T14:52:00.016000000Z\",\"hdx-oss-dev-api\",\"debug\",\"Fetched 1 alert tasks to process\"\n\"2026-06-16T14:52:00.017000000Z\",\"hdx-oss-dev-api\",\"debug\",\"Obtained teams and webhooks for all alertTasks\"\n\n``` | http://localhost:30296/search/6a3162175459c7092c078df3?from=1781621520000&to=1781621580000&isLive=false | ALERT | 1781621520000 | 1781621580000 | 7e645d5a3c56111a86f2cba9d8f46af046943a6f"

And then:

{"text": "✅ Alert for \"test-alert\" - 0 lines found | The alert has been resolved.\nTime Range (UTC): [Jun 16 2:53:00 PM - Jun 16 2:54:00 PM)\n | http://localhost:30296/search/6a3162175459c7092c078df3?from=1781621580000&to=1781621640000&isLive=false | OK | 1781621580000 | 1781621640000 | 7e645d5a3c56111a86f2cba9d8f46af046943a6f"}

How to test on Vercel preview

Preview routes:

Wasn't sure what to put here!

References

Source Issue: #2469 (comment)

@changeset-bot

changeset-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 7687415

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

@mrkaye97 is attempting to deploy a commit to the HyperDX Team on Vercel.

A member of the Team first needs to authorize it.

@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds a numConsecutiveWindows setting to alerts, allowing users to require that a threshold condition hold for N consecutive check windows before a notification fires. A new fired boolean field on AlertHistory tracks whether a notification was actually sent for each ALERT window, so the resolution path can correctly skip "resolved" webhooks when no ALERT was ever sent.

  • Adds shouldFireBasedOnConsecutiveWindows helper that queries the N-1 most recent AlertHistory records and returns true only when all are in ALERT state; all three call-sites (single-value, zero-fill, and group-by paths) correctly set fired = false in the suppression branch.
  • Threads fired through getPreviousAlertHistories aggregation and the AggregatedAlertHistory interface so sendNotificationIfResolved can gate on previousHistory?.fired !== false.
  • Adds NumberInput controls to both alert form surfaces (saved-search modal and tile alert editor) that map a value of 1 back to undefined so the default single-window behavior remains unchanged for existing alerts.

Confidence Score: 4/5

The change is logically sound and well-tested for the happy path, but there is a known open question about service-downtime gaps allowing non-adjacent ALERT windows to satisfy the consecutive check.

The core shouldFireBasedOnConsecutiveWindows implementation correctly gates all three ALERT code paths, the fired field is properly threaded through the aggregation and resolution guard, and three integration tests cover the main scenarios. The open thread item about temporal proximity — where downtime gaps can cause stale ALERT windows to satisfy the consecutive-window requirement — is an acknowledged correctness gap that could produce spurious notifications after extended outages.

packages/api/src/tasks/checkAlerts/index.ts — specifically the shouldFireBasedOnConsecutiveWindows query, which lacks a lower-bound timestamp filter to ensure only truly consecutive windows are counted.

Important Files Changed

Filename Overview
packages/api/src/tasks/checkAlerts/index.ts Core implementation of consecutive-window alerting. Adds shouldFireBasedOnConsecutiveWindows helper and fired field tracking. The fired !== false resolution guard is correct; all three call-sites have the else branch setting fired=false.
packages/api/src/models/alertHistory.ts Adds optional fired boolean field to AlertHistory schema to track whether a notification was actually sent for each ALERT window.
packages/api/src/tasks/checkAlerts/tests/checkAlerts.test.ts Adds three integration tests covering: fires on first window (N=1), suppresses until N consecutive ALERTs, and no resolution notification when no alert previously fired.
packages/app/src/components/DBEditTimeChartForm/TileAlertEditor.tsx Adds numConsecutiveWindows NumberInput to the tile alert editor. Minor naming issue: alertnumConsecutiveWindows should be alertNumConsecutiveWindows to follow the established camelCase pattern.
packages/common-utils/src/types.ts Adds numConsecutiveWindows to AlertBaseObjectSchema and AlertsPageItemSchema with correct int/min(1) constraints.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[processAlert runs] --> B{meta.type == single_value?}
    B -- yes --> C[doesExceedThreshold?]
    B -- no --> D[time-series: iterate expectedBuckets]

    C -- yes --> E[shouldFireBasedOnConsecutiveWindows]
    C -- no --> F[state = OK]

    D --> G[bucket has data?]
    G -- no, zeroValueIsAlert --> E
    G -- yes, per group --> H[doesExceedThreshold?]
    H -- yes --> E
    H -- no --> I[state = OK]

    E --> J{N-1 previous windows all ALERT?}
    J -- yes --> K[history.fired = true, trySendNotification ALERT]
    J -- no --> L[history.fired = false, notification suppressed]

    F --> M[sendNotificationIfResolved]
    I --> M
    K --> M
    L --> M

    M --> N{previousHistory.state==ALERT AND fired !== false AND current==OK?}
    N -- yes --> O[trySendNotification OK, resolution sent]
    N -- no --> P[skip resolution]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[processAlert runs] --> B{meta.type == single_value?}
    B -- yes --> C[doesExceedThreshold?]
    B -- no --> D[time-series: iterate expectedBuckets]

    C -- yes --> E[shouldFireBasedOnConsecutiveWindows]
    C -- no --> F[state = OK]

    D --> G[bucket has data?]
    G -- no, zeroValueIsAlert --> E
    G -- yes, per group --> H[doesExceedThreshold?]
    H -- yes --> E
    H -- no --> I[state = OK]

    E --> J{N-1 previous windows all ALERT?}
    J -- yes --> K[history.fired = true, trySendNotification ALERT]
    J -- no --> L[history.fired = false, notification suppressed]

    F --> M[sendNotificationIfResolved]
    I --> M
    K --> M
    L --> M

    M --> N{previousHistory.state==ALERT AND fired !== false AND current==OK?}
    N -- yes --> O[trySendNotification OK, resolution sent]
    N -- no --> P[skip resolution]
Loading

Reviews (3): Last reviewed commit: "refactor: group key can be optional" | Re-trigger Greptile

Comment thread packages/api/src/tasks/checkAlerts/index.ts
Comment thread packages/api/src/tasks/checkAlerts/index.ts Outdated
Comment on lines +1011 to +1022
const alertHistory = await AlertHistory.find({
alert: new mongoose.Types.ObjectId(alert.id),
...groupFilter,
createdAt: { $lt: nowInMinsRoundDown },
})
.sort({ createdAt: -1 })
.limit(numWindowsToLookBack - 1);

return (
alertHistory.length === numWindowsToLookBack - 1 &&
alertHistory.every(h => h.state === AlertState.ALERT)
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Consecutive-window check doesn't verify temporal proximity

shouldFireBasedOnConsecutiveWindows queries the N-1 most recent AlertHistory records without filtering by timestamp proximity. If the alert-check service experienced downtime (e.g., hours of outage), old ALERT records from before the gap would still be returned by the sort-and-limit query and would satisfy the every(h => h.state === ALERT) check, causing the current window to fire as if the intervening period had never broken the streak.

Adding a lower-bound timestamp filter — e.g. createdAt: { $gte: new Date(nowInMinsRoundDown.getTime() - (numWindowsToLookBack - 1) * windowSizeInMins * 60_000), $lt: nowInMinsRoundDown } — would ensure only truly consecutive windows are counted.

Fix in Claude Code Fix in Conductor Fix in Cursor Fix in Codex

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, would love input from a maintainer on this one, as I'm not too familiar here!

@mrkaye97 mrkaye97 force-pushed the mk/add-multi-window-lookback-alerts branch from 0214704 to 897fa74 Compare June 16, 2026 15:01
@mrkaye97 mrkaye97 marked this pull request as ready for review June 16, 2026 15:03
Comment thread packages/api/src/tasks/checkAlerts/index.ts Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature request: generalize alert configuration to allow for "fire alert after N occurrences in M consecutive windows"

1 participant