Skip to content

feat(monitor-v2): add failure grace period to PM notifier#4939

Open
md0x wants to merge 3 commits intomasterfrom
feat/pm-notifier-failure-grace-period
Open

feat(monitor-v2): add failure grace period to PM notifier#4939
md0x wants to merge 3 commits intomasterfrom
feat/pm-notifier-failure-grace-period

Conversation

@md0x
Copy link
Contributor

@md0x md0x commented Mar 12, 2026

Summary

  • Adds a cross-run failure grace period to the Polymarket notifier. Transient API failures (timeouts, 5xx) no longer trigger immediate alerts or permanently mark proposals as handled. Instead, a new FailedProposals Datastore kind tracks proposals that fail with a firstFailureAt timestamp. Only after FAILURE_GRACE_PERIOD_SECONDS (default 630s ≈ 2 serverless runs + buffer) does the alert fire. Successful checks silently clean up prior failure records. Setting to 0 preserves original behavior.
  • Also bumps RETRY_ATTEMPTS default 1→3 and RETRY_DELAY_MS 0→1000ms for better in-process retry coverage.

Test plan

  • All 36 existing tests pass unchanged (grace period defaults to 0 in tests)
  • 5 new tests covering: first failure suppression, subsequent retry within grace period, alert after grace period exceeded, cleanup on success, and backward-compatible gracePeriod=0 behavior

Transient Polymarket API failures (timeouts, 5xx) currently trigger
immediate Slack alerts and permanently mark the proposal as handled,
preventing any retry on subsequent runs.

This adds a two-layer resiliency mechanism:

Layer 1 — Better in-process HTTP retries:
  - RETRY_ATTEMPTS default 1 → 3
  - RETRY_DELAY_MS default 0 → 1000 (exponential backoff)

Layer 2 — Cross-run failure grace period:
  - New Datastore kind "FailedProposals" tracks proposals that fail
    with firstFailureAt timestamp, failureCount, and lastError
  - First failure stores a record and warns, does NOT alert
  - Subsequent failures within the grace period update the record
  - Only after FAILURE_GRACE_PERIOD_SECONDS (default 630s ≈ 2
    serverless runs + buffer) does the alert fire
  - Successful checks silently clear any prior failure record
  - Setting grace period to 0 preserves the original behavior

No changes to the existing NotifiedProposals schema or notification
format.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant