Skip to content

fix: Handling of site connection issues during outage #3482

Open
rippyboii wants to merge 105 commits intopython-discord:mainfrom
rippyboii:main
Open

fix: Handling of site connection issues during outage #3482
rippyboii wants to merge 105 commits intopython-discord:mainfrom
rippyboii:main

Conversation

@rippyboii
Copy link
Copy Markdown

Summary

This PR improves bot startup reliability and moderator visibility when extensions/cogs fail to load.

  • Aggregate extension + cog load failures during startup and report them as a single alert in #mod-log.
  • Add retry + exponential backoff for cogs that depend on external sites/APIs (e.g., temporary 5xx/429/timeouts), with clear mod notifications on final failure.
  • Add unit tests to validate retry behavior, error classification, and startup reporting.

During setup_hook, extensions are loaded concurrently. When an extension/cog fails due to a transient outage (rate limits, 5xx, timeouts), failures can either:

  • stop startup unexpectedly, or
  • fail noisily/fragmentedly, making it hard to see what broke and why.

This change standardizes both resilience (retry when appropriate) and visibility (one clean startup report + targeted alerts).

Changes

1) Startup failure aggregation (single #mod-log alert)

  • Added utils/startup_reporting.py to format a standardized startup failure message.
  • Updated bot.py to:
    • collect extension + cog load failures (import/setup/add_cog)
    • wait for all load tasks to complete
    • send one aggregated alert summarizing all failures
  • Reporting is defensive: it does not crash if the log channel is unavailable.
  • Startup continues for non-critical failures.

2) Retry + backoff for external/API-dependent cogs

Implemented retry logic with exponential backoff and explicit “retriable vs non-retriable” classification, plus moderator notifications when retries are exhausted.

Covered cogs include:

  • Filtering: 3 attempts with backoff (1s, 2s, 4s); retries on HTTP 429, HTTP 5xx, TimeoutError, OSError; final failure logs + alerts #mod-alerts.
  • Reminders: retry count is configured via URLs.connect_max_retries; warns are logged to Sentry; final failure posts to #mod-log.
  • PythonNews: retries on 408, 429, 5xx, TimeoutError, OSError; on max retries logs + alerts mod_alerts and re-raises to stop startup.
  • Superstarify cogs: added retry + notification and corresponding tests.

Tests / Verification

  • Added unit tests covering:
    • retry-then-success
    • max-retries then alert + failure behavior
    • non-retriable errors
    • retry classification logic
    • aggregated startup failure reporting for faulty extensions/cogs

Suggested checks:

  1. uv run task test
  2. Run the bot and simulate a faulty extension/cog load to confirm a single aggregated #mod-log startup alert.

Moderator Alert in Discord:

mod_alert

Closes #2918

rippyboii and others added 30 commits February 25, 2026 15:50
Alerts the moderators through a discord error message if the loading of the Reminders Cog has failed.
Adds retry logic with time buff to `Filtering.cog_load()`
Changed cog_load() function to retry connecting to api if it fails initially with an exponential delay and limited max attempts.
Add test cases for retrying cog loads and skeleton for new functions
Implement skeleton functions with code for retrying fetch, alerting mods, and checking if retryable
Add unit test for on_member_update and unit test to check _alert_mods_if_loading_failed is being called
kahoujo1 and others added 22 commits March 2, 2026 21:53
fix: remove uncalled method (Closes #36)
refactor: rename variables (Closes #38)
refactor: simplify merging of lines (Closes #39)
refactor: remove explicit context helper function (Closes #45)
refactor: remove dataclass label (Closes #47)
@rippyboii rippyboii requested a review from mbaruh as a code owner March 24, 2026 14:09
@rippyboii
Copy link
Copy Markdown
Author

rippyboii commented Mar 24, 2026

Hello @jb3,

I have opened another PR to keep the things clean. The earlier PR had referenced some commits including reports, which was the part of our assignment but not of this project. This new PR doesn't include any commits refrencing the reports or anything else that's unnecessary to this project.

All the suggestions from the previous PR are adapted in this PR too.

Apologies for delay update from our side.

Comment on lines +113 to +137
for attempt in range(1, URLs.connect_max_retries + 1):
try:
raw_filter_lists = await self.bot.api_client.get("bot/filter/filter_lists")
break
except Exception as error:
is_retryable = is_retryable_api_error(error)
is_last_attempt = attempt == URLs.connect_max_retries

if not is_retryable:
raise

if is_last_attempt:
log.exception("Failed to load filtering data after %d attempts.", URLs.connect_max_retries)
raise

backoff_seconds = URLs.connect_initial_backoff * (2 ** (attempt - 1))
log.warning(
"Failed to load filtering data (attempt %d/%d). Retrying in %d second(s): %s",
attempt,
URLs.connect_max_retries,
backoff_seconds,
error
)
await asyncio.sleep(backoff_seconds)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you repeated this backoff logic many times. Consider moving it to self.bot.api_client.get()

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Thank you for your review,

I’m a bit hesitant to put retries directly into api_client.get() because that may affect all GET calls globally, including cases where retries may be not necessary or need different behavior/logging.

If you think this might not be the problem, I can definately move it to the self.bot.api_client.get()

Comment on lines +58 to +69
self.seen_items = {}
for mailing_list in response:
self.seen_items[mailing_list["name"]] = set(mailing_list["seen_items"])

with sentry_sdk.start_span(description="Update site with new mailing lists"):
for mailing_list in ("pep", *constants.PythonNews.mail_lists):
if mailing_list not in self.seen_items:
await self.bot.api_client.post("bot/mailing-lists", json={"name": mailing_list})
self.seen_items[mailing_list] = set()

self.fetch_new_media.start()
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in the try?

Copy link
Copy Markdown
Author

@rippyboii rippyboii Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

The intent was to make the whole startup sync step atomic and retry on transient API failures from both the get("bot/mailing-lists") and post("bot/mailing-lists") calls.

except only retries is_retryable_api_error(error) so the local logic errors still raise immediately.

also the try covers all the startup sequence, so the transient failure in either get or the post should trigger the clean retry method. right?


if bot.get_channel(channel_id) is None:
# Can't send a message if the channel doesn't exist, so log instead
log.warning("Failed to send startup failure report: mod_log channel not found.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the failures be logged in the error?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right! Thank you for pointing it out.

I think a missing mod_log channel means the failure report was silently dropped, which is an error, not just a warning. I will update it soon.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated it with proper error logging. Could you check it again?

Comment on lines +118 to +126
async def _load_one(extension: str) -> None:
try:
await self.load_extension(extension)
log.info(f"Extension successfully loaded: {extension}")

except BaseException as e:
self.extension_load_failures[extension] = e
log.exception(f"Failed to load extension: {extension}")
raise
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason this is defined within _load_extensions instead of it being a method of Bot?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I defined _load_one inside _load_extensions because it’s only used there. I didn't realized we could move it to private Bot method.

If you want, I’m happy to move it to a private Bot method (e.g. _load_single_extension). Shall I?

@rippyboii rippyboii requested a review from rf20008 April 6, 2026 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling of site connection issues during outage.

6 participants