Skip to content

fix: classify gateway responses and add retry/backoff policy (closes #104)#155

Open
Yashash4 wants to merge 1 commit into
brightdata:mainfrom
Yashash4:fix/mcp-104-retry-backoff
Open

fix: classify gateway responses and add retry/backoff policy (closes #104)#155
Yashash4 wants to merge 1 commit into
brightdata:mainfrom
Yashash4:fix/mcp-104-retry-backoff

Conversation

@Yashash4

@Yashash4 Yashash4 commented Jun 11, 2026

Copy link
Copy Markdown

Summary

Split out from #141 per @meirk-brd's review. This PR is only the #104 retry/backoff fix. geo_fanout is held back and will return as its own PR once its target-status sourcing is reworked (see the note at the end).

What #104 asked for

Issue #104 reports intermittent 502 Bad Gateway (and 504) from the gateway under burst load (~50-100 MCP calls), where subsequent requests fail as well, and notes there is no documented retry / backoff policy specific to MCP usage.

The previous base_request retried on every non-4xx error in a tight loop with no delay (timeout: base_timeout only), which under a burst can amplify the overload rather than relieve it, and it had no notion of Retry-After.

The fix: retry_utils.js

Three pure, dependency-free functions (no transport-library coupling), each table-tested.

classify_response(input, now) maps an HTTP response or a thrown transport error to a stable outcome taxonomy:

  • success (2xx)
  • redirect (3xx, given its own self-consistent outcome rather than being mislabeled fatal)
  • retryable (408/425/500/502/503/504, unenumerated 5xx, and retryable network codes like ECONNRESET/ETIMEDOUT/UND_ERR_CONNECT_TIMEOUT)
  • rate_limited (429, surfacing Retry-After)
  • blocked (403/451, terminal, not a silent retry)
  • client_error (other 4xx, terminal)
  • fatal (unclassifiable, never silently retried)

parse_retry_after accepts only a non-negative integer number of seconds or a valid HTTP-date. Anything malformed (fractional 1.5, negative -3, numeric-looking junk) returns null, so the caller falls back to its computed backoff instead of reading it as a past date and retrying immediately.

compute_backoff(attempt, opts, rng) is exponential backoff with full jitter (so a burst of concurrent callers does not retry in lockstep), capped at max_ms, with a server-supplied Retry-After (integer seconds or HTTP-date) always winning. rng is injectable for deterministic tests.

should_retry(classification, attempt, max_retries, opts, rng) combines the above into a {retry, delay_ms} decision honoring the retry budget.

base_request now retries only transient failures, with jittered backoff. New optional env knobs BASE_BACKOFF_MS (default 500) and MAX_BACKOFF_MS (default 30000). The existing BASE_MAX_RETRIES still bounds the attempt count (clamped to the documented 0-3 range, NaN-safe).

Addressing the review on this half

  • README. Documented the new BASE_BACKOFF_MS / MAX_BACKOFF_MS knobs and the new worst-case latency: a single call can now block for tens of seconds before it ultimately fails (with the defaults, up to roughly 90s of backoff across three capped waits), so callers that need to fail fast can lower the knobs.
  • Retry logging. base_request no longer logs one line per attempt. It emits a single concise summary line only when it gives up after at least one retry, so the burst Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance) #104 describes (50-100 calls, each retrying up to 3 times) no longer floods stderr. Success and first-try non-retryable failures stay silent.

Tests

Table-driven, using the repo's existing node --test / node:assert/strict setup. No new dependencies.

test/retry-utils.test.js covers the classify_response taxonomy (including the 502/504 from #104 and the 3xx redirect outcome), parse_retry_after (strict rejection of fractional/negative/junk), compute_backoff (jitter modes, cap, and Retry-After override), and the should_retry budget (including a 3xx never looping).

Run npm test: 21 tests, 21 pass, 0 fail (against the current main).

Notes

  • No crypto, no new runtime dependencies, MIT-compatible.
  • Follows the repo's conventions (ES modules, 'use strict'; /*jslint ...*/ header, snake_case, single quotes, 4-space indent).
  • Default behavior is unchanged for existing deployments. With BASE_MAX_RETRIES unset (default 0), a failing call still throws on the first try with no added latency. The retry path is opt-in.

On geo_fanout (held for a follow-up PR)

You were right that classifying axios response.status does not work against the Unlocker /request endpoint, since that is the gateway status (a successful proxy fetch returns 200 even when the target returned 403 or a redirect), and axios follows the 3xx before the tool ever sees it. I will resubmit geo_fanout separately once it reads the target status the way scrape_as_html / extract do (the x-brd-* headers) with maxRedirects: 0, and the tests exercise the real Unlocker response shape rather than synthetic statuses. Happy to compare notes on the status-passthrough when I open it.

…rightdata#104)

Intermittent 502/504 from the MCP gateway under burst load had no retry
guidance and the previous base_request retried every non-4xx in a tight
loop with no delay, which could amplify the overload.

Add retry_utils.js with pure, table-tested helpers:
- classify_response: stable outcome taxonomy (success / redirect /
  retryable / rate_limited / blocked / client_error / fatal) over HTTP
  status and network error codes; 502/504/503/500/408 are retryable,
  403/451 are a first-class BLOCKED outcome, 3xx are a non-retryable
  REDIRECT, 4xx are terminal.
- parse_retry_after: strict RFC 9110 parsing (integer seconds or a
  date-shaped HTTP-date); fractional/negative/junk values return null so
  the caller falls back to computed backoff instead of an immediate retry.
- compute_backoff: exponential backoff with full jitter, capped at
  max_ms, that honors a server Retry-After (seconds or HTTP-date).
- should_retry: budget + delay decision for a retry loop.

Wire base_request to use them so only transient failures are retried,
with jittered backoff so a burst of concurrent calls no longer retries
in lockstep. Backoff is configurable via BASE_BACKOFF_MS (default 500)
and MAX_BACKOFF_MS (default 30000).

Document the new env knobs and the worst-case added latency in the
README. Reduce retry logging from one stderr line per attempt to a single
concise summary line per request (only on final give-up), so the brightdata#104
burst (50-100 calls x up to 3 retries) no longer floods stderr.

Clamp BASE_MAX_RETRIES to a sane 0-3 integer so a negative or
non-numeric value behaves as 0 instead of skipping the request loop
entirely and throwing undefined.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant