fix: classify gateway responses and add retry/backoff policy (closes #104) by Yashash4 · Pull Request #155 · brightdata/brightdata-mcp

Yashash4 · 2026-06-11T17:28:05Z

Summary

Split out from #141 per @meirk-brd's review. This PR is only the #104 retry/backoff fix. geo_fanout is held back and will return as its own PR once its target-status sourcing is reworked (see the note at the end).

What #104 asked for

Issue #104 reports intermittent 502 Bad Gateway (and 504) from the gateway under burst load (~50-100 MCP calls), where subsequent requests fail as well, and notes there is no documented retry / backoff policy specific to MCP usage.

The previous base_request retried on every non-4xx error in a tight loop with no delay (timeout: base_timeout only), which under a burst can amplify the overload rather than relieve it, and it had no notion of Retry-After.

The fix: retry_utils.js

Three pure, dependency-free functions (no transport-library coupling), each table-tested.

classify_response(input, now) maps an HTTP response or a thrown transport error to a stable outcome taxonomy:

success (2xx)
redirect (3xx, given its own self-consistent outcome rather than being mislabeled fatal)
retryable (408/425/500/502/503/504, unenumerated 5xx, and retryable network codes like ECONNRESET/ETIMEDOUT/UND_ERR_CONNECT_TIMEOUT)
rate_limited (429, surfacing Retry-After)
blocked (403/451, terminal, not a silent retry)
client_error (other 4xx, terminal)
fatal (unclassifiable, never silently retried)

parse_retry_after accepts only a non-negative integer number of seconds or a valid HTTP-date. Anything malformed (fractional 1.5, negative -3, numeric-looking junk) returns null, so the caller falls back to its computed backoff instead of reading it as a past date and retrying immediately.

compute_backoff(attempt, opts, rng) is exponential backoff with full jitter (so a burst of concurrent callers does not retry in lockstep), capped at max_ms, with a server-supplied Retry-After (integer seconds or HTTP-date) always winning. rng is injectable for deterministic tests.

should_retry(classification, attempt, max_retries, opts, rng) combines the above into a {retry, delay_ms} decision honoring the retry budget.

base_request now retries only transient failures, with jittered backoff. New optional env knobs BASE_BACKOFF_MS (default 500) and MAX_BACKOFF_MS (default 30000). The existing BASE_MAX_RETRIES still bounds the attempt count (clamped to the documented 0-3 range, NaN-safe).

Addressing the review on this half

README. Documented the new BASE_BACKOFF_MS / MAX_BACKOFF_MS knobs and the new worst-case latency: a single call can now block for tens of seconds before it ultimately fails (with the defaults, up to roughly 90s of backoff across three capped waits), so callers that need to fail fast can lower the knobs.
Retry logging. base_request no longer logs one line per attempt. It emits a single concise summary line only when it gives up after at least one retry, so the burst Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance) #104 describes (50-100 calls, each retrying up to 3 times) no longer floods stderr. Success and first-try non-retryable failures stay silent.

Tests

Table-driven, using the repo's existing node --test / node:assert/strict setup. No new dependencies.

test/retry-utils.test.js covers the classify_response taxonomy (including the 502/504 from #104 and the 3xx redirect outcome), parse_retry_after (strict rejection of fractional/negative/junk), compute_backoff (jitter modes, cap, and Retry-After override), and the should_retry budget (including a 3xx never looping).

Run npm test: 21 tests, 21 pass, 0 fail (against the current main).

Notes

No crypto, no new runtime dependencies, MIT-compatible.
Follows the repo's conventions (ES modules, 'use strict'; /*jslint ...*/ header, snake_case, single quotes, 4-space indent).
Default behavior is unchanged for existing deployments. With BASE_MAX_RETRIES unset (default 0), a failing call still throws on the first try with no added latency. The retry path is opt-in.

On geo_fanout (held for a follow-up PR)

You were right that classifying axios response.status does not work against the Unlocker /request endpoint, since that is the gateway status (a successful proxy fetch returns 200 even when the target returned 403 or a redirect), and axios follows the 3xx before the tool ever sees it. I will resubmit geo_fanout separately once it reads the target status the way scrape_as_html / extract do (the x-brd-* headers) with maxRedirects: 0, and the tests exercise the real Unlocker response shape rather than synthetic statuses. Happy to compare notes on the status-passthrough when I open it.

…rightdata#104) Intermittent 502/504 from the MCP gateway under burst load had no retry guidance and the previous base_request retried every non-4xx in a tight loop with no delay, which could amplify the overload. Add retry_utils.js with pure, table-tested helpers: - classify_response: stable outcome taxonomy (success / redirect / retryable / rate_limited / blocked / client_error / fatal) over HTTP status and network error codes; 502/504/503/500/408 are retryable, 403/451 are a first-class BLOCKED outcome, 3xx are a non-retryable REDIRECT, 4xx are terminal. - parse_retry_after: strict RFC 9110 parsing (integer seconds or a date-shaped HTTP-date); fractional/negative/junk values return null so the caller falls back to computed backoff instead of an immediate retry. - compute_backoff: exponential backoff with full jitter, capped at max_ms, that honors a server Retry-After (seconds or HTTP-date). - should_retry: budget + delay decision for a retry loop. Wire base_request to use them so only transient failures are retried, with jittered backoff so a burst of concurrent calls no longer retries in lockstep. Backoff is configurable via BASE_BACKOFF_MS (default 500) and MAX_BACKOFF_MS (default 30000). Document the new env knobs and the worst-case added latency in the README. Reduce retry logging from one stderr line per attempt to a single concise summary line per request (only on final give-up), so the brightdata#104 burst (50-100 calls x up to 3 retries) no longer floods stderr. Clamp BASE_MAX_RETRIES to a sane 0-3 integer so a negative or non-numeric value behaves as 0 instead of skipping the request loop entirely and throwing undefined.

This was referenced Jun 11, 2026

geo_fanout tool + classified retry/backoff policy (closes #104) #141

Closed

feat: geo_fanout with first-class classified per-geo results (target status via Unlocker format:json) #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: classify gateway responses and add retry/backoff policy (closes #104)#155

fix: classify gateway responses and add retry/backoff policy (closes #104)#155
Yashash4 wants to merge 1 commit into
brightdata:mainfrom
Yashash4:fix/mcp-104-retry-backoff

Yashash4 commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yashash4 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What #104 asked for

The fix: retry_utils.js

Addressing the review on this half

Tests

Notes

On geo_fanout (held for a follow-up PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Yashash4 commented Jun 11, 2026 •

edited

Loading