fix: classify gateway responses and add retry/backoff policy (closes #104)#155
Open
Yashash4 wants to merge 1 commit into
Open
fix: classify gateway responses and add retry/backoff policy (closes #104)#155Yashash4 wants to merge 1 commit into
Yashash4 wants to merge 1 commit into
Conversation
…rightdata#104) Intermittent 502/504 from the MCP gateway under burst load had no retry guidance and the previous base_request retried every non-4xx in a tight loop with no delay, which could amplify the overload. Add retry_utils.js with pure, table-tested helpers: - classify_response: stable outcome taxonomy (success / redirect / retryable / rate_limited / blocked / client_error / fatal) over HTTP status and network error codes; 502/504/503/500/408 are retryable, 403/451 are a first-class BLOCKED outcome, 3xx are a non-retryable REDIRECT, 4xx are terminal. - parse_retry_after: strict RFC 9110 parsing (integer seconds or a date-shaped HTTP-date); fractional/negative/junk values return null so the caller falls back to computed backoff instead of an immediate retry. - compute_backoff: exponential backoff with full jitter, capped at max_ms, that honors a server Retry-After (seconds or HTTP-date). - should_retry: budget + delay decision for a retry loop. Wire base_request to use them so only transient failures are retried, with jittered backoff so a burst of concurrent calls no longer retries in lockstep. Backoff is configurable via BASE_BACKOFF_MS (default 500) and MAX_BACKOFF_MS (default 30000). Document the new env knobs and the worst-case added latency in the README. Reduce retry logging from one stderr line per attempt to a single concise summary line per request (only on final give-up), so the brightdata#104 burst (50-100 calls x up to 3 retries) no longer floods stderr. Clamp BASE_MAX_RETRIES to a sane 0-3 integer so a negative or non-numeric value behaves as 0 instead of skipping the request loop entirely and throwing undefined.
This was referenced Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Split out from #141 per @meirk-brd's review. This PR is only the
#104retry/backoff fix.geo_fanoutis held back and will return as its own PR once its target-status sourcing is reworked (see the note at the end).What #104 asked for
Issue #104 reports intermittent
502 Bad Gateway(and504) from the gateway under burst load (~50-100 MCP calls), where subsequent requests fail as well, and notes there is no documented retry / backoff policy specific to MCP usage.The previous
base_requestretried on every non-4xx error in a tight loop with no delay (timeout: base_timeoutonly), which under a burst can amplify the overload rather than relieve it, and it had no notion ofRetry-After.The fix: retry_utils.js
Three pure, dependency-free functions (no transport-library coupling), each table-tested.
classify_response(input, now)maps an HTTP response or a thrown transport error to a stable outcome taxonomy:success(2xx)redirect(3xx, given its own self-consistent outcome rather than being mislabeledfatal)retryable(408/425/500/502/503/504, unenumerated 5xx, and retryable network codes likeECONNRESET/ETIMEDOUT/UND_ERR_CONNECT_TIMEOUT)rate_limited(429, surfacingRetry-After)blocked(403/451, terminal, not a silent retry)client_error(other 4xx, terminal)fatal(unclassifiable, never silently retried)parse_retry_afteraccepts only a non-negative integer number of seconds or a valid HTTP-date. Anything malformed (fractional1.5, negative-3, numeric-looking junk) returnsnull, so the caller falls back to its computed backoff instead of reading it as a past date and retrying immediately.compute_backoff(attempt, opts, rng)is exponential backoff with full jitter (so a burst of concurrent callers does not retry in lockstep), capped atmax_ms, with a server-suppliedRetry-After(integer seconds or HTTP-date) always winning.rngis injectable for deterministic tests.should_retry(classification, attempt, max_retries, opts, rng)combines the above into a{retry, delay_ms}decision honoring the retry budget.base_requestnow retries only transient failures, with jittered backoff. New optional env knobsBASE_BACKOFF_MS(default 500) andMAX_BACKOFF_MS(default 30000). The existingBASE_MAX_RETRIESstill bounds the attempt count (clamped to the documented 0-3 range, NaN-safe).Addressing the review on this half
BASE_BACKOFF_MS/MAX_BACKOFF_MSknobs and the new worst-case latency: a single call can now block for tens of seconds before it ultimately fails (with the defaults, up to roughly 90s of backoff across three capped waits), so callers that need to fail fast can lower the knobs.base_requestno longer logs one line per attempt. It emits a single concise summary line only when it gives up after at least one retry, so the burst Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance) #104 describes (50-100 calls, each retrying up to 3 times) no longer floods stderr. Success and first-try non-retryable failures stay silent.Tests
Table-driven, using the repo's existing
node --test/node:assert/strictsetup. No new dependencies.test/retry-utils.test.jscovers theclassify_responsetaxonomy (including the 502/504 from #104 and the 3xxredirectoutcome),parse_retry_after(strict rejection of fractional/negative/junk),compute_backoff(jitter modes, cap, and Retry-After override), and theshould_retrybudget (including a 3xx never looping).Run
npm test: 21 tests, 21 pass, 0 fail (against the currentmain).Notes
'use strict'; /*jslint ...*/header, snake_case, single quotes, 4-space indent).BASE_MAX_RETRIESunset (default 0), a failing call still throws on the first try with no added latency. The retry path is opt-in.On geo_fanout (held for a follow-up PR)
You were right that classifying
axios response.statusdoes not work against the Unlocker/requestendpoint, since that is the gateway status (a successful proxy fetch returns 200 even when the target returned 403 or a redirect), and axios follows the 3xx before the tool ever sees it. I will resubmitgeo_fanoutseparately once it reads the target status the wayscrape_as_html/extractdo (thex-brd-*headers) withmaxRedirects: 0, and the tests exercise the real Unlocker response shape rather than synthetic statuses. Happy to compare notes on the status-passthrough when I open it.