geo_fanout tool + classified retry/backoff policy (closes #104)#141
geo_fanout tool + classified retry/backoff policy (closes #104)#141Yashash4 wants to merge 3 commits into
Conversation
…rightdata#104) Intermittent 502/504 from the MCP gateway under burst load had no retry guidance and the previous base_request retried every non-4xx in a tight loop with no delay, which could amplify the overload. Add retry_utils.js with pure, table-tested helpers: - classify_response: stable outcome taxonomy (success / retryable / rate_limited / blocked / client_error / fatal) over HTTP status and network error codes; 502/504/503/500/408 are retryable, 403/451 are a first-class BLOCKED outcome, 4xx are terminal. - compute_backoff: exponential backoff with full jitter, capped at max_ms, that honors a server Retry-After (seconds or HTTP-date). - should_retry: budget + delay decision for a retry loop. Wire base_request to use them so only transient failures are retried, with jittered backoff so a burst of concurrent calls no longer retries in lockstep. Backoff is configurable via BASE_BACKOFF_MS / MAX_BACKOFF_MS.
geo_fanout fetches the same URL from multiple country exits in parallel (Web Unlocker country targeting) and returns one structured report. A geo that is blocked (403/451), redirected (3xx to a different host), rate-limited (429) or fails transiently becomes a FIRST-CLASS classified result rather than a discarded error, so geo-gating and regional access/price differences are observable at a glance. geo_utils.js holds the pure, table-tested aggregation logic (normalize_geos, build_geo_entry, summarize_fanout) built on the brightdata#104 classify_response taxonomy. The tool is registered in default/pro mode and in the advanced_scraping group.
classify_response now maps 3xx to a dedicated REDIRECT outcome instead of the contradictory FATAL (a redirect is neither a retryable gateway error nor a hard fatal); should_retry leaves it non-retryable so no 3xx retry loop is introduced, and geo_fanout's per-geo outcome metadata is now self-consistent with its REDIRECTED status. parse_retry_after now accepts only a non-negative integer number of seconds or a date-shaped HTTP-date; malformed values (fractional '1.5', negative '-3', numeric-looking junk) return null so the caller falls back to computed backoff instead of being read by permissive Date.parse as a past date and clamped to an immediate retry.
d83d859 to
55fdd97
Compare
|
Thanks for this, it's a careful and well-structured PR. The retry/backoff helpers are clean, pure, follow the repo conventions, and the table-driven tests are thorough. I want to separate the two halves though, because they stand on pretty different ground for me. Commit 1 (the #104 retry fix) — in favorThis is the right shape for what #104 actually asked for: gateway-level 502/504 under burst load, no backoff, no Two smaller things on commit 1:
Commit 2 (
|
|
Split this per your suggestion, @meirk-brd. The #104 retry/backoff fix is now its own PR, #155, with the README latency/knob docs and the reduced retry logging you flagged folded in (and rebased onto the current Closing this one to keep the tracker tidy. |
geo_fanout tool + classified retry/backoff policy (closes #104)
Summary
This PR adds a resilient cross-country fetch primitive and, as its foundation,
the retry/backoff guidance requested in #104.
Two self-contained commits:
fix: classify gateway responses and add retry/backoff policy (closes #104)— the literal Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance) #104 fix; pure, table-tested helpers + wiring into the existing
request path. Cherry-pick-clean on its own.
feat: add geo_fanout tool with first-class classified geo results— a new tool built on top of the Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance) #104 classification.
What #104 asked for
Issue #104 reports intermittent
502 Bad Gateway(and504) from the gatewayunder burst load (~50-100 MCP calls), where subsequent requests fail as well,
and notes there is no documented retry / backoff policy specific to MCP usage.
The previous
base_requestretried on every non-4xx error in a tight loop withno delay (
timeout: base_timeoutonly), which under a burst can amplify theoverload rather than relieve it, and it had no notion of
Retry-After.The fix (commit 1) —
retry_utils.jsThree pure, dependency-free functions (no transport library coupling), each
table-tested:
classify_response(input, now)— maps an HTTP response or a throwntransport error to a stable outcome taxonomy:
success(2xx)redirect(3xx — follow theLocation; not a retryable gateway error andnot a hard fatal, so it gets its own self-consistent outcome rather than
being mislabeled
fatal)retryable(408/425/500/502/503/504, unenumerated 5xx, and retryablenetwork codes like
ECONNRESET/ETIMEDOUT/UND_ERR_CONNECT_TIMEOUT)rate_limited(429, surfacingRetry-After)blocked(403/451 — a first-class terminal outcome, not a silent retry)client_error(other 4xx — terminal)fatal(unknown network code / unclassifiable — never silently retried)parse_retry_afteraccepts only a non-negative integer number of secondsor a valid HTTP-date; anything malformed (fractional
1.5, negative-3,numeric-looking junk) returns
nullso the caller falls back to its computedbackoff, rather than being read as a past date and clamped to an immediate
retry.
compute_backoff(attempt, opts, rng)— exponential backoff with fulljitter (so a burst of concurrent callers does not retry in lockstep), capped
at
max_ms, and a server-suppliedRetry-After(integer seconds orHTTP-date) always wins.
rngis injectable for deterministic tests.should_retry(classification, attempt, max_retries, opts, rng)— combinesthe two into a
{retry, delay_ms}decision honoring the retry budget.base_requestnow uses these: only transient failures are retried, with jitteredbackoff. New optional env knobs
BASE_BACKOFF_MS(default 500) andMAX_BACKOFF_MS(default 30000); the existingBASE_MAX_RETRIESstill bounds theattempt count.
The feature (commit 2) —
geo_fanoutgeo_fanoutfetches the same URL from multiple country exits in parallel(via Web Unlocker's
countrytargeting) and returns one structured report.Crucially, a geo that is blocked (403/451), redirected (3xx to a different
host), rate-limited (429), or fails transiently becomes a first-class classified
result in the report — not a discarded error. This makes geo-gating and regional
price/availability/access differences observable at a glance.
Parameters:
url,countries(1-10 deduped 2-letter ISO codes), optionaldata_format(markdowndefault |raw). Registered in default/pro mode and theadvanced_scrapinggroup. The aggregation logic lives in pure, table-testedhelpers in
geo_utils.js(normalize_geos,build_geo_entry,summarize_fanout).Example result shape:
{ "summary": {"total": 3, "ok": 1, "blocked": 1, "redirected": 1, "rate_limited": 0, "error": 0}, "any_blocked": true, "any_redirected": true, "results": [ {"geo": "de", "status": "ok", "http_status": 200, "exit_ip": "...", "outcome": "success", "reason": "http 200", "redirect": null}, {"geo": "be", "status": "blocked", "http_status": 403, "outcome": "blocked", "reason": "http 403 target blocked request"}, {"geo": "fr", "status": "redirected", "http_status": 302, "outcome": "redirect", "redirect": {"redirected": true, "location": "...", "cross_host": true}} ] }Tests
All table-driven, using the repo's existing
node --test/node:assert/strictsetup. No new dependencies.
test/retry-utils.test.js—classify_responsetaxonomy (26 cases incl. the502/504 from Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance) #104 and the 3xx
redirectoutcome),parse_retry_after(incl.strict rejection of fractional/negative/junk),
compute_backoff(jitter modesshould_retrybudget (incl. a 3xx never looping).test/geo-utils.test.js—normalize_geos(valid + loud-throw cases),build_geo_entry(every geo first-class),summarize_fanout(nothing dropped).test/geo-fanout-tool.test.js— boots the server over stdio and asserts thegeo_fanouttool is registered with the correct schema.Run:
npm test→ 15 tests, 15 pass, 0 fail.Notes
'use strict'; /*jslint ...*/header, snake_case, single quotes, 4-space indent).
main, its tests pass standalone) for reviewers who want only the Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance) #104 fix.