Skip to content

geo_fanout tool + classified retry/backoff policy (closes #104)#141

Closed
Yashash4 wants to merge 3 commits into
brightdata:mainfrom
Yashash4:feat/geo-fanout-classified-retry
Closed

geo_fanout tool + classified retry/backoff policy (closes #104)#141
Yashash4 wants to merge 3 commits into
brightdata:mainfrom
Yashash4:feat/geo-fanout-classified-retry

Conversation

@Yashash4

Copy link
Copy Markdown

geo_fanout tool + classified retry/backoff policy (closes #104)

Summary

This PR adds a resilient cross-country fetch primitive and, as its foundation,
the retry/backoff guidance requested in #104.

Two self-contained commits:

  1. fix: classify gateway responses and add retry/backoff policy (closes #104)
    — the literal Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance) #104 fix; pure, table-tested helpers + wiring into the existing
    request path. Cherry-pick-clean on its own.
  2. feat: add geo_fanout tool with first-class classified geo results
    — a new tool built on top of the Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance) #104 classification.

What #104 asked for

Issue #104 reports intermittent 502 Bad Gateway (and 504) from the gateway
under burst load (~50-100 MCP calls), where subsequent requests fail as well,
and notes there is no documented retry / backoff policy specific to MCP usage.

The previous base_request retried on every non-4xx error in a tight loop with
no delay
(timeout: base_timeout only), which under a burst can amplify the
overload rather than relieve it, and it had no notion of Retry-After.

The fix (commit 1) — retry_utils.js

Three pure, dependency-free functions (no transport library coupling), each
table-tested:

  • classify_response(input, now) — maps an HTTP response or a thrown
    transport error to a stable outcome taxonomy:

    • success (2xx)
    • redirect (3xx — follow the Location; not a retryable gateway error and
      not a hard fatal, so it gets its own self-consistent outcome rather than
      being mislabeled fatal)
    • retryable (408/425/500/502/503/504, unenumerated 5xx, and retryable
      network codes like ECONNRESET/ETIMEDOUT/UND_ERR_CONNECT_TIMEOUT)
    • rate_limited (429, surfacing Retry-After)
    • blocked (403/451 — a first-class terminal outcome, not a silent retry)
    • client_error (other 4xx — terminal)
    • fatal (unknown network code / unclassifiable — never silently retried)

    parse_retry_after accepts only a non-negative integer number of seconds
    or a valid HTTP-date; anything malformed (fractional 1.5, negative -3,
    numeric-looking junk) returns null so the caller falls back to its computed
    backoff, rather than being read as a past date and clamped to an immediate
    retry.

  • compute_backoff(attempt, opts, rng) — exponential backoff with full
    jitter
    (so a burst of concurrent callers does not retry in lockstep), capped
    at max_ms, and a server-supplied Retry-After (integer seconds or
    HTTP-date) always wins. rng is injectable for deterministic tests.

  • should_retry(classification, attempt, max_retries, opts, rng) — combines
    the two into a {retry, delay_ms} decision honoring the retry budget.

base_request now uses these: only transient failures are retried, with jittered
backoff. New optional env knobs BASE_BACKOFF_MS (default 500) and
MAX_BACKOFF_MS (default 30000); the existing BASE_MAX_RETRIES still bounds the
attempt count.

The feature (commit 2) — geo_fanout

geo_fanout fetches the same URL from multiple country exits in parallel
(via Web Unlocker's country targeting) and returns one structured report.
Crucially, a geo that is blocked (403/451), redirected (3xx to a different
host), rate-limited (429), or fails transiently becomes a first-class classified
result
in the report — not a discarded error. This makes geo-gating and regional
price/availability/access differences observable at a glance.

Parameters: url, countries (1-10 deduped 2-letter ISO codes), optional
data_format (markdown default | raw). Registered in default/pro mode and the
advanced_scraping group. The aggregation logic lives in pure, table-tested
helpers in geo_utils.js (normalize_geos, build_geo_entry,
summarize_fanout).

Example result shape:

{
  "summary": {"total": 3, "ok": 1, "blocked": 1, "redirected": 1,
              "rate_limited": 0, "error": 0},
  "any_blocked": true,
  "any_redirected": true,
  "results": [
    {"geo": "de", "status": "ok", "http_status": 200, "exit_ip": "...",
     "outcome": "success", "reason": "http 200", "redirect": null},
    {"geo": "be", "status": "blocked", "http_status": 403,
     "outcome": "blocked", "reason": "http 403 target blocked request"},
    {"geo": "fr", "status": "redirected", "http_status": 302,
     "outcome": "redirect",
     "redirect": {"redirected": true, "location": "...", "cross_host": true}}
  ]
}

Tests

All table-driven, using the repo's existing node --test / node:assert/strict
setup. No new dependencies.

  • test/retry-utils.test.jsclassify_response taxonomy (26 cases incl. the
    502/504 from Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance) #104 and the 3xx redirect outcome), parse_retry_after (incl.
    strict rejection of fractional/negative/junk), compute_backoff (jitter modes
    • cap + Retry-After override), should_retry budget (incl. a 3xx never looping).
  • test/geo-utils.test.jsnormalize_geos (valid + loud-throw cases),
    build_geo_entry (every geo first-class), summarize_fanout (nothing dropped).
  • test/geo-fanout-tool.test.js — boots the server over stdio and asserts the
    geo_fanout tool is registered with the correct schema.

Run: npm test15 tests, 15 pass, 0 fail.

Notes

Yashash4 added 3 commits May 29, 2026 21:40
…rightdata#104)

Intermittent 502/504 from the MCP gateway under burst load had no retry
guidance and the previous base_request retried every non-4xx in a tight
loop with no delay, which could amplify the overload.

Add retry_utils.js with pure, table-tested helpers:
- classify_response: stable outcome taxonomy (success / retryable /
  rate_limited / blocked / client_error / fatal) over HTTP status and
  network error codes; 502/504/503/500/408 are retryable, 403/451 are a
  first-class BLOCKED outcome, 4xx are terminal.
- compute_backoff: exponential backoff with full jitter, capped at
  max_ms, that honors a server Retry-After (seconds or HTTP-date).
- should_retry: budget + delay decision for a retry loop.

Wire base_request to use them so only transient failures are retried,
with jittered backoff so a burst of concurrent calls no longer retries
in lockstep. Backoff is configurable via BASE_BACKOFF_MS / MAX_BACKOFF_MS.
geo_fanout fetches the same URL from multiple country exits in parallel
(Web Unlocker country targeting) and returns one structured report. A geo
that is blocked (403/451), redirected (3xx to a different host),
rate-limited (429) or fails transiently becomes a FIRST-CLASS classified
result rather than a discarded error, so geo-gating and regional
access/price differences are observable at a glance.

geo_utils.js holds the pure, table-tested aggregation logic
(normalize_geos, build_geo_entry, summarize_fanout) built on the brightdata#104
classify_response taxonomy. The tool is registered in default/pro mode and
in the advanced_scraping group.
classify_response now maps 3xx to a dedicated REDIRECT outcome instead of
the contradictory FATAL (a redirect is neither a retryable gateway error nor
a hard fatal); should_retry leaves it non-retryable so no 3xx retry loop is
introduced, and geo_fanout's per-geo outcome metadata is now self-consistent
with its REDIRECTED status.

parse_retry_after now accepts only a non-negative integer number of seconds
or a date-shaped HTTP-date; malformed values (fractional '1.5', negative '-3',
numeric-looking junk) return null so the caller falls back to computed backoff
instead of being read by permissive Date.parse as a past date and clamped to
an immediate retry.
@Yashash4 Yashash4 force-pushed the feat/geo-fanout-classified-retry branch from d83d859 to 55fdd97 Compare May 29, 2026 16:11
@meirk-brd

Copy link
Copy Markdown
Collaborator

Thanks for this, it's a careful and well-structured PR. The retry/backoff helpers are clean, pure, follow the repo conventions, and the table-driven tests are thorough. I want to separate the two halves though, because they stand on pretty different ground for me.

Commit 1 (the #104 retry fix) — in favor

This is the right shape for what #104 actually asked for: gateway-level 502/504 under burst load, no backoff, no Retry-After handling. The old base_request retried every non-4xx in a tight loop with zero delay, which amplifies a gateway overload rather than relieving it. Classified retries with exponential backoff and full jitter address that directly. parse_retry_after rejecting fractional/negative/junk values so it falls back to computed backoff instead of retrying immediately is a good touch.

Two smaller things on commit 1:

  1. Retry logging volume. Every retry attempt logs to stderr. Under the burst Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance) #104 describes (50-100 calls, each retrying up to 3 times) that's a lot of stderr noise. Minor, but worth being aware of.

  2. Worst-case latency change. A 5xx was already retried before, but now the retry waits between attempts, up to max_ms (default 30s). With BASE_MAX_RETRIES up to 3, a single tool call can now block for tens of seconds before it finally fails. That's the correct resilience tradeoff, but it's a latency change existing callers (and their MCP client timeouts) don't expect. Worth documenting the new worst-case latency and the BASE_BACKOFF_MS / MAX_BACKOFF_MS knobs in the README.

Commit 2 (geo_fanout) — one correctness concern that undercuts the core feature

The premise is detecting per-geo blocked (403/451) and redirected (3xx) by classifying response.status. But against the Web Unlocker /request endpoint, response.status is the gateway's status, not the target site's. The unlocker fetches the target and returns 200 to us on a successful proxy even when the target page itself returned 403 or issued a redirect. The target's real status comes back in the body or in x-brd-* headers, not in axios response.status. The other tools in this file (scrape_as_html, extract, search_engine) all treat it that way.

Two consequences:

  • A geo the target actually blocks (403/451) will usually come back as a successful 200 proxy fetch and get classified ok. The blocked path won't fire the way the examples in the description show.
  • The redirect detection has a second issue: axios follows redirects by default (maxRedirects: 5), so even on a direct fetch a 3xx gets followed and you see the final 200, never the 302. There's no maxRedirects: 0 set, so detect_redirect and the redirected count won't trigger in the live path either.

So the classification machinery is correct in the abstract and well-tested in isolation, but the data it actually receives in production (gateway 200s, redirects already followed) means the headline "first-class blocked/redirected geo results" mostly won't surface. The tests pass because they build synthetic {status: 403} / {status: 302} attempts directly rather than exercising the real unlocker response shape.

To make geo_fanout deliver what it promises, it'd need to read the target status the way the unlocker reports it (via the x-brd-* headers or a mode that surfaces the upstream status) rather than axios response.status. Until that's sorted I'd hold this half, even though the surrounding code is good.

Suggestion

Split the PR. Commit 1 is a clean, valuable fix for #104 and I'd merge it on its own. The author already notes it's cherry-pick-clean. That way the #104 fix isn't blocked behind the geo_fanout question, and geo_fanout can land separately once the target-status sourcing is worked out.

Tests pass locally (15/15 after npm install). Happy to dig into the unlocker status-passthrough side with you if that helps.

@Yashash4

Copy link
Copy Markdown
Author

Split this per your suggestion, @meirk-brd. The #104 retry/backoff fix is now its own PR, #155, with the README latency/knob docs and the reduced retry logging you flagged folded in (and rebased onto the current main).

Closing this one to keep the tracker tidy. geo_fanout will come back as a separate PR once I have reworked it to read the target's status the way scrape_as_html / extract do (the x-brd-* headers) with maxRedirects: 0, and rebuilt the tests around the real Unlocker response shape instead of synthetic statuses. Your call on that was exactly right. Thanks again for the careful review, and I will follow up on that PR to compare notes on the status-passthrough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent 502 Bad Gateway errors from MCP endpoint under moderate call volume (no retry guidance)

2 participants