Skip to content

feat: geo_fanout with first-class classified per-geo results (target status via Unlocker format:json)#156

Open
Yashash4 wants to merge 2 commits into
brightdata:mainfrom
Yashash4:feat/geo-fanout-target-status
Open

feat: geo_fanout with first-class classified per-geo results (target status via Unlocker format:json)#156
Yashash4 wants to merge 2 commits into
brightdata:mainfrom
Yashash4:feat/geo-fanout-target-status

Conversation

@Yashash4

Copy link
Copy Markdown

Summary

This is the geo_fanout half of the original #141, reworked to fix the correctness issue @meirk-brd caught in review. It builds on #155 (the #104 retry/backoff helpers, which geo_fanout reuses), so until #155 merges this PR also shows those commits. Once #155 lands I will rebase this onto main so it shows geo_fanout only.

What geo_fanout does

Fetches the SAME url from multiple country exits in parallel (via the Web Unlocker country targeting) and returns one structured report. A geo that is blocked (403/451), redirected (3xx to a different host), rate-limited (429), or fails transiently becomes a first-class classified result, not a discarded error. Useful for spotting geo-gating and regional price/availability/access differences at a glance.

The fix (your review was right)

The earlier version requested format: 'raw' and classified axios response.status, which against the Unlocker /request endpoint is always the gateway's 200, so a target 403/451/3xx came back as ok. The other tools were unaffected because they do not classify per-geo status, but that is precisely geo_fanout's purpose.

The rework requests format: 'json'. The Unlocker then returns a {status_code, headers, body} envelope where status_code is the TARGET's HTTP status and headers are the TARGET's response headers, and it surfaces a 3xx WITHOUT following it (so the Location is preserved). A small pure helper parse_unlocker_json maps that envelope to {status, headers, body}, and classify_response runs on the real target status. No maxRedirects change is needed because json mode does not follow the redirect. data_format still controls how the target body is rendered (markdown via the existing remark pipeline, or raw).

I verified this against the LIVE Unlocker, not synthetic fixtures:

  • httpbin.org/status/403 across de+be: both classified blocked, http_status 403, any_blocked true.
  • a 302 redirect across de+be: both redirected, http_status 302, redirect.location captured, cross_host true, any_redirected true.
  • exit_ip is null in json mode (the envelope carries target headers, not the gateway x-brd-* ip); reported as null, never fabricated.

Tests

test/geo-utils.test.js is rebuilt around the REAL Unlocker envelope (the prior tests used synthetic axios statuses, which is what hid the bug). Coverage: parse_unlocker_json (200/403/302/JSON-string/null/non-JSON/missing status_code); build_geo_entry end to end (403 blocked, 451 blocked, 302 cross-host redirected with Location, 301 same-host cross_host false, 429 rate_limited with Retry-After, 502 error, thrown ECONNRESET error); normalize_geos valid plus loud-throw; summarize_fanout nothing dropped. The stdio registration test is unchanged.

npm test: 30 tests, 30 pass, 0 fail.

Notes

  • No crypto, no new runtime dependencies, MIT-compatible. Follows the repo conventions (ES modules, snake_case, single quotes, 4-space indent).
  • geo_fanout reuses base_request, so its per-geo fetches inherit the same classified retry/backoff from fix: classify gateway responses and add retry/backoff policy (closes #104) #155.
  • Happy to compare notes on the status-passthrough if there is a header you would prefer over the format: 'json' envelope.

Yashash4 added 2 commits June 11, 2026 22:44
…rightdata#104)

Intermittent 502/504 from the MCP gateway under burst load had no retry
guidance and the previous base_request retried every non-4xx in a tight
loop with no delay, which could amplify the overload.

Add retry_utils.js with pure, table-tested helpers:
- classify_response: stable outcome taxonomy (success / redirect /
  retryable / rate_limited / blocked / client_error / fatal) over HTTP
  status and network error codes; 502/504/503/500/408 are retryable,
  403/451 are a first-class BLOCKED outcome, 3xx are a non-retryable
  REDIRECT, 4xx are terminal.
- parse_retry_after: strict RFC 9110 parsing (integer seconds or a
  date-shaped HTTP-date); fractional/negative/junk values return null so
  the caller falls back to computed backoff instead of an immediate retry.
- compute_backoff: exponential backoff with full jitter, capped at
  max_ms, that honors a server Retry-After (seconds or HTTP-date).
- should_retry: budget + delay decision for a retry loop.

Wire base_request to use them so only transient failures are retried,
with jittered backoff so a burst of concurrent calls no longer retries
in lockstep. Backoff is configurable via BASE_BACKOFF_MS (default 500)
and MAX_BACKOFF_MS (default 30000).

Document the new env knobs and the worst-case added latency in the
README. Reduce retry logging from one stderr line per attempt to a single
concise summary line per request (only on final give-up), so the brightdata#104
burst (50-100 calls x up to 3 retries) no longer floods stderr.

Clamp BASE_MAX_RETRIES to a sane 0-3 integer so a negative or
non-numeric value behaves as 0 instead of skipping the request loop
entirely and throwing undefined.
… the gateway 200)

The geo_fanout executor called the Unlocker /request with format:'raw' and then
classified axios response.status, which is always the gateway's 200. A target
403/451/redirect was therefore misclassified as ok, defeating the whole point of
the tool (surfacing geo-gating as a first-class classified result).

Fix: call /request with format:'json'. The response body is the envelope
{status_code, headers, body} where status_code is the TARGET's HTTP status and
headers are the TARGET's response headers. A 3xx is surfaced WITHOUT the Unlocker
following it, so a redirect keeps its Location header. We classify on that real
target status.

- geo_utils.js: add parse_unlocker_json(data) mapping {status_code, headers, body}
  to the {status, headers, body} shape build_geo_entry expects (tolerant of a
  missing/non-object/JSON-string input). build_geo_entry now carries the rendered
  body through and documents exit_ip as best-effort null (the json envelope has no
  gateway x-brd-* headers; we never fabricate an IP).
- server.js: executor uses format:'json' + responseType:'json', parses with
  parse_unlocker_json, passes the target {status, headers} into build_geo_entry,
  and renders the body to markdown via the existing remark/strip pipeline when
  data_format is markdown (raw otherwise). Per-geo country targeting preserved.
- tests: geo-utils fixtures now use the REAL Unlocker envelope shape and drive
  build_geo_entry end-to-end through parse_unlocker_json (403 -> blocked, 302 with
  a cross-host location -> redirected, 200 -> ok, 429 -> rate_limited, thrown
  transport error -> error); parse_unlocker_json gets its own table-driven test.
  The stdio registration test is unchanged.

Based on PR brightdata#155 (retry/backoff for brightdata#104); retry_utils.js classification is reused.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant