feat: geo_fanout with first-class classified per-geo results (target status via Unlocker format:json)#156
Open
Yashash4 wants to merge 2 commits into
Open
Conversation
…rightdata#104) Intermittent 502/504 from the MCP gateway under burst load had no retry guidance and the previous base_request retried every non-4xx in a tight loop with no delay, which could amplify the overload. Add retry_utils.js with pure, table-tested helpers: - classify_response: stable outcome taxonomy (success / redirect / retryable / rate_limited / blocked / client_error / fatal) over HTTP status and network error codes; 502/504/503/500/408 are retryable, 403/451 are a first-class BLOCKED outcome, 3xx are a non-retryable REDIRECT, 4xx are terminal. - parse_retry_after: strict RFC 9110 parsing (integer seconds or a date-shaped HTTP-date); fractional/negative/junk values return null so the caller falls back to computed backoff instead of an immediate retry. - compute_backoff: exponential backoff with full jitter, capped at max_ms, that honors a server Retry-After (seconds or HTTP-date). - should_retry: budget + delay decision for a retry loop. Wire base_request to use them so only transient failures are retried, with jittered backoff so a burst of concurrent calls no longer retries in lockstep. Backoff is configurable via BASE_BACKOFF_MS (default 500) and MAX_BACKOFF_MS (default 30000). Document the new env knobs and the worst-case added latency in the README. Reduce retry logging from one stderr line per attempt to a single concise summary line per request (only on final give-up), so the brightdata#104 burst (50-100 calls x up to 3 retries) no longer floods stderr. Clamp BASE_MAX_RETRIES to a sane 0-3 integer so a negative or non-numeric value behaves as 0 instead of skipping the request loop entirely and throwing undefined.
… the gateway 200)
The geo_fanout executor called the Unlocker /request with format:'raw' and then
classified axios response.status, which is always the gateway's 200. A target
403/451/redirect was therefore misclassified as ok, defeating the whole point of
the tool (surfacing geo-gating as a first-class classified result).
Fix: call /request with format:'json'. The response body is the envelope
{status_code, headers, body} where status_code is the TARGET's HTTP status and
headers are the TARGET's response headers. A 3xx is surfaced WITHOUT the Unlocker
following it, so a redirect keeps its Location header. We classify on that real
target status.
- geo_utils.js: add parse_unlocker_json(data) mapping {status_code, headers, body}
to the {status, headers, body} shape build_geo_entry expects (tolerant of a
missing/non-object/JSON-string input). build_geo_entry now carries the rendered
body through and documents exit_ip as best-effort null (the json envelope has no
gateway x-brd-* headers; we never fabricate an IP).
- server.js: executor uses format:'json' + responseType:'json', parses with
parse_unlocker_json, passes the target {status, headers} into build_geo_entry,
and renders the body to markdown via the existing remark/strip pipeline when
data_format is markdown (raw otherwise). Per-geo country targeting preserved.
- tests: geo-utils fixtures now use the REAL Unlocker envelope shape and drive
build_geo_entry end-to-end through parse_unlocker_json (403 -> blocked, 302 with
a cross-host location -> redirected, 200 -> ok, 429 -> rate_limited, thrown
transport error -> error); parse_unlocker_json gets its own table-driven test.
The stdio registration test is unchanged.
Based on PR brightdata#155 (retry/backoff for brightdata#104); retry_utils.js classification is reused.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is the
geo_fanouthalf of the original #141, reworked to fix the correctness issue @meirk-brd caught in review. It builds on #155 (the #104 retry/backoff helpers, whichgeo_fanoutreuses), so until #155 merges this PR also shows those commits. Once #155 lands I will rebase this ontomainso it showsgeo_fanoutonly.What geo_fanout does
Fetches the SAME url from multiple country exits in parallel (via the Web Unlocker
countrytargeting) and returns one structured report. A geo that is blocked (403/451), redirected (3xx to a different host), rate-limited (429), or fails transiently becomes a first-class classified result, not a discarded error. Useful for spotting geo-gating and regional price/availability/access differences at a glance.The fix (your review was right)
The earlier version requested
format: 'raw'and classifiedaxios response.status, which against the Unlocker/requestendpoint is always the gateway's 200, so a target 403/451/3xx came back asok. The other tools were unaffected because they do not classify per-geo status, but that is preciselygeo_fanout's purpose.The rework requests
format: 'json'. The Unlocker then returns a{status_code, headers, body}envelope wherestatus_codeis the TARGET's HTTP status andheadersare the TARGET's response headers, and it surfaces a 3xx WITHOUT following it (so theLocationis preserved). A small pure helperparse_unlocker_jsonmaps that envelope to{status, headers, body}, andclassify_responseruns on the real target status. NomaxRedirectschange is needed because json mode does not follow the redirect.data_formatstill controls how the target body is rendered (markdown via the existing remark pipeline, or raw).I verified this against the LIVE Unlocker, not synthetic fixtures:
httpbin.org/status/403across de+be: both classifiedblocked,http_status403,any_blockedtrue.redirected,http_status302,redirect.locationcaptured,cross_hosttrue,any_redirectedtrue.exit_ipis null in json mode (the envelope carries target headers, not the gatewayx-brd-*ip); reported as null, never fabricated.Tests
test/geo-utils.test.jsis rebuilt around the REAL Unlocker envelope (the prior tests used synthetic axios statuses, which is what hid the bug). Coverage:parse_unlocker_json(200/403/302/JSON-string/null/non-JSON/missing status_code);build_geo_entryend to end (403 blocked, 451 blocked, 302 cross-host redirected with Location, 301 same-host cross_host false, 429 rate_limited with Retry-After, 502 error, thrown ECONNRESET error);normalize_geosvalid plus loud-throw;summarize_fanoutnothing dropped. The stdio registration test is unchanged.npm test: 30 tests, 30 pass, 0 fail.Notes
geo_fanoutreusesbase_request, so its per-geo fetches inherit the same classified retry/backoff from fix: classify gateway responses and add retry/backoff policy (closes #104) #155.format: 'json'envelope.