fix(mcp/oauth): discover RFC 8414 §3.1 path-aware metadata URLs#2877
fix(mcp/oauth): discover RFC 8414 §3.1 path-aware metadata URLs#2877dgageot wants to merge 4 commits into
Conversation
Adds TestCatalogOAuthDiscoveryLive, a subtest-per-server probe over every
oauth server in the embedded catalog. For each it verifies the structural
prerequisites docker-agent needs:
- 401 + WWW-Authenticate from the MCP endpoint
- protected-resource metadata reachable
- authorization-server metadata reachable (with openid-configuration
fallback)
- HTTPS registration_endpoint (Dynamic Client Registration)
- PKCE S256 advertised
Skipped by default because it makes real HTTPS calls to ~17 third-party
servers and is unsuitable for CI without explicit opt-in. Run with:
MCP_CATALOG_OAUTH_LIVE=1 go test -run TestCatalogOAuthDiscoveryLive \
-v -count=1 -timeout=180s ./pkg/tools/builtin/mcpcatalog
api.grafbase.com/mcp returns 404 on every plausible path (including the bare origin), and the well-known oauth-protected-resource is gone too. The entry has been dead long enough that it is now actively misleading in search results — surfaced by TestCatalogOAuthDiscoveryLive. If grafbase relaunches their MCP server we can re-add it.
When the issuer URL has a path component (e.g. https://access.stripe.com/mcp), RFC 8414 §3.1 specifies that the well-known suffix must be inserted between the origin and the path: https://access.stripe.com/.well-known/oauth-authorization-server/mcp Stripe's authorization server follows the spec; the legacy 'append the well-known suffix to the full issuer URL' form 404s. Until now getAuthorizationServerMetadata only tried the append form (and an OIDC fallback), so for stripe-remote it silently fell through to createDefaultMetadata — leaving the OAuth flow with synthesised endpoints that fail later in handshake. Refactor metadata discovery to walk an ordered candidate list: 1. RFC 8414 §3.1 path-aware oauth-authorization-server (NEW), 2. legacy append-form oauth-authorization-server (preserved for the many widely-deployed servers that ship that way), 3. path-aware openid-configuration (NEW), 4. append-form openid-configuration. A 200 wins; 404s flow through; any other status is a hard error so we don't paper over a misconfigured auth server. URLs already containing '/.well-known/' pass through untouched. Verified by: - new TestMetadataDiscoveryURLs and TestGetAuthorizationServerMetadata_* unit tests covering all four issuer shapes, - TestCatalogOAuthDiscoveryLive (live probe, opt-in) which now reports 16/16 oauth servers passing — including stripe-remote. The live probe duplicates the same candidate ordering on purpose so it remains a black-box audit independent of any future refactor of the discovery code.
The RFC 8414 §3.1 path-aware variant added in e3fed03 is a speculative guess about where an authorization server *might* publish its metadata. Several real-world deployments answer that URL with something other than 404 (e.g. a gateway that returns 403 for unknown well-known prefixes, or a 5xx from an upstream component) while still serving valid metadata at the legacy 'append-to-issuer' URL. Before this change, getAuthorizationServerMetadata short-circuited on the first non-404 status with a hard error, even though further candidates would have succeeded. That regressed every issuer with a path component whose path-aware endpoint doesn't 404. Probe semantics now: - 200 with a decodable body wins immediately. - Non-404 statuses, transport errors and JSON-decode failures are logged and the next candidate is tried. - If at least one candidate produced a non-404 status or a transport-level error and none returned 200, surface the most diagnostic failure so a misconfigured auth server is still visible. - If every candidate 404'd, fall back to default metadata exactly as before (legacy behaviour). Covered by three new tests: - TestGetAuthorizationServerMetadata_NonFatalCandidateStatus: 403 on candidate #1, 200 on candidate docker#2 → success. - TestGetAuthorizationServerMetadata_AllUnreachableSurfacesError: every candidate 500s → error surfaced. - TestGetAuthorizationServerMetadata_All404FallsBackToDefaults: legacy behaviour preserved.
docker-agent
left a comment
There was a problem hiding this comment.
Assessment: 🟢 APPROVE
The RFC 8414 §3.1 path-aware OAuth metadata discovery implementation looks correct and well-tested. The refactored getAuthorizationServerMetadata properly walks an ordered candidate list, the URL construction in metadataDiscoveryURLs correctly implements the spec, and the error prioritisation logic (prefer non-404 HTTP status over transport errors) is a reasonable design choice with no silent-failure paths.
Reviewed:
pkg/tools/mcp/oauth.go— candidate-URL builder and probe loop logic ✅pkg/tools/mcp/oauth_test.go— coverage of path-aware, append-form, non-fatal-status, all-404, and all-500 cases ✅pkg/tools/builtin/mcpcatalog/servers.json— Grafbase removal and count update ✅pkg/tools/builtin/mcpcatalog/mcpcatalog_test.go— live probe and its local URL-math copy ✅
Minor observation (no action required): authServerMetadataCandidates in mcpcatalog_test.go is intentionally duplicated from production; the comment acknowledges this. If the production metadataDiscoveryURLs gains new fallbacks in the future, keeping the live probe in sync manually is a small maintenance burden, but it's an acceptable trade-off for black-box independence.
docker-agent
left a comment
There was a problem hiding this comment.
Assessment: 🟡 NEEDS ATTENTION
Two medium-severity issues found in the new RFC 8414 §3.1 path-aware metadata discovery code in pkg/tools/mcp/oauth.go. Both are in newly added code.
Findings summary:
- 🟡 [Medium/Confirmed] Non-200 response body not drained in
fetchAuthorizationServerMetadata— prevents HTTP/1.1 connection reuse across the up-to-4 candidate probes - 🟡 [Medium/Likely]
metadataDiscoveryURLssilently drops query parameters fromauthServerURL— affects real-world multi-tenant auth servers (e.g. Keycloak)
The core RFC 8414 §3.1 URL construction logic, candidate ordering, and error-surfacing behavior look correct. Test coverage for the new flow is thorough.
| } | ||
| oidcReq.Header.Set("Accept", "application/json") | ||
| if resp.StatusCode != http.StatusOK { | ||
| return nil, resp.StatusCode, nil |
There was a problem hiding this comment.
[MEDIUM] Non-200 response body not drained — prevents HTTP/1.1 connection reuse
When resp.StatusCode != http.StatusOK, the function returns early without draining the response body. The defer resp.Body.Close() will close it, but Go's http.Transport requires the body to be fully read to EOF before it can be reused — closing without draining forces a new TCP connection for the next candidate.
This function is called up to 4 times per discovery call (one per candidate URL). Each non-200 response (e.g. 403, 404 from the path-aware variant on Auth0) abandons the connection, so a Stripe-style discovery that hits the path-aware URL 404 before finding the spec-compliant one will open 2–4 connections instead of reusing 1.
Fix: drain the body before returning:
if resp.StatusCode != http.StatusOK {
_, _ = io.Copy(io.Discard, resp.Body)
return nil, resp.StatusCode, nil
}| return nil, fmt.Errorf("invalid authorization server URL %q: %w", authServerURL, err) | ||
| } | ||
| origin := parsed.Scheme + "://" + parsed.Host | ||
| path := strings.TrimSuffix(parsed.Path, "/") |
There was a problem hiding this comment.
[MEDIUM/LIKELY] Query parameters in authServerURL are silently dropped when building candidate URLs
metadataDiscoveryURLs uses only parsed.Scheme, parsed.Host, and parsed.Path — parsed.RawQuery is never included. If authServerURL contains query parameters (e.g. Keycloak's https://sso.example.com/auth?realm=myrealm), the candidate URLs will all omit the query string, pointing to non-existent or wrong metadata endpoints.
RFC 8414 §2 says issuers MUST NOT contain query components, but real-world multi-tenant servers (Keycloak is the most common example) routinely do. The code has no guard that rejects such URLs, so they silently produce broken discovery URLs with no warning logged.
Suggested fix: Either reject URLs with query strings early and surface a clear error, or include parsed.RawQuery in candidate URL construction where appropriate:
path := strings.TrimSuffix(parsed.Path, "/")
query := ""
if parsed.RawQuery != "" {
query = "?" + parsed.RawQuery
}
// ... then append query to each candidate that ends with the path segmentAlternatively, document that query-parameterized issuer URLs are not supported and add a url.Parse + parsed.RawQuery != "" check that returns an explicit error.
aheritier
left a comment
There was a problem hiding this comment.
The RFC 8414 §3.1 implementation is correct and the candidate-walking logic is sound. Two medium issues worth fixing before merge.
Issue 1 — Response body not drained in fetchAuthorizationServerMetadata (connection-pool leak)
When resp.StatusCode != http.StatusOK, the function returns immediately and lets defer resp.Body.Close() fire without having read the body. Go's net/http transport only reuses a TCP connection when the response body is fully consumed before close. With up to four speculative candidate probes per OAuth session init (e.g. three 404s followed by a 200 for Stripe), this leaks three connections per handshake.
Fix — one line, same pattern already used in handleUnmanagedOAuthFlow in this file:
if resp.StatusCode != http.StatusOK {
_, _ = io.ReadAll(resp.Body)
return nil, resp.StatusCode, nil
}Issue 2 — Query parameters in the issuer URL are silently dropped in metadataDiscoveryURLs
The reconstruction uses only parsed.Path, discarding parsed.RawQuery. RFC 8414 §2 says issuers MUST NOT have query components, but multi-tenant Keycloak installations sometimes advertise query-bearing issuer URLs via authorization_servers. Dropping them silently generates wrong discovery URLs with no diagnostic signal. At minimum, add a comment acknowledging the intentional drop (spec-correct but a latent hazard); ideally return an error when parsed.RawQuery != "".
|
|
||
| oidcResp, err := o.metadataClient.Do(oidcReq) | ||
| if err != nil { | ||
| return nil, err |
There was a problem hiding this comment.
Body not drained before return — the defer resp.Body.Close() fires without consuming the body, so the transport discards the TCP connection instead of returning it to the pool.
With up to four candidate probes per discovery call (e.g. three 404s before the RFC 8414 §3.1 hit for Stripe), this creates up to three abandoned connections per OAuth session init. io is already imported.
if resp.StatusCode != http.StatusOK {
_, _ = io.ReadAll(resp.Body)
return nil, resp.StatusCode, nil
}Same pattern used in handleUnmanagedOAuthFlow in this file.
| var metadata AuthorizationServerMetadata | ||
| if err := json.NewDecoder(resp.Body).Decode(&metadata); err != nil { | ||
| return nil, fmt.Errorf("failed to decode metadata from %s: %w", metadataURL, err) | ||
| // No path: spec-compliant and append forms are identical. |
There was a problem hiding this comment.
parsed.RawQuery is never consulted — any query parameters in authServerURL are silently dropped.
RFC 8414 §2 forbids query components in issuer identifiers, so this is technically spec-correct. However, some multi-tenant Keycloak deployments do advertise query-bearing issuer URLs (e.g. https://auth.example.com/realms/master?realm=tenant) via the authorization_servers field in protected-resource metadata. Silently dropping those generates wrong discovery URLs with no diagnostic.
Options:
- Return an error when
parsed.RawQuery != ""— surfaces the misconfiguration explicitly. - Add a comment acknowledging the intentional drop so future maintainers don't have to rediscover this edge case.
OAuth servers publish their authorization-server metadata at multiple URL candidates depending on their implementation and standards compliance. The RFC 8414 §3.1 path-aware URL — placing
oauth-authorization-serverafter the authority and before the resource path — is well-established but docker-agent's discovery code never tried it, silently falling back to hardcoded default metadata that broke the OAuth handshake.A live audit test (TestCatalogOAuthDiscoveryLive) that probes every OAuth server in the embedded MCP catalog revealed two real findings: Grafbase's MCP endpoint is decommissioned, and Stripe's authorization server publishes its metadata at the RFC 8414 §3.1 path-aware URL. Dropping Grafbase (count 45 → 44) and fixing the Stripe case requires updating the metadata discovery logic to walk an ordered candidate list and try multiple URL patterns before giving up.
The refactored
getAuthorizationServerMetadatafunction now walks: RFC 8414 §3.1 path-awareoauth-authorization-server, legacy "append-to-issuer"oauth-authorization-server(preserved for widely-deployed servers that ship that way), then the OIDC variants of each. A 200 wins; non-200 statuses, transport errors, and decode failures are logged and the next candidate is tried. If all fail, the most diagnostic error is surfaced; if every candidate 404'd, default metadata is returned to preserve legacy behavior.Unit tests in
oauth_test.gocover URL discovery ordering, the RFC 8414 path-aware case, backward compatibility with the append-form, and the probe-and-decide flow. After the fix, all 16 OAuth servers in the MCP catalog pass their structural prerequisites (401+WWW-Authenticate challenge, reachable protected-resource and authorization-server metadata, HTTPS registration_endpoint, and PKCE S256).To re-run the live probe:
MCP_CATALOG_OAUTH_LIVE=1 go test -run TestCatalogOAuthDiscoveryLive -v -count=1 -timeout=180s ./pkg/tools/builtin/mcpcatalog