Skip to content

fix(mcp/oauth): discover RFC 8414 §3.1 path-aware metadata URLs#2877

Open
dgageot wants to merge 4 commits into
docker:mainfrom
dgageot:mcp/oauth-discovery-rfc8414
Open

fix(mcp/oauth): discover RFC 8414 §3.1 path-aware metadata URLs#2877
dgageot wants to merge 4 commits into
docker:mainfrom
dgageot:mcp/oauth-discovery-rfc8414

Conversation

@dgageot
Copy link
Copy Markdown
Member

@dgageot dgageot commented May 22, 2026

OAuth servers publish their authorization-server metadata at multiple URL candidates depending on their implementation and standards compliance. The RFC 8414 §3.1 path-aware URL — placing oauth-authorization-server after the authority and before the resource path — is well-established but docker-agent's discovery code never tried it, silently falling back to hardcoded default metadata that broke the OAuth handshake.

A live audit test (TestCatalogOAuthDiscoveryLive) that probes every OAuth server in the embedded MCP catalog revealed two real findings: Grafbase's MCP endpoint is decommissioned, and Stripe's authorization server publishes its metadata at the RFC 8414 §3.1 path-aware URL. Dropping Grafbase (count 45 → 44) and fixing the Stripe case requires updating the metadata discovery logic to walk an ordered candidate list and try multiple URL patterns before giving up.

The refactored getAuthorizationServerMetadata function now walks: RFC 8414 §3.1 path-aware oauth-authorization-server, legacy "append-to-issuer" oauth-authorization-server (preserved for widely-deployed servers that ship that way), then the OIDC variants of each. A 200 wins; non-200 statuses, transport errors, and decode failures are logged and the next candidate is tried. If all fail, the most diagnostic error is surfaced; if every candidate 404'd, default metadata is returned to preserve legacy behavior.

Unit tests in oauth_test.go cover URL discovery ordering, the RFC 8414 path-aware case, backward compatibility with the append-form, and the probe-and-decide flow. After the fix, all 16 OAuth servers in the MCP catalog pass their structural prerequisites (401+WWW-Authenticate challenge, reachable protected-resource and authorization-server metadata, HTTPS registration_endpoint, and PKCE S256).

To re-run the live probe:

MCP_CATALOG_OAUTH_LIVE=1 go test -run TestCatalogOAuthDiscoveryLive -v -count=1 -timeout=180s ./pkg/tools/builtin/mcpcatalog

dgageot added 4 commits May 22, 2026 16:51
Adds TestCatalogOAuthDiscoveryLive, a subtest-per-server probe over every
oauth server in the embedded catalog. For each it verifies the structural
prerequisites docker-agent needs:

  - 401 + WWW-Authenticate from the MCP endpoint
  - protected-resource metadata reachable
  - authorization-server metadata reachable (with openid-configuration
    fallback)
  - HTTPS registration_endpoint (Dynamic Client Registration)
  - PKCE S256 advertised

Skipped by default because it makes real HTTPS calls to ~17 third-party
servers and is unsuitable for CI without explicit opt-in. Run with:

    MCP_CATALOG_OAUTH_LIVE=1 go test -run TestCatalogOAuthDiscoveryLive \
        -v -count=1 -timeout=180s ./pkg/tools/builtin/mcpcatalog
api.grafbase.com/mcp returns 404 on every plausible path (including the
bare origin), and the well-known oauth-protected-resource is gone too.
The entry has been dead long enough that it is now actively misleading
in search results — surfaced by TestCatalogOAuthDiscoveryLive.

If grafbase relaunches their MCP server we can re-add it.
When the issuer URL has a path component (e.g. https://access.stripe.com/mcp),
RFC 8414 §3.1 specifies that the well-known suffix must be inserted between
the origin and the path:

    https://access.stripe.com/.well-known/oauth-authorization-server/mcp

Stripe's authorization server follows the spec; the legacy 'append the
well-known suffix to the full issuer URL' form 404s. Until now
getAuthorizationServerMetadata only tried the append form (and an OIDC
fallback), so for stripe-remote it silently fell through to
createDefaultMetadata — leaving the OAuth flow with synthesised endpoints
that fail later in handshake.

Refactor metadata discovery to walk an ordered candidate list:

  1. RFC 8414 §3.1 path-aware oauth-authorization-server (NEW),
  2. legacy append-form oauth-authorization-server (preserved for the
     many widely-deployed servers that ship that way),
  3. path-aware openid-configuration (NEW),
  4. append-form openid-configuration.

A 200 wins; 404s flow through; any other status is a hard error so we
don't paper over a misconfigured auth server. URLs already containing
'/.well-known/' pass through untouched.

Verified by:

  - new TestMetadataDiscoveryURLs and TestGetAuthorizationServerMetadata_*
    unit tests covering all four issuer shapes,
  - TestCatalogOAuthDiscoveryLive (live probe, opt-in) which now reports
    16/16 oauth servers passing — including stripe-remote.

The live probe duplicates the same candidate ordering on purpose so it
remains a black-box audit independent of any future refactor of the
discovery code.
The RFC 8414 §3.1 path-aware variant added in e3fed03 is a speculative
guess about where an authorization server *might* publish its metadata.
Several real-world deployments answer that URL with something other
than 404 (e.g. a gateway that returns 403 for unknown well-known
prefixes, or a 5xx from an upstream component) while still serving
valid metadata at the legacy 'append-to-issuer' URL.

Before this change, getAuthorizationServerMetadata short-circuited on
the first non-404 status with a hard error, even though further
candidates would have succeeded. That regressed every issuer with a
path component whose path-aware endpoint doesn't 404.

Probe semantics now:

  - 200 with a decodable body wins immediately.
  - Non-404 statuses, transport errors and JSON-decode failures are
    logged and the next candidate is tried.
  - If at least one candidate produced a non-404 status or a
    transport-level error and none returned 200, surface the most
    diagnostic failure so a misconfigured auth server is still visible.
  - If every candidate 404'd, fall back to default metadata exactly as
    before (legacy behaviour).

Covered by three new tests:

  - TestGetAuthorizationServerMetadata_NonFatalCandidateStatus: 403 on
    candidate #1, 200 on candidate docker#2 → success.
  - TestGetAuthorizationServerMetadata_AllUnreachableSurfacesError: every
    candidate 500s → error surfaced.
  - TestGetAuthorizationServerMetadata_All404FallsBackToDefaults:
    legacy behaviour preserved.
@dgageot dgageot requested a review from a team as a code owner May 22, 2026 15:25
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

The RFC 8414 §3.1 path-aware OAuth metadata discovery implementation looks correct and well-tested. The refactored getAuthorizationServerMetadata properly walks an ordered candidate list, the URL construction in metadataDiscoveryURLs correctly implements the spec, and the error prioritisation logic (prefer non-404 HTTP status over transport errors) is a reasonable design choice with no silent-failure paths.

Reviewed:

  • pkg/tools/mcp/oauth.go — candidate-URL builder and probe loop logic ✅
  • pkg/tools/mcp/oauth_test.go — coverage of path-aware, append-form, non-fatal-status, all-404, and all-500 cases ✅
  • pkg/tools/builtin/mcpcatalog/servers.json — Grafbase removal and count update ✅
  • pkg/tools/builtin/mcpcatalog/mcpcatalog_test.go — live probe and its local URL-math copy ✅

Minor observation (no action required): authServerMetadataCandidates in mcpcatalog_test.go is intentionally duplicated from production; the comment acknowledges this. If the production metadataDiscoveryURLs gains new fallbacks in the future, keeping the live probe in sync manually is a small maintenance burden, but it's an acceptable trade-off for black-box independence.

Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

Two medium-severity issues found in the new RFC 8414 §3.1 path-aware metadata discovery code in pkg/tools/mcp/oauth.go. Both are in newly added code.

Findings summary:

  • 🟡 [Medium/Confirmed] Non-200 response body not drained in fetchAuthorizationServerMetadata — prevents HTTP/1.1 connection reuse across the up-to-4 candidate probes
  • 🟡 [Medium/Likely] metadataDiscoveryURLs silently drops query parameters from authServerURL — affects real-world multi-tenant auth servers (e.g. Keycloak)

The core RFC 8414 §3.1 URL construction logic, candidate ordering, and error-surfacing behavior look correct. Test coverage for the new flow is thorough.

Comment thread pkg/tools/mcp/oauth.go
}
oidcReq.Header.Set("Accept", "application/json")
if resp.StatusCode != http.StatusOK {
return nil, resp.StatusCode, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Non-200 response body not drained — prevents HTTP/1.1 connection reuse

When resp.StatusCode != http.StatusOK, the function returns early without draining the response body. The defer resp.Body.Close() will close it, but Go's http.Transport requires the body to be fully read to EOF before it can be reused — closing without draining forces a new TCP connection for the next candidate.

This function is called up to 4 times per discovery call (one per candidate URL). Each non-200 response (e.g. 403, 404 from the path-aware variant on Auth0) abandons the connection, so a Stripe-style discovery that hits the path-aware URL 404 before finding the spec-compliant one will open 2–4 connections instead of reusing 1.

Fix: drain the body before returning:

if resp.StatusCode != http.StatusOK {
    _, _ = io.Copy(io.Discard, resp.Body)
    return nil, resp.StatusCode, nil
}

Comment thread pkg/tools/mcp/oauth.go
return nil, fmt.Errorf("invalid authorization server URL %q: %w", authServerURL, err)
}
origin := parsed.Scheme + "://" + parsed.Host
path := strings.TrimSuffix(parsed.Path, "/")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM/LIKELY] Query parameters in authServerURL are silently dropped when building candidate URLs

metadataDiscoveryURLs uses only parsed.Scheme, parsed.Host, and parsed.Pathparsed.RawQuery is never included. If authServerURL contains query parameters (e.g. Keycloak's https://sso.example.com/auth?realm=myrealm), the candidate URLs will all omit the query string, pointing to non-existent or wrong metadata endpoints.

RFC 8414 §2 says issuers MUST NOT contain query components, but real-world multi-tenant servers (Keycloak is the most common example) routinely do. The code has no guard that rejects such URLs, so they silently produce broken discovery URLs with no warning logged.

Suggested fix: Either reject URLs with query strings early and surface a clear error, or include parsed.RawQuery in candidate URL construction where appropriate:

path := strings.TrimSuffix(parsed.Path, "/")
query := ""
if parsed.RawQuery != "" {
    query = "?" + parsed.RawQuery
}
// ... then append query to each candidate that ends with the path segment

Alternatively, document that query-parameterized issuer URLs are not supported and add a url.Parse + parsed.RawQuery != "" check that returns an explicit error.

@aheritier aheritier added area/mcp MCP protocol, MCP tool servers, integration kind/fix PR fixes a bug (maps to fix: commit prefix) labels May 22, 2026
Copy link
Copy Markdown
Contributor

@aheritier aheritier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC 8414 §3.1 implementation is correct and the candidate-walking logic is sound. Two medium issues worth fixing before merge.

Issue 1 — Response body not drained in fetchAuthorizationServerMetadata (connection-pool leak)

When resp.StatusCode != http.StatusOK, the function returns immediately and lets defer resp.Body.Close() fire without having read the body. Go's net/http transport only reuses a TCP connection when the response body is fully consumed before close. With up to four speculative candidate probes per OAuth session init (e.g. three 404s followed by a 200 for Stripe), this leaks three connections per handshake.

Fix — one line, same pattern already used in handleUnmanagedOAuthFlow in this file:

if resp.StatusCode != http.StatusOK {
    _, _ = io.ReadAll(resp.Body)
    return nil, resp.StatusCode, nil
}

Issue 2 — Query parameters in the issuer URL are silently dropped in metadataDiscoveryURLs

The reconstruction uses only parsed.Path, discarding parsed.RawQuery. RFC 8414 §2 says issuers MUST NOT have query components, but multi-tenant Keycloak installations sometimes advertise query-bearing issuer URLs via authorization_servers. Dropping them silently generates wrong discovery URLs with no diagnostic signal. At minimum, add a comment acknowledging the intentional drop (spec-correct but a latent hazard); ideally return an error when parsed.RawQuery != "".

Comment thread pkg/tools/mcp/oauth.go

oidcResp, err := o.metadataClient.Do(oidcReq)
if err != nil {
return nil, err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Body not drained before return — the defer resp.Body.Close() fires without consuming the body, so the transport discards the TCP connection instead of returning it to the pool.

With up to four candidate probes per discovery call (e.g. three 404s before the RFC 8414 §3.1 hit for Stripe), this creates up to three abandoned connections per OAuth session init. io is already imported.

if resp.StatusCode != http.StatusOK {
    _, _ = io.ReadAll(resp.Body)
    return nil, resp.StatusCode, nil
}

Same pattern used in handleUnmanagedOAuthFlow in this file.

Comment thread pkg/tools/mcp/oauth.go
var metadata AuthorizationServerMetadata
if err := json.NewDecoder(resp.Body).Decode(&metadata); err != nil {
return nil, fmt.Errorf("failed to decode metadata from %s: %w", metadataURL, err)
// No path: spec-compliant and append forms are identical.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsed.RawQuery is never consulted — any query parameters in authServerURL are silently dropped.

RFC 8414 §2 forbids query components in issuer identifiers, so this is technically spec-correct. However, some multi-tenant Keycloak deployments do advertise query-bearing issuer URLs (e.g. https://auth.example.com/realms/master?realm=tenant) via the authorization_servers field in protected-resource metadata. Silently dropping those generates wrong discovery URLs with no diagnostic.

Options:

  1. Return an error when parsed.RawQuery != "" — surfaces the misconfiguration explicitly.
  2. Add a comment acknowledging the intentional drop so future maintainers don't have to rediscover this edge case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mcp MCP protocol, MCP tool servers, integration kind/fix PR fixes a bug (maps to fix: commit prefix)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants