Skip to content

Cache vMCP advertised set to avoid audit re-aggregation#5512

Open
tgrunnagle wants to merge 3 commits into
mainfrom
vmcp-core-p2-4b_issue_5493
Open

Cache vMCP advertised set to avoid audit re-aggregation#5512
tgrunnagle wants to merge 3 commits into
mainfrom
vmcp-core-p2-4b_issue_5493

Conversation

@tgrunnagle

Copy link
Copy Markdown
Contributor

Summary

On the vMCP Serve path, audit backend-enrichment resolved a request's backend by calling core.Lookup*, which — by the core's deliberate stateless design — re-aggregates every backend's capabilities on every audited request (O(backends·capabilities), plus a remote discovery fan-out) purely to label the audit event. This was a documented deferral from PR #5491 (P2.4) and is a hard blocker before audit can ship on a live Serve path.

This PR implements the "Serve caches, core is stateless" pattern:

  • Why: Eliminate the per-request second aggregation that audit enrichment incurred on the Serve path. Backend-enrichment middleware runs for every MCP request when AuditConfig != nil, so each audited tools/call / resources/read was paying for a full backend aggregation (and remote fan-out) on top of the aggregation the call itself performs — just to attach a backend label.
  • What: Add a bounded, per-session, node-local advertised-set cache mapping advertised tool name / resource URI → backend display name. It is populated once at session registration from the same core.ListTools / core.ListResources aggregation the session already performs (no extra aggregation), read by audit enrichment keyed by the Mcp-Session-Id header, and evicted on transport-driven session-end paths. A cache miss (e.g. a cross-pod request without session affinity) degrades the audit label gracefully and never falls back to re-aggregation. The legacy server.New path is unchanged (the cache is nil there).

Closes #5493

Type of change

  • Refactoring (no behavior change)

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Verification run on the touched package:

  • go test -race ./pkg/vmcp/server/... passes.
  • golangci-lint run ./pkg/vmcp/server/... reports 0 issues.
  • go build ./... is clean.
  • A repo-wide task lint-fix surfaces only a pre-existing, unrelated G115 in cmd/thv/app/upgrade.go (not touched by this PR).

Acceptance criteria verified by tests:

  1. Audited Serve-path requests no longer trigger a second backend aggregation — TestBackendEnrichmentMiddleware_ServePath/"does not re-aggregate via the core" asserts core.LookupTool and core.ListTools call counts are 0.
  2. The documented hard-blocker for live-Serve audit is resolved.

Changes

File Change
pkg/vmcp/server/serve_advertised_set.go New. advertisedSet (immutable name→backend maps) and advertisedSetCache (bounded LRU, nil-safe, concurrent-safe).
pkg/vmcp/server/serve_advertised_set_test.go New. Unit tests for the cache and advertised-set lookup.
pkg/vmcp/server/backend_enrichment.go Enrichment dispatches on path: Serve reads the per-session cache (no core.Lookup*); legacy reads the context routing table.
pkg/vmcp/server/backend_enrichment_test.go Adds Serve-path coverage, including the no-re-aggregation assertion.
pkg/vmcp/server/serve_handlers.go Builds the advertised name→backend mapping from the existing per-session aggregation; stores at registration, evicts on binding-failure termination.
pkg/vmcp/server/serve.go Constructs the cache (sized to the session manager's CacheCapacity) and wires it onto the Server.
pkg/vmcp/server/server.go Adds the advertisedSets field; re-populates the cache on lazy tool injection (cross-replica) and evicts on registration-failure cleanup.
pkg/vmcp/server/serve_session_test.go Session-lifecycle test updates for cache population/eviction.

Does this introduce a user-facing change?

No. This is a performance/internal change with no user-facing behavior change — the audit event content is identical to the legacy path (both record the backend's human-readable name).

Special notes for reviewers

  • LRU bound rationale: the bound is the leak backstop for the session-end paths the session manager evicts lazily without notifying the transport (client DELETE and TTL/capacity eviction). The transport-driven session-end paths evict eagerly; the LRU guarantees the cache can never grow without limit even when one of the lazy paths leaves an entry behind for a now-dead session ID. Stale entries are harmless (session IDs are UUIDs, never reused) and are evicted under capacity pressure. The default capacity mirrors the session manager's default so the two node-local caches hold roughly the same number of live sessions.
  • Cross-replica behaviour: entries are in-process and node-local, mirroring the MultiSession runtime layer. A request landing on a replica that did not register the session misses and records the event without a backend label rather than re-aggregating, so Serve-path enrichment is aggregation-free regardless of cache state. On lazy tool re-injection, only the tools mapping is re-populated, so a resources/read audit on a non-owning replica misses gracefully.
  • Follow-ups (out of scope, both LOW):
    • The now-unused core.Lookup* interface surface is a candidate for retirement.
    • Cross-pod resources/read audit labels (tools-only on non-owning replicas) can be revisited when the live Serve composition root lands.
  • Part of the vMCP core-extraction epic (Phase 2: Serve transport helper, re-home transport, replace discovery #5431).

tgrunnagle and others added 2 commits June 12, 2026 10:11
On the Serve path, audit backend-enrichment resolved a request's backend
by calling core.Lookup*, which — by the core's stateless design —
re-aggregates every backend's capabilities on every audited request (and
fans out to remote backends with discovery). That doubled the aggregation
work per audited tools/call, resources/read, and prompts/get.

Implement the "Serve caches, core is stateless" pattern for issue #5493:
- Add a bounded, per-session advertised-set cache (name->backend display
  name for tools and resources), populated once at session registration
  from the same ListTools/ListResources aggregation already performed.
- Read enrichment from the cache (keyed by Mcp-Session-Id) instead of
  core.Lookup*; a miss degrades the audit label gracefully and never
  re-aggregates.
- Evict on the transport-driven session-end paths; the LRU bound backstops
  the DELETE/TTL paths the session manager evicts lazily without notifying
  the transport.

The legacy server.New path is unchanged (cache is nil there).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixed issues from code review:
- MEDIUM: Align advertised-set cache capacity validation with the session
  manager — newAdvertisedSetCache now rejects a negative capacity and treats
  0 as the default, instead of silently clamping. Both caches read the same
  CacheCapacity field, so they must agree on what a value means.

Updated the cache validation test and retargeted the Serve construction-failure
test (which used a negative CacheCapacity) to a nil FactoryConfig.Base, since a
negative capacity is now caught before sessionmanager.New.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size/L Large PR: 600-999 lines changed label Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.25000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.72%. Comparing base (eb503b5) to head (e6e1d25).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
pkg/vmcp/server/serve_handlers.go 85.71% 3 Missing ⚠️
pkg/vmcp/server/serve_advertised_set.go 93.33% 1 Missing and 1 partial ⚠️
pkg/vmcp/server/server.go 71.42% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5512      +/-   ##
==========================================
+ Coverage   69.66%   69.72%   +0.06%     
==========================================
  Files         644      646       +2     
  Lines       65548    65640      +92     
==========================================
+ Hits        45661    45765     +104     
+ Misses      16539    16523      -16     
- Partials     3348     3352       +4     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tgrunnagle tgrunnagle left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-agent review — Cache vMCP advertised set (#5512)

Reviewed by 6 specialist agents (concurrency/lifecycle, vMCP architecture & anti-patterns, Go style/comments, test coverage, security/audit, general quality). No HIGH-severity findings; no authorization, routing, or admission bypass. The design is sound and a near-textbook satisfaction of the vMCP anti-pattern rules it operates under: the optimization is evidence-backed and was explicitly deferred to this exact landing point (#9 satisfied), the abstraction is package-internal and touches no shared interface (#8), sessionID is threaded as an explicit parameter rather than via ctx.Value (#1 improved), and advertisedSet is immutable-after-construction with reconstruct-and-replace updates (#10).

The findings below are non-blocking for the current test-only Serve path, but several should be resolved before audit ships on a live Serve composition root.

Consensus findings (≥7/10)

# Severity Finding
F1 MEDIUM lazyInjectSessionTools tools-only store can clobber a fuller cached entry on the owning replica for a zero-tool/has-resource session; the "never clobbers a fuller entry" comment is false in that case
F2 LOW Lifecycle comment claims DELETE/TTL paths "do not notify the transport" — the mcp-go SDK in fact fires OnUnregisterSession on both; vMCP simply doesn't register the hook
F3 MEDIUM Backend label is silently dropped on a cache miss (cross-pod / post-eviction); the path it replaces always resolved one — an audit-completeness regression gated by #5493's own "hard blocker before live audit"
F4 LOW core.LookupTool/LookupResource/LookupPrompt now have zero production callers — dead interface surface (acknowledged out-of-scope; track + annotate)
F5 LOW Test gaps: lazy-inject cache re-population and registration-failure eviction are both untested
F6 LOW Orphan-backend (advertised but absent from registry) records the raw BackendID on Serve but empty WorkloadName on legacy — the "audit parity" comment overstates equivalence (pre-existing behavior, carried over)
F7 LOW fakeCore.lookupResCalls/lookupPromptCalls counters are incremented but never asserted; the resources/read re-aggregation guard is missing

What this PR gets right (verified)

  • AC1 is genuinely guarded: the "does not re-aggregate" test asserts lookupToolCalls==0/listToolsCalls==0 against a fresh fakeCore whose counters are actually wired — not vacuous.
  • Cache is bounded (LRU, validated capacity; 0→default, negative rejected) and nil-receiver-safe (the legacy server.New path leaves it nil and calls evict unconditionally).
  • Stores logical backend display names keyed by a server-generated UUID session ID — no internal addressing in shared state; a miss never affects enforcement (calls still route through enforceSessionBinding + core admission).

Dropped (< 7, for audit only)

const-coupling drift (F8), 22-field Server struct (F9, pre-existing), capacity divergence when audit is disabled (F10), forged-header audit-label confusion (F11, no auth impact), minor test nits (F12), lazy re-inject identity mismatch (F13, mitigated by binding check).

Overall: COMMENT — approve-with-suggestions. None of the findings block merge for a test-only path; F1/F3/F5 are worth resolving (or explicitly tracking against the live-audit blocker) before the production Serve composition root lands.

🤖 Generated with Claude Code

Comment thread pkg/vmcp/server/server.go Outdated
Comment thread pkg/vmcp/server/serve_advertised_set.go Outdated
Comment thread pkg/vmcp/server/backend_enrichment.go Outdated
Comment thread pkg/vmcp/server/serve_handlers.go
Comment thread pkg/vmcp/server/serve_session_test.go
Comment thread pkg/vmcp/server/serve_session_test.go
Addresses #5512 review comments:
- MEDIUM pkg/vmcp/server/server.go (3405516256): F1 — gate lazy re-population
  on a cache miss so a tools-only store never clobbers a fuller {tools,resources}
  entry on the owning replica (a zero-tool/has-resource session reaches this path).
- MEDIUM pkg/vmcp/server/backend_enrichment.go (3405516282): F3 — log a Serve-path
  enrichment cache miss at DEBUG so the audit-label gap is observable; deliberately
  no core.Lookup* fallback (that would reintroduce the re-aggregation the cache removes).
- LOW pkg/vmcp/server/serve_advertised_set.go + serve.go (3405516280): F2 — correct the
  lifecycle comment (mcp-go v0.54.1 fires OnUnregisterSession on DELETE/TTL) and wire
  AddOnUnregisterSession for eager eviction; the LRU bound stays as a backstop.
- LOW pkg/vmcp/server/serve_handlers.go (3405516287): F6 — fix backendDisplayName doc to
  stop claiming legacy parity in the orphan case (Serve records the ID, legacy records "").
- LOW pkg/vmcp/server/serve_session_test.go (3405516296): F7 — assert resources/read and
  prompts/get do not re-aggregate (lookupResCalls/lookupPromptCalls/listResourcesCalls == 0).
- LOW pkg/vmcp/server/serve_session_test.go (3405516300): F5 — test lazy-inject cache
  re-population and registration-failure eviction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Jun 12, 2026
@tgrunnagle tgrunnagle marked this pull request as ready for review June 12, 2026 19:07

@jerm-dro jerm-dro left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocker: This is the overuse of middleware coming home to roost, and the cache is complexity added to compensate for it rather than removing it. The backend-enrichment middleware checks the request method and no-ops for most of them, re-reads and re-Unmarshals the request body, and exists solely to stuff one value into BackendInfo context state — and now it needs a per-session cache to do that without re-aggregating on the Serve path. We're spending a cache, four eviction sites, an LRU backstop, and a capacity knob to keep a misplaced layer performant.

The backend name is produced at the call path, and the audit middleware emits after the handler returns (next.ServeHTTPlogAuditEvent, reading BackendName at emit time). So the handler can write the label directly into the BackendInfo the audit middleware already created — and the enrichment middleware comes off the Serve chain entirely.

// coreSessionTools — pass the name (already computed here) into the closure:
Handler: s.coreToolHandler(sessionID, domainTool.Name,
    s.backendDisplayName(ctx, domainTool.BackendID)),
// coreToolHandler — label where the call is handled:
func (s *Server) coreToolHandler(sessionID, toolName, backendName string) server.ToolHandlerFunc {
    return func(ctx context.Context, req mcp.CallToolRequest) (*mcp.CallToolResult, error) {
        if bi, ok := audit.BackendInfoFromContext(ctx); ok && bi != nil {
            bi.BackendName = backendName
        }
        // ... existing binding check + core.CallTool, unchanged ...
    }
}

Same one-liner in coreResourceHandler. Perf should be unchanged (the name is computed once at registration; per-request is an O(1) write, no core.Lookup*) and the log identical (same field, value, emitter).

Net effect: removes serve_advertised_set.go + tests, the advertisedSets field, all four evict calls + the LRU backstop, the resolveBackendName/cachedBackendName split, and drops the enrichment middleware from the Serve path. A layer comes out and no cache goes in. The middleware stays wired only for the legacy path (s.core == nil) until #5445 retires it.

This rests on two things I sketched but didn't implement: that the audit event is emitted after the handler runs, and that the SDK propagates the request context into the handler so BackendInfo is reachable there (coreToolHandler already reads identity from ctx, which suggests it does). Is this approach feasible? WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

P2.4b vMCP Serve: cache per-session advertised set to avoid audit re-aggregation

2 participants