Skip to content

feat(proxy): transparent Anthropic proxy — first slice of #207#208

Merged
hashedone merged 18 commits into
feature/llm-proxyfrom
feat/proxy-transparent
May 27, 2026
Merged

feat(proxy): transparent Anthropic proxy — first slice of #207#208
hashedone merged 18 commits into
feature/llm-proxyfrom
feat/proxy-transparent

Conversation

@hashedone
Copy link
Copy Markdown
Contributor

First slice of #207 (which is a partial of #181): transparent Anthropic LLM proxy.

What this PR ships

  • Schema (024_user_anthropic_keys.sql): per-user encrypted Anthropic key, AES-256-GCM via the existing encryption.rs path (same master key as org signing keys / SSO client secrets).
  • API (/api/v1/me/anthropic-key): GET returns { configured, configured_at } — never the key itself. PUT validates the sk-ant- prefix, upserts. DELETE is idempotent. All three reject org api_keys with a clear 403.
  • Proxy handler (/proxy/anthropic/{*path}): catch-all forwarding to https://api.anthropic.com/{path}. Authenticates via TV session token in x-api-key, swaps in the user's stored Anthropic key, streams the response byte-for-byte via reqwest::Response::bytes_stream() — no SSE parsing, no buffering. Mounted on its own Router so the public GovernorLayer rate-limit does not throttle agent sessions.
  • Header allow-list: forwards accept / accept-encoding / anthropic-version / anthropic-beta / anthropic-dangerous-direct-browser-access / content-type / user-agent; drops host / cookie / authorization / x-forwarded-*. Response forwards a fixed list plus any anthropic-* for forward compat. Set-Cookie is explicitly tested to never leak.
  • Anthropic-shaped error envelope for all proxy-originated failures (auth, missing key, decrypt error, upstream unreachable). Upstream errors pass through verbatim.
  • Web UI (/me/proxy/): SvelteKit page with status + save/replace/remove + copy-block.
  • CLI: new tracevault proxy info command prints proxy URL + env-var setup snippet.
  • Tests: 4 unit (handler internals) + 8 repo-layer (encryption roundtrip + FK cascade) + 11 in-process integration against a wiremock stub upstream + 2 #[ignore]-d real-Anthropic tests for local pre-merge verification.

Out of scope — explicitly deferred

  • Event capture / trace persistence (will be a tee + parse stage on the existing byte stream)
  • Model routing / named routes / dispatching
  • Org-level keys and admin key management
  • OpenAI proxy support
  • Cost attribution, rate limiting, quotas
  • Dedicated long-lived proxy tokens (uses existing auth_sessions; expiry causes silent 401s — documented limitation)

Verification status

Layer Status
cargo fmt --all -- --check
cargo clippy --workspace --tests --all-targets -- -D warnings
cargo test --workspace ✅ (every result ok, 2 ignored as designed)
Web pnpm check (svelte-check) ✅ 0 errors, 0 warnings
Real-Anthropic #[ignore] tests ⏳ run locally pre-merge
Manual smoke: GSD2 driving the local proxy ⏳ run locally pre-merge

How to run the deferred local checks

# Real-Anthropic integration tests
ANTHROPIC_API_KEY=sk-ant-... \
  cargo test -p tracevault-server --test proxy_real_anthropic -- --ignored --nocapture

# Manual smoke
# 1. Run the server + web stack locally
# 2. Visit http://localhost:<port>/me/proxy and save your Anthropic key
# 3. Configure GSD2 (or another tool) with:
#      ANTHROPIC_BASE_URL=http://localhost:<port>/proxy/anthropic
#      ANTHROPIC_API_KEY=<your TV session token from ~/.config/tracevault/credentials.json>
# 4. Run a real multi-turn coding session

Parent issue: #207 (which is a partial of #181)

Target branch: feature/llm-proxy (umbrella branch tracking #181)

@hashedone hashedone added the enhancement New feature or request label May 27, 2026
@hashedone hashedone self-assigned this May 27, 2026
@hashedone hashedone added the enhancement New feature or request label May 27, 2026
///
/// Path layout: `path` is everything after `/proxy/anthropic/` (no leading
/// slash). Query string is forwarded verbatim from the original URI.
pub async fn anthropic_proxy(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anthropic_proxy and its helpers mix auth (token resolution), DB access, decryption, and HTTP proxying in one component; separate concerns (auth resolution, credential loading/decrypting, and request forwarding) to reduce coupling.

Details

✨ AI Reasoning
​The new proxy handler and its helpers combine several distinct responsibilities: authenticating and resolving session tokens, querying the database for user keys, decrypting secrets with the server master key, constructing and forwarding HTTP requests, shaping Anthropic-style error envelopes, and streaming responses back to clients. This bundles persistence, crypto, auth, and networking logic into a single component, increasing coupling and making testing and future changes harder. Isolating these concerns into smaller modules (auth resolution, credential service, and a thin forwarding handler) would improve maintainability.

🔧 How do I fix it?
Split classes that handle database, HTTP, and UI concerns into separate, focused classes.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f358647 — extracted authenticate, forward_to_upstream, and build_downstream_response so the handler is a thin orchestration shell over the three concerns. All proxy_integration tests still pass.

Comment thread web/src/routes/me/proxy/+page.svelte Outdated
Comment on lines +84 to +85
} catch {
// Clipboard API can fail in some browsers / contexts; ignore silently.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch block in copyProxyUrl silently swallows clipboard errors. Add logging or user-visible feedback instead of silently ignoring failures.

Show fix
Suggested change
} catch {
// Clipboard API can fail in some browsers / contexts; ignore silently.
} catch (err) {
// Clipboard API can fail in some browsers / contexts; log and continue.
console.warn('Failed to copy proxy URL to clipboard:', err);
Details

✨ AI Reasoning
​A try/catch block was added that swallows errors silently in copyProxyUrl when the Clipboard API write fails. Silent catches make debugging and monitoring harder because they suppress failure signals; even with an explanatory comment, the error is not logged or surfaced. The handler contains no logging, fallback, or user-visible feedback.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

Comment on lines +28 to +33
const state = await new Promise<{ current: OrgInfo | null }>((resolve) => {
const unsub = orgStore.subscribe((s) => {
resolve(s);
unsub();
});
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsub() is called inside the subscribe callback before unsub is guaranteed initialized; immediate subscriber invocation can throw at runtime and break loadIfNeeded().

Show fix
Suggested change
const state = await new Promise<{ current: OrgInfo | null }>((resolve) => {
const unsub = orgStore.subscribe((s) => {
resolve(s);
unsub();
});
});
let unsub: (() => void) | undefined;
const state = await new Promise<{ current: OrgInfo | null }>((resolve) => {
unsub = orgStore.subscribe((s) => {
resolve(s);
});
});
unsub?.();
Details

✨ AI Reasoning
​The code creates a Promise from orgStore.subscribe and calls the unsubscribe function inside the subscriber callback. In store implementations that invoke the subscriber immediately, the callback runs before the unsubscribe variable has been initialized. That makes the callback access an uninitialized binding, so the logic intended to resolve once and unsubscribe can fail at runtime instead of completing normally.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

hashedone added 13 commits May 27, 2026 16:26
Add user_anthropic_keys table (user_id PK, key_encrypted, key_nonce,
created_at, updated_at) and UserAnthropicKeyRepo with upsert /
get_ciphertext / configured_at / delete. Keys are encrypted at rest via
the existing AES-256-GCM encryption.rs path (same master key as org
signing keys / SSO client secrets); configured_at lets the status API
report key presence without ever touching ciphertext.

Refs #207, parent #181.
GET returns { configured, configured_at } — never the key itself.
PUT validates sk-ant- prefix, requires AppState.encryption_key to be
set, and upserts via UserAnthropicKeyRepo. DELETE is idempotent. All
three reject AuthUser with nil user_id (org-scoped api_keys) with a
clear 403 — the proxy is fundamentally per-user.

Refs #207, parent #181.
Catch-all at /proxy/anthropic/{*path} forwarding to api.anthropic.com.
Authenticates via x-api-key (TV session token resolved against
auth_sessions; org api_keys explicitly rejected with a clear 401);
loads + decrypts the user's stored Anthropic key; rewrites x-api-key
with the upstream value; streams the response body byte-for-byte via
reqwest bytes_stream() — no SSE parsing, no buffering. Allow-list
header forwarding (drops host/cookie/authorization/x-forwarded-*/
set-cookie); response headers forward a fixed allow-list plus any
anthropic-* prefix for forward compat. Proxy-originated errors use
the Anthropic envelope shape so unmodified clients route them through
their existing error paths. Mounted on its own Router so the public
GovernorLayer rate-limit does not throttle agent sessions.

Refs #207, parent #181.
11 tests covering the full proxy lifecycle: non-streaming JSON
passthrough (with header-swap assertion proving the TV token is
replaced by the upstream Anthropic key before reaching the wire),
SSE byte-for-byte streaming, upstream 4xx/5xx verbatim passthrough,
upstream-unreachable -> 502 api_error, missing/invalid/org-key/
no-key-configured auth failures with the Anthropic-shaped envelope,
forbidden-header stripping verified via MockServer.received_requests(),
and the deferred /me/anthropic-key HTTP lifecycle (empty -> PUT-bad ->
PUT-good -> GET-configured -> DELETE -> GET-empty).

Adds AppState.anthropic_upstream_base so tests can redirect to
wiremock without env-var mutation, plus seed_auth_session and
fixture_encryption_key helpers in tests/common.

Refs #207, parent #181.
Two tests against api.anthropic.com that verify byte-fidelity in the
real-network case wiremock cannot model (TLS, HTTP/2, real SSE event
vocabulary). Both are #[ignore]-d so they do not run in CI. Run with:

    ANTHROPIC_API_KEY=sk-ant-... \
      cargo test -p tracevault-server --test proxy_real_anthropic \
      -- --ignored --nocapture

Non-streaming asserts type=message + non-empty content[0].text +
stop_reason. Streaming asserts the body contains message_start and
content_block_delta SSE event markers.

Refs #207, parent #181.
Status / Save / Replace / Remove (with inline confirm) / copy-block
with the proxy base URL and env-var snippet. Saved keys are never
shown again — the GET endpoint returns only { configured,
configured_at } and the form clears after a successful PUT.

Refs #207, parent #181.
New nested subcommand that prints the proxy base URL, the
configured TraceVault server, the credentials path, and a step-by-
step env-var snippet for setting ANTHROPIC_BASE_URL +
ANTHROPIC_API_KEY in Claude Code / GSD2 / Cursor / Codex CLI.
Read-only — no network call.

Refs #207, parent #181.
dotenvy::dotenv() walks up from the test CWD to the workspace root
and loads any .env present, so users can put ANTHROPIC_API_KEY in
.env instead of exporting it per-shell.

Note: DATABASE_URL still has to be set externally because sqlx::test
constructs the pool before the test body runs.

Refs #207, parent #181.
resolve_claude_target() in init.rs branches on io::stdin().is_terminal():
non-TTY -> Shared default, TTY -> interactive prompt. Tests passing
claude_settings=None were taking the interactive branch when run from a
real terminal, which raced on stdin under parallel execution and caused
init_installs_claude_hooks / init_merges_into_existing_settings /
init_no_gitignore_skips_gitignore_update to fail intermittently
depending on test scheduling.

Pass Some(ClaudeSettingsTarget::Shared) explicitly from every test
that previously passed None. This pins behavior independent of the
runner's stdin shape and matches what the tests actually want to
verify.

Follow-up: init_in_directory itself should take an interactive: bool
arg so production callers can't accidentally fall into the same trap.
File a separate issue.

Refs #207, parent #181.
Without a link, /me/proxy/ was reachable only by typing the URL.
Adds a Cable-icon button in the sidebar footer (both expanded and
collapsed variants) above the Log out button, since the Anthropic
key is user-scoped and the footer is where per-user controls live.

Refs #207, parent #181.
/me/proxy/ was rendering without any chrome because the root layout
has no sidebar — only orgs/[slug]/+layout.svelte wires up AppLayout.
Adds web/src/routes/me/+layout.svelte that loads the user's org list,
picks the first org as current if none is set, and renders AppLayout
around the children. The sidebar's links derive their slug from the
org store, so /me/* now has a usable nav back to org-scoped pages.

Refs #207, parent #181.
* me/+layout.svelte: svelte/store calls the subscriber synchronously
  on .subscribe(), so the previous in-callback unsub() was running
  while `unsub` was still in its TDZ. Stash the unsubscribe in an
  outer binding and call it after the Promise resolves.
* me/proxy/+page.svelte: log clipboard-write failures via
  console.warn instead of swallowing silently — easier to debug,
  still no page-level error since the user can copy manually.

Refs #207, parent #181.
Per PR review: extract `authenticate`, `forward_to_upstream`, and
`build_downstream_response` so the handler is a thin orchestration
shell and each concern (auth+credential loading, upstream request
construction, downstream response streaming) lives in its own
private function. Behavior is identical — all 11 proxy_integration
tests still pass.

Refs #207, parent #181.
@hashedone hashedone force-pushed the feat/proxy-transparent branch from f358647 to 40cb68a Compare May 27, 2026 14:28
hashedone added 4 commits May 27, 2026 17:04
Reject keys longer than 256 chars at the PUT /api/v1/me/anthropic-key
endpoint so a malformed paste cannot persist megabytes of junk on the
user_anthropic_keys row. Real Anthropic keys are ~110 chars.

Self-review minor #2 on PR #208.
* Reject path segments equal to `..` at the proxy entry with an
  Anthropic-shaped 400. Today reqwest/url normalize `..` so no
  cross-host escape is possible, but a future regional Anthropic base URL
  with a path prefix could be escaped — guard at the entry point.
* Replace round-tripped `HeaderName::from_bytes` / `HeaderValue::from_bytes`
  in copy_response_headers with a direct `name.clone() / value.clone()` —
  the upstream-validated headers don't need re-validation, and the
  silent-drop branch on validation failure was a confusing prod debug story
  waiting to happen.
* Drop the redundant `to_ascii_lowercase()` in the anthropic-* prefix
  check (http::HeaderName names are lowercase by construction).
* New integration test: `proxy_rejects_path_traversal_segments`.

Self-review minor #1 + minor #3 + nit #1 on PR #208.
Two related production-hardening fixes for the Anthropic proxy router:

1. `DefaultBodyLimit::max(32 * 1024 * 1024)` on the proxy router. Axum's
   default `Bytes` cap of 2 MB silently rejects legitimate Anthropic
   requests (vision inputs, long conversations, large `system` prompts)
   with a 413 that does NOT match the Anthropic error envelope, breaking
   the transparent-proxy contract. 32 MB matches the Anthropic request
   envelope while still bounding worst-case in-flight memory.

2. Dedicated `proxy_http_client` on AppState with
   `connect_timeout: 10s`. The shared `reqwest::Client::new()` used by
   pricing sync has no timeout configured; using it for the proxy meant a
   stalled TCP/TLS handshake on api.anthropic.com would park the proxy task
   indefinitely, holding a DB connection and a router slot — exhausting
   pools under partial Anthropic outage. Intentionally no overall
   `timeout()` because the proxy carries open-ended SSE streams.

Self-review major #1 + major #2 on PR #208.
After rebasing onto main (v0.16.0), Cargo.lock fell behind the workspace's
dependency graph. `cargo check` regenerated it; commit the regeneration
so CI's `cargo check --locked` does not fail on a stale lockfile.
// regardless of how the base URL is configured later.
if path.split(['/', '\\']).any(|seg| seg == "..") {
tracing::warn!(
error_type = "authentication_error",
Copy link
Copy Markdown

@aikido-pr-checks aikido-pr-checks Bot May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path-traversal rejection logs error_type = "authentication_error" but returns ProxyErrorKind::ApiError; this contradictory classification can mislabel incidents and break error-type-based monitoring.

Suggested change
error_type = "authentication_error",
error_type = "api_error",
Details

✨ AI Reasoning
​This branch is handling invalid path traversal input, not authentication. The emitted classification says one error type while the returned envelope uses another. That mismatch makes telemetry and downstream debugging inconsistent for the exact same failure path.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

…tion_error

The log line said `error_type = "authentication_error"` while the
response envelope used `ProxyErrorKind::ApiError`. Path traversal is not
an authentication failure; align the log to match the envelope so
error-type-based telemetry isn't split across two labels for the same
failure.

Addresses Aikido review comment on PR #208.
@hashedone hashedone merged commit 77aac6c into feature/llm-proxy May 27, 2026
3 checks passed
hashedone added a commit that referenced this pull request May 27, 2026
Reject keys longer than 256 chars at the PUT /api/v1/me/anthropic-key
endpoint so a malformed paste cannot persist megabytes of junk on the
user_anthropic_keys row. Real Anthropic keys are ~110 chars.

Self-review minor #2 on PR #208.
hashedone added a commit that referenced this pull request May 27, 2026
* Reject path segments equal to `..` at the proxy entry with an
  Anthropic-shaped 400. Today reqwest/url normalize `..` so no
  cross-host escape is possible, but a future regional Anthropic base URL
  with a path prefix could be escaped — guard at the entry point.
* Replace round-tripped `HeaderName::from_bytes` / `HeaderValue::from_bytes`
  in copy_response_headers with a direct `name.clone() / value.clone()` —
  the upstream-validated headers don't need re-validation, and the
  silent-drop branch on validation failure was a confusing prod debug story
  waiting to happen.
* Drop the redundant `to_ascii_lowercase()` in the anthropic-* prefix
  check (http::HeaderName names are lowercase by construction).
* New integration test: `proxy_rejects_path_traversal_segments`.

Self-review minor #1 + minor #3 + nit #1 on PR #208.
hashedone added a commit that referenced this pull request May 27, 2026
Two related production-hardening fixes for the Anthropic proxy router:

1. `DefaultBodyLimit::max(32 * 1024 * 1024)` on the proxy router. Axum's
   default `Bytes` cap of 2 MB silently rejects legitimate Anthropic
   requests (vision inputs, long conversations, large `system` prompts)
   with a 413 that does NOT match the Anthropic error envelope, breaking
   the transparent-proxy contract. 32 MB matches the Anthropic request
   envelope while still bounding worst-case in-flight memory.

2. Dedicated `proxy_http_client` on AppState with
   `connect_timeout: 10s`. The shared `reqwest::Client::new()` used by
   pricing sync has no timeout configured; using it for the proxy meant a
   stalled TCP/TLS handshake on api.anthropic.com would park the proxy task
   indefinitely, holding a DB connection and a router slot — exhausting
   pools under partial Anthropic outage. Intentionally no overall
   `timeout()` because the proxy carries open-ended SSE streams.

Self-review major #1 + major #2 on PR #208.
@hashedone hashedone deleted the feat/proxy-transparent branch May 27, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant