fix(auth, cli): sliding session window + actionable check errors#212
fix(auth, cli): sliding session window + actionable check errors#212hashedone wants to merge 3 commits into
Conversation
Previously the only path to recover from an expired `auth_sessions` row was `tracevault login`. Tokens hard-expire at 30 days regardless of how active the user is — daily users hit the cliff just as easily as inactive ones, with no signal until a push gets blocked. Replace the SELECT-only validity check in both `AuthUser` and `OrgAuth` extractors with a single `UPDATE auth_sessions SET expires_at = NOW() + INTERVAL '30 days' WHERE token_hash = AND expires_at > NOW() RETURNING user_id`. Atomic: one round-trip, no race between the validity check and the extension. Net effect: * Active users (anything that talks to TV once per 30 days) never see a token expiry again — the window slides forward on every hit. * Inactive users still hit the 30-day expiry cliff and must re-login, which is the right security tradeoff for genuinely-dormant tokens. * No new endpoint, no client change, no breaking-change to existing callers. Three integration tests against a real Postgres pool exercise the SQL directly: extend-on-hit, no-revive-of-already-expired, no-op-on- unknown-token.
The pre-push hook runs `tracevault check` and exits 1 on any failure. Previously that 1 came with an opaque message — typically "Stream failed (500 Internal Server Error)" or "Server returned 401: Invalid or expired token". Both an agent and a human reader were left guessing what to do. Adds `connectivity_message()` which sniffs the raw error string and appends a specific next step: * 401 / Unauthorized: "Session token may be expired. Run `tracevault login --server-url=<server_url>` to refresh." * 403 / Forbidden: org-membership hint (not a re-login). * DNS / connect / timeout / connection refused: network + server_url hint. * 5xx Server Error / Internal: "team has been paged, retry shortly." * Unknown shapes: pass through the raw error, no fake advice. All connectivity-class errors still propagate as Err so the pre-push hook exits non-zero — a TV-tracked repo MUST be evaluated by TV on every push, that contract doesn't change. The fix here is purely the quality of the message the user/agent reads after the block. Six unit tests pin the message shapes for the common error patterns the api_client surfaces today.
Same lockfile regeneration as PR #211 carried — release-plz updated deps in v0.16.0 that this branch picks up. CI runs `cargo check --locked`, so commit the regenerated lock.
| // it, (c) read back the user_id — one round-trip, no race. | ||
| let session_row = sqlx::query_as::<_, (Uuid,)>( | ||
| "SELECT user_id FROM auth_sessions WHERE token_hash = $1 AND expires_at > NOW()", | ||
| "UPDATE auth_sessions |
There was a problem hiding this comment.
Replacing SELECT with UPDATE on every auth turns a read into a write for each request, increasing DB write load and contention; consider the performance cost of frequent writes.
Details
✨ AI Reasoning
The code swaps a SELECT-only check for an UPDATE..RETURNING that extends expires_at on every successful auth. While functionally intended, UPDATE issues a write to the auth_sessions table on every authenticated request. For high-throughput endpoints or many short-lived requests, this converts cheap reads into writes, increasing I/O, WAL generation, and potential contention/latency. The pattern is visible and the impact scales with authentication frequency.
🔧 How do I fix it?
Move constant work outside loops. Use StringBuilder instead of string concatenation in loops. Cache compiled regex patterns. Use hash-based lookups instead of nested loops. Batch database operations instead of N+1 queries.
Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
| fn connectivity_message(raw: &str) -> String { | ||
| let lower = raw.to_ascii_lowercase(); | ||
| let action = if lower.contains("401") || lower.contains("unauthorized") { | ||
| Some("Session token may be expired. Run `tracevault login --server-url=<server_url>` to refresh.") | ||
| } else if lower.contains("403") || lower.contains("forbidden") { | ||
| Some("Your token lacks access to this repo's policies. Check your org membership and rerun `tracevault login`.") | ||
| } else if lower.contains("dns") | ||
| || lower.contains("connect") | ||
| || lower.contains("timed out") | ||
| || lower.contains("timeout") | ||
| || lower.contains("connection refused") | ||
| { | ||
| Some("Could not reach the TraceVault server. Check network and `server_url` in .tracevault/config.toml.") | ||
| } else if lower.contains("5") && (lower.contains("server error") || lower.contains("internal")) | ||
| { | ||
| Some("TraceVault server returned an error. The team has likely been paged; retry shortly.") | ||
| } else { | ||
| None | ||
| }; | ||
|
|
||
| match action { | ||
| Some(a) => format!("Policy check could not run: {raw}.\n → {a}"), | ||
| None => format!("Policy check could not run: {raw}."), | ||
| } |
There was a problem hiding this comment.
connectivity_message builds an intermediate Option via a long if/else chain; return the formatted message immediately from each branch to flatten control flow (use early returns instead of action variable + match).
Show fix
| fn connectivity_message(raw: &str) -> String { | |
| let lower = raw.to_ascii_lowercase(); | |
| let action = if lower.contains("401") || lower.contains("unauthorized") { | |
| Some("Session token may be expired. Run `tracevault login --server-url=<server_url>` to refresh.") | |
| } else if lower.contains("403") || lower.contains("forbidden") { | |
| Some("Your token lacks access to this repo's policies. Check your org membership and rerun `tracevault login`.") | |
| } else if lower.contains("dns") | |
| || lower.contains("connect") | |
| || lower.contains("timed out") | |
| || lower.contains("timeout") | |
| || lower.contains("connection refused") | |
| { | |
| Some("Could not reach the TraceVault server. Check network and `server_url` in .tracevault/config.toml.") | |
| } else if lower.contains("5") && (lower.contains("server error") || lower.contains("internal")) | |
| { | |
| Some("TraceVault server returned an error. The team has likely been paged; retry shortly.") | |
| } else { | |
| None | |
| }; | |
| match action { | |
| Some(a) => format!("Policy check could not run: {raw}.\n → {a}"), | |
| None => format!("Policy check could not run: {raw}."), | |
| } | |
| fn connectivity_message(raw: &str) -> String { | |
| if lower.contains("401") || lower.contains("unauthorized") { | |
| return format!("Policy check could not run: {raw}.\n → Session token may be expired. Run `tracevault login --server-url=<server_url>` to refresh."); | |
| } | |
| if lower.contains("403") || lower.contains("forbidden") { | |
| return format!("Policy check could not run: {raw}.\n → Your token lacks access to this repo's policies. Check your org membership and rerun `tracevault login`."); | |
| } | |
| if lower.contains("dns") | |
| || lower.contains("connect") | |
| || lower.contains("timed out") | |
| || lower.contains("timeout") | |
| || lower.contains("connection refused") | |
| { | |
| return format!("Policy check could not run: {raw}.\n → Could not reach the TraceVault server. Check network and `server_url` in .tracevault/config.toml."); | |
| } | |
| if lower.contains("5") && (lower.contains("server error") || lower.contains("internal")) { | |
| return format!("Policy check could not run: {raw}.\n → TraceVault server returned an error. The team has likely been paged; retry shortly."); | |
| } | |
| format!("Policy check could not run: {raw}.") |
Details
✨ AI Reasoning
The function connectivity_message builds an intermediate Option via a long if/else-if chain and then matches that Option to format the output. This adds an extra mental indirection; using early returns (return formatted string as soon as a pattern matches) would flatten control flow and be clearer. The change introduced this new function and its conditional chain, increasing nesting and indirection in the codebase.
Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
Fixes the bug reported today: an agent's plan ended in
git push --draft-pr, the push got blocked because the TraceVault session token had expired, and Claude Code's only signal was an opaque "Stream failed" — so it hallucinated commands until a human stepped in withtracevault login.Two complementary changes. They make the bug nearly impossible and, when it does happen, ensure the next agent can self-recover.
1. Server: sliding
auth_sessionswindowToday
auth_sessionsrows haveexpires_at = login_time + 30 days. The expiry is a hard cliff: it doesn't matter whether you used TV every day for 30 days or didn't use it at all, you get logged out the same way.Replace the SELECT-only validity check in
AuthUserandOrgAuthextractors with a single atomic statement that bumps the expiry forward on every successful authentication:Net effect: as long as a user touches TV once per 30 days, the token never expires. Genuinely-dormant tokens still hit the cliff and require
tracevault login— the right tradeoff for inactive credentials.No new endpoint, no schema change, no breaking change.
2. CLI: actionable error messages on
tracevault checkfailureA TV-tracked repo MUST be evaluated by TV on every push — that contract doesn't change. The fix here is purely the quality of the message the user/agent reads when the block fires.
connectivity_message()sniffs the raw error and appends a specific next step:Session token may be expired. Run \tracevault login --server-url=<server_url>` to refresh.`Your token lacks access to this repo's policies. Check your org membership and rerun \tracevault login`.`Could not reach the TraceVault server. Check network and \server_url` in .tracevault/config.toml.`TraceVault server returned an error. The team has likely been paged; retry shortly.An agent reading "run
tracevault login --server-url=<server_url>" has a real chance of self-repair. The previous "Stream failed (500)" gave it nothing to work with.Verification
cargo fmt --checkcargo clippy --workspace --tests --all-targets -- -D warningscargo test --workspacesliding_session_testcheck::tests::connectivity_message_*What this PR does NOT change
enforce_check = falseto fail-open on connectivity errors; rejected during review as inconsistent with the enforcement contract.result.blockedfrom the server) still exit 1 with their existing detailed output. That path is untouched.What's missing (future work)
Proper OAuth-style refresh tokens (short-lived access + long-lived refresh). The sliding window handles the active-user case correctly but inactive users still hit the 30-day cliff. Refresh tokens are the standard fix and remain on the roadmap; they're a half-day-plus slice that needs its own design pass.