Skip to content

feat: persist JWT to disk so process restarts skip /auth/token (v1.12.0)#53

Merged
jackparnell merged 4 commits into
TheColonyCC:mainfrom
ColonistOne:feat/persist-jwt-cache
May 23, 2026
Merged

feat: persist JWT to disk so process restarts skip /auth/token (v1.12.0)#53
jackparnell merged 4 commits into
TheColonyCC:mainfrom
ColonistOne:feat/persist-jwt-cache

Conversation

@ColonistOne
Copy link
Copy Markdown
Collaborator

Summary

Cross-process JWT cache at ~/.cache/colony-sdk/<sha256(base_url|api_key)[:16]>.json (XDG-aware). The existing in-memory _token cache lives only for the lifetime of a ColonyClient instance — every fresh process re-auths against /auth/token, which the server rate-limits to 100/hr/IP. With this PR, a new process for the same (base_url, api_key) pair reads the cached token instead of re-authing.

Motivation

Surfaced on 2026-05-23: one host running ~10 short-lived SDK scripts (interactive operator work) plus four supervisor-rotated dogfood agents (each restarted every ~20min) cumulatively hit the 100/hr/IP /auth/token rate limit. The supervisor pattern is structurally incompatible with the previous "JWT in memory only" cache because every restart re-auths from zero.

PR #52 (v1.11.1) added retry-with-backoff for /auth/token outages — that's about server unavailability. This PR is about avoiding the round-trip entirely when a valid token is already cached.

Behaviour

  • Cache location: ~/.cache/colony-sdk/ (honors XDG_CACHE_HOME; overridable via COLONY_SDK_TOKEN_CACHE_DIR).
  • Cache filename: <sha256(base_url|api_key)[:16]>.json — keyed by both base_url and api_key so the same key against prod vs staging gets independent files.
  • File permissions: 0600 (atomic write via tmpfile + rename — the secret never exists on disk with a wider mode).
  • Safety margin: cached tokens are treated as a miss if they have ≤ 60s of life remaining, so a long request can't outlive the token mid-flight.
  • Default-on: ColonyClient(..., cache_token=True). Per-client opt-out is cache_token=False; global opt-out is COLONY_SDK_NO_TOKEN_CACHE=1.
  • Invalidation: refresh_token(), rotate_key(), and the auto-401-refresh path all clear the on-disk cache so a stale token can't resurrect itself across processes.
  • Error handling: any read/write IO error silently falls through to a fresh /auth/token call — the cache is a cold-start latency optimization, not a correctness requirement.

Mirrored symmetrically in AsyncColonyClient. Sync + async clients share the cache file for the same (base_url, api_key) pair.

Test plan

  • 11 new tests in test_client.py::TestTokenCachePersistence covering: first-write writes mode-0600 file with v=1 envelope; second client reads from disk and skips /auth/token; expired-token cache miss triggers fresh fetch; corrupt JSON falls through silently; both opt-out paths; per-key + per-base-url cache isolation; refresh_token() clears the file; 401 invalidates the cache; safety-margin treats near-expiry as miss.
  • New tests/conftest.py autouse fixture routes all tests to tmp_path so the suite never touches the real ~/.cache/colony-sdk/ (previously, just running the suite would write 4e03f9ea9cac7702.json to the dev's real cache dir).
  • All 456 existing tests pass (no regressions).
  • ruff check clean, ruff format clean.
  • Manual smoke test: run two consecutive python3 -c "from colony_sdk import ColonyClient; c = ColonyClient('col_...'); print(c.get_me()['username'])" invocations — second one should observably skip /auth/token (visible in server logs if you have access, or via OpenTelemetry tracing if enabled).

🤖 Generated with Claude Code

Cross-process JWT cache at ~/.cache/colony-sdk/ (XDG-aware). The
existing in-memory `_token` cache survives only the lifetime of a
`ColonyClient` instance; every fresh process re-auths against
/auth/token, which the server rate-limits to 100/hr/IP. A single host
running ~10 short-lived SDK scripts plus four supervisor-rotated
dogfood agents can exhaust that budget in an hour.

This change persists the access_token + expiry to disk in
~/.cache/colony-sdk/<sha256(base_url|api_key)[:16]>.json (mode 0600,
atomic write). New processes for the same (base_url, api_key) pair
read the cached token before paying for /auth/token. A 60s safety
margin avoids handing out a token that's about to expire.

Cache invalidation:
- refresh_token() clears both in-memory + on-disk
- rotate_key() clears the OLD key's cache file BEFORE flipping api_key
- 401 responses clear the disk cache so a stale token can't resurrect
  across processes

Opt-out:
- per-client: ColonyClient(..., cache_token=False)
- global: COLONY_SDK_NO_TOKEN_CACHE=1

Test sandboxing:
- COLONY_SDK_TOKEN_CACHE_DIR overrides cache dir (used by tests)
- new tests/conftest.py autouse fixture routes all tests to tmp_path
  so token writes never leak into the real ~/.cache during dev

Mirrored in AsyncColonyClient — sync + async share the same cache
file for the same (base_url, api_key) pair.

11 new tests in TestTokenCachePersistence covering: first-write,
load-from-disk, expired-token miss, corrupt-cache fallthrough, both
opt-out paths, per-key and per-base-url isolation, refresh_token
side effects, 401 invalidation, and safety-margin behaviour.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

ColonistOne and others added 2 commits May 23, 2026 09:52
8 new async tests in TestAsyncTokenCachePersistence covering the
same paths as the sync version: first-write, second-client read,
per-client and global opt-out, refresh_token disk-cache cleanup,
corrupt-cache fallthrough, expired-cache miss, and 401 invalidation.

Coverage was 95% on async_client.py before; now 98%. The remaining
3% is the URLError retry-on-network-failure path that's covered in
the sync suite but the async client has a slightly different shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… diff)

Adds 8 targeted tests to exercise the branches codecov flagged:

- XDG_CACHE_HOME fallback path (when no explicit override is set)
- ~/.cache fallback path (neither env var set)
- mid-write OSError swallow + tmpfile cleanup (sync + async)
- outer OSError swallow on un-writable cache dir (sync + async)
- _clear_cached_token early-return when cache globally disabled (sync + async)

Now at 100% coverage on both client.py and async_client.py. Same
contract: best-effort under OSError; programmer-error exceptions
still propagate so bugs aren't masked.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread src/colony_sdk/async_client.py Outdated
self.retry = retry if retry is not None else RetryConfig()
self.typed = typed
# `cache_token=True` (default) persists the JWT to disk in
# `~/.cache/colony-sdk/` (XDG-aware), shared with the sync
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure about this. Can we expect this path to exist and be writing on all systems?

Would that path work on Windows, for example?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point — the original implementation was Linux/XDG-only. Just pushed 1485904 which adds per-platform resolution:

  • Linux / BSD / Unix: $XDG_CACHE_HOME/colony-sdk or ~/.cache/colony-sdk
  • macOS: ~/Library/Caches/colony-sdk (per Apple's File System Programming Guide)
  • Windows: %LOCALAPPDATA%\colony-sdk\Cache (falls back to %APPDATA% if unset, then ~/AppData/Local/... if neither)

COLONY_SDK_TOKEN_CACHE_DIR still wins on every platform as the explicit escape hatch.

7 new tests in TestTokenCachePersistence cover the resolution order on each platform (mocked via sys.platform so they're cross-platform-runnable themselves). Failure mode is still best-effort: if mkdir/open fails for any reason on any platform, the SDK falls through to a fresh /auth/token call without erroring.

Comment thread src/colony_sdk/client.py Outdated
# scripts that construct a fresh client per invocation) re-auth against
# `/auth/token` every time, which trips the 100/hr/IP server-side rate
# limit when many cycles happen on the same host — a recurring pattern
# in supervisor-rotated dogfood-agent setups.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The very specific case where we saw an issue I don't think needs an explanation in a generic SDK.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — generic SDK code shouldn't carry incident-specific framing. Removed "supervisor-rotated dogfood-agent setups" and the "100/hr/IP" mention from both the module-level docstring block and the _ensure_token rationale comment. Now talks abstractly about "short-lived scripts and processes that recreate a client per invocation" with no deployment specifics. CHANGELOG entry tightened in the same direction.

The PR #52 comment block above (lines 169-190 in the file, the _DEFAULT_AUTH_RETRY rationale) does still reference the 2026-05-21 incident — happy to follow up with a separate cleanup PR if you want that scrubbed too, but flagged it as out of scope for this PR.

Comment thread src/colony_sdk/__init__.py Outdated
from colony_sdk.testing import MockColonyClient

__version__ = "1.11.1"
__version__ = "1.12.0"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this more of a bug fix than a feature? Should we only bump a patch version?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — bumped down to 1.11.2.

Reasoning: the on-disk cache is structurally the cross-process completion of the existing in-memory _token cache, not new capability. The user-visible additions (cache_token arg, env vars) are configuration knobs for behaviour that should arguably have been there from the start. Same shape as the 1.11.0 → 1.11.1 auth_token_retry bump in PR #52 — operationally important fix, no semantic change for code that doesn't touch the new args.

Addresses the three review comments on PR TheColonyCC#53:

1. Cross-platform cache dir (#3292517995). The previous implementation
   used `~/.cache/colony-sdk/` unconditionally, which is Linux/XDG-only.
   Now picks per-platform:
     - Linux/BSD/Unix: $XDG_CACHE_HOME/colony-sdk or ~/.cache/colony-sdk
     - macOS: ~/Library/Caches/colony-sdk (per Apple's FSPG)
     - Windows: %LOCALAPPDATA%/colony-sdk/Cache (or %APPDATA% fallback)
   `COLONY_SDK_TOKEN_CACHE_DIR` still wins on every platform.

2. Drop incident-specific comment text (#3292519062). The module-level
   docstring + the `_ensure_token` rationale no longer reference
   "supervisor-rotated dogfood-agent setups" or the "100/hr/IP rate
   limit". Generic SDK doesn't need to know about specific deployment
   patterns. CHANGELOG entry also tightened.

3. Patch bump, not minor (#3292520901). This is the cross-process
   completion of the existing in-memory token cache — not a new
   capability. 1.11.1 → 1.11.2.

Adds 7 new platform-specific tests covering the resolution order on
Linux / macOS / Windows including the LOCALAPPDATA / APPDATA / home
fallback chain. Still 100% patch coverage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jackparnell jackparnell merged commit 375c717 into TheColonyCC:main May 23, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants