feat: persist JWT to disk so process restarts skip /auth/token (v1.12.0)#53
Conversation
Cross-process JWT cache at ~/.cache/colony-sdk/ (XDG-aware). The existing in-memory `_token` cache survives only the lifetime of a `ColonyClient` instance; every fresh process re-auths against /auth/token, which the server rate-limits to 100/hr/IP. A single host running ~10 short-lived SDK scripts plus four supervisor-rotated dogfood agents can exhaust that budget in an hour. This change persists the access_token + expiry to disk in ~/.cache/colony-sdk/<sha256(base_url|api_key)[:16]>.json (mode 0600, atomic write). New processes for the same (base_url, api_key) pair read the cached token before paying for /auth/token. A 60s safety margin avoids handing out a token that's about to expire. Cache invalidation: - refresh_token() clears both in-memory + on-disk - rotate_key() clears the OLD key's cache file BEFORE flipping api_key - 401 responses clear the disk cache so a stale token can't resurrect across processes Opt-out: - per-client: ColonyClient(..., cache_token=False) - global: COLONY_SDK_NO_TOKEN_CACHE=1 Test sandboxing: - COLONY_SDK_TOKEN_CACHE_DIR overrides cache dir (used by tests) - new tests/conftest.py autouse fixture routes all tests to tmp_path so token writes never leak into the real ~/.cache during dev Mirrored in AsyncColonyClient — sync + async share the same cache file for the same (base_url, api_key) pair. 11 new tests in TestTokenCachePersistence covering: first-write, load-from-disk, expired-token miss, corrupt-cache fallthrough, both opt-out paths, per-key and per-base-url isolation, refresh_token side effects, 401 invalidation, and safety-margin behaviour. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
8 new async tests in TestAsyncTokenCachePersistence covering the same paths as the sync version: first-write, second-client read, per-client and global opt-out, refresh_token disk-cache cleanup, corrupt-cache fallthrough, expired-cache miss, and 401 invalidation. Coverage was 95% on async_client.py before; now 98%. The remaining 3% is the URLError retry-on-network-failure path that's covered in the sync suite but the async client has a slightly different shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… diff) Adds 8 targeted tests to exercise the branches codecov flagged: - XDG_CACHE_HOME fallback path (when no explicit override is set) - ~/.cache fallback path (neither env var set) - mid-write OSError swallow + tmpfile cleanup (sync + async) - outer OSError swallow on un-writable cache dir (sync + async) - _clear_cached_token early-return when cache globally disabled (sync + async) Now at 100% coverage on both client.py and async_client.py. Same contract: best-effort under OSError; programmer-error exceptions still propagate so bugs aren't masked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| self.retry = retry if retry is not None else RetryConfig() | ||
| self.typed = typed | ||
| # `cache_token=True` (default) persists the JWT to disk in | ||
| # `~/.cache/colony-sdk/` (XDG-aware), shared with the sync |
There was a problem hiding this comment.
I'm a bit unsure about this. Can we expect this path to exist and be writing on all systems?
Would that path work on Windows, for example?
There was a problem hiding this comment.
Fair point — the original implementation was Linux/XDG-only. Just pushed 1485904 which adds per-platform resolution:
- Linux / BSD / Unix:
$XDG_CACHE_HOME/colony-sdkor~/.cache/colony-sdk - macOS:
~/Library/Caches/colony-sdk(per Apple's File System Programming Guide) - Windows:
%LOCALAPPDATA%\colony-sdk\Cache(falls back to%APPDATA%if unset, then~/AppData/Local/...if neither)
COLONY_SDK_TOKEN_CACHE_DIR still wins on every platform as the explicit escape hatch.
7 new tests in TestTokenCachePersistence cover the resolution order on each platform (mocked via sys.platform so they're cross-platform-runnable themselves). Failure mode is still best-effort: if mkdir/open fails for any reason on any platform, the SDK falls through to a fresh /auth/token call without erroring.
| # scripts that construct a fresh client per invocation) re-auth against | ||
| # `/auth/token` every time, which trips the 100/hr/IP server-side rate | ||
| # limit when many cycles happen on the same host — a recurring pattern | ||
| # in supervisor-rotated dogfood-agent setups. |
There was a problem hiding this comment.
The very specific case where we saw an issue I don't think needs an explanation in a generic SDK.
There was a problem hiding this comment.
Agreed — generic SDK code shouldn't carry incident-specific framing. Removed "supervisor-rotated dogfood-agent setups" and the "100/hr/IP" mention from both the module-level docstring block and the _ensure_token rationale comment. Now talks abstractly about "short-lived scripts and processes that recreate a client per invocation" with no deployment specifics. CHANGELOG entry tightened in the same direction.
The PR #52 comment block above (lines 169-190 in the file, the _DEFAULT_AUTH_RETRY rationale) does still reference the 2026-05-21 incident — happy to follow up with a separate cleanup PR if you want that scrubbed too, but flagged it as out of scope for this PR.
| from colony_sdk.testing import MockColonyClient | ||
|
|
||
| __version__ = "1.11.1" | ||
| __version__ = "1.12.0" |
There was a problem hiding this comment.
Is this more of a bug fix than a feature? Should we only bump a patch version?
There was a problem hiding this comment.
Good point — bumped down to 1.11.2.
Reasoning: the on-disk cache is structurally the cross-process completion of the existing in-memory _token cache, not new capability. The user-visible additions (cache_token arg, env vars) are configuration knobs for behaviour that should arguably have been there from the start. Same shape as the 1.11.0 → 1.11.1 auth_token_retry bump in PR #52 — operationally important fix, no semantic change for code that doesn't touch the new args.
Addresses the three review comments on PR TheColonyCC#53: 1. Cross-platform cache dir (#3292517995). The previous implementation used `~/.cache/colony-sdk/` unconditionally, which is Linux/XDG-only. Now picks per-platform: - Linux/BSD/Unix: $XDG_CACHE_HOME/colony-sdk or ~/.cache/colony-sdk - macOS: ~/Library/Caches/colony-sdk (per Apple's FSPG) - Windows: %LOCALAPPDATA%/colony-sdk/Cache (or %APPDATA% fallback) `COLONY_SDK_TOKEN_CACHE_DIR` still wins on every platform. 2. Drop incident-specific comment text (#3292519062). The module-level docstring + the `_ensure_token` rationale no longer reference "supervisor-rotated dogfood-agent setups" or the "100/hr/IP rate limit". Generic SDK doesn't need to know about specific deployment patterns. CHANGELOG entry also tightened. 3. Patch bump, not minor (#3292520901). This is the cross-process completion of the existing in-memory token cache — not a new capability. 1.11.1 → 1.11.2. Adds 7 new platform-specific tests covering the resolution order on Linux / macOS / Windows including the LOCALAPPDATA / APPDATA / home fallback chain. Still 100% patch coverage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Cross-process JWT cache at
~/.cache/colony-sdk/<sha256(base_url|api_key)[:16]>.json(XDG-aware). The existing in-memory_tokencache lives only for the lifetime of aColonyClientinstance — every fresh process re-auths against/auth/token, which the server rate-limits to 100/hr/IP. With this PR, a new process for the same(base_url, api_key)pair reads the cached token instead of re-authing.Motivation
Surfaced on 2026-05-23: one host running ~10 short-lived SDK scripts (interactive operator work) plus four supervisor-rotated dogfood agents (each restarted every ~20min) cumulatively hit the 100/hr/IP
/auth/tokenrate limit. The supervisor pattern is structurally incompatible with the previous "JWT in memory only" cache because every restart re-auths from zero.PR #52 (v1.11.1) added retry-with-backoff for
/auth/tokenoutages — that's about server unavailability. This PR is about avoiding the round-trip entirely when a valid token is already cached.Behaviour
~/.cache/colony-sdk/(honorsXDG_CACHE_HOME; overridable viaCOLONY_SDK_TOKEN_CACHE_DIR).<sha256(base_url|api_key)[:16]>.json— keyed by bothbase_urlandapi_keyso the same key against prod vs staging gets independent files.ColonyClient(..., cache_token=True). Per-client opt-out iscache_token=False; global opt-out isCOLONY_SDK_NO_TOKEN_CACHE=1.refresh_token(),rotate_key(), and the auto-401-refresh path all clear the on-disk cache so a stale token can't resurrect itself across processes./auth/tokencall — the cache is a cold-start latency optimization, not a correctness requirement.Mirrored symmetrically in
AsyncColonyClient. Sync + async clients share the cache file for the same(base_url, api_key)pair.Test plan
test_client.py::TestTokenCachePersistencecovering: first-write writes mode-0600 file with v=1 envelope; second client reads from disk and skips/auth/token; expired-token cache miss triggers fresh fetch; corrupt JSON falls through silently; both opt-out paths; per-key + per-base-url cache isolation;refresh_token()clears the file; 401 invalidates the cache; safety-margin treats near-expiry as miss.tests/conftest.pyautouse fixture routes all tests totmp_pathso the suite never touches the real~/.cache/colony-sdk/(previously, just running the suite would write4e03f9ea9cac7702.jsonto the dev's real cache dir).ruff checkclean,ruff formatclean.python3 -c "from colony_sdk import ColonyClient; c = ColonyClient('col_...'); print(c.get_me()['username'])"invocations — second one should observably skip/auth/token(visible in server logs if you have access, or via OpenTelemetry tracing if enabled).🤖 Generated with Claude Code