Skip to content

perf(sdk): cache check_beta_features userinfo response per credentials#599

Open
RapidPoseidon wants to merge 1 commit into
mainfrom
perf(sdk)/cache-userinfo-check-beta-features
Open

perf(sdk): cache check_beta_features userinfo response per credentials#599
RapidPoseidon wants to merge 1 commit into
mainfrom
perf(sdk)/cache-userinfo-check-beta-features

Conversation

@RapidPoseidon
Copy link
Copy Markdown
Contributor

@RapidPoseidon RapidPoseidon commented May 22, 2026

Summary

Cache the /connect/userinfo response that RapidataClient.__init__ fires from _check_beta_features, so subsequent client constructions in the same process reuse the result instead of calling identity-service again.

Why

A customer creates many short-lived RapidataClient instances inside one process. Each __init__ calls _check_beta_features, which fires GET https://auth.{env}/connect/userinfo with _request_timeout=1. During traffic bursts observed on 2026-05-22 the same process was generating 1500+ userinfo calls per minute; identity-service slowed under that load and the 1s timeout started failing, surfacing as SDK errors.

Userinfo is stable for the access token's lifetime (~1 day per OIDC config), so the call is redundant across short-lived clients sharing the same credentials.

What changes

  • Module-level cache in rapidata_client.py: dict[(environment, client_id), _UserInfoCacheEntry] with 24h TTL.
  • threading.Lock protects the cache for concurrent client construction.
  • _check_beta_features checks the cache before making the HTTP call; on miss, calls userinfo, stores the response, then applies the user-info / Admin-role side effects.
  • reset_credentials evicts the entry for the current (environment, client_id).
  • clear_all_caches wipes the whole cache.
  • No public API change; no docs/examples updated.
  • The _request_timeout=1 is left alone — the cache, not a higher timeout, is what fixes the burst symptom.

Test plan

  • uv run pyright src/rapidata/rapidata_client → 0 errors
  • uv run python3 -c "from rapidata import RapidataClient; print('OK')" → OK

🔗 Session: https://session-a1ce5546.poseidon.rapidata.internal/

Customers (e.g. Reve.art) construct many short-lived RapidataClient
instances during request bursts (~1500+ per minute from a single
process). Each __init__ called _check_beta_features which fired a
GET /connect/userinfo with a 1s timeout; under burst load
identity-service slowed enough for the calls to time out and surface
as SDK failures.

The userinfo response is stable for the access token's lifetime
(~1 day per OIDC config), so cache it process-wide keyed by
(environment, client_id) with a 24h TTL. The cache is guarded by a
threading.Lock for concurrent client construction. reset_credentials
clears the matching entry and clear_all_caches wipes the whole cache.

Co-Authored-By: luca <luca@rapidata.ai>
@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Code Review

Overall: Solid, well-motivated fix for a real production pain point. The structure is clean, the locking is correct, and the extracted _apply_userinfo helper is an improvement. A few issues worth addressing before merge.


Bugs / Correctness

1. reset_credentials doesn't evict the cache when client_id=None

When a user authenticates via browser flow or token (no client_id passed), self._client_id is None at construction. The cache entry is then stored under (self._environment, sub) — the sub from the API response. But reset_credentials only evicts when self._client_id is not None, so the stale entry is never removed.

After a credential reset, the next RapidataClient() call (with no client_id) will find None cache key, skip the cache lookup, make the HTTP call, and write a fresh entry — but any concurrent instance could still read the stale one in the window between them. More importantly, calling reset_credentials on a no-client_id client is a no-op for the cache.

Mitigation: store the resolved sub on self after a successful userinfo fetch (e.g. self._userinfo_sub), and use it in reset_credentials as a fallback key.


2. No cache hit for the client_id=None path on first call — the core use-case may not be fully covered

The cache lookup block is guarded by if cache_key is not None. For clients constructed without an explicit client_id, the first N concurrent calls all reach the HTTP endpoint (classic thundering-herd on startup). The fix lands writes but misses reads until after the first response. If the customer in question uses env-var credentials (RAPIDATA_CLIENT_ID), this is a non-issue (env vars are resolved to client_id before self._client_id is set). Worth confirming that is the actual path.


Design / Thread Safety

3. Lock released between cache miss and HTTP call (double-fetch)

# lock released here
if cached_result is not None:
    ...
    return

result = ...  # HTTP call happens without the lock

Two threads with the same cache_key can both read a miss, both call identity-service, and both write to cache. This is a benign double-fetch (not a correctness bug), but it could produce a small burst on a cold start. For the reported 1500 req/min pattern this is likely negligible. Just document or add a TODO if you want to close it later.


4. clear_all_caches clears entries for every client in the process

_userinfo_cache is module-level. Calling clear_all_caches() on one RapidataClient instance invalidates cached entries for all other instances in the same process (including those with different credentials). The docstring doesn't mention this. Worth at minimum adding a note, or scoping the clear to (self._environment, self._client_id) unless a full wipe is intentional.


Minor

5. Hardcoded 24 h TTL

_USERINFO_CACHE_TTL_SECONDS is correct for the stated OIDC config, but baking it in makes it hard to tune if the token lifetime changes. A class-level constant or env-var override would help. Low priority — mentioning for visibility.

6. Mutable dict stored in cache

_UserInfoCacheEntry.result holds the exact dict returned by json.loads. If _apply_userinfo ever gains a mutation (even accidentally), all cache readers share the side effect. A defensive result.copy() at write time costs nothing and removes the risk:

_userinfo_cache[effective_key] = _UserInfoCacheEntry(
    result=result.copy(),
    expires_at=time.monotonic() + _USERINFO_CACHE_TTL_SECONDS,
)

What's good

  • from __future__ import annotations added correctly, consistent with project conventions.
  • time.monotonic() — right call for TTL; immune to clock adjustments.
  • threading.Lock placement is correct; no lock held during the HTTP call (avoids blocking other threads).
  • _apply_userinfo extraction makes the logic readable and avoids duplication between cache-hit and cache-miss paths.
  • reset_credentials integration is a good touch — prevents stale admin-flag after a credential swap (for the client_id case).

Recommend: Address issue #1 (stale cache after reset_credentials on no-client_id clients) before merging. The rest are low-risk and can be follow-ups.

@LucStr LucStr requested a review from LinoGiger May 22, 2026 12:38
@LucStr LucStr assigned LucStr and LinoGiger and unassigned LucStr May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants