Skip to content

fix(caching): aggregate hit/miss counters across pods via Redis (closes #811)#961

Open
meem08 wants to merge 1 commit into
rinafcode:mainfrom
meem08:fix/811-distributed-cache-hit-rate-counters
Open

fix(caching): aggregate hit/miss counters across pods via Redis (closes #811)#961
meem08 wants to merge 1 commit into
rinafcode:mainfrom
meem08:fix/811-distributed-cache-hit-rate-counters

Conversation

@meem08

@meem08 meem08 commented Jun 30, 2026

Copy link
Copy Markdown

Summary

Fixes #811.

CachingService previously tracked this.hits / this.misses as in-process instance variables. In a horizontally scaled deployment with N pods, the value returned by getStats() (and ingested into the cache_hit_rate_percentage Prometheus gauge via publishHitRateMetrics) reflected only the current pod's slice of traffic. Dashboards and alerting rules driven by this metric were therefore skewed by a factor of ~1/N, masking both regressions and genuine improvements.

This change moves counter storage to Redis using atomic INCR against cache:hits:{type} and cache:misses:{type} so every pod writes to and reads from the same shared source of truth.

Linked Issue

Closes / references #811"CachingService hit/miss counters are in-process and incorrect in multi-instance deployments".

Acceptance Criteria

  • Hit rate is consistent across all pods under load — counters are shared via Redis INCR.
  • Redis keys are namespaced by cache type — keys are cache:hits:{type} / cache:misses:{type} where {type} is derived from the second segment of the cache key (e.g. cache:test:1cache:hits:test).
  • Reset daily — a @Cron(EVERY_DAY_AT_MIDNIGHT, { timeZone: 'UTC' }) job (dailyReset) sweeps both counter namespaces; resetStats(type) is also exposed for manual / rolling-window resets.
  • Unit tests mock Redis counters and verify the reported rate — src/caching/caching.service.spec.ts mocks incr / mget / scan / del and asserts exact key shape, cluster vs local numbers, fallback, and reset paths.

Impacted Files

File Change
src/caching/caching.service.ts Reworked recordHit / recordMiss to INCR per type; made getStats, resetStats, publishHitRateMetrics async and Redis-backed; added dailyReset cron, getAggregateStats, deduped helpers, and graceful local fallback.
src/caching/caching.service.spec.ts Updated to mock Redis client, await async stats APIs, and cover cluster-wide + per-type + aggregate + reset + fallback paths.

Implementation Notes

Counter namespace

Keys are built via the exported helper:

buildCounterKeys('application')
// → { hits: 'cache:hits:application', misses: 'cache:misses:application' }

Cache keys in the codebase already follow a cache:{type}:... convention, so deriveCacheType(key) extracts {type} from the second segment. Keys that don't follow this convention fall back to a single default bucket. This keeps counter-key cardinality bounded by the number of cache types rather than the number of cached entries.

Cluster-wide read path

  • getStats(type) reads cache:hits:{type} and cache:misses:{type} via a single MGET.
  • getAggregateStats() runs two independent SCAN loops (one per MATCH pattern) and concatenates the key sets — sharing a cursor across two parallel MATCH scans would terminate early whenever either pattern reached '0' and silently drop keys.
  • publishHitRateMetrics(type) simply awaits getStats(type) and forwards it to MetricsCollectionService.updateCacheHitRate.

Daily reset

@Cron(CronExpression.EVERY_DAY_AT_MIDNIGHT, { timeZone: 'UTC' })
async dailyReset(): Promise<void> {
  await this.resetStats();
}

resetStats(type?) deletes the targeted keys (DEL cache:hits:{type} cache:misses:{type}) when called with a type, or scans-and-deletes every counter key when called without one.

Resilience

  • CACHE_COUNTER_ENABLE_REDIS=false disables the shared client entirely.
  • CACHE_COUNTER_FALLBACK_LOCAL=true (default) keeps an in-process counter so single-instance dev / test setups and Redis outages still produce non-zero numbers.
  • resolveRedisFromConfig() short-circuits when NODE_ENV === 'test' so DI-instantiated services in tests don't open live Redis connections — tests inject a mock client via the constructor instead.

Public API change (async)

getStats(), resetStats(), and publishHitRateMetrics() are now async. An audit of the codebase (grep -r "cachingService.getStats\|cachingService.publishHitRateMetrics\|cachingService.resetStats") shows no production callers other than the spec we updated. Consumers that use the low-level API (get/set/getOrSet/delete/deleteMany/clear/deleteByPattern) are unaffected.

Testing

pnpm run typecheck          # ✅ tsc passes
pnpm exec jest src/caching/caching.service.spec.ts --no-coverage   # ✅ 16/16 pass

New tests cover:

  • deriveCacheType and buildCounterKeys exported helpers
  • getOrSet increments the correct Redis key for hit/miss
  • publishHitRateMetrics reads from Redis and emits the cluster-wide rate
  • Aggregate stats across multiple types
  • resetStats(type) and resetStats() (full sweep)
  • Fallback to local counters when Redis errors out
  • Throw propagation when fallback is disabled and Redis errors out

Out of Scope

ComputationCacheService has the same in-process counter pattern (computation hit/miss). The issue scopes the fix to CachingService only, so it is not modified by this PR. A follow-up can lift the same helpers into a shared RedisCounterService and adopt them in ComputationCacheService for consistency.

…fcode#811)

CachingService previously tracked this.hits / this.misses as in-process
instance variables. In a horizontally scaled deployment each pod reported
only its own slice of traffic, so getStats() and Prometheus hit-rate
metrics (cache_hit_rate_percentage) were skewed by 1/N.

This change moves counter storage to Redis using atomic INCR on
keys cache:hits:{type} and cache:misses:{type}. The cache type is
derived from the second segment of the cache key (cache:{type}:...),
keeping the cardinality of counter keys bounded.

Highlights:
* recordHit / recordMiss now await INCR against shared Redis keys.
* getStats(), resetStats(), publishHitRateMetrics() are async and read
  from Redis so the reported hit rate reflects cluster-wide traffic.
* New daily @Cron(EVERY_DAY_AT_MIDNIGHT UTC) resets all counter keys.
* getAggregateStats() scans cache:hits:* and cache:misses:* (each in
  its own complete SCAN loop) to expose a single cluster-wide rate.
* Graceful local fallback (CACHE_COUNTER_FALLBACK_LOCAL=true) is kept
  so single-instance dev/test runs and Redis outages still report
  something useful instead of silently zero.
* resolveRedisFromConfig() short-circuits in NODE_ENV=test to avoid
  leaking live Redis connections when other suites DI the service.
* Tests mock the Redis client (incr/mget/scan/del) and assert the
  exact key shape, distributed vs local counters, fallback behaviour
  and daily reset.

Refs rinafcode#811
@drips-wave

drips-wave Bot commented Jun 30, 2026

Copy link
Copy Markdown

@meem08 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

@RUKAYAT-CODER

Copy link
Copy Markdown
Contributor

Great job so far

There’s just one blocker — the workflow is failing. Could you take a look and fix it so all checks pass?
You could pull from the main to get the changes before pushing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CachingService hit/miss counters are in-process and incorrect in multi-instance deployments

2 participants