fix(caching): aggregate hit/miss counters across pods via Redis (closes #811) by meem08 · Pull Request #961 · rinafcode/teachLink_backend

meem08 · 2026-06-30T09:49:06Z

Summary

Fixes #811.

CachingService previously tracked this.hits / this.misses as in-process instance variables. In a horizontally scaled deployment with N pods, the value returned by getStats() (and ingested into the cache_hit_rate_percentage Prometheus gauge via publishHitRateMetrics) reflected only the current pod's slice of traffic. Dashboards and alerting rules driven by this metric were therefore skewed by a factor of ~1/N, masking both regressions and genuine improvements.

This change moves counter storage to Redis using atomic INCR against cache:hits:{type} and cache:misses:{type} so every pod writes to and reads from the same shared source of truth.

Linked Issue

Closes / references #811 — "CachingService hit/miss counters are in-process and incorrect in multi-instance deployments".

Acceptance Criteria

Hit rate is consistent across all pods under load — counters are shared via Redis INCR.
Redis keys are namespaced by cache type — keys are cache:hits:{type} / cache:misses:{type} where {type} is derived from the second segment of the cache key (e.g. cache:test:1 → cache:hits:test).
Reset daily — a @Cron(EVERY_DAY_AT_MIDNIGHT, { timeZone: 'UTC' }) job (dailyReset) sweeps both counter namespaces; resetStats(type) is also exposed for manual / rolling-window resets.
Unit tests mock Redis counters and verify the reported rate — src/caching/caching.service.spec.ts mocks incr / mget / scan / del and asserts exact key shape, cluster vs local numbers, fallback, and reset paths.

Impacted Files

File	Change
`src/caching/caching.service.ts`	Reworked `recordHit` / `recordMiss` to `INCR` per type; made `getStats`, `resetStats`, `publishHitRateMetrics` async and Redis-backed; added `dailyReset` cron, `getAggregateStats`, deduped helpers, and graceful local fallback.
`src/caching/caching.service.spec.ts`	Updated to mock Redis client, await async stats APIs, and cover cluster-wide + per-type + aggregate + reset + fallback paths.

Implementation Notes

Counter namespace

Keys are built via the exported helper:

buildCounterKeys('application')
// → { hits: 'cache:hits:application', misses: 'cache:misses:application' }

Cache keys in the codebase already follow a cache:{type}:... convention, so deriveCacheType(key) extracts {type} from the second segment. Keys that don't follow this convention fall back to a single default bucket. This keeps counter-key cardinality bounded by the number of cache types rather than the number of cached entries.

Cluster-wide read path

getStats(type) reads cache:hits:{type} and cache:misses:{type} via a single MGET.
getAggregateStats() runs two independent SCAN loops (one per MATCH pattern) and concatenates the key sets — sharing a cursor across two parallel MATCH scans would terminate early whenever either pattern reached '0' and silently drop keys.
publishHitRateMetrics(type) simply awaits getStats(type) and forwards it to MetricsCollectionService.updateCacheHitRate.

Daily reset

@Cron(CronExpression.EVERY_DAY_AT_MIDNIGHT, { timeZone: 'UTC' })
async dailyReset(): Promise<void> {
  await this.resetStats();
}

resetStats(type?) deletes the targeted keys (DEL cache:hits:{type} cache:misses:{type}) when called with a type, or scans-and-deletes every counter key when called without one.

Resilience

CACHE_COUNTER_ENABLE_REDIS=false disables the shared client entirely.
CACHE_COUNTER_FALLBACK_LOCAL=true (default) keeps an in-process counter so single-instance dev / test setups and Redis outages still produce non-zero numbers.
resolveRedisFromConfig() short-circuits when NODE_ENV === 'test' so DI-instantiated services in tests don't open live Redis connections — tests inject a mock client via the constructor instead.

Public API change (async)

getStats(), resetStats(), and publishHitRateMetrics() are now async. An audit of the codebase (grep -r "cachingService.getStats\|cachingService.publishHitRateMetrics\|cachingService.resetStats") shows no production callers other than the spec we updated. Consumers that use the low-level API (get/set/getOrSet/delete/deleteMany/clear/deleteByPattern) are unaffected.

Testing

pnpm run typecheck          # ✅ tsc passes
pnpm exec jest src/caching/caching.service.spec.ts --no-coverage   # ✅ 16/16 pass

New tests cover:

deriveCacheType and buildCounterKeys exported helpers
getOrSet increments the correct Redis key for hit/miss
publishHitRateMetrics reads from Redis and emits the cluster-wide rate
Aggregate stats across multiple types
resetStats(type) and resetStats() (full sweep)
Fallback to local counters when Redis errors out
Throw propagation when fallback is disabled and Redis errors out

Out of Scope

ComputationCacheService has the same in-process counter pattern (computation hit/miss). The issue scopes the fix to CachingService only, so it is not modified by this PR. A follow-up can lift the same helpers into a shared RedisCounterService and adopt them in ComputationCacheService for consistency.

…fcode#811) CachingService previously tracked this.hits / this.misses as in-process instance variables. In a horizontally scaled deployment each pod reported only its own slice of traffic, so getStats() and Prometheus hit-rate metrics (cache_hit_rate_percentage) were skewed by 1/N. This change moves counter storage to Redis using atomic INCR on keys cache:hits:{type} and cache:misses:{type}. The cache type is derived from the second segment of the cache key (cache:{type}:...), keeping the cardinality of counter keys bounded. Highlights: * recordHit / recordMiss now await INCR against shared Redis keys. * getStats(), resetStats(), publishHitRateMetrics() are async and read from Redis so the reported hit rate reflects cluster-wide traffic. * New daily @Cron(EVERY_DAY_AT_MIDNIGHT UTC) resets all counter keys. * getAggregateStats() scans cache:hits:* and cache:misses:* (each in its own complete SCAN loop) to expose a single cluster-wide rate. * Graceful local fallback (CACHE_COUNTER_FALLBACK_LOCAL=true) is kept so single-instance dev/test runs and Redis outages still report something useful instead of silently zero. * resolveRedisFromConfig() short-circuits in NODE_ENV=test to avoid leaking live Redis connections when other suites DI the service. * Tests mock the Redis client (incr/mget/scan/del) and assert the exact key shape, distributed vs local counters, fallback behaviour and daily reset. Refs rinafcode#811

drips-wave · 2026-06-30T09:49:14Z

@meem08 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

RUKAYAT-CODER · 2026-06-30T20:45:43Z

Great job so far

There’s just one blocker — the workflow is failing. Could you take a look and fix it so all checks pass?
You could pull from the main to get the changes before pushing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(caching): aggregate hit/miss counters across pods via Redis (closes #811)#961

fix(caching): aggregate hit/miss counters across pods via Redis (closes #811)#961
meem08 wants to merge 1 commit into
rinafcode:mainfrom
meem08:fix/811-distributed-cache-hit-rate-counters

meem08 commented Jun 30, 2026

Uh oh!

drips-wave Bot commented Jun 30, 2026

Uh oh!

RUKAYAT-CODER commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

meem08 commented Jun 30, 2026

Summary

Linked Issue

Acceptance Criteria

Impacted Files

Implementation Notes

Counter namespace

Cluster-wide read path

Daily reset

Resilience

Public API change (async)

Testing

Out of Scope

Uh oh!

drips-wave Bot commented Jun 30, 2026

Uh oh!

RUKAYAT-CODER commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants