Skip to content

fix(sns): collapse concurrent cert-cache misses via singleflight (bug-bash #1/#9)#231

Merged
mastermanas805 merged 1 commit into
masterfrom
fix/sns-cert-singleflight-2026-06-04
Jun 3, 2026
Merged

fix(sns): collapse concurrent cert-cache misses via singleflight (bug-bash #1/#9)#231
mastermanas805 merged 1 commit into
masterfrom
fix/sns-cert-singleflight-2026-06-04

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

What

snsVerifier.getCert read the cert cache under RLock and, on a miss, fetched the cert over the network and wrote it under Lock — with nothing between the read-miss and the write. A burst of concurrent SNS deliveries that all miss the cache (cold start, or the instant after the 24h TTL expires) each fire their own fetchCert and then race to overwrite the map: a thundering herd against the AWS cert endpoint.

Fix

Collapse the miss path through a singleflight.Group keyed by certURL (the same primitive already used in internal/cache/redis.go, team_summary.go, billing_usage.go) so a concurrent burst issues exactly one fetch between them; followers receive the leader's result. The fast-path RLock cache hit is unchanged.

Coverage

Symptom:        N concurrent cache-miss deliveries → N redundant cert fetches (herd)
Enumeration:    grep certCache/getCert internal/handlers/sns_verify.go (single getCert)
Sites found:    1
Sites touched:  1
Coverage test:  TestSNSVerifyFinal2_GetCert_SingleflightCollapsesConcurrentMisses
Live verified:  passes under -race; go tool cover → getCert 100.0%

20 concurrent missing-cache getCert calls, fetch held open on a channel so all collapse onto one in-flight call; asserts exactly 1 fetchCert.

🤖 Generated with Claude Code

…-bash #1/#9)

snsVerifier.getCert read the cert cache under RLock and, on a miss,
fetched the cert and wrote it under Lock — with nothing between the
read-miss and the write. A burst of concurrent SNS deliveries that all
miss the cache (cold start, or the instant after the 24h TTL expires)
each fire their own fetchCert over the network and then race to overwrite
the map: a thundering herd against the AWS cert endpoint.

Collapse the miss path through a singleflight.Group keyed by certURL
(same primitive already used in internal/cache/redis.go, team_summary.go)
so a concurrent burst issues exactly ONE fetch between them; followers
receive the leader's result. The fast-path RLock cache hit is unchanged.

Test: TestSNSVerifyFinal2_GetCert_SingleflightCollapsesConcurrentMisses —
20 concurrent missing-cache getCert calls, fetch held open on a channel
so all collapse onto one in-flight call; asserts exactly 1 fetchCert.
Passes under -race; getCert 100% covered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 3, 2026 18:53
@mastermanas805 mastermanas805 merged commit c48d17a into master Jun 3, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant