Skip to content

fix: honor 429 rate limits from VngCloud API#7

Open
anngdinh wants to merge 2 commits into
mainfrom
fix/rate-limit-handling
Open

fix: honor 429 rate limits from VngCloud API#7
anngdinh wants to merge 2 commits into
mainfrom
fix/rate-limit-handling

Conversation

@anngdinh

@anngdinh anngdinh commented May 1, 2026

Copy link
Copy Markdown
Contributor

Summary

When the API returns 429 the controller currently sees a generic error, controller-runtime requeues with exponential backoff, and N reconciles in flight independently keep hammering the bucket so it never refills. This PR makes the controller honor the server's rate-limit signal.

  • New domain.RateLimitError carries RetryAfter parsed from X-Ratelimit-Reset (then Retry-After).
  • domain.SDKError(sdkErr) replaces 55 raw sdkErr.GetError() calls in internal/repository/vngcloud_repo/ so 429s propagate as the typed error.
  • pkg/errs.HandleReconcileError converts a RateLimitError into ctrl.Result{RequeueAfter} via domain.RateLimitRequeueAfter (2s floor, 5m ceiling, +50% jitter) — workqueue items honor the server's wait hint, and independent processes/clusters desynchronize retries instead of re-colliding at reset+0s.
  • IsRateLimitExceeded now uses errors.As (was a TODO stub).

Cross-cluster note

One user can run this controller in many clusters; the rate limit applies per account, not per cluster. Reactive backoff alone covers this because the API server itself is the coordination point — every cluster sees the same 429 and jittered RetryAfter keeps them from re-colliding. Per-process bucket coordination is intentionally not included here; see follow-ups.

Test plan

  • go build ./...
  • go vet ./internal/domain/... ./pkg/errs/... ./internal/repository/vngcloud_repo/...
  • go test ./internal/domain/... ./pkg/errs/... covers parser (status code, X-Ratelimit-Reset, Retry-After fallback, missing headers, nil), RateLimitRequeueAfter (floor/ceiling/jitter), and the HandleReconcileError rate-limit branch (direct + wrapped via fmt.Errorf("%w", ...)).
  • Verify in production: deploy and watch for the previous statusCode 429 lines to stop accumulating; expect rate limited by VngCloud API, requeue after Xs warnings instead.

Follow-ups (not in this PR)

  1. Reduce baseline request rate — TTL-cache ListCertificates and other hot List* calls. Likely the bigger win and attacks the cause, not the symptom.
  2. Process-wide rate gate (golang.org/x/time/rate.Limiter) so a cluster with many concurrent reconciles self-throttles before getting 429'd. Only worth doing if reactive backoff isn't enough.

🤖 Generated with Claude Code

anngdinh and others added 2 commits May 1, 2026 01:28
When the API returns 429 the controller currently sees a generic error,
controller-runtime requeues with exponential backoff, and N reconciles
in flight independently keep hammering the bucket so it never refills.

Surface 429 as a typed domain.RateLimitError carrying the server's
X-Ratelimit-Reset / Retry-After hint, plumb it through every repository
call (sdkErr.GetError -> domain.SDKError), and convert it in
HandleReconcileError to ctrl.Result{RequeueAfter} with a 2s floor, 5m
ceiling, and up to 50% jitter so independent processes don't retry in
lockstep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Base automatically changed from v3 to main June 15, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant