Problem
Two resilience gaps that compound under burst load:
1. Octokit throttle plugin never retries
The @octokit/plugin-throttling callbacks onRateLimit and onSecondaryRateLimit in auth.ts only log a warning — they don't return true, so the plugin never retries the request. The plugin's contract is:
- Return
true → retry after the retryAfter delay
- Return nothing /
false → throw immediately
Current code:
onRateLimit: (retryAfter, options) => {
logger.warn(`GitHub rate limit: Request quota exhausted...`);
// implicitly returns undefined → plugin throws, request fails
},
This means any GitHub API call that hits a rate limit immediately fails and propagates up — potentially causing the entire SQS batch to be dropped (the problem fixed by #5129).
2. SSM client uses default retry (standard, 3 attempts)
Under burst (multiple concurrent Lambdas writing JIT configs via PutParameter), SSM's per-account rate limit (~40 TPS for PutParameter standard throughput) is easily exceeded. The default SDK retry:
retryMode: 'standard' — exponential backoff without rate-sensing
maxAttempts: 3 — gives ~3 seconds of retry budget
This is insufficient. After 3 attempts the SDK throws ThrottlingException, which propagates up and (pre-#5129) silently drops the entire SQS batch.
Fix
1. Return true from throttle callbacks (with retry caps)
onRateLimit: (retryAfter, options) => {
logger.warn(`...retrying after ${retryAfter}s`);
return options.request.retryCount < 2;
},
onSecondaryRateLimit: (retryAfter, options) => {
logger.warn(`...retrying after ${retryAfter}s`);
return options.request.retryCount < 1;
},
Primary rate limit: retry up to 2 times. Secondary (abuse) rate limit: retry once. These are conservative caps — the plugin handles the retryAfter delay automatically.
2. Adaptive retry with maxAttempts=10 for SSM
const SSM_CLIENT_CONFIG = {
region: process.env.AWS_REGION,
maxAttempts: 10,
retryMode: 'adaptive' as const,
};
adaptive mode adds client-side rate-sensing via a token bucket — when the SDK sees ThrottlingException it slows further calls to match the observed budget. Combined with 10 attempts this gives ~30s of retry per call without hammering the API.
This is safe because runners take ~30-50s to boot before reading their JIT config from SSM, so longer retry on PutParameter is essentially free.
Impact
| Scenario |
Before |
After |
| GitHub rate limit during scale-up |
Request fails immediately, batch potentially dropped |
Retries 1-2 times with backoff, succeeds if rate limit window passes |
| SSM throttle during JIT config write |
ThrottlingException after ~3s, batch fails |
Adaptive backoff for ~30s, succeeds once throttle clears |
| Burst of 100 jobs with batch_size=10 |
High probability of SSM throttle → orphaned instances |
Retry absorbs the throttle, instances get their configs |
Refs: #5024, #5037
Problem
Two resilience gaps that compound under burst load:
1. Octokit throttle plugin never retries
The
@octokit/plugin-throttlingcallbacksonRateLimitandonSecondaryRateLimitinauth.tsonly log a warning — they don't returntrue, so the plugin never retries the request. The plugin's contract is:true→ retry after theretryAfterdelayfalse→ throw immediatelyCurrent code:
This means any GitHub API call that hits a rate limit immediately fails and propagates up — potentially causing the entire SQS batch to be dropped (the problem fixed by #5129).
2. SSM client uses default retry (standard, 3 attempts)
Under burst (multiple concurrent Lambdas writing JIT configs via
PutParameter), SSM's per-account rate limit (~40 TPS for PutParameter standard throughput) is easily exceeded. The default SDK retry:retryMode: 'standard'— exponential backoff without rate-sensingmaxAttempts: 3— gives ~3 seconds of retry budgetThis is insufficient. After 3 attempts the SDK throws
ThrottlingException, which propagates up and (pre-#5129) silently drops the entire SQS batch.Fix
1. Return
truefrom throttle callbacks (with retry caps)Primary rate limit: retry up to 2 times. Secondary (abuse) rate limit: retry once. These are conservative caps — the plugin handles the
retryAfterdelay automatically.2. Adaptive retry with maxAttempts=10 for SSM
adaptivemode adds client-side rate-sensing via a token bucket — when the SDK sees ThrottlingException it slows further calls to match the observed budget. Combined with 10 attempts this gives ~30s of retry per call without hammering the API.This is safe because runners take ~30-50s to boot before reading their JIT config from SSM, so longer retry on PutParameter is essentially free.
Impact
Refs: #5024, #5037