fix: Octokit throttle callbacks should retry, and SSM client should use adaptive retry

## Problem

Two resilience gaps that compound under burst load:

### 1. Octokit throttle plugin never retries

The `@octokit/plugin-throttling` callbacks `onRateLimit` and `onSecondaryRateLimit` in `auth.ts` only log a warning — they don't return `true`, so the plugin never retries the request. The plugin's contract is:

- Return `true` → retry after the `retryAfter` delay
- Return nothing / `false` → throw immediately

Current code:
```typescript
onRateLimit: (retryAfter, options) => {
  logger.warn(`GitHub rate limit: Request quota exhausted...`);
  // implicitly returns undefined → plugin throws, request fails
},
```

This means any GitHub API call that hits a rate limit immediately fails and propagates up — potentially causing the entire SQS batch to be dropped (the problem fixed by #5129).

### 2. SSM client uses default retry (standard, 3 attempts)

Under burst (multiple concurrent Lambdas writing JIT configs via `PutParameter`), SSM's per-account rate limit (~40 TPS for PutParameter standard throughput) is easily exceeded. The default SDK retry:

- `retryMode: 'standard'` — exponential backoff without rate-sensing
- `maxAttempts: 3` — gives ~3 seconds of retry budget

This is insufficient. After 3 attempts the SDK throws `ThrottlingException`, which propagates up and (pre-#5129) silently drops the entire SQS batch.

## Fix

### 1. Return `true` from throttle callbacks (with retry caps)

```typescript
onRateLimit: (retryAfter, options) => {
  logger.warn(`...retrying after ${retryAfter}s`);
  return options.request.retryCount < 2;
},
onSecondaryRateLimit: (retryAfter, options) => {
  logger.warn(`...retrying after ${retryAfter}s`);
  return options.request.retryCount < 1;
},
```

Primary rate limit: retry up to 2 times. Secondary (abuse) rate limit: retry once. These are conservative caps — the plugin handles the `retryAfter` delay automatically.

### 2. Adaptive retry with maxAttempts=10 for SSM

```typescript
const SSM_CLIENT_CONFIG = {
  region: process.env.AWS_REGION,
  maxAttempts: 10,
  retryMode: 'adaptive' as const,
};
```

`adaptive` mode adds client-side rate-sensing via a token bucket — when the SDK sees ThrottlingException it slows further calls to match the observed budget. Combined with 10 attempts this gives ~30s of retry per call without hammering the API.

This is safe because runners take ~30-50s to boot before reading their JIT config from SSM, so longer retry on PutParameter is essentially free.

## Impact

| Scenario | Before | After |
|---|---|---|
| GitHub rate limit during scale-up | Request fails immediately, batch potentially dropped | Retries 1-2 times with backoff, succeeds if rate limit window passes |
| SSM throttle during JIT config write | ThrottlingException after ~3s, batch fails | Adaptive backoff for ~30s, succeeds once throttle clears |
| Burst of 100 jobs with batch_size=10 | High probability of SSM throttle → orphaned instances | Retry absorbs the throttle, instances get their configs |

Refs: #5024, #5037

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Octokit throttle callbacks should retry, and SSM client should use adaptive retry #5135

Problem

1. Octokit throttle plugin never retries

2. SSM client uses default retry (standard, 3 attempts)

Fix

1. Return `true` from throttle callbacks (with retry caps)

2. Adaptive retry with maxAttempts=10 for SSM

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Before	After
GitHub rate limit during scale-up	Request fails immediately, batch potentially dropped	Retries 1-2 times with backoff, succeeds if rate limit window passes
SSM throttle during JIT config write	ThrottlingException after ~3s, batch fails	Adaptive backoff for ~30s, succeeds once throttle clears
Burst of 100 jobs with batch_size=10	High probability of SSM throttle → orphaned instances	Retry absorbs the throttle, instances get their configs

fix: Octokit throttle callbacks should retry, and SSM client should use adaptive retry #5135

Description

Problem

1. Octokit throttle plugin never retries

2. SSM client uses default retry (standard, 3 attempts)

Fix

1. Return true from throttle callbacks (with retry caps)

2. Adaptive retry with maxAttempts=10 for SSM

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Return `true` from throttle callbacks (with retry caps)