Skip to content

fix: Octokit throttle callbacks should retry, and SSM client should use adaptive retry #5135

@vegardx

Description

@vegardx

Problem

Two resilience gaps that compound under burst load:

1. Octokit throttle plugin never retries

The @octokit/plugin-throttling callbacks onRateLimit and onSecondaryRateLimit in auth.ts only log a warning — they don't return true, so the plugin never retries the request. The plugin's contract is:

  • Return true → retry after the retryAfter delay
  • Return nothing / false → throw immediately

Current code:

onRateLimit: (retryAfter, options) => {
  logger.warn(`GitHub rate limit: Request quota exhausted...`);
  // implicitly returns undefined → plugin throws, request fails
},

This means any GitHub API call that hits a rate limit immediately fails and propagates up — potentially causing the entire SQS batch to be dropped (the problem fixed by #5129).

2. SSM client uses default retry (standard, 3 attempts)

Under burst (multiple concurrent Lambdas writing JIT configs via PutParameter), SSM's per-account rate limit (~40 TPS for PutParameter standard throughput) is easily exceeded. The default SDK retry:

  • retryMode: 'standard' — exponential backoff without rate-sensing
  • maxAttempts: 3 — gives ~3 seconds of retry budget

This is insufficient. After 3 attempts the SDK throws ThrottlingException, which propagates up and (pre-#5129) silently drops the entire SQS batch.

Fix

1. Return true from throttle callbacks (with retry caps)

onRateLimit: (retryAfter, options) => {
  logger.warn(`...retrying after ${retryAfter}s`);
  return options.request.retryCount < 2;
},
onSecondaryRateLimit: (retryAfter, options) => {
  logger.warn(`...retrying after ${retryAfter}s`);
  return options.request.retryCount < 1;
},

Primary rate limit: retry up to 2 times. Secondary (abuse) rate limit: retry once. These are conservative caps — the plugin handles the retryAfter delay automatically.

2. Adaptive retry with maxAttempts=10 for SSM

const SSM_CLIENT_CONFIG = {
  region: process.env.AWS_REGION,
  maxAttempts: 10,
  retryMode: 'adaptive' as const,
};

adaptive mode adds client-side rate-sensing via a token bucket — when the SDK sees ThrottlingException it slows further calls to match the observed budget. Combined with 10 attempts this gives ~30s of retry per call without hammering the API.

This is safe because runners take ~30-50s to boot before reading their JIT config from SSM, so longer retry on PutParameter is essentially free.

Impact

Scenario Before After
GitHub rate limit during scale-up Request fails immediately, batch potentially dropped Retries 1-2 times with backoff, succeeds if rate limit window passes
SSM throttle during JIT config write ThrottlingException after ~3s, batch fails Adaptive backoff for ~30s, succeeds once throttle clears
Burst of 100 jobs with batch_size=10 High probability of SSM throttle → orphaned instances Retry absorbs the throttle, instances get their configs

Refs: #5024, #5037

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions