feat(termination-watcher): deregister runners from GitHub on EC2 termination by jensenbox · Pull Request #5055 · github-aws-runners/terraform-aws-github-runner

jensenbox · 2026-03-06T07:48:39Z

Summary

Extends the existing termination-watcher Lambda to deregister GitHub Actions runners from GitHub when their EC2 instances terminate. This prevents stale "offline" runner entries from accumulating in the organization/repository — a long-standing issue (#804, #1006, #2939) affecting all users of the module.

How it works

When an EC2 instance terminates, the Lambda reads the ghr:Owner and ghr:Type tags from the instance
Authenticates to GitHub using the module's existing App credentials (SSM parameters)
Finds the runner by instance ID in the runner name, then calls the delete API
Errors are logged but never fail the Lambda — metrics collection continues unaffected

What's included

Lambda changes:

deregister.ts — GitHub API deregistration logic reusing the module's existing auth pattern (createAppAuth → installation token)
Wired into both termination.ts (BidEvictedEvent) and termination-warning.ts (Spot Interruption Warning)
ConfigResolver.ts — adds enableRunnerDeregistration and ghesApiUrl config from env vars
295-line test suite covering org/repo runners, not-found cases, disabled feature, and error handling

Terraform changes:

Passes GitHub App SSM parameter ARNs through the module chain to the termination-watcher
Adds SSM GetParameter IAM policy when deregistration is enabled
Adds PARAMETER_GITHUB_APP_ID_NAME, PARAMETER_GITHUB_APP_KEY_BASE64_NAME, ENABLE_RUNNER_DEREGISTRATION, and GHES_URL environment variables to both Lambda functions
Adds an EC2 Instance State-change Notification EventBridge rule (state: shutting-down) that catches all termination types — not just spot-specific events. This covers scale-down, manual termination, ASG termination, and spot reclamation.

New variables on instance_termination_watcher:

enable_runner_deregistration (bool, default false)

Design decisions

Opt-in: Disabled by default to avoid breaking existing deployments. Enable with enable_runner_deregistration = true.
Reuses existing auth pattern: Same @octokit/auth-app + SSM approach used by the control-plane Lambda.
Reuses existing Lambda: The state-change EventBridge rule targets the same notification Lambda rather than creating a new one, since both event types provide detail['instance-id'].
Graceful failure: All deregistration errors are caught and logged. If the runner is already removed, it logs and returns. The Lambda never fails due to deregistration issues.
Supports Org and Repo runners: Reads ghr:Type tag to determine the correct API endpoint.
GHES compatible: Passes through the ghes_url variable for GitHub Enterprise Server deployments.

Testing

44 unit tests pass (7 test files), including the new deregister.test.ts
Tested in production: manually terminated a runner instance → Lambda triggered within seconds → runner successfully deregistered from GitHub org

Fixes #804

Brend-Smits

Hey @jensenbox

This is a great addition, thanks a lot for your contribution.

After testing this together with @stuartp44 I ran into a problem when the termination watcher tried to deregister a runner. The error was as following:

{
    "level": "ERROR",
    "message": "Failed to deregister runner from GitHub",
    "timestamp": "2026-03-06T10:07:02.489Z",
    "service": "spot-termination-notification",
    "sampling_rate": 0,
    "xray_trace_id": "1-69aaa741-3e6ab9e024a6cc5567e5f339",
    "region": "eu-west-1",
    "environment": "framework-dev",
    "module": "deregister",
    "aws-request-id": "87f61dc0-1c03-456a-9bf9-e5542558eac3",
    "function-name": "framework-dev-spot-termination-notification",
    "instanceId": "i-0c86dff9c4dfb59fc",
    "owner": "test-runners/multi-runner",
    "error": {
        "name": "HttpError",
        "location": "file:///var/task/index.js:95395",
        "message": "Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted. - https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository",
        "stack": "HttpError: Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted. - https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository\n    at fetchWrapper (file:///var/task/index.js:95395:11)\n    at process.processTicksAndRejections (node:internal/process/task_queues:103:5)\n    at async Job.doExecute (file:///var/task/index.js:83521:18)",
        "status": 422,
        "request": {
            "method": "DELETE",
            "url": "https://api.github.com/repos/test-runners/multi-runner/actions/runners/50",
            "headers": {
                "accept": "application/vnd.github.v3+json",
                "user-agent": "github-aws-runners-termination-watcher octokit-rest.js/22.0.1 octokit-core.js/7.0.6 Node.js/24",
                "authorization": "token [REDACTED]"
            },
            "request": {}
        },
        "response": {
            "url": "https://api.github.com/repos/test-runners/multi-runner/actions/runners/50",
            "status": 422,
            "headers": {
                "access-control-allow-origin": "*",
                "access-control-expose-headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset",
                "content-length": "260",
                "content-security-policy": "default-src 'none'",
                "content-type": "application/json; charset=utf-8",
                "date": "Fri, 06 Mar 2026 10:07:02 GMT",
                "referrer-policy": "origin-when-cross-origin, strict-origin-when-cross-origin",
                "server": "github.com",
                "strict-transport-security": "max-age=31536000; includeSubdomains; preload",
                "vary": "Accept-Encoding, Accept, X-Requested-With",
                "x-accepted-github-permissions": "administration=write",
                "x-content-type-options": "nosniff",
                "x-frame-options": "deny",
                "x-github-api-version-selected": "2022-11-28",
                "x-github-media-type": "github.v3; format=json",
                "x-github-request-id": "E8C2:1597F:2CF110:372C21:69AAA746",
                "x-ratelimit-limit": "15000",
                "x-ratelimit-remaining": "14994",
                "x-ratelimit-reset": "1772795053",
                "x-ratelimit-resource": "core",
                "x-ratelimit-used": "6",
                "x-xss-protection": "0"
            },
            "data": {
                "message": "Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted.",
                "documentation_url": "https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository",
                "status": "422"
            }
        }
    }
}

I would suggest adding some sort of retry mechanism with exponential backoff (which may be configurable).
On another note, I also see in the logs Received spot notification for undefined, are you also seeing undefined in your logs?
The rest looks great, looking forward testing this again 🚀

jensenbox · 2026-03-15T17:49:11Z

Hey @Brend-Smits, thanks for testing and the detailed report!

I've pushed a fix (653fd67) that addresses both issues:

1. Runner busy 422 — retry with exponential backoff

Added deleteRunnerWithRetry() that catches the specific 422 "currently running a job" error and retries up to 5 times with exponential backoff (1s → 2s → 4s → 8s → 16s). Non-422 errors are not retried and still fail gracefully. Each retry attempt is logged at WARN level so you can observe the behavior:

WARN: Runner is currently running a job, retrying after delay
  { instanceId, runnerId, runnerName, owner, attempt: 1, maxRetries: 5, delayMs: 1000 }

2. "Received spot notification for undefined"

Yes, we were seeing this too! This happens when metrics are disabled (ENABLE_METRICS_SPOT_WARNING=false / ENABLE_METRICS_SPOT_TERMINATION=false) — the metricName is passed as undefined and gets interpolated into the log string. Fixed so the log now reads "Received spot notification" when no metric name is set.

We've been running this feature in our production environment (closient) and confirmed both issues in our CloudWatch logs. All 47 tests pass including 3 new tests for the retry logic.

Let us know how retesting goes!

When EC2 instances running GitHub Actions runners terminate (spot interruption, scale-down), the runner stays registered as "offline" in GitHub. This extends the termination-watcher Lambda to deregister runners via the GitHub API, catching all termination causes. Lambda changes: - New deregister.ts with GitHub App auth, runner lookup, and deletion - ConfigResolver adds enableRunnerDeregistration and ghesApiUrl - Both termination.ts and termination-warning.ts call deregister - Dependencies: @octokit/auth-app, @octokit/rest, @aws-github-runner/aws-ssm-util Terraform changes: - termination-watcher module: new env vars, conditional SSM IAM policy - multi-runner module: wire github_app_parameters through, add enable_runner_deregistration variable (defaults to true) Feature-flagged via ENABLE_RUNNER_DEREGISTRATION env var (default false at module level, true in multi-runner). Deregistration failures are caught and logged without breaking existing metric functionality. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The root (single-runner) module also uses termination-watcher but wasn't wiring github_app_parameters through. Add enable_runner_deregistration, github_app_parameters, and ghes_url to the root module's termination watcher config, matching the multi-runner changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Include pre-built Lambda zip for use when referencing this fork branch as a Terraform module source (no GitHub release available for the download-lambda module to pull from). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The existing spot-specific rules (BidEvictedEvent, Spot Interruption Warning) only fire on AWS spot reclamations. Scale-down terminations and manual terminations — the most common causes of stale runners — were not covered. Add an EC2 Instance State-change Notification rule (state: shutting-down) that catches ALL termination types. Reuses the same notification Lambda since both event types have detail['instance-id']. Gated behind enable_runner_deregistration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a runner is terminated while executing a job (e.g., spot reclamation, power disruption), the GitHub API returns 422 and refuses to delete it. The runner stays registered as "offline" indefinitely, counting toward the maximum runner limit and preventing new runners from launching. Changes: - scale-down.ts: Add reconcileGitHubRunners() that runs every scale-down cycle (every 5 minutes). Lists all GitHub runners, compares against live EC2 instances, and deregisters any offline runners whose instances no longer exist. - deregister.ts: Improve 422 error handling — log as warning instead of error since the scale-down reconciliation will clean it up. The reconciliation is controlled by OFFLINE_RUNNER_DEREGISTER_MINUTES env var (defaults to 10). Set to 0 to disable.

@ts-ignore

Add @ts-ignore for createAppAuth calls where @octokit/request and @octokit/types have incompatible retryCount types.

When GitHub returns 422 on runner deletion (runner executing a job), instead of silently dropping the attempt, enqueue a retry message to SQS with a 5-minute delay. By that time the EC2 instance has been terminated and the runner appears offline, allowing clean deletion. Changes: - deregister.ts: send 422 failures to DEREGISTER_RETRY_QUEUE_URL SQS queue; add handleDeregisterRetry for processing retry messages - lambda.ts: export deregisterRetry SQS handler - package.json: add @aws-sdk/client-sqs dependency - scale-down.ts: remove reconcileGitHubRunners polling (replaced by SQS) - modules/multi-runner: add environment_variables to instance_termination_watcher variable and pass through to Lambda config - modules/termination-watcher: merge caller-supplied environment_variables into notification and handler Lambda env var configs

Add Terraform resources to support the SQS-based deregistration retry that was added in ed30bf8. When GitHub returns 422 (runner busy), the termination-watcher Lambda now has infrastructure to queue a delayed retry: - SQS queue with 5-minute delivery delay for retry messages - Dead-letter queue (14-day retention, 3 max receives) for failures - Dedicated Lambda function (index.deregisterRetry handler) - SQS event source mapping to trigger the retry Lambda - IAM policies: SQS send/receive, SSM read, EC2 describe - IAM policies on notification/termination Lambdas for SQS:SendMessage - Pass DEREGISTER_RETRY_QUEUE_URL env var to all termination Lambdas - Rebuild termination-watcher.zip with latest code Co-Authored-By: Paperclip <noreply@paperclip.ing>

When metrics are disabled, metricName is undefined and gets interpolated into the log string as literal "undefined". Use conditional interpolation so the message reads "Received spot notification" when no metric is set.

jensenbox · 2026-03-28T06:46:25Z

@Brend-Smits — updated with 5 new commits that address both issues you reported, plus more:

Changes since your last review

1. 422 "Runner is busy" — SQS-based retry with DLQ (replaces the in-process retry I initially described)

Instead of retrying in-process with exponential backoff, we now queue a delayed retry via SQS:

On 422, the Lambda sends a message to an SQS queue with a 5-minute delay
A separate retry Lambda picks it up and attempts deregistration again
If still busy, it re-queues (up to 3 attempts via maxReceiveCount)
Failed retries land in a DLQ for monitoring

This is more robust than in-process retry because the original Lambda invocation completes quickly, and the retry survives Lambda timeouts.

2. "Received spot notification for undefined" — fixed

The metricName parameter was undefined when metrics were disabled and got interpolated into the log string. Now uses conditional interpolation: Received spot notification (no metric) vs Received spot notification for SpotTermination (with metric).

3. Ghost runner reconciliation

Added handling in the Lambda entrypoint for EC2 Instance State-change events (shutting-down state) — the EventBridge rule we added catches all termination types, not just spot events. The handler extracts instance-id from the event detail and triggers deregistration.

4. @octokit type mismatch fix

Added @ts-ignore for a pre-existing type mismatch between @octokit/request and @octokit/auth-app versions that caused ncc build warnings.

Production validation

We've been running this exact code (pinned at 60fed701) in our production environment (closient) since early March. Results from CloudWatch logs (last 7 days):

Metric	Count
Successful direct deregistrations	5+
422 → SQS retry → successful deregistration	1 (runner `i-00b9b33032f13e4a3` on Mar 23)
422 → SQS retry → runner already gone	1 (runner `i-002ed1df38283a34a` on Mar 22, ephemeral)
DLQ messages (permanent failures)	0
Current offline/ghost runners	0 (all 21 runners online)

New Terraform resources (when `enable_runner_deregistration = true`)

aws_sqs_queue.deregister_retry — 5-minute delay, 3 max receives
aws_sqs_queue.deregister_retry_dlq — dead letter queue
aws_lambda_function.deregister_retry — processes retry messages
Associated IAM policies for SQS send/receive and SSM parameter access

All 44 tests pass (7 test files). Ready for re-review!

This repo uses yarn (yarn.lock), not npm. The package-lock.json was generated during local development and contains a low-severity advisory (GHSA-j965-2qgj-vjmq) that trips the dependency review check.

…5-c462-wpq7) Resolves high/moderate severity ReDoS vulnerabilities flagged by dependency review.

The octokit type mismatch only manifests with certain dependency resolutions. CI resolves compatible types, so the directives are flagged as unused.

Resolves moderate severity Stack Overflow vulnerability in yaml package.

jensenbox requested review from a team as code owners March 6, 2026 07:48

jensenbox force-pushed the deregister-runner-on-termination branch 2 times, most recently from f731868 to a9ca792 Compare March 6, 2026 07:55

Brend-Smits reviewed Mar 6, 2026

View reviewed changes

Brend-Smits mentioned this pull request Mar 18, 2026

Ghost offline runners accumulate when spot instances are terminated #5007

Open

jensenbox force-pushed the deregister-runner-on-termination branch from bec1fc0 to 83eccbd Compare March 19, 2026 05:15

jensenbox and others added 3 commits March 19, 2026 23:02

C-241 Add built termination-watcher Lambda zip

56127d9

Include pre-built Lambda zip for use when referencing this fork branch as a Terraform module source (no GitHub release available for the download-lambda module to pull from). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jensenbox force-pushed the deregister-runner-on-termination branch from 83eccbd to 03ce697 Compare March 20, 2026 06:07

jensenbox force-pushed the deregister-runner-on-termination branch from 03ce697 to db6a268 Compare March 20, 2026 06:15

jensenbox and others added 5 commits March 20, 2026 18:52

Fix pre-existing @octokit type mismatch for ncc builds

3542b85

Add @ts-ignore for createAppAuth calls where @octokit/request and @octokit/types have incompatible retryCount types.

jensenbox added 7 commits March 27, 2026 23:46

style: fix terraform fmt in termination-watcher module

ef07827

style: fix prettier formatting in types.d.ts

9f3a9ab

chore: remove accidentally committed package-lock.json and tsconfig.tmp

859e191

This repo uses yarn (yarn.lock), not npm. The package-lock.json was generated during local development and contains a low-severity advisory (GHSA-j965-2qgj-vjmq) that trips the dependency review check.

fix: replace @ts-ignore with @ts-expect-error per eslint rules

0fd6041

fix: bump path-to-regexp 8.3.0 → 8.4.0 (GHSA-j3q9-mxjg-w52f, GHSA-27v…

26f0be0

…5-c462-wpq7) Resolves high/moderate severity ReDoS vulnerabilities flagged by dependency review.

fix: remove unnecessary @ts-expect-error directives

b8e2d61

The octokit type mismatch only manifests with certain dependency resolutions. CI resolves compatible types, so the directives are flagged as unused.

fix: bump yaml 2.8.2 → 2.8.3 (GHSA-48c2-rrv3-qjmp)

62f9f56

Resolves moderate severity Stack Overflow vulnerability in yaml package.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(termination-watcher): deregister runners from GitHub on EC2 termination#5055

feat(termination-watcher): deregister runners from GitHub on EC2 termination#5055
jensenbox wants to merge 16 commits intogithub-aws-runners:mainfrom
closient:deregister-runner-on-termination

jensenbox commented Mar 6, 2026

Uh oh!

Brend-Smits left a comment

Uh oh!

jensenbox commented Mar 15, 2026

Uh oh!

jensenbox commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jensenbox commented Mar 6, 2026

Summary

How it works

What's included

Design decisions

Testing

Uh oh!

Brend-Smits left a comment

Choose a reason for hiding this comment

Uh oh!

jensenbox commented Mar 15, 2026

1. Runner busy 422 — retry with exponential backoff

2. "Received spot notification for undefined"

Uh oh!

jensenbox commented Mar 28, 2026

Changes since your last review

Production validation

New Terraform resources (when enable_runner_deregistration = true)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New Terraform resources (when `enable_runner_deregistration = true`)