feat(termination-watcher): deregister runners from GitHub on EC2 termination#5055
feat(termination-watcher): deregister runners from GitHub on EC2 termination#5055jensenbox wants to merge 16 commits intogithub-aws-runners:mainfrom
Conversation
f731868 to
a9ca792
Compare
Brend-Smits
left a comment
There was a problem hiding this comment.
Hey @jensenbox
This is a great addition, thanks a lot for your contribution.
After testing this together with @stuartp44 I ran into a problem when the termination watcher tried to deregister a runner. The error was as following:
{
"level": "ERROR",
"message": "Failed to deregister runner from GitHub",
"timestamp": "2026-03-06T10:07:02.489Z",
"service": "spot-termination-notification",
"sampling_rate": 0,
"xray_trace_id": "1-69aaa741-3e6ab9e024a6cc5567e5f339",
"region": "eu-west-1",
"environment": "framework-dev",
"module": "deregister",
"aws-request-id": "87f61dc0-1c03-456a-9bf9-e5542558eac3",
"function-name": "framework-dev-spot-termination-notification",
"instanceId": "i-0c86dff9c4dfb59fc",
"owner": "test-runners/multi-runner",
"error": {
"name": "HttpError",
"location": "file:///var/task/index.js:95395",
"message": "Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted. - https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository",
"stack": "HttpError: Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted. - https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository\n at fetchWrapper (file:///var/task/index.js:95395:11)\n at process.processTicksAndRejections (node:internal/process/task_queues:103:5)\n at async Job.doExecute (file:///var/task/index.js:83521:18)",
"status": 422,
"request": {
"method": "DELETE",
"url": "https://api.github.com/repos/test-runners/multi-runner/actions/runners/50",
"headers": {
"accept": "application/vnd.github.v3+json",
"user-agent": "github-aws-runners-termination-watcher octokit-rest.js/22.0.1 octokit-core.js/7.0.6 Node.js/24",
"authorization": "token [REDACTED]"
},
"request": {}
},
"response": {
"url": "https://api.github.com/repos/test-runners/multi-runner/actions/runners/50",
"status": 422,
"headers": {
"access-control-allow-origin": "*",
"access-control-expose-headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset",
"content-length": "260",
"content-security-policy": "default-src 'none'",
"content-type": "application/json; charset=utf-8",
"date": "Fri, 06 Mar 2026 10:07:02 GMT",
"referrer-policy": "origin-when-cross-origin, strict-origin-when-cross-origin",
"server": "github.com",
"strict-transport-security": "max-age=31536000; includeSubdomains; preload",
"vary": "Accept-Encoding, Accept, X-Requested-With",
"x-accepted-github-permissions": "administration=write",
"x-content-type-options": "nosniff",
"x-frame-options": "deny",
"x-github-api-version-selected": "2022-11-28",
"x-github-media-type": "github.v3; format=json",
"x-github-request-id": "E8C2:1597F:2CF110:372C21:69AAA746",
"x-ratelimit-limit": "15000",
"x-ratelimit-remaining": "14994",
"x-ratelimit-reset": "1772795053",
"x-ratelimit-resource": "core",
"x-ratelimit-used": "6",
"x-xss-protection": "0"
},
"data": {
"message": "Bad request - Runner ubuntu-2404-x64_i-0c86dff9c4dfb59fc is currently running a job and cannot be deleted.",
"documentation_url": "https://docs.github.com/rest/actions/self-hosted-runners#delete-a-self-hosted-runner-from-a-repository",
"status": "422"
}
}
}
}
I would suggest adding some sort of retry mechanism with exponential backoff (which may be configurable).
On another note, I also see in the logs Received spot notification for undefined, are you also seeing undefined in your logs?
The rest looks great, looking forward testing this again 🚀
|
Hey @Brend-Smits, thanks for testing and the detailed report! I've pushed a fix (653fd67) that addresses both issues: 1. Runner busy 422 — retry with exponential backoffAdded 2. "Received spot notification for undefined"Yes, we were seeing this too! This happens when metrics are disabled ( We've been running this feature in our production environment (closient) and confirmed both issues in our CloudWatch logs. All 47 tests pass including 3 new tests for the retry logic. Let us know how retesting goes! |
bec1fc0 to
83eccbd
Compare
When EC2 instances running GitHub Actions runners terminate (spot interruption, scale-down), the runner stays registered as "offline" in GitHub. This extends the termination-watcher Lambda to deregister runners via the GitHub API, catching all termination causes. Lambda changes: - New deregister.ts with GitHub App auth, runner lookup, and deletion - ConfigResolver adds enableRunnerDeregistration and ghesApiUrl - Both termination.ts and termination-warning.ts call deregister - Dependencies: @octokit/auth-app, @octokit/rest, @aws-github-runner/aws-ssm-util Terraform changes: - termination-watcher module: new env vars, conditional SSM IAM policy - multi-runner module: wire github_app_parameters through, add enable_runner_deregistration variable (defaults to true) Feature-flagged via ENABLE_RUNNER_DEREGISTRATION env var (default false at module level, true in multi-runner). Deregistration failures are caught and logged without breaking existing metric functionality. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The root (single-runner) module also uses termination-watcher but wasn't wiring github_app_parameters through. Add enable_runner_deregistration, github_app_parameters, and ghes_url to the root module's termination watcher config, matching the multi-runner changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include pre-built Lambda zip for use when referencing this fork branch as a Terraform module source (no GitHub release available for the download-lambda module to pull from). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
83eccbd to
03ce697
Compare
The existing spot-specific rules (BidEvictedEvent, Spot Interruption Warning) only fire on AWS spot reclamations. Scale-down terminations and manual terminations — the most common causes of stale runners — were not covered. Add an EC2 Instance State-change Notification rule (state: shutting-down) that catches ALL termination types. Reuses the same notification Lambda since both event types have detail['instance-id']. Gated behind enable_runner_deregistration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
03ce697 to
db6a268
Compare
When a runner is terminated while executing a job (e.g., spot reclamation, power disruption), the GitHub API returns 422 and refuses to delete it. The runner stays registered as "offline" indefinitely, counting toward the maximum runner limit and preventing new runners from launching. Changes: - scale-down.ts: Add reconcileGitHubRunners() that runs every scale-down cycle (every 5 minutes). Lists all GitHub runners, compares against live EC2 instances, and deregisters any offline runners whose instances no longer exist. - deregister.ts: Improve 422 error handling — log as warning instead of error since the scale-down reconciliation will clean it up. The reconciliation is controlled by OFFLINE_RUNNER_DEREGISTER_MINUTES env var (defaults to 10). Set to 0 to disable.
Add @ts-ignore for createAppAuth calls where @octokit/request and @octokit/types have incompatible retryCount types.
When GitHub returns 422 on runner deletion (runner executing a job), instead of silently dropping the attempt, enqueue a retry message to SQS with a 5-minute delay. By that time the EC2 instance has been terminated and the runner appears offline, allowing clean deletion. Changes: - deregister.ts: send 422 failures to DEREGISTER_RETRY_QUEUE_URL SQS queue; add handleDeregisterRetry for processing retry messages - lambda.ts: export deregisterRetry SQS handler - package.json: add @aws-sdk/client-sqs dependency - scale-down.ts: remove reconcileGitHubRunners polling (replaced by SQS) - modules/multi-runner: add environment_variables to instance_termination_watcher variable and pass through to Lambda config - modules/termination-watcher: merge caller-supplied environment_variables into notification and handler Lambda env var configs
Add Terraform resources to support the SQS-based deregistration retry that was added in ed30bf8. When GitHub returns 422 (runner busy), the termination-watcher Lambda now has infrastructure to queue a delayed retry: - SQS queue with 5-minute delivery delay for retry messages - Dead-letter queue (14-day retention, 3 max receives) for failures - Dedicated Lambda function (index.deregisterRetry handler) - SQS event source mapping to trigger the retry Lambda - IAM policies: SQS send/receive, SSM read, EC2 describe - IAM policies on notification/termination Lambdas for SQS:SendMessage - Pass DEREGISTER_RETRY_QUEUE_URL env var to all termination Lambdas - Rebuild termination-watcher.zip with latest code Co-Authored-By: Paperclip <noreply@paperclip.ing>
When metrics are disabled, metricName is undefined and gets interpolated into the log string as literal "undefined". Use conditional interpolation so the message reads "Received spot notification" when no metric is set.
|
@Brend-Smits — updated with 5 new commits that address both issues you reported, plus more: Changes since your last review1. 422 "Runner is busy" — SQS-based retry with DLQ (replaces the in-process retry I initially described) Instead of retrying in-process with exponential backoff, we now queue a delayed retry via SQS:
This is more robust than in-process retry because the original Lambda invocation completes quickly, and the retry survives Lambda timeouts. 2. "Received spot notification for undefined" — fixed The 3. Ghost runner reconciliation Added handling in the Lambda entrypoint for EC2 Instance State-change events ( 4. @octokit type mismatch fix Added Production validationWe've been running this exact code (pinned at
New Terraform resources (when
|
This repo uses yarn (yarn.lock), not npm. The package-lock.json was generated during local development and contains a low-severity advisory (GHSA-j965-2qgj-vjmq) that trips the dependency review check.
…5-c462-wpq7) Resolves high/moderate severity ReDoS vulnerabilities flagged by dependency review.
The octokit type mismatch only manifests with certain dependency resolutions. CI resolves compatible types, so the directives are flagged as unused.
Resolves moderate severity Stack Overflow vulnerability in yaml package.
Summary
Extends the existing termination-watcher Lambda to deregister GitHub Actions runners from GitHub when their EC2 instances terminate. This prevents stale "offline" runner entries from accumulating in the organization/repository — a long-standing issue (#804, #1006, #2939) affecting all users of the module.
How it works
ghr:Ownerandghr:Typetags from the instanceWhat's included
Lambda changes:
deregister.ts— GitHub API deregistration logic reusing the module's existing auth pattern (createAppAuth→ installation token)termination.ts(BidEvictedEvent) andtermination-warning.ts(Spot Interruption Warning)ConfigResolver.ts— addsenableRunnerDeregistrationandghesApiUrlconfig from env varsTerraform changes:
GetParameterIAM policy when deregistration is enabledPARAMETER_GITHUB_APP_ID_NAME,PARAMETER_GITHUB_APP_KEY_BASE64_NAME,ENABLE_RUNNER_DEREGISTRATION, andGHES_URLenvironment variables to both Lambda functionsEC2 Instance State-change NotificationEventBridge rule (state:shutting-down) that catches all termination types — not just spot-specific events. This covers scale-down, manual termination, ASG termination, and spot reclamation.New variables on
instance_termination_watcher:enable_runner_deregistration(bool, defaultfalse)Design decisions
enable_runner_deregistration = true.@octokit/auth-app+ SSM approach used by the control-plane Lambda.detail['instance-id'].ghr:Typetag to determine the correct API endpoint.ghes_urlvariable for GitHub Enterprise Server deployments.Testing
deregister.test.tsFixes #804