EC2CreateInstanceOperator could leave EC2 instances running when failing #60904

SameerMesiah97 · 2026-01-21T21:28:44Z

Description

Added best-effort cleanup to EC2CreateInstanceOperator to ensure EC2 instances are terminated when failures occur after successful instance creation.

Previously, the operator could create an EC2 instance via RunInstances and then fail during post-creation steps (for example, when waiting for the instance with wait_for_completion=True and DescribeInstances permissions are missing). In these cases, the task failed while leaving the EC2 instance running.

The operator now attempts to terminate any created instances if an exception is raised after instance creation. Cleanup is performed opportunistically and does not mask or replace the original exception if termination fails.

Rationale

EC2CreateInstanceOperator manages the lifecycle of an external infrastructure resource. If instance creation succeeds but subsequent steps fail, leaving the instance running is both surprising and potentially costly.

Failures after instance creation can occur for multiple reasons, including IAM permission errors (for example, missing DescribeInstances) as well as loss of access to observability or metadata systems used during task execution. In all of these cases, the operator has failed to complete successfully from Airflow’s perspective, and execution state may no longer be reliable.

Attempting best-effort cleanup in these scenarios avoids leaving unmanaged EC2 instances running when the task itself has failed, and aligns with the behavior of other Airflow operators that manage external resources. Cleanup failures are logged and do not mask the original exception, preserving existing failure semantics while improving safety.

Tests

Added a unit test verifying that EC2 instances are terminated when a failure occurs after instance creation (simulated via a waiter error).
Added a unit test ensuring that failures during cleanup do not mask or override the original exception raised by the operator.

Backwards Compatibility

No changes to the public API or operator parameters.

Closes: #60903

…ures occurred after successful instance creation (e.g. waiter failures due to missing DescribeInstances permissions). This change adds best-effort cleanup when post-creation steps fail by attempting to terminate created instances. Cleanup errors are logged but do not mask the original exception. Tests cover successful cleanup on failure and ensure cleanup failures do not override the original error.

vincbeck

Makes sense to me

…ures (apache#60904) occurred after successful instance creation (e.g. waiter failures due to missing DescribeInstances permissions). This change adds best-effort cleanup when post-creation steps fail by attempting to terminate created instances. Cleanup errors are logged but do not mask the original exception. Tests cover successful cleanup on failure and ensure cleanup failures do not override the original error. Co-authored-by: Sameer Mesiah <smesiah971@gmail.com>

SameerMesiah97 requested a review from o-nikolas as a code owner January 21, 2026 21:28

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jan 21, 2026

eladkal requested a review from vincbeck January 21, 2026 21:38

vincbeck approved these changes Jan 21, 2026

View reviewed changes

vincbeck merged commit 88f1645 into apache:main Jan 22, 2026
89 checks passed

SameerMesiah97 changed the title ~~EC2CreateInstanceOperator could leave EC2 instances running when fail…~~ EC2CreateInstanceOperator could leave EC2 instances running when failing Jan 24, 2026

This was referenced Jan 24, 2026

Add best-effort cleanup to EmrCreateJobFlowOperator on post-creation failure #61010

Open

EcsRunTaskOperator leaks ECS task on failure with partial IAM permissions #61050

Open

SameerMesiah97 mentioned this pull request Jan 27, 2026

EksCreateNodegroupOperator leaks EKS nodegroup on failure with partial IAM permissions #61142

Open

2 tasks

vincbeck mentioned this pull request Jan 28, 2026

Status of testing Providers that were prepared on January 28, 2026 #61165

Open

85 tasks

SameerMesiah97 mentioned this pull request Jan 30, 2026

Restrict EC2CreateInstanceOperator cleanup to waiter failures and add guard flag #61272

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EC2CreateInstanceOperator could leave EC2 instances running when failing #60904

EC2CreateInstanceOperator could leave EC2 instances running when failing #60904

SameerMesiah97 commented Jan 21, 2026

Uh oh!

vincbeck left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EC2CreateInstanceOperator could leave EC2 instances running when failing #60904

EC2CreateInstanceOperator could leave EC2 instances running when failing #60904

Conversation

SameerMesiah97 commented Jan 21, 2026

Uh oh!

vincbeck left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants